Over time, several research scientists have asked me whether there are any open access databases of de-identified laboratory data coded with LOINC. Clinical laboratory data are used in a majority of medical decisions. They are also valuable in measuring the quality of care, for public health and surveillance, and in cost-effectiveness studies.

Large, openly-available data sources could enable more of these kinds of analyses. Yet, medical privacy is a core tenet of healthcare. De-identification of protected health information according to HIPAA can be a challenging task. And the rapid growth of genetic laboratory testing raises additional questions and challenges for de-identification.

The underlying research question will drive whether any particular data set is useful, and scientists may have different collaboration possibilities available to them that can open doors. Here are some opportunities and directions for researchers to pursue when looking for de-identified laboratory data.

woman face among the binary 0s and 1s - de-identified laboratory data



The first resource to mention is the MIMIC database, an openly available critical care dataset. Now in it’s third iteration, MIMIC-III (Medical Information Mart for Intensive Care) is a large, single-center database that holds information about patients admitted to critical care units at a large hospital. The database contains a wealth of health information, including: laboratory tests, vital signs, medications, procedure codes, imaging reports, fluid balance, etc. It is available for use in research, quality improvement, or education.

MIMIC is a truly unique resource. I’m not aware of anything similar. In its original form, the catalog of test variables contained only local observation codes. While she was at the NLM, my LOINC colleague Swapna Abhyankar led an effort to standardize the MIMIC data to LOINC. The LOINC mappings are now available in the MIMIC distribution.

Research Networks

Large scale research networks have blossomed in the last few years. One reason is the technologic advances in informatics tools and common data models. Another is that they have been spurred by funding opportunities. Examples include the Clinical and Translational Science Awards (CTSA) Program, the FDA’s Mini-Sentinel project, PCORnet, and Observational Health Data Sciences and Informatics (OHDSI).

Depending on your institution, if you have joined one of these initiatives you may already have access to more resources than you were aware of. As a researcher, the data providers fulfill the role of a “safe harbor” or “honest broker” for the data, so researchers only have access to de-identified data.

The secret to success of these networks is that the participating institutions commit to adopting a common technology platform. Many CTSA institutions use i2b2, Mini-Sentinel uses its own data model and the open source PopMedNet query tool. PCORnet created a common data model based on (but not the same as) the Mini-Sentinel approach.


OHDSI has a similar approach to these other networks in that its participants use a common data model (based on the OMOP data model) and shared set of technologies for data analysis. Hripcsak et al have also written a nice paper describing ODHSI’s vision and the opportunities available to researchers.

What’s unique about OHDSI is that it is an open community. You don’t need to have been selected by the grant/contract process. Their core objectives include, among others, these three great principles that match very closely to how we’re developing LOINC:

  • Community: Everyone is welcome to actively participate in OHDSI, whether you are a patient, a health professional, a researcher, or someone who simply believes in our cause.
  • Collaboration: We work collectively to prioritize and address the real world needs of our community’s participants.
  • Openness: We strive to make all our community’s proceeds open and publicly accessible, including the methods, tools and the evidence that we generate.

In addition to their common technology stack for running queries at distributed sites, OHDSI has developed an impressive set of software tools for data analytics, including clinical characterization, population level estimation, and patient prediction. All these tools are made freely available on GitHub under an open source license.

As a participant in OHDSI, it is possible to run the same analysis at multiple sites, over many data sets. The scope of data available is far greater than mosts scientists have access to within just their home institution. The Data Network page gives an overview of the data from participating institutions.

OHDSI uses vocabulary standards like LOINC to accomplish the data normalization needed for such analyses. You can read more about the approach they’ve taken here.

De-identifying your own data sets

You may also be considering de-identifying your own data to better support re-use. Much laboratory data is sent (and stored) as discrete results, which makes de-identification slightly easier. Identifying information in key fields (e.g. patient name) can be excluded.

However, some laboratory tests results are still sent as narrative text. This is especially true in the emerging field of genetic testing, which presents another set of challenges for protecting privacy.

If you are interested in de-identifying clinical test, check out the NLM Scrubber software program. You could use such a program on a clinical data set in order to more easily justify to the IRB its use for research purposes.


The promise of large-scale observational health databases is now a reality for medical researchers. With standardized laboratory data coded with LOINC, researchers can advance health by generating scientific evidence about disease history, healthcare delivery, the effects of interventions, and the countless other questions.

You can’t use “but I don’t have access to…” as an excuse any longer. Go forth and analyze!