Skip to main content Skip to secondary navigation

Research IT has developed a number of novel Big Data processing techniques in database transfer, de-identification and more.


Main content start

Complex Event Processing (CEP) Engine, a real time framework for alerts

Complex Event Processing (CEP), is a real-time alerting system, and is used to connect clinical researchers with incoming patients who meet study/clinical trial criteria. CEP was launched in 2009, the year the HITECH ACT was passed, and uses HL7 feeds. It has been instrumental in enabling clinical trial enrollment in a timely manner. Prior to CEP, researchers would have to inform/train the registration desk or clinical staff on duty to call researchers whenever a patient meeting certain criteria is in the hospital. A Maternal Fetal Medicine department  researcher  stated that “Before alerts were developed, one of the team members needed to be "on call" nights and weekends and checking our participant log every few hours to make sure no one was missed”. Learn more about our real time alert solution. Here are two additional resources:

  • CEP: Implementing a Real-time Complex Event Stream Processing System to Help Identify Potential Participants in Clinical and Translational Research Studies., Weber SC, Lowe HJ, Malunjkar S, Quinn J, AMIA Annu Symp Proc. 2010 Nov 13;2010:472-6. (link)
  • Product specification (g-drive document link)

Researchers can request an alert via a technology consultation service.

Moving large databases

At Stanford, the pediatric and adult EHRs are multi-terabyte databases. STARR has access to backups of the operational Clarity at the two hospitals. In order to do downstream ETLs for multiple data models, we move the Clarity data to our Google cloud data center where we can access Big Data technologies for fast processing. We have developed a general purpose utility to export large databases in compressed file format, AVRO. The AVRO file is subsequently loaded to BigQuery for downstream processing. This utility has since been open sourced.

  • Code: The goal of this github project (db-to-avro) is to load database backups (eg. Oracle datapump exports, MS Sql Server .bak files) into a temporary Docker container running the appropriate vendor database, and then export the tables within that restored database as .avro files for loading into other Big data systems.

For illustration, we can do a full export of our Oracle adult Clarity in 5-6 hrs (~3M patients) and upload the AVRO to GCP in 45 minutes.

DICOM de-identification

Note that medical images are considered as "unstructured" clinical data. Stanford UPO determines that algorithmic de-identification of unstructured data results in a High Risk  dataset. These datasets may have small residual (leaked) PHI. DICOM is a common medical image standard and contains PHI in metadata as well as pixel data. 

Research IT has developed highly sophisticated DICOM de-identification pipeline that leverage and extend MIRC CTP. With the increase in Artificial Intelligence (AI) driven approaches, researchers are requesting unprecedented volumes of medical imaging data which far exceed the capacity of traditional on-premise client-server approaches for making the data research analysis-ready. Our on-demand de-identification that combines the use of mature software technologies with modern cloud-based distributed computing techniques to enable faster turnaround in medical imaging research. The solution is part of a broader platform that supports a secure high performance data center and clinical data science platform.

  • Manuscript: High performance on-demand de-identification of a petabyte-scale medical imaging data lake, Mesterhazy J, Olson G, Datta S, Aug 2020, arXiv:2008.01827 (link)
  • Code: MIRC CTP optimization for DICOM de-identification (link)

If you need imaging data, please request a data consultation service.

Learn more about STARR DICOM de-identification

Clinical Text De-identification

Note that clinical notes are free text and are considered as "unstructured" clinical data. Algorithmically de-identified unstructured datasets may have small residual (leaked) PHI. Stanford UPO determines that algorithmic de-identification of unstructured data results in a High Risk  dataset.   

Text De-identification (TiDE) combines a mix of pattern matching techniques and machine learning-based named entity recognition to find protected health information as well as techniques such as Hiding in Plain Sight as an additional privacy enhancement strategy.

  • Manuscript: Please refer to the Supplementary, Section 6 of the manuscript for details on the method, our QA process and overall statistics on PHI found in Stanford EHR clinical text.
  • Code: TiDE is a free open-source text de-identification tool that can identify and deid PHI in clinical note text and other free text in medical data. This code can be deployed on your laptop or server. If you have more than 20,000 clinical text documents, you will need to develop a distributed programming paradigm that is suited to your Big Data environment.

Note that de-identification using TiDE will have a small number of PHI leaks. With the hiding in plain sights, it is hard to distinguish surrogates from leaked PHI. If you need custom de-identification of Stanford clinical text, please request a data consultation service.

Learn more about TiDE de-identification

Zipcode de-identification

It is common practice to completely remove zipcode information from a non-human subject dataset such OMOP-deid. With the zipcode, you can potentially look at social determinants of health. To support such studies, Research IT in collaboration with University Privacy Office (UPO), has made five digit zipcode (precisely speaking, ZCTA, see below) available in the de-identified STARR OMOP data.  Now 78.5% of the patients retain their full five digit zip codes. These zipcodes have >20,000 people living in them. We are currently using 2010 census but will move to 2020 as soon as feasible. For the rest of the patients, we retain the first three digits (***00) and the remaining are set to zero.

Since OMOP is in Google Big Query, one of the first places to look for useful datasets is the Google marketplace. You can further sub-select by Social Determinants of Health (or SDOH). There are some interesting datasets such as income by county level, county level enrollment for supplemental nutritional assistance, federally designated areas with shortage of healthcare workers and more. 

About ZCTA: During the deID, we are using ZCTA to look up the census population. Where we claim to retain zip5, we are essentially retaining the ZCTA. In most cases, the zip5 and ZCTA are geographically close but in a small number of cases, they are geographically dissimilar. Note further that zipcode represents a postal delivery route and can change over the course of a decade. The zip+4 changes frequently. The five digit zipcode changes infrequently. These zipcode changes happen due to post office openings, closures and boundary changes. So, in a small number of patients, the older five digit zipcode stored in EHR may not represent the zipcode corresponding to the street address.  Also note that for social determinant of health studies, the ZCAT is often too large an area and too socio-economically diverse. A smaller, more homogenous sub-division is likely to be census tract. But census tract often has a very small population e.g. ~4000.

About Census blocks and tracts: Research IT is able to generate census block and tract level information via use of a third party software called deGauss (link). The population of census blocks are smaller than 20,000 and significantly increases the ability to re-identify the individual. Therefore, in order to access this information, the Data Use Agreement or IRB must explicitly allow for the information.  If you need census data, please request a consultation service.

Text Processing

In the STARR-OMOP dataset, aside from the standard encounter tables in OMOP CDM, we populate the clinical notes and note annotations (NOTE_NLP). For clinical text mapping to concepts, we use a we use a pipeline developed by LePendu et. al. , that has incorporated both negation detection and history detection. These contextual cues are based on NegEx and ConText and enable us to discern whether a term should not be attributed the patient's current status (e.g., lack of valvular dysfunction, or sister has muscular dystrophy). 

Please refer to our manuscript, Supplementary Material, “Section 7: Processing clinical text to identify known medical concepts”. When the terms are ambiguous such as “pad” (Peripheral Artery Disease or Pampers), you need to augment your queries to disambiguate in the context of the patient. Furthermore, the method doesn’t summarize the note or the patient’s longitudinal history.

Flowsheet data

We have mapped several vital flowsheets and integrated the mapping in OMOP measurements table. Our manuscript describes the method and impact to the CDM. 

In OMOP, we present two approaches to integration of flowsheet measures. Our first approach was computationally straightforward but of potentially limited research utility. The second approach is far more computationally and labor intensive and involved mapping to standardized terms in controlled clinical vocabularies such as Logical Observation Identifiers Names and Codes (LOINC), resulting in a research data set of higher utility to population health studies. These newly included mapping are vitals such as blood pressure, oxygen level, heart rate, respiratory rate, measurements from Sequential Organ Failure Assessment (SOFA) score, Glasgow Coma Scale Score, Deterioration Index Score etc.