TiDE text deid
Introduction to TiDE
Parts here are reproduced from our manuscript. TiDE stands for Text DEidentification and is named after a popular detergent brand. The product is designed specifically to, a) process 100s of millions of notes in a cost effective and timely manner, and b) preserve the note format for downstream NLP research. Furthermore, we focus TiDE pipeline on preserving patient privacy i.e., sensitivity of finding PHI is more important than specificity. We also use open source components for research transparency.
Research IT achieves de-identification of clinical text in accordance with NIST guidelines to meet the HHS HIPAA Privacy Rule. In particular, we use the Safe Harbor approach. Our approach to de-identification is similar to those presented by Ferrández et al (link) in that, we a) maximize patient confidentiality by redacting as much PHI as possible and may accidentally redact non-PHI; and b) leave de-identified data in a usable state preserving as much clinical information as possible.
Prior assessments, by Ferrández et al, of clinical text de-identification techniques show that it is difficult to find a single approach that performs well in all cases. Clinical narratives can be fragmented and lack formatting, making the use of pre-trained traditional newswire Name Entity Recognition (NER) approaches limiting. Rule based techniques (e.g. pattern matching) are not scalable approaches. TiDE combines a mix of pattern matching techniques, machine learning-based NER and Hiding in Plain Sight (link). While each of the techniques is only effective part of the time, together they are highly effective most of the time.
In the first step, we find the locations of HIPAA identifiers. TiDE has three separate sub-modules that find the HIPAA identifiers, a) the NLP name entity recognition module uses CoreNLP (link) and can find random combinations of street name, city name, state name and zip code, b) the regular expression (regex) pattern matching to known patient PHI and, c) for entities like MRNs, SSN (e.g. 123-45-6789), email (e.g. firstname.lastname@example.org), IP, URL, we use an enumerated pattern matching.
We selected CoreNLP from Christopher Manning's lab at Stanford (link), an open source software, developed for identifying names and addresses in the text. At the heart of CoreNLP is the Stanford Named Entity Recognizer (Stanford NER) also known as CRFClassifier. The software provides a general implementation of linear chain Conditional Random Field (CRF) sequence models. CRFs have been known to perform well for clinical text de-identification (link).
For names and places, which are two of the three most frequent PHI in clinical text data, we use a surrogate approach (link) aka Hiding in Plain Sight (HIPS) illustrated in Table S6 of our manuscript (reproduced below for convenience). At the time of writing, the name replacement is gender aware but not ethnicity aware. Addresses are replaced randomly.
The development of surrogate database is illustrated in Figure S6.2 of our manuscript. We collate the following sources, a) Bay Area census with addresses (bayareacensus.ca.gov), b) Health resources and services administration database with US addresses (data.hrsa.gov), c) FDA website with addresses (data.gov), d) Social security database with baby names (ssa.gov), and e) Medicare services database with physician names (data.cms.gov).
Using TiDE in STARR OMOP database
The anonymized clinical text is stored in NOTE table of the OMOP CDM STARR-OMOP-deid. In our workflow, we run the TiDE code using distributed programming paradigm. TiDE can process approximately 100 million clinical notes in roughly ~7hr by deploying 800 dataflow workers in parallel at a total cost of $440 USD. The total processing time translates to 0.00025s/note which is 3 orders of magnitude less than the recently reported fastest process (0.24s/note) by Heider et al.
Note that our method makes re-identification harder. It does not remove the possibility of leaked PHI. The STARR-OMOP-deid dataset is deemed as High Risk dataset due to the presence of occasional leaked PHI.