TiDE text deid
Introduction to TiDE
Parts here are reproduced from our manuscript. TiDE stands for Text DEidentification and is named after a popular detergent brand. The output of TiDE may contain residual PHI that are hard to identify and the algorithm makes it difficult to re-identify patients.
The product is designed specifically to, a) process 100s of millions of notes in a cost effective and timely manner, and b) preserve the note format for downstream NLP research. Furthermore, we focus TiDE pipeline on preserving patient privacy i.e., sensitivity of finding PHI is more important than specificity. We also use open source components for research transparency.
TiDE uses the Safe Harbor approach in accordance with NIST guidelines to meet the HHS HIPAA Privacy Rule. Our approach to de-identification is similar to those presented by Ferrández et al (link) in that, we a) maximize patient confidentiality by redacting as much PHI as possible and may accidentally redact non-PHI; and b) leave de-identified data in a usable state preserving as much clinical information as possible.
Prior assessments, by Ferrández et al, of clinical text de-identification techniques show that it is difficult to find a single approach that performs well in all cases. Clinical narratives can be fragmented and lack formatting, making the use of pre-trained traditional newswire Name Entity Recognition (NER) approaches limiting. Rule based techniques (e.g. pattern matching) are not scalable approaches. TiDE combines a mix of pattern matching techniques, machine learning-based NER and Hiding in Plain Sight (link). While each of the techniques is only effective part of the time, together they are highly effective most of the time.
In the first step, we find the locations of HIPAA identifiers. TiDE has three separate sub-modules that find the HIPAA identifiers, a) the NLP name entity recognition module uses CoreNLP (link) and can find random combinations of street name, city name, state name and zip code, b) the regular expression (regex) pattern matching to known patient PHI and, c) for entities like MRNs, SSN (e.g. 123-45-6789), email (e.g. john@example.org), IP, URL, we use an enumerated pattern matching.
Why CoreNLP?
We selected CoreNLP from Christopher Manning's lab at Stanford (link), an open source software, developed for identifying names and addresses in the text. At the heart of CoreNLP is the Stanford Named Entity Recognizer (Stanford NER) also known as CRFClassifier. The software provides a general implementation of linear chain Conditional Random Field (CRF) sequence models. CRFs have been known to perform well for clinical text de-identification (link).
Surrogate Approach
For names and places, which are two of the three most frequent PHI in clinical text data, we use a surrogate approach (link) aka Hiding in Plain Sight (HIPS) illustrated in Table S6 of our manuscript (reproduced below for convenience). At the time of writing, the name replacement is gender aware but not ethnicity aware. Addresses are replaced randomly.
The development of surrogate database is illustrated in Figure S6.2 of our manuscript. We collate the following sources, a) Bay Area census with addresses (bayareacensus.ca.gov), b) Health resources and services administration database with US addresses (data.hrsa.gov), c) FDA website with addresses (data.gov), d) Social security database with baby names (ssa.gov), and e) Medicare services database with physician names (data.cms.gov).
Using TiDE in STARR OMOP database
The anonymized clinical text is stored in NOTE table of the OMOP CDM STARR-OMOP-deid. In our workflow, we run the TiDE code using distributed programming paradigm. TiDE can process approximately 100 million clinical notes in roughly ~7hr by deploying 800 dataflow workers in parallel at a total cost of $440 USD. The total processing time translates to 0.00025s/note which is 3 orders of magnitude less than the recently reported fastest process (0.24s/note) by Heider et al.
Note that our method makes re-identification harder. It does not remove the possibility of leaked PHI. The STARR-OMOP-deid dataset is deemed as High Risk dataset due to the presence of occasional leaked PHI.
Expert Determination of TiDE output
TiDE uses Safe Harbor approaches to de-identify free text. While it uses state-of-the-art name entity recognition and privacy preserving surrogate approaches, there is a small possibility of leaked PHI. While TiDE algorithms makes it hard to identify residual PHI or re-identify patients, the output is, technically speaking, not free of residual PHI. Stanford UPO determines that TiDE output is High Risk.
Expert Determination is an alternate mechanism of achieving de-identification. We have requested such determination under limited circumstances. For example, if the dataset is static and further risk reduction strategies are implemented such as removal of patients or encounters with rare conditions. Unfortunately, HIPAA standards, as they stand today, do not provide guidelines for Expert Determination for de-identification where free text fields are concerned. At this time, if the dataset has free text fields, it is not feasible for Stanford UPO to declare that the dataset is de-identified by Expert Determination. This statement assumes that the algorithmic de-identification of free text has been achieved through state-of-the-art text processing techniques including Machine Learning, Deep Learning and Natural Language Processing and other privacy preserving approaches such as Hiding in Plain Sights. We further assume that the dataset has been statistically de-risked using approaches such as k-anonymity or removal of patients with rare conditions.