Skip to main content Skip to secondary navigation

Research IT's DICOM de-identification pipeline can deID both DICOM metadata and pixel PHI, and can do so at scale.

DICOM deid

Main content start

DICOM "deID" PHI Scrubbing

DICOM is a common medical image standard and the data type is considered "unstructured" medical data by Stanford UPO. Algorithmic de-identification may leave residual PHI behind. Output of our pipeline has been human-reviewed for over 250,000+ DICOMs and no residual PHI have been found. UPO deems our algorithmically de-identified images as High Risk i.e., there is a small chance of PHI leak.

Stanford has an estimated >5 million radiology imaging studies and >1 billion DICOM images across a variety of modalities like X-ray, CT-scans, MRIs etc. This is ~2PB of data from the two Hospitals - Children’s and Adult’s. The raw data is stored in its original format and includes PHI. Research IT has developed a highly scalable on-demand DICOM PHI scrubbing "de-id" pipeline. The pipeline can remove PHI from the DICOM header (metadata) as well as scrub pixel PHI. Here are some additional resources:

  • Manuscript: High performance on-demand de-identification of a petabyte-scale medical imaging data lake, Mesterhazy J, Olson G, Datta S, Aug 2020, arXiv:2008.01827 (link)
  • Code: MIRC CTP optimization for DICOM de-identification (link)

To request imaging data, linked to EHR, please request a data consultation service.

Output of "de-identification" PHI scrubbing pipeline:

STARR DICOMPHI scrubbing pipeline is flexible and can support different scrubbing configurations. By default, we redact nothing from PACS output if the requestor asks for PHI data. The following changes are made if researcher asks for de-identified High Risk datasets:

  • All image types:
    • Structured Reports and scanned paper documents are excluded. This includes encapsulated PDF reports, scanned intake forms, outside study alerts, etc. 
    • Free-text fields (eg. study description, series description, patient notes) are removed.
    • Private (eg. non-standard) DICOM tags are removed.
    • Image overlays are removed.
    • All primary and secondary identifiers (eg. MRN, Study/Series/Instance UIDs, accession numbers) are replaced by new values which are consistent throughout the dataset.
    • All timestamps are “jittered” +/- 30 days (excluding zero) in a manner which preserves the original ordering.
  • X-Ray
    • CR/DR images may have regions of the image containing PHI replaced by black pixels and recompressed using the JPEG Lossless syntax. The rules used for pixel blanking can be found on GitHub.
    • Scanned x-ray film (eg. analog to digital) images are excluded. 
  • CT/PET
    • Secondary or derived series (excluding fusion) are excluded. This includes dose reports, reformats, 3D reconstructions, etc. 
    • CT+PET fusion series for most common manufacturers (GE, Siemens, Philips), will have regions of the image containing PHI replaced by black pixels and recompressed using the JPEG Lossless syntax. The rules used for image blanking can be found on GitHub.
  • MRI
    • Secondary or derived series are excluded, including 3D reconstructions.
       
  • Mammography
    • Support for devices from common manufacturers including: Carestream, Fuji, GE, Hologic, Kodak, Lorad, Senograph and Siemens.
    • Reconstructions, 3D, CAD and screen captures are excluded.
       
  • Ultrasound (Echocardiogram is also similar)
    • Images may have regions of the image containing PHI replaced by black pixels and recompressed using the JPEG Lossless syntax. The rules used for pixel blanking can be found on GitHub.
       

Except when explicitly noted above, image pixel data is unchanged, preserving the original transfer syntax. However, in the rare case that the original image is stored as lossy JPEG, but has PHI in the pixel data which needs to be removed, the PHI region of the image is blurred (instead of blacked out) by modifying the JPEG data directly. This avoids loss of image fidelity in regions that do not contain PHI.

It should be noted the above exclusions exist as an aggressive baseline for avoiding disclosure of PHI. If you need a specific image sub-type that is not currently supported (eg. CT dose screen captures) let us know and we will investigate the feasibility of accommodating your request.

Researchers can request mix and match e.g. retain PHI in metadata but scrub pixels and remove PDFs. The request should be specified at the time of request submission 

Directory structure

The default directory structure for PHI datasets is: /dicom/${PatientID}/${AccessionNumber}/${SeriesDescription}/${InstanceNumber}-${SOPInstanceUID}.dcm

The default directory structure for de-identified data sets (identifiers are de-identified versions): /dicom/${PatientID}/${AccessionNumber}/${SeriesInstanceUID}/${InstanceNumber}-${SOPInstanceUID}.dcm

The directory structure is flexible, and supports a subset of DICOM tags which can be used to create the path. The supported tags are PatientID, AccessionNumber, StudyDate, StudyDescription (PHI-only), SeriesDate, SeriesDescription (PHI-only), SeriesInstanceUID, SOPInstanceUID, InstanceNumber. 

Care should be taken when providing a custom directory structure that the resulting paths are unique for every image.

Other associated data:

In addition to the delivered imaging data, a JSON manifest file is created for each study (in the path /manifests) which contains output from the anonymizer, including the image paths and how they were modified from the original.