Skip to main content Skip to secondary navigation

STARR Overview

STARR functions include Big Data ingestion, storage and processing, advanced data transformations, self-service tools and end-user services. 


Main content start

AI in Medicine and STARR evolution

Stanford School of Medicine (SoM) efforts to explore the use of Artificial Intelligence (AI) in Medicine (AIM) started in the 1970s with the Stanford University Medical EXperimental computer for Artificial Intelligence in Medicine (SUMEX-AIM) project, a national computer resource (1973-1992), via the early communications networks of the 1970s – ARPANET, funded by NIH to promote applications of AI to biological and medical problems. The report, “Seeds of Artificial Intelligence”, now a historical artifact, won an award as the top federal technical publication for 1980 by the National Association of Government Communicators. The SUMEX-AIM resource resided administratively within the SoM and provided computing facilities specifically tuned to the needs of AI research and developed many tools for encouraging and facilitating community relationships among collaborating projects and medical researchers. 

Fast forward to 2003, ahead of the HITECH ACT of 2009, SoM decided to invest in a research Clinical Data Warehouse (r-CDW). This effort resulted in launch of the HIPAA compliant STRIDE platform – An Integrated Standards-Based Translational Research Informatics Platform - in 2008. The STRIDE platform is a database in in-house data model, a set of self-service tools, cohort tool and a chart review tool. Fast forward to 2017, STARR is launched to support modern AI in Medicine efforts at Stanford.

STRIDE, now re-branded to STARR Tools, is going strong at Stanford and continues to serve 600-800 researchers annually. It was migrated to cloud in 2021 and redesigned to use cloud native architecture. In addition to STARR Tools, there are other r-CDWs, cohort tools and data types in the STARR portfolio now. Our long term goal is to expand STARR Tools to support self-service of OMOP data model and potentially, other data modalities.

STARR Design

In 2017, SoM embarked on the journey of re-imagining our research Clinical Data Warehouse and related ecosystem of services. Research IT, as the data stewards of SoM, were given the task of building STARR. We looked at data stewardship for lessons learned, specifically, complex efforts to manage and analyze data such as Genomic Data Commons and Sage Bionetwork’s Synapse platform. What emerged is the need for a comprehensive one stop data science platform where researchers can, 

  • access multi-modal linked data seamlessly, 
  • do computation on Big Data, and 
  • access services such as data and technology consultations, training and support.

STARR encompasses these functions and more. STARR is also a product of multiple collaborations. We partnered with Google to build a cloud data center to support Big Data. We partnered with the Hospitals to bring Big Data to cloud. We partnered with UIT to build a HIPAA compliant shared research platform, Nero, where the users securely work on these Big Data sets. We partnered with UPO and ISO to streamline privacy and security related processes between Nero and Research IT cloud. And finally, we co-created with our researchers.

STARR functions

STARR ingests data (A) from multiple sources. These sources include A.1) streams such as HL7; A.2) Bulk or batch ingestion such as DICOM images from the VNA or the SHC Epic Clarity database; or A.3) Adhoc ingestion such as Query/Retrieve of Echocardiograms from Cardiology Syngo systems.  During and soon after the data lands in Research IT's HIPAA compliant data center (B),  there are a number of basic data processing steps such as decompression e.g., unzip, re-organization e.g. AVRO to BigQuery, cataloging, provenance management etc.  Once the data is ready, the bulk of the data processing steps (C) kick in. Research IT uses a common software stack for data compliance, ETL framework etc. Output of the processing is a series of  analysis ready artifacts (D). For example: D.1) we generate metadata databases from bedside monitoring data so we can subsequently create data cuts; D.2) data blobs include analysis ready de-identified waveforms in PhysioNet format; D.3) data warehouses include identified OMOP and in-house databases.  Some of these artifacts require a data steward's help via consultation services to access. The final user accessible consumables (E) are: E.1) office hours and training; E.2) self-service tools such as cohort tools, chart review tools, direct access pre-IRB databases and finally; E.3) consultation services.