DICOM metadata transformation in healthcare data solutions

Note

This content is currently being updated.

This article explains how the healthcare data solutions environment extracts and transforms DICOM metadata across different lakehouse levels. You can also learn about the end-to-end metadata transformation process and understand the transformation mapping at each level.

The metadata transformation through the ingestion pipeline consists of the following three consecutive stages:

  1. Extraction and transformation of DICOM metadata to bronze delta table
  2. Metadata transformation from bronze to silver delta table
  3. Metadata transformation from silver to gold delta table

The following sections detail the transformation mapping for each stage.

Transformation mapping for DICOM metadata to bronze delta table

There are more than 5000 DICOM tags defined by the DICOM standard, including vendor-specific private tags. This section identifies which tags do we retrieve and explains the extraction process in the bronze lakehouse.

The tag extraction process performs the following actions:

  1. Extraction from DICOM files: Extract a collection of all the tags from the DICOM (DCM) files in the optimized folder structure in the bronze lakehouse.

  2. Pixel data tag exclusion: Exclude the DICOM pixel data tag (7FE0,0010) and the image pixel data module attributes from the collection. The DICOM pixel data tag includes image/pixel-level details.

  3. JSON mapping: Map all the extracted DICOM tags into a JSON structure of key-value pairs in the following schema:

    METADATA_JSON_DICT_SCHEMA = MapType
       (
          StringType(),
          StructType([
                        StructField("vr", StringType(), True),
                        StructField("Value", ArrayType(StringType(), True), True)
                     ])
       )
    

    These key-value JSON pairs are written to the metadata column in the bronze lakehouse's dicomimagingmetastore delta table.

  4. Extraction and mapping to bronze lakehouse: Further extract the following 30 DICOM tags and write them to the respective destination columns in the dicomimagingmetastore delta table:

    Source DICOM tag Destination column
    (0020,000D) [studyinstanceuid]
    (0010,0010) [patientname]
    (0010,0020) [patientid]
    (0010,0030) [patientbirthdate]
    (0010,0040) [patientsex]
    (0008,0050) [accessionnumber]
    (0008,0090) [referringphysicianname]
    (0008,0020) [studydate]
    (0008,1030) [studydescription]
    (0008,0061) [modalitiesinstudy]
    (0020,000E) [seriesinstanceuid]
    (0008,0060) [modality]
    (0040,0244) [performedprocedurestepstartdate]
    (0008,1090) [manufacturermodelname]
    (0008,0018) [sopinstanceuid]
    (0008,0030) [studytime]
    (0008,0096) [referringphysicianidentificationsequence]
    (0008,0201) [timezoneoffsetfromutc]
    (0020,1206) [numberofstudyrelatedseries]
    (0020,1208) [numberofstudyrelatedinstances]
    (0020,0011) [seriesnumber]
    (0008,103E) [seriesdescription]
    (0020,1209) [numberofseriesrelatedinstances]
    (0018,0015) [bodypartexamined]
    (0020,0060) [laterality]
    (0008,0021) [seriesdate]
    (0008,0031) [seriestime]
    (0008,0016) [sopclassuid]
    (0020,0013) [instancenumber]
    (0042,0010) [documenttitle]

    Note

  5. Execution time logging: The notebook's execution date and time are written to the created_date column in the dicomimagingmetastore delta table.

  6. DCM file path storage: The full file path for the DCM file is written to the filepath column in the dicomimagingmetastore delta table.

Transformation mapping for bronze to silver delta table

The following tables explain the complete mapping for the transformation of DICOM metadata in the bronze lakehouse dicomimagingmetastore delta table to the FHIR ImagingStudy delta table in the silver lakehouse.

Source DICOM tag in dicomimagingmetastore Destination column in ImagingStudy Mapping details
NA id A GUID generated using the Python UUID module.
NA meta.lastupdated Creation timestamp of the NDJSON file.
StudyInstanceUID (0020,000D)
Accession Number (0080,0050)
identifier ImagingStudy.identifier.where(system = 'urn:dicom:uid') => StudyInstanceUID

ImagingStudy.identifier.where(type.coding.system = 'http://terminology.hl7.org/CodeSystem/v2-0203' and type.coding.code = 'ACSN')) => "AccessionNumber"
Modalities in Study (0008,0061) modality modality = List{code = col('ModalitiesInStudy')}
Patient ID (0010,0020) subject ""subject"": {""identifier"": {""type"": {""coding"": [{""system"": ""lit('http://terminology.hl7.org/CodeSystem/v2-0203')"",""code"": ""lit('MR')""}]},""value"": ""col('PatientID')""},""type": ""lit('Patient')""},"
Patient Name (0010,0010)
Patient Birthdate (0010,0030)
Patient Sex (0010,0040)
subject "subject": {"extension": [{"url": "lit('name')", "valueString": "col('PatientName')"}, {"url": "lit('birthDate')", "valueDateTime": "col('PatientBirthDate')"}, {"url": "lit('gender')", "valueCode": "col('PatientSex')"}]}
StudyDate (0008,0020)
StudyTime (0008,0030)
TimezoneOffsetFromUTC(0008,0201)
started concat_ws(' ', col('StudyDate'), col('StudyTime'), col('TimezoneOffsetFromUTC'))
NumberOfStudyRelatedSeries (0020,1206) numberOfSeries col('NumberOfStudyRelatedSeries')
NumberOfStudyRelatedInstances (0020,1208) numberOfInstances col('NumberOfStudyRelatedInstances')
StudyDescription (0008,1030) description col('StudyDescription')

Series-level details

Source DICOM tag in dicomimagingmetastore Destination column in ImagingStudy Mapping details
SeriesInstanceUID (0020,000E) series.uid col('SeriesInstanceUID')
SeriesNumber (0020,0011) series.number col('SeriesNumber')
Modality (0008,0060) series.modality modality.code = col('Modality')
SeriesDescription (0008,103E) series.description col('SeriesDescription')
NumberOfSeriesRelatedInstances (0020,1209) series.numberOfInstances col('NumberOfSeriesRelatedInstances')
BodyPartExamined (0018,0015) series.bodySite bodySite.display = col('BodyPartExamined')
Laterality (0020,0060) series.laterality laterality.display = col('Laterality')
SeriesDate (0008,0021)
SeriesTime (0008,0031)
TimezoneOffsetFromUTC (0008,0201)
series.started concat_ws(' ', col('SeriesDate'), col('SeriesTime'), col('TimezoneOffsetFromUTC')).cast(TimestampType())
SOPInstanceUID (0008,0018) series.instance.uid col('SOPInstanceUID')
SOPClassUID (0008,0016) series.instance.sopClass sopClass.code = col('SOPClassUID')
InstanceNumber (0020,0013) series.instance.number col('InstanceNumber')
DocumentTitle (0042,0010) series.instance.title col('DocumentTitle')
NA status "available"
NA Series.Instance.Extension "extension": [{"url": "lit('file_path')", "valueUrl": "col('FilePath')"}]

The value for FilePath includes the ABFS file path in OneLake for all instance-level DCM files that are part of this ImagingStudy.
NA resourceType "ImagingStudy"

Transformation mapping for silver to gold delta table

The following table explains the complete mapping for the transformation of DICOM metadata in the silver lakehouse ImagingStudy delta table to the Observational Medical Outcomes Partnership (OMOP) Image_Occurrence delta table in the gold lakehouse.

Source column in ImagingStudy Destination column in OMOP Image_Occurrence Data type Mapping details
series.uid image_occurrence_id integer Unique key given to an imaging study record.
subject person_id integer Person ID of the person associated with the recorded procedure.
series.instance.extension local_path string {InstanceID; StoragePath}

A collection of DCM files for all instances in that series. The collection includes a JSON array of key-value pairs.
series.started image_occurrence_date date Imaging procedure (series) occurrence date.
identifier['StudyInstanceUID'] image_study_UID string DICOM Study UID
series.uid image_series_UID string DICOM Series UID
series.modality modality string Modality of the series.