DICOM metadata transformation in healthcare data solutions
Note
This content is currently being updated.
This article explains how the healthcare data solutions environment extracts and transforms DICOM metadata across different lakehouse levels. You can also learn about the end-to-end metadata transformation process and understand the transformation mapping at each level.
The metadata transformation through the ingestion pipeline consists of the following three consecutive stages:
- Extraction and transformation of DICOM metadata to bronze delta table
- Metadata transformation from bronze to silver delta table
- Metadata transformation from silver to gold delta table
The following sections detail the transformation mapping for each stage.
Transformation mapping for DICOM metadata to bronze delta table
There are more than 5000 DICOM tags defined by the DICOM standard, including vendor-specific private tags. This section identifies which tags do we retrieve and explains the extraction process in the bronze lakehouse.
The tag extraction process performs the following actions:
Extraction from DICOM files: Extract a collection of all the tags from the DICOM (DCM) files in the optimized folder structure in the bronze lakehouse.
Pixel data tag exclusion: Exclude the DICOM pixel data tag (7FE0,0010) and the image pixel data module attributes from the collection. The DICOM pixel data tag includes image/pixel-level details.
JSON mapping: Map all the extracted DICOM tags into a JSON structure of key-value pairs in the following schema:
METADATA_JSON_DICT_SCHEMA = MapType ( StringType(), StructType([ StructField("vr", StringType(), True), StructField("Value", ArrayType(StringType(), True), True) ]) )
These key-value JSON pairs are written to the metadata column in the bronze lakehouse's dicomimagingmetastore delta table.
Extraction and mapping to bronze lakehouse: Further extract the following 30 DICOM tags and write them to the respective destination columns in the dicomimagingmetastore delta table:
Source DICOM tag Destination column (0020,000D) [studyinstanceuid]
(0010,0010) [patientname]
(0010,0020) [patientid]
(0010,0030) [patientbirthdate]
(0010,0040) [patientsex]
(0008,0050) [accessionnumber]
(0008,0090) [referringphysicianname]
(0008,0020) [studydate]
(0008,1030) [studydescription]
(0008,0061) [modalitiesinstudy]
(0020,000E) [seriesinstanceuid]
(0008,0060) [modality]
(0040,0244) [performedprocedurestepstartdate]
(0008,1090) [manufacturermodelname]
(0008,0018) [sopinstanceuid]
(0008,0030) [studytime]
(0008,0096) [referringphysicianidentificationsequence]
(0008,0201) [timezoneoffsetfromutc]
(0020,1206) [numberofstudyrelatedseries]
(0020,1208) [numberofstudyrelatedinstances]
(0020,0011) [seriesnumber]
(0008,103E) [seriesdescription]
(0020,1209) [numberofseriesrelatedinstances]
(0018,0015) [bodypartexamined]
(0020,0060) [laterality]
(0008,0021) [seriesdate]
(0008,0031) [seriestime]
(0008,0016) [sopclassuid]
(0020,0013) [instancenumber]
(0042,0010) [documenttitle]
Note
- For more information about why we promote these particular 30 DICOM tags, see DICOM tag extraction.
- To learn more about the ingestion pattern (append), go to Append pattern in the bronze lakehouse.
Execution time logging: The notebook's execution date and time are written to the
created_date
column in the dicomimagingmetastore delta table.DCM file path storage: The full file path for the DCM file is written to the
filepath
column in the dicomimagingmetastore delta table.
Transformation mapping for bronze to silver delta table
The following tables explain the complete mapping for the transformation of DICOM metadata in the bronze lakehouse dicomimagingmetastore delta table to the FHIR ImagingStudy delta table in the silver lakehouse.
Source DICOM tag in dicomimagingmetastore | Destination column in ImagingStudy | Mapping details |
---|---|---|
NA | id |
A GUID generated using the Python UUID module. |
NA | meta.lastupdated |
Creation timestamp of the NDJSON file. |
StudyInstanceUID (0020,000D) Accession Number (0080,0050) |
identifier |
ImagingStudy.identifier.where(system = 'urn:dicom:uid') => StudyInstanceUID ImagingStudy.identifier.where(type.coding.system = 'http://terminology.hl7.org/CodeSystem/v2-0203' and type.coding.code = 'ACSN')) => "AccessionNumber" |
Modalities in Study (0008,0061) | modality |
modality = List{code = col('ModalitiesInStudy')} |
Patient ID (0010,0020) | subject |
""subject"": {""identifier"": {""type"": {""coding"": [{""system"": ""lit('http://terminology.hl7.org/CodeSystem/v2-0203')"",""code"": ""lit('MR')""}]},""value"": ""col('PatientID')""},""type": ""lit('Patient')""}," |
Patient Name (0010,0010) Patient Birthdate (0010,0030) Patient Sex (0010,0040) |
subject |
"subject": {"extension": [{"url": "lit('name')", "valueString": "col('PatientName')"}, {"url": "lit('birthDate')", "valueDateTime": "col('PatientBirthDate')"}, {"url": "lit('gender')", "valueCode": "col('PatientSex')"}]} |
StudyDate (0008,0020) StudyTime (0008,0030) TimezoneOffsetFromUTC(0008,0201) |
started |
concat_ws(' ', col('StudyDate'), col('StudyTime'), col('TimezoneOffsetFromUTC')) |
NumberOfStudyRelatedSeries (0020,1206) | numberOfSeries |
col('NumberOfStudyRelatedSeries') |
NumberOfStudyRelatedInstances (0020,1208) | numberOfInstances |
col('NumberOfStudyRelatedInstances') |
StudyDescription (0008,1030) | description |
col('StudyDescription') |
Series-level details
Source DICOM tag in dicomimagingmetastore | Destination column in ImagingStudy | Mapping details |
---|---|---|
SeriesInstanceUID (0020,000E) | series.uid |
col('SeriesInstanceUID') |
SeriesNumber (0020,0011) | series.number |
col('SeriesNumber') |
Modality (0008,0060) | series.modality |
modality.code = col('Modality') |
SeriesDescription (0008,103E) | series.description |
col('SeriesDescription') |
NumberOfSeriesRelatedInstances (0020,1209) | series.numberOfInstances |
col('NumberOfSeriesRelatedInstances') |
BodyPartExamined (0018,0015) | series.bodySite |
bodySite.display = col('BodyPartExamined') |
Laterality (0020,0060) | series.laterality |
laterality.display = col('Laterality') |
SeriesDate (0008,0021) SeriesTime (0008,0031) TimezoneOffsetFromUTC (0008,0201) |
series.started |
concat_ws(' ', col('SeriesDate'), col('SeriesTime'), col('TimezoneOffsetFromUTC')).cast(TimestampType()) |
SOPInstanceUID (0008,0018) | series.instance.uid |
col('SOPInstanceUID') |
SOPClassUID (0008,0016) | series.instance.sopClass |
sopClass.code = col('SOPClassUID') |
InstanceNumber (0020,0013) | series.instance.number |
col('InstanceNumber') |
DocumentTitle (0042,0010) | series.instance.title |
col('DocumentTitle') |
NA | status |
"available" |
NA | Series.Instance.Extension |
"extension": [{"url": "lit('file_path')", "valueUrl": "col('FilePath')"}] The value for FilePath includes the ABFS file path in OneLake for all instance-level DCM files that are part of this ImagingStudy. |
NA | resourceType |
"ImagingStudy" |
Transformation mapping for silver to gold delta table
The following table explains the complete mapping for the transformation of DICOM metadata in the silver lakehouse ImagingStudy delta table to the Observational Medical Outcomes Partnership (OMOP) Image_Occurrence delta table in the gold lakehouse.
Source column in ImagingStudy | Destination column in OMOP Image_Occurrence | Data type | Mapping details |
---|---|---|---|
series.uid |
image_occurrence_id |
integer | Unique key given to an imaging study record. |
subject |
person_id |
integer | Person ID of the person associated with the recorded procedure. |
series.instance.extension |
local_path |
string | {InstanceID; StoragePath} A collection of DCM files for all instances in that series. The collection includes a JSON array of key-value pairs. |
series.started |
image_occurrence_date |
date | Imaging procedure (series) occurrence date. |
identifier['StudyInstanceUID'] |
image_study_UID |
string | DICOM Study UID |
series.uid |
image_series_UID |
string | DICOM Series UID |
series.modality |
modality |
string | Modality of the series. |