Use SDOH datasets - Transformations (preview) in healthcare data solutions
[This article is prerelease documentation and is subject to change.]
This section provides guidance on how to ingest, transform, and unify the SDOH (social determinants of health) datasets using SDOH datasets - Transformations (preview) in healthcare data solutions.
After you complete the steps in Prepare the public datasets in SDOH datasets - Transformations (preview), the SDOH datasets are ready for ingestion. Also consider the following requirements:
- Ensure none of the files are open locally to avoid any temporary file copies from being uploaded.
- The files' sensitivity label must be set to General or Public.
To begin the ingestion process:
In your healthcare data solutions environment, open the healthcare#_msft_sdoh_ingestion data pipeline.
Select the Run button.
After successful execution, your SDOH datasets are ready for use in analytics workloads.
Understand the ingestion mechanism
The end-to-end execution of this capability involves the following high-level consecutive steps:
- Ingest SDOH datasets from OneLake into the Ingest folder.
- Move the SDOH datasets from the Ingest to Process folder.
- Convert SDOH datasets into dedicated delta tables in the bronze lakehouse.
- Ingest and convert bronze delta tables into an Industry Data Model (IDM) inspired data model in the silver lakehouse.
Ingest SDOH datasets from OneLake
The execution begins after you upload the SDOH datasets to the Ingest folder. The execution pipeline moves the files to the organized Process folder in the bronze lakehouse in the next stage. In any failures occur, the pipeline moves the files to the Failed folder.
To learn more about these folders and the file movement between them, see Folder descriptions.
Move SDOH datasets
The raw_process_movement notebook moves the files to the organized Process folder in the bronze lakehouse. The subfolder structure is as follows: Files\Process\SDOH\<file format>\<publisher name>\<dataset-specific folders
.
Processed files store in their respective subfolders, with the ingestion timestamp added at the beginning of the file name.
Convert SDOH datasets to delta tables
After the files move to the Process folder, the healthcare#_msft_bronze_ingestion notebook populates the metadata, layout, and data tables in the bronze lakehouse in the delta table format. Layout information populates in the SD_Layout table, metadata information populates in SD_Metadata table, and data populates in the individual data tables generated at run time. Data tables are prefixed with SD_
and contain the sheet/file name in the table name. All data sheets from every dataset retain their table structure. You can compare the original data sheet with their corresponding bronze delta table to understand the variation.
Convert delta tables to silver data model
After successful bronze ingestion, the healthcare#_msft_bronze_silver_ingestion notebook helps define a custom data model in the silver lakehouse. This notebook:
- Normalizes the data in the bronze lakehouse while preserving the context of the corresponding tables, allowing you to identify or query the data in the source context.
- Creates dedicated tables in the silver lakehouse for each of the source context.
Following are the key silver lakehouse tables:
- SocialDeterminant: Contains the actual data points for each social determinant, and the location details as entered in the Location configuration sheet.
- SocialDeterminantCategory: Contains the category of data points for each social determinant.
- SocialDeterminantSubCategory: Contains the subcategory of data points for each social determinant.
- UnitOfMeasure (IDM table): Contains the measuring unit details.
- SocialDeterminantDataSetMetadata: Contains information about the dataset, such as the dataset name, publisher, and published date.
You can compare the bronze delta lake tables with their corresponding silver lakehouse representations to understand the custom data model transformation. The custom data model tables differ in structure and organization compared to conventional FHIR-based tables.
Sample: Analyze the impact of food environment and socio-economic conditions on diabetes
Consider a scenario where we're trying to understand the impact of a county's food environment and socio-economic conditions on the number of patients with diabetes in that county.
Food environment and median household income represent SDOH information from the recently ingested SDOH public datasets, specifically USDA's Food Environment Atlas and AHRQ's SDOH Data. You can use fields such as the number of fast-food restaurants, number of grocery stores, and median household income from the SDOH data model (SocialDeterminant table) in the silver lakehouse.
SELECT
SocialDeterminantName,
SocialDeterminantValue,
SocialDeterminantDescription,
parsedJson.CountyName AS CountyName,
parsedJson.CountyFIPS AS CountyFIPS,
parsedJson.StateName AS StateName
FROM
healthcare1_msft_silver.SocialDeterminant sd
LATERAL VIEW json_tuple(sd.LocationJson, 'STATENAME', 'COUNTYNAME', 'COUNTYFIPS') parsedJson AS StateName, CountyName, CountyFIPS) sd
ON sd.CountyFIPS = fip_zip_mapping.STCOUNTYFP
WHERE
sd.SocialDeterminantName IN ('GROC16', 'FFR16', 'ACS_MEDIAN_HH_INC')
AND sd.SocialDeterminantValue IS NOT NULL
On the other hand, the number of patients with diabetes refers to clinical information from the healthcare data solutions clinical pipeline, which must also be deployed and installed. You can ingest clinical data into this pipeline or use the provided clinical sample data. Use fields such as patient address and patient condition to retrieve the required measures.
WITH ExpandedPatients AS (
SELECT
p.id_orig,
address_item.postalCode AS postalCode,
address_item.state AS state
FROM
healthcare1_msft_silver.patient p
LATERAL VIEW explode(p.address) exploded_address AS address_item
)
SELECT
fzm.STCOUNTYFP,
SUM(CASE WHEN c.code.text LIKE '%Asthma%' THEN 1 ELSE 0 END) AS Total_Asthma_Patients,
SUM(CASE WHEN c.code.text LIKE '%Diabetes%' THEN 1 ELSE 0 END) AS Total_Diabetes_Patients,
SUM(CASE WHEN c.code.text LIKE '%Hypertension%' THEN 1 ELSE 0 END) AS Total_Hypertension_Patients
FROM
ExpandedPatients ep
JOIN healthcare1_msft_silver.condition c ON ep.id_orig = c.subject.id_orig
JOIN healthcare1_msft_silver.fips_zip_mapping fzm ON ep.postalCode = fzm.ZIP
GROUP BY
fzm.STCOUNTYFP
Inherently, there's no direct relationship between these two data sets. The linking element is their location detail:
- Food environment data is available at the county level, accessible by expanding the Locationjson column in the SocialDeterminant table and using the
CountyFIPS
field. - Clinical data contains patient addresses in FHIR format, from which you can retrieve county information. If only the patient's ZIP code is available, you can retrieve it and create a mapping table of ZIP codes to FIPS codes to link with the SDOH dataset. This mapping table is readily available in public data repositories.
With the location data prepared, you can link the two data sets to build a gold lakehouse query that displays all the necessary data points. Here's a sample SQL query:
FROM
social_determinants sd
JOIN
patient_conditions pc
ON
sd.CountyFIPS = pc.STCOUNTYFP
You can now analyze and visualize the final dataset to determine the relationship between the number of patients with diabetes and the presence of fast-food restaurants. Thus, the silver layer enables robust data normalization, allowing you to construct queries and derive comprehensive insights within and across different datasets.