Loading data into the data warehouse

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

The tornado source data is in tab-delimited text format, and the data analysts have used Excel to create lookup tables for dates and states in the same format. To simplify loading this data into the Sequence File formatted tables created previously, they decided to create staging tables in Text File format to which the source data will be uploaded. They can then use a HiveQL INSERT statement to load data from the staging tables into the data warehouse tables, implicitly converting the data to Sequence File format and generating partition key values for the Tornadoes table. This approach also makes it possible to compress the data as it is extracted from the staging table, reducing storage requirements and improving query performance in the data warehouse tables.

The format and content of the data is shown in the following tables.

Date table








1932-01-01

01/01/1932

1

1

1932

6

Friday

January

1932-01-02

01/02/1932

2

1

1932

7

Saturday

January

...

...

...

...

...

...

...

...

State table


AL

Alabama

AK

Alaska

AZ

Arizona

...

...

Tornadoes table














1934-01-18

OK

01/18/1934 02:20

2

4

0

4000000

0

35.4

-96.67

35.45

-96.6

5.1

30

1934-01-18

AR

01/18/1934 08:50

0

0

0

1600000

0

35.2

-93.18

0

0

0.1

10

1934-01-18

MO

01/18/1934 13:55

2

1

0

4000000

0

36.68

-90.83

36.72

-90.77

4.3

100

Creating scripts for the data load process

The data warehouse will be updated with new data periodically, so the load process should be easily repeatable. To accomplish this the analysts created HiveQL scripts to drop and recreate each staging table. For example, the following script is used to drop and recreate a staging table named StagedTornadoes for the tornadoes data.

DROP TABLE DW.StagedTornadoes;
CREATE TABLE StagedTornadoes
  (DateCode STRING, StateCode STRING, EventTime STRING, Category INT, Injuries INT,
   Fatalities INT, PropertyLoss DOUBLE, CropLoss DOUBLE, StartLatitude DOUBLE,
   StartLongitude DOUBLE, EndLatitude DOUBLE, EndLongitude DOUBLE,
   LengthMiles DOUBLE, WidthYards DOUBLE)
  ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
  STORED AS TEXTFILE LOCATION '/staging/tornadoes';

Notice that the staging table is an INTERNAL table. When it is dropped, any staged data left over from a previous load operation is deleted and a new, empty /staging/tornadoes folder is created ready for new data files—which can simply be copied into the folder. Similar scripts were created for the StagedDates and StagedStates tables.

In addition to the scripts used to create the staging tables, the data load process requires scripts to insert the staged data into the data warehouse tables. For example, the following script is used to load the staged tornadoes data.

SET mapreduce.map.output.compress=true;
SET mapreduce.output.fileoutputformat.compress=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
FROM DW.StagedTornadoes s
INSERT INTO TABLE DW.Tornadoes PARTITION (Year)
SELECT s.DateCode, s.StateCode, s.EventTime, s.Category, s.Injuries, s.Fatalities,   
       s.PropertyLoss, s.CropLoss, s.StartLatitude, s.StartLongitude, s.EndLatitude, 
       s.EndLongitude, s.LengthMiles, s.WidthYards, SUBSTR(s.DateCode, 1, 4) Year;

Notice that the script includes some configuration settings to enable compression of the query output (which will be inserted into the data warehouse table). Additionally, the script for the tornadoes data includes an option to enable dynamic partitions and a function to generate the appropriate partitioning key value for Year. Similar scripts, without the partitioning functionality, were created for the states and dates data.

The scripts to create staging tables and load staged data were then uploaded to the /staging/scripts folder so that they can be used whenever new data is available for loading into the data warehouse.

Loading data

With the Hive table definition scripts in place, the data analysts could now implement a solution to automate the data load process. It is possible to create a custom application to load the data using the .NET SDK for HDInsight, but a simple approach using Windows PowerShell scripts was chosen for this scenario. A PowerShell script was created for each staging table, including the following script that is used to stage and load tornadoes data.

# Azure subscription-specific variables.
$subscriptionName = "subscription-name"
$clusterName = "cluster-name"
$storageAccountName = "storage-account-name"
$containerName = "container-name"
Select-AzureSubscription $subscriptionName

# Find the local Data folder.
$thisfolder = Split-Path -parent $MyInvocation.MyCommand.Definition 
$localfolder = "$thisfolder\Data"

# Run Hive script to drop and recreate staging table.
$jobDef = New-AzureHDInsightHiveJobDefinition 
  -File "wasbs://$containerName@$storageAccountName.blob.core.windows.net/staging/scripts/CreateStagedTornadoes.q"
$hiveJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId -StandardError

# Upload data to staging table.
$destfolder = "staging/tornadoes"
$dataFile = "Tornadoes.txt"
$storageAccountKey = (Get-AzureStorageKey -StorageAccountName $storageAccountName).Primary
$blobContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey
$blobName = "$destfolder/$dataFile"
$filename = "$localfolder\$dataFile"
Set-AzureStorageBlobContent -File $filename -Container $containerName -Blob $blobName -Context $blobContext -Force

# Run Hive script to load staged data to DW table.
$jobDef = New-AzureHDInsightHiveJobDefinition 
  -File "wasbs://$containerName>@$storageAccountName.blob.core.windows.net/staging/scripts/LoadStagedTornadoes.q"
$hiveJob = Start-AzureHDInsightJob –Cluster $clusterName –JobDefinition $jobDef
Wait-AzureHDInsightJob -Job $hiveJob -WaitTimeoutInSeconds 3600
Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $hiveJob.JobId -StandardError

# All done!
Write-Host "Data in $dataFile has been loaded!"

After setting some initial variables to identify the cluster, storage account, blob container, and the local folder where the source data is stored, the script performs the following three tasks:

  1. Runs the HiveQL script to drop and recreate the staging table.
  2. Uploads the source data file to the staging table folder.
  3. Runs the HiveQL script to load the data from the staging table into the data warehouse table.

Two similar scripts, LoadDates.ps1 and LoadStates.ps1, are run to load the dates and states into the data warehouse. Whenever new data is available for any of the data warehouse tables, the data analysts can run the appropriate PowerShell script to automate the data load process for that table.

Now that the data warehouse is complete, the analysts can explore how to analyze the data. This is discussed in Analyzing data from the data warehouse.

Next Topic | Previous Topic | Home | Community