Big Data and Advanced Analytics Virtual Workshop
I had a fantastic day (Well, night in my timezone!) at Big Data and Advanced Analytics Virtual Workshop - Day 1. You can download the slides and video recording from here .
I did 2 demos in the "Session 5: Using Azure SQL Data Warehouse".
First demo was to set up a Azure SQL Data warehouse, explain the pause and resume feature , scaling Data warehouse and pushing some data into it using bcp. I used following scripts and this file.
Create table in SQL DW
CREATE TABLE DimDate
(
DateId INT NOT NULL,
CalendarQuarter TINYINT NOT NULL,
FiscalQuarter TINYINT NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
);
Run the bcp utility
bcp DimDate in <your directory>\DimDate2.txt -S <your DW Server name> -d <your DW Name>-U <your DW Username> -q -c -t ','
Run SELECT query
SELECT * FROM DimDate
Second demo was more interesting. I was to demo joining tables pointing at 2 other data sources. 1st being the Azure Blob and 2nd being the ORC table in HDInsight (It uses Azure Blob underlying it!).
For Azure Blob-as-a-data source part, I created a container in my blob account. I moved this file to it. Back in SQL Server Management Studio (SSMS), I ran following queries.
Create a data source for Azure Blob
CREATE EXTERNAL DATA SOURCE AzureStorage
WITH (
TYPE = HADOOP,
LOCATION = 'wasbs://<yourcontainer>@<your-storage-account-path>',
CREDENTIAL = AzureStorageCredential
);
Credential 'AzureStorageCredential' was pre-created by running following queries.
CREATE MASTER KEY;
CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential
WITH
IDENTITY = 'user',
SECRET = '<your storage key>';
Create external table
CREATE EXTERNAL TABLE [asb].DimProduct
(
[ProductID] [int] NOT NULL,
[ProductName] [nvarchar](255) NULL,
[ProductDescription] [nvarchar](500) NULL
)
WITH
(
LOCATION='/Products.txt/' ,
DATA_SOURCE = AzureStorage,
FILE_FORMAT = PipeDeliemitedTextFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
)
The schema [asb] and FILE_FORMAT PipeDeliemitedTextFileFormat werepre-created using following queries.
CREATE SCHEMA [asb]
CREATE EXTERNAL FILE FORMAT PipeDeliemitedTextFileFormat
WITH
( FORMAT_TYPE = DELIMITEDTEXT
, FORMAT_OPTIONS ( FIELD_TERMINATOR = '|'
, STRING_DELIMITER = ''
, DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff'
, USE_TYPE_DEFAULT = FALSE
)
);
Show data from external table
SELECT * FROM [asb].DimProduct
This explained how data residing in Azure blob can be queried straight from the SQL Data warehouse.
For the HDInsight part, I needed to create a HDInsight cluster and set it up. I ran following queries in HIVE Editor of HDI.
Create an external table in HDI
CREATE EXTERNAL TABLE IF NOT EXISTS <your HDI database>.salestxt
(
salesdate int,
productid int,
salesamt float
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' STORED AS TEXTFILE;
Database was pre-created using script.
CREATE DATABASE IF NOT EXISTS <your HDI database>;
Load data into external table
LOAD DATA INPATH 'wasb:///Sales.txt' INTO TABLE <your HDI database>.salestxt;
I had created this file as Sales.txt. For copying to work, I had to keep it at the root location of HDI container in my storage account.
Create ORC table
CREATE TABLE IF NOT EXISTS <your HDI database>.salesorc
(
salesdate int,
productid int,
salesamt float
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS ORC;
Load data into ORC table
INSERT OVERWRITE TABLE <your HDI database>.salesorc SELECT * FROM <your HDI database>.salestxt;
Once the HDI set up was done, I set up SQL Data warehouse to connect to it. Following queries were used to set it up.
Create data source for HDInsight
CREATE EXTERNAL DATA SOURCE ORCAzureStorage
WITH
(
TYPE = HADOOP,
LOCATION = 'wasbs://<your HDI storage container>@<your storage account path>',
CREDENTIAL = AzureStorageCredential
);
Credential 'AzureStorageCredential' was re-used from the Azure Blob Data Source.
Create external table
CREATE EXTERNAL TABLE [hdp].FactSales (
salesdate int,
productid int,
salesamt float
)
WITH (
LOCATION ='/hive/warehouse/<your HDI database>/salestxt',
DATA_SOURCE = ORCAzureStorage ,
FILE_FORMAT = PipeDeliemitedTextFileFormat
);
FILE_FORMAT 'PipeDeliemitedTextFileFormat' was re-used from the Azure Blob file format. Schema [hdp] was created using query CREATE SCHEMA [hdp]
Show data from external table
SELECT * FROM [hdp].FactSales
I then showed a query output obtained after joining tables pointing to 2 data sources i.e. Azure Blob and HDInsight, with SQL Data warehouse
SELECT * FROM
[dbo].DimDate a,
[asb].DimProduct b,
[hdp].FactSales c
WHERE
a.dateid = c.salesdate AND
b.productid = c.productid
2nd day of the event also promises to be an exciting day with following sessions -
- Session 6: Machine Learning in Azure
- Session 7: Doing Spark on Azure
- Session 8: Running Hadoop Clusters on Azure
- Session 9: Cognitive Services and Bots
If you have not registered already, register here. See you there!