Creating the data warehouse

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

A data warehouse is fundamentally a database optimized for queries that support data analysis and reporting. Data warehouses are often implemented as relational databases with a schema that optimizes query performance, and aggregation of important numerical measures, at the expense of some data duplication in denormalized tables. When creating a data warehouse for big data you can use a relational database engine that is designed to handle huge volumes of data (such as Microsoft Analytics Platform System), or you can load the data into a Hadoop cluster and use Hive tables to project a schema onto the data.

In this scenario the data analysts at A. Datum have decided to use a Hadoop cluster provided by the HDInsight service. This enables them to build a Hive-based data warehouse schema that can be used as a source for analysis and reporting in traditional tools such as Excel, but which also will enable them to apply big data analysis tools and techniques.

Figure 1 shows a schematic view of the use case and model for using HDInsight to implement a data warehouse.

Figure 1 - Using HDInsight as a data warehouse for analysis, reporting, and as a business data source

Figure 1 - Using HDInsight as a data warehouse for analysis, reporting, and as a business data source

Unlike a traditional relational database, HDInsight allows you to manage the lifetime and storage of tables and indexes (metadata) separately from the data that populates the tables. A Hive table is simply a definition that is applied over a folder containing data, and this separation of schema and data is what enables one of the primary differences between Hadoop-based big data batch processing solutions and relational database: you apply a schema when the data is read, rather than when it is written.

In this scenario you’ll see how the capability to use a “schema on read” approach provides an advantage for organizations that need a data warehousing capability where data can be continuously collected, but analysis and reporting is carried out only occasionally.

Creating a database

When planning the HDInsight data warehouse, the data analysts at A. Datum needed to consider ways to ensure that the data and Hive tables can be easily recreated in the event of the HDInsight cluster being released and re-provisioned. This might happen for a number of reasons, including temporarily decommissioning the cluster to save costs during periods of non-use, and releasing the cluster in order to create a new one with more nodes in order to scale out the data warehouse.

A new cluster can be created over one or more existing Azure blob storage containers that hold the data, but the Hive (and other) metadata is stored separately in an Azure SQL Database instance. To be able to recreate this metadata, the analysts identified two possible approaches:

  • Save a HiveQL script that can be used to recreate EXTERNAL tables based on the data persisted in Azure blob storage.
  • Specify an existing Azure SQL Database instance to host the Hive metadata store when the cluster is created.

Using a HiveQL script to recreate tables after releasing and re-provisioning a cluster is an effective approach when the data warehouse will contain only a few tables and other objects. The script can be executed to recreate the tables over the existing data when the cluster is re-provisioned. However, selecting an existing SQL Database instance (which you maintain separately from the cluster) to be used as the Hive metadata store is also very easy, and removes the need to rerun the scripts. You can back up this database using the built-in tools, or export the data so that you can recreate the database if required.

Note

See Cluster and storage initialization for more details of using existing storage accounts and a separate Azure SQL Database instance to restore a cluster.

Creating a logical database in HDInsight is a useful way to provide separation between the contents of the database and other items located in the same cluster; for example, to ensure a logical separation from Hive tables used for other analytical processes. To do this the data analysts created a dedicated database for the data warehouse by using the following HiveQL statement.

CREATE DATABASE DW LOCATION '/DW/database';

This statement creates a folder named /DW/database as the default folder for all objects created in the DW database.

Creating tables

The tornado data includes the code for the state where the tornado occurred, as well as the date and time of the tornado. The data analysts want to be able to display the full state name in reports, and so created a table for state names using the following HiveQL statement.

CREATE EXTERNAL TABLE DW.States (StateCode STRING, StateName STRING)
  STORED AS SEQUENCEFILE;

Notice that the table is stored in the default location. For the DW database this is the /DW/database folder, and so this is where a new folder named States is created. The table is formatted as a Sequence File. Tables in this format typically provide faster performance than tables in which data is stored as text.

Note

You must use EXTERNAL tables if you want the data to be persisted when you delete a table definition or when you recreate a cluster. Storing the data in SEQUENCEFILE format is also a good idea as it can improve performance. You might also consider using the ORC file format, which provides a highly efficient way to store Hive data and can improve performance when reading, writing, and processing data. See ORC File Format for more information.

The data analysts also want to be able to aggregate data by temporal hierarchies (year, month, and day) and create reports that show month and day names. While many client applications that are used to analyze and report data support this kind of functionality, the analysts want to be able to generate reports without relying on specific client application capabilities.

To support date-based hierarchies and reporting, the analysts created a table containing various date attributes that can be used as a lookup table for date codes in the tornado data. The creation of a date table like this is a common pattern in relational data warehouses.

CREATE EXTERNAL TABLE DW.Dates
  (DateCode STRING, CalendarDate STRING, DayOfMonth INT, MonthOfYear INT,
   Year INT, DayOfWeek INT, WeekDay STRING, Month STRING)
  STORED AS SEQUENCEFILE;

Finally, the data analysts created a table for the tornado data itself. Since this table is likely to be large, and many queries will filter by year, they decided to partition the table on a Year column, as shown in the following HiveQL statement.

CREATE EXTERNAL TABLE DW.Tornadoes
  (DateCode STRING, StateCode STRING, EventTime STRING, Category INT, Injuries INT,
   Fatalities INT, PropertyLoss DOUBLE, CropLoss DOUBLE, StartLatitude DOUBLE,
   StartLongitude DOUBLE, EndLatitude DOUBLE, EndLongitude DOUBLE,
   LengthMiles DOUBLE, WidthYards DOUBLE)
 PARTITIONED BY (Year INT) STORED AS SEQUENCEFILE;

Next, the analysts needed to upload the data for the data warehouse. This is described in Loading data into the data warehouse.

Next Topic | Previous Topic | Home | Community