Unifying and stabilizing jobs with HCatalog
From: Developing big data solutions on Microsoft Azure HDInsight
HCatalog makes it easier to create complex, multi-step data processing solutions that enable you to operate on the same data by using Hive, Pig, or custom map/reduce code without having to handle storage details in each script. HCatalog doesn’t change the way scripts and queries work. It just abstracts the details of the data file location and the schema so that your code becomes less fragile because it has fewer dependencies, and the solution is also much easier to administer.
An overview of the use of HCatalog as storage abstraction layer is shown in Figure 1.
Figure 1 - Unifying different processing mechanisms with HCatalog
To understand the benefits of using HCatalog, consider a typical scenario shown in Figure 2.
Figure 2 - An example solution that benefits from using HCatalog
In the example, Hive scripts create the metadata definition for two tables that have different schemas. The table named mydata is created over some source data uploaded as a file to HDInsight, and this defines a Hive location for the table data (step 1 in Figure 2). Next, a Pig script reads the data defined in the mydata table, summarizes it, and stores it back in the second table named mysummary (steps 2 and 3). However, in reality, the Pig script does not access the Hive tables (which are just metadata definitions). It must access the source data file in storage, and write the summarized result back into storage, as shown by the dotted arrow in Figure 2.
In Hive, the path or location of these two files is (by default) denoted by the name used when the tables were created. For example, the following HiveQL shows the definition of the two Hive tables in this scenario, and the code that loads a data file into the mydata table.
CREATE TABLE mydata (col1 STRING, col2 INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
LOAD DATA INPATH '/mydata/data.txt' INTO TABLE mydata;
CREATE TABLE mysummary (col1 STRING, col2 BIGINT);
Hive scripts can now access the data simply by using the table names mydata and mysummary. Users can use HiveQL to query the Hive table without needing to know anything about the underlying file location or the data format of that file.
However, the Pig script that will group and aggregate the data, and store the results in the mysummary table, must know both the location and the data format of the files. Without HCatalog, the script must specify the full path to the mydata table source file, and be aware of the source schema in order to apply an appropriate schema (which must be defined in the script). In addition, after the processing is complete, the Pig script must specify the location associated with the mysummary table when storing the result back in storage, as shown in the following code sample.
A = LOAD '/mydata/data.txt'
USING PigStorage('\t') AS (col1, col2:long);
...
...
...
STORE X INTO '/mysummary/data.txt';
The file locations, the source format, and the schema are hard-coded in the script, creating some potentially problematic dependencies. For example, if an administrator moves the data files, or changes the format by adding a column, the Pig script will fail.
Using HCatalog removes these dependencies by enabling Pig to use the Hive metadata that defines the tables. To use HCatalog with Pig you must specify the -useHCatalog parameter, and the path to the HCatalog installation files must be registered as an environment variable named HCAT_HOME. For example, you could use the following Hadoop command line statements to launch the Grunt interface with HCatalog enabled.
SET HCAT_HOME = C:\apps\dist\hcatalog-0.4.1
Pig -useHCatalog
With the HCatalog support loaded you can now use the HCatLoader and HCatStorer objects in the Pig script, enabling you to access the data through the Hive metadata instead of requiring direct access to the data file storage.
A = LOAD 'mydata'
USING org.apache.hcatalog.pig.HCatLoader();
...
...
...
STORE X INTO 'mysummary'
USING org.apache.hcatalog.pig.HCatStorer();
The script stores the summarized data in the location denoted for the mysummary table defined in Hive, and so it can be queried using HiveQL as shown in the following example.
SELECT * FROM mysummary;
HCatalog also exposes notification events that you can use by other tools such as Oozie to detect when certain storage events occur.
Note
For more information see HCatalog in the Apache Hive Confluence Spaces.