Data processing tools and techniques
From: Developing big data solutions on Microsoft Azure HDInsight
There is a huge range of tools and frameworks you can use with a big data solution based on Hadoop. This guide focuses on Microsoft big data solutions, and specifically those built on HDInsight running in Azure. The following sections of this topic describe the data processing tools you are most likely to use in an HDInsight solution, and the techniques that you are likely to apply.
- Overview of Hive
- Overview of Pig
- Custom map/reduce components
- User-defined functions
- Overview of HCatalog
- Overview of Mahout
- Overview of Storm
- Configuring and debugging solutions
Overview of Hive
Hive is an abstraction layer over the Hadoop query engine that provides a query language called HiveQL, which is syntactically very similar to SQL and supports the ability to create tables of data that can be accessed remotely through an ODBC connection.
In effect, Hive enables you to create an interface to your data that can be used in a similar way to a traditional relational database. Business users can use familiar tools such as Excel and SQL Server Reporting Services to consume data from HDInsight in a similar way as they would from a database system such as SQL Server. Installing the ODBC driver for Hive on a client computer enables users to connect to an HDInsight cluster and submit HiveQL queries that return data to an Excel worksheet, or to any other client that can consume results through ODBC. HiveQL also allows you to plug in custom mappers and reducers to perform more sophisticated processing.
Hive is a good choice for data processing when:
- You want to process large volumes of immutable data to perform summarization, ad hoc queries, and analysis.
- The source data has some identifiable structure, and can easily be mapped to a tabular schema.
- You want to create a layer of tables through which business users can easily query source data, and data generated by previously executed map/reduce jobs or Pig scripts.
- You want to experiment with different schemas for the table format of the output.
- You are familiar with SQL-like languages and syntax.
- The processing you need to perform can be expressed effectively as HiveQL queries.
The latest versions of HDInsight incorporate a technology called Tez, part of the Stinger initiative for Hadoop, that vastly increases the performance of Hive. For more details see Stinger: Interactive Query for Hive on Hortonworks website.
Note
If you are not familiar with Hive, a basic introduction to using HiveQL can be found in the topic Processing data with Hive. You can also experiment with Hive by executing HiveQL statements in the Hive Editor page of the HDInsight management portal. See Monitoring and logging for more details.
Overview of Pig
Pig is a query interface that provides a workflow semantic for processing data in HDInsight. Pig enables you to perform complex processing of your source data to generate output that is useful for analysis and reporting.
Pig statements are expressed in a language named Pig Latin, and generally involve defining relations that contain data, either loaded from a source file or as the result of a Pig Latin expression on an existing relation. Relations can be thought of as result sets, and can be based on a schema (which you define in the Pig Latin statement used to create the relation) or can be completely unstructured.
Pig is a good choice when you need to:
- Restructure source data by defining columns, grouping values, or converting columns to rows.
- Perform data transformations such as merging and filtering data sets, and applying functions to all or subsets of records.
- Use a workflow-based approach to process data as a sequence of operations, which is often a logical way to approach many data processing tasks.
Note
If you are not familiar with Pig, a basic introduction to using Pig Latin can be found in the topic Processing data with Pig.
Custom map/reduce components
Map/reduce code consists of two separate functions implemented as map and reduce components. The map component is run in parallel on multiple cluster nodes, each node applying it to its own subset of the data. The reduce component collates and summarizes the results from all of the map functions (see How do big data solutions work? for more details of these two components).
In most HDInsight processing scenarios it is simpler and more efficient to use a higher-level abstraction such as Pig or Hive, although you can create custom map and reduce components for use within Hive scripts in order to perform more sophisticated processing.
Custom map/reduce components are typically written in Java. However, Hadoop provides a streaming interface that allows components to be used that are developed in other languages such as C#, F#, Visual Basic, Python, JavaScript, and more. For more information see the section “Using Hadoop Streaming” in the topic Writing map/reduce code.
You might consider creating your own map and reduce components when:
- You want to process data that is completely unstructured by parsing it and using custom logic in order to obtain structured information from it.
- You want to perform complex tasks that are difficult (or impossible) to express in Pig or Hive without resorting to creating a UDF. For example, you might need to use an external geocoding service to convert latitude and longitude coordinates or IP addresses in the source data to geographical location names.
- You want to reuse your existing .NET, Python, or JavaScript code in map/reduce components. You can do this using the Hadoop streaming interface.
Note
If you are not familiar with writing map/reduce components, a basic introduction and information about using Hadoop streaming can be found in the topic Writing map/reduce code.
User-defined functions
Developers often find that they reuse the same code in several locations, and the typical way to optimize this is to create a user-defined function (UDF) that can be imported into other projects when required. Often a series of UDFs that accomplish related functions are packaged together in a library so that the library can be imported into a project. Hive and Pig can take advantage of any of the UDFs it contains. For more information see User-defined functions.
Overview of HCatalog
Technologies such as Hive, Pig, and custom map/reduce code can be used to process data in an HDInsight cluster. In each case you use code to project a schema onto data that is stored in a particular location, and then apply the required logic to filter, transform, summarize, or otherwise process the data to generate the required results.
The code must load the source data from wherever it is stored, and convert it from its current format to the required schema. This means that each script must include assumptions about the location and format of the source data. These assumptions create dependencies that can cause your scripts to break if an administrator chooses to change the location, format, or schema of the source data.
Additionally, each processing interface (Hive, Pig, or custom map/reduce) requires its own definition of the source data, and so complex data processes that involve multiple steps in different interfaces require consistent definitions of the data to be maintained across all of the scripts.
HCatalog provides a tabular abstraction layer that helps unify the way that data is interpreted across processing interfaces, and provides a consistent way for data to be loaded and stored—regardless of the specific processing interface being used. This abstraction exposes a relational view over the data, including support for partitions.
The following factors will help you decide whether to incorporate HCatalog in your HDInsight solution:
- It makes it easy to abstract the data storage location, format, and schema from the code used to process it.
- It minimizes fragile dependencies between scripts in complex data processing solutions where the same data is processed by multiple tasks.
- It enables notification of data availability, making it easier to write applications that perform multiple jobs.
- It is easy to incorporate into solutions that include Hive and Pig scripts, requiring very little extra code. However, if you use only Hive scripts and queries, or you are creating a one-shot solution for experimentation purposes and do not intend to use it again, HCatalog is unlikely to provide any benefit.
- Files in JSON, SequenceFile, CSV, and RC format can be read and written by default, and a custom serializer/deserializer component (SerDe) can be used to read and write files in other formats (see SerDe on the Apache wiki for more details).
- Additional effort is required to use HCatalog in custom map/reduce components because you must create your own custom load and store functions.
Note
For more information, see Unifying and stabilizing jobs with HCatalog.
Overview of Mahout
Mahout is a data mining query library that you can use to examine data files in order to extract specific types of information. It provides an implementation of several machine learning algorithms, and is typically used with source data files containing relationships between the items of interest in a data processing solution. For example, it can use a data file containing the similarities between different movies and TV shows to create a list of recommendations for customers based on items they have already viewed or purchased. The source data could be obtained from a third party, or generated and updated by your application based on purchases made by other customers.
Mahout queries are typically executed as a separate process, perhaps based on a schedule, to update the results. These results are usually stored as a file within the cluster storage, though they may be exported to a database or to visualization tools. Mahout can also be executed as part of a workflow. However, it is a batch-based process that may take some time to execute with large source datasets.
Mahout is a good choice when you need to:
- Apply clustering algorithms to group documents or data items that contain similar content.
- Apply recommendation mining algorithms to discover user’s preferences from their behavior.
- Apply classification algorithms to assign new documents or data items to a category based on the existing categorizations.
- Perform frequent data mining operations based on the most recent data.
Note
For more information see the Apache Mahout website.
Overview of Storm
Storm is a scalable, fault-tolerant, distributed, real-time computation system for processing fast and large streams of data. It allows you to build trees and directed acyclic graphs (DAGs) that asynchronously process data items using a user-defined number of parallel tasks. It can be used for real-time analytics, online machine learning, continuous computation, Extract Transform Load (ETL) tasks, and more.
Storm processes messages or stream inputs as individual data items, which it refers to as tuples, using a user-defined number of parallel tasks. Input data is exposed by spouts that connect to an input stream such as a message queue, and pass data as messages to one or more bolts. Each bolt is a processing task that can be configured to run as multiple instances. Bolts can pass data as messages to other bolts using a range of options. For example, a bolt might pass the results of all the messages it processes to several different bolts, to a specific single bolt, or to a range of bolts based on filtering a value in the message.
This flexible and configurable routing system allows you to construct complex graphs of tasks to perform real-time processing. Bolts can maintain state for aggregation operations, and output results to a range of different types of storage including relational databases. This makes it ideal for performing ETL tasks as well as providing a real-time filtering, validation, and alerting solution for streaming data. A high-level language called Trident can be used to build complex queries and processing solutions with Storm.
Storm is a good choice when you need to:
- Pre-process data before loading it into a big data solution.
- Handle huge volumes of data or messages that arrive at a very high rate.
- Filter and sort incoming stream data for storing in separate files, repositories, or database tables.
- Examine the data stream in real time, perhaps to raise alerts for out-of-band values or specific combinations of events, before analyzing it later using one of the batch-oriented query mechanisms such as Hive or Pig.
Note
For more information, see the Tutorial on the Storm documentation website.