Processing, querying, and transforming data using HDInsight

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

This section of the guide explores the tools, techniques, and technologies for processing data in a big data solution. This processing may include executing queries to extract data, transformations to modify and shape data, and a range of other operations such as creating tables or executing workflows.

Microsoft big data solutions, including HDInsight on Microsoft Azure, are based on a Hadoop distribution called the Hortonworks Data Platform (HDP). It uses the YARN resource manager to implement a runtime platform for a wide range of data query, transformation, and storage tools and applications. Figure 1 shows the high-level architecture of HDP, and how it supports the tools and applications described in this guide.

Figure 1 - High-level architecture of the Hortonworks Data Platform

Figure 1 - High-level architecture of the Hortonworks Data Platform

The three most commonly used tools for processing data by executing queries and transformations, in order of popularity, are Hive, Pig, and map/reduce.

HCatalog is a feature of Hive that provides, amongst other features, a way to remove dependencies on literal file paths in order to stabilize and unify solutions that incorporate multiple steps.

Mahout is a scalable machine learning library for clustering, classification, and collaborative filtering that you can use to examine data files in order to extract specific types of information.

Storm is a real-time data processing application that is designed to handle streaming data.

These applications can be used for a wide variety of tasks, and many of them can be easily combined into multi-step workflows by using Oozie.

This section of the guide contains the following topics:

Note

HBase is a database management system that can provide scalability for storing vast amounts of data, support for real-time querying, consistent reads and writes, automatic and configurable sharding of tables, and high reliability with automatic failover. For more information see “Data storage” in the topic Specifying the infrastructure.

Evaluating the results

After you have used HDInsight to process the source data you can use the data for analysis and reporting, which forms the foundation for business decision making. However, before making critical business decisions based on the results you must carefully evaluate them to ensure they are:

  • Meaningful. The values in the results, when combined and analyzed, relate to one another in a meaningful way.
  • Accurate. The results appear to be correct, or are within the expected range.
  • Useful. The results are applicable to the business decision they will support, and provide relevant metrics that help inform the decision making process.

You will often need to employ the services of a business user who intimately understands the business context for the data to perform the role of a data steward and “sanity check” the results to determine whether or not they fall within expected parameters. It may not be possible to validate all of the source data for a query, especially if it is collected from external sources such as social media sites. However, depending on the complexity of the processing, you might decide to select a number of data inputs for spot-checking and trace them through the process to ensure that they produce the expected outcome.

When you are planning to use HDInsight to perform predictive analysis, it can be useful to evaluate the process against known values. For example, if your goal is to use demographic and historical sales data to determine the likely revenue for a proposed retail store, you can validate the processing model by using appropriate source data to predict revenue for an existing store and compare the resulting prediction to the actual revenue value. If the results of the data processing you have implemented vary significantly from the actual revenue, then it seems unlikely that the results for the proposed store will be reliable.

Considerations

Consider the following points when designing and developing data processing solutions:

  • Big data frameworks offer a huge range of tools that you can use with the Hadoop core engine, and choosing the most appropriate can be difficult. Azure HDInsight simplifies the process because all of the tools it includes are guaranteed to be compatible and work correctly together. This doesn’t mean you can’t incorporate other tools and frameworks in your solution.
  • Of the query and transformation applications, Hive is the most popular. However, many HDInsight processing solutions are actually incremental in nature—they consist of multiple queries, each operating on the output of the previous one. These queries may use different query applications. For example, you might first use a custom map/reduce job to summarize a large volume of unstructured data, and then create a Pig script to restructure and group the data values produced by the initial map/reduce job. Finally, you might create Hive tables based on the output of the Pig script so that client applications such as Excel can easily consume the results.
  • If you decide to use a resource-intensive application such as HBase or Storm, you should consider running it on a separate cluster from your Hadoop-based big data batch processing solution to avoid contention and consequent loss of performance for the application and your solution as a whole.
  • The challenges don’t end with simply writing and running a job. As in any data processing scenario, it’s vitally important to check that the results generated by queries are realistic, valid, and useful before you invest a lot of time and effort (and cost) in developing and extending your solution. A common use of HDInsight is simply to experiment with data to see if it can offer insights into previously undiscovered information. As with any investigational or experimental process, you need to be convinced that each stage is producing results that are both valid (otherwise you gain nothing from the answers) and useful (in order to justify the cost and effort).
  • Unless you are simply experimenting with data to find the appropriate questions to ask, you will want to automate some or all of the tasks and be able to run the solution from a remote computer. For more information see Building custom clients and Building end-to-end solutions using HDInsight.
  • Security is a fundamental concern in all computing scenarios, and big data processing is no exception. Security considerations apply during all stages of a big data process, and include securing data while in transit over the network, securing data in storage, and authenticating and authorizing users who have access to the tools and utilities you use as part of your process. For more details of how you can maximize security of your HDInsight solutions see the topic Security in the section Building end-to-end solutions using HDInsight.

More information

For more information about HDInsight, see the Microsoft Azure HDInsight web page.

A central point for TechNet articles about HDInsight is HDInsight Services For Windows.

For examples of how you can use HDInsight, see the following tutorials on the HDInsight website:

Next Topic | Previous Topic | Home | Community