Choosing tools and technologies

Статья
04/07/2016

From: Developing big data solutions on Microsoft Azure HDInsight

Hadoop-based big data systems such as HDInsight enable data processing using a wide range of tools and technologies, many of which are described earlier in this section of the guide. This topic provides comparisons between the commonly used tools and technologies to help you choose the most appropriate for your own scenarios. The following table shows the main advantages and considerations for each one.

Query mechanism	Advantages	Considerations
Hive using HiveQL	An excellent solution for batch processing and analysis of large amounts of immutable data, for data summarization, and for ad hoc querying. It uses a familiar SQL-like syntax. It can be used to produce persistent tables of data that can be easily partitioned and indexed. Multiple external tables and views can be created over the same data. It supports a simple data warehouse implementation that provides massive scale out and fault tolerance capabilities for data storage and processing.	It requires the source data to have at least some identifiable structure. It is not suitable for real-time queries and row level updates. It is best used for batch jobs over large sets of data. It might not be able to carry out some types of complex processing tasks.
Pig using Pig Latin	An excellent solution for manipulating data as sets, merging and filtering datasets, applying functions to records or groups of records, and for restructuring data by defining columns, by grouping values, or by converting columns to rows. It can use a workflow-based approach as a sequence of operations on data.	SQL users may find Pig Latin is less familiar and more difficult to use than HiveQL. The default output is usually a text file and so it is more difficult to use with visualization tools such as Excel. Typically you will layer a Hive table over the output.
Custom map/reduce	It provides full control over the map and reduce phases and execution. It allows queries to be optimized to achieve maximum performance from the cluster, or to minimize the load on the servers and the network. The components can be written in a range of widely known languages that most developers are likely to be familiar with.	It is more difficult than using Pig or Hive because you must create your own map and reduce components. Processes that require the joining of sets of data are more difficult to implement. Even though there are test frameworks available, debugging code is more complex than a normal application because they run as a batch job under the control of the Hadoop job scheduler.
HCatalog	It abstracts the path details of storage, making administration easier and removing the need for users to know where the data is stored. It enables notification of events such as data availability, allowing other tools such as Oozie to detect when operations have occurred. It exposes a relational view of data, including partitioning by key, and makes the data easy to access.	It supports RCFile, CSV text, JSON text, SequenceFile and ORC file formats by default, but you may need to write a custom SerDe if you use other formats. HCatalog is not thread-safe. There are some restrictions on the data types for columns when using the HCatalog loader in Pig scripts. See HCatLoader Data Types in the Apache HCatalog documentation for more details.
Mahout	It can make it easier and quicker to build intelligent applications that require data mining and data classification. It contains pre-created algorithms for many common machine learning and data mining scenarios. It can be scaled across distributed nodes to maximize performance.	Accurate results are usually obtained only when there are large pre-categorized sets of reference data. Learning from scratch, such as in a “related products” (collaborative filtering) scenario where reference data is gradually accumulated, can take some time to provide accurate results. As a framework, it often requires writing considerable supporting code to use it within a solution.
Storm	It provides an easy way to implement highly scalable, fault-tolerant, and reliable real-time processing for streaming data. It makes it easier to build complex parallel processing topologies. It is useful for monitoring and raising alerts in real time while storing incoming stream data for analysis later. It supports a high-level processing language called Trident that implements an intuitive fluent interface.	Enabling the full set of features for guaranteed processing can impose a performance penalty.

Hive using HiveQL

An excellent solution for batch processing and analysis of large amounts of immutable data, for data summarization, and for ad hoc querying.

It uses a familiar SQL-like syntax.

It can be used to produce persistent tables of data that can be easily partitioned and indexed.

Multiple external tables and views can be created over the same data.

It supports a simple data warehouse implementation that provides massive scale out and fault tolerance capabilities for data storage and processing.

It requires the source data to have at least some identifiable structure.

It is not suitable for real-time queries and row level updates. It is best used for batch jobs over large sets of data.

It might not be able to carry out some types of complex processing tasks.

Pig using Pig Latin

An excellent solution for manipulating data as sets, merging and filtering datasets, applying functions to records or groups of records, and for restructuring data by defining columns, by grouping values, or by converting columns to rows.

It can use a workflow-based approach as a sequence of operations on data.

SQL users may find Pig Latin is less familiar and more difficult to use than HiveQL.

The default output is usually a text file and so it is more difficult to use with visualization tools such as Excel. Typically you will layer a Hive table over the output.

Custom map/reduce

It provides full control over the map and reduce phases and execution.

It allows queries to be optimized to achieve maximum performance from the cluster, or to minimize the load on the servers and the network.

The components can be written in a range of widely known languages that most developers are likely to be familiar with.

It is more difficult than using Pig or Hive because you must create your own map and reduce components.

Processes that require the joining of sets of data are more difficult to implement.

Even though there are test frameworks available, debugging code is more complex than a normal application because they run as a batch job under the control of the Hadoop job scheduler.

HCatalog

It abstracts the path details of storage, making administration easier and removing the need for users to know where the data is stored.

It enables notification of events such as data availability, allowing other tools such as Oozie to detect when operations have occurred.

It exposes a relational view of data, including partitioning by key, and makes the data easy to access.

It supports RCFile, CSV text, JSON text, SequenceFile and ORC file formats by default, but you may need to write a custom SerDe if you use other formats.

HCatalog is not thread-safe.

There are some restrictions on the data types for columns when using the HCatalog loader in Pig scripts. See HCatLoader Data Types in the Apache HCatalog documentation for more details.

Mahout

It can make it easier and quicker to build intelligent applications that require data mining and data classification.

It contains pre-created algorithms for many common machine learning and data mining scenarios.

It can be scaled across distributed nodes to maximize performance.

Accurate results are usually obtained only when there are large pre-categorized sets of reference data.

Learning from scratch, such as in a “related products” (collaborative filtering) scenario where reference data is gradually accumulated, can take some time to provide accurate results.

As a framework, it often requires writing considerable supporting code to use it within a solution.

Storm

It provides an easy way to implement highly scalable, fault-tolerant, and reliable real-time processing for streaming data.

It makes it easier to build complex parallel processing topologies.

It is useful for monitoring and raising alerts in real time while storing incoming stream data for analysis later.

It supports a high-level processing language called Trident that implements an intuitive fluent interface.

Enabling the full set of features for guaranteed processing can impose a performance penalty.

Typically, you will use the simplest of these approaches that can provide the results you require. For example, it may be that you can achieve these results by using just Hive, but for more complex scenarios you may need to use Pig or even write your own map and reduce components. You may also decide, after experimenting with Hive or Pig, that custom map and reduce components can provide better performance by allowing you to fine tune and optimize the processing.

The following table shows some of the more general suggestions that will help you make the appropriate choice of query technology depending on the requirements of your task.

Requirement	Appropriate technologies
Table or dataset joins, or manipulating nested data.	Pig and Hive.
Ad hoc data analysis.	Pig, map/reduce (including Hadoop Streaming for non-Java components).
SQL-like data analysis and data warehousing.	Hive
Working with binary data or SequenceFiles.	Hive with Avro, Java map/reduce components.
Working with existing Java or map/reduce libraries.	Java map/reduce components, UDFs in Hive and Pig.
Maximum performance for large or recurring jobs.	Well-designed Java map/reduce components.
Using scripting or non-Java languages.	Hadoop Streaming
Abstracting storage paths for Hive and Pig to simplify administration.	HCatalog
Pre-processing or fully processing streaming data in real time.	Storm
Performing classification and data mining through collaborative filtering and machine learning.	Mahout

For more information about these tools and technologies see Data processing tools and techniques.

Next Topic | Previous Topic | Home | Community

Поделиться через

Choosing tools and technologies

Дополнительные ресурсы