How do big data solutions work?
From: Developing big data solutions on Microsoft Azure HDInsight
In the days before Structured Query Language (SQL) and relational databases, data was typically stored in flat files, often is simple text format, with fixed width columns. Application code would open and read one or more files sequentially, or jump to specific locations based on the known line width of a row, to read the text and parse it into columns to extract the values. Results would be written back by creating a new copy of the file or a separate results file.
Modern relational databases put an end to all this, giving us greater power and additional capabilities for extracting information simply by writing queries in a standard format such as SQL. The database system hides all the complexity of the underlying storage mechanism and the logic for assembling the query that extracts and formats the information. However, as the volume of data that we collect continues to increase, and the native structure of this information is less clearly defined, we are moving beyond the capabilities of even enterprise-level relational database systems.
Big data batch processing solutions are essentially a simple process that breaks up the source files into multiple blocks and replicates the blocks on a distributed cluster of commodity nodes. Data processing runs in parallel on each node, and the parallel processes are then combined into an aggregated result set.
At the core of many big data implementations is an open source technology named Apache Hadoop. Hadoop was developed by Yahoo and the code was then provided as open source to the Apache Software Foundation. The most recent versions of Hadoop are commonly understood to contain the following main assets:
- The Hadoop kernel, or core package, containing the Hadoop distributed file system (HDFS)), the map/reduce framework, and common routines and utilities.
- A runtime resource manager that allocates tasks, and executes queries (such as map/reduce jobs) and other applications. This is usually implemented through the YARN framework, although other resource managers such as Mesos are available.
- Other resources, tools, and utilities that run under the control of the resource manager to support tasks such as managing data and running queries or other jobs on the data.
Notice that map/reduce is just one application that you can run on a Hadoop cluster. Several query, management, and other types of applications are available or under development. Examples are:
- Accumulo: A key/value NoSQL database that runs on HDFS.
- Giraph: An iterative graph processing system designed to offer high scalability.
- Lasr: An in-memory analytics processor for tasks that are not well suited to map/reduce processing.
- Reef: A query mechanism designed to implement iterative algorithms for graph analytics and machine learning.
- Storm: A distributed real-time computation system for processing fast, large streams of data.
In addition there are many other open source components and tools that can be used with Hadoop. The Apache Hadoop website lists the following:
- Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters.
- Avro: A data serialization system.
- Cassandra: A scalable multi-master database with no single points of failure.
- Chukwa: A data collection system for managing large distributed systems.
- HBase: A scalable, distributed database that supports structured data storage for large tables.
- Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout: A scalable machine learning and data mining library.
- Pig: A high-level data-flow language and execution framework for parallel computation.
- Spark: A fast, general-use compute engine with a simple and expressive programming model.
- Tez: A generalized data-flow programming framework for executing both batch and interactive tasks.
- ZooKeeper: A high-performance coordination service for distributed applications.
Note
A list of commonly used tools and frameworks for big data projects based on Hadoop can be found in Appendix A - Tools and technologies reference. Some of these tools are not supported on HDInsight—for more details see What is Microsoft HDInsight?
Figure 1 shows an overview of a typical Hadoop-based big data mechanism.
Figure 1 - The main assets of Apache Hadoop (version 2 onwards)
We'll be exploring and using some of these components in the scenarios described in subsequent sections of this guide. For more information about all of them, see the Apache Hadoop website.
The cluster
In Hadoop, a cluster of servers stores the data using HDFS, and processes it. Each member server in the cluster is called a data node, and contains an HDFS data store and a query execution engine. The cluster is managed by a server called the name node that has knowledge of all the cluster servers and the parts of the data files stored on each one. The name node server does not store any of the data to be processed, but is responsible for storing vital metadata about the cluster and the location of each block of the source data, directing clients to the other cluster members, and keeping track of the state of each one by communicating with a software agent running on each server.
To store incoming data, the name node server directs the client to the appropriate data node server. The name node also manages replication of data files across all the other cluster members that communicate with each other to replicate the data. The data is divided into blocks and three copies of each data file are stored across the cluster servers in order to provide resilience against failure and data loss (the block size and the number of replicated copies are configurable for the cluster).
The data store
The data store running on each server in a cluster is a suitable distributed storage service such as HDFS or a compatible equivalent. Hadoop implementations may use a cluster where all the servers are co-located in the same datacenter in order to minimize bandwidth use and maximize query performance.
Note
HDInsight does not completely follow this principle. Instead it provides an HDFS-compatible service over blob storage, which means that the data is stored in the datacenter storage cluster that is co-located with the virtualized servers that run the Hadoop framework. For more details, see What is Microsoft HDInsight?
The data store in a Hadoop implementation is usually referred to as a NoSQL store, although this is not technically accurate because some implementations do support a structured query language (such as SQL). In fact, some people prefer to use the term “Not Only SQL” for just this reason. There are more than 120 NoSQL data store implementations available at the time of writing, but they can be divided into the following basic categories:
- Key/value stores. These are data stores that hold data as a series of key/value pairs. The value may be a single data item or a complex data structure. There is no fixed schema for the data, and so these types of data store are ideal for unstructured data. An example of a key/value store is Azure table storage, where each row has a key and a property bag containing one or more values. Key/value stores can be persistent or volatile.
- Document stores. These are data stores optimized to hold structured, semi-structured, and unstructured data items such as JSON objects, XML documents, and binary data. They are usually indexed stores.
- Block stores. These are typically non-indexed stores that hold binary data, which can be a representation of data in any format. For example, the data could represent JSON objects or it could just be a binary data stream. An example of a block store is Azure blob storage, where each item is identified by a blob name within a virtual container structure.
- Wide column or column family data stores. These are data stores that do use a schema, but the schema can contain families of columns rather than just single columns. They are ideally suited to storing semi-structured data, where some columns can be predefined but others are capable of storing differing elements of unstructured data. HBase running on HDFS is an example. HBase is discussed in more detail in the topic Specifying the infrastructure in this guide.
- Graph data stores. These are data stores that hold the relationships between objects. They are less common than the other types of data store, many still being experimental, and they tend to have specialist uses.
NoSQL storage is typically much cheaper than relational storage, and usually supports a write once capability that allows only for data to be appended. To update data in these stores you must drop and recreate the relevant file, or maintain delta files and implement mechanisms to conflate the data. This limitation maximizes throughput; storage implementations are usually measured by throughput rather than capacity because this is usually the most significant factor for both storage and query efficiency.
Modern data management techniques such as the Event Sourcing, Command Query Responsibility Separation (CQRS), and other patterns do not encourage updates to data. Instead, new data is added and milestone records are used to fix the current state of the data at intervals. This approach provides better performance and maintains the history of changes to the data. For more information about CQRS and Event Sourcing see the patterns & practices guide CQRS Journey.
The query mechanism
Big data batch processing queries are commonly based on a distributed processing mechanism called map/reduce (often written as MapReduce) that provides optimum performance across the servers in a cluster. Map and reduce are mathematical operations. A map operation applies a function to a list of data items, returning a list of transformed items. A reduce operation applies a combining function that takes multiple lists and recursively generates an output.
In a big data framework such as Hadoop, a map/reduce query uses two components, usually written in Java, which implement the algorithms that perform a two-stage data extraction and rollup process. The Map component runs on each data node server in the cluster extracting data that matches the query, and optionally applying some processing or transformation to the data to acquire the required result set from the files on that server. The Reduce component runs on one or more of the data node****servers, and combines the results from all of the Map components into the final results set.
Note
For a detailed description of the MapReduce framework and programming model, see MapReduce.org.
As a simplified example of a map/reduce query, assume that the input data contains a list of the detail lines from customer orders. Each detail line contains a reference (foreign key) that links it to the main order record, the name of the item ordered, and the quantity ordered. If this data was stored in a relational database, a query of the following form could be used to generate a summary of the total number of each item sold:
SELECT ProductName, SUM(Quantity) FROM OrderDetails GROUP BY ProductName
The equivalent using a big data solution requires a Map and a Reduce component. The Map component running on each node operates on a subset, or chunk, of the data. It transforms each order line into a name/value pair where the name is the product name, and the value is the quantity from that order line. Note that in this example the Map component does not sum the quantity for each product, it simply transforms the data into a list.
Next, the framework shuffles and sorts all of the lists generated by the Map component instances into a single list, and executes the Reduce component with this list as the input. The Reduce component sums the totals for each product, and outputs the results. Figure 2 shows a schematic overview of the process.
Figure 2 - A high level view of the map/reduce process for storing data and extracting information
Depending on the configuration of the query job, there may be more than one Reduce component instance running. The output from each Map component instance is stored in a buffer on disk, and the component exits. The content of the buffer is then sorted, and passed to one or more Reduce component instances. Intermediate results are stored in the buffer until the final Reduce component instance combines them all.
In some cases the process might include an additional component called a Combiner that runs on each data node as part of the Map process, and performs a “reduce” type of operation on this part of the data each time the map process runs. It may also run as part of the reduce phase, and again when large datasets are being merged.
In the example shown here, a Combiner could sum the values for each product so that the output is smaller, which can reduce network load and memory requirements—with a subsequent increase in overall query efficiency. Often, as in this example, the Combiner and the Reduce components would be identical.
Performing a map/reduce operation involves several stages such as partitioning the input data, reading and writing data, and shuffling and sorting the intermediate results. Some of these operations are quite complex. However, they are typically the same every time—irrespective of the actual data and the query. The great thing with a map/reduce framework such as Hadoop is that you usually need to create only the Map and Reduce components. The framework does the rest.
Although the core Hadoop engine requires the Map and Reduce components it executes to be written in Java, you can use other techniques to create them in the background without writing Java code. For example you can use tools named Hive and Pig that are included in most big data frameworks to write queries in a SQL-like or a high-level language. You can also use the Hadoop streaming API to execute components written in other languages—see Hadoop Streaming on the Apache website for more details.
More information
The official site for Apache big data solutions and tools is the Apache Hadoop website.
For a detailed description of the MapReduce framework and programming model, see MapReduce.org.