Writing map/reduce code

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

You can implement map/reduce code in Java (which is the native language for map/reduce jobs in all Hadoop distributions), or in a number of other supported languages including JavaScript, Python, C#, and F# through Hadoop Streaming.

Creating map and reduce components

The following code sample shows a commonly referenced JavaScript map/reduce example that counts the words in a source that consist of unstructured text data.

var map = function (key, value, context) {
  var words = value.split(/[^a-zA-Z]/);
  for (var i = 0; i < words.length; i++) {
    if (words[i] !== "") {
      context.write(words[i].toLowerCase(), 1);
    }
  }
};

var reduce = function (key, values, context) {
  var sum = 0;
  while (values.hasNext()) {
    sum += parseInt(values.next());
  }
  context.write(key, sum);
};

The map function splits the contents of the text input into an array of strings using anything that is not an alphabetic character as a word delimiter. Each string in the array is then used as the key of a new key/value pair with the value set to 1.

Each key/value pair generated by the map function is passed to the reduce function, which sums the values in key/value pairs that have the same key. Working together, the map and reduce functions determine the total number of times each unique word appeared in the source data, as shown here.

Aardvark 2
About    7
Above    12
Action   3
...

Note

For more information about writing map/reduce code, see MapReduce Tutorial on the Apache Hadoop website and Develop Java MapReduce programs for HDInsight on the Azure website.

Using Hadoop streaming

The Hadoop core within HDInsight supports a technology called Hadoop Streaming that allows you to interact with the map/reduce process and run your own code outside of the Hadoop core as a separate executable process. Figure 1 shows a high-level overview of the way that streaming works.

Figure 1 - High level overview of Hadoop Streaming

Figure 1 - High level overview of Hadoop Streaming

Note

Figure 1 shows how streaming executes the map and reduce components as separate processes. The schematic does not attempt to illustrate all of the standard map/reduce stages, such as sorting and merging the intermediate results or using multiple instances of the reduce component.

When using Hadoop Streaming, each node passes the data for the map part of the process to a separate process through the standard input (“stdin”), and accepts the results from the code through the standard output (“stdout”), instead of internally invoking a map component written in Java. In the same way, the node(s) that execute the reduce process pass the data as a stream to the specified code or component, and accept the results from the code as a stream, instead of internally invoking a Java reduce component.

Streaming has the advantage of decoupling the map/reduce functions from the Hadoop core, allowing almost any type of components to be used to implement the mapper and the reducer. The only requirement is that the components must be able to read from and write to the standard input and output.

Using the streaming interface does have a minor impact on performance. The additional movement of the data over the streaming interface can marginally increase query execution time. Streaming tends to be used mostly to enable the creation of map and reduce components in languages other than Java. It is quite popular when using Python, and also enables the use of .NET languages such as C# and F# with HDInsight.

The Azure SDK contains a series of classes that make it easier to use the streaming interface from .NET code. For more information see Microsoft .NET SDK For Hadoop on CodePlex.

Note

For more details see Hadoop Streaming on the Apache website. For information about writing HDInsight map/reduce jobs in languages other than Java, see Develop C# Hadoop streaming programs for HDInsight and Hadoop Streaming Alternatives.

Next Topic | Previous Topic | Home | Community