Hadoop Services for Microsoft Azure: Learning Part 1 - 10GB GraySort - Teragen (video)

Hadoop-based Services for Microsoft Azure includes several samples you can use for learning and testing. One sample is the 10GB GraySort which is a scaled-down version of the Hadoop Terasort benchmark. There are three jobs to run and in this video, Developer Brad Sarsfield walks you through Teragen.

See Also

 


Video

View


Transcript

Hi, my name is Brad Sarsfield and I’m a Developer on the Hadoop Services for Windows and Windows Azure team.

In this video I will show you how to use the 10GB GraySort sample to generate 10GB of data and store it in HDFS. The GraySort sample is a scaled-down version of the TeraSort I/O benchmark.  [see sortbenchmark.org]. This video is part 1 in the series.  In part 2 I’ll submit a MapReduce job to sort the data (terasort) and write it back to disk. In part 3 I’ll validate the data that have been sorted (teraval).   

So let’s get started.

 

  1. From the Samples page I select the  10GB GraySort sample. To start the process, I deploy the sample to my cluster.

  2. On the Create Job page, the fields are pre-populated for me, but I need to make a few changes.

    1. First, I rename the job from Terasort Example to Teragen to identify this as the data generation program/job. 
    2. The first parameter I leave as is – this is the name of the program that will be run from the hadoop-examples JAR.
    3. The 2nd parameter specifies the number of map tasks to be executed. By default, the cluster will only run 2 maps[m1] .  Meaning I will generate 2 5GB map tasks and each task will run and produce 5GB of data. 

    But I know that it’s better to split the size of the data that I’m generating into a much smaller number, keeping in mind that map tasks can fail for a variety of reasons and would need to be restarted.  Ideally, each task should take one to one-and-a-half minutes at most – so they fail fast. If a task should fail, Hadoop will restart that task either on the same node or on a different node.   In fact, if Hadoop thinks one of the tasks is running too slowly, it will start up a new parallel task on a new node and use the results of whichever one finishes first. This is called Speculative Execution.

  3. Based on my environment, I change the number of map tasks to 50.

    The 3rd parameter updates to reflect changes I made to parameter 2.  I’m creating 100 million records and saving these to my example/data directory in HDFS in a file named 10GB-sort-input.

    Hadoop for Windows Azure constructs and displays the Final Command that will be executed on the headnode below.

  4. Execute this job. 

    Behind the scenes, the teragen job is submitted to the headnode.  I can see here the command that is being run.  I could terminal serve into the headnode and run this command there as well.  As the job runs it prints informational messages to the Output and Errors sections.

    Remember that this is a map-only job and there are 50 different map tasks.  Each task will generate a single 200MB file comprised of 2 million records in HDFS. Within each record, the data is pseudo-randomly generated and hashed to create the key. The first 10 bytes of the record are the key, and the remaining 90 bytes are the data.  In the next video in this series I’ll show you how Hadoop sorts the data based on these keys.

    I could configure my cluster to do this work faster.  For instance, if I had more nodes, it would run faster.  But for this video I have only 4 worker nodes dividing up the work. 

    When I submit the job and tell it to use 50 maps, jobscheduler decides on placement and execution of tasks based on the available map slots.  Once a task finishes, the next task in the queue begins.  

    Each of my medium VM nodes is set to accept 2 map tasks each.  So in this case, 4 workernodes each with 2 slots available for map tasks means I am running 8 tasks at a time.  The jobscheduler tries to affinitize the task to the placement of the data.  It places the task close to the location of the data, if not on the same machine.

       

  5. The job completed successfully.  I scroll down and see that all maps completed.  I have 100 million records created, and for that, 10 billion bytes -- or 10GB.

Let’s take a look at the data generated. There are several ways to do this:

  1. I can terminal serve into the headnode and browse HDFS via the Hadoop web interface to see the output of the teragen.  Here I see a list of the 50 files that represent the output from the 50 maps. 
  2. Drilling down further I can see what the teragen data looks like.  The first 10 bytes is the key followed by 90bytes of data.
  3. And for even more detail, I can view the data by block. Since each file is 200MB and each block is 64MB, a single file contains 3 blocks. And each block is triple-replicated in HDFS.
  4. I can also open the JavaScript Console to see the 50 files that correspond to the output of each map task. 
  5. I can also take a look at the data by going to Job History.  Here I see that the job produced 100million 100byte records. I’ve written 10GB of data.     

Now it’s time to sort that data.  Sorting is covered in the next video in this series. 

Thank you for watching, I hope you found it helpful.