Running a map/reduce job with Windows PowerShell

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

To submit a map/reduce job that uses a Java .jar file to process data you can use the New-AzureHDInsightMapReduceJobDefinition cmdlet to define the job and its parameters, and then initiate the job by using the Start-AzureHDInsightJob cmdlet. The job is run asynchronously. If you want to show the output generated by the job you must wait for it to complete by using the Wait-AzureHDInsightJob cmdlet with a suitable timeout value, and then display the job output with the Get-AzureHDInsightJobOutput cmdlet.

Note that the job output in this context does not refer to the data files generated by the job, but to the status and outcome messages generated while the job is in progress. This is the same output as displayed in the console window when the job is executed interactively at the command line.

The following code example shows a PowerShell script that uses the mymapreduceclass class in mymapreducecode.jar with arguments to indicate the location of the data to be processed and the folder where the output files should be stored. The script waits up to 3600 seconds for the job to complete, and then displays the output that was generated by the job.

$clusterName = "cluster-name"

$jobDef = New-AzureHDInsightMapReduceJobDefinition
  -JarFile "wasb:///mydata/jars/mymapreducecode.jar"
  -ClassName "mymapreduceclass"
  -Arguments "wasb:///mydata/source", "wasb:///mydata/output"

$wordCountJob = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $jobDef

Write-Host "Map/Reduce job submitted..."

Wait-AzureHDInsightJob -Job $wordCountJob -WaitTimeoutInSeconds 3600

Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $wordCountJob.JobId -StandardError

Note

Due to page width limitations we have broken some of the commands in the code above across several lines for clarity. In your code each command must be on a single, unbroken line.

The Azure PowerShell module also provides the New- AzureHDInsightStreamingMapReduceJobDefinition cmdlet, which you can use to execute map/reduce jobs that are implemented in .NET assemblies and that use the Hadoop Streaming API. This cmdlet enables you to specify discrete .NET executables for the mapper and reducer to be used in the job.

Next Topic | Previous Topic | Home | Community