Submitting Pig and Hive jobs from a .NET application

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

You can define Pig and Hive jobs in a .NET client application by using the HiveJobCreateParameters and PigJobCreateParameters classes from the Microsoft Azure HDInsight NuGet package. You can then submit the jobs to an HDInsight cluster by using the CreateHiveJob and CreatePigJob methods of the IJobSubmissionClient interface, which is implemented by objects returned by the JobSubmissionFactory object’s Connect method.

The following code example shows how to define and submit a Hive job that executes a HiveQL statement to create a table.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
using System.IO;
using System.Security.Cryptography.X509Certificates;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;
using Microsoft.WindowsAzure.Management.HDInsight;
using Microsoft.Hadoop.Client;

namespace HiveClient
{
  class Program
  {
    static void Main(string[] args)
    {
      // Azure variables.
      string subscriptionID = "subscription-id";
      string certFriendlyName = "certificate-friendly-name";
      string clusterName = "cluster-name";

      string hiveQL = @"CREATE TABLE mytable (id INT, val STRING) 
         ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
         STORED AS TEXTFILE LOCATION '/data/mytable';";

      // Define the Hive job.
      HiveJobCreateParameters hiveJobDefinition = new HiveJobCreateParameters()
      {
        JobName = "Create Table",
        StatusFolder = "/CreateTableStatus",
        Query = hiveQL
      };

      // Get the certificate object from certificate store
      // using the friendly name to identify it.
      X509Store store = new X509Store();
      store.Open(OpenFlags.ReadOnly);
      X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>() 
        .First(item => item.FriendlyName == certFriendlyName);
      JobSubmissionCertificateCredential creds = new JobSubmissionCertificateCredential(
        new Guid(subscriptionID), cert, clusterName);

      // Create a hadoop client to connect to HDInsight.
      var jobClient = JobSubmissionClientFactory.Connect(creds);

      // Run the Hive job.
      JobCreationResults jobResults = jobClient.CreateHiveJob(hiveJobDefinition);

      // Wait for the job to complete.
      Console.Write("Job running...");
      JobDetails jobInProgress = jobClient.GetJob(jobResults.JobId);
      while (jobInProgress.StatusCode != JobStatusCode.Completed 
        && jobInProgress.StatusCode != JobStatusCode.Failed)
      {
        Console.Write(".");
        jobInProgress = jobClient.GetJob(jobInProgress.JobId); 
        Thread.Sleep(TimeSpan.FromSeconds(10));
      }
      // Job is complete
      Console.WriteLine("!");
      Console.WriteLine("Job complete!");
      Console.WriteLine("Press a key to end.");
      Console.Read();
    }
  }
}

Notice the variables required to configure the Hadoop client. These include the unique ID of the subscription in which the cluster is defined (which you can view in the Azure management portal), the friendly name of the Azure management certificate to be loaded (which you can view in certmgr.msc), and the name of your HDInsight cluster.

In previous example, the HiveQL command to be executed was specified as the Query parameter of the HiveJobCreateParameters object. A similar approach is used to specify the Pig Latin statements to be executed when using the PigJobCreateParameters class. Alternatively, you can use the File property to specify a file in Azure storage that contains the HiveQL or Pig Latin code to be executed. The following code example shows how to submit a Pig job that executes the Pig Latin code in a file that already exists in Azure storage.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Threading;
using System.IO;
using System.Security.Cryptography.X509Certificates;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;
using Microsoft.WindowsAzure.Management.HDInsight;
using Microsoft.Hadoop.Client;


namespace PigClient
{
  class Program
  {
    static void Main(string[] args)
    {
      // Azure variables.
      string subscriptionID = "subscription-id";
      string certFriendlyName = "certificate-friendly-name";
      string clusterName = "cluster-name";

      // Define the Pig job.
      PigJobCreateParameters pigJobDefinition = new PigJobCreateParameters()
      {
        StatusFolder = "/PigJobStatus",
        File = "/weather/scripts/SummarizeWeather.pig"
      };


      // Get the certificate object from certificate store 
      // using the friendly name to identify it.
      X509Store store = new X509Store();
      store.Open(OpenFlags.ReadOnly);
      X509Certificate2 cert = store.Certificates.Cast<X509Certificate2>()
        .First(item => item.FriendlyName == certFriendlyName);
      JobSubmissionCertificateCredential creds = new JobSubmissionCertificateCredential(
        new Guid(subscriptionID), cert, clusterName);

      // Create a hadoop client to connect to HDInsight.
      var jobClient = JobSubmissionClientFactory.Connect(creds);

      // Run the Pig job.
      JobCreationResults jobResults = jobClient.CreatePigJob(pigJobDefinition);

      // Wait for the job to complete.
      Console.Write("Job running...");
      JobDetails jobInProgress = jobClient.GetJob(jobResults.JobId);
      while (jobInProgress.StatusCode != JobStatusCode.Completed 
        && jobInProgress.StatusCode != JobStatusCode.Failed)
      {
        Console.Write(".");
        jobInProgress = jobClient.GetJob(jobInProgress.JobId); 
        Thread.Sleep(TimeSpan.FromSeconds(10));
      }
      // Job is complete.
      Console.WriteLine("!");
      Console.WriteLine("Job complete!");
      Console.WriteLine("Press a key to end.");
      Console.Read();
    }
  }
}

You can combine this approach with any of the data upload techniques described in Uploading data with the Microsoft .NET Framework to build a client application that uploads source data and the Pig Latin or HiveQL code files required to process it, and then submits a job to initiate processing.

Next Topic | Previous Topic | Home | Community