Retrieving job output files with the .NET Framework

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

When the HDInsight jobs you have used to process your data do not generate Hive tables, you can implement code to retrieve the results from the output files generated by the jobs. The following examples demonstrate this:

  • Using the Microsoft .NET API for Hadoop WebClient Package
  • Using the Windows Azure Storage Library

Using the Microsoft .NET API for Hadoop WebClient Package

When your application project includes a reference to the Microsoft .NET API for Hadoop WebClient package you can use the OpenFile method of the WebHDFSClient class to open the output files and read their contents. This approach can be particularly convenient if you have already used classes in this package to upload the source data and initiate the HDInsight jobs.

As an example, the following code shows a simple console application that reads and displays the contents of an output file that is stored as a blob named /weather/output/part-r-00000. The examples in this section are deliberately kept simple by including the credentials in the code so that you can copy and paste it while you are experimenting with HDInsight. In a production system you must protect credentials, as described in “Securing credentials in scripts and applications” in the Security section of this guide.

using System;
using System.Text;
using System.Threading.Tasks;

using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;

namespace BlobClient
{
  class Program
  {
    static void Main(string[] args)
    {
      GetResult();
      Console.WriteLine("--------------------------");
      Console.WriteLine("Press a key to end");
      Console.Read();
    }

    static async void GetResult()
    {
      var hdInsightUser = "user-name";
      var storageName = "storage-account-name";
      var storageKey = "storage-account-key";
      var containerName = "container-name";
      var outputFile = "/weather/output/part-r-00000";

      // Get the contents of the output file.
      var hdfsClient = new WebHDFSClient(hdInsightUser,
          new BlobStorageAdapter(storageName, storageKey, containerName, false));
      await hdfsClient.OpenFile(outputFile)
            .ContinueWith(r => r.Result.Content.ReadAsStringAsync()
              .ContinueWith(c => Console.WriteLine(c.Result.ToString())));
    }
  }
}

The output from this example code is shown in Figure 1.

Figure 1 - Output retrieved using the OpenFile method of the WebHDFSClient class

Figure 1 - Output retrieved using the OpenFile method of the WebHDFSClient class

Note

For more information about using the .NET SDK see HDInsight SDK Reference Documentation and the incubator projects on the CodePlexwebsite.

Using the Windows Azure Storage Library

In some cases you may want to download the output files generated by HDInsight jobs so that they can be opened in client applications such as Excel. You can use the CloudBlockBlob class in the Windows Azure Storage package to download the contents of the blob to a file.

The following example shows how you can use the Windows Azure Storage package in an application to download the contents of a blob to a file.

using System;
using System.Text;
using System.Threading.Tasks;

using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Auth;
using Microsoft.WindowsAzure.Storage.Blob;
using System.IO;

namespace BlobDownloader
{
  class Program
  {
    const string AZURE_STORAGE_CONNECTION_STRING = "DefaultEndpointsProtocol=https;"
          + "AccountName=storage-account-name;AccountKey=storage-account-key";
    static void Main(string[] args)
    {
      CloudStorageAccount storageAccount = CloudStorageAccount.Parse
                                           (AZURE_STORAGE_CONNECTION_STRING);
      CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
      CloudBlobContainer container = blobClient.GetContainerReference("container-name");
      CloudBlockBlob blob = container.GetBlockBlobReference("weather/output/part-r-00000");

      var fileStream = File.OpenWrite(@".\results.txt");
      using ( fileStream)
      {
        blob.DownloadToStream(fileStream);
      }

      Console.WriteLine("Results downloaded to " + fileStream.Name);
      Console.WriteLine("Press a key to end");
      Console.Read();
    }
  }
}

The output from this example code is shown in Figure 2.

Figure 2 - Downloading a blob with the CloudBlockBlob.DownloadToStream method

Figure 2 - Downloading a blob with the CloudBlockBlob.DownloadToStream method

The Windows Azure Storage package enables a more versatile approach to consuming output files generated by HDInsight jobs than the WebHDFSClient class in the Microsoft .NET API for Hadoop WebClient package. In particular, you can use the other classes in the Windows Azure Storage package to browse blob hierarchies in a container, and to download all of the blobs in a specific path. This makes it easier to download results from HDInsight jobs that generate multiple output files, or in cases where the exact name of an output file is unknown. The library also provides asynchronous methods that can be used to great effect when HDInsight jobs generate extremely large output files.

Note

For more information about using the classes in the Windows Azure Storage package see How to use Blob Storage from .NET.

Next Topic | Previous Topic | Home | Community