Serializing data with the Microsoft .NET Library for Avro

patterns & practices Developer Center

From: Developing big data solutions on Microsoft Azure HDInsight

The .NET Library for Avro is a component of the .NET SDK for HDInsight that you can use to serialize and deserialize data using the Avro serialization format. Avro enables you to include schema metadata in a data file, and is widely used in Hadoop (including in HDInsight) as a language-neutral means of exchanging complex data structures between operations.

For example, consider a weather monitoring application that records meteorological observations. In the application, each observation can be represented as an object with properties that contain the specific data values for the observation. These properties might be simple values such as the date, the time, the wind speed, and the temperature. However, some values might be complex structures such as the geo-coded location of the monitoring station, which contains longitude and latitude coordinates.

The following code example shows how a list of weather observations in this complex data structure can be serialized in Avro format and uploaded to Azure storage. The example is deliberately kept simple by including the credentials in the code so that you can copy and paste it while you are experimenting with HDInsight. In a production system you must protect credentials, as described in “Securing credentials in scripts and applications” in the Security section of this guide.

using System;
using System.Collections.Generic;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Runtime.Serialization;
using System.Configuration;

using Microsoft.Hadoop.Avro.Container;
using Microsoft.Hadoop.WebHDFS;
using Microsoft.Hadoop.WebHDFS.Adapters;


namespace AvroClient
{
  // Class representing a weather observation.
  [DataContract(Name = "Observation", Namespace = "WeatherData")]
  internal class Observation
  {
    [DataMember(Name = "obs_date")]
    public DateTime Date { get; set; }

    [DataMember(Name = "obs_time")]
    public string Time { get; set; }

    [DataMember(Name = "obs_location")]
    public GeoLocation Location { get; set; }

    [DataMember(Name = "wind_speed")]
    public double WindSpeed { get; set; }

    [DataMember(Name = "temperature")]
    public double Temperature { get; set; }
  }

  // Struct for geo-location coordinates.
  [DataContract]
  internal struct GeoLocation
  {
    [DataMember]
    public double lat { get; set; }
    [DataMember]
    public double lon { get; set; }
  }

  class Program
  { 
    static void Main(string[] args)
    {
      // Get a list of Observation objects.
      List<Observation> Observations = GetData();

      // Serialize Observation objects to a file in Avro format.
      string fileName = "observations.avro";
      string filePath = new DirectoryInfo(".") + @"\" + fileName;
      using (var dataStream = new FileStream(filePath, FileMode.Create))
      {
        // Compress the data using the Deflate codec.
        using (var avroWriter = AvroContainer.CreateWriter<Observation>(dataStream, Codec.Deflate))
        {
          using (var seqWriter = new SequentialWriter<Observation>(avroWriter, 24))
          {
            // Serialize the data to stream using the sequential writer.
            Observations.ForEach(seqWriter.Write);
          }
        }
        dataStream.Close();

        // Upload the serialized data.
        var hdInsightUser = "user-name";
        var storageName = "storage-account-name";
        var storageKey = "storage-account-key";
        var containerName = "container-name";
        var destFolder = "/data/";

        var hdfsClient = new WebHDFSClient(hdInsightUser,
            new BlobStorageAdapter(storageName, storageKey, containerName, false));

        hdfsClient.CreateFile(filePath, destFolder + fileName).Wait();

        Console.WriteLine("The data has been uploaded in Avro format");
        Console.WriteLine("Press a key to end");
        Console.Read();
      }

    }

    static List<Observation> GetData()
    {
      List<Observation> Observations = new List<Observation>();

      // Code to capture a list of Observation objects.

      return Observations;
    }
  }
}

The class Observation used to represent a weather observation, and the struct GeoLocation used to represent a geographical location, include metadata to describe the schema. This schema information is included in the serialized file that is uploaded to Azure storage, enabling an HDInsight process such as a Pig job to deserialize the data into an appropriate data structure. Notice also that the data is compressed using the Deflate codec as it is serialized, reducing the size of the file to be uploaded.

Next Topic | Previous Topic | Home | Community