HDInsight Services For Windows

This article is the main portal for technical information about HDInsight Services for Windows and related Microsoft technologies. It provides a brief overview of Apache Hadoop, as well as information for the HDInsight Services provided by Microsoft for deployment on both Windows and Windows Azure.

It also provides links to more detailed technical content in various formats.

Note: Contributions are welcome and appreciated: Please feel free to update this and other articles on this Wiki, and to add links to relevant content both from within and outside Microsoft.

Table of Contents

  Topics

  Content Types

Hadoop Overview

Orientation

Learning Apache Hadoop

Tutorials

Getting Started with HDInsight Services on Windows

Tutorials

Getting Started with HDInsight Services for Windows Azure 

Tutorials

Samples on the HDInsight Services for Windows Azure Dashboard

Samples

Developing with Hadoop

Tutorials

Using HDInsight Services with other BI Technologies

HowTos

How To

 HowTos

Code Examples

 Samples

Videos

 Videos

Audio

 Audio

Books

 Books

Hadoop on Windows and on Windows Azure Best Practices

 Guidance

 

Hadoop Overview

Apache Hadoop is an open source software framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It consists of two primary components: Hadoop Distributed File System (HDFS), a reliable and distributed data storage, and MapReduce, a parallel and distributed processing system. A Hadoop cluster can be made up of a single node or thousands.

HDFS is the primary distributed storage used by Hadoop applications. As you load data into a Hadoop cluster, HDFS splits up the data into blocks/chunks and creates multiple replicas of blocks and distributes them across the nodes of the cluster to enable reliable and extremely rapid computations.

Hadoop MapReduce is a software framework for writing applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. A MapReduce job usually splits the input data-set into independent chunks. These independent chunks are processed by the map tasks running across the nodes of the Hadoop cluster in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Some of the main advantages of Hadoop are that it can process vast amounts of data, hundreds of terabytes or even petabytes quickly and efficiently, process both structured and non-structured data, perform the processing where the data is located rather than moving the data to some processing location, and detect and handle failures by design.

There are two other key Apache technologies that are frequently used with Hadoop: Hive and Pig. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems such as HDFS. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

For more details on Apache Hadoop, see http://hadoop.apache.org/.

Learning Apache Hadoop

This section contains links to resources useful in learning Hadoop, such as installation, configuration, and basic how-to information.

  Link

  Description

Apache Hadoop

The Apache Hadoop home page

Introduction to Apache MapReduce and HDFS [Video]

An introduction to Apache MapReduce and HDFS

Hive

A data warehouse system for Hadoop

Introduction to Apache Hive [Video]

An introduction to Apache Hive

Pig

A platform for analyzing large data sets

Introduction to Pig [Video]

An introduction to Apache Pig

Learning resources for Apache Mahout

An introduction to Apache Mahout

Mahout

A scalable machine learning library

How to Contribute

How to Contribute to Hadoop Common

Getting Started with HDInsight Services for Windows

The links in this section provide information on deploying and using the Developer Preview of HDInsight Services on Windows.

       

 Link  Description
Installing the Developer Preview of HDInsight Services on Windows How to install the Developer Preview of Hadoop on Windows with the Microsoft Web Platform Installer 4.0.
Getting Started with HDInsight Services for Windows  Tour through the Microsoft HDInsight dashboard and resources for getting started with the developer preview.

       

Getting Started with HDInsight Services for Windows Azure

The links in this section provide information on deploying and using Apache Hadoop on the Microsoft Windows Azure Platform. Instead of setting up and managing a Hadoop cluster on Azure by yourself, you can use the HDInsight Services for Windows Azure dashboard that Microsoft has made available at hadooponazure.com. This is a preview of the HDInsight Services for Windows Azure to which you can submit MapReduce jobs to be processed along with the data used in the processing. It enables you to process vast amounts of structured as well as non-structured data easily without worrying about setting up the Hadoop cluster, configuring, maintaining, and managing it manually.

  Link

  Description

Deployment of Hadoop-based Services on the Windows Azure Portal

 A walkthrough for provisioning and using a temporary HDFS cluster on the Hadoop on Windows Azure Portal.

Introduction to HDInsight Sevices for Windows Azure

A service that deploys and provisions clusters in the cloud, providing a software framework designed to manage, analyze and report on big data.

HD Insight Services for Windows Azure QuickStart: Running Hadoop Jobs

This tutorial shows how to run MapReduce programs in a cluster by using Apache™ Hadoop™-based Services for Windows Azure in two ways.

Working With Data in HDInsight Services for Windows Azure

Outlines several techniques for importing and storing data for use in Hadoop jobs run with Hadoop-based Services for Windows Azure.

Analyzing Twitter Movie Data with Hive in HDInsight Services for Windows Azure

In this tutorial you will query, explore, and analyze data from Twitter using Apache™ Hadoop™-based Services for Windows Azure and a Hive query in Excel. Social web sites are one of the major driving forces for Big Data adoption.

Simple recommendation engine using Apache Mahout

In this tutorial you use the Million Song Dataset to create song recommendations for users based on their past listening habits.

A Lap Around HDInsight

 An end-to-end introduction to HDInsight, Map/Reduce. Pig, and Hive.

 

Samples on the HDInsight for Windows Azure Dashboard

This section contains links to the tutorials for the samples that are on the Hadoop on Windows Azure Portal.

  Link

Description

The Hadoop on Azure Pi Estimator Sample Tutorial 

This tutorial shows how to deploy a MapReduce program with Hadoop on Windows Azure that uses a statistical (quasi-Monte Carlo) method to estimate the value of Pi.

The Hadoop on Azure 10-GB Graysort Sample Tutorial

This tutorial shows how to run a general purpose GraySort on a 10 GB file using Hadoop on Windows Azure.

The Hadoop on Azure C# Streaming Sample Tutorial 

This tutorial shows how to use C# programs with the Hadoop streaming interface.

The Hadoop on Azure Mahout Classification Sample

This tutorial illustrates how to use Apache Mahout in Hadoop on Windows Azure to do classification.

The Hadoop on Azure Mahout Clustering Sample

This tutorial illustrates how to use Hadoop on Windows Azure to do cluster analysis with Mahout.

The Hadoop on Azure Pegasus Degree Distribution Sample Tutorial

This tutorial shows how to deploy Pegasus from the Hadoop on Windows Azure portal to compute the degree of each node and the distribution of degrees for a simple 16-node graph.

The Hadoop on Azure Pegasus Page Rank Sample Tutorial 

This tutorial shows how to deploy Pegasus from the Hadoop on Windows Azure portal to compute the page rank for a simple 16-node graph.

The Hadoop on Azure Sqoop Import Sample Tutorial

This tutorial shows how to use Sqoop to import data from a SQL database on Windows Azure to an Hadoop on Windows Azure HDFS cluster.

The Hadoop on Azure Wordcount Sample Tutorial

This tutorial shows two ways to use Hadoop on Windows Azure to run a MapReduce program that counts word occurrences in a text.

 

Developing with Hadoop

This section contains information on developing solutions using Hadoop.

  Link

  Description

Yahoo! Hadoop tutorial

A tutorial on using Hadoop 0.18.0

Map Reduce example

A tutorial on using Map/Reduce

Hadoop Streaming

Hadoop Wiki page on the Streaming utility

 

Using HDInsight Services with other BI Technologies

This section contains information on using Hadoop with other BI technologies.

Link Description
How to Connect Excel to Hadoop on Azure via HiveODBC Explains how to use Excel 2010 to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver.
How to Connect Excel PowerPivot to Hive on Azure via HiveODBC Explains how to use PowerPivot to access data in the Hive data warehouse running on Windows Azure by using the Hive ODBC Driver.

Leveraging a Hadoop cluster from SQL Server Integration Services (SSIS)

With the explosion of data, the open source Apache™ Hadoop™ Framework is gaining traction thanks to its huge ecosystem that has arisen around the core functionalities of Hadoop distributed file system (HDFS™) and Hadoop Map Reduce. As of today, being able to have SQL Server working with Hadoop™ becomes increasingly important because the two are indeed complementary. For instance, while petabytes of data can be stored unstructured in Hadoop and take hours to be queried, terabytes of data can be stored in a structured way in the SQL Server platform and queried in seconds. This leads to the need to transfer data between Hadoop and SQL Server.

 

How To

This section contains a list of Hadoop-related how-to articles.

  Link

  Description

Hadoop-based Services on Windows Azure How-Tos and FAQs 

A collection of common How To topics along with FAQs. 

How to Contribute

How to Contribute to Hadoop Common

How to count the number of lines in a file

An example of counting the number of lines in a file using Map Reduce

How to get distinct values

An example of getting distinct values/lines using Map Reduce

Avkash Chauhan's Blog 

Information related to Hadoop-based services on Windows Azure.

How to Run a Job on a Provisioned Hadoop on Windows Azure Cluster 

Information about creating Map Reduce jobs on a cluster that has been provisioned on the Hadoop on Windows Azure Portal

Use SQL Azure database as a Hive metastore

Information about using SQL Azure database as a Hive metastore

 

Code Examples

This section contains a list of Hadoop-related examples.

  Link

  Description

Yahoo! Hadoop tutorial

A tutorial on using Hadoop 0.18.0

Map Reduce example

A tutorial on using Map/Reduce

How to count the number of lines in a file

An example of counting the number of lines in a file using Map Reduce

How to get distinct values

An example of getting distinct values/lines using Map Reduce

 

Videos

This section contains a list of Hadoop-related videos.

  Link

  Description

Introduction to Interactive JavaScript Console

Learn how to use the JS console with your Hadoop cluster.

Introduction to Interactive Hive Console

Learn how to use the Hive console with your Hadoop cluster.

Use Excel Hive Add-in to Access Hive on Windows Azure

Use the Add-in to import data from Hive on Windows Azure.

Use PowerPivot to Access Hive on Windows Azure 

Use Excel PowerPivot to access data from Hive on Windows Azure. 

Introduction to Apache Hive

An introduction to Apache Hive

Introduction to Pig

An introduction to Apache Pig

Uploading Data and the WordCount Sample

Upload data to Azure cluster and then run the WordCount sample

Pi Sample

Run the Pi Estimator Sample

Import from Azure Marketplace

Import data from Marketplace into Hadoop Services for Windows Azure

10GB GraySort Sample - Generate Data

 Introduction to the GraySort benchmark and generating test data

10GB GraySort Sample - Sort Data

Running the MapReduce job to sort your data

10GB GraySort Sample - Validate Data

After sorting the data, validate that the operation worked

[[PowerView Report to Hadoop on Azure Hive Sample|PowerView, PowerPivot, Hadoop, and Hive]]

Use PowerView to connect to a Hive sample table in PowerPivot

 

Audio

This section contains a list of Hadoop-related audio recordings.

  Link

  Description

.NET Rocks (podcast) episode discussing Hadoop on Azure

.NET Rocks episode 755 (March 2012) with general discussion of Hadoop on Azure.

Books

This section contains a list of Hadoop-related books.

  Link

  Description

Hadoop: The Definitive Guide, 3rd Edition by Tom White (May 26, 2012)

A comprehensive guide to build and maintain reliable, scalable, distributed systems with Apache Hadoop.

 

Hadoop on Windows and on Windows Azure Best Practices

Microsoft is planning on providing guidance on best practices in the future. If you have best practices guidance that you'd like to share, please feel free to provide a link to it here.

(Some suggestions.) Be great to list some best practices around:

  1. How to get big data sets into Windows Azure.
  2. Understanding how the costs work so as to cost optimize the process.

See Also

Another important place to find an extensive amount of Cortana Intelligence Suite related articles is the TechNet Wiki itself. The best entry point is Cortana Intelligence Suite Resources on the TechNet Wiki.