Connecting R to HDinsight through HIVE

With the powerful big data platform that Microsoft provides through Azure HDinsight, and with the wide range of data scientists and statisticians utilizes R, this Post is who to bring the best of both and connect R to HDinsight through Hive connector. so that you can analyze hive tables in R where they resides on the Azure HDinsight cluster so let’s see the steps

    1. Download Microsoft Hive ODBC driver from here
    2. Install the Microsoft Hive ODBC driver use either the x86 or the x64 ( take care of the version to use the same with R)
    3. Configure your DSN in the ODBC Data Sources
    1. go to Control Panel > Administrative Tools > ODBC Data Sources (64-bit)
    2. open System DSN click add
    3. choose Microsoft Hive ODBC Driver and click Finish
    4. enter the fields
    1. Data Source Name: the data source name we’ll use in R so name it anything i’ll call it now HiveOnAzure
    2. Description: write your desription
    3. Host: get it from your Azure Manage site [yourclustername].azurehdinsight.net
    4. port: leave it 443
    5. Database: leave it “default”
    6. Hive server type: use Hive Server 2
    7. Mechanism: Windows Azure HDinsight (it automatically configures the port and Database above)
    8. HTTP Path: leave it blank
    9. username: your username that you entered while creating the cluster
    10. password: your password that you entered while creating the cluster
    11. then test the connectivity you should receive a connection successful established message
    12.  
  1. Now after establishing the ODBC driver connectivity to Azure we’ll shift to R

  • Open RStudio (make sure to use the same x64 or x86 version as you’ve configured in the ODBC drivers)
    1. install the RODBC package
  > install.packages("RODBC")
  > library(RODBC)
      1. create the ODBC connection in R
  > myconn <- odbcConnect("HiveOnAzure",uid="[YOUR_USERNAME_HERE]",pwd="[YOU_PASSWORD_HERE]" )
      1. run your HiveQL Query and return the data into a data frame
  > alldata <- sqlFetch(myconn,"Select * from hivesampletable")
      1. inspect the retrieved data
  > head(alldata,10)

Now you’ve successfully connected your R to the Hive on HDisnight on Azure to pass your HiveQL Queries and start doing the analysis you want to create.

Comments

  • Anonymous
    September 04, 2014
    Hey thanks for the instructions!! However when i try to run hiveql query "alldata <- sqlFetch(myconn,"Select * from hivesampletable")", i get an error
    Error in odbcTableExists(channel, sqtable) :
    ‘Select * from hivesampletable’: table not found on channel

    would you know if im doing something wrong or is it with the settings or something? Please help!!
    thanks,
    Sada
  • Anonymous
    September 05, 2014
    hey i did not figure out the issue however i figured out a work around. instead of sqlFetch keyword we can use sqlquery which will the results. i guess this will still work!! but the question now is why sqlquery works but not the sqlfetch?
    thanks,
    sada
  • Anonymous
    November 28, 2015
    Thanks for sharing the details and explanations..I want more information from your side..I Am working inhttp://www.xerrow.com">ERP In Dubai