Using R to perform FileSystem Operations on Azure Data Lake Store

In this article, you will learn how to use WebHDFS REST APIs in R to perform filesystem operations on Azure Data Lake Store. We shall look into performing the following 6 filesystem operations on ADLS using httr package for REST calls :

  1. Create folders
  2. List folders
  3. Upload data
  4. Read data
  5. Rename a file
  6. Delete a file

Prerequisites

1. An Azure subscription. See Get Azure free trial.
2. Create an Azure Data Lake Store Account using the following guide.
3. Create an Azure Active Directory Application. You use the Azure AD application to authenticate the Data Lake Store application with Azure AD. There are different approaches to authenticate with Azure AD, which are end-user authentication or service-to-service authentication. For instructions and more information on how to authenticate, see Authenticate with Data Lake Store using Azure Active Directory.
4. An Authorization Token should be obtained from above created Azure AD application. There are many ways to obtain this token: Using REST API , Azure Active Directory Code SamplesADAL for Python. This authorization token will be used in the header of all requests. R code to obtain the access token : (replace with proper values for client id, client secret, tenant id)

5. Microsoft R Open (OR) Microsoft R Client (OR) Microsoft R Server.
6. Integrated Development Environment like RStudio/RTVS

Install the required package “httr” along with its dependencies.

 install.packages("httr", dependencies = TRUE)

In all the code snippets provided below,

  1. Replace <AD AUTH TOKEN>  with the authorization token you obtained in step 4 of Prerequisites.
  2. Replace <yourstorename> with the Data Lake Store name that you created in step 2 of Prerequisites. 

Create Folders

Create a directory called mytempdir under the root folder of your Data Lake Store account.

You should see a response like this if the operation completes successfully: a1

 

 

List Folders

List all the folders under the root folder of your Data Lake Store account.

You should see a response like this if the operation completes successfully: a2

 

 

 

 

 

 

 

 

 

Upload Data

Upload any file to a particular directory in Data Lake Store account. In this example, we will save iris dataframe in a .csv file and upload that iris.csv file to mytempdir.

You should see a response like this if the operation completes successfully: a3

 

 

We can also upload R dataframes directly to Data Lake by converting the data frame to a csv string using textConnection() and using it in the "body" parameter. Here is an example of uploading iris dataframe without saving it as a csv file :

You should see a response like this if the operation completes successfully: a3

 

 

Read Data

Let us read the data in iris.csv file uploaded in the previous step into a R dataframe "irisDownloaded"

You should see a response like this if the operation completes successfully: a4

 

 

 

 

 

 

In order to read large files from ADLS, we can use the write_disk() argument in httr::GET function to save the file directly to disk without loading it into memory. Here is an example code :

Rename a File

Let's rename the file iris.csv to iris2.csv

You should see a response like this if the operation completes successfully: a5

 

 

Delete a File

Let's delete the file iris2.csv

You should see a response like this if the operation completes successfully: a6

 

 

REFERENCES

Get started with Azure Data Lake Store using REST APIs Package httr

Comments

  • Anonymous
    July 28, 2017
    I tried this code. works for text file. but this doesnt work on non-text like png/jpg/pdf etcCan you please help
    • Anonymous
      July 28, 2017
      Hi Attrana,It works fine for me with png/jpg/pdf. Can you provide the code that you are using.
  • Anonymous
    August 18, 2017
    Hi Ramkumar.Doesn't work for me when I read my .csv file from data lake:taxi<- httr::GET("https://mydatalake.azuredatalakestore.net/webhdfs/v1/TAXI/taxi.csv?op=OPEN&read=true", add_headers(Authorization= paste("Bearer ", token)))content(taxi)$error$error$code[1] "AuthenticationFailed"$error$message[1] "The format of the access token in the 'Authorization' header is not supported or malformed. _____Could you please help?Thanks.
    • Anonymous
      August 18, 2017
      paste() function in R introduces extra space between Bearer and Token. Use paste0("Bearer ", token)
      • Anonymous
        August 19, 2017
        Thank you, I modified my code a bit paste(res$token_type,res$access_token) and it seems working.I say "seems" because I've got another error:)$RemoteException$RemoteException$exception[1] "AccessControlException"$RemoteException$message[1] "OPEN failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).I think that this time there's something wrong on Azure side. Right?
        • Anonymous
          August 19, 2017
          The comment has been removed
          • Anonymous
            August 21, 2017
            Thank you.
          • Anonymous
            August 23, 2017
            Ramukumar, thanks a lot. I granted right permissions and now it works.
          • Anonymous
            August 23, 2017
            Ramkumar. :)
  • Anonymous
    January 03, 2018
    The comment has been removed
  • Anonymous
    January 03, 2018
    The comment has been removed