Create a cluster with Data Lake Storage Gen2 using Azure CLI
To create an HDInsight cluster that uses Data Lake Storage Gen2 for storage, follow these steps.
Prerequisites
- If you're unfamiliar with Azure Data Lake Storage Gen2, check out the overview section.
- If you don't already have an Azure account, sign up for a free account before continuing.
- To run the CLI script examples, you have three options:
- Use Azure Cloud Shell from the Azure portal (see next section).
- Use the embedded Azure Cloud Shell via the "Try It" button, located in the top-right corner of each code block.
- Install the latest version of the Azure CLI (2.0.13 or later) if you prefer to use a local CLI console. Sign in to Azure using
az login
, using an account that is associated with the Azure subscription under which you would like to deploy the user-assigned managed identity.Azure CLI.
Azure Cloud Shell
Azure hosts Azure Cloud Shell, an interactive shell environment that you can use through your browser. You can use either Bash or PowerShell with Cloud Shell to work with Azure services. You can use the Cloud Shell preinstalled commands to run the code in this article, without having to install anything on your local environment.
To start Azure Cloud Shell:
Option | Example/Link |
---|---|
Select Try It in the upper-right corner of a code or command block. Selecting Try It doesn't automatically copy the code or command to Cloud Shell. | |
Go to https://shell.azure.com, or select the Launch Cloud Shell button to open Cloud Shell in your browser. | |
Select the Cloud Shell button on the menu bar at the upper right in the Azure portal. |
To use Azure Cloud Shell:
Start Cloud Shell.
Select the Copy button on a code block (or command block) to copy the code or command.
Paste the code or command into the Cloud Shell session by selecting Ctrl+Shift+V on Windows and Linux, or by selecting Cmd+Shift+V on macOS.
Select Enter to run the code or command.
Warning
Billing for HDInsight clusters is prorated per minute, whether you use them or not. Be sure to delete your cluster after you finish using it. See how to delete an HDInsight cluster.
You can download a sample template file and download a sample parameters file. Before using the template and the Azure CLI code snippet below, replace the following placeholders with their correct values:
Placeholder | Description |
---|---|
<SUBSCRIPTION_ID> |
The ID of your Azure subscription |
<RESOURCEGROUPNAME> |
The resource group where you want the new cluster and storage account created. |
<MANAGEDIDENTITYNAME> |
The name of the managed identity that will be given permissions on your storage account with Azure Data Lake Storage Gen2. |
<STORAGEACCOUNTNAME> |
The new storage account with Azure Data Lake Storage Gen2 that will be created. |
<FILESYSTEMNAME> |
The name of the filesystem that this cluster should use in the storage account. |
<CLUSTERNAME> |
The name of your HDInsight cluster. |
<PASSWORD> |
Your chosen password for signing in to the cluster using SSH and the Ambari dashboard. |
The code snippet below does the following initial steps:
- Logs in to your Azure account.
- Sets the active subscription where the created operations will be done.
- Creates a new resource group for the new deployment activities.
- Creates a user-assigned managed identity.
- Adds an extension to the Azure CLI to use features for Data Lake Storage Gen2.
- Creates a new storage account with Data Lake Storage Gen2 by using the
--hierarchical-namespace true
flag.
az login
az account set --subscription <SUBSCRIPTION_ID>
# Create resource group
az group create --name <RESOURCEGROUPNAME> --location eastus
# Create managed identity
az identity create -g <RESOURCEGROUPNAME> -n <MANAGEDIDENTITYNAME>
az extension add --name storage-preview
az storage account create --name <STORAGEACCOUNTNAME> \
--resource-group <RESOURCEGROUPNAME> \
--location eastus --sku Standard_LRS \
--kind StorageV2 --hierarchical-namespace true
Next, sign in to the portal. Add the new user-assigned managed identity to the Storage Blob Data Owner role on the storage account. This step is described in step 3 under Using the Azure portal.
Important
Ensure that your storage account has the user-assigned identity with Storage Blob Data Owner role permissions, otherwise cluster creation will fail.
az deployment group create --name HDInsightADLSGen2Deployment \
--resource-group <RESOURCEGROUPNAME> \
--template-file hdinsight-adls-gen2-template.json \
--parameters parameters.json
Clean up resources
After you complete the article, you may want to delete the cluster. With HDInsight, your data is stored in Azure Storage, so you can safely delete a cluster when it isn't in use. You're also charged for an HDInsight cluster, even when it's not in use. Since the charges for the cluster are many times more than the charges for storage, it makes economic sense to delete clusters when they aren't in use.
Enter all or some of the following commands to remove resources:
# Remove cluster
az hdinsight delete \
--name $clusterName \
--resource-group $resourceGroupName
# Remove storage container
az storage container delete \
--account-name $AZURE_STORAGE_ACCOUNT \
--name $AZURE_STORAGE_CONTAINER
# Remove storage account
az storage account delete \
--name $AZURE_STORAGE_ACCOUNT \
--resource-group $resourceGroupName
# Remove resource group
az group delete \
--name $resourceGroupName
Troubleshoot
If you run into issues with creating HDInsight clusters, see access control requirements.
Next steps
You've successfully created an HDInsight cluster. Now learn how to work with your cluster.
Apache Spark clusters
- Customize Linux-based HDInsight clusters by using script actions
- Create a standalone application using Scala
- Run jobs remotely on an Apache Spark cluster using Apache Livy
- Apache Spark with BI: Perform interactive data analysis using Spark in HDInsight with BI tools
- Apache Spark with Machine Learning: Use Spark in HDInsight to predict food inspection results