Use a service principal with the Spark 3 connector for Azure Cosmos DB for NoSQL

In this article, you learn how to create a Microsoft Entra application and service principal that can be used with role-based access control. You can then use this service principal to connect to an Azure Cosmos DB for NoSQL account from Spark 3.

Prerequisites

Create a secret and record credentials

In this section, you create a client secret and record the value for use later.

  1. Open the Azure portal.

  2. Go to your existing Microsoft Entra application.

  3. Go to the Certificates & secrets page. Then, create a new secret. Save the Client Secret value to use later in this article.

  4. Go to the Overview page. Locate and record the values for Application (client) ID, Object ID, and Directory (tenant) ID. You also use these values later in this article.

  5. Go to your existing Azure Cosmos DB for NoSQL account.

  6. Record the URI value on the Overview page. Also record the Subscription ID and Resource Group values. You use these values later in this article.

Create a definition and an assignment

In this section, you create a Microsoft Entra ID role definition. Then you assign that role with permissions to read and write items in the containers.

  1. Create a role by using the az role definition create command. Pass in the Azure Cosmos DB for NoSQL account name and resource group, followed by a body of JSON that defines the custom role. The role is also scoped to the account level by using /. Ensure that you provide a unique name for your role by using the RoleName property of the request body.

    az cosmosdb sql role definition create \
        --resource-group "<resource-group-name>" \
        --account-name "<account-name>" \
        --body '{
            "RoleName": "<role-definition-name>",
            "Type": "CustomRole",
            "AssignableScopes": ["/"],
            "Permissions": [{
                "DataActions": [
                    "Microsoft.DocumentDB/databaseAccounts/readMetadata",
                    "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*",
                    "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*"
                ]
            }]
        }'
    
  2. List the role definition you created to fetch its unique identifier in the JSON output. Record the id value of the JSON output.

    az cosmosdb sql role definition list \
        --resource-group "<resource-group-name>" \
        --account-name "<account-name>"
    
    [
      {
        ...,
        "id": "/subscriptions/<subscription-id>/resourceGroups/<resource-grou-name>/providers/Microsoft.DocumentDB/databaseAccounts/<account-name>/sqlRoleDefinitions/<role-definition-id>",
        ...
        "permissions": [
          {
            "dataActions": [
              "Microsoft.DocumentDB/databaseAccounts/readMetadata",
              "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/items/*",
              "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers/*"
            ],
            "notDataActions": []
          }
        ],
        ...
      }
    ]
    
  3. Use az cosmosdb sql role assignment create to create a role assignment. Replace <aad-principal-id> with the Object ID you recorded earlier in this article. Also, replace <role-definition-id> with the id value fetched from running the az cosmosdb sql role definition list command in a previous step.

    az cosmosdb sql role assignment create \
        --resource-group "<resource-group-name>" \
        --account-name "<account-name>" \
        --scope "/" \
        --principal-id "<account-name>" \
        --role-definition-id "<role-definition-id>"
    

Use a service principal

Now that you've created a Microsoft Entra application and service principal, created a custom role, and assigned that role permissions to your Azure Cosmos DB for NoSQL account, you should be able to run a notebook.

  1. Open your Azure Databricks workspace.

  2. In the workspace interface, create a new cluster. Configure the cluster with these settings, at a minimum:

    Version Value
    Runtime version 13.3 LTS (Scala 2.12, Spark 3.4.1)
  3. Use the workspace interface to search for Maven packages from Maven Central with a Group ID of com.azure.cosmos.spark. Install the package specifically for Spark 3.4 with an Artifact ID prefixed with azure-cosmos-spark_3-4 to the cluster.

  4. Finally, create a new notebook.

    Tip

    By default, the notebook is attached to the recently created cluster.

  5. Within the notebook, set Azure Cosmos DB Spark connector configuration settings for the NoSQL account endpoint, database name, and container name. Use the Subscription ID, Resource Group, Application (client) ID, Directory (tenant) ID, and Client Secret values recorded earlier in this article.

    # Set configuration settings
    config = {
      "spark.cosmos.accountEndpoint": "<nosql-account-endpoint>",
      "spark.cosmos.auth.type": "ServicePrincipal",
      "spark.cosmos.account.subscriptionId": "<subscription-id>",
      "spark.cosmos.account.resourceGroupName": "<resource-group-name>",
      "spark.cosmos.account.tenantId": "<entra-tenant-id>",
      "spark.cosmos.auth.aad.clientId": "<entra-app-client-id>",
      "spark.cosmos.auth.aad.clientSecret": "<entra-app-client-secret>",
      "spark.cosmos.database": "<database-name>",
      "spark.cosmos.container": "<container-name>"        
    }    
    
    // Set configuration settings
    val config = Map(
      "spark.cosmos.accountEndpoint" -> "<nosql-account-endpoint>",
      "spark.cosmos.auth.type" -> "ServicePrincipal",
      "spark.cosmos.account.subscriptionId" -> "<subscription-id>",
      "spark.cosmos.account.resourceGroupName" -> "<resource-group-name>",
      "spark.cosmos.account.tenantId" -> "<entra-tenant-id>",
      "spark.cosmos.auth.aad.clientId" -> "<entra-app-client-id>",
      "spark.cosmos.auth.aad.clientSecret" -> "<entra-app-client-secret>",
      "spark.cosmos.database" -> "<database-name>",
      "spark.cosmos.container" -> "<container-name>" 
    )
    
  6. Configure the Catalog API to manage API for NoSQL resources by using Spark.

    # Configure Catalog Api
    spark.conf.set("spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
    spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", "<nosql-account-endpoint>")
    spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.type", "ServicePrincipal")
    spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.subscriptionId", "<subscription-id>")
    spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.resourceGroupName", "<resource-group-name>")
    spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.account.tenantId", "<entra-tenant-id>")
    spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientId", "<entra-app-client-id>")
    spark.conf.set("spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientSecret", "<entra-app-client-secret>")
    
    // Configure Catalog Api
    spark.conf.set(s"spark.sql.catalog.cosmosCatalog", "com.azure.cosmos.spark.CosmosCatalog")
    spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.accountEndpoint", "<nosql-account-endpoint>")
    spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.type", "ServicePrincipal")
    spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.subscriptionId", "<subscription-id>")
    spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.resourceGroupName", "<resource-group-name>")
    spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.account.tenantId", "<entra-tenant-id>")
    spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientId", "<entra-app-client-id>")
    spark.conf.set(s"spark.sql.catalog.cosmosCatalog.spark.cosmos.auth.aad.clientSecret", "<entra-app-client-secret>")
    
  7. Create a new database by using CREATE DATABASE IF NOT EXISTS. Ensure that you provide your database name.

    # Create a database using the Catalog API
    spark.sql("CREATE DATABASE IF NOT EXISTS cosmosCatalog.{};".format("<database-name>"))
    
    // Create a database using the Catalog API
    spark.sql(s"CREATE DATABASE IF NOT EXISTS cosmosCatalog.<database-name>;")
    
  8. Create a new container by using the database name, container name, partition key path, and throughput values that you specify.

    # Create a products container using the Catalog API
    spark.sql("CREATE TABLE IF NOT EXISTS cosmosCatalog.{}.{} USING cosmos.oltp TBLPROPERTIES(partitionKeyPath = '{}', manualThroughput = '{}')".format("<database-name>", "<container-name>", "<partition-key-path>", "<throughput>"))
    
    // Create a products container using the Catalog API
    spark.sql(s"CREATE TABLE IF NOT EXISTS cosmosCatalog.<database-name>.<container-name> using cosmos.oltp TBLPROPERTIES(partitionKeyPath = '<partition-key-path>', manualThroughput = '<throughput>')")
    
  9. Create a sample dataset.

    # Create sample data    
    products = (
      ("68719518391", "gear-surf-surfboards", "Yamba Surfboard", 12, 850.00, False),
      ("68719518371", "gear-surf-surfboards", "Kiama Classic Surfboard", 25, 790.00, True)
    )
    
    // Create sample data
    val products = Seq(
      ("68719518391", "gear-surf-surfboards", "Yamba Surfboard", 12, 850.00, false),
      ("68719518371", "gear-surf-surfboards", "Kiama Classic Surfboard", 25, 790.00, true)
    )
    
  10. Use spark.createDataFrame and the previously saved online transaction processing (OLTP) configuration to add sample data to the target container.

    # Ingest sample data    
    spark.createDataFrame(products) \
      .toDF("id", "category", "name", "quantity", "price", "clearance") \
      .write \
      .format("cosmos.oltp") \
      .options(config) \
      .mode("APPEND") \
      .save()
    
    // Ingest sample data
    spark.createDataFrame(products)
      .toDF("id", "category", "name", "quantity", "price", "clearance")
      .write
      .format("cosmos.oltp")
      .options(config)
      .mode("APPEND")
      .save()
    

    Tip

    In this quickstart example, credentials are assigned to variables in clear text. For security, we recommend that you use secrets. For more information on how to configure secrets, see Add secrets to your Spark configuration.