Integrating Databricks notebooks in Azure ML using SDK V2

Question

Hi all,

We currently have some Azure Databricks notebooks in production which we would like to integrate in Azure ML using the v2 SDK.

I found resources to integrate these notebooks using the databricks_step in the v1 SDK. The official documentation makes mention that there is no alternative component provided to do this: https://video2.skills-academy.com/en-us/azure/machine-learning/migrate-to-v2-execution-pipeline?view=azureml-api-2

Does anyone have advice on how to integrate Azure Databricks notebooks with ease? I would prefer not having to use the Rest API.

Answer

Using AML Pipeline with PythonScriptStep

Install Required Libraries: Make sure you have the necessary libraries installed:
```
   pip install azureml-core azureml-pipeline-steps databricks-api
```

Set Up Authentication: Authenticate to both Azure ML and Databricks:


   from azureml.core import Workspace

   from databricks_api import DatabricksAPI

   

   # Azure ML authentication

   ws = Workspace.from_config()

   # Databricks authentication

   DATABRICKS_URL = 'https://'

   DATABRICKS_TOKEN = ''

   db = DatabricksAPI(host=DATABRICKS_URL, token=DATABRICKS_TOKEN)

Create a Python Script to Trigger Databricks Notebooks: Create a Python script (run_databricks.py) that will trigger the execution of your Databricks notebook.


   import sys

   from databricks_api import DatabricksAPI

   DATABRICKS_URL = 'https://'

   DATABRICKS_TOKEN = ''

   db = DatabricksAPI(host=DATABRICKS_URL, token=DATABRICKS_TOKEN)

   def run_databricks_notebook(notebook_path):

       run_id = db.jobs.submit_run(notebook_task={"notebook_path": notebook_path})

       return run_id

   if __name__ == "__main__":

       notebook_path = sys.argv[1]

       run_id = run_databricks_notebook(notebook_path)

       print(f"Triggered Databricks notebook run with ID: {run_id}")

Define the AML Pipeline: Define a pipeline in Azure ML that uses PythonScriptStep to run the Databricks notebook.


   from azureml.core import Experiment

   from azureml.pipeline.core import Pipeline

   from azureml.pipeline.steps import PythonScriptStep

   notebook_path = '/Users//notebook'

   step = PythonScriptStep(

       script_name="run_databricks.py",

       arguments=[notebook_path],

       compute_target='your-compute-cluster',

       source_directory='.',

       allow_reuse=False

   )

   pipeline = Pipeline(workspace=ws, steps=[step])

   experiment = Experiment(workspace=ws, name='databricks-integration')

   pipeline_run = experiment.submit(pipeline)

Using Azure ML Databricks Linked Service (Preview Feature)

Create Databricks Linked Service: Create a Databricks linked service in Azure Machine Learning. This feature is currently in preview, and the API might change in the future.

Define the Databricks Job: Define a job in AML that points to the Databricks notebook.


   from azureml.core import Workspace

   from azureml.pipeline.core import Pipeline

   from azureml.pipeline.steps import DatabricksStep

   ws = Workspace.from_config()

   databricks_step = DatabricksStep(

       name="run-notebook",

       notebook_path="/Users//notebook",

       run_name="DatabricksNotebookRun",

       cluster_id="cluster-id",

       databricks_compute="your-databricks-compute"

   )

   pipeline = Pipeline(workspace=ws, steps=[databricks_step])

   experiment = Experiment(workspace=ws, name="databricks-integration")

   pipeline_run = experiment.submit(pipeline)

Using Azure ML and Databricks Jobs API

Submit a Databricks Job via Azure ML: Use Azure ML to submit a Databricks job using the Jobs API. This involves creating a job in Databricks and triggering it from Azure ML.


   from azureml.core import Workspace

   import requests

   import json

   ws = Workspace.from_config()

   DATABRICKS_URL = 'https://'

   DATABRICKS_TOKEN = ''

   job_payload = {

       "name": "My Databricks Job",

       "existing_cluster_id": "cluster-id",

       "notebook_task": {

           "notebook_path": "/Users//notebook"

       }

   }

   response = requests.post(

       f"{DATABRICKS_URL}/api/2.0/jobs/runs/submit",

       headers={"Authorization": f"Bearer {DATABRICKS_TOKEN}"},

       json=job_payload

   )

   run_id = response.json().get("run_id")

   print(f"Databricks job submitted with run ID: {run_id}")

Share via

Integrating Databricks notebooks in Azure ML using SDK V2

1 answer

Using AML Pipeline with PythonScriptStep

Using Azure ML Databricks Linked Service (Preview Feature)

Using Azure ML and Databricks Jobs API