Integrating Databricks notebooks in Azure ML using SDK V2

Alexander 0 Reputation points
2024-06-07T13:12:55.3333333+00:00

Hi all,

We currently have some Azure Databricks notebooks in production which we would like to integrate in Azure ML using the v2 SDK.

I found resources to integrate these notebooks using the databricks_step in the v1 SDK. The official documentation makes mention that there is no alternative component provided to do this: https://video2.skills-academy.com/en-us/azure/machine-learning/migrate-to-v2-execution-pipeline?view=azureml-api-2

Does anyone have advice on how to integrate Azure Databricks notebooks with ease? I would prefer not having to use the Rest API.

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,687 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,045 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 18,501 Reputation points
    2024-06-08T09:05:45.9433333+00:00

    Using AML Pipeline with PythonScriptStep

    1. Install Required Libraries: Make sure you have the necessary libraries installed:
      
         pip install azureml-core azureml-pipeline-steps databricks-api
      
      
    2. Set Up Authentication: Authenticate to both Azure ML and Databricks:
      
         from azureml.core import Workspace
      
         from databricks_api import DatabricksAPI
      
         
      
         # Azure ML authentication
      
         ws = Workspace.from_config()
      
         # Databricks authentication
      
         DATABRICKS_URL = 'https://<your-databricks-instance>'
      
         DATABRICKS_TOKEN = '<your-databricks-token>'
      
         db = DatabricksAPI(host=DATABRICKS_URL, token=DATABRICKS_TOKEN)
      
      
    3. Create a Python Script to Trigger Databricks Notebooks: Create a Python script (run_databricks.py) that will trigger the execution of your Databricks notebook.
      
         import sys
      
         from databricks_api import DatabricksAPI
      
         DATABRICKS_URL = 'https://<your-databricks-instance>'
      
         DATABRICKS_TOKEN = '<your-databricks-token>'
      
         db = DatabricksAPI(host=DATABRICKS_URL, token=DATABRICKS_TOKEN)
      
         def run_databricks_notebook(notebook_path):
      
             run_id = db.jobs.submit_run(notebook_task={"notebook_path": notebook_path})
      
             return run_id
      
         if __name__ == "__main__":
      
             notebook_path = sys.argv[1]
      
             run_id = run_databricks_notebook(notebook_path)
      
             print(f"Triggered Databricks notebook run with ID: {run_id}")
      
      
    4. Define the AML Pipeline: Define a pipeline in Azure ML that uses PythonScriptStep to run the Databricks notebook.
      
         from azureml.core import Experiment
      
         from azureml.pipeline.core import Pipeline
      
         from azureml.pipeline.steps import PythonScriptStep
      
         notebook_path = '/Users/<your-user>/notebook'
      
         step = PythonScriptStep(
      
             script_name="run_databricks.py",
      
             arguments=[notebook_path],
      
             compute_target='your-compute-cluster',
      
             source_directory='.',
      
             allow_reuse=False
      
         )
      
         pipeline = Pipeline(workspace=ws, steps=[step])
      
         experiment = Experiment(workspace=ws, name='databricks-integration')
      
         pipeline_run = experiment.submit(pipeline)
      
      

    Using Azure ML Databricks Linked Service (Preview Feature)

    1. Create Databricks Linked Service: Create a Databricks linked service in Azure Machine Learning. This feature is currently in preview, and the API might change in the future.
    2. Define the Databricks Job: Define a job in AML that points to the Databricks notebook.
      
         from azureml.core import Workspace
      
         from azureml.pipeline.core import Pipeline
      
         from azureml.pipeline.steps import DatabricksStep
      
         ws = Workspace.from_config()
      
         databricks_step = DatabricksStep(
      
             name="run-notebook",
      
             notebook_path="/Users/<your-user>/notebook",
      
             run_name="DatabricksNotebookRun",
      
             cluster_id="cluster-id",
      
             databricks_compute="your-databricks-compute"
      
         )
      
         pipeline = Pipeline(workspace=ws, steps=[databricks_step])
      
         experiment = Experiment(workspace=ws, name="databricks-integration")
      
         pipeline_run = experiment.submit(pipeline)
      
      

    Using Azure ML and Databricks Jobs API

    1. Submit a Databricks Job via Azure ML: Use Azure ML to submit a Databricks job using the Jobs API. This involves creating a job in Databricks and triggering it from Azure ML.
      
         from azureml.core import Workspace
      
         import requests
      
         import json
      
         ws = Workspace.from_config()
      
         DATABRICKS_URL = 'https://<your-databricks-instance>'
      
         DATABRICKS_TOKEN = '<your-databricks-token>'
      
         job_payload = {
      
             "name": "My Databricks Job",
      
             "existing_cluster_id": "cluster-id",
      
             "notebook_task": {
      
                 "notebook_path": "/Users/<your-user>/notebook"
      
             }
      
         }
      
         response = requests.post(
      
             f"{DATABRICKS_URL}/api/2.0/jobs/runs/submit",
      
             headers={"Authorization": f"Bearer {DATABRICKS_TOKEN}"},
      
             json=job_payload
      
         )
      
         run_id = response.json().get("run_id")
      
         print(f"Databricks job submitted with run ID: {run_id}")