Azure Databricks workflow that runs multiple jobs sequentially using the same job cluster

Aravind Raam 20 Reputation points
2024-06-01T14:30:01.2+00:00

Newbie here as far as Azure Databricks and workflows are concerned.

TL/DR version: Is there a way to configure the same job cluster for multiple jobs that are part of the same workflow?

Long version:

We have an ELT process that we have broken down into multiple stages.

  1. Stage 1 - build staging tables (about 25 or so tables) - Extract from source and Load into staging
  2. Stage 2 - build final tables (about 12 tables) - Transform data from staging and load into Final
  3. Stage 3 - run multiple (over 100 rules - across the 12 final tables) business rule validations
  4. Stage 4 - create enterprise specific delimited file (not csv) from the 12 final tables

Now each one of these stages were created as multiple notebooks. Some of these notebooks have a dependency on some other notebook having run first. To test these individual stages, we built a driver notebook that orchestrates the running of the child notebooks in the correct order. So far so good.

Now we are trying to bring these 4 stages into a single workflow. Initially, we were trying to repeat the orchestration by running one driver notebook at a time (set the dependency for the 2nd stage driver notebook to the 1st, the 3rd notebook to the 2nd and so on). This works but the workflow takes long time to run. We tried to parallelize some of the notebooks by changing the dependency to only the ones that matter, but we ran into issues with concurrency errors (even though no 2 notebooks write to the same table; we are still researching this).

As a second attempt, we created a new workflow - daisy chaining the 4 individual workflow jobs that each do 1 stage. But doing this, it appears like each job when it starts tries to spin up its own compute job cluster.

Example:

Task1 (Notebook to set some parameters) ==> Task2 (Run Job1) ==> Task3 (Run Job2) ==> Task4 (Run Job3) ==> Task5 (Run Job4) ==> Task6(Notebook to provide notifications as well as updates)

Task1 and Task6 appears to use the same job cluster - whereas Task2, 3, 4, and 5, spin up their own clusters and so adds time to the whole workflow process. Is there a way to set the job cluster for the overall workflow and "share" or pass that cluster as a parameter to the individual jobs?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,045 questions
0 comments No comments
{count} votes

Accepted answer
  1. PRADEEPCHEEKATLA-MSFT 84,381 Reputation points Microsoft Employee
    2024-06-03T05:19:25.2866667+00:00

    @Aravind Raam - Thanks for the question and using MS Q&A platform.

    To optimize resource usage with jobs that orchestrate multiple tasks, use shared job clusters. A shared job cluster allows multiple tasks in the same job run to reuse the cluster. You can use a single job cluster to run all tasks that are part of the job, or multiple job clusters optimized for specific workloads. To use a shared job cluster:

    1. Select New Job Clusters when you create a task and complete the cluster configuration.
    2. Select the new cluster when adding a task to the job, or create a new job cluster. Any cluster you configure when you select New Job Clusters is available to any task in the job.

    For more details, refer to Use shared job clusters.

    OR

    There is a way to configure the same job cluster for multiple jobs that are part of the same workflow in Azure Databricks. You can use a single job cluster for all the jobs in your workflow by creating a Databricks cluster and then using that cluster for all the jobs in your workflow.

    For more details, refer to https://stackoverflow.com/questions/76716693/how-to-run-same-databricks-notebook-from-different-job-concurrently.

    OR
    To do this, you can create a Databricks cluster with the required configuration and then specify that cluster as the job cluster for all the jobs in your workflow. You can do this by setting the existing_cluster_id parameter in the job configuration to the ID of the cluster you created.

    Here's an example of how you can set the existing_cluster_id parameter in the job configuration:

    {
      "name": "job1",
      "new_cluster": false,
      "existing_cluster_id": "1234-567890-abcdefg",
      "notebook_task": {
        "notebook_path": "/path/to/notebook",
        "base_parameters": {
          "param1": "value1",
          "param2": "value2"
        }
      }
    }
    
    

    In this example, the existing_cluster_id parameter is set to the ID of the cluster you created. You can use this same configuration for all the jobs in your workflow, and they will all use the same cluster.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful