Is there a way to restrict multiple instances of an ADF pipeline on Same path of event based trigger?

Divya Sharma 1 Reputation point
2024-05-28T11:16:32.32+00:00

I'm running Spark notebooks in Synapse using an event-based ADF pipeline. If any notebook gets triggered twice, I encounter a conflict error. I understand that I can avoid concurrent execution of the pipeline by setting the concurrency to 1. However, I don't want to do this because it would unnecessarily queue all the files.

I need a solution that conditionally waits only if there is an active pipeline processing the same path.

Suggestions for both the notebook and ADF are appreciated.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,630 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,047 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,028 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Harishga 5,590 Reputation points Microsoft Vendor
    2024-05-28T12:41:23.77+00:00

    Hi @Divya Sharma

    Welcome to Microsoft Q&A platform and thanks for posting your question here.

    To manage concurrent execution in Spark notebooks within Synapse and ADF pipelines without setting the concurrency to 1, you can implement a conditional wait mechanism.

    Here are some suggestions for both environments:
    For Spark Notebooks in Synapse:

    • Use await Termination: This function ensures that the notebook waits for the completion of any active streams before proceeding. By implementing this function, you can effectively manage the execution of your Spark notebooks and prevent multiple instances from running concurrently on the same path when triggered by an event.
    • Check Active Streams: If there are multiple streams, you can loop through spark.streams.active and use awaitAnyTermination to wait for all streams to complete. This means that the notebook will wait until all streams have completed processing before proceeding to the next step. By using this strategy, you can effectively manage the execution of your Spark notebooks and ensure that they run smoothly without any conflicts.

    For Azure Data Factory (ADF):

    • Web Activity for Pipeline Check: Use a web activity at the start of your pipeline to call the ADF REST API to check for any active runs of the same pipeline. If an active run is detected, you can configure the pipeline to terminate the current run or delay its execution until the active run is completed.
    • Global Parameter Lock: Implement a global parameter that acts as a lock. At the start of the pipeline, check the lock status. If it’s ‘unlocked’, proceed and set it to ‘locked’. After completion, set it back to ‘unlocked’. This removes concurrent runs if the lock is active.
    • Custom Activity: Create a custom activity within your pipeline that performs the check for active runs. This activity can use the ADF REST API to determine if the pipeline should proceed or not.

    Reference:
    https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries

    https://www.moderndata.ai/2021/12/how-to-prevent-concurrent-pipeline-execution-in-azure-data-factory-or-azure-synapse-analytics-design-1/

    https://video2.skills-academy.com/en-us/answers/questions/1486855/avoid-concurrent-execution-of-the-same-pipeline

    I hope this information helps you. Let me know if you have any further questions or concerns.

    0 comments No comments