How to run spark streaming on azure synapse (which activity to choose to have a continuous run)

Mila 0 Reputation points
2024-03-14T11:41:10.04+00:00

Hello community!

I'm looking for the best way to run spark streaming on synapse (data is coming from event hub).

I was thinking about Spark job definition, but spark job definition has a limited amount of time an activity can run (Default is 12 hours, and the maximum amount of time allowed is 7 days)

User's image

Another possible solution is HDInsight spark activity

User's image

but even in this case how can I make the pipeline run continually.

Is there any recommendation on the best way to do it basing on real time production projects.

Thank you :)

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,621 questions
Azure HDInsight
Azure HDInsight
An Azure managed cluster service for open-source analytics.
204 questions
{count} votes

1 answer

Sort by: Most helpful
  1. phemanth 8,080 Reputation points Microsoft Vendor
    2024-03-14T15:31:05.38+00:00

    @Mila

    Thanks for reaching out to Microsoft Q&A.

    Here are some recommendations on how to run Spark Streaming on Synapse for real-time production projects, considering the limitations of Spark job definitions:

    • Structured Streaming in Synapse Spark: This is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. It allows ingesting real-time data from various data sources, including Event Hubs. You can create a new notebook in Synapse, initialize the variables, and set up your source by providing the name of your Event Hub along with the connection string and consumer group. Your destination will be Data Lake, so you need to supply the container and folder path where you want to land the streaming data.
      refer :https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/structured-streaming-in-synapse-spark/ba-p/3692836
    • Azure Synapse Analytics with a dedicated Spark pool: You can configure a dedicated Spark pool in your Azure Synapse workspace. This allows you to run Spark jobs, including streaming jobs, for an indefinite period. Here are some resources to get you started:
    1. [Create a dedicated Spark pool (version 3.2 or above) for Apache Spark in Azure Synapse Analytics workspace]https://video2.skills-academy.com/en-us/azure/templates/microsoft.synapse/workspaces/bigdatapools
    2. [Ingest and process real-time data streams with Azure Synapse Analytics]https://video2.skills-academy.com/en-us/sql/big-data-cluster/spark-streaming-guide?view=sql-server-ver15

    Remember, the choice between these options depends on your specific use case and requirements. It’s also important to note that managing streaming data effectively often involves a combination of these techniques.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.