Exit multiple values from Azure Synapse analytics notebook

DeGuy 40 Reputation points
2023-08-02T04:21:42.4333333+00:00

Hello,

I want to Azure Synapse notebook to exit multiple values, and these values must be used as deciding factor to execute the Synapse ADF pipeline. For example,

I have a dataframe with 10 columns in Synapse notebook, How do I pass entire dataframe along with 10 columns and use column values in pipeline?

mssparkutils.notebook.exit(d)

I see mssparkutils.notebook.exit exits a value not a entire dataframe, Can someone please help me how to proceed?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,622 questions
0 comments No comments
{count} votes

Accepted answer
  1. RevelinoB 2,780 Reputation points
    2023-08-02T04:37:42.9433333+00:00

    Hi DeGuy,

    Please be aware that, Azure Synapse Notebook allows you to pass only a single value out using the mssparkutils.notebook.exit() function. If you want to pass an entire dataframe, there's no direct way to do this. However, you can work around it.

    One common method is to save your dataframe to a file (like Parquet, CSV, or JSON) in a known location in your Azure Data Lake or Blob Storage, and then read that file in your Azure Data Factory (ADF) pipeline. Here is an example of how you can save a dataframe as a CSV file:

    df.coalesce(1).write.format('com.databricks.spark.csv').option('header', 'true').save('abfss://your-data-lake@your-storage-account.dfs.core.windows.net/folder/file.csv')

    In this example, df is your dataframe, 'com.databricks.spark.csv' is the format you want to write in (CSV in this case), and the last argument is the path where you want to save the file.

    In your ADF pipeline, you can read the CSV file using a Copy Activity or another appropriate activity.

    The key to this approach is that the notebook and the pipeline have a shared understanding of where the data is being written and read.

    If you need to pass only a few fields or some aggregate information about your dataframe, you can create a dictionary or a JSON object, serialize it into a string, and pass that string using mssparkutils.notebook.exit().

    # Let's assume df_summary contains some summary information about your dataframe.

    df_summary = {"column1_avg": avg1, "column2_max": max2, "column3_min": min3}

    summary_str = json.dumps(df_summary)

    mssparkutils.notebook.exit(summary_str)

    Then, in your ADF pipeline, you can use the @json() function to parse the JSON string back into an object.

    Remember, the approach to use depends on your specific needs and the size and complexity of your data.

    I hope this helps?


1 additional answer

Sort by: Most helpful
  1. Johan Tokarskij 0 Reputation points
    2024-06-19T01:13:31.87+00:00

    I have the similar problem that I try so solve, in Microsoft Fabric though. Here is a simplified version of the problem:

    import json
    
    summa = 250
    snitt = 100
    
    result = {
        'sum': summa,
        'snitt': snitt
    }
    
    result_json = json.dumps(result)
    
    mssparkutils.notebook.exit(result_json)
    

    I want then to pass this json to outlook365-activity. I try to capture one of the keys with

    @json(activity('exit').output.result.exitValue.snitt)
    

    but I'm getting the following error:

    The expression 'json(activity('exit').output.result.exitValue.snitt)' cannot be evaluated because property 'snitt' cannot be selected. Property selection is not supported on values of type 'String'.

    Any workaround for this? My goal is to get some aggregations from notebook and pass them to outlook365.

    0 comments No comments