Flatten a JSON Nested Array in Azure Synapse pipeline

Question

I am trying to flatten a JSON which has nested array objects with 1:M mapping. Please refer below JSON response.

{
    "result": [
        {
            "data": [
                {
                    "dimensions": [
                        "SYNTHETIC-1"
                    ],
                    "timestamps": [
                        17064756000,
                        17064752000,
                        17064751000,
                        17064750000
                    ],
                    "values": [
                        null,
                        null,
                        100,
                        99.6354
                    ]
                },
                {
                    "dimensions": [
                        "SYNTHETIC-2"
                    ],
                    "timestamps": [
                        17064755000,
                        17064754000,
                        17064753000,
                        17064752000
                    ],
                    "values": [
                        101,
                        null,
                        100,
                        99.6354
                    ]
                }
            ]
        }
    ]
}

The above JSON response is an output from a web activity in Azure Synapse. I have used Data Flow to flatten this JSON and unroll by first 'result' and then by 'data'. This gives me dimensions[], timestamps[] and values[] array, but it is not giving the right output. The output I am getting is below:

dimensions    timestamps    values
SYNTHETIC-1   17064756000   null
SYNTHETIC-1   17064756000   null
SYNTHETIC-1   17064756000   100
SYNTHETIC-1   17064756000   99.6354
SYNTHETIC-1   17064752000   null
SYNTHETIC-1   17064752000   null
SYNTHETIC-1   17064752000   100
SYNTHETIC-1   17064752000   99.6354
SYNTHETIC-1   17064751000   null
SYNTHETIC-1   17064751000   null
SYNTHETIC-1   17064751000   100
SYNTHETIC-1   17064751000   99.6354
SYNTHETIC-1   17064750000   null
SYNTHETIC-1   17064750000   null
SYNTHETIC-1   17064750000   100
SYNTHETIC-1   17064750000   99.6354
SYNTHETIC-2   17064755000   101
SYNTHETIC-2   17064755000   null
SYNTHETIC-2   17064755000   100
SYNTHETIC-2   17064755000   99.6354
SYNTHETIC-2   17064754000   101
SYNTHETIC-2   17064754000   null
SYNTHETIC-2   17064754000   100
SYNTHETIC-2   17064754000   99.6354
SYNTHETIC-2   17064753000   101
SYNTHETIC-2   17064753000   null
SYNTHETIC-2   17064753000   100
SYNTHETIC-2   17064753000   99.6354
SYNTHETIC-2   17064752000   101
SYNTHETIC-2   17064752000   null
SYNTHETIC-2   17064752000   100
SYNTHETIC-2   17064752000   99.6354

The output I am expecting is like below:

dimensions    timestamps    values
SYNTHETIC-1   17064756000   null
SYNTHETIC-1   17064752000   null
SYNTHETIC-1   17064751000   100
SYNTHETIC-1   17064750000   99.6354
SYNTHETIC-2   17064755000   101
SYNTHETIC-2   17064754000   null
SYNTHETIC-2   17064753000   100
SYNTHETIC-2   17064752000   99.6354

Can someone please help me to achieve this?

TIA.

Answer

Hi @sachin gupta
Thanks for the question and using MS Q&A platform.
I saw the same question has been asked in Stack Overflow thread - https://stackoverflow.com/questions/78929093/flatten-a-json-nested-array-in-azure-synapse-pipeline one of our support engineers has already answered your question.

Answer Summary (from SO):

Currently, Dataflow doesn't have any feature to flatten all the nested arrays as per your requirement. In your case, you can try the below workaround using combination of flattens and join transformations.

For some reason, your timestamps array is giving me only null values when the type is an integer array. So, I have set it to the string array. You need to change this array as per your requirement in the source dataset schema.

First, use flatten transformation like below to flatten the data array and to get the inner arrays as columns. Use Rule-based mapping for this.

enter image description here

It will give the result like below.

enter image description here

Then follow the transformations as shown in the above image.

Use flatten2 transformation to flatten the dimensions array. In flatten, give the dimension array for both Unroll by and Unroll root options. It will flatten this array.

enter image description here

Then take a select2 transformation and remove the timestamps column.

Take a new branch from flatten2 and add another select transformation select1 and here remove the values column.

Now, add another flatten transformation flatten3 to the select2 and flatten the array column values. For both Unroll by and Unroll root options, give values column only.

Similarly, add flatten4 to the select1 and here do it for timestamps array column.

After that, take SurrogateKey transformation surrogateKey1 to the flatten3 and create a key column key1 and give the step and start value as 1.

enter image description here

Similarly, do the same for flatten4 transformation. give the same key name as well.

Now, inner join both surrogate key transformations using join transformation based on the common column key1.

enter image description here

Now, it will have required results, use another select transformation and remove the duplicate columns and extra key1 column.

Now, you will get the desired result.

enter image description here

Use the below dataflow script to build your dataflow.

source(output(
        result as (data as (dimensions as string[], timestamps as string[], values as string[])[])[]
    ),
    allowSchemaDrift: true,
    validateSchema: false,
    ignoreNoFilesFound: false,
    documentForm: 'singleDocument') ~> source1
source1 foldDown(unroll(result.data, result),
    mapColumn(
        each(result.data,match(true()))
    ),
    skipDuplicateMapInputs: false,
    skipDuplicateMapOutputs: false) ~> flatten1
flatten1 foldDown(unroll(dimensions, dimensions),
    mapColumn(
        dimensions,
        timestamps,
        values
    ),
    skipDuplicateMapInputs: false,
    skipDuplicateMapOutputs: false) ~> flatten2
flatten2 select(mapColumn(
        dimensions,
        timestamps
    ),
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true) ~> select1
flatten2 select(mapColumn(
        dimensions,
        values
    ),
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true) ~> select2
select2 foldDown(unroll(values, values),
    mapColumn(
        dimensions,
        values
    ),
    skipDuplicateMapInputs: false,
    skipDuplicateMapOutputs: false) ~> flatten3
select1 foldDown(unroll(timestamps, timestamps),
    mapColumn(
        dimensions,
        timestamps
    ),
    skipDuplicateMapInputs: false,
    skipDuplicateMapOutputs: false) ~> flatten4
flatten3 keyGenerate(output(key1 as long),
    startAt: 1L,
    stepValue: 1L) ~> surrogateKey1
flatten4 keyGenerate(output(key1 as long),
    startAt: 1L,
    stepValue: 1L) ~> surrogateKey2
surrogateKey1, surrogateKey2 join(surrogateKey1@key1 == surrogateKey2@key1,
    joinType:'inner',
    matchType:'exact',
    ignoreSpaces: false,
    broadcast: 'auto')~> join1
join1 select(mapColumn(
        dimensions = flatten3@dimensions,
        values,
        dimensions = flatten4@dimensions,
        timestamps
    ),
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true) ~> select3
select3 sink(allowSchemaDrift: true,
    validateSchema: false,
    input(
        result as (data as (dimensions as string[], timestamps as string[], values as string[])[])[]
    ),
    umask: 0022,
    preCommands: [],
    postCommands: [],
    skipDuplicateMapInputs: true,
    skipDuplicateMapOutputs: true) ~> sink1

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Flatten a JSON Nested Array in Azure Synapse pipeline

1 answer

Your answer