Preserving Nested Zip Structure in ADF Copy Activity

vikranth-0706 265 Reputation points
2024-09-04T18:35:48.8033333+00:00

I'm working with Azure Data Factory (ADF) and using a Copy Data activity to unzip compressed files in Blob storage. It functions flawlessly for simple text files within basic zip archives. However, I encounter an issue when dealing with nested zip files.

Desired Outcome:

My goal is to unzip the main archive and preserve the nested zip files within a designated folder named after the main archive. The ideal structure would resemble this:

main_archive_folder/

  • nested1.zip
  • nested2.zip

Current Behavior:

The Copy Data activity extracts nested zip files individually, creating folders named after each nested archive and placing the uncompressed content inside them. This results in a structure like this:

main_archive.zip/

  • nest1.zip/
    • f1.txt
  • nest2.zip/
    • f2.txt

Why It's Confusing:

I've come across information suggesting ADF can't handle nested zip extraction in one go. However, my current behavior seems to partially unzip them. I'm unsure if there's a way to achieve the desired outcome of maintaining nested zip files within a folder.

Seeking Guidance:

Can you offer any suggestions or alternative methods within ADF to achieve my desired structure for nested zip files?

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,566 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 14,870 Reputation points Microsoft Vendor
    2024-09-04T23:28:12.67+00:00

    Hi @vikranth-0706
    Thank you for using Microsoft Q&A platform and thanks for posting your query here.
    It seems that you are trying to use Azure Data Factory (ADF) to unzip compressed files in Blob storage, but you are encountering an issue when dealing with nested zip files. The current behavior of the Copy Data activity is to extract nested zip files individually, creating folders named after each nested archive and placing the uncompressed content inside them. However, you want to preserve the nested zip files within a designated folder named after the main archive.
    Unfortunately, ADF does not have a built-in feature to handle nested zip extraction in one go.

    Since your root zip file contains zip files only in a sub-folder, you can try the following workaround. This approach creates the needed zip files from the unzipped folders and deletes those folders.

    After the copy activity, use a 'Get Metadata' activity with a Binary dataset and set the 'ChildItems' field. The Binary dataset path should be the target unzipped folder (in my case, it's 'zipsoutout/mainzip.zip') and don't specify a compression type.

    enter image description here

    This will list all the folder and file names. Use the filter activity to remove unzipped folder names from the list. Folders with '.zip' at the end are the unzipped ones.

    Use the expressions below for the items and condition in the filter activity.

    Items : @activity('Get Metadata1').output.childItems
    
    condition : @endswith(item().name, '.zip')
    
    
    

    enter image description here

    Now, pass the output array from the Filter activity @activity('Filter1').output.value to the For-Each activity.

    Inside the For-Each, use a Copy activity to zip the folders. Use the same dataset from the previous Get Metadata activity as the source for the Copy activity with the configurations below.

    @concat('mainzip.zip/',item().name)
    
    

    enter image description here

    Create a new Binary dataset with same folder path but for the file path, create a dataset parameter and use that in the file name. Give the required Compression type as well.

    enter image description here

    Use this dataset as the sink in the copy activity and set the dataset parameter to @item().name in the copy activity.

    enter image description here

    This copy activity will create the zip file you need. To remove the existing unzipped folders, use a Delete activity, which will need a Binary dataset.

    Create a dataset parameter and use it in the folder name like this:
    enter image description here

    In the delete activity, use the below expression as the value for the above parameter and follow the below configurations.

    @concat('mainzip.zip/',item().name)
    
    

    enter image description here

    This will delete everything inside the unzipped folders. Since you're using Blob storage, the empty folders will be deleted automatically.

    Now, run the pipeline in debug mode, and it will create the needed inner zip files.
    User's image

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.