Staging of Source Data in Parquete vs SQL DB for DWH with ADF for Delta-Loads

Jörg Lang 120 Reputation points
2024-07-23T06:58:39.9333333+00:00

Hi there,
I have some "conceptual" question I would like to discuss.

New DWH should is beeing build on Azure using ADF and SQL for Dimensions/Facts. Today, we have also running the landing on SQL DB.

Also we are only loading delta datasets from source DB (on-prem) with simple query logic like
select max(last_modified) from landing_table which is running quite smooth.

Now the idea comes up to use Azure Data Lake Storage Gen2 in combination with the parquet files to increase our processing speed and reduce space used in SQL DB.

  • Does it make sense to swicht?
  • How do we get the max(last_modified) from a parquete file?
  • Is it right, that we should create new parquete file for any run to ensure, that we are not loosing any staged data from before? How can we then get the last modified?

Thanks for your thoughts on that?
Regards

Joerg

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,521 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 30,496 Reputation points Microsoft Employee
    2024-07-23T19:16:36.9066667+00:00

    Hello Jörg Lang,

    Switching to Azure Data Lake Storage Gen2 and using Parquet files can indeed offer benefits in terms of processing speed and storage optimization. Parquet is a columnar storage file format that is optimized for use with Azure Data Lake and is known for its efficiency and performance, especially with large datasets.

    To get the max(last_modified) from a Parquet file, you can use a tool like Apache Spark or Azure Databricks to read the Parquet file and perform the necessary aggregation. Both of these tools have built-in support for reading and processing Parquet files.

    Regarding creating new Parquet files for each run, it's a common practice to create new Parquet files for each data ingestion run, especially when dealing with delta datasets. This helps in managing data versions and ensures that you have a historical record of data changes.

    When you create a new Parquet file for each run, you can maintain a metadata file or a manifest that tracks the last_modified timestamps for each file. This way, you can easily retrieve the most recent last_modified value across all your Parquet files.

    I hope this helps. Please let me know if you have any further questions.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.