Best practise to copy updating parquet files into sql database

Question

I have to copy over 300 million rows from parquet files to Azure SQL Database. (The Parquet files come from Synapselink for dataverse.) I have tried using copy activity but because the parquet files update constantly, it feels funky. I have also tried using data flow but they have been failing in insert side. What is the best practise to use in this case?

Accepted Answer

Hi Jani Hämäläinen ,

Thankyou for using Microsoft Q&A platform and thanks for posting your query here.

As per my understanding you are trying to copy data from large volume parquet files to Azure SQL Db using ADF/synapse pipelines.

I would like to understand what it means when you say parquet files are getting updated constantly. My understanding is that the files would be getting replaced with the new updated files.

To perform incremental load for parquet files as the source, copy activity would not be suffice, you can either use dataflow or write pyspark code in synapse notebook to identify the incremental data and load it in the Azure SQL DB.

In case there is any possibility to treat Synapse SQL pool as the target DB, then you can use external tables to load the data from parquet to the Synapse DB. Or use external tables in Synapse DB as the intermediate staging layer before Azure SQL.

Hope it helps . Kindly let us know how it goes. Thankyou

Share via

Best practise to copy updating parquet files into sql database

0 additional answers

Your answer