How does Azure Data Factory's Data Copy REST API pagination rules work, is there any parallelisation and can this be disabled to only copy one page at a time?

Kit MacInnes-Manby 20 Reputation points
2024-08-29T16:56:09.3266667+00:00

Hello,

I'm using the copy data activity in a Azure Data Factory pipeline to copy data from a REST API data source (the pipeline will soon be migrated to a Synapse one).

A recent update to the source API has stumped me as using the built in pagination rules now cause data transfer errors which make the pipelines unusable, however if I manually iterate through the pages then I receive the data as expected.

The system I'm connecting to limits the number of rows per page to 2,000 so prior to the update I've been using the pagination rules as shown in the below picture, where the offset parameter below can then cycle through the page numbers until it hits an end condition that is in the response.

User's image

This system owner provided this explanation to their update:

"The improvement relies on consecutive incremental calls using the "page" parameter. In order to benefit from this you need to ensure that each request to the API is done in page order, and only triggers a subsequent call once the response has been returned form the API.

Example: to return all events with a date after 01/01/2024:

  • https://example/reports/api/TableName?dateFrom=01/08/2024&page=0
  • wait for response
  • https://example/reports/api/TableName?dateFrom=01/08/2024&page=1
  • wait for response
  • https://example/reports/api/TableName?dateFrom=01/08/2024&page=2
  • etc."

Since this update I've encountered issues where I'm missing data and / or getting duplicate data alongside some other weird behaviour.

My question is, does the pagination in the copy activity use some parallelisation and / or request the pages out of sequential order? And if so is there anyway to disable that or force it to go sequentially and wait for a response each time?

Below are examples of the response I get using the "preview data" function in the copy activity. The first is using the pagination rules as above, which is telling me there are 12,000 records (totalElements), which I know is incorrect. The second one is if hard code the page number to zero (therefore removing the pagination), which shows the correct number of elements at 10,211. I can then manually cycle through the 6 pages and extract the correct data.

I could set up an until activity to force it to iterate through each page one at a time, however this feels inelegant and I'm also genuinely curious to find out what is going on with how ADF uses the pagination rules to request the different pages.

Any suggestions welcome!

(NB: I've tried extending the "Request interval" in the source settings and also setting "Degree of copy parallelism" to 1 in the copy settings but neither have made a difference)

User's image User's image

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,860 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,567 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sergio Andrés Vargas Acosta 75 Reputation points
    2024-08-29T17:05:10.7333333+00:00

    Hi

    Disable parallelization by setting "Degree of copy parallelism" to 1 in Azure Data Factory and use an "Until" activity to manually iterate through each page, ensuring requests are made sequentially.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.