Slow performance reading parquet files from Data Lake Gen2

Jorge Lopez 36

I'm using Azure Data Lake Gen2 to store data in parquet file format. I have split the data using partitions by year, month and day to benefit from the filtering functionality. I'm reading the data from python directly (no Spark or Synapse here) using the pyarrofs_adlgen2 library as suggested in other Q&A of this forum. However, the performance is much worse compared to what I get locally (storing and reading the data from my local file system). The following query takes 0.2s in local Vs 11.7s from Azure Data Lake Gen2:

import azure.identity
import pandas as pd
import pyarrow.fs
import pyarrowfs_adlgen2

handler=pyarrowfs_adlgen2.AccountHandler.from_account_name(account_name, azure.identity.DefaultAzureCredential()
fs = pyarrow.fs.PyFileSystem(handler)

filters = [('year', '=', 2020), ('month', '=', 1), ('day', 'in', (9, 10, 11))]
df = pd.read_parquet('container/data', filesystem=fs, engine='pyarrow', dtype_backend='pyarrow', filters=filters)

There is something I'm missing? How can I improve the performance using Azure Storage?

KarishmaTiwari-MSFT 18,872 Reputation points Microsoft Employee

2024-02-16T01:21:00.5333333+00:00

@Jorge Lopez Thanks for posting your query on Microsoft Q&A.
There are a few things you can try to improve performance:

1.If your workloads require a low consistent latency and/or require a high number of input output operations per second (IOP), consider using a premium block blob storage account. This type of account makes data available via high-performance hardware. Data is stored on solid-state drives (SSDs) which are optimized for low latency. SSDs provide higher throughput compared to traditional hard drives. The storage costs of premium performance are higher, but transaction costs are lower. Therefore, if your workloads execute a large number of transactions, a premium performance block blob account can be economical.
https://video2.skills-academy.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#consider-premium

2.To achieve the best performance, use all available throughput by performing as many reads and writes in parallel as possible.

https://video2.skills-academy.com/en-us/azure/storage/blobs/data-lake-storage-best-practices#configure-data-ingestion-tools-for-maximum-parallelization

3.Larger files lead to better performance and reduced costs.

4.File format, file size, and directory structure can all impact performance and cost.

Refer to detailed documentation here: https://video2.skills-academy.com/en-us/azure/storage/blobs/data-lake-storage-best-practices
KarishmaTiwari-MSFT 18,872 Reputation points Microsoft Employee

2024-02-21T00:44:19.39+00:00

@Jorge Lopez Checking in to see if the comment above helped. Let me know. Thanks.
Jorge Lopez 36 Reputation points

2024-02-21T07:25:22.32+00:00
Thank you for your answer @KarishmaTiwari-MSFT

The premium storage slightly improve the times, but they are still far from what I got in local.

I'm not sure if this apply on my use case

I'm exploring it, but don't seems to have a true impact on performance for my scenario
KarishmaTiwari-MSFT 18,872 Reputation points Microsoft Employee

2024-02-27T15:59:46.84+00:00

Thanks for sharing the update. Were you able to try with the Premium option?
Peter Stevens 0 Reputation points

2024-07-05T22:21:50.9733333+00:00

I have the same problem, with similar times locally and reading from blob. Did you end up finding a solution?

Share via

Slow performance reading parquet files from Data Lake Gen2