Querying partitioned data with SQL on-demand?

Question

It is common to partition data within data lakes and this is reflected in the form of a folder hierarchy.

What is the proper way to query partitioned data with SQL on-demand within Azure Synapse Analytics, as according to my understanding e.g. OPENROWSET does not process subfolders at all?

Although not strictly an example of Hadoop-style partitioning, Azure Event Hubs Capture outputs the Avro files in a deeply nested structure. How would you process this with SQL on-demand (ignoring for a moment that Avro is not supported as file format either)?

Answer

Hello VeliJussiRaitila-8710 and thank you for your question.

To my understanding OPENROWSET can process multiple files from multiple folders.

Suppose we have files stored in path like mydata/year/month/day/file :

mydata/2020/02/03/file1.csv
mydata/2020/02/03/file2.csv
mydata/2020/02/15/file1.csv
mydata/2020/02/15/file2.csv
mydata/2020/03/20/file1.csv

If you want to read all the above files, it would look like

   SELECT *  
   FROM OPENROWSET(  
        BULK 'mydata/2020/*/*/file*.csv' ,  
        DATA_SOURCE = 'myExternalDataSource' ,  
        FORMAT = 'CSV', PARSER_VERSION = '2.0',  
        FIRSTROW = 2  
   )

If I wanted to select data from all months, but only days after 14, it could look something like

   SELECT r.filepath(1) as [MONTH], r.filepath(2) as [DAY], data  
   FROM OPENROWSET(  
        BULK 'mydata/2020/*/*/file*.csv' ,  
        DATA_SOURCE = 'myExternalDataSource' ,  
        FORMAT = 'CSV', PARSER_VERSION = '2.0',  
        FIRSTROW = 2  
   )  
   AS [r]  
   WHERE r.filepath(2) > '14'

References:
Use filename and filepath functions to target specific partitions
Use file metadata in queries
Query folders and multiple files

Please let me know if this helps, or if I misunderstood your ask.
Thanks,
Martin

Share via

Querying partitioned data with SQL on-demand?

1 answer