Querying partitioned data with SQL on-demand?

Veli-Jussi Raitila 381 Reputation points
2020-08-18T11:25:07.39+00:00

It is common to partition data within data lakes and this is reflected in the form of a folder hierarchy.

What is the proper way to query partitioned data with SQL on-demand within Azure Synapse Analytics, as according to my understanding e.g. OPENROWSET does not process subfolders at all?

Although not strictly an example of Hadoop-style partitioning, Azure Event Hubs Capture outputs the Avro files in a deeply nested structure. How would you process this with SQL on-demand (ignoring for a moment that Avro is not supported as file format either)?

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,631 questions
{count} votes

1 answer

Sort by: Most helpful
  1. MartinJaffer-MSFT 26,051 Reputation points
    2020-08-19T22:20:52.603+00:00

    Hello VeliJussiRaitila-8710 and thank you for your question.

    To my understanding OPENROWSET can process multiple files from multiple folders.

    Suppose we have files stored in path like mydata/year/month/day/file :

    mydata/2020/02/03/file1.csv
    mydata/2020/02/03/file2.csv
    mydata/2020/02/15/file1.csv
    mydata/2020/02/15/file2.csv
    mydata/2020/03/20/file1.csv

    If you want to read all the above files, it would look like

       SELECT *  
       FROM OPENROWSET(  
            BULK 'mydata/2020/*/*/file*.csv' ,  
            DATA_SOURCE = 'myExternalDataSource' ,  
            FORMAT = 'CSV', PARSER_VERSION = '2.0',  
            FIRSTROW = 2  
       )  
    

    If I wanted to select data from all months, but only days after 14, it could look something like

       SELECT r.filepath(1) as [MONTH], r.filepath(2) as [DAY], data  
       FROM OPENROWSET(  
            BULK 'mydata/2020/*/*/file*.csv' ,  
            DATA_SOURCE = 'myExternalDataSource' ,  
            FORMAT = 'CSV', PARSER_VERSION = '2.0',  
            FIRSTROW = 2  
       )  
       AS [r]  
       WHERE r.filepath(2) > '14'  
    

    References:
    Use filename and filepath functions to target specific partitions
    Use file metadata in queries
    Query folders and multiple files

    Please let me know if this helps, or if I misunderstood your ask.
    Thanks,
    Martin

    1 person found this answer helpful.
    0 comments No comments