how to move compressed parquet file using adf or databricks

reddy 41 Reputation points
2020-08-11T14:40:54.08+00:00

hi,
i have a requirement to move parquet files from aws s3 into azure then convert to csv using adf.
i tried to download that few files on to my local file system and tried to copy via copy activity within adf.
The files are in this format part-00000-bdo894h-fkji-8766-jjab-988f8d8b9877-c000.snappy.parquet

im getting below errors for two different files that i tried.

File is not a valid parquet file.

Parquet file contained column 'XXX', which is of a non-primitive, unsupported type.

Can any one suggest some ways to achieve this? can we implement by just using adf or do i need databricks activity where we can use spark to transform those files?

thanks

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,047 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,028 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 84,531 Reputation points Microsoft Employee
    2020-08-12T09:25:48.383+00:00

    Hello @reddy ,

    Welcome to the Microsoft Q&A platform.

    You can simply move data from aws s3 to Azure Storage account and then mount azure storage account to databricks and convert parquet file to csv file using Scala or Python.

    Why The files are in this format part-00000-bdo894h-fkji-8766-jjab-988f8d8b9877-c000.snappy.parquet

    By default, the underlying data files for a Parquet table are compressed with Snappy. The combination of fast compression and decompression makes it a good choice for many data sets.

    Using Spark, you can convert Parquet files to CSV format as shown below.

    df = spark.read.parquet("/path/to/infile.parquet")  
    df.write.csv("/path/to/outfile.csv")  
    

    For more details, refer “Spark Parquet file to CSV format”.

    File is not a valid parquet file.

    I would suggest you to checkout the file format. Make sure you are passing valid parquet file format.

    Parquet file contained column 'XXX', which is of a non-primitive, unsupported type?

    You may experience this error message, when you pass the columns which are unsupported data type in parquet files.

    You may checkout the file contained column “XXX” data type and make sure you are using the supported data type.

    These are the supported data type mappings for parquet files.

    17212-image.png

    For more details, refer “ADF – Supported file formats - Parquet”.

    Hope this helps. Do let us know if you any further queries.

    ----------------------------------------------------------------------------------------

    Do click on "Accept Answer" and Upvote on the post that helps you, this can be beneficial to other community members.