pyspark merge dataframe issue

Admin (KK) 136 Reputation points
2024-06-25T20:43:46.5233333+00:00

I am trying to merge multiple files using merge schema command in pyspark and i get the the following error. The pipeline runs every one hour. The merge schema works fine some times and it fails sometime. I am using the following command.

df = spark.read.option("mergeSchema", "True").parquet(lc_raw)

and i get the following error.

Caused by: java.io.EOFException

at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:88)

at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:548)

at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:528)

at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:522)

at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:498)

at org.apache.spark.sql.execution.datasources.parquet.ParquetMetadataCacheReader$.$anonfun$getFooter$2(ParquetMetadataCacheReader.scala:98)

at org.apache.spark.sql.execution.datasources.parquet.ParquetMetadataCacheReaderSource.runWithTimerAndRecord(ParquetMetadataCacheReaderSource.scala:64)

at org.apache.spark.sql.execution.datasources.parquet.ParquetMetadataCacheReader$.getFooter(ParquetMetadataCacheReader.scala:91)

at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:505)

Can someone faced this issue or anyone knows how we can fix this.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,912 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Admin (KK) 136 Reputation points
    2024-06-26T11:50:40.3+00:00

    I have tried put some logging , and increase the number executor. but still the same issue.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.