I have tried put some logging , and increase the number executor. but still the same issue.
pyspark merge dataframe issue
Admin (KK)
136
Reputation points
I am trying to merge multiple files using merge schema command in pyspark and i get the the following error. The pipeline runs every one hour. The merge schema works fine some times and it fails sometime. I am using the following command.
df = spark.read.option("mergeSchema", "True").parquet(lc_raw)
and i get the following error.
Caused by: java.io.EOFException
at org.apache.parquet.bytes.BytesUtils.readIntLittleEndian(BytesUtils.java:88)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:548)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:528)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:522)
at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:498)
at org.apache.spark.sql.execution.datasources.parquet.ParquetMetadataCacheReader$.$anonfun$getFooter$2(ParquetMetadataCacheReader.scala:98)
at org.apache.spark.sql.execution.datasources.parquet.ParquetMetadataCacheReaderSource.runWithTimerAndRecord(ParquetMetadataCacheReaderSource.scala:64)
at org.apache.spark.sql.execution.datasources.parquet.ParquetMetadataCacheReader$.getFooter(ParquetMetadataCacheReader.scala:91)
at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readParquetFootersInParallel$1(ParquetFileFormat.scala:505)
Can someone faced this issue or anyone knows how we can fix this.