Spark unable to write file onto Blob storage

Question

We use HDInsight with Spark, v3.6. So far, our code has been working as expected. As of last night, our job started failing. The error states that "output directory already exists". When looking at the blob storage, directories appear to be created as 'block blob' and not as directories.

Are there any suggestions on how to overcome this error?

User class threw exception: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory wasbs://payment-file-outbound@xxx.blob.core.windows.net/output/275DPN45922 already exists  
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:131)  
    at org.apache.spark.internal.io.HadoopMapRedWriteConfigUtil.assertConf(SparkHadoopWriter.scala:287)  
    at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1096)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1.apply(PairRDDFunctions.scala:1094)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1094)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply$mcV$sp(PairRDDFunctions.scala:1067)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$4.apply(PairRDDFunctions.scala:1032)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1032)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply$mcV$sp(PairRDDFunctions.scala:958)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)  
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopFile$1.apply(PairRDDFunctions.scala:958)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:957)  
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply$mcV$sp(RDD.scala:1499)  
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)  
    at org.apache.spark.rdd.RDD$$anonfun$saveAsTextFile$1.apply(RDD.scala:1478)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)  
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)  
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)  
    at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1478)  
    at org.apache.spark.api.java.JavaRDDLike$class.saveAsTextFile(JavaRDDLike.scala:550)  
    at org.apache.spark.api.java.AbstractJavaRDDLike.saveAsTextFile(JavaRDDLike.scala:45)  
    at com.rm.integration.etl.generator.FixedWidthGenerator.genFlatFile(FixedWidthGenerator.java:114)  
    at com.rm.integration.etl.generator.PaymentOutboundGenerator.generate(PaymentOutboundGenerator.java:43)  
    at com.rm.integration.main.pipeline.PaymentPipeline.run(PaymentPipeline.java:115)  
    at com.rm.integration.main.PaymentOutboundApp.runApp(PaymentOutboundApp.java:35)  
    at com.rm.integration.app.DefaultSparkApplication.run(DefaultSparkApplication.java:40)  
    at com.rm.integration.main.Main.main(Main.java:16)  
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)  
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)  
    at java.lang.reflect.Method.invoke(Method.java:498)  
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)

It appears that there was an HDInsight update on September 28, which may have reached our region just now. However, the release notes don't mention any possible regressions or unsolved problems.

EDIT: link to release notes: https://video2.skills-academy.com/en-us/azure/hdinsight/hdinsight-release-notes#release-date-09282020

Accepted Answer

Hi @PRADEEPCHEEKATLA-MSFT ,

It turned out to be our issue after all. It seems that we had a Null Pointer exception being thrown but Spark was swallowing our error and throwing its own instead. Once we changed some settings, we were able to see the NPE and correct it.

Bad data somehow made it through the front-end validation checks (causing the error), and it happened to coincide perfectly with the upgrade. Once we were able to reproduce it reliably, we found the root cause.

Sorry to have bothered you.

Share via

Spark unable to write file onto Blob storage

0 additional answers

Your answer