Issues while writing into bad_records path in Databricks

Alok Thampi 111

Hello All,

I would like to get your inputs with a scenario that I see while writing into the bad_records file.

I am reading a ‘Ԓ’ delimited CSV file based on a schema that I have already defined. I have enabled error handling while reading the file to write the error rows into a badRecordsPath if I have a schema mismatch.

I have new line characters coming in from the source file because of which a few columns get moved to the next line and since those new rows do not align with the schema defined, it writes the rows into a file in the bad_records path that I have specified.

This works well for almost all the scenarios EXCEPT when I define the schema with DateType(). If I try to write a non date type value to this column, instead of writing the whole row to the bad_records path, it creates blank files in the bad_records folder. It also creates another folder named bad_files and creates another file in it which shows the error –

"reason":"org.apache.spark.SparkUpgradeException: [INCONSISTENT_BEHAVIOR_CROSS_VERSION.PARSE_DATETIME_BY_NEW_PARSER] You may get a different result due to the upgrading to Spark >= 3.0:\nFail to parse '009-7-4-23 ' in the new parser. You can set "spark.sql.legacy.timeParserPolicy" to "LEGACY" to restore the behavior before Spark 3.0, or set to "CORRECTED" and treat it as an invalid datetime string."

I get this error only while defining the datatype as DateType(). For testing purposes, I tried replacing it with IntegerType/TimestampType/DoubleType,etc and all of them writes to the bad_records file as expected with the error data.

Any leads on why this happens?

modified_schema = StructType(
    [
        StructField(".....", StringType(), True),
		.....
        StructField("ENTRYDATE", DateType(), True),
		.....
        StructField(".....", IntegerType(), True)
    ]   
)

df = spark.read.format("csv").option("header","true").option("sep",” Ԓ”).schema(modified_schema).option("badRecordsPath",badRecordsPath).load(filepath)

Below are the 2 folders generated inside my badRecordsPath.

Below are the files generated inside the bad_records folder and it contains no information on the erroneous rows.

PRADEEPCHEEKATLA-MSFT 84,381 Reputation points Microsoft Employee

2024-06-05T05:49:33.88+00:00
@Alok Thampi - Thanks for the question and using MS Q&A platform.

It seems that you are facing an issue while writing to the bad_records path in Databricks. Based on the error message you provided, it looks like the issue is related to the parsing of the date value in the "ENTRYDATE" column. The error message suggests that the value "009-7-4-23" cannot be parsed as a valid date value.

When you define the schema with DateType(), Spark expects the values in that column to be in a specific date format. If the values do not match the expected format, Spark will not be able to parse them and will throw an error.

In your case, it seems that the value "009-7-4-23" is not a valid date value and cannot be parsed by Spark. This is why Spark is creating blank files in the bad_records folder instead of writing the whole row to the bad_records path.

To resolve this issue, you can either modify the data in the source file to ensure that the date values are in the expected format or you can change the data type of the "ENTRYDATE" column to a different data type that can handle the non-date values.

For example, you can change the data type of the "ENTRYDATE" column to StringType() to handle the non-date values. This way, even if the values in the "ENTRYDATE" column do not match the expected date format, Spark will still be able to write the whole row to the bad_records path.

Here's an example of how you can modify the schema to use StringType() instead of DateType():

modified_schema = StructType( [ StructField(".....", StringType(), True), ..... StructField("ENTRYDATE", StringType(), True), ..... StructField(".....", IntegerType(), True) ] )

I hope this helps! Let me know if you have any further questions.
PRADEEPCHEEKATLA-MSFT 84,381 Reputation points Microsoft Employee

2024-06-06T07:22:19.6933333+00:00

@Alok Thampi - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Alok Thampi 111 Reputation points

2024-06-06T15:37:02.62+00:00
Hello @PRADEEPCHEEKATLA-MSFT ,

Thanks for the quick response on this one.

The issue happens whenever I have a new line character in my source file which pushes the rest of the records to the next line, eventually breaking the schema. The normal records that come from the source have values in the DateType() format and they are parsing as expected.

I understand that changing the datatype but unfortunately, we are not able to change the datatype as there are many records that conform to the original datatype that was defined.

A sample dataset that I tried out is as below, I copied this into a txt file which was my source.

FYI that ‘Ԓ’ is the delimiter.

ENTRYDATEԒStringColumn_1ԒIntegerColumn_1 2023-07-03 00:00:00.0000000ԒThis is a sample1 string 009-7-4-23 Ԓ123 2023-07-04 00:00:00.0000000ԒThis is a sample2 string 009-7-4-23 Ԓ123 2023-07-05 00:00:00.0000000ԒThis is a sample3 string 009-7-4-23 Ԓ123 2023-07-06 00:00:00.0000000ԒThis is a sample4 string 009-7-4-23 Ԓ123

Row#1,3,4 is an expected row and it would parse without issues

Row#2 has got a new line character at the end and so ‘009-7-4-23’ jumps to the next line which is not a DateType(), hence causing the issue.

Test 1

I read the file without writing to the bad_records path and it throws the below error. The dataframe df does not show any results.

Test 2

The file was read with the bad_records path option. The dataframe df displays only the first record and rows 3 and 4 are skipped even though they are valid records.

The same bad_files and bad_records folders are generated.

Bad_files folder contain a file and it shows the same error message

Bad_records folder contains below files but with no data

Could you please advise on

Why is spark throwing error while parsing ‘009-7-4-23’? Since this is not in the typical DateType format and that it is not aligned with the schema, shouldn't that be populated into the bad records file?

Why are the rows (row#3,4) following this error row not being loaded into the dataframe or the bad records folder?

Thanks in advance @PRADEEPCHEEKATLA-MSFT

Thanks,

Alok
PRADEEPCHEEKATLA-MSFT 84,381 Reputation points Microsoft Employee

2024-06-10T05:43:44.21+00:00

@Alok Thampi - Thank you for providing more details and the sample dataset. Based on the information you provided, it looks like the issue is related to the way Spark handles new line characters in CSV files.

When Spark encounters a new line character in a CSV file, it assumes that it is the end of the current row and starts parsing the next row. In your sample dataset, the second row has a new line character at the end, which causes Spark to parse it as two separate rows. The first row is parsed correctly, but the second row contains a value that does not conform to the DateType() format, which causes Spark to throw an exception.

Regarding your questions:

Spark is throwing an error while parsing '009-7-4-23' because it is not a valid date format. Since you have defined the schema with DateType() for the ENTRYDATE column, Spark expects all values in that column to be in a specific date format. When it encounters a value that does not conform to that format, it throws an exception.

The reason why rows 3 and 4 are not being loaded into the dataframe or the bad records folder is because they are part of the second row in your sample dataset, which contains the invalid value. When Spark encounters an error while parsing a row, it skips that row and moves on to the next row. In this case, since the second row contains an invalid value, Spark skips that row and moves on to the third row, which is part of the next row in the file.

To resolve this issue, you can try removing the new line characters from your CSV file before reading it into Spark. Alternatively, you can try using a different delimiter that does not conflict with any values in your dataset. If neither of these options is feasible, you may need to consider using a different data type for the ENTRYDATE column that can handle non-conforming values.

I hope this helps! Let me know if you have any further questions.
Alok Thampi 111 Reputation points

2024-06-10T12:16:55.65+00:00

Hello @PRADEEPCHEEKATLA-MSFT ,

Thanks you for your quick response.

I do understand that it throws an exception because ‘009-7-4-23 ’ does not conform to the DateType() exception. But the whole intention of having badRecordsPath option in the code was to identify all those erroneous values and log them.

I have been seeing this loggin issue only with the DateType() datatype. I changed the datatype of the ENRTYDATE column to IntegerType(), even though the value ‘009-7-4-23 ’ is not an integer, it correctly logged all the data into the bad_records folder. I tried with another datatypes (Short,Double,etc) and all of them logged the bad records but not DateType.

Also, regarding the skipping of the 3 and 4 rows, I am a little confused.

When a new line character is identified and if spark parses it a new line, we have a new line character at the end of row#2 (which is the error record), row#3 and row#4.

So should row#3 and row#4 be considered as separate lines?

Another test that I did to check this was to change the ENRTYDATE column to TimestampType(), it loaded all the correct records (row #1,3,4) into the dataframe and pushed the 1 error row to the bad_records folder (and this is an expected scenario).

Am I missing anything else here?

Thanks again, appreciate all your support so far!

Thanks,

Alok

Accepted answer

PRADEEPCHEEKATLA-MSFT 84,381 Reputation points Microsoft Employee

2024-06-12T04:57:42.9366667+00:00

@Alok Thampi - I understand your concern about the badRecordsPath option not logging the erroneous values for the DateType() column. It is possible that this is a bug or limitation in Spark's CSV reader implementation.

Regarding your question about rows 3 and 4, when Spark encounters a new line character in a CSV file, it treats it as a new record. In your case, row 2 is being treated as a separate record because of the new line character at the end of the line. Rows 3 and 4 are also being treated as separate records because they are on separate lines. However, since rows 3 and 4 do not conform to the schema, they are being skipped by Spark.

Regarding your test with the TimestampType() column, it is possible that this data type is more forgiving than the DateType() column when it comes to parsing invalid date formats. This could be why it is able to load the correct records and push the error row to the bad_records folder.

In summary, it seems that the issue is related to the DateType() column and how Spark's CSV reader handles invalid date formats. Changing the data type to a more forgiving type, such as TimestampType(), or removing the new line characters from the source file may help in this case.

I hope this helps! Let me know if you have any further questions.
Please sign in to rate this answer.

1 person found this answer helpful.
Alok Thampi 111 Reputation points

2024-06-12T12:06:38.5633333+00:00

Thanks @PRADEEPCHEEKATLA-MSFT ,

I understand that this could be a limitation of the Spark parser, I will look for alternatives.

Also regarding the rows 3 and 4 (lines 4 and 5 in the sample dataset that I had provided), it has the same datatype as row/line 1, so in an ideal scenario those 2 should also should have parsed.

But anyhow, I believe this is how spark behaves, may be this would be fixed in a future release.

Once again, thanks for all the support so far!

PRADEEPCHEEKATLA-MSFT 84,381 Reputation points Microsoft Employee

2024-06-14T06:29:21.74+00:00

@Alok Thampi - Regarding rows 3 and 4, you're right that they have the same datatype as row 1, so in an ideal scenario they should have parsed correctly. It's possible that this behavior is related to the issue with the DateType() column, or it could be a separate issue.

In any case, I hope you're able to find a suitable workaround or solution for your use case. If you have any further questions or concerns, feel free to ask.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.
Sign in to comment

Share via

Issues while writing into bad_records path in Databricks

0 additional answers