Databricks out of memory issue on driver causing it to restart

Question

I am having a strange issue with relatively simple job that is causing the driver to restart.

Before anyone suggests it: There are no displays, collects, conversions to pandas, and the data size is tiny (a little over 3000 rows of data). The cluster nodes including the driver have 28gb of memory.

The basic premise is that I am trying to join two dataframes in pyspark on different sets of keys. There are 10 different potential key combinations that I am trying.

Dataframe 1 has a little over 2500 rows and the other one has about 500.

For each of these, they follow the same pattern (pyspark):

Join df1 to df2 on 2 to 4 columns.

Result is unioned to a joined dataframe (joined_df).

Joined results are removed from df1 and df2 using antijoin (tried antijoin and outer with null)

The problem is that by the end of the job, the notebook crashes with a driver restarting error and a memory heap space error in the driver logs.

I tried materializing after each cell trying both caching and doing counts on the dataframes to see what was going on.

The first one would take about 15 seconds, and then increase after each block, eventually taking 10+ minutes before erroring out on the second to last or last block with the error:
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1492)

I am at a loss for what could be going on. Anyone got any ideas?

Thanks!

Answer

I did find a solution, but not overly happy with it and that was to convert the dataframes to an rdd and back to dataframes. The entire thing took 2 minutes after the change.

It seems that Databricks was holding on to the explain plan all the way through and adding to it with each operation, and nothing would reset it until I converted to an rdd and back. (We also tried checkpointing with no luck either)

Answer

@John Aherne I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer .
Issue: I am having a strange issue with relatively simple job that is causing the driver to restart.

Before anyone suggests it: There are no displays, collects, conversions to pandas, and the data size is tiny (a little over 3000 rows of data). The cluster nodes including the driver have 28gb of memory.

The basic premise is that I am trying to join two dataframes in pyspark on different sets of keys. There are 10 different potential key combinations that I am trying.

Dataframe 1 has a little over 2500 rows and the other one has about 500.

For each of these, they follow the same pattern (pyspark):

Join df1 to df2 on 2 to 4 columns.

Result is unioned to a joined dataframe (joined_df).

Joined results are removed from df1 and df2 using antijoin (tried antijoin and outer with null)

The problem is that by the end of the job, the notebook crashes with a driver restarting error and a memory heap space error in the driver logs.

I tried materializing after each cell trying both caching and doing counts on the dataframes to see what was going on.

The first one would take about 15 seconds, and then increase after each block, eventually taking 10+ minutes before erroring out on the second to last or last block with the error:
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1492)

I am at a loss for what could be going on. Anyone got any ideas?

Solution: I did find a solution, but not overly happy with it and that was to convert the dataframes to an rdd and back to dataframes. The entire thing took 2 minutes after the change.

It seems that Databricks was holding on to the explain plan all the way through and adding to it with each operation, and nothing would reset it until I converted to an rdd and back. (We also tried checkpointing with no luck either)
If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

I hope this helps!

If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

Share via

Databricks out of memory issue on driver causing it to restart

2 answers

Your answer