Databricks out of memory issue on driver causing it to restart

John Aherne 141 Reputation points
2024-08-30T15:57:43.52+00:00

I am having a strange issue with relatively simple job that is causing the driver to restart.

Before anyone suggests it: There are no displays, collects, conversions to pandas, and the data size is tiny (a little over 3000 rows of data). The cluster nodes including the driver have 28gb of memory.

The basic premise is that I am trying to join two dataframes in pyspark on different sets of keys. There are 10 different potential key combinations that I am trying.

Dataframe 1 has a little over 2500 rows and the other one has about 500.

For each of these, they follow the same pattern (pyspark):

Join df1 to df2 on 2 to 4 columns.

Result is unioned to a joined dataframe (joined_df).

Joined results are removed from df1 and df2 using antijoin (tried antijoin and outer with null)

The problem is that by the end of the job, the notebook crashes with a driver restarting error and a memory heap space error in the driver logs.

I tried materializing after each cell trying both caching and doing counts on the dataframes to see what was going on.

The first one would take about 15 seconds, and then increase after each block, eventually taking 10+ minutes before erroring out on the second to last or last block with the error:
The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1492)

I am at a loss for what could be going on. Anyone got any ideas?

Thanks!

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,162 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. John Aherne 141 Reputation points
    2024-08-30T21:17:45.7433333+00:00

    I did find a solution, but not overly happy with it and that was to convert the dataframes to an rdd and back to dataframes. The entire thing took 2 minutes after the change.

    It seems that Databricks was holding on to the explain plan all the way through and adding to it with each operation, and nothing would reset it until I converted to an rdd and back. (We also tried checkpointing with no luck either)


  2. Smaran Thoomu 14,870 Reputation points Microsoft Vendor
    2024-09-04T18:32:15.27+00:00

    @John Aherne I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer .
    Issue: I am having a strange issue with relatively simple job that is causing the driver to restart.

    Before anyone suggests it: There are no displays, collects, conversions to pandas, and the data size is tiny (a little over 3000 rows of data). The cluster nodes including the driver have 28gb of memory.

    The basic premise is that I am trying to join two dataframes in pyspark on different sets of keys. There are 10 different potential key combinations that I am trying.

    Dataframe 1 has a little over 2500 rows and the other one has about 500.

    For each of these, they follow the same pattern (pyspark):

    Join df1 to df2 on 2 to 4 columns.

    Result is unioned to a joined dataframe (joined_df).

    Joined results are removed from df1 and df2 using antijoin (tried antijoin and outer with null)

    The problem is that by the end of the job, the notebook crashes with a driver restarting error and a memory heap space error in the driver logs.

    I tried materializing after each cell trying both caching and doing counts on the dataframes to see what was going on.

    The first one would take about 15 seconds, and then increase after each block, eventually taking 10+ minutes before erroring out on the second to last or last block with the error: The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached.

    at com.databricks.spark.chauffeur.Chauffeur.onDriverStateChange(Chauffeur.scala:1492)

    I am at a loss for what could be going on. Anyone got any ideas?

    Solution: I did find a solution, but not overly happy with it and that was to convert the dataframes to an rdd and back to dataframes. The entire thing took 2 minutes after the change.

    It seems that Databricks was holding on to the explain plan all the way through and adding to it with each operation, and nothing would reset it until I converted to an rdd and back. (We also tried checkpointing with no luck either)
    If I missed anything please let me know and I'd be happy to add it to my answer, or feel free to comment below with any additional information.

    I hope this helps!

    If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.


    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.