How to fix "The specified path already exists" issue raised by Databricks Delta Live Table pipeline executions
Hello,
I have several DLT pipelines that are generating an exception "The specified path already exists". The exception indicates that there is an issue with the internal checkpoint file of the DLT streaming tables.
org.apache.hadoop.fs.FileAlreadyExistsException: Operation failed: "The specified path already exists.", 409, PUT, https://xxx.dfs.core.windows.net/managed/__unitystorage/schemas/43a-ae09-4e97-a2eb-324fd2e84f2e/tables/ee9d67c-2043-4250-af53-c1951aad6/_dlt_metadata/checkpoints/configuration__ids/24?resource=directory&timeout=90&st=2024-08-28T07:27:49Z&sv=2020-02-10&ske=2024-08-28T09:27:49Z&sig=XXXXX&sktid=0652c929-6106-451-ba96-0ebb59a37670&se=2024-08-28T08:47:09Z&sdd=5&skoid=5e4b8638-3c04-45fXXXXXXXXXXXXXXXXX&spr=https&sks=b&skt=2024-08-28T07:27:49Z&sp=racwdm&skv=2021-08-06&sr=d, PathAlreadyExists, "The specified path already exists.
Moreover, the problem is random (it doesn't always occur at the same stage of the workflows) and affects all my Databricks environments (3 environments).
Could you please help me ?
Azure Databricks
-
PRADEEPCHEEKATLA-MSFT 88,716 Reputation points • Microsoft Employee
2024-08-28T11:19:02.3866667+00:00 @Slim MISSAOUI - Thanks for the question and using MS Q&A platform.
This error message indicates that the internal checkpoint file of the DLT streaming tables already exists. This can happen when the pipeline is trying to create a new checkpoint file but the old one still exists.
To fix this issue, you can try the following steps:
- Stop the pipeline execution.
- Delete the existing checkpoint files from the storage account.
- Restart the pipeline execution.
You can delete the checkpoint files using the Azure databricks portal. Navigate to the storage account where the checkpoint files are stored and delete the files under the
dlt_metadata/checkpoints/dlt_table_name
directory.Where do you find the checkpoint location for delta live tables?
Delta Live Tables then checkpoints are stored under the storage location specified in the DLT settings. Each table gets a dedicated directory under<storage_location/checkpoints/<dlt_table_name>
.For more details, refer to https://stackoverflow.com/questions/75692260/how-to-get-the-checkpoint-location-of-delta-live-table
If the issue persists, you can try the following:
- Stop the pipeline execution.
- Delete the existing checkpoint files from the storage account.
- Clear the checkpoint cache by running the following command in a Databricks notebook: Clear Cache or Refresh
- Restart the pipeline execution.
If the issue still persists, you can try upgrading the Databricks runtime version to the latest version. If none of these steps work, please share the details of the code used and stack trace of the error message for further assistance.
I hope this helps! Let me know if you have any other questions.
-
Slim MISSAOUI 10 Reputation points
2024-08-29T10:30:05.66+00:00 @PRADEEPCHEEKATLA-MSFT Thank you for your response.
Unfortunately, I've tried all the suggested solutions, but I'm still encountering the same error in my pipelines.
Moreover, the error occurs randomly—sometimes the pipelines run without any issues.
Please find the error stack trace below:
24/08/29 07:40:17 ERROR TriggeredFlowExecution: Unhandled exception while starting flow:__materialization_mat_silver__table_1 org.apache.hadoop.fs.FileAlreadyExistsException: Operation failed: "The specified path already exists.", 409, PUT, https://stdpfmes.dfs.core.windows.net/managed/__unitystorage/schemas/4e17-8351-4403-b3fc-802dc6e0/tables/c75c034-409f-bb3d-8d39cbd66d8f/_dlt_metadata/checkpoints/silver__table/3?resource=directory&timeout=90&st=2024-08-29T07:28:38Z&sv=2020-02-10&ske=2024-08-29T09:28:38Z&sig=XXXXX&sktid=0652c929-6106-40ed-ba959a37670&se=2024-08-29T08:40:17Z&sdd=5&skoid=eddb3d-b94f-477cXXXXXXXXXXXXXXXXXX&spr=https&sks=b&skt=2024-08-29T07:28:38Z&sp=racwdxlm&skv=2020-10-02&sr=d, PathAlreadyExists, "The specified path already exists. RequestId:a611a15e-401f-0043-3ee6-000000 Time:2024-08-29T07:40:17.8296656Z" at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.checkException(AzureBlobFileSystem.java:1699) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.mkdirs(AzureBlobFileSystem.java:906) at com.databricks.common.filesystem.LokiFileSystem.mkdirs(LokiFileSystem.scala:308) at com.databricks.sql.acl.fs.CredentialScopeFileSystem.mkdirs(CredentialScopeFileSystem.scala:264) at com.databricks.spark.sql.streaming.AzureCheckpointFileManager.createCheckpointDirectory(DatabricksCheckpointFileManager.scala:316) at com.databricks.spark.sql.streaming.DatabricksCheckpointFileManager.createCheckpointDirectory(DatabricksCheckpointFileManager.scala:88) at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$.resolveCheckpointLocation(ResolveWriteToStream.scala:145) at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$$anonfun$apply$1.applyOrElse(ResolveWriteToStream.scala:45) at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$$anonfun$apply$1.applyOrElse(ResolveWriteToStream.scala:41) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$2(AnalysisHelper.scala:219) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:83) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDownWithPruning$1(AnalysisHelper.scala:219) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:436) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning(AnalysisHelper.scala:217) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDownWithPruning$(AnalysisHelper.scala:213) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDownWithPruning(LogicalPlan.scala:40) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning(AnalysisHelper.scala:102) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsWithPruning$(AnalysisHelper.scala:99) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsWithPruning(LogicalPlan.scala:40) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:79) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:78) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:40) at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$.apply(ResolveWriteToStream.scala:41) at org.apache.spark.sql.execution.streaming.ResolveWriteToStream$.apply(ResolveWriteToStream.scala:40) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$4(RuleExecutor.scala:327) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$3(RuleExecutor.scala:327) at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) at scala.collection.immutable.List.foldLeft(List.scala:91) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:324) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:307) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$9(RuleExecutor.scala:409) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$9$adapted(RuleExecutor.scala:409) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:409) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:270) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeSameContext(Analyzer.scala:423) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:416) at org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:329) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:416) at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:348) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:262) at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:167) at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:262) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:401) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:443) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:400) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:260) at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:426) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$4(QueryExecution.scala:625) at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:1176) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:625) at com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:621) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1175) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:621) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:254) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:253) at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:328) at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:422) at org.apache.spark.sql.streaming.DataStreamWriter.startQuery(DataStreamWriter.scala:537) at org.apache.spark.sql.streaming.DataStreamWriter.startInternal(DataStreamWriter.scala:514) at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:276) at com.databricks.pipelines.execution.core.StreamingNonChangeFlow.startStream(PhysicalFlow.scala:449) at com.databricks.pipelines.execution.core.StreamingNonChangeFlow.startStream$(PhysicalFlow.scala:443) at com.databricks.pipelines.execution.core.StreamingTableWrite.startStream(PhysicalFlow.scala:559) at com.databricks.pipelines.execution.core.StreamingPhysicalFlow.$anonfun$executeInternal$2(PhysicalFlow.scala:413) at com.databricks.pipelines.util.SparkSessionUtils$.withSQLConf(SparkSessionUtils.scala:19) at com.databricks.pipelines.execution.core.StreamingPhysicalFlow.executeInternal(PhysicalFlow.scala:413) at com.databricks.pipelines.execution.core.StreamingPhysicalFlow.executeInternal$(PhysicalFlow.scala:394) at com.databricks.pipelines.execution.core.StreamingTableWrite.executeInternal(PhysicalFlow.scala:559) at com.databricks.pipelines.execution.core.PhysicalFlow.executeAsync(PhysicalFlow.scala:190) at com.databricks.pipelines.execution.core.PhysicalFlow.executeAsync$(PhysicalFlow.scala:181) at com.databricks.pipelines.execution.core.StreamingTableWrite.executeAsync(PhysicalFlow.scala:559) at com.databricks.pipelines.execution.core.FlowExecution.startFlow(FlowExecution.scala:403) at com.databricks.pipelines.execution.core.TriggeredFlowExecution.$anonfun$topologicalExecution$7(TriggeredFlowExecution.scala:259) at com.databricks.pipelines.execution.core.CommandContextUtils$.withCommandContext(CommandContextUtils.scala:99) at com.databricks.pipelines.execution.core.TriggeredFlowExecution.startFlowWithPlanningMode$1(TriggeredFlowExecution.scala:246) at com.databricks.pipelines.execution.core.TriggeredFlowExecution.$anonfun$topologicalExecution$9(TriggeredFlowExecution.scala:284) at com.databricks.pipelines.execution.core.TriggeredFlowExecution.$anonfun$topologicalExecution$9$adapted(TriggeredFlowExecution.scala:284) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at com.databricks.pipelines.execution.core.TriggeredFlowExecution.com$databricks$pipelines$execution$core$TriggeredFlowExecution$$topologicalExecution(TriggeredFlowExecution.scala:284) at com.databricks.pipelines.execution.core.TriggeredFlowExecution$$anon$1.$anonfun$run$2(TriggeredFlowExecution.scala:98) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.unity.UCSEphemeralState$Handle.runWith(UCSEphemeralState.scala:45) at com.databricks.unity.HandleImpl.runWith(UCSHandle.scala:103) at com.databricks.pipelines.execution.core.BaseUCContext.runWithExecutionUCS(BaseUCContext.scala:572) at com.databricks.pipelines.execution.core.UCContextCompanion$OptionUCContextHelper.runWithExecutionUCSIfAvailable(BaseUCContext.scala:1469) at com.databricks.pipelines.execution.core.TriggeredFlowExecution$$anon$1.$anonfun$run$1(TriggeredFlowExecution.scala:98) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at com.databricks.pipelines.execution.core.CommandContextUtils$.withCommandContext(CommandContextUtils.scala:99) at com.databricks.pipelines.execution.core.TriggeredFlowExecution$$anon$1.run(TriggeredFlowExecution.scala:93) Caused by: Operation failed: "The specified path already exists.", 409, PUT, https://stdpfmes.dfs.core.windows.net/managed/__unitystorage/schemas/4e17-8351-4403-b3fc-802dc6e0/tables/c75c034-409f-bb3d-8d39cbd66d8f/_dlt_metadata/checkpoints/silver__table/3?resource=directory&timeout=90&st=2024-08-29T07:28:38Z&sv=2020-02-10&ske=2024-08-29T09:28:38Z&sig=XXXXX&sktid=0652c929-6106-40ed-ba959a37670&se=2024-08-29T08:40:17Z&sdd=5&skoid=eddb3d-b94f-477cXXXXXXXXXXXXXXXXXX&spr=https&sks=b&skt=2024-08-29T07:28:38Z&sp=racwdxlm&skv=2020-10-02&sr=d, PathAlreadyExists, "The specified path already exists. RequestId:a611a15e-401f-0043-3ee6-000000 Time:2024-08-29T07:40:17.8296656Z" at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.completeExecute(AbfsRestOperation.java:265) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.lambda$execute$0(AbfsRestOperation.java:212) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation(IOStatisticsBinding.java:494) at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation(IOStatisticsBinding.java:465) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:210) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.AbfsClient.createPath(AbfsClient.java:477) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.createDirectory(AzureBlobFileSystemStore.java:829) at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.mkdirs(AzureBlobFileSystem.java:900) ... 97 more
-
PRADEEPCHEEKATLA-MSFT 88,716 Reputation points • Microsoft Employee
2024-08-30T12:14:46.6833333+00:00 @Slim MISSAOUI - I'm sorry to hear that the previous solutions did not work for you. Since the issue is still occurring randomly, it could be related to the specific data being processed or the environment itself. Here are a few more suggestions that you can try:
Check if there are any other processes or jobs running at the same time that could be interfering with the checkpoint directory. If possible, try running the DLT pipelines during off-peak hours to reduce the likelihood of interference.
Check if there are any network issues or latency in the storage account that could be causing the issue. You can try monitoring the storage account for any spikes in latency or errors.
Try increasing the timeout value for the checkpoint directory. This can be done by setting the "fs.azure.account.request.timeout.ms" configuration property to a higher value.
Check if there are any issues with the storage account itself. You can try creating a new storage account and using it for the checkpoint directory to see if the issue persists.
If you are using a managed identity to access the storage account, make sure that the identity has the necessary permissions to create directories in the storage account.
If none of these solutions work, I would recommend to open a support for further assistance. They may be able to provide more specific guidance based on your environment and data.
-
PRADEEPCHEEKATLA-MSFT 88,716 Reputation points • Microsoft Employee
2024-09-02T08:55:16.5166667+00:00 @Slim MISSAOUI - We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
-
Slim MISSAOUI 10 Reputation points
2024-09-03T18:34:46.1633333+00:00 @PRADEEPCHEEKATLA-MSFT We tried the various suggestions, but without success. Unfortunately, the issue is still ongoing. A support ticket has been opened for further assistance.
-
PRADEEPCHEEKATLA-MSFT 88,716 Reputation points • Microsoft Employee
2024-09-04T04:00:27.9666667+00:00 @Slim MISSAOUI - Could you pelase share the support request number to track internally?
-
Slim MISSAOUI 10 Reputation points
2024-09-05T11:33:18.93+00:00 The support request number : 2408290050003755
-
PRADEEPCHEEKATLA-MSFT 88,716 Reputation points • Microsoft Employee
2024-09-10T03:36:17.0366667+00:00 @Slim MISSAOUI - Thanks for sharing the support request number. Once the issue has been solved with the help of support, please do share the resolution, which might be beneficial to other community members reading this thread.
-
Yoann Boyere 0 Reputation points
2024-09-11T06:42:04.28+00:00 Hi !
I have exactly the same problem since yesterday morning. I cannot manage to fix it neither.
-
Boris 0 Reputation points
2024-09-12T08:47:43.45+00:00 We are also intermittently seeing this error on our Databricks jobs. It seems to be random although on days when it occurs with one job, subsequent unrelated jobs are more likely to have their tasks fail with the same issue, which seems to point to issues with either the storage or the network on those days.
-
Dwight 5 Reputation points
2024-09-12T12:31:14.9966667+00:00 We're consistently getting the same error since Tuesday when restarting our structured streams that use a checkpoint location.
The issue can be reproduced using a very basic stream on both our Databricks environments: stream_checkpoint_issue_simplified.pdfWe've resorted to setting the stream starting position based on what's available in the target for now, effectively overriding the checkpoint directory whenever the stream is restarted.
A workaround we'd like to remove as soon as this issue gets resolved. -
Christophe Preaud 0 Reputation points
2024-09-13T09:41:27.1566667+00:00 Hi,
We've had the same problem randomly since last week. And it has been now failing constantly since Wednesday September 11.
Sign in to comment