deploying Azure databrick with datalake

Sujeet 0 Reputation points
2024-06-05T08:40:13.1866667+00:00

Deploying Azure Databricks creates an additional resource group in the background, which includes a data lake. Is it possible to use the data lake that I have already deployed in Azure instead of the one provisioned by Azure Databricks?

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,045 questions
{count} votes

1 answer

Sort by: Most helpful
  1. PRADEEPCHEEKATLA-MSFT 84,381 Reputation points Microsoft Employee
    2024-06-05T11:12:19.88+00:00

    @Sujeet - Thanks for the question and using MS Q&A platform.

    Deploying Azure Databricks creates an additional resource group in the background, which includes a data lake. Is it possible to use the data lake that I have already deployed in Azure instead of the one provisioned by Azure Databricks?

    Answer is no, Reason: Do not Store any Production Data in Default DBFS Folders

    Why we do not Store any Production Data in Default DBFS Folders?

    Azure Databricks uses the DBFS root directory as a default location for some workspace actions. Databricks recommends against storing any production data or sensitive information in the DBFS root. This article focuses on recommendations to avoid accidental exposure of sensitive data on the DBFS root.

    User's image

    Educate users not to store data on DBFS root?

    Because the DBFS root is accessible to all users in a workspace, all users can access any data stored here. It is important to instruct users to avoid using this location for storing sensitive data. The default location for managed tables in the Hive metastore on Azure Databricks is the DBFS root; to prevent end users who create managed tables from writing to the DBFS root, declare a location on external storage when creating databases in the Hive metastore.

    For more details, refer to Recommendations for working with DBFS root.

    From Azure Databricks Best Practices: Do not Store any Production Data in Default DBFS Folders

    User's image

    Reason for recommending to store data in other storage accounts (ADLS gen2) than storing in storage account associated with Azure Databricks workspace.

    Reason1: You don't have write permission, when you use the same storage account externally via Storage Explorer.

    Reason 2: You cannot use the same storage accounts for another ADB workspace or use the same storage account linked service for Azure Data Factory or Azure synapse workspace.

    Reason 3: In future, you decided to use Azure Synapse workspaces or MS Fabric than using Azure databricks.

    Reason 4: What if you want to delete the existing workspace, in that case you will lose your data.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.