Design strategy for Data lake

Question

Hi friends, we need to design an Azure data lake from scratch for a solution that has complex, multiple data sources from Databases, Rest APIs, and custom applications, and the Data lake solution should be scalable and high-performance in terms of data retrieval. We are in the process of designing HLD(high-level design ) and LLD(low-level design). What are the ideal templates for Data Lake HLD and LLD? Please help with what key components it should address from a data lake perspective, and what critical parts we should focus on.

Accepted Answer

Hello Anshal,

Greetings! Welcome to Microsoft Q&A Platform.

A data lake is a storage repository that holds a large amount of data in its native, raw format. Data lake stores are optimized for scaling to terabytes and petabytes of data. The data typically comes from multiple heterogeneous sources, and may be structured, semi-structured, or unstructured. The idea with a data lake is to store everything in its original, untransformed state.

Azure Data Lake Storage Gen2 isn't a dedicated service or account type. It's a set of capabilities that support high throughput analytic workloads. The Data Lake Storage Gen2 documentation provides best practices and guidance for using these capabilities. For all other aspects of account management such as setting up network security, designing for high availability, and disaster recovery, see the Blob storage documentation content.

Below are some recommendations for High Level Design,

List all data sources (databases, REST APIs, custom applications) that will feed into the data lake and decide on the ingestion frequency (batch or real-time) for each source.
Choose between a centralized or federated architecture and consider whether to use Azure Data Lake Storage Gen2 or other storage solutions.
Decide on a folder structure for organizing data within the data lake and determine how to handle schema-on-read (no predefined schema) vs. schema-on-write (predefined schema).
For Security and Access Control:Define access controls for different user roles (read, write, execute).Implement encryption, authentication, and authorization mechanisms.
Scalability and Performance:Plan for scalability by distributing data across multiple storage accounts or containers.Optimize for high throughput and low latency.

refer - https://video2.skills-academy.com/en-us/azure/architecture/data-guide/scenarios/data-lake, https://video2.skills-academy.com/en-us/azure/databricks/lakehouse-architecture/.

For Low-Level Design (LLD) Consider the following,

Data Ingestion Pipelines: Design ETL (Extract, Transform, Load) pipelines for ingesting data from various sources and choose appropriate Azure services (Azure Data Factory, Azure Logic Apps, etc.).
Data Transformation and Processing: Define transformation logic (if needed) to convert raw data into usable formats. Consider using Azure Databricks, Azure HDInsight, or Azure Synapse Analytics for data processing.
Establish metadata management practices to document data lineage, data quality, and data cataloging. Use tools like Azure Purview for metadata management.
Data Partitioning and Indexing: Partition data based on relevant attributes (e.g., date, region) to improve query performance. Create appropriate indexes for efficient data retrieval.
Monitoring and Logging: Set up monitoring for data lake health, performance, and usage and try using Azure Monitor, Azure Log Analytics for monitoring tools.
Backup and Disaster Recovery: Implement backup strategies to protect against data loss. Plan for disaster recovery scenarios.

these following links will describe the design templates - https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/synapse-analytics/database-designer/create-lake-database-from-lake-database-templates.md, https://video2.skills-academy.com/en-us/azure/storage/blobs/data-lake-storage-best-practices. Please review and design per business requirements, and continuous refinement based on usage patterns.

Below are some key components that address from a data lake gen2 perspective, and some critical parts we should focus of is data consistency in ADLS Gen2 is that data is accurately written, read, and listen without anomalies or errors, even in the presence of concurrent operations.

Strong Consistency

ADLS Gen2 provides strong consistency guarantees, meaning that once a write operation is acknowledged, the written data is immediately visible to subsequent read and listing operations. This eliminates the uncertainties associated with eventual consistency, where data might not immediately reflect recent writes, making ADLS Gen2 reliable for analytical and transactional workloads that require immediate consistency.

Atomic File Operations

File operations in ADLS Gen2, such as creation, deletion, and renaming, are atomic at the file level. This means that these operations either fully succeed or fail, without leaving the system in an intermediate state. For example, if you rename a file, the change is instantly visible to all clients, and there's never a moment when the file is accessible by both the old and new names.

Concurrent Append

ADLS Gen2 supports concurrent append operations, allowing multiple clients to append data to the same file simultaneously without data corruption. This feature is particularly useful for logging scenarios where data from multiple sources needs to be aggregated into a single file.

Directory and File Atomicity

The hierarchical namespace enables atomicity at the directory level for certain operations. For instance, moving a directory within the same file system is an atomic operation. This ensures that the directory move is immediately visible and consistent across all clients.

Transactional Support

While ADLS Gen2 itself does not provide transactional support akin to traditional database systems, it integrates well with services like Azure Data Factory, which can orchestrate transaction-like behavior across multiple steps in a data processing pipeline. This is achieved through careful planning and the implementation of idempotent operations, ensuring that data flows remain consistent even in complex processing scenarios.

How does it work in Practise ?

In practice, maintaining data consistency in ADLS Gen2 involves leveraging these features and understanding their implications. For instance, when designing a data ingestion pipeline, knowing that file operations are atomic can inform how you structure data ingestion batches or handle errors. Similarly, the strong consistency model simplifies data processing logic, as you can rely on the immediate visibility of written data.

Hope this answer helps! Please let us know if you have any further queries. I’m happy to assist you further.

Please "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Answer

Hi Anshal - Thanks for reaching out.

You can start by reviewing the below links to gain insights on ADLS Gen2 accounts, terminology associated, auth mechanism, best practices along with known issues/limitations.

https://video2.skills-academy.com/en-us/azure/storage/blobs/data-lake-storage-introduction

https://video2.skills-academy.com/en-us/azure/storage/blobs/data-lake-storage-namespace

https://azure.github.io/Storage/docs/analytics/hitchhikers-guide-to-the-datalake/

Please let us know if you have any further queries. I’m happy to assist you further.

Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Share via

Design strategy for Data lake

1 additional answer