Consistent data in data lake gen2

Anshal 2,246 Reputation points
2024-04-04T12:24:04.8466667+00:00

Hi friends, I have to understand how data consistency works in ADLS, I have found this old link.

https://stackoverflow.com/questions/41727633/consistency-of-azure-data-lake-store#:~:text=In%20Azure%20Data%20Lake%20Store,files%20to%20a%20different%20parent.

A detailed explanation of it will be helpful.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,466 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,681 questions
{count} votes

Accepted answer
  1. Amira Bedhiafi 24,531 Reputation points
    2024-04-04T21:39:09.0533333+00:00

    First I invite you to read https://video2.skills-academy.com/en-us/azure/storage/blobs/data-lake-storage-best-practices.

    The concept of data consistency in ADLS Gen2 is that data is accurately written, read, and listen without anomalies or errors, even in the presence of concurrent operations.

    Strong Consistency

    ADLS Gen2 provides strong consistency guarantees, meaning that once a write operation is acknowledged, the written data is immediately visible to subsequent read and listing operations. This eliminates the uncertainties associated with eventual consistency, where data might not immediately reflect recent writes, making ADLS Gen2 reliable for analytical and transactional workloads that require immediate consistency.

    Atomic File Operations

    File operations in ADLS Gen2, such as creation, deletion, and renaming, are atomic at the file level. This means that these operations either fully succeed or fail, without leaving the system in an intermediate state. For example, if you rename a file, the change is instantly visible to all clients, and there's never a moment when the file is accessible by both the old and new names.

    Concurrent Append

    ADLS Gen2 supports concurrent append operations, allowing multiple clients to append data to the same file simultaneously without data corruption. This feature is particularly useful for logging scenarios where data from multiple sources needs to be aggregated into a single file.

    Directory and File Atomicity

    The hierarchical namespace enables atomicity at the directory level for certain operations. For instance, moving a directory within the same file system is an atomic operation. This ensures that the directory move is immediately visible and consistent across all clients.

    Transactional Support

    While ADLS Gen2 itself does not provide transactional support akin to traditional database systems, it integrates well with services like Azure Data Factory, which can orchestrate transaction-like behavior across multiple steps in a data processing pipeline. This is achieved through careful planning and the implementation of idempotent operations, ensuring that data flows remain consistent even in complex processing scenarios.

    How does it work in Practise ?

    In practice, maintaining data consistency in ADLS Gen2 involves leveraging these features and understanding their implications. For instance, when designing a data ingestion pipeline, knowing that file operations are atomic can inform how you structure data ingestion batches or handle errors. Similarly, the strong consistency model simplifies data processing logic, as you can rely on the immediate visibility of written data.

    2 people found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Anand Prakash Yadav 7,785 Reputation points Microsoft Vendor
    2024-04-05T12:18:04.86+00:00

    Hello Anshal,

    Thank you for posting your query here!

    The consistency model of a distributed storage system like ADLS Gen2 is a critical aspect as it impacts how applications interact with the data.

    Azure Data Lake Storage Gen2 (ADLS Gen2) provides strong consistency for read and write operations. This means that once a write operation is acknowledged, subsequent read operations will return the most recently written data. https://video2.skills-academy.com/en-us/azure/storage/blobs/data-lake-storage-best-practices

    Here are some key points about data consistency in ADLS Gen2:

    · Namespace functionality is available to both Azure Data Lake Storage Gen2 and Blob APIs allowing for consistent usage across both set of APIs. By general availability, the same data will be accessible using both BLOB and Azure Data Lake Storage Gen2 APIs with full coherence.

    · Processing is executed at near-constant per-request latencies that are measured at the service, account, and file levels. This design means that Azure Data Lake Storage Gen2 can easily and quickly scale up to meet the most demanding workloads. It can also just as easily scale back down when demand drops. https://video2.skills-academy.com/en-us/azure/storage/blobs/data-lake-storage-introduction

    · With the evolution of the Common Data Model metadata system, the model brings the same structural consistency and semantic meaning to the data stored in Microsoft Azure Data Lake Storage Gen2 with hierarchical namespaces and folders that contain schematized data in standard Common Data Model format. The standardized metadata and self-describing data in an Azure Data Lake facilitates metadata discovery and interoperability between data producers and data consumers. https://video2.skills-academy.com/en-us/common-data-model/data-lake

    I hope this helps! Please let me know if the issue persists or if you have any other questions.

    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    2 people found this answer helpful.
    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.