Enterprise Data Lake for multiple legacy systems

Arishtat67 1 Reputation point
2021-08-24T09:04:37.347+00:00

(Guidance level questions, mostly)

Suppose you have a government agency responsible for motor vehicle and driver license registration (much like DMV in the US). They have separate legacy systems for vehicle registration and driver license registration and driver infractions, which may or may not use a central customer data system. Other organizations, both government and private, require read access to this data as well as reporting and analysis, both curated and ad hoc.

I've been learning Data Lake technologies trying to assess if dumping all of the OLTP data into a data lake to form an enterprise data lake could fulfill all these various needs. What I've learned so far doesn't make me confident enough that could recommend this general solution.

Quick Read Access
Suppose the police need quick read access to check to whom a vehicle is registered (and history as well), or how many infractions a driver has. These are both cases where row-level access is required, suggesting that we should have a raw area using Avro. Or possibly construct separate areas for current and historical data on top of raw area as the current data is most oftten used. I haven't gotten as far as building a proof-of-concept solution that would allow me to test if this solution would give us good enough performance (assuming the competing solution would be e.g. Elastic).

Analytics and Reporting
Other actors, such as the department of transportation and the transportation industry, want curated reports as well as do analytics to help their decision making. For that, AFAIK, we should construct separate areas using column-oriented file format such as Parquet, which would then open the doors for various analytics and reporting technologies.

There is a lot in this solution that appeals to me, but I have a lot of concerns as well. All OLTP systems are on-prem and all sorts of database technologies so transferring data to Azure Data Lake is a bit of a challenge, but Data Factory should be able handle that. Data security is a concern. Read access requires fine-grained access control as it deals with sensitive data. On the upside, as one example, instead of us building a service for the police that returns vehicle or infraction data, we could just give the police limited access to the appropriate areas in the data lake and tell them to build their app. Similarly, we could grant the department of transportation data scientists access to the curated data area so they can do their own analytics without bothering us. My main motivation for even researching this solution is that it brings all the data together, passed and future, and opens new possibilities for interaction between different actors without setting up new projects every time new data is needed. Instead, we could just say, "here's the data, knock yourself out".

What I'm looking for is some guidance as to the viability of this solution, so that I can make the decision if I should pursue it or abandon it altogether. I realize this is very high-level, but I'm sure others have struggled with the same questions. Whitepapers, references and case studies would be greatly appreciated.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,466 questions
Azure Data Lake Analytics
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. HimanshuSinha-msft 19,476 Reputation points Microsoft Employee
    2021-08-26T19:45:18.607+00:00

    Hello @Arishtat67 ,
    Thanks for the ask and using the Microsoft Q&A platform .
    As mentioned in your case data are coming from different sources and i agree that azure data factory should do the trick and mat you can also consider useing Synapse and it also supports SHIR ( in your this is a vital ) .

    You did asked about the security , row level security is not available in ADL and it is only possible when you funnel this data to a different system like DWH or Synapse ( which supports RLS ) , read here ..

    The great scenario which you have potraited , lets say if the police is looking for my driving history . The ask is do they need this in 2 mins ( when they pulled over ) or they need this info in 2 hrs . If they need in two mins then you will have to make sure that the data in stored in system which is indexed / partitioned properly . If the tuen around times is 2 hours , I think you can move the records to a diffenet data lake using ADF( may something else ) and cops can access the data .
    Not sure if you are considering Azure databricks as it also supports RLS .

    Thanks
    Himanshu
    Please do consider clicking on "Accept Answer" and "Up-vote" on the post that helps you, as it can be beneficial to other community members


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.