50 million records - use case is blob json - to blob json - Need audit help

Birajdar, Sujata 61 Reputation points
2021-11-11T07:11:56.13+00:00

Hi All,

We have a requirement given below. Please help us with technical details ASAP. Diagram of our flow already attached

  1. Read multiple json files and convert files to output json format of draft 4 schema given in databricks.
  2. Transform all input files to in given schema using databrick( we are using pyspark)
  3. Client want only blob stroage for reading and writing to blob storage all files.
  4. How can we manage data persistence in data frames
  5. How we will manage audits?
  6. As dataframes is temporary how will be able to manage if something goes wrong?
  7. For 50 million records how much time is needed to read and write json files
  8. Shall we chunk json files and read data?

148477-image.png

Thanks & Regards,
Sujata

148367-image.png

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,466 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,175 questions
Azure Data Lake Analytics
0 comments No comments
{count} votes

Accepted answer
  1. HimanshuSinha-msft 19,476 Reputation points Microsoft Employee
    2021-11-11T21:30:55.52+00:00

    Hello @Birajdar, Sujata ,
    Thanks for the ask and using Microsoft Q&A platform .
    Dataframe not being persisitent is my design and I think thats one of the key reason why operations on the ADb is faster then Hadoop . But then if you want you can always write some audit data in the storage blob to keep an eye on the status . if you are concerned that if there is an compute issues while the processing is happeing the whole design of head/worker should take of that . I do understand that since you are transferring 50 million records you definitely do not want hit any error at the later part of the processing which makes you redo the whole thing again .
    If I were you I could have invested in validation script to validate/transform the the data and find out the odd data records ( if any ) . Also the you can take the advantage of the many workers running on Adb when you partition the data . Please look at the input blobs and see if you partition on container / date etc .

    Also I see that you are wrting the data to different region ( end goal ) , I suggest you to start with the region with least data , it can work as a trial for your validation scripts \ runbook and also you will eastablish the baseline on how much it will take .

    Please do let me know how it goes .
    Thanks
    Himanshu

    -------------------------------------------------------------------------------------------------------------------------

    • Please don't forget to click on 130616-image.png or upvote 130671-image.png button whenever the information provided helps you. Original posters help the community find answers faster by identifying the correct answer. Here is how
    • Want a reminder to come back and check responses? Here is how to subscribe to a notification
    • If you are interested in joining the VM program and help shape the future of Q&A: Here is how you can be part of Q&A Volunteer Moderators
    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.