Optimizing training data for Azure Document Intelligence Custom Extraction Model

Question

I'm curious about Azure Document Intelligence Custom Extraction Model. What are the best practices for training a custom neural extraction model? Is it true that a neural model can perform better on a variety of training data in a model, or should I separate the types of documents into several models?

I'm working on a resume-related model to primarily act as a resume parser. Since it involves real-world resumes, and everyone has a different type of resume, how should I classify the type of documents if using composed model is a better option instead of all the resumes throw into a single model (through job description, by manually classifying them, or another method)? Moreover, I have thousands of CVs on hand, and the required fields do not fulfill every CV, which causes a lot of empty fields. Since the required fields also contain table fields, what I'm extracting from it isn't always as structured as it is; will this affect the accuracy of the model, or how many CVs are enough for the model to be decently accurate?

Lastly, I have a lot of fields to put in and thousands of resumes to train. Using Azure Document Intelligence's no-code environment is frustrating and repetitive, which feels like a data entry job. I want to avoid future changes where I have to manually put in the values into fields over thousands or tens of thousands of training data. Are there any Azure features that I might not have discovered yet that I can use to optimize my model related to resume parsing?

To summarize my main concerns, how can I optimize my model related to resume parsing? I wanted to know all the limitations of using DI custom extraction model on model resume parsing and what should I do beforehand to maximize my accuracy on top of those limitations. My model API version is 2023-07-31.

Accepted Answer

Hi @Tom Chow,

Thank you for reaching out to Microsoft Q&A forum!

When training a custom neural extraction model in Azure Document Intelligence, it is recommended to use a diverse set of training data that represents the range of document types and layouts that the model will encounter in production. It is not necessary to separate the types of documents into several models, the neural model can perform well on a variety of document types. However, if you have a large number of documents with very different layouts or structures, it may be beneficial to train separate models for each type of document.

To classify the type of documents in a resume-related model, it is recommended to use a composed model based on job descriptions or manual classification. To handle empty fields, use techniques such as data augmentation or synthetic data generation. When extracting table fields, ensure that the table structure is consistent. Use at several labelled examples for each field and evaluate the model's performance on a validation set of documents. If the model is not performing well, adjust the training data or model parameters and retrain the model.

To avoid manual data entry, you can use the Azure Document Intelligence service to automatically extract data from your resumes. You can use Azure Functions to automate the process of uploading resumes to Azure Blob Storage and extracting data from them using the Document Intelligence REST API. This can help you optimize your model related to resume parsing and avoid repetitive manual data entry.

To optimize your model, create a custom template or neural model, ensure your training data is representative and label it accurately. Maximize accuracy by using a sufficient number of samples and representative data.

I hope this information helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Share via

Optimizing training data for Azure Document Intelligence Custom Extraction Model

0 additional answers