How to extract table text from pdf documents using Form Recognizer service?

MachineLearning 6 Reputation points
2020-09-08T09:30:28.807+00:00

I'm looking out for a way to extract tables text present in a PDF document using form recognizer. I tried creating a custom model for training with labels wherein different labels were defined using the OCR labeling tool. Although, the accuracy received is ~30% which is really less. A sample image of the table is attached (please ignore the red color oval lines).

23291-table.png

Is there a possibility to extract the content of the above table image into a .csv file?

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,505 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
2,577 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. GiftA-MSFT 11,161 Reputation points
    2020-09-09T22:21:57.157+00:00

    Hi, thanks for reaching out. Currently, the supported output format is JSON. However, you can try to reformat the output to pandas dataframe and export to csv as shown in this example or check out other available resources online. Hope this helps.

    0 comments No comments

  2. Nick Hill 1 Reputation point
    2020-09-16T15:32:11.103+00:00

    That image should be pretty easy to process using forms recognizer. You will need a minimum of 5 images to train the model (but more will give better accuracy). You need to tag only the data elements you need (not any labels). If you have repeating groups such as services then tag each one with a separate tag service1, service2, service3, etc and the same for event, etc. Then you be able to see the accuracy grow as you add more tags across more invoices. I would create a simple azure function that monitors blob storage and then on receipt of a new file submits to form recognizer and takes the returned json output and writes back to blob storage a CSV (you may be able to write less code using a logic app).

    0 comments No comments