How to extract text from tables present in a pdf document using any of the Cognitive Services?

Question

Hi,

I'm looking out for a way to extract tables from a pdf document similar to AWS textract. The accuracy should be high enough for the extracted text to be correct irrespective of the kind of tables. I have multiple pdfs having a few tables in each pdf.

Answer

Hello,

Thanks for reaching out to us. For your scenario, I think Form Recognizer meet your requirement. Azure Form Recognizer is a cognitive service that uses machine learning technology to identify and extract text, key/value pairs and table data from form documents. It ingests text from forms and outputs structured data that includes the relationships in the original file. You quickly get accurate results that are tailored to your specific content without heavy manual intervention or extensive data science expertise. Form Recognizer is comprised of custom models, the prebuilt receipt model, and the layout API. You can call Form Recognizer models by using a REST API to reduce complexity and integrate it into your workflow or application.

Form Recognizer is made up of the following services:

Custom models - Extract key/value pairs and table data from forms. These models are trained with your own data, so they're tailored to your forms.
Prebuilt models - Extract data from unique form types using prebuilt models. Currently available are prebuilt models for sales receipts and business cards in English.
Layout API - Extract text and table structures, along with their bounding box coordinates, from documents.

For more information please check: https://video2.skills-academy.com/en-us/azure/cognitive-services/form-recognizer
For more samples please check:https://video2.skills-academy.com/en-us/samples/browse/?products=azure&term=vision&terms=%22form%20recognizer%22

Please let me know if you have more question.

Regards,
Yutong

Share via

How to extract text from tables present in a pdf document using any of the Cognitive Services?

1 answer