Document Intelligence Studio read does not correctly read a PDF

Javier Alonso Gutiérrez 0

Hi. I was trying to train a custom extraction model in Document Intelligence Studio, but when analyzing the PDFs files with the data it does not read the text paragraphs correctly.

Trying to isolate the problem, moved to simple Read analysis, but it only reads some portions of the text present in the PDF document and fails to recognize correctly most of the words. Even the language detection fails to work.

If I convert the PDF to JPG it works fine. What could be the problem?

Some samples:

Captura de pantalla 2024-06-05 a las 17.48.06

Captura de pantalla 2024-06-05 a las 17.39.47

YutongTie-MSFT 48,001 Reputation points

2024-06-05T23:58:34.1466667+00:00

Thanks for reporting this issue, could you please share the original PDF to us which has this issue so that we can reproduce this issue and investigate it if that is not confidential?

Regards,

Yutong
Javier Alonso Gutiérrez 0 Reputation points

2024-06-06T08:25:56.95+00:00

sure.

Files are larger than 3MB so cant be attached. I created a wetransfer link: https://we.tl/t-FeyyF1g1am

FYI: PDF files are created from a webpage using "Export as PDF" in Safari browser for macOS

Share via

Document Intelligence Studio read does not correctly read a PDF