Form Recognizer: PDFs with own temporary fonts are not recognized correctly

Benedikt Schmaler 6

As it seems, Form Recognizer does not correctly recognize PDF files created with custom temporary fonts.

For example, I have a file that was created with a custom font. In the PDF file, the text looks like this:

But the detection provides this result:
?hZkd]’ Jej[dj_WbWki]b[Y^ kdZ iedij][ MY^kjpcWydW^c[d

This is also the same result when I copy this text from the PDF file and paste it into a text editor.
As far as I can tell, in this case the recognition does not run over the recognition of the text in the image, but over the plain text contained in the PDF file which, because of the font, is not recognized correctly.

Do you know if there are any plans in future releases to recognize text with unknown fonts if they are included in the PDF file?

romungi-MSFT 45,961 Reputation points Microsoft Employee

2021-05-25T13:49:22.983+00:00

@Benedikt Schmaler Thanks for reporting. This is an interesting observation, It looks like there is no documentation about the limitation of using a custom font for form recognizer.
Does the same work if you change the font to a standard font? Is it possible to share this document to share the same to the team for their review? Thanks!!
NetaH-MSFT 6 Reputation points

2021-05-26T05:28:01.667+00:00

Can you please try copying and pasting the text from the PDF in a PDF viewer and into a text editor ? You will probably get the same garbled text.

We usually see these type of PDFs from when the originator of the PDF either produced the PDF incorrectly and the important information about the font character mapping is missing in the PDF or in most cases where this is done deliberately by the PDF originator to obfuscate as a protection mechanism to prevent a reader to copy & paste the text data.
Benedikt Schmaler 6 Reputation points

2021-05-26T13:49:21.607+00:00

Unfortunately, the file is a document of our client and I can't share it here publicly. But I could make it available in a closed area.
But if I use other files with a standard font the service works fine.
Benedikt Schmaler 6 Reputation points

2021-05-26T13:55:12.37+00:00

Copy and paste results in the same outcome.

But in this particular case, if there are temporary fonts in the pdf file, wouldn't it be advantageous to convert them to an image format, for example, and then run the recognition on them? This would prevent garbled text from being returned as a result of the recognition?
NetaH-MSFT 6 Reputation points

2021-05-26T22:32:55.21+00:00

Form Recognizer adheres to the PDF restrictions and originator setting. You can convert these to images and then send them to Form Recognizer but Form Recognizer in PDFs will adhere to the PDF security and restrictions.

Share via

Form Recognizer: PDFs with own temporary fonts are not recognized correctly

Your answer