Duplicated words returned from computer vision API

Question

Hi, I'm using read API to extract typed and handwritten text from pdf. When pdf is scanned, all is working as expected. However if pdf is already OCRed, then json response of extracted text has duplicated words and phrases (with some duplicates containing typos, example attached). These duplicated appear on the same line. If I convert such pdf to image first, this problem doesn't occur. Is there a way to overcome this step of converting pdf to image by passing some additional argument or some other solution? We can't control the type of pdf being sent to us.

Attached is an example screenshot of output with duplications.

Answer

@Julia Sizova Thanks for the question. Can you please share the sample pdf is already OCRed that you are trying, also please add more details about the Read API and OCR API version that you are trying.
The Computer Vision Read API is Azure's latest OCR technology (learn what's new) that extracts printed text (in several languages), handwritten text (English only), digits, and currency symbols from images and multi-page PDF documents. It's optimized to extract text from text-heavy images and multi-page PDF documents with mixed languages. It supports detecting both printed and handwritten text in the same image or document.

Please follow the Read API v3.2:https://centraluseuap.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-2/operations/5d986960601faab4bf452005

Share via

Duplicated words returned from computer vision API

1 answer

Your answer