Duplicated words returned from computer vision API

Julia Sizova 1 Reputation point
2021-05-11T11:20:33.47+00:00

Hi, I'm using read API to extract typed and handwritten text from pdf. When pdf is scanned, all is working as expected. However if pdf is already OCRed, then json response of extracted text has duplicated words and phrases (with some duplicates containing typos, example attached). These duplicated appear on the same line. If I convert such pdf to image first, this problem doesn't occur. Is there a way to overcome this step of converting pdf to image by passing some additional argument or some other solution? We can't control the type of pdf being sent to us.

Attached is an example screenshot of output with duplications.95564-output-example.jpg

Azure Computer Vision
Azure Computer Vision
An Azure artificial intelligence service that analyzes content in images and video.
371 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Ramr-msft 17,736 Reputation points
    2021-05-12T13:28:10.777+00:00

    @Julia Sizova Thanks for the question. Can you please share the sample pdf is already OCRed that you are trying, also please add more details about the Read API and OCR API version that you are trying.
    The Computer Vision Read API is Azure's latest OCR technology (learn what's new) that extracts printed text (in several languages), handwritten text (English only), digits, and currency symbols from images and multi-page PDF documents. It's optimized to extract text from text-heavy images and multi-page PDF documents with mixed languages. It supports detecting both printed and handwritten text in the same image or document.

    Please follow the Read API v3.2:https://centraluseuap.dev.cognitive.microsoft.com/docs/services/computer-vision-v3-2/operations/5d986960601faab4bf452005


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.