Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Bad OCR results

See original GitHub issue

I am using

ocrmypdf -l deu in.pdf out.pdf

but the OCR results are quite disappointing. The text is very clear and should be easy to detect. In this case “Dienstleistung” seems to be detected as “Diensttleistuung” (at least that’s what c&p reveals - but that is also consistent with searching inside the PDF).

screen shot 2018-09-11 at 02 46 21

I am using:

ocrmypdf 7.0.4
tesseract 3.05.02
on macOS 10.13.6

I looked at https://github.com/tesseract-ocr/tessdata_best but I guess those are too new?

Any advice?

Issue Analytics

State:
Created 5 years ago
Comments:16

Top GitHub Comments

1reaction

jbarlow83commented, Sep 14, 2018

Do you have Ghostscript 9.24? (gs --version)

9.24 seems to have broken many things 😦. Using Ghostscript 9.24, the sidecar file contains only the first page, but in 9.23 the sidecar file contains all text.

1reaction

jbarlow83commented, Sep 11, 2018

The text file confirms that Tesseract is working.

It is quite likely a Preview problem. Text extraction from PDF is necessarily a heuristic because (as a type of print media) PDF does not have a concept of “words”, just objects that are printed are specific. Apple seems to have little interesting in fixing PDF display issues in Preview. Preview and Evince are bad and worse at text extraction respectively. If you can check Acrobat, it does a better job.

You can also use Ghostscript txtwrite to extract text: https://www.ghostscript.com/doc/9.21/VectorDevices.htm#TXT

as another way to view the output.

That being said, it may be helpful I can view the PDF. That would be a way to check if there is anything that can be done to improve the output. If you are concerned about sharing the file publicly, you can encrypt it with my public key as described here: https://github.com/jbarlow83/OCRmyPDF/wiki

Top Results From Across the Web

Get a Grip – How to Handle OCR Failure - Billtrust

Resolutions below 300 dpi may affect the quality and accuracy of OCR results. ... Request higher-contrast documents.Low-contrast documents can result in poor OCR....

Improve OCR Accuracy With Advanced Image Preprocessing

Low contrast can result in poor OCR. Increase the contrast and density before carrying out the OCR process. This can be done in...

Bad OCR results - LeadTools

Hi! I have some problems by implementing the OCR Engine. If I use the VBOcrDemo to recognize a zone on an image I've...

On the State of OCR Correction | Hendrik Erz

The first step before any OCR can happen is to actually take a look at the OCR'd text and see if you can...

Improving OCR Results with Basic Image Processing

Learn to improve your OCR results with basic image processing. Learning to use computer vision to improve OCR is a key to a...