question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

I am using

ocrmypdf -l deu in.pdf out.pdf

but the OCR results are quite disappointing. The text is very clear and should be easy to detect. In this case “Dienstleistung” seems to be detected as “Diensttleistuung” (at least that’s what c&p reveals - but that is also consistent with searching inside the PDF).

screen shot 2018-09-11 at 02 46 21

I am using:

ocrmypdf 7.0.4
tesseract 3.05.02
on macOS 10.13.6

I looked at https://github.com/tesseract-ocr/tessdata_best but I guess those are too new?

Any advice?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:16

github_iconTop GitHub Comments

1reaction
jbarlow83commented, Sep 14, 2018

Do you have Ghostscript 9.24? (gs --version)

9.24 seems to have broken many things 😦. Using Ghostscript 9.24, the sidecar file contains only the first page, but in 9.23 the sidecar file contains all text.

1reaction
jbarlow83commented, Sep 11, 2018

The text file confirms that Tesseract is working.

It is quite likely a Preview problem. Text extraction from PDF is necessarily a heuristic because (as a type of print media) PDF does not have a concept of “words”, just objects that are printed are specific. Apple seems to have little interesting in fixing PDF display issues in Preview. Preview and Evince are bad and worse at text extraction respectively. If you can check Acrobat, it does a better job.

You can also use Ghostscript txtwrite to extract text: https://www.ghostscript.com/doc/9.21/VectorDevices.htm#TXT

as another way to view the output.

That being said, it may be helpful I can view the PDF. That would be a way to check if there is anything that can be done to improve the output. If you are concerned about sharing the file publicly, you can encrypt it with my public key as described here: https://github.com/jbarlow83/OCRmyPDF/wiki

Read more comments on GitHub >

github_iconTop Results From Across the Web

Get a Grip – How to Handle OCR Failure - Billtrust
Resolutions below 300 dpi may affect the quality and accuracy of OCR results. ... Request higher-contrast documents.Low-contrast documents can result in poor OCR....
Read more >
Improve OCR Accuracy With Advanced Image Preprocessing
Low contrast can result in poor OCR. Increase the contrast and density before carrying out the OCR process. This can be done in...
Read more >
Bad OCR results - LeadTools
Hi! I have some problems by implementing the OCR Engine. If I use the VBOcrDemo to recognize a zone on an image I've...
Read more >
On the State of OCR Correction | Hendrik Erz
The first step before any OCR can happen is to actually take a look at the OCR'd text and see if you can...
Read more >
Improving OCR Results with Basic Image Processing
Learn to improve your OCR results with basic image processing. Learning to use computer vision to improve OCR is a key to a...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found