Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OCRmyPDF not correctly working with tesseract 4

See original GitHub issue

ocrmypdf is not working as expected on my machine (INTEL NUC i7 and debian 8 Jessie).

Tesseract (current git version incl. updated traineddata) however is working and is quick.

my command line:

$ ocrmypdf -l deu+eng -g -c --force-ocr x.pdf x.out.pdf
....
WARNING -    2: [tesseract]  took too long to OCR - skipping
WARNING -    3: [tesseract]  took too long to OCR - skipping
WARNING -    1: [tesseract]  took too long to OCR - skipping
INFO - Output file is a PDF/A-2B (as expected)

No txt is extracted.

When I use

$ ocrmypdf -l deu+eng -g -c --force-ocr -pdf-renderer tesseract x.pdf x.out.pdf

it takes a long time and then badly ocr-ed pdf is generated.

I have tesseract installed from github:

$ tesseract --version
tesseract 4.00.00alpha
 leptonica-1.74.1
  libjpeg 6b (libjpeg-turbo 1.3.1) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

 Found AVX
 Found SSE

and tesseract as such works perfectly, especially with its new LSTM algorithm.

Please can you assist ? I suppose there’s a problem with the preprocessing or detection of Tesseract. I also tried the -v verbose option. Cannot find a real problem.

Issue Analytics

State:
Created 7 years ago
Comments:12

Top GitHub Comments

1reaction

jbarlow83commented, Jan 16, 2017

After I finally got tessdata set up correctly (had a corrupt osd.traineddata, thanks for listing the file size, that tipped me off), I got ocrmypdf 4.3.5 to use Tesseract 4.00 without any changes to ocrmypdf.

My best guess is that your ocrmypdf isn’t finding tesseract 4.0. Bad OCR means it’s probably finding an older tesseract 😃. It should be just be searching the system PATH.

You could confirm this by checking the Producer tag in the output file, e.g. with pdfinfo (from poppler-utils), because I embed the Tesseract version in output files:

Creator:        ocrmypdf 4.3.5 / Tesseract OCR 4.00.00alpha
Producer:       GPL Ghostscript 9.19

Try setting the environment variable OCRMYPDF_TESSERACT to point at the 4.0 binary to see if that changes the results. You might also need to set TESSDATA_PREFIX to the parent of the 4.0 tessdata/ directory.

0reactions

hongyi-zhaocommented, May 2, 2021

I meet the same problem with git master version of OCRmyPDF, tesseract and tessdata_best. It seems that setting the number of concurrent threads less than the physical cores, say, 22 for 44 cores machine, can solve this problem, as shown below:

$ ocrmypdf -j22 --output-type pdf -l eng Dirac-Principles\ of\ Quantum\ Mechanics.pdf out.pdf

Top Results From Across the Web

OCRmyPDF not correctly working with tesseract 4 #124 - GitHub

ocrmypdf is not working as expected on my machine (INTEL NUC i7 and debian 8 Jessie). Tesseract (current git version incl. updated traineddata) ......

Advanced features - OCRmyPDF - Read the Docs

This is useful for redoing OCR, for fixing OCR text with a damaged character map (text is selectable but not searchable), and destroying...

ocrmypdf Documentation - Read the Docs

Rasterizing a PDF is the process of generating corresponding raster images. OCR engines like Tesseract work with images, not scalable vector graphics or...

OCRmyPDF cannot find Leptonica Library - Stack Overflow

It could be that Tesseract is not installed properly, we can't find the installation on your system PATH environment variable.

ocrmypdf 4.1 - PyPI

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be ... Or they did not display correctly some escaped...