question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OCRmyPDF not correctly working with tesseract 4

See original GitHub issue

ocrmypdf is not working as expected on my machine (INTEL NUC i7 and debian 8 Jessie).

Tesseract (current git version incl. updated traineddata) however is working and is quick.

my command line:

$ ocrmypdf -l deu+eng -g -c --force-ocr x.pdf x.out.pdf
....
WARNING -    2: [tesseract]  took too long to OCR - skipping
WARNING -    3: [tesseract]  took too long to OCR - skipping
WARNING -    1: [tesseract]  took too long to OCR - skipping
INFO - Output file is a PDF/A-2B (as expected)

No txt is extracted.

When I use

$ ocrmypdf -l deu+eng -g -c --force-ocr -pdf-renderer tesseract x.pdf x.out.pdf

it takes a long time and then badly ocr-ed pdf is generated.

I have tesseract installed from github:

$ tesseract --version
tesseract 4.00.00alpha
 leptonica-1.74.1
  libjpeg 6b (libjpeg-turbo 1.3.1) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

 Found AVX
 Found SSE

and tesseract as such works perfectly, especially with its new LSTM algorithm.

Please can you assist ? I suppose there’s a problem with the preprocessing or detection of Tesseract. I also tried the -v verbose option. Cannot find a real problem.

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:12

github_iconTop GitHub Comments

1reaction
jbarlow83commented, Jan 16, 2017

After I finally got tessdata set up correctly (had a corrupt osd.traineddata, thanks for listing the file size, that tipped me off), I got ocrmypdf 4.3.5 to use Tesseract 4.00 without any changes to ocrmypdf.

My best guess is that your ocrmypdf isn’t finding tesseract 4.0. Bad OCR means it’s probably finding an older tesseract 😃. It should be just be searching the system PATH.

You could confirm this by checking the Producer tag in the output file, e.g. with pdfinfo (from poppler-utils), because I embed the Tesseract version in output files:

Creator:        ocrmypdf 4.3.5 / Tesseract OCR 4.00.00alpha
Producer:       GPL Ghostscript 9.19

Try setting the environment variable OCRMYPDF_TESSERACT to point at the 4.0 binary to see if that changes the results. You might also need to set TESSDATA_PREFIX to the parent of the 4.0 tessdata/ directory.

0reactions
hongyi-zhaocommented, May 2, 2021

I meet the same problem with git master version of OCRmyPDF, tesseract and tessdata_best. It seems that setting the number of concurrent threads less than the physical cores, say, 22 for 44 cores machine, can solve this problem, as shown below:

$ ocrmypdf -j22 --output-type pdf -l eng Dirac-Principles\ of\ Quantum\ Mechanics.pdf out.pdf

Read more comments on GitHub >

github_iconTop Results From Across the Web

OCRmyPDF not correctly working with tesseract 4 #124 - GitHub
ocrmypdf is not working as expected on my machine (INTEL NUC i7 and debian 8 Jessie). Tesseract (current git version incl. updated traineddata) ......
Read more >
Advanced features - OCRmyPDF - Read the Docs
This is useful for redoing OCR, for fixing OCR text with a damaged character map (text is selectable but not searchable), and destroying...
Read more >
ocrmypdf Documentation - Read the Docs
Rasterizing a PDF is the process of generating corresponding raster images. OCR engines like Tesseract work with images, not scalable vector graphics or...
Read more >
OCRmyPDF cannot find Leptonica Library - Stack Overflow
It could be that Tesseract is not installed properly, we can't find the installation on your system PATH environment variable.
Read more >
ocrmypdf 4.1 - PyPI
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be ... Or they did not display correctly some escaped...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found