OCRmyPDF not correctly working with tesseract 4
See original GitHub issueocrmypdf is not working as expected on my machine (INTEL NUC i7 and debian 8 Jessie).
Tesseract (current git version incl. updated traineddata) however is working and is quick.
my command line:
$ ocrmypdf -l deu+eng -g -c --force-ocr x.pdf x.out.pdf
....
WARNING - 2: [tesseract] took too long to OCR - skipping
WARNING - 3: [tesseract] took too long to OCR - skipping
WARNING - 1: [tesseract] took too long to OCR - skipping
INFO - Output file is a PDF/A-2B (as expected)
No txt is extracted.
When I use
$ ocrmypdf -l deu+eng -g -c --force-ocr -pdf-renderer tesseract x.pdf x.out.pdf
it takes a long time and then badly ocr-ed pdf is generated.
I have tesseract installed from github:
$ tesseract --version
tesseract 4.00.00alpha
leptonica-1.74.1
libjpeg 6b (libjpeg-turbo 1.3.1) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8
Found AVX
Found SSE
and tesseract as such works perfectly, especially with its new LSTM algorithm.
Please can you assist ? I suppose there’s a problem with the preprocessing or detection of Tesseract.
I also tried the -v
verbose option. Cannot find a real problem.
Issue Analytics
- State:
- Created 7 years ago
- Comments:12
Top Results From Across the Web
OCRmyPDF not correctly working with tesseract 4 #124 - GitHub
ocrmypdf is not working as expected on my machine (INTEL NUC i7 and debian 8 Jessie). Tesseract (current git version incl. updated traineddata) ......
Read more >Advanced features - OCRmyPDF - Read the Docs
This is useful for redoing OCR, for fixing OCR text with a damaged character map (text is selectable but not searchable), and destroying...
Read more >ocrmypdf Documentation - Read the Docs
Rasterizing a PDF is the process of generating corresponding raster images. OCR engines like Tesseract work with images, not scalable vector graphics or...
Read more >OCRmyPDF cannot find Leptonica Library - Stack Overflow
It could be that Tesseract is not installed properly, we can't find the installation on your system PATH environment variable.
Read more >ocrmypdf 4.1 - PyPI
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be ... Or they did not display correctly some escaped...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
After I finally got tessdata set up correctly (had a corrupt osd.traineddata, thanks for listing the file size, that tipped me off), I got ocrmypdf 4.3.5 to use Tesseract 4.00 without any changes to ocrmypdf.
My best guess is that your ocrmypdf isn’t finding tesseract 4.0. Bad OCR means it’s probably finding an older tesseract 😃. It should be just be searching the system PATH.
You could confirm this by checking the Producer tag in the output file, e.g. with
pdfinfo
(from poppler-utils), because I embed the Tesseract version in output files:Try setting the environment variable OCRMYPDF_TESSERACT to point at the 4.0 binary to see if that changes the results. You might also need to set TESSDATA_PREFIX to the parent of the 4.0 tessdata/ directory.
I meet the same problem with git master version of OCRmyPDF, tesseract and tessdata_best. It seems that setting the number of concurrent threads less than the physical cores, say, 22 for 44 cores machine, can solve this problem, as shown below:
$ ocrmypdf -j22 --output-type pdf -l eng Dirac-Principles\ of\ Quantum\ Mechanics.pdf out.pdf