Bad OCR results
See original GitHub issueI am using
ocrmypdf -l deu in.pdf out.pdf
but the OCR results are quite disappointing. The text is very clear and should be easy to detect. In this case “Dienstleistung” seems to be detected as “Diensttleistuung” (at least that’s what c&p reveals - but that is also consistent with searching inside the PDF).
I am using:
ocrmypdf 7.0.4
tesseract 3.05.02
on macOS 10.13.6
I looked at https://github.com/tesseract-ocr/tessdata_best but I guess those are too new?
Any advice?
Issue Analytics
- State:
- Created 5 years ago
- Comments:16
Top Results From Across the Web
Get a Grip – How to Handle OCR Failure - Billtrust
Resolutions below 300 dpi may affect the quality and accuracy of OCR results. ... Request higher-contrast documents.Low-contrast documents can result in poor OCR....
Read more >Improve OCR Accuracy With Advanced Image Preprocessing
Low contrast can result in poor OCR. Increase the contrast and density before carrying out the OCR process. This can be done in...
Read more >Bad OCR results - LeadTools
Hi! I have some problems by implementing the OCR Engine. If I use the VBOcrDemo to recognize a zone on an image I've...
Read more >On the State of OCR Correction | Hendrik Erz
The first step before any OCR can happen is to actually take a look at the OCR'd text and see if you can...
Read more >Improving OCR Results with Basic Image Processing
Learn to improve your OCR results with basic image processing. Learning to use computer vision to improve OCR is a key to a...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Do you have Ghostscript 9.24? (
gs --version
)9.24 seems to have broken many things 😦. Using Ghostscript 9.24, the sidecar file contains only the first page, but in 9.23 the sidecar file contains all text.
The text file confirms that Tesseract is working.
It is quite likely a Preview problem. Text extraction from PDF is necessarily a heuristic because (as a type of print media) PDF does not have a concept of “words”, just objects that are printed are specific. Apple seems to have little interesting in fixing PDF display issues in Preview. Preview and Evince are bad and worse at text extraction respectively. If you can check Acrobat, it does a better job.
You can also use Ghostscript txtwrite to extract text: https://www.ghostscript.com/doc/9.21/VectorDevices.htm#TXT
as another way to view the output.
That being said, it may be helpful I can view the PDF. That would be a way to check if there is anything that can be done to improve the output. If you are concerned about sharing the file publicly, you can encrypt it with my public key as described here: https://github.com/jbarlow83/OCRmyPDF/wiki