Unable to perform chinese language OCR using ocrmypdf-polyglot
See original GitHub issueUsing the lastest ocrmypdf-polyglot docker image, I ran docker run -v "$(pwd):/home/docker" jbarlow83/ocrmypdf-polyglot -v -l chi_sim textbook.pdf textbook-ocr.pdf
according to instructions on a single page A4 simplified chinese PDF with ~1.5 line spacing. The result was mainly solid black square characters (■■■■■
).
Is there a reference PDF for which I can test on for simplified chinese (or any asian language)?
Issue Analytics
- State:
- Created 7 years ago
- Comments:9
Top Results From Across the Web
Unable to perform chinese language OCR using ocrmypdf ...
Using the lastest ocrmypdf-polyglot docker image, I ran docker run -v "$(pwd):/home/docker" jbarlow83/ocrmypdf-polyglot -v -l chi_sim ...
Read more >Installation — ocrmypdf 8.0.0 documentation - Read the Docs
Latest ocrmypdf with Tesseract 4.0.0-beta1 on Ubuntu 18.04. Includes English, French, German, Spanish, Portugeuse and Simplified Chinese. ocrmypdf-polyglot ...
Read more >ocrmypdf 4.0.5 - PyPI
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be ... If you need all language packs docker pull...
Read more >ocrmypdf Documentation - manualzz
OCRmyPDF uses Tesseract for OCR, and relies on its language packs for languages ... tesseract-ocr-chi-sim # Example: Install Chinese Simplified language.
Read more >ocrmypdf Documentation - UserManual.wiki
It uses Ghostscript to rasterize the page, and then performs on ... In some cases, ocrmypdf [-c|--clean] failed to exit with an error...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
存轻乡圣王里人的领导力
v4.2 offers a partial solution if you specify the tesseract renderer and regular PDF output as opposed to PDF/A. This bypasses two of the bugs described above.
You might need to rescan that image if the OCR was not what you hoped for. There is some warping in the text baseline. (I can’t help with OCR quality, that’s an issue for tesseract-ocr).
I tried manually retriggering the Docker build. (It’s usually automatic.) Hopefully that will do it.
On Thu, 4 Aug 2016 at 23:56 wallclock notifications@github.com wrote: