question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Unable to perform chinese language OCR using ocrmypdf-polyglot

See original GitHub issue

Using the lastest ocrmypdf-polyglot docker image, I ran docker run -v "$(pwd):/home/docker" jbarlow83/ocrmypdf-polyglot -v -l chi_sim textbook.pdf textbook-ocr.pdf according to instructions on a single page A4 simplified chinese PDF with ~1.5 line spacing. The result was mainly solid black square characters (■■■■■).

Is there a reference PDF for which I can test on for simplified chinese (or any asian language)?

Issue Analytics

  • State:closed
  • Created 7 years ago
  • Comments:9

github_iconTop GitHub Comments

1reaction
jbarlow83commented, Aug 3, 2016

存轻乡圣王里人的领导力

v4.2 offers a partial solution if you specify the tesseract renderer and regular PDF output as opposed to PDF/A. This bypasses two of the bugs described above.

ocrmypdf -f -l chi_sim --pdf-renderer tesseract --output-type pdf   input.pdf output.pdf

You might need to rescan that image if the OCR was not what you hoped for. There is some warping in the text baseline. (I can’t help with OCR quality, that’s an issue for tesseract-ocr).

0reactions
jbarlow83commented, Aug 5, 2016

I tried manually retriggering the Docker build. (It’s usually automatic.) Hopefully that will do it.

On Thu, 4 Aug 2016 at 23:56 wallclock notifications@github.com wrote:

Great! I will be happy to try out the new release. Unfortunately, the latest polyglot docker image released 2 days ago seems to be version 4.1.3.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/jbarlow83/OCRmyPDF/issues/81#issuecomment-237771077, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvcM3Ci1WoV6DSYuR2cIlBB0gB6grz-ks5qct6IgaJpZM4JVDyv .

Read more comments on GitHub >

github_iconTop Results From Across the Web

Unable to perform chinese language OCR using ocrmypdf ...
Using the lastest ocrmypdf-polyglot docker image, I ran docker run -v "$(pwd):/home/docker" jbarlow83/ocrmypdf-polyglot -v -l chi_sim ...
Read more >
Installation — ocrmypdf 8.0.0 documentation - Read the Docs
Latest ocrmypdf with Tesseract 4.0.0-beta1 on Ubuntu 18.04. Includes English, French, German, Spanish, Portugeuse and Simplified Chinese. ocrmypdf-polyglot ...
Read more >
ocrmypdf 4.0.5 - PyPI
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be ... If you need all language packs docker pull...
Read more >
ocrmypdf Documentation - manualzz
OCRmyPDF uses Tesseract for OCR, and relies on its language packs for languages ... tesseract-ocr-chi-sim # Example: Install Chinese Simplified language.
Read more >
ocrmypdf Documentation - UserManual.wiki
It uses Ghostscript to rasterize the page, and then performs on ... In some cases, ocrmypdf [-c|--clean] failed to exit with an error...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found