Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues when trying to OCR files with Arabic script

See original GitHub issue

Describe the issue I am having issues when trying to OCR documents having Arabic script. I played with the original docker image in order to add support for Arabic (installed the extra tessaract language pack). The OCR itself seems to be fine, but it seems I am experiencing an issue with rendering : when I --sidecar output.txt to my command, I get a txt file with a decent output. However when I access the PDF and try to copy text, it looks like no spaces are taken into account and every line is copied as a huge word…

To Reproduce What command line were you trying to run?

ocrmypdf --verbose 1 --force-ocr -l ara --sidecar output.txt -k input.pdf  test-ocr.pdf

Example file Please include an example input PDF (or image). The input file is more helpful. http://arp.tn/site/servlet/Fichier?code_obj=106593&code_exp=1&langue=1 Please check any or all that apply about the test file:

This is the input file
The file contains no personal or confidential information
I am the copyright holder for this file
I permit this file to be included in the OCRmyPDF test suite under the CC-BY-SA 4.0 license
I am not the copyright holder, but this file is available under a free software license

Files that are not free for inclusion in this project are quite welcome, but we like to collect free files for our test suite when possible. Please do not submit files with confidential information. At your option you may encrypt files for OCRmyPDF’s author only.

Expected behavior When I try to copy a paragraph from the output PDF I would like to have it properly copied with spaces between words etc.

System:

OS: macOS, using the Alpine docker image
OCRmyPDF Version: 8.2.2.post10+g6e49bb3.d20190403

Additional context Using Preview Version 10.1 (944.6.16.1), but the output is even crappier in pdf.js (misplaced lines … etc)

Could this be related to #225 ? Could it be that the feature implemented there is only working for latin scripts ?

Issue Analytics

State:
Created 4 years ago
Comments:5

Top GitHub Comments

1reaction

jbarlow83commented, Jun 5, 2019

Tesseract 4.1 accepted a fix based on my work on the hocr renderer that should improve issues with words being stuck together. If you’re able to compile Tesseract master and you still get the same issue, please re-open and we’ll take a look.

0reactions

yregaiegcommented, Apr 23, 2019

the text output is definitely correct, the issue I have seems to be only related to the rendering … Found already a couple of active issues on tesseract-*, so I guess no need to spam them with the same issue again