question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Issues when trying to OCR files with Arabic script

See original GitHub issue

Describe the issue I am having issues when trying to OCR documents having Arabic script. I played with the original docker image in order to add support for Arabic (installed the extra tessaract language pack). The OCR itself seems to be fine, but it seems I am experiencing an issue with rendering : when I --sidecar output.txt to my command, I get a txt file with a decent output. However when I access the PDF and try to copy text, it looks like no spaces are taken into account and every line is copied as a huge word…

To Reproduce What command line were you trying to run?

ocrmypdf --verbose 1 --force-ocr -l ara --sidecar output.txt -k input.pdf  test-ocr.pdf

Example file Please include an example input PDF (or image). The input file is more helpful. http://arp.tn/site/servlet/Fichier?code_obj=106593&code_exp=1&langue=1 Please check any or all that apply about the test file:

  • This is the input file
  • The file contains no personal or confidential information
  • I am the copyright holder for this file
  • I permit this file to be included in the OCRmyPDF test suite under the CC-BY-SA 4.0 license
  • I am not the copyright holder, but this file is available under a free software license

Files that are not free for inclusion in this project are quite welcome, but we like to collect free files for our test suite when possible. Please do not submit files with confidential information. At your option you may encrypt files for OCRmyPDF’s author only.

Expected behavior When I try to copy a paragraph from the output PDF I would like to have it properly copied with spaces between words etc.

System:

  • OS: macOS, using the Alpine docker image
  • OCRmyPDF Version: 8.2.2.post10+g6e49bb3.d20190403

Additional context Using Preview Version 10.1 (944.6.16.1), but the output is even crappier in pdf.js (misplaced lines … etc)

Could this be related to #225 ? Could it be that the feature implemented there is only working for latin scripts ?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
jbarlow83commented, Jun 5, 2019

Tesseract 4.1 accepted a fix based on my work on the hocr renderer that should improve issues with words being stuck together. If you’re able to compile Tesseract master and you still get the same issue, please re-open and we’ll take a look.

0reactions
yregaiegcommented, Apr 23, 2019

the text output is definitely correct, the issue I have seems to be only related to the rendering … Found already a couple of active issues on tesseract-*, so I guess no need to spam them with the same issue again

Read more comments on GitHub >

github_iconTop Results From Across the Web

Re: Arabic OCR Problem - Adobe Support Community
Acrobat Pro cannot OCR PDFs with Arabic text - yet. Adobe are considering this feature. I'm using Sakhr صخر for Arabic OCR, it...
Read more >
Advances and Limitations in Open Source Arabic-Script OCR ...
With these problems in mind, in 2016 OpenITI began working on the development of open source OCR tools for Arabic-script languages (in print ......
Read more >
Box File disorder, Arabic Language · Issue #648 - GitHub
Thus a problem arises caused by the box file disorder since the boxes are mistakenly set to be in LTR ( Left to...
Read more >
Arabic text with ( Tashkeel ) Editing after PDF file is saved add ...
Arabic letters would just go wrong, attached an image showing the issue. I noticed that this only happen if ( Tashkeel ) is...
Read more >
Arabic script with Fill & Sign
To shed light on this issue… as long as I'm within the Fill & Sign environment, I'm able to type and view Arabic...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found