Issues when trying to OCR files with Arabic script
See original GitHub issueDescribe the issue
I am having issues when trying to OCR documents having Arabic script.
I played with the original docker image in order to add support for Arabic (installed the extra tessaract language pack).
The OCR itself seems to be fine, but it seems I am experiencing an issue with rendering : when I --sidecar output.txt
to my command, I get a txt file with a decent output. However when I access the PDF and try to copy text, it looks like no spaces are taken into account and every line is copied as a huge word…
To Reproduce What command line were you trying to run?
ocrmypdf --verbose 1 --force-ocr -l ara --sidecar output.txt -k input.pdf test-ocr.pdf
Example file Please include an example input PDF (or image). The input file is more helpful. http://arp.tn/site/servlet/Fichier?code_obj=106593&code_exp=1&langue=1 Please check any or all that apply about the test file:
- This is the input file
- The file contains no personal or confidential information
- I am the copyright holder for this file
- I permit this file to be included in the OCRmyPDF test suite under the CC-BY-SA 4.0 license
- I am not the copyright holder, but this file is available under a free software license
Files that are not free for inclusion in this project are quite welcome, but we like to collect free files for our test suite when possible. Please do not submit files with confidential information. At your option you may encrypt files for OCRmyPDF’s author only.
Expected behavior When I try to copy a paragraph from the output PDF I would like to have it properly copied with spaces between words etc.
System:
- OS: macOS, using the Alpine docker image
- OCRmyPDF Version: 8.2.2.post10+g6e49bb3.d20190403
Additional context Using Preview Version 10.1 (944.6.16.1), but the output is even crappier in pdf.js (misplaced lines … etc)
Could this be related to #225 ? Could it be that the feature implemented there is only working for latin scripts ?
Issue Analytics
- State:
- Created 4 years ago
- Comments:5
Top GitHub Comments
Tesseract 4.1 accepted a fix based on my work on the hocr renderer that should improve issues with words being stuck together. If you’re able to compile Tesseract master and you still get the same issue, please re-open and we’ll take a look.
the text output is definitely correct, the issue I have seems to be only related to the rendering … Found already a couple of active issues on tesseract-*, so I guess no need to spam them with the same issue again