question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

French OCR fails when running after "Importer" directory

See original GitHub issue

Hi there,

I’ve been playing around with papermerge lately, great work!

Unfortunately, I can’t seem to run French OCR when using the “Importer” directory − although it works fine when uploading the file directly.

I’m using the linuxserver Docker image of papermerge.

Here are the relevant lines from papermerge.conf.py:

OCR_DEFAULT_LANGUAGE = "fra"

OCR_LANGUAGES = {
    "fra": "French",
    "eng": "English",
}

When I upload a file directly to the inbox, everything works fine. Here are the first lines of the log when grepping tesseract:

papermerge         | 2020-09-17T16:37:38.627164022Z [2020-09-17 16:37:38,626: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1
papermerge         | 2020-09-17T16:37:40.044259676Z [2020-09-17 16:37:40,043: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/125/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/125/page-1|hocr
papermerge         | 2020-09-17T16:37:41.431137020Z [2020-09-17 16:37:41,431: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/100/page-1|hocr
papermerge         | 2020-09-17T16:37:42.874047624Z [2020-09-17 16:37:42,873: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/75/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/75/page-1|hocr
papermerge         | 2020-09-17T16:37:43.865581275Z [2020-09-17 16:37:43,865: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/50/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/50/page-1|hocr
papermerge         | 2020-09-17T16:37:44.969770457Z [2020-09-17 16:37:44,969: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_2/100/page-2.jpg|/data/media/results/user_1/document_17/pages/page_2

But when I move a file to the Importer directory, this happens (complete log this time):

papermerge         | 2020-09-17T16:47:18.601873673Z [2020-09-17 16:47:18,601: INFO/ForkPoolWorker-2] Importing file /importer/e-RESA.pdf...
papermerge         | 2020-09-17T16:47:18.603650417Z [2020-09-17 16:47:18,603: INFO/ForkPoolWorker-2] Same as temp_file_name=/tmp/tmp5vrjfa8p/e-RESA.pdf...
papermerge         | 2020-09-17T16:47:18.607349962Z [2020-09-17 16:47:18,607: DEBUG/ForkPoolWorker-2] Importing file /tmp/tmp5vrjfa8p/e-RESA.pdf.
papermerge         | 2020-09-17T16:47:18.659031911Z [2020-09-17 16:47:18,658: DEBUG/ForkPoolWorker-2] Post save doc => normalize_pages
papermerge         | 2020-09-17T16:47:18.659268128Z [2020-09-17 16:47:18,659: DEBUG/ForkPoolWorker-2] Normalizing document 18
papermerge         | 2020-09-17T16:47:18.666408263Z [2020-09-17 16:47:18,666: DEBUG/ForkPoolWorker-2] Uploading file /tmp/tmp5vrjfa8p/e-RESA.pdf to docs/user_1/document_18/e-RESA.pdf
papermerge         | 2020-09-17T16:47:18.666610584Z [2020-09-17 16:47:18,666: DEBUG/ForkPoolWorker-2] copy_doc: /tmp/tmp5vrjfa8p/e-RESA.pdf to docs/user_1/document_18/e-RESA.pdf
papermerge         | 2020-09-17T16:47:18.667201448Z [2020-09-17 16:47:18,667: DEBUG/ForkPoolWorker-2] Document 18 has 1 pages
papermerge         | 2020-09-17T16:47:18.672169866Z [2020-09-17 16:47:18,671: DEBUG/ForkPoolWorker-2]  ocr_page user_id=1 doc_id=18 page_num=1
papermerge         | 2020-09-17T16:47:18.672289228Z [2020-09-17 16:47:18,672: DEBUG/ForkPoolWorker-2] subprocess: /usr/bin/file --mime-type -b /data/media/docs/user_1/document_18/e-RESA.pdf
papermerge         | 2020-09-17T16:47:18.677201972Z [2020-09-17 16:47:18,676: DEBUG/ForkPoolWorker-2] Mime Type = Mime(/data/media/docs/user_1/document_18/e-RESA.pdf, application/pdf)
papermerge         | 2020-09-17T16:47:18.677314130Z [2020-09-17 16:47:18,677: DEBUG/ForkPoolWorker-2] subprocess: /usr/bin/file --mime-type -b /data/media/docs/user_1/document_18/e-RESA.pdf
papermerge         | 2020-09-17T16:47:18.681960690Z [2020-09-17 16:47:18,681: DEBUG/ForkPoolWorker-2] OCR PDF document
papermerge         | 2020-09-17T16:47:18.690268186Z [2020-09-17 16:47:18,689: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/125/page-1.jpg
papermerge         | 2020-09-17T16:47:18.690354276Z [2020-09-17 16:47:18,690: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/125 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.690635824Z [2020-09-17 16:47:18,690: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|1550|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/125/page
papermerge         | 2020-09-17T16:47:18.742333188Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/100/page-1.jpg
papermerge         | 2020-09-17T16:47:18.742464233Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/100 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.742940080Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|1240|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/100/page
papermerge         | 2020-09-17T16:47:18.784122351Z [2020-09-17 16:47:18,783: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/75/page-1.jpg
papermerge         | 2020-09-17T16:47:18.784215354Z [2020-09-17 16:47:18,784: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/75 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.784377875Z [2020-09-17 16:47:18,784: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|930|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/75/page
papermerge         | 2020-09-17T16:47:18.816375434Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/50/page-1.jpg
papermerge         | 2020-09-17T16:47:18.816455236Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/50 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.816635157Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|620|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/50/page
papermerge         | 2020-09-17T16:47:18.841902783Z [2020-09-17 16:47:18,841: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/10/page-1.jpg
papermerge         | 2020-09-17T16:47:18.841992456Z [2020-09-17 16:47:18,841: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/10 does not exists. Creating.
papermerge         | 2020-09-17T16:47:18.842162119Z [2020-09-17 16:47:18,842: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|124|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/10/page
papermerge         | 2020-09-17T16:47:18.861551288Z [2020-09-17 16:47:18,861: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1
papermerge         | 2020-09-17T16:47:18.869939701Z [2020-09-17 16:47:18,869: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.869955678Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.869959187Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.869961665Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.869964273Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.869966753Z
papermerge         | 2020-09-17T16:47:18.870152882Z [2020-09-17 16:47:18,870: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/125/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/125/page-1|hocr
papermerge         | 2020-09-17T16:47:18.878382858Z [2020-09-17 16:47:18,878: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.878400899Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.878405734Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.878409389Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.878412582Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.878416071Z
papermerge         | 2020-09-17T16:47:18.878464860Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/125/page-1.jpg - Complete.
papermerge         | 2020-09-17T16:47:18.878761200Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/125/page-1.hocr.
papermerge         | 2020-09-17T16:47:18.878786629Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/100/page-1|hocr
papermerge         | 2020-09-17T16:47:18.886793257Z [2020-09-17 16:47:18,886: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.886810945Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.886814693Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.886817268Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.886819822Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.886822245Z
papermerge         | 2020-09-17T16:47:18.886874136Z [2020-09-17 16:47:18,886: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/100/page-1.jpg - Complete.
papermerge         | 2020-09-17T16:47:18.886954935Z [2020-09-17 16:47:18,886: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/100/page-1.hocr.
papermerge         | 2020-09-17T16:47:18.887103486Z [2020-09-17 16:47:18,887: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/75/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/75/page-1|hocr
papermerge         | 2020-09-17T16:47:18.895445974Z [2020-09-17 16:47:18,895: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.895466845Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.895473193Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.895477902Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.895481946Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.895485920Z
papermerge         | 2020-09-17T16:47:18.895522453Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/75/page-1.jpg - Complete.
papermerge         | 2020-09-17T16:47:18.895572202Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/75/page-1.hocr.
papermerge         | 2020-09-17T16:47:18.895736885Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/50/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/50/page-1|hocr
papermerge         | 2020-09-17T16:47:18.903811183Z [2020-09-17 16:47:18,903: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge         | 2020-09-17T16:47:18.903828233Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge         | 2020-09-17T16:47:18.903832541Z Failed loading language 'fre'
papermerge         | 2020-09-17T16:47:18.903835551Z Tesseract couldn't load any languages!
papermerge         | 2020-09-17T16:47:18.903838084Z Could not initialize tesseract.
papermerge         | 2020-09-17T16:47:18.903840557Z
papermerge         | 2020-09-17T16:47:18.903922423Z [2020-09-17 16:47:18,903: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/50/page-1.jpg - Complete.
papermerge         | 2020-09-17T16:47:18.904556111Z [2020-09-17 16:47:18,903: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/50/page-1.hocr.
papermerge         | 2020-09-17T16:47:18.904570069Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2]  user_id=1 doc_id=18 page_num=1 page_type=pdf total_exec_time=0.23
papermerge         | 2020-09-17T16:47:18.904586242Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] Page hocr ready: document_id=18 page_num=1
papermerge         | 2020-09-17T16:47:18.904591287Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] apply_automates: Begin.
papermerge         | 2020-09-17T16:47:18.916041990Z [2020-09-17 16:47:18,915: ERROR/ForkPoolWorker-2] Task papermerge.core.management.commands.worker.import_from_local_folder[4800dff1-ea3b-4730-9c05-0c24fc23ff10] raised unexpected: FileNotFoundError(2, 'No such file or directory')
papermerge         | 2020-09-17T16:47:18.916059667Z Traceback (most recent call last):
papermerge         | 2020-09-17T16:47:18.916064337Z   File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 385, in trace_task
papermerge         | 2020-09-17T16:47:18.916068437Z     R = retval = fun(*args, **kwargs)
papermerge         | 2020-09-17T16:47:18.916071555Z   File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 650, in __protected_call__
papermerge         | 2020-09-17T16:47:18.916075118Z     return self.run(*args, **kwargs)
papermerge         | 2020-09-17T16:47:18.916078443Z   File "/app/papermerge/papermerge/core/management/commands/worker.py", line 53, in import_from_local_folder
papermerge         | 2020-09-17T16:47:18.916082225Z     import_documents(settings.PAPERMERGE_IMPORTER_DIR)
papermerge         | 2020-09-17T16:47:18.916085667Z   File "/app/papermerge/papermerge/core/importers/local.py", line 45, in import_documents
papermerge         | 2020-09-17T16:47:18.916089370Z     imp.import_file()
papermerge         | 2020-09-17T16:47:18.916093094Z   File "/app/papermerge/papermerge/core/document_importer.py", line 106, in import_file
papermerge         | 2020-09-17T16:47:18.916097201Z     DocumentImporter.ocr_document(
papermerge         | 2020-09-17T16:47:18.916100930Z   File "/app/papermerge/papermerge/core/document_importer.py", line 156, in ocr_document
papermerge         | 2020-09-17T16:47:18.916104262Z     signals.page_ocr.send(
papermerge         | 2020-09-17T16:47:18.916107280Z   File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 173, in send
papermerge         | 2020-09-17T16:47:18.916111292Z     return [
papermerge         | 2020-09-17T16:47:18.916114621Z   File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 174, in <listcomp>
papermerge         | 2020-09-17T16:47:18.916118202Z     (receiver, receiver(signal=self, sender=sender, **named))
papermerge         | 2020-09-17T16:47:18.916121347Z   File "/app/papermerge/papermerge/core/signals.py", line 35, in apply_automates_handler
papermerge         | 2020-09-17T16:47:18.916124812Z     apply_automates(
papermerge         | 2020-09-17T16:47:18.916127941Z   File "/app/papermerge/papermerge/core/automate.py", line 45, in apply_automates
papermerge         | 2020-09-17T16:47:18.916131187Z     with open(text_path, "r") as f:
papermerge         | 2020-09-17T16:47:18.916134714Z FileNotFoundError: [Errno 2] No such file or directory: '/data/media/results/user_1/document_18/pages/page_1.txt'

Obviously, this does not work because the correct code for French is fra, not fre. But I can’t figure out why it uses fre instead of fra just when I use the Importer directory instead of a direct upload. I have double checked the config files, I have used the correct code.

Any idea about how we could fix this?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
guim31commented, Oct 22, 2020

Hi @gaalcaras ! Could you tell me how you managed to change langage in the database directly ?

1reaction
ciurcommented, Sep 18, 2020

Hi @gaalcaras,

great that you figured it out!

I have just checked linuxserver image 😮 …
Those guys from Linuxserver did an amazing work! 🌟 First of all, indeed, they managed to wrapp everything in one single docker image! They use different configuration (sqlite3 instead of postgresql and uwsgi instead of apache mod_wsgi). And yes, they followed “bere metal” approach, but again, as I mentioned - they managed to wrap worker and main app in a single docker image 🎉

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tesseract OCR for Non-English Languages - PyImageSearch
In this tutorial, you will learn how to OCR non-English languages ... and execute the following command from the main project directory:
Read more >
Tesseract running error - ocr - Stack Overflow
I've downloaded RUS language data and put it to tessdata directory (/usr/local/share/tessdata). When I'm trying to run tesseract with command tesseract blob.jpg ...
Read more >
Creating an OCR Configuration - TechDocs - Broadcom Inc.
Symantec assumes that documents are primarily one language (for example, all French, or all English, as opposed to mixed English and French) ...
Read more >
Server side OCR (Optical Character Recognition) - FileHold
To view the OCR status and reprocess documents · To restrict the list to a specific date(s) when the error(s) occurred, enter a...
Read more >
Optical Character Recognition using PaddleOCR
lang: The language which we want to recognise is passed here. For example, en for English, ch for Chinese, french for French, etc....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found