French OCR fails when running after "Importer" directory
See original GitHub issueHi there,
I’ve been playing around with papermerge lately, great work!
Unfortunately, I can’t seem to run French OCR when using the “Importer” directory − although it works fine when uploading the file directly.
I’m using the linuxserver Docker image of papermerge.
Here are the relevant lines from papermerge.conf.py
:
OCR_DEFAULT_LANGUAGE = "fra"
OCR_LANGUAGES = {
"fra": "French",
"eng": "English",
}
When I upload a file directly to the inbox, everything works fine. Here are the first lines of the log when grepping tesseract
:
papermerge | 2020-09-17T16:37:38.627164022Z [2020-09-17 16:37:38,626: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1
papermerge | 2020-09-17T16:37:40.044259676Z [2020-09-17 16:37:40,043: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/125/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/125/page-1|hocr
papermerge | 2020-09-17T16:37:41.431137020Z [2020-09-17 16:37:41,431: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/100/page-1|hocr
papermerge | 2020-09-17T16:37:42.874047624Z [2020-09-17 16:37:42,873: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/75/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/75/page-1|hocr
papermerge | 2020-09-17T16:37:43.865581275Z [2020-09-17 16:37:43,865: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_1/50/page-1.jpg|/data/media/results/user_1/document_17/pages/page_1/50/page-1|hocr
papermerge | 2020-09-17T16:37:44.969770457Z [2020-09-17 16:37:44,969: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fra|/data/media/results/user_1/document_17/pages/page_2/100/page-2.jpg|/data/media/results/user_1/document_17/pages/page_2
But when I move a file to the Importer directory, this happens (complete log this time):
papermerge | 2020-09-17T16:47:18.601873673Z [2020-09-17 16:47:18,601: INFO/ForkPoolWorker-2] Importing file /importer/e-RESA.pdf...
papermerge | 2020-09-17T16:47:18.603650417Z [2020-09-17 16:47:18,603: INFO/ForkPoolWorker-2] Same as temp_file_name=/tmp/tmp5vrjfa8p/e-RESA.pdf...
papermerge | 2020-09-17T16:47:18.607349962Z [2020-09-17 16:47:18,607: DEBUG/ForkPoolWorker-2] Importing file /tmp/tmp5vrjfa8p/e-RESA.pdf.
papermerge | 2020-09-17T16:47:18.659031911Z [2020-09-17 16:47:18,658: DEBUG/ForkPoolWorker-2] Post save doc => normalize_pages
papermerge | 2020-09-17T16:47:18.659268128Z [2020-09-17 16:47:18,659: DEBUG/ForkPoolWorker-2] Normalizing document 18
papermerge | 2020-09-17T16:47:18.666408263Z [2020-09-17 16:47:18,666: DEBUG/ForkPoolWorker-2] Uploading file /tmp/tmp5vrjfa8p/e-RESA.pdf to docs/user_1/document_18/e-RESA.pdf
papermerge | 2020-09-17T16:47:18.666610584Z [2020-09-17 16:47:18,666: DEBUG/ForkPoolWorker-2] copy_doc: /tmp/tmp5vrjfa8p/e-RESA.pdf to docs/user_1/document_18/e-RESA.pdf
papermerge | 2020-09-17T16:47:18.667201448Z [2020-09-17 16:47:18,667: DEBUG/ForkPoolWorker-2] Document 18 has 1 pages
papermerge | 2020-09-17T16:47:18.672169866Z [2020-09-17 16:47:18,671: DEBUG/ForkPoolWorker-2] ocr_page user_id=1 doc_id=18 page_num=1
papermerge | 2020-09-17T16:47:18.672289228Z [2020-09-17 16:47:18,672: DEBUG/ForkPoolWorker-2] subprocess: /usr/bin/file --mime-type -b /data/media/docs/user_1/document_18/e-RESA.pdf
papermerge | 2020-09-17T16:47:18.677201972Z [2020-09-17 16:47:18,676: DEBUG/ForkPoolWorker-2] Mime Type = Mime(/data/media/docs/user_1/document_18/e-RESA.pdf, application/pdf)
papermerge | 2020-09-17T16:47:18.677314130Z [2020-09-17 16:47:18,677: DEBUG/ForkPoolWorker-2] subprocess: /usr/bin/file --mime-type -b /data/media/docs/user_1/document_18/e-RESA.pdf
papermerge | 2020-09-17T16:47:18.681960690Z [2020-09-17 16:47:18,681: DEBUG/ForkPoolWorker-2] OCR PDF document
papermerge | 2020-09-17T16:47:18.690268186Z [2020-09-17 16:47:18,689: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/125/page-1.jpg
papermerge | 2020-09-17T16:47:18.690354276Z [2020-09-17 16:47:18,690: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/125 does not exists. Creating.
papermerge | 2020-09-17T16:47:18.690635824Z [2020-09-17 16:47:18,690: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|1550|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/125/page
papermerge | 2020-09-17T16:47:18.742333188Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/100/page-1.jpg
papermerge | 2020-09-17T16:47:18.742464233Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/100 does not exists. Creating.
papermerge | 2020-09-17T16:47:18.742940080Z [2020-09-17 16:47:18,742: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|1240|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/100/page
papermerge | 2020-09-17T16:47:18.784122351Z [2020-09-17 16:47:18,783: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/75/page-1.jpg
papermerge | 2020-09-17T16:47:18.784215354Z [2020-09-17 16:47:18,784: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/75 does not exists. Creating.
papermerge | 2020-09-17T16:47:18.784377875Z [2020-09-17 16:47:18,784: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|930|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/75/page
papermerge | 2020-09-17T16:47:18.816375434Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/50/page-1.jpg
papermerge | 2020-09-17T16:47:18.816455236Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/50 does not exists. Creating.
papermerge | 2020-09-17T16:47:18.816635157Z [2020-09-17 16:47:18,816: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|620|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/50/page
papermerge | 2020-09-17T16:47:18.841902783Z [2020-09-17 16:47:18,841: DEBUG/ForkPoolWorker-2] Extracing image for results/user_1/document_18/pages/page_1/10/page-1.jpg
papermerge | 2020-09-17T16:47:18.841992456Z [2020-09-17 16:47:18,841: DEBUG/ForkPoolWorker-2] PPMROOT /data/media/results/user_1/document_18/pages/page_1/10 does not exists. Creating.
papermerge | 2020-09-17T16:47:18.842162119Z [2020-09-17 16:47:18,842: DEBUG/ForkPoolWorker-2] Run:/usr/bin/pdftoppm|-jpeg|-f|1|-l|1|-scale-to-x|124|-scale-to-y|-1|/data/media/docs/user_1/document_18/e-RESA.pdf|/data/media/results/user_1/document_18/pages/page_1/10/page
papermerge | 2020-09-17T16:47:18.861551288Z [2020-09-17 16:47:18,861: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1
papermerge | 2020-09-17T16:47:18.869939701Z [2020-09-17 16:47:18,869: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge | 2020-09-17T16:47:18.869955678Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge | 2020-09-17T16:47:18.869959187Z Failed loading language 'fre'
papermerge | 2020-09-17T16:47:18.869961665Z Tesseract couldn't load any languages!
papermerge | 2020-09-17T16:47:18.869964273Z Could not initialize tesseract.
papermerge | 2020-09-17T16:47:18.869966753Z
papermerge | 2020-09-17T16:47:18.870152882Z [2020-09-17 16:47:18,870: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/125/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/125/page-1|hocr
papermerge | 2020-09-17T16:47:18.878382858Z [2020-09-17 16:47:18,878: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge | 2020-09-17T16:47:18.878400899Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge | 2020-09-17T16:47:18.878405734Z Failed loading language 'fre'
papermerge | 2020-09-17T16:47:18.878409389Z Tesseract couldn't load any languages!
papermerge | 2020-09-17T16:47:18.878412582Z Could not initialize tesseract.
papermerge | 2020-09-17T16:47:18.878416071Z
papermerge | 2020-09-17T16:47:18.878464860Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/125/page-1.jpg - Complete.
papermerge | 2020-09-17T16:47:18.878761200Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/125/page-1.hocr.
papermerge | 2020-09-17T16:47:18.878786629Z [2020-09-17 16:47:18,878: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/100/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/100/page-1|hocr
papermerge | 2020-09-17T16:47:18.886793257Z [2020-09-17 16:47:18,886: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge | 2020-09-17T16:47:18.886810945Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge | 2020-09-17T16:47:18.886814693Z Failed loading language 'fre'
papermerge | 2020-09-17T16:47:18.886817268Z Tesseract couldn't load any languages!
papermerge | 2020-09-17T16:47:18.886819822Z Could not initialize tesseract.
papermerge | 2020-09-17T16:47:18.886822245Z
papermerge | 2020-09-17T16:47:18.886874136Z [2020-09-17 16:47:18,886: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/100/page-1.jpg - Complete.
papermerge | 2020-09-17T16:47:18.886954935Z [2020-09-17 16:47:18,886: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/100/page-1.hocr.
papermerge | 2020-09-17T16:47:18.887103486Z [2020-09-17 16:47:18,887: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/75/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/75/page-1|hocr
papermerge | 2020-09-17T16:47:18.895445974Z [2020-09-17 16:47:18,895: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge | 2020-09-17T16:47:18.895466845Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge | 2020-09-17T16:47:18.895473193Z Failed loading language 'fre'
papermerge | 2020-09-17T16:47:18.895477902Z Tesseract couldn't load any languages!
papermerge | 2020-09-17T16:47:18.895481946Z Could not initialize tesseract.
papermerge | 2020-09-17T16:47:18.895485920Z
papermerge | 2020-09-17T16:47:18.895522453Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/75/page-1.jpg - Complete.
papermerge | 2020-09-17T16:47:18.895572202Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/75/page-1.hocr.
papermerge | 2020-09-17T16:47:18.895736885Z [2020-09-17 16:47:18,895: DEBUG/ForkPoolWorker-2] Run:/usr/bin/tesseract|-l|fre|/data/media/results/user_1/document_18/pages/page_1/50/page-1.jpg|/data/media/results/user_1/document_18/pages/page_1/50/page-1|hocr
papermerge | 2020-09-17T16:47:18.903811183Z [2020-09-17 16:47:18,903: ERROR/ForkPoolWorker-2] returncode=1 stdout= stderr=Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/fre.traineddata
papermerge | 2020-09-17T16:47:18.903828233Z Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
papermerge | 2020-09-17T16:47:18.903832541Z Failed loading language 'fre'
papermerge | 2020-09-17T16:47:18.903835551Z Tesseract couldn't load any languages!
papermerge | 2020-09-17T16:47:18.903838084Z Could not initialize tesseract.
papermerge | 2020-09-17T16:47:18.903840557Z
papermerge | 2020-09-17T16:47:18.903922423Z [2020-09-17 16:47:18,903: DEBUG/ForkPoolWorker-2] OCR for results/user_1/document_18/pages/page_1/50/page-1.jpg - Complete.
papermerge | 2020-09-17T16:47:18.904556111Z [2020-09-17 16:47:18,903: DEBUG/ForkPoolWorker-2] OCR Result results/user_1/document_18/pages/page_1/50/page-1.hocr.
papermerge | 2020-09-17T16:47:18.904570069Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] user_id=1 doc_id=18 page_num=1 page_type=pdf total_exec_time=0.23
papermerge | 2020-09-17T16:47:18.904586242Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] Page hocr ready: document_id=18 page_num=1
papermerge | 2020-09-17T16:47:18.904591287Z [2020-09-17 16:47:18,904: DEBUG/ForkPoolWorker-2] apply_automates: Begin.
papermerge | 2020-09-17T16:47:18.916041990Z [2020-09-17 16:47:18,915: ERROR/ForkPoolWorker-2] Task papermerge.core.management.commands.worker.import_from_local_folder[4800dff1-ea3b-4730-9c05-0c24fc23ff10] raised unexpected: FileNotFoundError(2, 'No such file or directory')
papermerge | 2020-09-17T16:47:18.916059667Z Traceback (most recent call last):
papermerge | 2020-09-17T16:47:18.916064337Z File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 385, in trace_task
papermerge | 2020-09-17T16:47:18.916068437Z R = retval = fun(*args, **kwargs)
papermerge | 2020-09-17T16:47:18.916071555Z File "/usr/local/lib/python3.8/dist-packages/celery/app/trace.py", line 650, in __protected_call__
papermerge | 2020-09-17T16:47:18.916075118Z return self.run(*args, **kwargs)
papermerge | 2020-09-17T16:47:18.916078443Z File "/app/papermerge/papermerge/core/management/commands/worker.py", line 53, in import_from_local_folder
papermerge | 2020-09-17T16:47:18.916082225Z import_documents(settings.PAPERMERGE_IMPORTER_DIR)
papermerge | 2020-09-17T16:47:18.916085667Z File "/app/papermerge/papermerge/core/importers/local.py", line 45, in import_documents
papermerge | 2020-09-17T16:47:18.916089370Z imp.import_file()
papermerge | 2020-09-17T16:47:18.916093094Z File "/app/papermerge/papermerge/core/document_importer.py", line 106, in import_file
papermerge | 2020-09-17T16:47:18.916097201Z DocumentImporter.ocr_document(
papermerge | 2020-09-17T16:47:18.916100930Z File "/app/papermerge/papermerge/core/document_importer.py", line 156, in ocr_document
papermerge | 2020-09-17T16:47:18.916104262Z signals.page_ocr.send(
papermerge | 2020-09-17T16:47:18.916107280Z File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 173, in send
papermerge | 2020-09-17T16:47:18.916111292Z return [
papermerge | 2020-09-17T16:47:18.916114621Z File "/usr/local/lib/python3.8/dist-packages/django/dispatch/dispatcher.py", line 174, in <listcomp>
papermerge | 2020-09-17T16:47:18.916118202Z (receiver, receiver(signal=self, sender=sender, **named))
papermerge | 2020-09-17T16:47:18.916121347Z File "/app/papermerge/papermerge/core/signals.py", line 35, in apply_automates_handler
papermerge | 2020-09-17T16:47:18.916124812Z apply_automates(
papermerge | 2020-09-17T16:47:18.916127941Z File "/app/papermerge/papermerge/core/automate.py", line 45, in apply_automates
papermerge | 2020-09-17T16:47:18.916131187Z with open(text_path, "r") as f:
papermerge | 2020-09-17T16:47:18.916134714Z FileNotFoundError: [Errno 2] No such file or directory: '/data/media/results/user_1/document_18/pages/page_1.txt'
Obviously, this does not work because the correct code for French is fra
, not fre
. But I can’t figure out why it uses fre
instead of fra
just when I use the Importer directory instead of a direct upload. I have double checked the config files, I have used the correct code.
Any idea about how we could fix this?
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
Hi @gaalcaras ! Could you tell me how you managed to change langage in the database directly ?
Hi @gaalcaras,
great that you figured it out!
I have just checked linuxserver image 😮 …
Those guys from Linuxserver did an amazing work! 🌟 First of all, indeed, they managed to wrapp everything in one single docker image! They use different configuration (sqlite3 instead of postgresql and uwsgi instead of apache mod_wsgi). And yes, they followed “bere metal” approach, but again, as I mentioned - they managed to wrap worker and main app in a single docker image 🎉