ocrd process fails due to processor's logging mixed with JSON dump
See original GitHub issueProblem Description
We use the docker image ocrd/all:minimum
for testing purposes. We are running the ocrd process example from the documentation in a slightly derived form:
ocrd process -l DEBUG --overwrite
'cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE'
'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK'
'tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE'
'tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESSEROCR'
(Please not that we removed -p param-tess-fraktur.json
from the original example)
When running these processes each seperately, everything is fine and runs without error.
However, the ocrd process
pipeline causes the following log output:
2020-07-15 14:13:32,208.208 INFO root - Overriding log level globally to DEBUG
2020-07-15 14:13:32,209.209 DEBUG ocrd.resolver - Deriving dst_dir /data/images from /data/images/mets.xml
2020-07-15 14:13:32,209.209 DEBUG ocrd.resolver - workspace_from_url
mets_basename='mets.xml'
mets_url='/data/images/mets.xml'
src_baseurl='/data/images'
dst_dir='/data/images'
2020-07-15 14:13:32,210.210 DEBUG ocrd.resolver.download_to_directory - directory=|/data/images| url=|/data/images/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
2020-07-15 14:13:32,210.210 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/images/mets.xml' (url: '/data/images/mets.xml')
Traceback (most recent call last):
File "/usr/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd/cli/process.py", line 27, in process_cli
run_tasks(mets, log_level, page_id, tasks, overwrite)
File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 125, in run_tasks
validate_tasks(tasks, workspace, page_id, overwrite)
File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 94, in validate_tasks
first_task.validate()
File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 71, in validate
param_validator = ParameterValidator(self.ocrd_tool_json)
File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 49, in ocrd_tool_json
self._ocrd_tool_json = json.loads(result.stdout)
File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)
Then, running the same script again, without changing anything. The log gives:
2020-07-15 13:30:10,292.292 INFO root - Overriding log level globally to DEBUG
2020-07-15 13:30:10,293.293 DEBUG ocrd.resolver - Deriving dst_dir /data/images from /data/images/mets.xml
2020-07-15 13:30:10,293.293 DEBUG ocrd.resolver - workspace_from_url
mets_basename='mets.xml'
mets_url='/data/images/mets.xml'
src_baseurl='/data/images'
dst_dir='/data/images'
2020-07-15 13:30:10,293.293 DEBUG ocrd.resolver.download_to_directory - directory=|/data/images| url=|/data/images/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
2020-07-15 13:30:10,294.294 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/images/mets.xml' (url: '/data/images/mets.xml')
2020-07-15 13:30:12,289.289 DEBUG ocrd.workspace_validator - input_file_grp=['OCR-D-IMG'] output_file_grp=[]
2020-07-15 13:30:15,959.959 INFO ocrd.task_sequence.run_tasks - Start processing task 'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK'
2020-07-15 13:30:15,960.960 DEBUG ocrd.processor - Running subprocess 'ocrd-tesserocr-segment-region --working-dir /data/images --mets mets.xml --log-level DEBUG --input-file-grp OCR-D-SEG-PAGE --output-file-grp OCR-D-SEG-BLOCK --overwrite'
2020-07-15 13:30:17,160.160 INFO root - Overriding log level globally to DEBUG
2020-07-15 13:30:17,162.162 DEBUG ocrd.resolver - workspace_from_url
mets_basename='mets.xml'
mets_url='/data/images/mets.xml'
src_baseurl='/data/images'
dst_dir='/data/images'
2020-07-15 13:30:17,162.162 DEBUG ocrd.resolver.download_to_directory - directory=|/data/images| url=|/data/images/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
2020-07-15 13:30:17,163.163 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/images/mets.xml' (url: '/data/images/mets.xml')
2020-07-15 13:30:17,164.164 DEBUG ocrd.workspace_validator - input_file_grp=['OCR-D-SEG-PAGE'] output_file_grp=[]
Traceback (most recent call last):
File "/usr/bin/ocrd-tesserocr-segment-region", line 8, in <module>
sys.exit(ocrd_tesserocr_segment_region())
File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 16, in ocrd_tesserocr_segment_region
return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd/decorators.py", line 81, in ocrd_cli_wrap_processor
raise Exception("Invalid input/output file grps:\n\t%s" % '\n\t'.join(report.errors))
Exception: Invalid input/output file grps:
Input fileGrp[@USE='OCR-D-SEG-PAGE'] not in METS!
Traceback (most recent call last):
File "/usr/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/ocrd/cli/process.py", line 27, in process_cli
run_tasks(mets, log_level, page_id, tasks, overwrite)
File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 148, in run_tasks
raise Exception("%s exited with non-zero return value %s" % (task.executable, returncode))
Exception: ocrd-tesserocr-segment-region exited with non-zero return value 1
So, in the second attempt, the script goes on, when in the first try it had an error.
Also, in the second attempt, although the first process in the given pipeline (cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE
) should generate a fileGrp[@USE='OCR-D-SEG-PAGE']
in the mets.xml
, this process seems to not run and consequently the next process cannot access this information.
Reproduction
On the host, I am in a folder with an images
folder, containing two digitized page images Bild1.jpg
and Bild2.jpg
and a Shell script problem.sh
:
> ls
images/ problem.sh
> ls images/
Bild1.jpg Bild2.jpg
The content of problem.sh
is:
ocrd-import -P -i -r 300 images/
cd images
ocrd process -l DEBUG --overwrite \
'cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE' \
'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK' \
'tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE' \
'tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESSEROCR'
Then I run:
docker run -v $PWD:/data:Z -w /data -it ocrd/all:minimum problem.sh
# The first error occurs
docker run -v $PWD:/data:Z -w /data -it ocrd/all:minimum problem.sh
# The second error occurs
Note that we use skip the -u $(id -u)
parameter, because we have podman running in the background and this parameter causes issues with file permissions.
Issue Analytics
- State:
- Created 3 years ago
- Comments:11 (2 by maintainers)
Top GitHub Comments
Here we set up logging even when all we are asked to do is dump JSON or version or help:
https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/decorators.py#L53-L59
But moving the
getLogger
into the branches might still not be enough. All processors inherit fromocrd.processor.Processor
which necessarily importsocrd.processor
which has a module-level logging setup:https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/processor/base.py#L9
Maybe the only thing we can do is to try to disable all logging as soon as we know the job is to only dump the JSON:
I am fine with that (if this refered to me). 😃