question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

ocrd process fails due to processor's logging mixed with JSON dump

See original GitHub issue

Problem Description We use the docker image ocrd/all:minimum for testing purposes. We are running the ocrd process example from the documentation in a slightly derived form:

ocrd process -l DEBUG --overwrite  
'cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE'   
'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK'  
 'tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE'  
 'tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESSEROCR'

(Please not that we removed -p param-tess-fraktur.json from the original example) When running these processes each seperately, everything is fine and runs without error.

However, the ocrd process pipeline causes the following log output:

2020-07-15 14:13:32,208.208 INFO root - Overriding log level globally to DEBUG
2020-07-15 14:13:32,209.209 DEBUG ocrd.resolver - Deriving dst_dir /data/images from /data/images/mets.xml
2020-07-15 14:13:32,209.209 DEBUG ocrd.resolver - workspace_from_url
mets_basename='mets.xml'
mets_url='/data/images/mets.xml'
src_baseurl='/data/images'
dst_dir='/data/images'
2020-07-15 14:13:32,210.210 DEBUG ocrd.resolver.download_to_directory - directory=|/data/images| url=|/data/images/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
2020-07-15 14:13:32,210.210 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/images/mets.xml' (url: '/data/images/mets.xml')
Traceback (most recent call last):
  File "/usr/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/cli/process.py", line 27, in process_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 125, in run_tasks
    validate_tasks(tasks, workspace, page_id, overwrite)
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 94, in validate_tasks
    first_task.validate()
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 71, in validate
    param_validator = ParameterValidator(self.ocrd_tool_json)
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 49, in ocrd_tool_json
    self._ocrd_tool_json = json.loads(result.stdout)
  File "/usr/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 5 (char 4)

Then, running the same script again, without changing anything. The log gives:

2020-07-15 13:30:10,292.292 INFO root - Overriding log level globally to DEBUG
2020-07-15 13:30:10,293.293 DEBUG ocrd.resolver - Deriving dst_dir /data/images from /data/images/mets.xml
2020-07-15 13:30:10,293.293 DEBUG ocrd.resolver - workspace_from_url
mets_basename='mets.xml'
mets_url='/data/images/mets.xml'
src_baseurl='/data/images'
dst_dir='/data/images'
2020-07-15 13:30:10,293.293 DEBUG ocrd.resolver.download_to_directory - directory=|/data/images| url=|/data/images/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
2020-07-15 13:30:10,294.294 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/images/mets.xml' (url: '/data/images/mets.xml')
2020-07-15 13:30:12,289.289 DEBUG ocrd.workspace_validator - input_file_grp=['OCR-D-IMG'] output_file_grp=[]
2020-07-15 13:30:15,959.959 INFO ocrd.task_sequence.run_tasks - Start processing task 'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK'
2020-07-15 13:30:15,960.960 DEBUG ocrd.processor - Running subprocess 'ocrd-tesserocr-segment-region --working-dir /data/images --mets mets.xml --log-level DEBUG --input-file-grp OCR-D-SEG-PAGE --output-file-grp OCR-D-SEG-BLOCK --overwrite'
2020-07-15 13:30:17,160.160 INFO root - Overriding log level globally to DEBUG
2020-07-15 13:30:17,162.162 DEBUG ocrd.resolver - workspace_from_url
mets_basename='mets.xml'
mets_url='/data/images/mets.xml'
src_baseurl='/data/images'
dst_dir='/data/images'
2020-07-15 13:30:17,162.162 DEBUG ocrd.resolver.download_to_directory - directory=|/data/images| url=|/data/images/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
2020-07-15 13:30:17,163.163 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/data/images/mets.xml' (url: '/data/images/mets.xml')
2020-07-15 13:30:17,164.164 DEBUG ocrd.workspace_validator - input_file_grp=['OCR-D-SEG-PAGE'] output_file_grp=[]
Traceback (most recent call last):
  File "/usr/bin/ocrd-tesserocr-segment-region", line 8, in <module>
    sys.exit(ocrd_tesserocr_segment_region())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd_tesserocr/cli.py", line 16, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/decorators.py", line 81, in ocrd_cli_wrap_processor
    raise Exception("Invalid input/output file grps:\n\t%s" % '\n\t'.join(report.errors))
Exception: Invalid input/output file grps:
	Input fileGrp[@USE='OCR-D-SEG-PAGE'] not in METS!
Traceback (most recent call last):
  File "/usr/bin/ocrd", line 8, in <module>
    sys.exit(cli())
  File "/usr/lib/python3.6/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/ocrd/cli/process.py", line 27, in process_cli
    run_tasks(mets, log_level, page_id, tasks, overwrite)
  File "/usr/lib/python3.6/site-packages/ocrd/task_sequence.py", line 148, in run_tasks
    raise Exception("%s exited with non-zero return value %s" % (task.executable, returncode))
Exception: ocrd-tesserocr-segment-region exited with non-zero return value 1

So, in the second attempt, the script goes on, when in the first try it had an error.

Also, in the second attempt, although the first process in the given pipeline (cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE) should generate a fileGrp[@USE='OCR-D-SEG-PAGE'] in the mets.xml, this process seems to not run and consequently the next process cannot access this information.

Reproduction On the host, I am in a folder with an images folder, containing two digitized page images Bild1.jpg and Bild2.jpg and a Shell script problem.sh:

> ls
images/    problem.sh
> ls images/
Bild1.jpg   Bild2.jpg

The content of problem.sh is:

ocrd-import -P -i -r 300 images/
cd images
ocrd process -l DEBUG --overwrite \
  'cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-SEG-PAGE' \
  'tesserocr-segment-region -I OCR-D-SEG-PAGE -O OCR-D-SEG-BLOCK' \
  'tesserocr-segment-line -I OCR-D-SEG-BLOCK -O OCR-D-SEG-LINE' \
  'tesserocr-recognize -I OCR-D-SEG-LINE -O OCR-D-OCR-TESSEROCR'

Then I run:

docker run -v $PWD:/data:Z -w /data -it ocrd/all:minimum problem.sh
# The first error occurs
docker run -v $PWD:/data:Z -w /data -it ocrd/all:minimum problem.sh
# The second error occurs

Note that we use skip the -u $(id -u) parameter, because we have podman running in the background and this parameter causes issues with file permissions.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:11 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
bertskycommented, Jul 17, 2020

clearly the tool itself is misbehaving (due to matplotlib’s bad design choices). But how do we tackle this danger generally?

Disabling logging [on the command line] does not help:

Here we set up logging even when all we are asked to do is dump JSON or version or help:

https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/decorators.py#L53-L59

But moving the getLogger into the branches might still not be enough. All processors inherit from ocrd.processor.Processor which necessarily imports ocrd.processor which has a module-level logging setup:

https://github.com/OCR-D/core/blob/903ac6cba493ef450a4730ede84fcd5ee81b9ddd/ocrd/ocrd/processor/base.py#L9

Maybe the only thing we can do is to try to disable all logging as soon as we know the job is to only dump the JSON:

logging.disable(logging.CRITICAL)
1reaction
GrazingScientistcommented, Jul 16, 2020

I am fine with that (if this refered to me). 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

OCR-D/Lobby - Gitter
In order to export the ocrd-tool.json file for a specific ocr-d processor is there a shorter way other than manually finding it inside...
Read more >
ocrd-anybaseocr-tiseg not applying default wiring · Issue #54 · OCR ...
The --help of ocrd-anybaseocr-tiseg states a default wiring of ['OCR-D-IMG-CROP'] ... [OFF|ERROR|WARN|INFO|DEBUG|TRACE] Log level -J, --dump-json Dump tool ...
Read more >
Dump to JSON adds additional double quotes and escaping ...
I noticed an unintended escaping of the entire data-string for a tweet being enclosed in double quotes. Furthermore, all double quotes of the...
Read more >
json — JSON encoder and decoder — Python 3.11.1 ...
A malicious JSON string may cause the decoder to consume considerable CPU and memory resources. Limiting the size of data to be parsed...
Read more >
json.dumps() in Python - GeeksforGeeks
json.dumps() function will convert a subset of Python objects into a json string. Not all objects are convertible and you may need to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found