question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

parse fails to validate result of to_xml

See original GitHub issue

I get a regression with 1.0.0b11: The call to page_from_file fails at ocrd_models_generateds.parse on a file previously generated by ocrd_models.ocrd_page.to_xml. (It mocks in validate_ConfSimpleType that the value is a str instead of a number.)

This is what I did:

ocrd-asv-ann-evaluate -m $mets -I OCR-D-GT-SEG-LINE,OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP

where all the OCR file grps are from a previous recognize processor in a long chain that runs through ok. See here for what the processor does.

This is what happens:

16:05:16.373 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0001
16:05:16.375 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.378 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.381 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.383 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.385 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.387 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0002
16:05:16.389 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.391 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.393 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.396 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.399 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.401 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0003
16:05:16.402 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.405 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.407 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.410 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.412 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.415 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0004
16:05:16.417 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.419 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.422 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.424 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.427 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.430 INFO processor.EvaluateLines - processing page phys_0001
16:05:16.431 INFO processor.EvaluateLines - INPUT FILE for OCR-D-GT-SEG-LINE: OCR-D-GT-SEG-LINE_0001
16:05:16.465 INFO processor.EvaluateLines - INPUT FILE for OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP: OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001
Traceback (most recent call last):
  File "/home/xbert/unsortiert/arbeit/heyer/tools/ocrd_tesserocr/env3/bin/ocrd-asv-ann-evaluate", line 11, in <module>
    load_entry_point('ocrd-cor-asv-ann', 'console_scripts', 'ocrd-asv-ann-evaluate')()
  File "click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/cli.py", line 16, in ocrd_cor_asv_ann_evaluate
    return ocrd_cli_wrap_processor(EvaluateLines, *args, **kwargs)
  File "ocrd/decorators.py", line 38, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "ocrd/processor/base.py", line 65, in run_processor
    processor.process()
  File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/evaluate.py", line 71, in process
    pcgts = page_from_file(self.workspace.download_file(input_file))
  File "ocrd_modelfactory/__init__.py", line 71, in page_from_file
    return parse(input_file.local_filename, silence=True)
  File "ocrd_models/ocrd_page_generateds.py", line 11222, in parse
    rootObj.build(rootNode)
  File "ocrd_models/ocrd_page_generateds.py", line 1069, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 1084, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 2406, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 2544, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 11073, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 11155, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3057, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3122, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3446, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3499, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3776, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3837, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 4013, in build
    self.buildAttributes(node, node.attrib, already_processed)
  File "ocrd_models/ocrd_page_generateds.py", line 4030, in buildAttributes
    self.validate_ConfSimpleType(self.conf)    # validate type ConfSimpleType
  File "ocrd_models/ocrd_page_generateds.py", line 3934, in validate_ConfSimpleType
    if value < 0:
TypeError: '<' not supported between instances of 'str' and 'int'

The incriminated PAGE-XML is OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001.xml.gz. It validates fine under http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:17 (8 by maintainers)

github_iconTop GitHub Comments

2reactions
kbacommented, Aug 5, 2019

The pertinent diff in the generated code:

-            try:
-                self.conf = float(value)
-            except ValueError as exp:
-                raise ValueError('Bad float/double attribute (conf): %s' % exp)
+            self.conf = value
+            self.validate_ConfSimpleType(self.conf)    # validate type ConfSimpleType

There is not more casting to float in the current code. Hence all of

set_conf("1")
set_conf(int(1))
set_conf(1.0)

are accepted and stored as str, int and float as-is but only the third one is valid. Investigating at which version between 2.30.11 and 2.33.1 this changed and whether it can be re-enabled.

1reaction
kbacommented, Aug 2, 2019

Sorry about that, will try to fix ASAP. I updated generateDS before regenerating the page API, maybe something changed about how the @conf attribute is parsed…

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to parse invalid (bad / not well-formed) XML?
Invalid XML really isn't XML, though. Parsers exist which expect XML to be valid, and it's not a leap to expect that, either;...
Read more >
HOW TO: Parse the text(not valid XML or JSON) returned from ...
HOW TO: Parse the text(not valid XML or JSON) returned from the Service endpoint to retrieve the required information in Service connector ...
Read more >
Getting error 9082 when trying to validate an XML document ...
It is not possible to validate an XML document against a DTD anymore in OpenEdge ... SAX parser error: SAX-PARSE-NEXT, Unexpected error.
Read more >
Solved: Parse from xml error - Power Platform Community
Solved: I'm trying to parse the Session ID from the following xml SOAP api response: { "statusCode": 200, "headers": {
Read more >
The Magic of PowerShell to Parse XML, Read, and Validate
In this handy tutorial, learn how to use PowerShell to parse XML by reading XML and even creating an XML schema and validating...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found