parse fails to validate result of to_xml
See original GitHub issueI get a regression with 1.0.0b11: The call to page_from_file
fails at ocrd_models_generateds.parse
on a file previously generated by ocrd_models.ocrd_page.to_xml
. (It mocks in validate_ConfSimpleType
that the value is a str
instead of a number.)
This is what I did:
ocrd-asv-ann-evaluate -m $mets -I OCR-D-GT-SEG-LINE,OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP
where all the OCR file grps are from a previous recognize processor in a long chain that runs through ok. See here for what the processor does.
This is what happens:
16:05:16.373 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0001
16:05:16.375 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.378 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.381 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.383 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.385 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.387 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0002
16:05:16.389 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.391 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.393 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.396 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.399 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.401 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0003
16:05:16.402 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.405 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.407 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.410 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.412 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.415 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0004
16:05:16.417 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.419 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.422 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.424 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.427 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.430 INFO processor.EvaluateLines - processing page phys_0001
16:05:16.431 INFO processor.EvaluateLines - INPUT FILE for OCR-D-GT-SEG-LINE: OCR-D-GT-SEG-LINE_0001
16:05:16.465 INFO processor.EvaluateLines - INPUT FILE for OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP: OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001
Traceback (most recent call last):
File "/home/xbert/unsortiert/arbeit/heyer/tools/ocrd_tesserocr/env3/bin/ocrd-asv-ann-evaluate", line 11, in <module>
load_entry_point('ocrd-cor-asv-ann', 'console_scripts', 'ocrd-asv-ann-evaluate')()
File "click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "click/core.py", line 717, in main
rv = self.invoke(ctx)
File "click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/cli.py", line 16, in ocrd_cor_asv_ann_evaluate
return ocrd_cli_wrap_processor(EvaluateLines, *args, **kwargs)
File "ocrd/decorators.py", line 38, in ocrd_cli_wrap_processor
run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
File "ocrd/processor/base.py", line 65, in run_processor
processor.process()
File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/evaluate.py", line 71, in process
pcgts = page_from_file(self.workspace.download_file(input_file))
File "ocrd_modelfactory/__init__.py", line 71, in page_from_file
return parse(input_file.local_filename, silence=True)
File "ocrd_models/ocrd_page_generateds.py", line 11222, in parse
rootObj.build(rootNode)
File "ocrd_models/ocrd_page_generateds.py", line 1069, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 1084, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 2406, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 2544, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 11073, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 11155, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 3057, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 3122, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 3446, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 3499, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 3776, in build
self.buildChildren(child, node, nodeName_)
File "ocrd_models/ocrd_page_generateds.py", line 3837, in buildChildren
obj_.build(child_)
File "ocrd_models/ocrd_page_generateds.py", line 4013, in build
self.buildAttributes(node, node.attrib, already_processed)
File "ocrd_models/ocrd_page_generateds.py", line 4030, in buildAttributes
self.validate_ConfSimpleType(self.conf) # validate type ConfSimpleType
File "ocrd_models/ocrd_page_generateds.py", line 3934, in validate_ConfSimpleType
if value < 0:
TypeError: '<' not supported between instances of 'str' and 'int'
The incriminated PAGE-XML is OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001.xml.gz. It validates fine under http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
.
Issue Analytics
- State:
- Created 4 years ago
- Comments:17 (8 by maintainers)
Top Results From Across the Web
How to parse invalid (bad / not well-formed) XML?
Invalid XML really isn't XML, though. Parsers exist which expect XML to be valid, and it's not a leap to expect that, either;...
Read more >HOW TO: Parse the text(not valid XML or JSON) returned from ...
HOW TO: Parse the text(not valid XML or JSON) returned from the Service endpoint to retrieve the required information in Service connector ...
Read more >Getting error 9082 when trying to validate an XML document ...
It is not possible to validate an XML document against a DTD anymore in OpenEdge ... SAX parser error: SAX-PARSE-NEXT, Unexpected error.
Read more >Solved: Parse from xml error - Power Platform Community
Solved: I'm trying to parse the Session ID from the following xml SOAP api response: { "statusCode": 200, "headers": {
Read more >The Magic of PowerShell to Parse XML, Read, and Validate
In this handy tutorial, learn how to use PowerShell to parse XML by reading XML and even creating an XML schema and validating...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
The pertinent diff in the generated code:
There is not more casting to float in the current code. Hence all of
are accepted and stored as
str
,int
andfloat
as-is but only the third one is valid. Investigating at which version between 2.30.11 and 2.33.1 this changed and whether it can be re-enabled.Sorry about that, will try to fix ASAP. I updated generateDS before regenerating the page API, maybe something changed about how the @conf attribute is parsed…