content of PageXML overwritten
See original GitHub issueI have a problem that has repeatedly led to considerable data loss (and I have the impression that I have seen some Issue/discussion on that, so I apologize in advance):
In a set of legacy PageXMLs some run through smoothly in v.0.6, other however get overwritten. Here is a diff of the first 7 resp. 8 lines. The file on the left is processed without problem, the file on the left is overwritten as soon as saved.
<?xml version="1.0" encoding="UTF-8" standalone="no"?> | <?xml version="1.0"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pageco <PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pageco
<Metadata> <Metadata>
<Creator/> | <Creator>User123</Creator>
<Created>2022-01-19T20:02:07</Created> | <Created>2021-06-16T20:13:22</Created>
<LastChange>1970-01-01T00:00:00</LastChange> | <LastChange>2021-06-16T20:13:22</LastChange>
<Comments/> <
</Metadata> </Metadata>
Here is a file that gets overwritten when saved with the current OCR4all docker version: https://cloud.uni-halle.de/s/cgzRExPB0xPRP3I
After saving the right file is reduced to the following:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
Do you happen to know how to avoid this behaviour?
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
config.xml gets overwritten everytime - Stack Overflow
In fact, I can't edit the config.xml at all because after compilation it always gets overwritten with original contents.
Read more >Updating XML-file when "preventing file being overwritten"?
Hi again, I have read and inplmented the "How do I prevent a file from being replaced by a newer version?
Read more >Configuration file server.xml overwritten - Community | Denodo
xml. template no longer exists so the configuration changes have to be made in the server.
Read more >Changes to standalone.xml overwritten in restart - JBoss.org
Hi! How can I make these two changes permanent? <subsystem xmlns="urn:jboss:domain:undertow:3.1">. <buffer-cache name="default"/>.
Read more >Topic: Import XML file – without overwriting! - WordPress.com
So we are thinking about importing the XML file of the demo content, and from there input our own content, pictires etc. Problem...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Fantastic help, thank you very much!
Just for the record (and an aide-memoire for myself), I followed @bertsky 's advice:
Only pages with negative coordinates didn’t validate.
Then I used the xslt mentioned above to transform my pagexmls.
The PageXMLs in tmp_output work perfectly fine.
Many thanks to both of you!
At least for the file provided in #301 the negative coordinates are indeed the only thing which make the PAGE XML invalid. I just set all negative coordinate points to 0 and it loaded just fine afterwards.