question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

content of PageXML overwritten

See original GitHub issue

I have a problem that has repeatedly led to considerable data loss (and I have the impression that I have seen some Issue/discussion on that, so I apologize in advance):

In a set of legacy PageXMLs some run through smoothly in v.0.6, other however get overwritten. Here is a diff of the first 7 resp. 8 lines. The file on the left is processed without problem, the file on the left is overwritten as soon as saved.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>	      |	<?xml version="1.0"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pageco	<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pageco
  <Metadata>							  <Metadata>
    <Creator/>						      |	    <Creator>User123</Creator>
    <Created>2022-01-19T20:02:07</Created>		      |	    <Created>2021-06-16T20:13:22</Created>
    <LastChange>1970-01-01T00:00:00</LastChange>	      |	    <LastChange>2021-06-16T20:13:22</LastChange>
    <Comments/>						      <
  </Metadata>							  </Metadata>

Here is a file that gets overwritten when saved with the current OCR4all docker version: https://cloud.uni-halle.de/s/cgzRExPB0xPRP3I

After saving the right file is reduced to the following:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>

Do you happen to know how to avoid this behaviour?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
alexander-winklercommented, Jan 20, 2022

Fantastic help, thank you very much!

Just for the record (and an aide-memoire for myself), I followed @bertsky 's advice:

wget "https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd"
xmllint --schema pagecontent.xsd --noout BOOK/processing/*.xml

Only pages with negative coordinates didn’t validate.

Then I used the xslt mentioned above to transform my pagexmls.

wget "https://raw.githubusercontent.com/bertsky/workflow-configuration/master/page-fix-coords.xsl"
mkdir tmp_output
for i in BOOK/processing/*.xml; do xsltproc -o tmp_output/$(basename $i) page-fix-coords.xsl $i; done

The PageXMLs in tmp_output work perfectly fine.

Many thanks to both of you!

0reactions
maxnthcommented, Jan 20, 2022

Is it just the negative coordinates that cause the problem?

At least for the file provided in #301 the negative coordinates are indeed the only thing which make the PAGE XML invalid. I just set all negative coordinate points to 0 and it loaded just fine afterwards.

Read more comments on GitHub >

github_iconTop Results From Across the Web

config.xml gets overwritten everytime - Stack Overflow
In fact, I can't edit the config.xml at all because after compilation it always gets overwritten with original contents.
Read more >
Updating XML-file when "preventing file being overwritten"?
Hi again, I have read and inplmented the "How do I prevent a file from being replaced by a newer version?
Read more >
Configuration file server.xml overwritten - Community | Denodo
xml. template no longer exists so the configuration changes have to be made in the server.
Read more >
Changes to standalone.xml overwritten in restart - JBoss.org
Hi! How can I make these two changes permanent? <subsystem xmlns="urn:jboss:domain:undertow:3.1">. <buffer-cache name="default"/>.
Read more >
Topic: Import XML file – without overwriting! - WordPress.com
So we are thinking about importing the XML file of the demo content, and from there input our own content, pictires etc. Problem...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found