question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

alto2hocr: Content in BottomMargin is not considered (PrintSpace node is missing in this example)

See original GitHub issue

cf #95

I am targeting hocr and trying to do so from the ABBYY latest form of alto. The header for the latter is

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto-v2.0.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<OCRProcessing ID="IdOcr"><ocrProcessingStep><processingDateTime>2019-08-29</processingDateTime><processingSoftware><softwareCreator>ABBYY</softwareCreator><softwareName>ABBYY FineReader Engine</softwareName><softwareVersion>12</softwareVersion></processingSoftware></ocrProcessingStep></OCRProcessing>
</Description>
<Styles>
</Styles>
...

But when I run

ocr-transform alto2.0 hocr in.alto out.hocr

I only get a header and no content:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="" lang=""><head><title>Image: </title><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><meta name="ocr-system" content="ABBYY FineReader Engine 12"/><meta name="ocr-capabilities" content="ocr_page ocr_carea ocr_par ocr_line ocrx_word"/></head><body><div class="ocr_page" id="Page1" title="image ; bbox 0 0 2480 3507; ppageno 0"/><div class="ocr_page" id="Page2" title="image ; bbox 0 0 2480 3507; ppageno 0"/></body></html>

@zuphilip Any ideas on how to proceed?

Thanks!

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:15 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
filakcommented, Jan 3, 2020

@kba I do not see any content in the margin elements - there will be no output produced by the transformation.

I think the Top and Bottom margins have been fixed now.

What to do with the Left and Right margins? There are no respective float elements specified in the hOCR spec (like ocr_header and ocr_footer).

If there are no real life examples with Left/Right margins I suggest to close this issue - and create another one here https://github.com/filak/hOCR-to-ALTO if it pop up someday. We can discuss then how to implement it.

0reactions
filakcommented, Mar 24, 2020

I have updated the master a while ago, just forgot to let you know…

https://github.com/filak/hOCR-to-ALTO/commit/61bb10e6f36a6b9c65776013e2dd22a52db3575c

Read more comments on GitHub >

github_iconTop Results From Across the Web

Issues · UB-Mannheim/ocr-fileformat - GitHub
GCV to HOCR or PAGE conversion not working ... alto2hocr: Content in BottomMargin is not considered (PrintSpace node is missing in this example)...
Read more >
My top margin is missing - Microsoft Support
If your document is in Print Layout view and the top and bottom margins appear to be cut off, the option for hiding...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found