alto2hocr: Content in BottomMargin is not considered (PrintSpace node is missing in this example)
See original GitHub issuecf #95
I am targeting hocr and trying to do so from the ABBYY latest form of alto. The header for the latter is
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto-v2.0.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<OCRProcessing ID="IdOcr"><ocrProcessingStep><processingDateTime>2019-08-29</processingDateTime><processingSoftware><softwareCreator>ABBYY</softwareCreator><softwareName>ABBYY FineReader Engine</softwareName><softwareVersion>12</softwareVersion></processingSoftware></ocrProcessingStep></OCRProcessing>
</Description>
<Styles>
</Styles>
...
But when I run
ocr-transform alto2.0 hocr in.alto out.hocr
I only get a header and no content:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="" lang=""><head><title>Image: </title><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/><meta name="ocr-system" content="ABBYY FineReader Engine 12"/><meta name="ocr-capabilities" content="ocr_page ocr_carea ocr_par ocr_line ocrx_word"/></head><body><div class="ocr_page" id="Page1" title="image ; bbox 0 0 2480 3507; ppageno 0"/><div class="ocr_page" id="Page2" title="image ; bbox 0 0 2480 3507; ppageno 0"/></body></html>
@zuphilip Any ideas on how to proceed?
Thanks!
Issue Analytics
- State:
- Created 4 years ago
- Comments:15 (14 by maintainers)
Top Results From Across the Web
Issues · UB-Mannheim/ocr-fileformat - GitHub
GCV to HOCR or PAGE conversion not working ... alto2hocr: Content in BottomMargin is not considered (PrintSpace node is missing in this example)...
Read more >My top margin is missing - Microsoft Support
If your document is in Print Layout view and the top and bottom margins appear to be cut off, the option for hiding...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@kba I do not see any content in the margin elements - there will be no output produced by the transformation.
I think the Top and Bottom margins have been fixed now.
What to do with the Left and Right margins? There are no respective float elements specified in the hOCR spec (like ocr_header and ocr_footer).
If there are no real life examples with Left/Right margins I suggest to close this issue - and create another one here https://github.com/filak/hOCR-to-ALTO if it pop up someday. We can discuss then how to implement it.
I have updated the master a while ago, just forgot to let you know…
https://github.com/filak/hOCR-to-ALTO/commit/61bb10e6f36a6b9c65776013e2dd22a52db3575c