Converting hOCR to Alto
See original GitHub issueHi, first thanks for making this tool.
I have questions using the GUI to convert hOCR to Alto XML.
My hOCR file looks as follows:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="unknown" lang="unknown">
<head>
<title>None</title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name='ocr-system' content='gcv2hocr.py' />
<meta name='ocr-langs' content='unknown' />
<meta name='ocr-number-of-pages' content='1' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_line ocrx_word ocrp_lang'/>
</head>
<body>
<div class='ocr_page' lang='unknown' title='bbox 0 0 1420 2068'>
<div class='ocr_carea' lang='unknown' title='bbox 176 121 1420 2068'>
<span class='ocr_line' id='line_0' title='bbox 678 121 747 168; baseline 0 -5'>
<span class='ocrx_word' id='word_0_0' title='bbox 678 121 747 168'>2T</span>
</span>
<span class='ocr_line' id='line_1' title='bbox 383 184 572 218; baseline 0 -5'>
<span class='ocrx_word' id='word_1_0' title='bbox 383 184 572 218'>Especially</span>
</span>
<span class='ocr_line' id='line_2' title='bbox 583 184 697 218; baseline 0 -5'>
<span class='ocrx_word' id='word_2_0' title='bbox 583 184 697 218'>during</span>
</span>
<span class='ocr_line' id='line_3' title='bbox 722 188 775 215; baseline 0 -5'>
<span class='ocrx_word' id='word_3_0' title='bbox 722 188 775 215'>the</span>
</span>
<span class='ocr_line' id='line_4' title='bbox 796 186 888 218; baseline 0 -5'>
<span class='ocrx_word' id='word_4_0' title='bbox 796 186 888 218'>years</span>
</span>
<span class='ocr_line' id='line_5' title='bbox 904 184 977 218; baseline 0 -5'>
<span class='ocrx_word' id='word_5_0' title='bbox 904 184 977 218'>1933</span>
</span>
<span class='ocr_line' id='line_6' title='bbox 1040 187 1110 218; baseline 0 -5'>
<span class='ocrx_word' id='word_6_0' title='bbox 1040 187 1110 218'>1938</span>
</span>
But the ALTO output from the GUI gives me two xml files, which look like this:
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName/>
</sourceImageInformation>
<OCRProcessing ID="IdOcr">
<ocrProcessingStep>
<processingSoftware>
<softwareName>gcv2hocr.py</softwareName>
<softwareVersion>gcv2hocr.py</softwareVersion>
</processingSoftware>
</ocrProcessingStep>
</OCRProcessing>
</Description>
<Layout>
<Page ID="" PHYSICAL_IMG_NR="1" HEIGHT="" WIDTH="">
<PrintSpace HEIGHT="" WIDTH="" VPOS="0" HPOS="0">
<ComposedBlock ID="" HEIGHT="1947" WIDTH="1244" VPOS="121" HPOS="176"/>
</PrintSpace>
</Page>
</Layout>
</alto>
and
<?xml version="1.0" encoding="utf-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto.xsd">None2TEspeciallyduringtheyears19331938theGermanun-employmentwasfullyremoved.LikemanyothershealsothoughtthatNationlasocialismvouldcauseaneconomicrisejoiningtheSAinApril1937Inforeigncountriestoo,Nationalsocialismwasnotrecognizedinitslasterfectsinthosedays.Imayremindyouofthefactthate.g.LordRothermeredevotedaspecialcopyofthe"DailyMailtotheNSDAPandaman1iaeMrWinstonChurchillwritesinhisreminiscences:"AtthattimeIhadnonationalprejudicesagainstHitler.Iknewbut1ittleofhisopinionoflifeandpastandhisoharacter.TomymindHitlerwasrighttobeaGerman1ovinghiscountry"Nodoubt,thatevenmoresuchorsimilarutterancesofstatesmenareknown.Atthattimemyhusbandcouldnotforeseethatbyhisjoininghewouldpromoteorsupportacriminalaffair.In1937hewasbusyasanassistantfortheknow-ledgeofkinsattheAnthropologicInstituteoftheUnivezaityofVienna.InSept.1937hepassedtothegeneralSS,becausehecouldbebusyasanivestigatorofkins.WaenAustriauasannexed,hecouldjointheGermanPolice.Afberyearsoftroublesanddistressnowhegotasafepoşitionasanofficial.Whenhewascalledouttothefrontier-guard(controlofpassports)onApril1st,1938hismembershiptothegeneralSSwasextinguished.HislatertransfertotheSDandtotheWafen-SS"wasnotvoluntary.DhusmyhusbanddoesnotbelongtotheciroleofthosemembersoftheSSwhomustbecosideredasCriminalsaccordingtothejudgementsofuremberg,becauseonlythosecounttothemwhoweremembersofthe3SstillfterSept.1st,1939.ThelatercompulsoryassimilationofranksintheSDandthe"Waffen-s"isotconsideredasamembershipothe3Saspertherulingpracticeofall"SpruchkammerInthecourseofageneraltraining-planinin1944myhusbandcametotheKRIPOforthreemonthstobeemployedthereforinformetionpurposes.ThenBourmonthsfollowedat.theSIAPOtobetrained1ateroninother1inesotheGeImanPolice.AstherewasalackofmenattheSTAPO,theycausedthepro-longationofhiscommendandinFebr.1945histransfertotheSTAPO.MyhusbandhasseveraltimestriedtoleavetheSTAFOandf1nallyappliedforbeingemployedasavoluateeratthefront.A1lhisapplicationswererefused.FurthertrialsWouldbeperhapspunishedasadenialofobedienceoradecompo-sitionof,themilitgry.ref.3)InFebr.andMarcha945asamemberoftheArmedForoesofthethenGermanymyhusbandshotdownanalliedterror-flyereachi.e.anenemyeirforce-manwhohadfiredabwomenandchildrenatBensheim/Germanyinalowflight,andthisonaccouatofadirectmilitaryandthereforebindingorderofhisdirectsuperior.Hewasorderedtodosobytheleaderofhisunit,SS-SourmbannführerandcouscillortothegovernmentGIRKEorbyhesdeputySS-sturmbannführerandcouncillortotheKRIPOHELLENBROICHresp.InFébr.1945Girkeaskedbyphonethecom-petentCommanderoftheSIPOSS-OberführerTRUMMLER,whethertheorderissuedfromBerlinbesti1lvalidbywhichterror-flyersweretobelki1led.TrummleransweredintheaffirmativeandP.t.o.</alto>
I’ve not worked with ALTO formats before, but I’m thinking it shouldn’t look like this? Please let me know what you think, any help would be greatly appreciated!
Issue Analytics
- State:
- Created 4 years ago
- Comments:21 (14 by maintainers)
Top Results From Across the Web
filak/hOCR-to-ALTO - GitHub
hOCR-to-ALTO. Convert between Tesseract hOCR and ALTO XML 2.0/2.1/3/4 using XSL stylesheets. The XSLT scripts use XSLT 2.0 features, so ...
Read more >hocr to ALTO XML converter? - Google Groups
Is there a tool or code snippet for converting the hocr output produced by Tesseract or OCRopus to ALTO or (partial) TEI XML?...
Read more >XSLT transform multiple input files from hocr to alto xml Saxon
Having difficulty using Saxon to transform multiple input files from hocr to alto xml (need an xml output for every input file) I've...
Read more >Importing attached OCR - Madoc
A per-canvas seeAlso entry pointing to the ALTO or hOCR OCR markup for the page with either: A format that is application/xml+alto or...
Read more >The hOCR Microformat for OCR Workflow and Results
Large scale scanning and document conversion efforts have led to a ... for example ALTO (https: //www.loc.gov/standards/alto), hOCR [10] ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Well, you found another bug. 😃 The patch mentioned above by @zuphilip lacks opening
<xsl:choose>
tags here and here. If you are in a hurry you can modify the file inside the docker container by hand (in/usr/local/share/ocr-fileformat/xslt/
).Closing this issue because of inactivity. If the problem remains, then feel free to reopen it.