question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Converting hOCR to Alto

See original GitHub issue

Hi, first thanks for making this tool.

I have questions using the GUI to convert hOCR to Alto XML.

My hOCR file looks as follows:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="unknown" lang="unknown">
  <head>
    <title>None</title>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
    <meta name='ocr-system' content='gcv2hocr.py' />
    <meta name='ocr-langs' content='unknown' />
    <meta name='ocr-number-of-pages' content='1' />
    <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_line ocrx_word ocrp_lang'/>
  </head>
  <body>
    <div class='ocr_page' lang='unknown' title='bbox 0 0 1420 2068'>
        <div class='ocr_carea' lang='unknown' title='bbox 176 121 1420 2068'>
            <span class='ocr_line' id='line_0' title='bbox 678 121 747 168; baseline 0 -5'>
                <span class='ocrx_word' id='word_0_0' title='bbox 678 121 747 168'>2T</span>
            </span>
            <span class='ocr_line' id='line_1' title='bbox 383 184 572 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_1_0' title='bbox 383 184 572 218'>Especially</span>
            </span>
            <span class='ocr_line' id='line_2' title='bbox 583 184 697 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_2_0' title='bbox 583 184 697 218'>during</span>
            </span>
            <span class='ocr_line' id='line_3' title='bbox 722 188 775 215; baseline 0 -5'>
                <span class='ocrx_word' id='word_3_0' title='bbox 722 188 775 215'>the</span>
            </span>
            <span class='ocr_line' id='line_4' title='bbox 796 186 888 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_4_0' title='bbox 796 186 888 218'>years</span>
            </span>
            <span class='ocr_line' id='line_5' title='bbox 904 184 977 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_5_0' title='bbox 904 184 977 218'>1933</span>
            </span>
            <span class='ocr_line' id='line_6' title='bbox 1040 187 1110 218; baseline 0 -5'>
                <span class='ocrx_word' id='word_6_0' title='bbox 1040 187 1110 218'>1938</span>
            </span>

But the ALTO output from the GUI gives me two xml files, which look like this:

<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/v2/alto-2-0.xsd">
   <Description>
      <MeasurementUnit>pixel</MeasurementUnit>
      <sourceImageInformation>
         <fileName/>
      </sourceImageInformation>
      <OCRProcessing ID="IdOcr">
         <ocrProcessingStep>
            <processingSoftware>
               <softwareName>gcv2hocr.py</softwareName>
               <softwareVersion>gcv2hocr.py</softwareVersion>
            </processingSoftware>
         </ocrProcessingStep>
      </OCRProcessing>
   </Description>
   <Layout>
      <Page ID="" PHYSICAL_IMG_NR="1" HEIGHT="" WIDTH="">
         <PrintSpace HEIGHT="" WIDTH="" VPOS="0" HPOS="0">
            <ComposedBlock ID="" HEIGHT="1947" WIDTH="1244" VPOS="121" HPOS="176"/>
         </PrintSpace>
      </Page>
   </Layout>
</alto>

and

<?xml version="1.0" encoding="utf-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v2#"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v2# http://www.loc.gov/standards/alto/alto.xsd">None2TEspeciallyduringtheyears19331938theGermanun-employmentwasfullyremoved.LikemanyothershealsothoughtthatNationlasocialismvouldcauseaneconomicrisejoiningtheSAinApril1937Inforeigncountriestoo,Nationalsocialismwasnotrecognizedinitslasterfectsinthosedays.Imayremindyouofthefactthate.g.LordRothermeredevotedaspecialcopyofthe"DailyMailtotheNSDAPandaman1iaeMrWinstonChurchillwritesinhisreminiscences:"AtthattimeIhadnonationalprejudicesagainstHitler.Iknewbut1ittleofhisopinionoflifeandpastandhisoharacter.TomymindHitlerwasrighttobeaGerman1ovinghiscountry"Nodoubt,thatevenmoresuchorsimilarutterancesofstatesmenareknown.Atthattimemyhusbandcouldnotforeseethatbyhisjoininghewouldpromoteorsupportacriminalaffair.In1937hewasbusyasanassistantfortheknow-ledgeofkinsattheAnthropologicInstituteoftheUnivezaityofVienna.InSept.1937hepassedtothegeneralSS,becausehecouldbebusyasanivestigatorofkins.WaenAustriauasannexed,hecouldjointheGermanPolice.Afberyearsoftroublesanddistressnowhegotasafepoşitionasanofficial.Whenhewascalledouttothefrontier-guard(controlofpassports)onApril1st,1938hismembershiptothegeneralSSwasextinguished.HislatertransfertotheSDandtotheWafen-SS"wasnotvoluntary.DhusmyhusbanddoesnotbelongtotheciroleofthosemembersoftheSSwhomustbecosideredasCriminalsaccordingtothejudgementsofuremberg,becauseonlythosecounttothemwhoweremembersofthe3SstillfterSept.1st,1939.ThelatercompulsoryassimilationofranksintheSDandthe"Waffen-s"isotconsideredasamembershipothe3Saspertherulingpracticeofall"SpruchkammerInthecourseofageneraltraining-planinin1944myhusbandcametotheKRIPOforthreemonthstobeemployedthereforinformetionpurposes.ThenBourmonthsfollowedat.theSIAPOtobetrained1ateroninother1inesotheGeImanPolice.AstherewasalackofmenattheSTAPO,theycausedthepro-longationofhiscommendandinFebr.1945histransfertotheSTAPO.MyhusbandhasseveraltimestriedtoleavetheSTAFOandf1nallyappliedforbeingemployedasavoluateeratthefront.A1lhisapplicationswererefused.FurthertrialsWouldbeperhapspunishedasadenialofobedienceoradecompo-sitionof,themilitgry.ref.3)InFebr.andMarcha945asamemberoftheArmedForoesofthethenGermanymyhusbandshotdownanalliedterror-flyereachi.e.anenemyeirforce-manwhohadfiredabwomenandchildrenatBensheim/Germanyinalowflight,andthisonaccouatofadirectmilitaryandthereforebindingorderofhisdirectsuperior.Hewasorderedtodosobytheleaderofhisunit,SS-SourmbannführerandcouscillortothegovernmentGIRKEorbyhesdeputySS-sturmbannführerandcouncillortotheKRIPOHELLENBROICHresp.InFébr.1945Girkeaskedbyphonethecom-petentCommanderoftheSIPOSS-OberführerTRUMMLER,whethertheorderissuedfromBerlinbesti1lvalidbywhichterror-flyersweretobelki1led.TrummleransweredintheaffirmativeandP.t.o.</alto>

I’ve not worked with ALTO formats before, but I’m thinking it shouldn’t look like this? Please let me know what you think, any help would be greatly appreciated!

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:21 (14 by maintainers)

github_iconTop GitHub Comments

1reaction
jmechnichcommented, Sep 18, 2019

Well, you found another bug. 😃 The patch mentioned above by @zuphilip lacks opening <xsl:choose> tags here and here. If you are in a hurry you can modify the file inside the docker container by hand (in /usr/local/share/ocr-fileformat/xslt/).

0reactions
zuphilipcommented, Dec 30, 2019

Closing this issue because of inactivity. If the problem remains, then feel free to reopen it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

filak/hOCR-to-ALTO - GitHub
hOCR-to-ALTO. Convert between Tesseract hOCR and ALTO XML 2.0/2.1/3/4 using XSL stylesheets. The XSLT scripts use XSLT 2.0 features, so ...
Read more >
hocr to ALTO XML converter? - Google Groups
Is there a tool or code snippet for converting the hocr output produced by Tesseract or OCRopus to ALTO or (partial) TEI XML?...
Read more >
XSLT transform multiple input files from hocr to alto xml Saxon
Having difficulty using Saxon to transform multiple input files from hocr to alto xml (need an xml output for every input file) I've...
Read more >
Importing attached OCR - Madoc
A per-canvas seeAlso entry pointing to the ALTO or hOCR OCR markup for the page with either: A format that is application/xml+alto or...
Read more >
The hOCR Microformat for OCR Workflow and Results
Large scale scanning and document conversion efforts have led to a ... for example ALTO (https: //www.loc.gov/standards/alto), hOCR [10] ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found