Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Pdfalto XML parser does not check the size of the spaces (which could be zero)

See original GitHub issue

Looks like I did everything as it was written in a structure. After spending many hours due training modules I still find out, that my sentence with words has spaces in word. first of all my straining structure was like this:

./gradlew train_fulltext
java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-0.6.0-SNAPSHOT-onejar.jar 0 fulltext -gH grobid-home
java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-0.6.0-SNAPSHOT-onejar.jar 1 fulltext -gH grobid-home

After proceeding it I get such a result:

 <div
                xmlns="http://www.tei-c.org/ns/1.0">
                <head>Uni ka lūs eks po na tai</head>
                <p>Mo kan tis ke ra mi kos stu di jo je ir vė liau, vi są 4ąjį de šimt me tį, me ni nin ko dar bų sti lis ti ka la biau siai at spin dė jo liau dies dir bi nių tra di ci jų plė to tę. Dai li nin kas kū rė kai me pa pli tu sius švil pu kus, de ko ra ty vi nes skulp tū rė les, bet vis dėl to daž niau siai -in dus.</p>
                <p>Ser vi zuo se ir va zo se va ri ja vo liau dies in dų for mas, de ko rui pa si telk da vo krai ti nių skry nių raš tus, gy vy bės me džio, gė lių mo ty vus. Dau gu ma šių in dų pa si žy mė jo at ski rų da lių tar pu sa vio dar na, lanks čio mis, iš baig to mis de ta lė mis, at li ki mo pro fe sio na lu mu. Pa laips niui kū ri niuo se pra dė jo ryš kė ti ir tar pu ka rio art de co sti liui bū din gi me ni niai ypa tu mai -įstri ži ar ban guo ti ran ke nė lių, dang te lių ele men tai, konst ruk ty ves nė de ko ro trak tuo tė.</p>
                <p>De rė tų pa ste bė ti, kad A.Žmui dzi na vi čiaus kū ri nių ir rin ki nių mu zie ju je su reng to je pa ro do je eks po nuo ja mi anks čiau ne de monst ruo ti šio anks ty vo jo me ni nin ko kū ry bos pe rio do dar bai, sau go mi jo šei mos ko lek ci jo je.</p>
            </div>

for example instead of: <head>Uni ka lūs eks po na tai</head> should be <head>Unikalūs eksponatai</head>

Anyone had similar issues?

Issue Analytics

State:
Created 3 years ago
Comments:7 (1 by maintainers)

Top GitHub Comments

1reaction

lfoppianocommented, Jul 7, 2020

OK. So far we have #48 and #564 related to the same issue.

0reactions

kermitt2commented, Jul 6, 2020

This is an issue for pdfalto I think, we probably don’t want to look at the <SP> elements in PDFALtoSaxHandler because it would introduce a hack and a dependency to a pdfalto “error”.

In pdfalto, a space character always introduces such a space and breaks words, but in practice there are almost no space character in a PDF stream, so spaces have to be inferred from the positions of the characters/tokens. One issue is the diacritics (apparently lot’s in this PDF), when a diacritic occurs, we have to recompose the characters and to join separated tokens to create actual words.

The relevant code in pdfalto is there -> https://github.com/kermitt2/pdfalto/blob/master/src/XmlAltoOutputDev.cc#L2669 but the challenge is that fixing the heuristics for a particular PDF will likely create errors in other PDF 😄 So the first step is maybe to create a set of PDF with a variety of space issues to improve the robustness of pdfalto.

Top Results From Across the Web

Space between most of the character for some documents #564

For example: DOI: 10.1101/019794 Extracted text might then look like: M ... Pdfalto XML parser does not check the size of the spaces...

Will not publish to pdf - MATLAB Answers - MathWorks

Error reported by XML parser: An invalid XML character (Unicode: 0x1a) was found in the element content of the document. Error using publish....

Extracting scientific results from research articles - HAL-Inria

It is a production-ready tool for information extraction from PDF articles, that can extract the title, abstract, header metadata, find ...

Search Results - CVE

When wantype is 3, l2tp_usrname will be decrypted by base64, and the result will be stored in v94, which does not check the...

Vulnerability Summary for the Week of June 27, 2022 | CISA

Primary Vendor ‑‑ Product Published CVSS Score admidio ‑‑ admidio 2022‑06‑28 not yet calculated aerogear ‑‑ aerogear 2022‑07‑01 not yet calculated aerogear ‑‑ aerogear 2022‑07‑01 not...