Pdfalto XML parser does not check the size of the spaces (which could be zero)
See original GitHub issueLooks like I did everything as it was written in a structure. After spending many hours due training modules I still find out, that my sentence with words has spaces in word. first of all my straining structure was like this:
- ./gradlew train_fulltext
- java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-0.6.0-SNAPSHOT-onejar.jar 0 fulltext -gH grobid-home
- java -Xmx1024m -jar grobid-trainer/build/libs/grobid-trainer-0.6.0-SNAPSHOT-onejar.jar 1 fulltext -gH grobid-home
After proceeding it I get such a result:
<div
xmlns="http://www.tei-c.org/ns/1.0">
<head>Uni ka lūs eks po na tai</head>
<p>Mo kan tis ke ra mi kos stu di jo je ir vė liau, vi są 4ąjį de šimt me tį, me ni nin ko dar bų sti lis ti ka la biau siai at spin dė jo liau dies dir bi nių tra di ci jų plė to tę. Dai li nin kas kū rė kai me pa pli tu sius švil pu kus, de ko ra ty vi nes skulp tū rė les, bet vis dėl to daž niau siai -in dus.</p>
<p>Ser vi zuo se ir va zo se va ri ja vo liau dies in dų for mas, de ko rui pa si telk da vo krai ti nių skry nių raš tus, gy vy bės me džio, gė lių mo ty vus. Dau gu ma šių in dų pa si žy mė jo at ski rų da lių tar pu sa vio dar na, lanks čio mis, iš baig to mis de ta lė mis, at li ki mo pro fe sio na lu mu. Pa laips niui kū ri niuo se pra dė jo ryš kė ti ir tar pu ka rio art de co sti liui bū din gi me ni niai ypa tu mai -įstri ži ar ban guo ti ran ke nė lių, dang te lių ele men tai, konst ruk ty ves nė de ko ro trak tuo tė.</p>
<p>De rė tų pa ste bė ti, kad A.Žmui dzi na vi čiaus kū ri nių ir rin ki nių mu zie ju je su reng to je pa ro do je eks po nuo ja mi anks čiau ne de monst ruo ti šio anks ty vo jo me ni nin ko kū ry bos pe rio do dar bai, sau go mi jo šei mos ko lek ci jo je.</p>
</div>
for example instead of:
<head>Uni ka lūs eks po na tai</head>
should be
<head>Unikalūs eksponatai</head>
Anyone had similar issues?
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (1 by maintainers)
Top Results From Across the Web
Space between most of the character for some documents #564
For example: DOI: 10.1101/019794 Extracted text might then look like: M ... Pdfalto XML parser does not check the size of the spaces...
Read more >Will not publish to pdf - MATLAB Answers - MathWorks
Error reported by XML parser: An invalid XML character (Unicode: 0x1a) was found in the element content of the document. Error using publish....
Read more >Extracting scientific results from research articles - HAL-Inria
It is a production-ready tool for information extraction from PDF articles, that can extract the title, abstract, header metadata, find ...
Read more >Search Results - CVE
When wantype is 3, l2tp_usrname will be decrypted by base64, and the result will be stored in v94, which does not check the...
Read more >Vulnerability Summary for the Week of June 27, 2022 | CISA
Primary Vendor ‑‑ Product Published CVSS Score
admidio ‑‑ admidio 2022‑06‑28 not yet calculated
aerogear ‑‑ aerogear 2022‑07‑01 not yet calculated
aerogear ‑‑ aerogear 2022‑07‑01 not...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
OK. So far we have #48 and #564 related to the same issue.
This is an issue for pdfalto I think, we probably don’t want to look at the
<SP>
elements inPDFALtoSaxHandler
because it would introduce a hack and a dependency to a pdfalto “error”.In pdfalto, a space character always introduces such a space and breaks words, but in practice there are almost no space character in a PDF stream, so spaces have to be inferred from the positions of the characters/tokens. One issue is the diacritics (apparently lot’s in this PDF), when a diacritic occurs, we have to recompose the characters and to join separated tokens to create actual words.
The relevant code in pdfalto is there -> https://github.com/kermitt2/pdfalto/blob/master/src/XmlAltoOutputDev.cc#L2669 but the challenge is that fixing the heuristics for a particular PDF will likely create errors in other PDF 😄 So the first step is maybe to create a set of PDF with a variety of space issues to improve the robustness of pdfalto.