Header not returning reference
See original GitHub issueI am running Grobid on Ubuntu 22.04.1 LTS using Windows Subsystem for Linux. Java verison:
For some reason I can’t manage to output reference data from the Header Model. I have tried running it in batch mode and by using Grobid service. I tried both processHeader
and processFullText
. Neither worked.
I used the createTraining
command to generate data. Then I edited the header.tei.xml files and retrained the mode. When I use the new model to generate training data, then the generated training data (training.header
file) is tagged correctly but the output tei.xml
file doesn’t show that information. The reference within the header is also parsed correctly in the training.header.reference
file.
Why aren’t the tags present in the output file even though they are in the training data? I am using the <reference>
tag
Issue Analytics
- State:
- Created a year ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
Hi @kuubikus !
Indeed in the current version, the metadata of the parsed reference present in the header section are not injected in the result (only the metadata coming from the “consolidation” mechanism might be injected). More precisely this part is present but in comments, because it requires some review. So far I didn’t find the time to work on this part again 😕
If you want the header reference information injected in the resulting header TEI, you simply need to remove the comments for following lines:
https://github.com/kermitt2/grobid/blob/master/grobid-core/src/main/java/org/grobid/core/engines/HeaderParser.java#L246
The issue as it is now, is that it might overwrite some “original” fields extracted from the header with the values in the reference and there’s some decision that needs to be added to keep the most reliable values in case of conflicting metadata (so coming from the header itself and from the reference in the header).
I managed to make it work. First I changed the
BiblioItem.java
file to “correct”idno
numbers even if it didn’t have a global level of acceptance. https://github.com/kermitt2/grobid/blob/0326b2872304a8de1be1e3583ae5811ec406c9f5/grobid-core/src/main/java/org/grobid/core/data/BiblioItem.java#L3963I then added these lines to the
TEIformatter.java
file. https://github.com/kermitt2/grobid/blob/0326b2872304a8de1be1e3583ae5811ec406c9f5/grobid-core/src/main/java/org/grobid/core/document/TEIFormatter.java#L312Now the output file displays
idno
but only if I use Grobid Service. For some reason it doesn’t work with Batch Mode.