Adding new labels Issue
See original GitHub issueHi - I have the following questions regarding Grobid:
- My current task is to add new labels to sections of the pdf (ie: features of product, warranty, expiration date, etc). Basically, I am manually going through both the training and evaluation documents, and I am adding in my own labels such as
'<features>'
to the appropriate section. I was successfully able to add new labels to my environment by editing the startElement() and writeData() methods in grobid-trainer/src/main/java/org/grobid/trainer/sax/TEIFulltextSaxParser.java, and also by adding labels through /org/grobid/core/engines/label/TaggingLabels.java.
I am using batch testing to exeucute this. Here is my process: first I input injava -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-SNAPSHOT-onejar.jar -dIn ./trainingdata/test_in -dOut ./trainingdata/test_out2 -exe createTraining
to create the training data. I then manually go through the fulltext.tei.xml documents to add new labels like I explained above to both training and evaluation., and I move the appropriate fulltext documents to the corpus and evaluation folder in /grobid-trainer/resources/dataset/fulltext
.
I then train the data by executing ./gradlew train_fulltext
, I see that the labels that I manually inputted into the evaluation folder show up in the evaluation results.
However, my issue is that when I access the live grobid web api server (http://myhostname:8085), and I input in a pdf to processFullText through the GUI. I see that my labels that I manually added do not show up at all. Only the regular labels show up like paragraph and header. Also the format of this xml file is different from the myfile.fulltext.tei.xml files that I processed initially(ie: tables and images do not show up). How do I fix this issue that I am having? Am I testing the model correctly? If not, how would you recommend testing it?
- What is the difference between token and field level results? I understand a field is a compilation of tokens, but what exactly is a token in this case?
Thank you so much for your help!
Issue Analytics
- State:
- Created 5 years ago
- Reactions:2
- Comments:27 (12 by maintainers)
Top GitHub Comments
Hi @aishwaryabh !
Look at the class
org.grobid.core.document.TEIFormatter
, in the methodtoTEITextPiece()
, there is an iteration on allTaggingTokenCluster
which are sequence ofLayoutToken
with the same label. You need to add in the conditions one corresponding to your new label, and create the corresponding XML element to encode it, for example:and you should get outputed in the final resulting TEI:
(note: in this example there is dehyphenization of the annotated text, you might want to keep the text untouched depending on the nature of the information you want to annotate and remove the dehyphenization)
Table and figures are positioned at the end of the final TEI. The final resulting TEI is a logical representation of the input document, independent from a particular presentation.
LayoutToken
is a representation of a token with all its PDF layout informations (coordinates, font, size, etc.)Field represents a complete sequence with the same label, so an annotated entity, comprising several words, so several tokens.
Hope this is helpful !
Hi My task is to retrain header part alone I have started training. Have a doubt if i am going in the right path. By retraining only header part will my overall score be reduced Please reply