Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding new labels Issue

See original GitHub issue

Hi - I have the following questions regarding Grobid:

My current task is to add new labels to sections of the pdf (ie: features of product, warranty, expiration date, etc). Basically, I am manually going through both the training and evaluation documents, and I am adding in my own labels such as '<features>' to the appropriate section. I was successfully able to add new labels to my environment by editing the startElement() and writeData() methods in grobid-trainer/src/main/java/org/grobid/trainer/sax/TEIFulltextSaxParser.java, and also by adding labels through /org/grobid/core/engines/label/TaggingLabels.java.

I am using batch testing to exeucute this. Here is my process: first I input injava -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-SNAPSHOT-onejar.jar -dIn ./trainingdata/test_in -dOut ./trainingdata/test_out2 -exe createTraining to create the training data. I then manually go through the fulltext.tei.xml documents to add new labels like I explained above to both training and evaluation., and I move the appropriate fulltext documents to the corpus and evaluation folder in /grobid-trainer/resources/dataset/fulltext. I then train the data by executing ./gradlew train_fulltext, I see that the labels that I manually inputted into the evaluation folder show up in the evaluation results.

However, my issue is that when I access the live grobid web api server (http://myhostname:8085), and I input in a pdf to processFullText through the GUI. I see that my labels that I manually added do not show up at all. Only the regular labels show up like paragraph and header. Also the format of this xml file is different from the myfile.fulltext.tei.xml files that I processed initially(ie: tables and images do not show up). How do I fix this issue that I am having? Am I testing the model correctly? If not, how would you recommend testing it?

What is the difference between token and field level results? I understand a field is a compilation of tokens, but what exactly is a token in this case?

Thank you so much for your help!

Issue Analytics

State:
Created 5 years ago
Reactions:2
Comments:27 (12 by maintainers)

Top GitHub Comments

2reactions

kermitt2commented, Sep 7, 2018

Hi @aishwaryabh !

I think you did the hardest, you still need to indicate how you want to serialize the new labels in the final TEI result.

Look at the class org.grobid.core.document.TEIFormatter, in the method toTEITextPiece(), there is an iteration on all TaggingTokenCluster which are sequence of LayoutToken with the same label. You need to add in the conditions one corresponding to your new label, and create the corresponding XML element to encode it, for example:

        else if (clusterLabel.equals(TaggingLabels.MY_NEW_LABEL)) {
                String clusterContent = LayoutTokensUtil.normalizeDehyphenizeText(cluster.concatTokens());
                    Element newElement = teiElement("rs", clusterContent);
                    newElement.addAttribute(new Attribute("type", "new_label"));
                    curDiv.appendChild(newElement);
            }

and you should get outputed in the final resulting TEI:

     bla bla bla <rs type="new_label">text tagged</rs> bla bla bla

(note: in this example there is dehyphenization of the annotated text, you might want to keep the text untouched depending on the nature of the information you want to annotate and remove the dehyphenization)

Table and figures are positioned at the end of the final TEI. The final resulting TEI is a logical representation of the input document, independent from a particular presentation.

Token is word level tokenization, with puntuations as tokens - it is called token because it does not correspond to words as a linguist would define it (tokenization here is not linguistic motivated, it’s just to achieve our NLP tasks). The class LayoutToken is a representation of a token with all its PDF layout informations (coordinates, font, size, etc.)

Field represents a complete sequence with the same label, so an annotated entity, comprising several words, so several tokens.

Hope this is helpful !

0reactions

LeelaManicommented, Sep 18, 2019

Hi My task is to retrain header part alone I have started training. Have a doubt if i am going in the right path. By retraining only header part will my overall score be reduced Please reply