question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Adding new labels Issue

See original GitHub issue

Hi - I have the following questions regarding Grobid:

  1. My current task is to add new labels to sections of the pdf (ie: features of product, warranty, expiration date, etc). Basically, I am manually going through both the training and evaluation documents, and I am adding in my own labels such as '<features>' to the appropriate section. I was successfully able to add new labels to my environment by editing the startElement() and writeData() methods in grobid-trainer/src/main/java/org/grobid/trainer/sax/TEIFulltextSaxParser.java, and also by adding labels through /org/grobid/core/engines/label/TaggingLabels.java.

I am using batch testing to exeucute this. Here is my process: first I input injava -Xmx4G -jar grobid-core/build/libs/grobid-core-0.6.0-SNAPSHOT-onejar.jar -dIn ./trainingdata/test_in -dOut ./trainingdata/test_out2 -exe createTraining to create the training data. I then manually go through the fulltext.tei.xml documents to add new labels like I explained above to both training and evaluation., and I move the appropriate fulltext documents to the corpus and evaluation folder in /grobid-trainer/resources/dataset/fulltext. I then train the data by executing ./gradlew train_fulltext, I see that the labels that I manually inputted into the evaluation folder show up in the evaluation results.

However, my issue is that when I access the live grobid web api server (http://myhostname:8085), and I input in a pdf to processFullText through the GUI. I see that my labels that I manually added do not show up at all. Only the regular labels show up like paragraph and header. Also the format of this xml file is different from the myfile.fulltext.tei.xml files that I processed initially(ie: tables and images do not show up). How do I fix this issue that I am having? Am I testing the model correctly? If not, how would you recommend testing it?

  1. What is the difference between token and field level results? I understand a field is a compilation of tokens, but what exactly is a token in this case?

Thank you so much for your help!

Issue Analytics

  • State:open
  • Created 5 years ago
  • Reactions:2
  • Comments:27 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
kermitt2commented, Sep 7, 2018

Hi @aishwaryabh !

  1. I think you did the hardest, you still need to indicate how you want to serialize the new labels in the final TEI result.

Look at the class org.grobid.core.document.TEIFormatter, in the method toTEITextPiece(), there is an iteration on all TaggingTokenCluster which are sequence of LayoutToken with the same label. You need to add in the conditions one corresponding to your new label, and create the corresponding XML element to encode it, for example:

        else if (clusterLabel.equals(TaggingLabels.MY_NEW_LABEL)) {
                String clusterContent = LayoutTokensUtil.normalizeDehyphenizeText(cluster.concatTokens());
                    Element newElement = teiElement("rs", clusterContent);
                    newElement.addAttribute(new Attribute("type", "new_label"));
                    curDiv.appendChild(newElement);
            }

and you should get outputed in the final resulting TEI:

     bla bla bla <rs type="new_label">text tagged</rs> bla bla bla

(note: in this example there is dehyphenization of the annotated text, you might want to keep the text untouched depending on the nature of the information you want to annotate and remove the dehyphenization)

Table and figures are positioned at the end of the final TEI. The final resulting TEI is a logical representation of the input document, independent from a particular presentation.

  1. Token is word level tokenization, with puntuations as tokens - it is called token because it does not correspond to words as a linguist would define it (tokenization here is not linguistic motivated, it’s just to achieve our NLP tasks). The class LayoutToken is a representation of a token with all its PDF layout informations (coordinates, font, size, etc.)

Field represents a complete sequence with the same label, so an annotated entity, comprising several words, so several tokens.

Hope this is helpful !

0reactions
LeelaManicommented, Sep 18, 2019

Hi My task is to retrain header part alone I have started training. Have a doubt if i am going in the right path. By retraining only header part will my overall score be reduced Please reply

Read more comments on GitHub >

github_iconTop Results From Across the Web

Adding labels to issues - GitHub Docs
Introduction. This tutorial demonstrates how to use the actions/github-script action in a workflow to label newly opened or reopened issues.
Read more >
How to put a label on an issue in GitHub if you are not a ...
With this feature, repository admins can create a set of issue templates, and assign a set of labels to each template. Then, whenever...
Read more >
Creating GitHub Issue Labels - ZenHub Support
In GitHub, navigate to the repository where you want to create or edit a label. Next, navigate to Issues and further click on...
Read more >
Creating and Adding Labels to GitHub Pull Requests and Issues
You'll see a New Label option to the right of the search. Click it and add a Name and Description. You can also...
Read more >
Labels - GitLab Docs
You can also create a new project label from an issue or merge request. Labels you create this way belong to the same...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found