question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

run_tf_ner.py doesn't work with unlabelled test data

See original GitHub issue

When running run_tf_ner.py in predict mode if all the labels in test data are O, script errors out with File "/home/himanshu/.local/lib/python3.7/site-packages/numpy/lib/function_base.py", line 423, in average "Weights sum to zero, can't be normalized") ZeroDivisionError: Weights sum to zero, can't be normalized This is because pad_token_label_id https://github.com/huggingface/transformers/blob/cae334c43c49aa770d9dac1ee48319679ee8c72c/examples/ner/run_tf_ner.py#L511 , label_id for O are both zero, resulting in empty y_pred https://github.com/huggingface/transformers/blob/cae334c43c49aa770d9dac1ee48319679ee8c72c/examples/ner/run_tf_ner.py#L364-L367 Shouldn’t the pad_token_label_id be different?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:9 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
VDCN12593commented, Mar 20, 2020

I have noticed the same issue and posted a question here: https://stackoverflow.com/questions/60732509/label-handling-confusion-in-run-tf-ner-example

I think pad_token_label_id should definitely not fall into the range of actual labels. Maybe we can make it -1 or num(label) or something. Also as shown in convert_examples_to_features(), pad_token_label_id is not only used for pad tokens at the end of the sequence, but also for non-first tokens inside a word when the word is split up to multiple tokens. Accordingly, during prediction, only the label of the first token in each word is used. So I am wondering if we should modify input_mask so that the loss does not take into account non-first tokens in a word.

I tried to set pad_token_label_id = -1, mask out non-first tokens in each word by changing input_mask, and change num_labels to len(labels) instead of len(labels) + 1. The training and evaluation can run, but the F1-score on the test set becomes much lower (on both conll03 English and Ontonotes English). I am still confused about this.

0reactions
stale[bot]commented, May 29, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to load unlabelled data for sentiment classification after ...
I used python 3.7. Below is the code. import random import pandas as pd data = pd.read_csv("label data for ...
Read more >
What is the best approach: Labeled training data and ...
-I want to achieve binary classification on unlabeled test data while training it on labeled data. Data: -train data: 795 rows with 59...
Read more >
How to Use Unlabeled Data in Machine Learning
Unsupervised learning (UL) is a machine learning algorithm that works with datasets without labeled responses. It is most commonly used to find ...
Read more >
Semi-Supervised Classification of Unlabeled Data (PU ...
Suppose you have a dataset of payment transactions. Some of the transactions are labeled as fraud and the rest are labeled as authentic, ......
Read more >
Not All Unlabeled Data are Equal: Learning to Weight ... - arXiv
In this paper we study how to use a different weight for every unlabeled example. Manual tuning of all those weights -- as...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found