question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

TFBartForConditionalGeneration with labels padded with -100 gives Nan loss.

See original GitHub issue

I am pretraining T5 and Bart. I noticed that the padding token for labels of these models should be -100 for decoder_input_ids.

I change the padding token for labels for T5(pytorch, tensorflow) and Bart(pytorch), and it works well. But, For Bart(tensorflow) gives Nan loss.

Because of this, I also get a error message for pretraining: tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of -100 which is outside the valid range of [0, 50265). Label values: 0 2387 2335 16 11962 2 -100 -100 -100 -100 -100 ...........

Environment info

  • transformers version: 4.2.2
  • Platform: ubuntu 18.04
  • Python version: 3.6
  • PyTorch version (GPU?):
  • Tensorflow version (GPU?): 2.4.0
  • Using GPU in script?: yes (colab)
  • Using distributed or parallel set-up in script?: no

Bart: @patrickvonplaten

Information

Model I am using (Bert, XLNet …): TFBartForConditionalGeneration

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

import tensorflow as tf
from transformers import BartTokenizer, TFBartForConditionalGeneration

tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
model = TFBartForConditionalGeneration.from_pretrained("facebook/bart-base")

inputs = tokenizer("My dog is <mask>", return_tensors='tf', truncation=True, max_length=16, padding="max_length")
labels_ids = tokenizer("My dog is cute", return_tensors='tf', truncation=True, max_length=16, padding="max_length").input_ids

## labels padding_token = 1
loss = model(inputs, labels=labels_ids)[0]
print(labels_ids)
print(loss)

## labels padding_token = -100
labels_ids = tf.where(
    labels_ids == 1, tf.fill(tf.shape(labels_ids), tf.constant(-100, dtype='int32')), labels_ids
)

loss = model(inputs, labels=labels_ids)[0]
print(labels_ids)
print(loss)

Resurts:

tf.Tensor(
[[    0  2387  2335    16 11962     2     1     1     1     1     1     1
      1     1     1     1]], shape=(1, 16), dtype=int32)
tf.Tensor(
[2.2291888e-05 4.8874615e-05 3.7073401e-05 7.9230859e-04 6.1941872e+00
 1.1058841e+00], shape=(6,), dtype=float32)
tf.Tensor(
[[    0  2387  2335    16 11962     2  -100  -100  -100  -100  -100  -100
   -100  -100  -100  -100]], shape=(1, 16), dtype=int32)
tf.Tensor(
[2.2291888e-05 4.8755410e-05 3.7073401e-05 7.9242775e-04 6.1941872e+00
 1.1058841e+00           nan           nan           nan           nan
           nan           nan           nan           nan           nan
           nan], shape=(16,), dtype=float32)

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
patrickvonplatencommented, Jan 25, 2021

Great catch @kiyoungkim1!

It’s not very consistent what we are doing here…TFBart should have never ignored the pad_token_id as a default setting, but -100 as all other models do.

To fix the problem, I think we should add a couple of lines that check if -100 are in the labels and if yes replaces them with the pad_token_id to have consistency with PyTorch’s Bart. It would be a pretty big breaking change to just replace pad_token_id with -100 so I think the first option is the better one. @kiyoungkim1 if you feel like opening a PR to correct this behavior we would be more than happy 😃

1reaction
jplucommented, Jan 25, 2021

We also plan to turn all the loss computation, not anymore as a method but as a layer, so it will be much easier to use, to configure and TensorFlow workflow compliant.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Why my loss become NaN when I set the padding token in the ...
The course said I should use -100 as the padding token in the labels. But once I use -100, the loss becomes NaN...
Read more >
BERT HuggingFace gives NaN Loss - Stack Overflow
The output probability is always 100% for class 0. If you have classes 0, 1, 2, 3, you need to have 4 outputs...
Read more >
My transformer NMT model is giving "nan" loss value - nlp
I am training my transformer model and my model's loss is “nan”. I have tried various workarounds but couldn't figure it out.
Read more >
Few-shot Question Generation for Personalized Feedback in ...
BERTScore returns a score (0−1) between student and reference ... Model Confidence: This feature is computed as the negative loss of model ...
Read more >
Master's thesis Source Code Generation from Descriptions in a ...
servation from a space of inputs and Y is a corresponding label, is called a ... Rather than a completely new architecture, the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found