TFBartForConditionalGeneration with labels padded with -100 gives Nan loss.
See original GitHub issueI am pretraining T5 and Bart.
I noticed that the padding token for labels
of these models should be -100 for decoder_input_ids
.
I change the padding token for labels for T5(pytorch, tensorflow) and Bart(pytorch), and it works well. But, For Bart(tensorflow) gives Nan loss.
Because of this, I also get a error message for pretraining:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Received a label value of -100 which is outside the valid range of [0, 50265). Label values: 0 2387 2335 16 11962 2 -100 -100 -100 -100 -100 ...........
Environment info
transformers
version: 4.2.2- Platform: ubuntu 18.04
- Python version: 3.6
- PyTorch version (GPU?):
- Tensorflow version (GPU?): 2.4.0
- Using GPU in script?: yes (colab)
- Using distributed or parallel set-up in script?: no
Bart: @patrickvonplaten
Information
Model I am using (Bert, XLNet …): TFBartForConditionalGeneration
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
import tensorflow as tf
from transformers import BartTokenizer, TFBartForConditionalGeneration
tokenizer = BartTokenizer.from_pretrained("facebook/bart-base")
model = TFBartForConditionalGeneration.from_pretrained("facebook/bart-base")
inputs = tokenizer("My dog is <mask>", return_tensors='tf', truncation=True, max_length=16, padding="max_length")
labels_ids = tokenizer("My dog is cute", return_tensors='tf', truncation=True, max_length=16, padding="max_length").input_ids
## labels padding_token = 1
loss = model(inputs, labels=labels_ids)[0]
print(labels_ids)
print(loss)
## labels padding_token = -100
labels_ids = tf.where(
labels_ids == 1, tf.fill(tf.shape(labels_ids), tf.constant(-100, dtype='int32')), labels_ids
)
loss = model(inputs, labels=labels_ids)[0]
print(labels_ids)
print(loss)
Resurts:
tf.Tensor(
[[ 0 2387 2335 16 11962 2 1 1 1 1 1 1
1 1 1 1]], shape=(1, 16), dtype=int32)
tf.Tensor(
[2.2291888e-05 4.8874615e-05 3.7073401e-05 7.9230859e-04 6.1941872e+00
1.1058841e+00], shape=(6,), dtype=float32)
tf.Tensor(
[[ 0 2387 2335 16 11962 2 -100 -100 -100 -100 -100 -100
-100 -100 -100 -100]], shape=(1, 16), dtype=int32)
tf.Tensor(
[2.2291888e-05 4.8755410e-05 3.7073401e-05 7.9242775e-04 6.1941872e+00
1.1058841e+00 nan nan nan nan
nan nan nan nan nan
nan], shape=(16,), dtype=float32)
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
Why my loss become NaN when I set the padding token in the ...
The course said I should use -100 as the padding token in the labels. But once I use -100, the loss becomes NaN...
Read more >BERT HuggingFace gives NaN Loss - Stack Overflow
The output probability is always 100% for class 0. If you have classes 0, 1, 2, 3, you need to have 4 outputs...
Read more >My transformer NMT model is giving "nan" loss value - nlp
I am training my transformer model and my model's loss is “nan”. I have tried various workarounds but couldn't figure it out.
Read more >Few-shot Question Generation for Personalized Feedback in ...
BERTScore returns a score (0−1) between student and reference ... Model Confidence: This feature is computed as the negative loss of model ...
Read more >Master's thesis Source Code Generation from Descriptions in a ...
servation from a space of inputs and Y is a corresponding label, is called a ... Rather than a completely new architecture, the...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Great catch @kiyoungkim1!
It’s not very consistent what we are doing here…TFBart should have never ignored the
pad_token_id
as a default setting, but -100 as all other models do.To fix the problem, I think we should add a couple of lines that check if -100 are in the labels and if yes replaces them with the
pad_token_id
to have consistency with PyTorch’s Bart. It would be a pretty big breaking change to just replacepad_token_id
with -100 so I think the first option is the better one. @kiyoungkim1 if you feel like opening a PR to correct this behavior we would be more than happy 😃We also plan to turn all the loss computation, not anymore as a method but as a layer, so it will be much easier to use, to configure and TensorFlow workflow compliant.