inf/nan in generate (beam_sample) with small temperature values
See original GitHub issueEnvironment info
- transformers` version: transformers version: ‘4.6.0.dev0’
- Platform: Linux
- Python version: 3.6.9
- PyTorch version (GPU?): ‘1.8.0’ (yes)
Information
The generate
function (beam_sample
) throws error when passing small temperature values.
To reproduce
from transformers import (
AutoModelForSeq2SeqLM,
AutoTokenizer
)
model_name = "sshleifer/distilbart-xsum-12-3"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "New York City (NYC), often simply called New York, is the most populous city in the United States"
input_ids = tokenizer.encode(text, return_tensors='pt')
sample_outputs = model.generate(input_ids,
num_beams=3,
do_sample=True,
temperature=0.2
)
Traceback (most recent call last):
File "test.py", line 16, in <module>
temperature=0.2
File "/opt/anaconda3/envs/tensorflow2/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/anaconda3/envs/tensorflow2/lib/python3.6/site-packages/transformers/generation_utils.py", line 1113, in generate
**model_kwargs,
File "/opt/anaconda3/envs/tensorflow2/lib/python3.6/site-packages/transformers/generation_utils.py", line 2134, in beam_sample
next_tokens = torch.multinomial(probs, num_samples=2 * num_beams)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
Another way to reproduce this error is using higher temperatures and more iterations (generate a longer output).
It looks like this error is caused by next_token_scores
growing to -inf and probs
becoming nan.
Apparently, large absolute values accumulate over iterations because next_token_scores
are no longer normalized after adding unnormalized beam_scores
.
beam_scores
are calculated form the output of logits_warper(input_ids, next_token_scores)
,
and can grow fast with low temperatures (warper does: scores = scores / self.temperature
).
Expected behavior
Is the increase of unscaled values a desired behaviour and should one just implement their own logits_warper
handling float overflow?
If not, a quick fix, just for demonstration, is scaling the values of beam_scores
added to next_token_scores
by replacing:
next_token_scores = next_token_scores + beam_scores[:, None].expand_as(next_token_scores)
with:
beam_scores_softmax = F.softmax(beam_scores, dim=-1)
next_token_scores = next_token_scores + beam_scores_softmax[:, None].expand_as(next_token_scores)
It works fine but changes absolute values of scores users may rely on.
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
Hi @patrickvonplaten,
Thank you for a detailed answer.
I noticed this behaviour testing various decoding methods, and I don’t recall seeing a significant advantage of
beam_sample
in any particular use case.Since the new approach would be a breaking change, it seems a right solution to keep it the way it is for now.
Thanks again for your answer
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.