Logit explosion in MobileBertForNextSentencePrediction example from documentation (and all others tried)
See original GitHub issueEnvironment info
transformers
version: 4.11.3- Platform: Darwin-19.6.0-x86_64-i386-64bit
- Python version: 3.6.8
- PyTorch version (GPU?): 1.9.1 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Information
Model I am using (Bert, XLNet …): MobileBertForNextSentencePrediction
The problem arises when using:
- the official example scripts: (give details below) Using the example code provided in https://huggingface.co/transformers/model_doc/mobilebert.html#mobilebertfornextsentenceprediction.
The tasks I am working on is:
- an official GLUE/SQUaD task: Next Sentence Prediction
To reproduce
Steps to reproduce the behavior:
Run the code from the official example script in the documentation:
>>> from transformers import MobileBertTokenizer, MobileBertForNextSentencePrediction
>>> import torch
>>> tokenizer = MobileBertTokenizer.from_pretrained('google/mobilebert-uncased')
>>> model = MobileBertForNextSentencePrediction.from_pretrained('google/mobilebert-uncased')
>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> next_sentence = "The sky is blue due to the shorter wavelength of blue light."
>>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')
>>> outputs = model(**encoding, labels=torch.LongTensor([1]))
>>> loss = outputs.loss
>>> logits = outputs.logits
Printing logits
, we get tensor([[2.7888e+08, 2.7884e+08]], grad_fn=<AddmmBackward>)
- strangely huge for both classes, and one that leads to a softmax score of 1 for the “is next sentence” class—the opposite of the correct answer, which is strange for an example from the documentation.
I ran it on a handful of related prompt and next sentence pairs, then on a larger set from my own NSP dataset, and got the same strange behavior: logits of about 2e+08 for both classes, and higher for the first class in the 3rd or 4th significant figure, no matter the prompt and sentence pair. Given the sizes, it leads to a softmax score of 1 “is the next sentence” (the first class) and 0 for the other no matter what the first and second sentence is, no matter how unrelated the second sentence is.
Expected behavior
For comparison, the logits produced on the same example using BertForNextSentencePrediction with bert-base-uncased instead on this example are tensor([[-3.0729, 5.9056]], grad_fn=<AddmmBackward>)
. I would expect that for an example with ‘next sentence’ from the “Not following the prompt” category like this, MobileBertForNextSentencePrediction with the default pretrained model would get this right, and have logits in a similar ballpark - not huge positive values like the ones pictured.
I posted about this on HuggingFace Hub discussion board, but it got immediately taken down by the bot for some reason. Linking here in case admins approve it: https://discuss.huggingface.co/t/next-sentence-prediction-with-google-mobilebert-uncased-producing-massive-near-identical-logits-10-8-for-its-documentation-example-and-2k-others-tried/10750/1.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (3 by maintainers)
Checked the code and yes, the MobileBertForPreTraining and MobileBertForNextSentencePrediction are crafted in such a way, state_dict-wise, that PreTraining is loadable into the NextSentencePrediction; the LM head won’t be loaded, (but it shouldn’t get used anyway). My theory doesn’t explain the current state of affairs, then - the example should be working since the NSP from pretraining should have been transferred into the NSP-specific model. Will try and investigate further.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.