Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Logit explosion in MobileBertForNextSentencePrediction example from documentation (and all others tried)

See original GitHub issue

Environment info

transformers version: 4.11.3
Platform: Darwin-19.6.0-x86_64-i386-64bit
Python version: 3.6.8
PyTorch version (GPU?): 1.9.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@vshampor

Information

Model I am using (Bert, XLNet …): MobileBertForNextSentencePrediction

The problem arises when using:

the official example scripts: (give details below) Using the example code provided in https://huggingface.co/transformers/model_doc/mobilebert.html#mobilebertfornextsentenceprediction.

The tasks I am working on is:

an official GLUE/SQUaD task: Next Sentence Prediction

To reproduce

Steps to reproduce the behavior:

Run the code from the official example script in the documentation:

>>> from transformers import MobileBertTokenizer, MobileBertForNextSentencePrediction
>>> import torch

>>> tokenizer = MobileBertTokenizer.from_pretrained('google/mobilebert-uncased')
>>> model = MobileBertForNextSentencePrediction.from_pretrained('google/mobilebert-uncased')

>>> prompt = "In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced."
>>> next_sentence = "The sky is blue due to the shorter wavelength of blue light."
>>> encoding = tokenizer(prompt, next_sentence, return_tensors='pt')

>>> outputs = model(**encoding, labels=torch.LongTensor([1]))
>>> loss = outputs.loss
>>> logits = outputs.logits

Printing logits, we get tensor([[2.7888e+08, 2.7884e+08]], grad_fn=<AddmmBackward>) - strangely huge for both classes, and one that leads to a softmax score of 1 for the “is next sentence” class—the opposite of the correct answer, which is strange for an example from the documentation.

I ran it on a handful of related prompt and next sentence pairs, then on a larger set from my own NSP dataset, and got the same strange behavior: logits of about 2e+08 for both classes, and higher for the first class in the 3rd or 4th significant figure, no matter the prompt and sentence pair. Given the sizes, it leads to a softmax score of 1 “is the next sentence” (the first class) and 0 for the other no matter what the first and second sentence is, no matter how unrelated the second sentence is.

Expected behavior

For comparison, the logits produced on the same example using BertForNextSentencePrediction with bert-base-uncased instead on this example are tensor([[-3.0729, 5.9056]], grad_fn=<AddmmBackward>). I would expect that for an example with ‘next sentence’ from the “Not following the prompt” category like this, MobileBertForNextSentencePrediction with the default pretrained model would get this right, and have logits in a similar ballpark - not huge positive values like the ones pictured.

I posted about this on HuggingFace Hub discussion board, but it got immediately taken down by the bot for some reason. Linking here in case admins approve it: https://discuss.huggingface.co/t/next-sentence-prediction-with-google-mobilebert-uncased-producing-massive-near-identical-logits-10-8-for-its-documentation-example-and-2k-others-tried/10750/1.

Issue Analytics

State:
Created 2 years ago
Comments:8 (3 by maintainers)

Top GitHub Comments

1reaction

vshamporcommented, Oct 26, 2021

Checked the code and yes, the MobileBertForPreTraining and MobileBertForNextSentencePrediction are crafted in such a way, state_dict-wise, that PreTraining is loadable into the NextSentencePrediction; the LM head won’t be loaded, (but it shouldn’t get used anyway). My theory doesn’t explain the current state of affairs, then - the example should be working since the NSP from pretraining should have been transferred into the NSP-specific model. Will try and investigate further.

0reactions

github-actions[bot]commented, Nov 19, 2021

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Top Results From Across the Web

https://huggingface.co/ccdv/lsg-legal-base-uncased...

+ +![attn](attn.png) + +## Usage +The model relies on a custom modeling ... the superclass for the appropriate + documentation alongside usage examples....

Deep Learning for Session Aware Conversational Agents

action. During the computer science history, different attempts to create a NLU system have been made. A famous example is the SHRDLU[19], a....

Transfer Learning for Natural Language Processing [1&nbsp

Agency (DARPA) ecosystem. We used transfer learning to reduce the requirement for labeled data by training NLP systems on simulated data first and...

Full text of "Python Ebooks" - Internet Archive

Full text of "Python Ebooks". See other formats. EXPERT INSIGHT Deep Learning with TensorFlow 2 and Keras Regression, ConvNets, GANs, RNNs NLP, and...

Antonio Gulli, Amita Kapoor, Sujit Pal - Deep Learning with ...

Machine learning, artificial intelligence, and the deep learning Cambrian explosion Artificial intelligence (AI) lays the ground for everything this book ...