[Causal Language Modeling] seems not as expected
See original GitHub issueProblem
Causal Models is only attended to the left context. Therefore causal models should not depend on the right tokens. For example, The word embedding of “I” will be unchanged no matter what is in the right In GPT2. Since Causal Language Model are uni-directional self-attention.
from transformers import AutoModel,AutoTokenizer, AutoConfig
import torch
# gpt
gpt_model = AutoModel.from_pretrained('gpt2')
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')
embeddings = gpt_model.get_input_embeddings()
# create ids of encoded input vectors
decoder_input_ids = gpt_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids
# pass decoder input_ids and encoded input vectors to decoder
lm_logits = gpt_model(decoder_input_ids).last_hidden_state
# change the decoder input slightly
decoder_input_ids_perturbed = gpt_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = gpt_model(decoder_input_ids_perturbed).last_hidden_state
# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))
Result
Is encoding for `Ich` equal to its perturbed version?: True
However, when it comes to other models, the result is not following the assumption, the logits will be changed when changing the right side input? What is the reason? Is it a bug? I really want to know the answer, thank you!
BERT
Is encoding for `Ich` equal to its perturbed version?: False
BART
Is encoding for `Ich` equal to its perturbed version?: False
Roberta
Is encoding for `Ich` equal to its perturbed version?: False
Experiment notebook colab
Environment info
transformers
version: 4.3.3- Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
- Python version: 3.7.10
- PyTorch version (GPU?): 1.7.1+cu101 (False)
- Tensorflow version (GPU?): 2.4.1 (False)
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Who can help
Information
Model I am using (GPT, Bert, RoBerta, BART ForCausalLM):
The problem arises when using:
- [ x] the official example scripts: https://huggingface.co/blog/encoder-decoder#decoder
To reproduce
Experiment notebook colab
Expected behavior
Causal Models should not be affected by the right context?
Issue Analytics
- State:
- Created 3 years ago
- Comments:9 (8 by maintainers)
Top Results From Across the Web
Understanding Masked Language Models (MLM) and ...
Under Masked Language Modelling, we typically mask a certain % of words in a given sentence and the model is expected to predict...
Read more >Forgetful causal masking makes causal language models ...
This paper proposes a simple method of randomly masking past tokens during causal language modeling that boosts zero-shot capabilities and fine-tuning results ...
Read more >fine tune causal language model using transformers and ...
When fine-tuning a model with a language-model head, the labels are the next tokens themselves (you predict the next words).
Read more >training speed of "Causal Language Model Training ...
It looks like it's expected to be 0.14s/it (7 steps per second). I believe the difference is due to my Colab instance using...
Read more >Training a causal language model from scratch
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
I don’t know why I said
bias
😂, It should be dropout.from_config() is more likely for training, so it should be fine not to add
model.eval()
by default.Thanks for your reply~
model.eval()
does not disable the bias in the model as far as I know.model.eval()
simply puts the model into “non training” mode meaning that dropout layers are not applied, etc… . I don’t think we need to add amodel.eval()
to thefrom_config()
function.