question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Causal Language Modeling] seems not as expected

See original GitHub issue

Problem

Causal Models is only attended to the left context. Therefore causal models should not depend on the right tokens. For example, The word embedding of “I” will be unchanged no matter what is in the right In GPT2. Since Causal Language Model are uni-directional self-attention.

from transformers import AutoModel,AutoTokenizer, AutoConfig
import torch

# gpt
gpt_model = AutoModel.from_pretrained('gpt2')
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')
embeddings = gpt_model.get_input_embeddings()

# create ids of encoded input vectors
decoder_input_ids = gpt_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids

# pass decoder input_ids and encoded input vectors to decoder
lm_logits = gpt_model(decoder_input_ids).last_hidden_state

# change the decoder input slightly
decoder_input_ids_perturbed = gpt_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = gpt_model(decoder_input_ids_perturbed).last_hidden_state

# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))

Result

Is encoding for `Ich` equal to its perturbed version?:  True

However, when it comes to other models, the result is not following the assumption, the logits will be changed when changing the right side input? What is the reason? Is it a bug? I really want to know the answer, thank you!

BERT

Is encoding for `Ich` equal to its perturbed version?:  False

BART

Is encoding for `Ich` equal to its perturbed version?:  False

Roberta

Is encoding for `Ich` equal to its perturbed version?:  False

Experiment notebook colab

Environment info

  • transformers version: 4.3.3
  • Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
  • Python version: 3.7.10
  • PyTorch version (GPU?): 1.7.1+cu101 (False)
  • Tensorflow version (GPU?): 2.4.1 (False)
  • Using GPU in script?: <fill in>
  • Using distributed or parallel set-up in script?: <fill in>

Who can help

Information

Model I am using (GPT, Bert, RoBerta, BART ForCausalLM):

The problem arises when using:

To reproduce

Experiment notebook colab

Expected behavior

Causal Models should not be affected by the right context?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:9 (8 by maintainers)

github_iconTop GitHub Comments

1reaction
voidfulcommented, Mar 12, 2021

model.eval() does not disable the bias in the model as far as I know. model.eval() simply puts the model into “non training” mode meaning that dropout layers are not applied, etc… . I don’t think we need to add a model.eval() to the from_config() function.

I don’t know why I said bias 😂, It should be dropout.

from_config() is more likely for training, so it should be fine not to add model.eval() by default.

Thanks for your reply~

0reactions
patrickvonplatencommented, Mar 12, 2021

model.eval() does not disable the bias in the model as far as I know. model.eval() simply puts the model into “non training” mode meaning that dropout layers are not applied, etc… . I don’t think we need to add a model.eval() to the from_config() function.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Understanding Masked Language Models (MLM) and ...
Under Masked Language Modelling, we typically mask a certain % of words in a given sentence and the model is expected to predict...
Read more >
Forgetful causal masking makes causal language models ...
This paper proposes a simple method of randomly masking past tokens during causal language modeling that boosts zero-shot capabilities and fine-tuning results ...
Read more >
fine tune causal language model using transformers and ...
When fine-tuning a model with a language-model head, the labels are the next tokens themselves (you predict the next words).
Read more >
training speed of "Causal Language Model Training ...
It looks like it's expected to be 0.14s/it (7 steps per second). I believe the difference is due to my Colab instance using...
Read more >
Training a causal language model from scratch
We're on a journey to advance and democratize artificial intelligence through open source and open science.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found