Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Causal Language Modeling] seems not as expected

See original GitHub issue

Problem

Causal Models is only attended to the left context. Therefore causal models should not depend on the right tokens. For example, The word embedding of “I” will be unchanged no matter what is in the right In GPT2. Since Causal Language Model are uni-directional self-attention.

from transformers import AutoModel,AutoTokenizer, AutoConfig
import torch

# gpt
gpt_model = AutoModel.from_pretrained('gpt2')
gpt_tokenizer = AutoTokenizer.from_pretrained('gpt2')
embeddings = gpt_model.get_input_embeddings()

# create ids of encoded input vectors
decoder_input_ids = gpt_tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids

# pass decoder input_ids and encoded input vectors to decoder
lm_logits = gpt_model(decoder_input_ids).last_hidden_state

# change the decoder input slightly
decoder_input_ids_perturbed = gpt_tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
lm_logits_perturbed = gpt_model(decoder_input_ids_perturbed).last_hidden_state

# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))

Result

Is encoding for `Ich` equal to its perturbed version?:  True

However, when it comes to other models, the result is not following the assumption, the logits will be changed when changing the right side input? What is the reason? Is it a bug? I really want to know the answer, thank you!

BERT

Is encoding for `Ich` equal to its perturbed version?:  False

BART

Is encoding for `Ich` equal to its perturbed version?:  False

Roberta

Is encoding for `Ich` equal to its perturbed version?:  False

Experiment notebook colab

Environment info

transformers version: 4.3.3
Platform: Linux-4.19.112±x86_64-with-Ubuntu-18.04-bionic
Python version: 3.7.10
PyTorch version (GPU?): 1.7.1+cu101 (False)
Tensorflow version (GPU?): 2.4.1 (False)
Using GPU in script?: <fill in>
Using distributed or parallel set-up in script?: <fill in>

Who can help

Information

Model I am using (GPT, Bert, RoBerta, BART ForCausalLM):

The problem arises when using:

[ x] the official example scripts: https://huggingface.co/blog/encoder-decoder#decoder

To reproduce

Experiment notebook colab

Expected behavior

Causal Models should not be affected by the right context?

Issue Analytics

State:
Created 3 years ago
Comments:9 (8 by maintainers)

Top GitHub Comments

1reaction

voidfulcommented, Mar 12, 2021

model.eval() does not disable the bias in the model as far as I know. model.eval() simply puts the model into “non training” mode meaning that dropout layers are not applied, etc… . I don’t think we need to add a model.eval() to the from_config() function.

I don’t know why I said bias 😂, It should be dropout.

from_config() is more likely for training, so it should be fine not to add model.eval() by default.

Thanks for your reply~

0reactions

patrickvonplatencommented, Mar 12, 2021

model.eval() does not disable the bias in the model as far as I know. model.eval() simply puts the model into “non training” mode meaning that dropout layers are not applied, etc… . I don’t think we need to add a model.eval() to the from_config() function.

Top Results From Across the Web

Understanding Masked Language Models (MLM) and ...

Under Masked Language Modelling, we typically mask a certain % of words in a given sentence and the model is expected to predict...

Forgetful causal masking makes causal language models ...

This paper proposes a simple method of randomly masking past tokens during causal language modeling that boosts zero-shot capabilities and fine-tuning results ...

fine tune causal language model using transformers and ...

When fine-tuning a model with a language-model head, the labels are the next tokens themselves (you predict the next words).

training speed of "Causal Language Model Training ...

It looks like it's expected to be 0.14s/it (7 steps per second). I believe the difference is due to my Colab instance using...

Training a causal language model from scratch

We're on a journey to advance and democratize artificial intelligence through open source and open science.