Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

PAD symbols change the output

See original GitHub issue

Adding [PAD] symbols to an input sentence changes the output of the model. I put together a small example here:

https://gist.github.com/juditacs/8be068d5f9063ad68e3098a473b497bd

I also noticed that the seed state affects the output as well. Resetting it in every run ensures that the output is always the same. Is this because of layernorm?

Issue Analytics

State:
Created 5 years ago
Comments:5 (1 by maintainers)

Top GitHub Comments

6reactions

HarmCcommented, Sep 4, 2019

Due to Position Embeddings every token results in different vectors. You might want to google “How the Embedding Layers in BERT Were Implemented”

6reactions

cherepanoviccommented, Aug 17, 2019

@thomwolf

Despite the attention_mask the values are a slightly different.

It is normal that [PAD] vectors have different values?

from pytorch_transformers import BertModel
from rest.run_glue import *

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=False)
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()

torch.manual_seed(0)
sent = "this is a complicated sentence [SEP]"
tokens = ['[CLS]'] + tokenizer.tokenize(sent)
ids = tokenizer.convert_tokens_to_ids(tokens)
t = torch.LongTensor([ids])

with torch.no_grad():
    out = model(t)[0]

torch.manual_seed(0)
sent = "this is a complicated sentence [SEP]"
tokens = ['[CLS]'] + tokenizer.tokenize(sent)
tokens.extend(['[PAD]'] * 3)
ids = torch.tensor(tokenizer.convert_tokens_to_ids(tokens)).unsqueeze(0)
mask = torch.zeros((1, ids.shape[1], ids.shape[1]), dtype=torch.float)
mask[:, :, 0:-3] = 1.0

with torch.no_grad():
    out2 = model(ids, attention_mask = mask[:, 0])[0]

print('------------')
for i in range(out.shape[1]):
    print(i, out[0][0, i].item())

print('------------')
for i in range(out2.shape[1]):
    torch.manual_seed(0)
    print(i, out2[0][0, i].item())

here is the output

------------
0 -0.10266201943159103
1 0.11214534193277359
2 -0.1575649380683899
3 -0.3163739740848541
4 -0.4168904423713684
5 -0.4069269001483917
6 0.28849801421165466
------------
0 -0.10266169905662537
1 0.1121453121304512
2 -0.15756472945213318
3 -0.3163738548755646
4 -0.41689014434814453
5 -0.40692687034606934
6 0.288497656583786
7 0.28312715888023376
8 0.08457585424184799
9 -0.3077544569969177

[PAD]'s are different, is that normal?

7 0.28312715888023376 8 0.08457585424184799 9 -0.3077544569969177

Top Results From Across the Web

Formatting individual symbols ... - GraphPad Prism 9 User Guide

Right-mouse click on the point and choose Format this Point. You can change the symbol shape, size, color and the format, color and...

Preprocess - Hugging Face

Padding is a strategy for ensuring tensors are rectangular by adding a special padding token to shorter sentences. Set the padding parameter to...

String Padding in C - Stack Overflow

will output |Hello |. In this case the - symbol means "Left align", the 10 means "Ten characters in field" and the s...

Wrong Character/Symbols being Output when Typing in ... - Dell

Go to Control Panel > Region and Language > Keyboards and Languages Tab. Click "Change Keyboards". In the "Text Services and Input Languages ......

objcopy(1) - Linux manual page - man7.org

This is different from --change-leading-char because it always changes the symbol name when appropriate, regardless of the object file format of the output...