Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

xlm-mlm-17-1280: after run model to get embeddings shape 20000

See original GitHub issue

i want to get embeddings for 🇷🇺 text by xlm-mlm-17-1280 at the end get embeddings with shape 2k

example code (using the last version transformers & torch on ubuntu):

from transformers import AutoTokenizer, AutoModelWithLMHead

xlm_mlm = 'xlm-mlm-17-1280'
tokenizer_xlm_mlm = AutoTokenizer.from_pretrained(xlm_mlm)
model_xlm_mlm =  AutoModelWithLMHead.from_pretrained(xlm_mlm) 

input_ids = torch.tensor([tokenizer_xlm_mlm.encode(my_input_text)]) # batch size of 1
print(f'{input_ids.shape=}')
# input_ids.shape=torch.Size([1, 373])

lang_id_ru = tokenizer_xlm_mlm.lang2id['ru']

langs_ru = torch.tensor([lang_id_ru] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])
print(f'{langs_ru.shape=}')
# langs_ru.shape=torch.Size([373])

langs_ru = langs_ru.view(1, -1) # is now of shape [1, sequence_length]
print(f'{langs_ru.shape=}')
# langs_ru.shape=torch.Size([1, 373])

outputs = model_xlm_mlm(input_ids, langs=langs_ru)
outputs[0].shape
# torch.Size([1, 373, 200000])

so it’s a bug or my bad?

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Jul 7, 2020

That’s probably because you were using the AutoModel factory instead of AutoModelWithLMHead. The former returns the transformer embeddings of dimension hidden_size (1280 in your case), while the latter returns the projected embeddings on the vocabulary, of dimension vocab_size (200 000 in your case).

Change the two lines:

from transformers import AutoTokenizer, AutoModelWithLMHead

model_xlm_mlm =  AutoModelWithLMHead.from_pretrained(xlm_mlm)

from transformers import AutoTokenizer, AutoModel

model_xlm_mlm =  AutoModel.from_pretrained(xlm_mlm)

0reactions

vvssttkkcommented, Jul 7, 2020

@LysandreJik maybe u can help, how i can get embeddings to sentence after run outputs = model_xlm_mlm(input_ids, langs=langs_ru) at the older version 2.3.0 the outputs had the last size <emb_dim> but no it’s <vocab_size>