question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

xlm-mlm-17-1280: after run model to get embeddings shape 20000

See original GitHub issue

i want to get embeddings for 🇷🇺 text by xlm-mlm-17-1280 at the end get embeddings with shape 2k

example code (using the last version transformers & torch on ubuntu):

from transformers import AutoTokenizer, AutoModelWithLMHead

xlm_mlm = 'xlm-mlm-17-1280'
tokenizer_xlm_mlm = AutoTokenizer.from_pretrained(xlm_mlm)
model_xlm_mlm =  AutoModelWithLMHead.from_pretrained(xlm_mlm) 

input_ids = torch.tensor([tokenizer_xlm_mlm.encode(my_input_text)]) # batch size of 1
print(f'{input_ids.shape=}')
# input_ids.shape=torch.Size([1, 373])

lang_id_ru = tokenizer_xlm_mlm.lang2id['ru']

langs_ru = torch.tensor([lang_id_ru] * input_ids.shape[1])  # torch.tensor([0, 0, 0, ..., 0])
print(f'{langs_ru.shape=}')
# langs_ru.shape=torch.Size([373])

langs_ru = langs_ru.view(1, -1) # is now of shape [1, sequence_length]
print(f'{langs_ru.shape=}')
# langs_ru.shape=torch.Size([1, 373])

outputs = model_xlm_mlm(input_ids, langs=langs_ru)
outputs[0].shape
# torch.Size([1, 373, 200000])

so it’s a bug or my bad?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Jul 7, 2020

That’s probably because you were using the AutoModel factory instead of AutoModelWithLMHead. The former returns the transformer embeddings of dimension hidden_size (1280 in your case), while the latter returns the projected embeddings on the vocabulary, of dimension vocab_size (200 000 in your case).

Change the two lines:

from transformers import AutoTokenizer, AutoModelWithLMHead

model_xlm_mlm =  AutoModelWithLMHead.from_pretrained(xlm_mlm)

to

from transformers import AutoTokenizer, AutoModel

model_xlm_mlm =  AutoModel.from_pretrained(xlm_mlm)
0reactions
vvssttkkcommented, Jul 7, 2020

@LysandreJik maybe u can help, how i can get embeddings to sentence after run outputs = model_xlm_mlm(input_ids, langs=langs_ru) at the older version 2.3.0 the outputs had the last size <emb_dim> but no it’s <vocab_size>

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to Use Word Embedding Layers for Deep Learning with ...
After completing this tutorial, you will know: About word embeddings and that Keras supports word embeddings via the Embedding layer. How to ...
Read more >
Word embeddings | Text - TensorFlow
This tutorial contains an introduction to word embeddings. You will train your own word embeddings using a simple Keras model for a sentiment...
Read more >
Using pre-trained word embeddings in a Keras model
We will only consider the top 20,000 most commonly occuring words in the dataset, and we will truncate the sequences to a maximum...
Read more >
Understanding Embedding Layer in Keras - Medium
In deep learning, embedding layer sounds like an enigma until you get the hold of it. Since embedding layer is an essential part...
Read more >
word or sentence embedding from BERT model #1950 - GitHub
How can I extract embeddings for a sentence or a set of words directly from pre-trained ... Your model expect input of the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found