question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

What does .encode() actually do to extract the embeddings ?

See original GitHub issue

Looked at the code, it reads .encode() obtains the embeddings by looping through all hidden states (i.e. sequence outputs from all layers).

Question: But the shape of the output embeddings is 1D, it just has the (hidden dim size,) i.e (512, or 768,), as opposed to 2D (input_ids, hidden dim size) for text models and (pixel_values, hidden dim size) for vision models. So are the sentence embeddings obtained are got by adding an MLP head on top of the sequence output of all layers ? or by adding an MLP head on top of the pooler output of the sequence output of all layers? Please clarify.

Also the MLP head is the classic one i.e. simple linear layer within, out features + classic tanh activation, and no dropouts of any sort ?

Please advice …

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
nreimerscommented, Apr 27, 2022

Just the mean of the last layer

1reaction
nreimerscommented, Apr 27, 2022

In most cases mean pooling is performed: All output embeddings are averaged to give the fixed representation

Read more comments on GitHub >

github_iconTop Results From Across the Web

Word Embedding in NLP: One-Hot Encoding and Skip-Gram ...
The simplest method is called one-hot encoding, also known as “1-of-N” encoding (meaning the vector is composed of a single one and a...
Read more >
NLP: Everything about Embeddings - Medium
Embedding methods (alternatively referred to as “encoding”, “vectorising”, etc) convert symbolic representations (i.e. words, emojis, ...
Read more >
Embed, encode, attend, predict: The new deep learning ...
The new approach can be summarised as a simple four-step formula: embed, encode, attend, predict. This post explains the components of this ...
Read more >
Top 4 Sentence Embedding Techniques using Python!
Just like SentenceBERT, we take a pair of sentences and encode them to generate the actual sentence embeddings. Then, extract the relations ...
Read more >
How to Encode Text Data for Machine Learning with scikit-learn
The text must be parsed to remove words, called tokenization. Then the words need to be encoded as integers or floating point values...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found