question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to generate BERT/Roberta word/sentence embedding?

See original GitHub issue

I know the stanford operation.

tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
model = RobertaModel.from_pretrained('roberta-large')

input_ids = torch.tensor(tokenizer.encode("Hello, my <span class="highlighter highlight-on">dog</span> is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)

last_hidden_states = outputs[0] #(batch_size, input_len, embedding_size) But I need single vector for each sentence 

But. I am working on improving RNN with incorporating Bert-like pretrain model embedding. How to get a sentence embedding so in this case(one vector for entire sentence)? Averaging or some transformation of the last_hidden_states? Is add_special_token necessary? Any suggested papers to read?

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:6
  • Comments:5

github_iconTop GitHub Comments

23reactions
cformosacommented, Feb 24, 2020

Hey @zjplab, for sentence embeddings, I’d recommend this library https://github.com/UKPLab/sentence-transformers along with their paper. They explain how they get their sentence embeddings as well as the pros and cons to several different methods of doing it. They have embeddings for bert/roberta and many more

16reactions
BramVanroycommented, Feb 24, 2020

Hi there. A few weeks or months ago, I wrote this notebook to introduce my colleagues to doing inference on LMs. In other words: how can I get a sentence representation out of them. You can have a look here. It should be self-explanatory.

Read more comments on GitHub >

github_iconTop Results From Across the Web

sentence-transformers/nli-roberta-large
This model is deprecated. Please don't use it as it produces sentence embeddings of low quality. You can find recommended sentence embedding models...
Read more >
How can I get RoBERTa word embeddings?
Given a sentence of the type 'Roberta is a heavily optimized version of BERT.', I need to get the embeddings for each of...
Read more >
BERT Word Embeddings Tutorial
Creating word and sentence vectors from hidden states ... we will use BERT to extract features, namely word and sentence embedding vectors, ...
Read more >
CLRP / How to get Text Embedding from RoBERTa
get CLS Token; pool RoBERTa output (RoBERTa output = word embeddings) ... BERTなど) を使って文章のベクトル化 (text embedding) を行う方法を紹介する。 RoBERTa ...
Read more >
An Intuitive Explanation of Sentence-BERT
After the sentences were inputted to BERT, because of BERT's word-level embeddings, the most common way to generate a sentence embedding was by...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found