Getting Word Embeddings for Sentences using long-former model?
See original GitHub issueI am new to Huggingface and have few basic queries. This post might be helpful to others as well who are starting to use longformer model from huggingface.
Objective:
Create Sentence/document embeddings using longformer model. We don’t have lables in our data-set, so we want to do clustering on output of embeddings generated. Please let me know if the code is correct?
Environment info
transformers
version:3.0.2- Platform:
- Python version: Python 3.6.12 :: Anaconda, Inc.
- PyTorch version (GPU?):1.7.1
- Tensorflow version (GPU?): 2.3.0
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: parallel
Who can help
##Models:
- longformer, reformer, transfoxl, xlnet: @patrickvonplaten
Library:
- benchmarks: @patrickvonplaten
- text generation: @patrickvonplaten
- tokenizers: @LysandreJik
- trainer: @sgugger
Information
Model I am using longformer:
The problem arises when using:
- my own modified scripts: (give details below)
The tasks I am working on is:
- my own task or dataset: (give details below)
Code:
from transformers import LongformerModel, LongformerTokenizer
model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
df = pd.read_csv("inshort_news_data-1.csv")
df.head(5)
#**news_article** column is used to generate embedding.
all_content=list(df['news_article'])
def sentence_bert():
list_of_emb=[]
for i in range(len(all_content)):
SAMPLE_TEXT = all_content[i] # long input document
print("length of string: ",len(SAMPLE_TEXT.split()))
input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)
# How to include batch of size here?
# Attention mask values -- 0: no attention, 1: local attention, 2: global attention
attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
attention_mask[:, [0,-1]] = 2
with torch.no_grad():
outputs = model(input_ids, attention_mask=attention_mask)
hidden_states = outputs[2]
token_embeddings = torch.stack(hidden_states, dim=0)
# Remove dimension 1, the "batches".
token_embeddings = torch.squeeze(token_embeddings, dim=1)
# Swap dimensions 0 and 1.
token_embeddings = token_embeddings.permute(1,0,2)
token_vecs_sum = []
# For each token in the sentence...
for token in token_embeddings:
#but preferrable is
sum_vec=torch.sum(token[-4:],dim=0)
# Use `sum_vec` to represent `token`.
token_vecs_sum.append(sum_vec)
h=0
for i in range(len(token_vecs_sum)):
h+=token_vecs_sum[i]
list_of_emb.append(h)
return list_of_emb
f=sentence_bert()
Doubts/Question:
- If we want to get embeddings in batches, what all changes do I need to make in the above code?
- If the sentence is " I am learning longformer model.". Will the tokenizer function will return ID’s of following token in longformer model: [ ‘I’, ‘am’ , ‘learning’ , ‘longformer’ , 'model. '] Is my understanding correct? Can you explain it with minimum reproducible example?
- Similarly attention mask will return attention weights of following tokens? The part which I didn’t understand is its necessary to replace last attention weight of sentence by 2 (in above code)?
- #outputs[0] gives us sequence_output: torch.Size([768]) #outputs[1] gives us pooled_output torch.Size([1, 512, 768]) #outputs[2]: gives us Hidden_output: torch.Size([13, 512, 768]) Can you talk more about what does each dimension depicts in outputs? Example what does Hidden_output [13, 512, 768] means ? From where 13, 512 and 768 is coming ? What does 13, 512 and 768 means in terms of hidden state, embedding dimesion and number of layesr?
- From which token do we get the sentence embedding in longformer? Can you explain it with minimum reproducible example?
- If I am running the model in linux system, where does pre-trained model get’s downloaded or stored? Can you list the complete path?
- length of string: 15 input_ids: tensor([[ 0, 35702, 1437, 3743, 1437, 560, 1437, 48317, 1437, 28884, 20042, 1437, 6968, 241, 1437, 16402, 1437, 463, 1437, 3056, 1437, 48317, 1437, 281, 1437, 16752, 1437, 281, 1437, 1694, 1437, 7424, 4, 2]]) input_ids.shape: torch.Size([1, 34]) My sentence length is 15 then why input_ids and attention_ids are length 34?
Expected behavior
Document1: Embeddings Document2: Embeddings
Issue Analytics
- State:
- Created 2 years ago
- Comments:7 (3 by maintainers)
Top Results From Across the Web
How to generate sentence embedding using long-former model
I want to generate sentence level embedding. I have a data-frame which has a text column. I am using this code: import torch...
Read more >How to generate sentence embedding using long-former ...
Is this correct approach to get global attention to get embeddings for one documents.? 2. The code put model in evaluation mode. What...
Read more >com.johnsnowlabs.nlp.embeddings.LongformerEmbeddings
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To ...
Read more >LONG CONTEXTUALIZED DOCUMENT EMBEDDINGS FOR ...
Our framework to create word, sentence and document contextual embeddings us- ing last four layers of the BERT- (base-uncased) model.
Read more >long-former - Kaggle
He said to get Best Contextualised word embeddings from these Attention Based Models We Will Sum the Outputs of Last 4 Hidden layers...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ah saw your post and approved it.
Thanks for pointing it out. I have posted my question in https://discuss.huggingface.co/ Sorry for confusion.