Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting Word Embeddings for Sentences using long-former model?

See original GitHub issue

I am new to Huggingface and have few basic queries. This post might be helpful to others as well who are starting to use longformer model from huggingface.

Objective:

Create Sentence/document embeddings using longformer model. We don’t have lables in our data-set, so we want to do clustering on output of embeddings generated. Please let me know if the code is correct?

Environment info

transformers version:3.0.2
Platform:
Python version: Python 3.6.12 :: Anaconda, Inc.
PyTorch version (GPU?):1.7.1
Tensorflow version (GPU?): 2.3.0
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: parallel

Who can help

@patrickvonplaten

##Models:

longformer, reformer, transfoxl, xlnet: @patrickvonplaten

Library:

benchmarks: @patrickvonplaten
text generation: @patrickvonplaten
tokenizers: @LysandreJik
trainer: @sgugger

Information

Model I am using longformer:

The problem arises when using:

my own modified scripts: (give details below)

The tasks I am working on is:

my own task or dataset: (give details below)

Code:

from transformers import LongformerModel, LongformerTokenizer
model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

df = pd.read_csv("inshort_news_data-1.csv")
df.head(5)
#**news_article** column is used to generate embedding.

all_content=list(df['news_article'])
def sentence_bert():
    list_of_emb=[]
    for i in range(len(all_content)):
        SAMPLE_TEXT = all_content[i]  # long input document
        print("length of string:  ",len(SAMPLE_TEXT.split()))
        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  

        # How to include batch of size here?

        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
        attention_mask[:, [0,-1]] = 2
        
         with torch.no_grad():

            outputs = model(input_ids, attention_mask=attention_mask)
            hidden_states = outputs[2]
            token_embeddings = torch.stack(hidden_states, dim=0)
            # Remove dimension 1, the "batches".
            token_embeddings = torch.squeeze(token_embeddings, dim=1)
            # Swap dimensions 0 and 1.
            token_embeddings = token_embeddings.permute(1,0,2)

            token_vecs_sum = []
            # For each token in the sentence...
            for token in token_embeddings:

            #but preferrable is
               sum_vec=torch.sum(token[-4:],dim=0)

            # Use `sum_vec` to represent `token`.
               token_vecs_sum.append(sum_vec)

           
           h=0
           for i in  range(len(token_vecs_sum)):
              h+=token_vecs_sum[i]
           list_of_emb.append(h)

    return list_of_emb

f=sentence_bert()

Doubts/Question:

If we want to get embeddings in batches, what all changes do I need to make in the above code?
If the sentence is " I am learning longformer model.". Will the tokenizer function will return ID’s of following token in longformer model: [ ‘I’, ‘am’ , ‘learning’ , ‘longformer’ , 'model. '] Is my understanding correct? Can you explain it with minimum reproducible example?
Similarly attention mask will return attention weights of following tokens? The part which I didn’t understand is its necessary to replace last attention weight of sentence by 2 (in above code)?
#outputs[0] gives us sequence_output: torch.Size([768]) #outputs[1] gives us pooled_output torch.Size([1, 512, 768]) #outputs[2]: gives us Hidden_output: torch.Size([13, 512, 768]) Can you talk more about what does each dimension depicts in outputs? Example what does Hidden_output [13, 512, 768] means ? From where 13, 512 and 768 is coming ? What does 13, 512 and 768 means in terms of hidden state, embedding dimesion and number of layesr?
From which token do we get the sentence embedding in longformer? Can you explain it with minimum reproducible example?
If I am running the model in linux system, where does pre-trained model get’s downloaded or stored? Can you list the complete path?
length of string: 15 input_ids: tensor([[ 0, 35702, 1437, 3743, 1437, 560, 1437, 48317, 1437, 28884, 20042, 1437, 6968, 241, 1437, 16402, 1437, 463, 1437, 3056, 1437, 48317, 1437, 281, 1437, 16752, 1437, 281, 1437, 1694, 1437, 7424, 4, 2]]) input_ids.shape: torch.Size([1, 34]) My sentence length is 15 then why input_ids and attention_ids are length 34?

Expected behavior

Document1: Embeddings Document2: Embeddings

Issue Analytics

State:
Created 2 years ago
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

sguggercommented, Jul 17, 2021

Ah saw your post and approved it.

1reaction

pratikchhapolikacommented, Jul 16, 2021

Please use the forums for this kind of questions, we keep the issues for bugs and feature requests only.

Thanks for pointing it out. I have posted my question in https://discuss.huggingface.co/ Sorry for confusion.

Top Results From Across the Web

How to generate sentence embedding using long-former model

I want to generate sentence level embedding. I have a data-frame which has a text column. I am using this code: import torch...

How to generate sentence embedding using long-former ...

Is this correct approach to get global attention to get embeddings for one documents.? 2. The code put model in evaluation mode. What...

com.johnsnowlabs.nlp.embeddings.LongformerEmbeddings

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To ...

LONG CONTEXTUALIZED DOCUMENT EMBEDDINGS FOR ...

Our framework to create word, sentence and document contextual embeddings us- ing last four layers of the BERT- (base-uncased) model.

long-former - Kaggle

He said to get Best Contextualised word embeddings from these Attention Based Models We Will Sum the Outputs of Last 4 Hidden layers...