question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Getting Word Embeddings for Sentences using long-former model?

See original GitHub issue

I am new to Huggingface and have few basic queries. This post might be helpful to others as well who are starting to use longformer model from huggingface.

Objective:

Create Sentence/document embeddings using longformer model. We don’t have lables in our data-set, so we want to do clustering on output of embeddings generated. Please let me know if the code is correct?

Environment info

  • transformers version:3.0.2
  • Platform:
  • Python version: Python 3.6.12 :: Anaconda, Inc.
  • PyTorch version (GPU?):1.7.1
  • Tensorflow version (GPU?): 2.3.0
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: parallel

Who can help

@patrickvonplaten

##Models:

Library:

Information

Model I am using longformer:

The problem arises when using:

  • my own modified scripts: (give details below)

The tasks I am working on is:

  • my own task or dataset: (give details below)

Code:

from transformers import LongformerModel, LongformerTokenizer
model = LongformerModel.from_pretrained('allenai/longformer-base-4096',output_hidden_states = True)
tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

df = pd.read_csv("inshort_news_data-1.csv")
df.head(5)
#**news_article** column is used to generate embedding.
all_content=list(df['news_article'])
def sentence_bert():
    list_of_emb=[]
    for i in range(len(all_content)):
        SAMPLE_TEXT = all_content[i]  # long input document
        print("length of string:  ",len(SAMPLE_TEXT.split()))
        input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  

        # How to include batch of size here?

        # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
        attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
        attention_mask[:, [0,-1]] = 2
        
         with torch.no_grad():

            outputs = model(input_ids, attention_mask=attention_mask)
            hidden_states = outputs[2]
            token_embeddings = torch.stack(hidden_states, dim=0)
            # Remove dimension 1, the "batches".
            token_embeddings = torch.squeeze(token_embeddings, dim=1)
            # Swap dimensions 0 and 1.
            token_embeddings = token_embeddings.permute(1,0,2)

            token_vecs_sum = []
            # For each token in the sentence...
            for token in token_embeddings:

            #but preferrable is
               sum_vec=torch.sum(token[-4:],dim=0)

            # Use `sum_vec` to represent `token`.
               token_vecs_sum.append(sum_vec)

           
           h=0
           for i in  range(len(token_vecs_sum)):
              h+=token_vecs_sum[i]
           list_of_emb.append(h)

    return list_of_emb

f=sentence_bert()

Doubts/Question:

  1. If we want to get embeddings in batches, what all changes do I need to make in the above code?
  2. If the sentence is " I am learning longformer model.". Will the tokenizer function will return ID’s of following token in longformer model: [ ‘I’, ‘am’ , ‘learning’ , ‘longformer’ , 'model. '] Is my understanding correct? Can you explain it with minimum reproducible example?
  3. Similarly attention mask will return attention weights of following tokens? The part which I didn’t understand is its necessary to replace last attention weight of sentence by 2 (in above code)?
  4. #outputs[0] gives us sequence_output: torch.Size([768]) #outputs[1] gives us pooled_output torch.Size([1, 512, 768]) #outputs[2]: gives us Hidden_output: torch.Size([13, 512, 768]) Can you talk more about what does each dimension depicts in outputs? Example what does Hidden_output [13, 512, 768] means ? From where 13, 512 and 768 is coming ? What does 13, 512 and 768 means in terms of hidden state, embedding dimesion and number of layesr?
  5. From which token do we get the sentence embedding in longformer? Can you explain it with minimum reproducible example?
  6. If I am running the model in linux system, where does pre-trained model get’s downloaded or stored? Can you list the complete path?
  7. length of string: 15 input_ids: tensor([[ 0, 35702, 1437, 3743, 1437, 560, 1437, 48317, 1437, 28884, 20042, 1437, 6968, 241, 1437, 16402, 1437, 463, 1437, 3056, 1437, 48317, 1437, 281, 1437, 16752, 1437, 281, 1437, 1694, 1437, 7424, 4, 2]]) input_ids.shape: torch.Size([1, 34]) My sentence length is 15 then why input_ids and attention_ids are length 34?

Expected behavior

Document1: Embeddings Document2: Embeddings

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Jul 17, 2021

Ah saw your post and approved it.

1reaction
pratikchhapolikacommented, Jul 16, 2021

Please use the forums for this kind of questions, we keep the issues for bugs and feature requests only.

Thanks for pointing it out. I have posted my question in https://discuss.huggingface.co/ Sorry for confusion.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to generate sentence embedding using long-former model
I want to generate sentence level embedding. I have a data-frame which has a text column. I am using this code: import torch...
Read more >
How to generate sentence embedding using long-former ...
Is this correct approach to get global attention to get embeddings for one documents.? 2. The code put model in evaluation mode. What...
Read more >
com.johnsnowlabs.nlp.embeddings.LongformerEmbeddings
Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To ...
Read more >
LONG CONTEXTUALIZED DOCUMENT EMBEDDINGS FOR ...
Our framework to create word, sentence and document contextual embeddings us- ing last four layers of the BERT- (base-uncased) model.
Read more >
long-former - Kaggle
He said to get Best Contextualised word embeddings from these Attention Based Models We Will Sum the Outputs of Last 4 Hidden layers...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found