question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in roberta.extract_features_aligned_to_words()

See original GitHub issue

Threw errors when run the following commands to extract features aligned to words,

roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()

ss = 'There were 28 apples in the house.  There are 54 apples in the garden.'
roberta.extract_features_aligned_to_words(ss)

Error messages are as following,

~/.cache/torch/hub/pytorch_fairseq_master/fairseq/models/roberta/hub_interface.py in extract_features_aligned_to_words(self, sentence, return_all_hiddens)
    125         features = self.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
    126         features = features.squeeze(0)
--> 127         aligned_feats = alignment_utils.align_features_to_words(self, features, alignment)
    128 
    129         # wrap in spaCy Doc

~/.cache/torch/hub/pytorch_fairseq_master/fairseq/models/roberta/alignment_utils.py in align_features_to_words(roberta, features, alignment)
     92         output.append(weighted_features[j])
     93     output = torch.stack(output)
---> 94     assert torch.all(torch.abs(output.sum(dim=0) - features.sum(dim=0)) < 1e-4)
     95     return output

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Reactions:3
  • Comments:7 (1 by maintainers)

github_iconTop GitHub Comments

3reactions
ihungalexhsucommented, Dec 19, 2019

I face the same problem in my usage. I think the assertion is applied in order to make sure the weighted sum works well. But there might be some numerical error after all the calculations. And I think 1e-4 is just a threshold to ensure the numerical error is not too big.

In my case, I just enlarge the threshold from 1e-4 to 1e-3, and it fixes my problem.

3reactions
sidsvash26commented, Oct 29, 2019

I have the same issue and it’s really hard to figure out which spaces are needed to be removed. And in my case, I do care about alignments as I’m looking to extract embeddings for some specific tokens.

I created a custom function because I don’t want to use spacy tokens and I already have gold tokens available.

Consider the below code:

import torch
from fairseq.models.roberta import alignment_utils
from typing import Tuple, List

def extract_aligned_roberta(roberta, sentence: str, 
                            tokens: List[str], 
                            return_all_hiddens=False):
    ''' Code inspired from: 
       https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.py
    
    Aligns roberta embeddings for an input tokenization of words for a sentence
    
    Inputs:
    1. roberta: roberta fairseq class
    2. sentence: sentence in string
    3. tokens: tokens of the sentence in which the alignment is to be done
    
    Outputs: Aligned roberta features 
    '''

    # tokenize both with GPT-2 BPE and get alignment with given tokens
    bpe_toks = roberta.encode(sentence)
    alignment = alignment_utils.align_bpe_to_words(roberta, bpe_toks, tokens)
    
    
    # extract features and align them
    features = roberta.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
    features = features.squeeze(0)   #Batch-size = 1
    aligned_feats = alignment_utils.align_features_to_words(roberta, features, alignment)
   
    return aligned_feats[1:-1]  #exclude <s> and </s> tokens

This code works for simple sentences:

sentence = 'There were 28 apples in the house. There are 54 apples in the garden.'
tokens = ['There','were', '28', 'apples', 'in', 'the', 'house', '.',  
              'There','are','54','apples','in','the','garden', '.']
print(extract_aligned_roberta(roberta, sentence, tokens).shape)

Outputs: torch.Size([16, 1024])

But when I use another sentence such as:

sentence1 = "DPA : Iraqi authorities announced that they had busted up 3 terrorist cells operating in Baghdad. Two of them were being run by 2 officials of the Ministry of the Interior! The MoI in Iraq is equivalent to the US FBI, so this would be like having J. Edgar Hoover unwittingly employ at a high level members of the Weathermen bombers back in the 1960s."

tokens1 = ['DPA', ':', 'Iraqi', 'authorities', 'announced', 'that', 'they', 'had', 'busted', 'up', '3', 'terrorist', 'cells', 'operating', 'in', 'Baghdad', '.', 'Two', 'of', 'them', 'were', 'being', 'run', 'by', '2', 'officials', 'of', 'the', 'Ministry', 'of', 'the', 'Interior', '!', 'The', 'MoI', 'in', 'Iraq', 'is', 'equivalent', 'to', 'the', 'US', 'FBI', ',', 'so', 'this', 'would', 'be', 'like', 'having', 'J.', 'Edgar', 'Hoover', 'unwittingly', 'employ', 'at', 'a', 'high', 'level', 'members', 'of', 'the', 'Weathermen', 'bombers', 'back', 'in', 'the', '1960s', '.']

print(extract_aligned_roberta(roberta, sentence1, tokens1).shape)

Then I get the same error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-23-a1b5dbdbcc62> in <module>
----> 1 extract_aligned_roberta(roberta, sentence, tokens).shape

<ipython-input-1-093c6979ba7b> in extract_aligned_roberta(roberta, sentence, tokens, return_all_hiddens)
     28     features = roberta.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
     29     features = features.squeeze(0)   #Batch-size = 1
---> 30     aligned_feats = alignment_utils.align_features_to_words(roberta, features, alignment)
     31 
     32     return aligned_feats[1:-1]  #exclude <s> and </s> tokens

~/anaconda3/envs/allennlp/lib/python3.6/site-packages/fairseq/models/roberta/alignment_utils.py in align_features_to_words(roberta, features, alignment)
     92         output.append(weighted_features[j])
     93     output = torch.stack(output)
---> 94     assert torch.all(torch.abs(output.sum(dim=0) - features.sum(dim=0)) < 1e-4)
     95     return output
     96 

AssertionError: 

And it’s not clear to me if there are any extra spaces in the sentence.

Any help here?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Error in roberta.extract_features_aligned_to_words() #1106
The problem is that we assert that the sum of the "aligned" version matches the sum of the original BPE version. Since the...
Read more >
RoBERTa - Hugging Face
RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer ... Check out the from_pretrained() method to load...
Read more >
error received after loading Roberta and XLM_Roberta ...
My Python code seems to work just fine with bert-base and bert-large models , so I want to understand how I might need...
Read more >
RoBERTa - PyTorch
An open source machine learning framework that accelerates the path from research prototyping to production deployment.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found