Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Error in roberta.extract_features_aligned_to_words()

See original GitHub issue

Threw errors when run the following commands to extract features aligned to words,

roberta = torch.hub.load('pytorch/fairseq', 'roberta.large')
roberta.eval()

ss = 'There were 28 apples in the house.  There are 54 apples in the garden.'
roberta.extract_features_aligned_to_words(ss)

Error messages are as following,

~/.cache/torch/hub/pytorch_fairseq_master/fairseq/models/roberta/hub_interface.py in extract_features_aligned_to_words(self, sentence, return_all_hiddens)
    125         features = self.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
    126         features = features.squeeze(0)
--> 127         aligned_feats = alignment_utils.align_features_to_words(self, features, alignment)
    128 
    129         # wrap in spaCy Doc

~/.cache/torch/hub/pytorch_fairseq_master/fairseq/models/roberta/alignment_utils.py in align_features_to_words(roberta, features, alignment)
     92         output.append(weighted_features[j])
     93     output = torch.stack(output)
---> 94     assert torch.all(torch.abs(output.sum(dim=0) - features.sum(dim=0)) < 1e-4)
     95     return output

Issue Analytics

State:
Created 4 years ago
Reactions:3
Comments:7 (1 by maintainers)

Top GitHub Comments

3reactions

ihungalexhsucommented, Dec 19, 2019

I face the same problem in my usage. I think the assertion is applied in order to make sure the weighted sum works well. But there might be some numerical error after all the calculations. And I think 1e-4 is just a threshold to ensure the numerical error is not too big.

In my case, I just enlarge the threshold from 1e-4 to 1e-3, and it fixes my problem.

3reactions

sidsvash26commented, Oct 29, 2019

I have the same issue and it’s really hard to figure out which spaces are needed to be removed. And in my case, I do care about alignments as I’m looking to extract embeddings for some specific tokens.

I created a custom function because I don’t want to use spacy tokens and I already have gold tokens available.

Consider the below code:

import torch
from fairseq.models.roberta import alignment_utils
from typing import Tuple, List

def extract_aligned_roberta(roberta, sentence: str, 
                            tokens: List[str], 
                            return_all_hiddens=False):
    ''' Code inspired from: 
       https://github.com/pytorch/fairseq/blob/master/fairseq/models/roberta/hub_interface.py
    
    Aligns roberta embeddings for an input tokenization of words for a sentence
    
    Inputs:
    1. roberta: roberta fairseq class
    2. sentence: sentence in string
    3. tokens: tokens of the sentence in which the alignment is to be done
    
    Outputs: Aligned roberta features 
    '''

    # tokenize both with GPT-2 BPE and get alignment with given tokens
    bpe_toks = roberta.encode(sentence)
    alignment = alignment_utils.align_bpe_to_words(roberta, bpe_toks, tokens)
    
    
    # extract features and align them
    features = roberta.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
    features = features.squeeze(0)   #Batch-size = 1
    aligned_feats = alignment_utils.align_features_to_words(roberta, features, alignment)
   
    return aligned_feats[1:-1]  #exclude <s> and </s> tokens

This code works for simple sentences:

sentence = 'There were 28 apples in the house. There are 54 apples in the garden.'
tokens = ['There','were', '28', 'apples', 'in', 'the', 'house', '.',  
              'There','are','54','apples','in','the','garden', '.']
print(extract_aligned_roberta(roberta, sentence, tokens).shape)

Outputs: torch.Size([16, 1024])

But when I use another sentence such as:

sentence1 = "DPA : Iraqi authorities announced that they had busted up 3 terrorist cells operating in Baghdad. Two of them were being run by 2 officials of the Ministry of the Interior! The MoI in Iraq is equivalent to the US FBI, so this would be like having J. Edgar Hoover unwittingly employ at a high level members of the Weathermen bombers back in the 1960s."

tokens1 = ['DPA', ':', 'Iraqi', 'authorities', 'announced', 'that', 'they', 'had', 'busted', 'up', '3', 'terrorist', 'cells', 'operating', 'in', 'Baghdad', '.', 'Two', 'of', 'them', 'were', 'being', 'run', 'by', '2', 'officials', 'of', 'the', 'Ministry', 'of', 'the', 'Interior', '!', 'The', 'MoI', 'in', 'Iraq', 'is', 'equivalent', 'to', 'the', 'US', 'FBI', ',', 'so', 'this', 'would', 'be', 'like', 'having', 'J.', 'Edgar', 'Hoover', 'unwittingly', 'employ', 'at', 'a', 'high', 'level', 'members', 'of', 'the', 'Weathermen', 'bombers', 'back', 'in', 'the', '1960s', '.']

print(extract_aligned_roberta(roberta, sentence1, tokens1).shape)

Then I get the same error:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-23-a1b5dbdbcc62> in <module>
----> 1 extract_aligned_roberta(roberta, sentence, tokens).shape

<ipython-input-1-093c6979ba7b> in extract_aligned_roberta(roberta, sentence, tokens, return_all_hiddens)
     28     features = roberta.extract_features(bpe_toks, return_all_hiddens=return_all_hiddens)
     29     features = features.squeeze(0)   #Batch-size = 1
---> 30     aligned_feats = alignment_utils.align_features_to_words(roberta, features, alignment)
     31 
     32     return aligned_feats[1:-1]  #exclude <s> and </s> tokens

~/anaconda3/envs/allennlp/lib/python3.6/site-packages/fairseq/models/roberta/alignment_utils.py in align_features_to_words(roberta, features, alignment)
     92         output.append(weighted_features[j])
     93     output = torch.stack(output)
---> 94     assert torch.all(torch.abs(output.sum(dim=0) - features.sum(dim=0)) < 1e-4)
     95     return output
     96 

AssertionError:

And it’s not clear to me if there are any extra spaces in the sentence.

Any help here?