question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Does Bart Model can fill <mask> with variable length?

See original GitHub issue

Hi, is it possible for bart in huggingface to fill <mask> with variable length?

For example,

from transformers import BartTokenizer, BartForConditionalGeneration

tokenizer = BartTokenizer.from_pretrained(config["bart_path"])
TXT = "My dog is so <mask>."
model = BartForConditionalGeneration.from_pretrained(config["bart_path"])
input_ids = tokenizer.encode_plus(TXT, return_tensors='pt')['input_ids']  # batch_encode_plus([TXT]
logits = model(input_ids, output_hidden_states=True)[0]
masked_index = (input_ids[0] == tokenizer.mask_token_id).nonzero().item()
probs = logits[0, masked_index].softmax(dim=0)
values, predictions = probs.topk(5)
print(tokenizer.decode(predictions).split())

This code will output ['.', 'cute', 'sweet', 'funny', 'awesome']. Is Bart able to fill <mask> with more than one words like “cute and smart”? If so, what should I do? Is there an example?

Thank you.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:10 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
fabrahmancommented, Jun 24, 2022

Hi,

I realized while this approach works but the model doesn’t just fill the <mask> span and it might also go beyond that and changes other part of the given text. For example for below code:

masked_text = "Police said in the first four months of the project, they laid more than 100 charges against 10 people, in connection with the illegal towing industry. “Once we started our investigation, we found that the people involved were not only breaking the law, but they were also  <mask> said Sgt. Sean Cassidy of the Toronto Police Service. “They were breaking the laws surrounding the storage of the vehicles, the fees that they were charging and the manner in which they were charging,” he added. "

max_length = 20
print(max_length)
model = BartForConditionalGeneration.from_pretrained("facebook/bart-large").cuda()
tokenizer = BartTokenizer.from_pretrained("facebook/bart-large")

tokenizer.src_lang = "en_XX"
encoded_en = tokenizer(text, return_tensors="pt")
for key, value in encoded_en.items():
    encoded_en[key] = value.cuda()

max_length += encoded_en.input_ids.shape[1]
generated_tokens = model.generate(encoded_en['input_ids'], forced_bos_token_id=tokenizer.bos_token_id, max_length=max_length, do_sample=True, top_p=0.96, num_return_sequences=5)
result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

And below is one of the generated sequences with bold showing where the model unnecessarily changed the text and italic is the correctly filled masked span.:

"Police said in the first four months of the project, they laid more than 100 charges against 10 people, in connection with the illegal towing industry. “Once we started our investigation, we found that the people involved were not only breaking the law, but they were also breaking a lot of other laws as well,” said Sgt. Sean Cassidy of the Toronto Police Service.Article Continued Below“They were breaking the laws surrounding the storage of the vehicles, the fees that they were charging and the manner in which they are charging for the services that they’re providing to the public,', "

I wonder if this is what it is and BART cannot effectively be used for this use case? or should we fine-tune a separate model to do this infilling task only?

1reaction
xswaycommented, Sep 22, 2021

Just FYI, this code in the documentation doesn’t work. I have latest (4.10.2) transformers. The error is

TypeError: __init__() got an unexpected keyword argument 'force_bos_token_to_be_generated'

And if I remove the keyword argument and run I get ['UNALSO SEE'] as result, not the one expected.

UPDATE: I see it was reported already https://github.com/huggingface/transformers/issues/12296. Please update the docs 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fill-mask for BART with variable length - Hugging Face Forums
Hey All i have tried to use BART in the fill-mask pipeline to predict masked tokens, but the output sometimes might be more...
Read more >
Transformers BART Model Explained for Text Summarization
The BART HugggingFace model allows the pre-trained weights and weights fine-tuned on question-answering, text summarization, conditional text ...
Read more >
arXiv:2103.10360v2 [cs.CL] 17 Mar 2022
Language Model (GLM) based on autoregres- sive blank infilling to address this challenge. GLM improves blank filling pretraining by.
Read more >
Transformers MarianMT Tutorial — AWS Neuron Documentation
In this tutorial, you will deploy the HuggingFace MarianMT model for text ... class can be applied to the BART model for the...
Read more >
Masked-Language Modeling With BERT | by James Briggs
All 512 tokens produce a final output embedding — the logits — which has a vector length equal to the model vocab size....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found