Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Using the T5 model with huggingface's mask-fill pipeline

See original GitHub issue

Does anyone know if it is possible to use the T5 model with hugging face’s mask-fill pipeline? The below is how you can do it using the default model but i can’t seem to figure out how to do is using the T5 model specifically?

from transformers import pipeline
nlp_fill = pipeline('fill-mask')
nlp_fill('Hugging Face is a French company based in ' + nlp_fill.tokenizer.mask_token)

Trying this for example raises the error “TypeError: must be str, not NoneType” because nlp_fill.tokenizer.mask_token is None.

nlp_fill = pipeline('fill-mask',model="t5-base", tokenizer="t5-base")
nlp_fill('Hugging Face is a French company based in ' + nlp_fill.tokenizer.mask_token)

Stack overflow question

Issue Analytics

State:
Created 3 years ago
Comments:18 (8 by maintainers)

Top GitHub Comments

31reactions

girishponkiyacommented, May 2, 2020

Could we use the following workaround?

<extra_id_0> could be considered as a mask token
Candidate sequences for the mask-token could be generated using a code, like:

from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration

T5_PATH = 't5-base' # "t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"

DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') # My envirnment uses CPU

t5_tokenizer = T5Tokenizer.from_pretrained(T5_PATH)
t5_config = T5Config.from_pretrained(T5_PATH)
t5_mlm = T5ForConditionalGeneration.from_pretrained(T5_PATH, config=t5_config).to(DEVICE)

# Input text
text = 'India is a <extra_id_0> of the world. </s>'

encoded = t5_tokenizer.encode_plus(text, add_special_tokens=True, return_tensors='pt')
input_ids = encoded['input_ids'].to(DEVICE)

# Generaing 20 sequences with maximum length set to 5
outputs = t5_mlm.generate(input_ids=input_ids, 
                          num_beams=200, num_return_sequences=20,
                          max_length=5)

_0_index = text.index('<extra_id_0>')
_result_prefix = text[:_0_index]
_result_suffix = text[_0_index+12:]  # 12 is the length of <extra_id_0>

def _filter(output, end_token='<extra_id_1>'):
    # The first token is <unk> (inidex at 0) and the second token is <extra_id_0> (indexed at 32099)
    _txt = t5_tokenizer.decode(output[2:], skip_special_tokens=False, clean_up_tokenization_spaces=False)
    if end_token in _txt:
        _end_token_index = _txt.index(end_token)
        return _result_prefix + _txt[:_end_token_index] + _result_suffix
    else:
        return _result_prefix + _txt + _result_suffix

results = list(map(_filter, outputs))
results

Output:

['India is a cornerstone of the world. </s>',
 'India is a part of the world. </s>',
 'India is a huge part of the world. </s>',
 'India is a big part of the world. </s>',
 'India is a beautiful part of the world. </s>',
 'India is a very important part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a unique part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a beautiful country in of the world. </s>',
 'India is a part of the of the world. </s>',
 'India is a small part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a country in the of the world. </s>',
 'India is a large part of the world. </s>',
 'India is a part of the world. </s>',
 'India is a significant part of the world. </s>',
 'India is a part of the world. </s>']

3reactions

klimentijcommented, May 13, 2020

@girishponkiya Thanks for your example! Unfortunately, I can’t reproduce your results. I get

['India is a _0> of the world. </s>',
 'India is a  ⁇ extra of the world. </s>',
 'India is a India is  of the world. </s>',
 'India is a  ⁇ extra_ of the world. </s>',
 'India is a a  of the world. </s>',
 'India is a [extra_ of the world. </s>',
 'India is a India is an of the world. </s>',
 'India is a of the world of the world. </s>',
 'India is a India. of the world. </s>',
 'India is a is a of the world. </s>',
 'India is a India  ⁇  of the world. </s>',
 'India is a Inde is  of the world. </s>',
 'India is a ] of the of the world. </s>',
 'India is a . of the world. </s>',
 'India is a _0 of the world. </s>',
 'India is a is  ⁇  of the world. </s>',
 'India is a india is  of the world. </s>',
 'India is a India is the of the world. </s>',
 'India is a -0> of the world. </s>',
 'India is a  ⁇ _ of the world. </s>']

Tried on CPU, GPU, ‘t5-base’ and ‘t5-3b’ — same thing.