Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use fill-mask pipeline to get probability of specific token

See original GitHub issue

Hi, I am trying to use the fill-mask pipeline:

nlp_fm = pipeline('fill-mask')
nlp_fm('Hugging Face is a French company based in <mask>')

And get the output:

[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',
  'score': 0.23106734454631805,
  'token': 2201},
 {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',
  'score': 0.08198195695877075,
  'token': 12790},
 {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',
  'score': 0.04769458621740341,
  'token': 11559},
 {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',
  'score': 0.04762236401438713,
  'token': 6497},
 {'sequence': '<s> Hugging Face is a French company based in France</s>',
  'score': 0.041305914521217346,
  'token': 1470}]

But let’s say I want to get the score & rank on other word - such as London - is this possible?

Issue Analytics

State:
Created 3 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

13reactions

LysandreJikcommented, May 27, 2020

Hi, the pipeline doesn’t offer such a functionality yet. You’re better off using the model directly. Here’s an example of how you would replicate the pipeline’s behavior, and get a token score at the end:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
model = AutoModelWithLMHead.from_pretrained("distilroberta-base")

sequence = f"Hugging Face is a French company based in {tokenizer.mask_token}"

input_ids = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

token_logits = model(input_ids)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
mask_token_logits = torch.softmax(mask_token_logits, dim=1)

top_5 = torch.topk(mask_token_logits, 5, dim=1)
top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())

for token, score in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])), f"(score: {score})")

# Get the score of token_id
sought_after_token = "London"
sought_after_token_id = tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[0]  # 928

token_score = mask_token_logits[:, sought_after_token_id]
print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")

Outputs:

Hugging Face is a French company based in  Paris (score: 0.2310674488544464)
Hugging Face is a French company based in  Lyon (score: 0.08198253810405731)
Hugging Face is a French company based in  Geneva (score: 0.04769456014037132)
Hugging Face is a French company based in  Brussels (score: 0.047622524201869965)
Hugging Face is a French company based in  France (score: 0.04130581393837929)
Score of London: tensor([0.0343], grad_fn=<SelectBackward>)

Let me know if it helps.

0reactions

orenschonlabcommented, Feb 11, 2021

@LysandreJik I also get the error:

    mask_token_index = torch.where(input_ids == bert_tokenizer.mask_token_id)[1]
TypeError: where(): argument 'condition' (position 1) must be Tensor, not bool

for this code. I have torch version 1.7.1 Any idea what is the problem? Might it be version-related? If so, what changes should be made in the code? Or what version should I downgrade to?