question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Use fill-mask pipeline to get probability of specific token

See original GitHub issue

Hi, I am trying to use the fill-mask pipeline:

nlp_fm = pipeline('fill-mask')
nlp_fm('Hugging Face is a French company based in <mask>')

And get the output:

[{'sequence': '<s> Hugging Face is a French company based in Paris</s>',
  'score': 0.23106734454631805,
  'token': 2201},
 {'sequence': '<s> Hugging Face is a French company based in Lyon</s>',
  'score': 0.08198195695877075,
  'token': 12790},
 {'sequence': '<s> Hugging Face is a French company based in Geneva</s>',
  'score': 0.04769458621740341,
  'token': 11559},
 {'sequence': '<s> Hugging Face is a French company based in Brussels</s>',
  'score': 0.04762236401438713,
  'token': 6497},
 {'sequence': '<s> Hugging Face is a French company based in France</s>',
  'score': 0.041305914521217346,
  'token': 1470}]

But let’s say I want to get the score & rank on other word - such as London - is this possible?

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

13reactions
LysandreJikcommented, May 27, 2020

Hi, the pipeline doesn’t offer such a functionality yet. You’re better off using the model directly. Here’s an example of how you would replicate the pipeline’s behavior, and get a token score at the end:

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")
model = AutoModelWithLMHead.from_pretrained("distilroberta-base")

sequence = f"Hugging Face is a French company based in {tokenizer.mask_token}"

input_ids = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input_ids == tokenizer.mask_token_id)[1]

token_logits = model(input_ids)[0]
mask_token_logits = token_logits[0, mask_token_index, :]
mask_token_logits = torch.softmax(mask_token_logits, dim=1)

top_5 = torch.topk(mask_token_logits, 5, dim=1)
top_5_tokens = zip(top_5.indices[0].tolist(), top_5.values[0].tolist())

for token, score in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])), f"(score: {score})")

# Get the score of token_id
sought_after_token = "London"
sought_after_token_id = tokenizer.encode(sought_after_token, add_special_tokens=False, add_prefix_space=True)[0]  # 928

token_score = mask_token_logits[:, sought_after_token_id]
print(f"Score of {sought_after_token}: {mask_token_logits[:, sought_after_token_id]}")

Outputs:

Hugging Face is a French company based in  Paris (score: 0.2310674488544464)
Hugging Face is a French company based in  Lyon (score: 0.08198253810405731)
Hugging Face is a French company based in  Geneva (score: 0.04769456014037132)
Hugging Face is a French company based in  Brussels (score: 0.047622524201869965)
Hugging Face is a French company based in  France (score: 0.04130581393837929)
Score of London: tensor([0.0343], grad_fn=<SelectBackward>)

Let me know if it helps.

0reactions
orenschonlabcommented, Feb 11, 2021

@LysandreJik I also get the error:

    mask_token_index = torch.where(input_ids == bert_tokenizer.mask_token_id)[1]
TypeError: where(): argument 'condition' (position 1) must be Tensor, not bool

for this code. I have torch version 1.7.1 Any idea what is the problem? Might it be version-related? If so, what changes should be made in the code? Or what version should I downgrade to?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for transformers.pipelines.fill_mask - Hugging Face
This mask filling pipeline can currently be loaded from :func:`~transformers.pipeline` using the following task identifier: :obj:`"fill-mask"`.
Read more >
How to get a probability distribution over tokens in a ...
So to get token probabilities you can use a softmax over this, i.e. probs = torch.nn.functional.softmax(last_hidden_state[mask_index]). You can ...
Read more >
How to get a probability distribution over tokens in a ... - Reddit
from transformers import pipeline # Initialize MLM pipeline mlm = pipeline('fill-mask') # Get mask token mask = mlm.tokenizer.mask_token ...
Read more >
Create a Tokenizer and Train a Huggingface RoBERTa Model ...
The special tokens depend on the model, for RoBERTa we include a shortlist ... We can use the 'fill-mask' pipeline where we input...
Read more >
HOW TO USE TRANSFORMER FOR REAL LIFE PROBLEMS ...
In this article, I have discussed some use cases of transformer, ... A special mask token with a probability of 0.8; A random...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found