question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

convert_tokens_to_string does not conform to its signature

See original GitHub issue

Environment info

  • transformers version: 4.17.0
  • Platform: macOS-11.6.4-x86_64-i386-64bit
  • Python version: 3.9.10
  • PyTorch version (GPU?): 1.11.0 (False)
  • Tensorflow version (GPU?): 2.7.0 (False)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: False
  • Using distributed or parallel set-up in script?: False

Who can help

@SaulLu

Information

Model I am using (Bert, XLNet …): AutoModelForQuestionAnswering

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: Question Answering
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Using the official example script (I will omit it, I will just post the result):

Question: How many pretrained models are available in 🤗 Transformers?
Answer: ['over', ' 32', ' +']
Question: What does 🤗 Transformers provide?
Answer: ['general', ' -', ' purpose', ' architecture', 's']
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: ['tensor', 'flow', ' 2', '.', ' 0', ' and', ' p', 'yt', 'or', 'ch']

Using the model in our context:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = "Hello my browser is not working, I need help."
questions = [
    "What is the issue?",
    "What is the request?",
]


def extract_answer_idxs(start_logits, end_logits):
    answer_start = torch.argmax(start_logits)
    answer_end = torch.argmax(end_logits) + 1
    return answer_start, answer_end

text = [text] * len(questions)
inputs = tokenizer(questions, text, add_special_tokens=True, return_tensors="pt", max_length=512, truncation=True)
input_ids = inputs["input_ids"].tolist()
outputs = model(**inputs)
idxs = map(
        lambda x, y: extract_answer_idxs(x, y),
        outputs.start_logits,
        outputs.end_logits,
)
answers = list(
    map(
        lambda x, y: tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(x[y[0]:y[1]])),
        input_ids,
        (idx for idx in idxs),
    )
)

print(f"Questions: {questions}")
print(f"Answers: {answers}")

Result:

Questions: ['What is the issue?', 'What is the request?']
Answers: [['my', ' browser', ' is', ' not', ' working'], ['help']]

(I also tried in a loop fashion and I get the same identical result.)

Expected behavior

Questions: ['What is the issue?', 'What is the request?']
Answers: ['my browser is not working', 'help']

As the docs show, I expect a string and not a list of tokens. Please notice how whitespaces are somehow introduced in some of the tokens. Furthermore, some tokens are split e.g. ['tensor', 'flow', ' 2', '.', ' 0', ' and', ' p', 'yt', 'or', 'ch'].

I expect convert_tokens_to_string to return a str, as it was previously.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
SaulLucommented, Apr 1, 2022

Thank you both for sharing your issues 🤗 !

You are indeed right, your problem was related to the same issue: a change in the format of the output given by the decode method of the decoders objects of the tokenizers library.

We have for the moment yanked the version of tokenizers 0.12.0 and in the process of releasing a new version 0.12.1 which reverts this change. Using a previous version of 0.12.0 or the future new 0.12.1 should solve this issue.

Sorry again for any problems this may have caused you 😊

1reaction
inspiralpatternscommented, Apr 1, 2022

Thank you @SaulLu 🤗

Read more comments on GitHub >

github_iconTop Results From Across the Web

SAS token Error : Signature did not match. String to sign used ...
I tried in my environment and got below results: Code: using Azure.Identity; using Azure.Storage; using Azure.Storage.Blobs; using Azure.
Read more >
The STRING signature - Standard ML
The STRING signature specifies the basic operations on a string type, which is a vector of the underlying character type char as defined...
Read more >
Azure App Configuration REST API - HMAC authentication
Reason: The Signature provided doesn't match what the server expects. Solution: Make sure the String-To-Sign is correct.
Read more >
Validate Access Tokens - Okta Developer
This guide on tokens shows you how to verify a token's signature, manage key rotation, and how to use a refresh token to...
Read more >
Can't create meeting by using an access token ...
After getting an access token by using grant type of ... one generated with my app is not working and getting me Invalid...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found