convert_tokens_to_string does not conform to its signature
See original GitHub issueEnvironment info
transformers
version: 4.17.0- Platform: macOS-11.6.4-x86_64-i386-64bit
- Python version: 3.9.10
- PyTorch version (GPU?): 1.11.0 (False)
- Tensorflow version (GPU?): 2.7.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: False
- Using distributed or parallel set-up in script?: False
Who can help
Information
Model I am using (Bert, XLNet …): AutoModelForQuestionAnswering
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: Question Answering
- my own task or dataset: (give details below)
To reproduce
Steps to reproduce the behavior:
Using the official example script (I will omit it, I will just post the result):
Question: How many pretrained models are available in 🤗 Transformers?
Answer: ['over', ' 32', ' +']
Question: What does 🤗 Transformers provide?
Answer: ['general', ' -', ' purpose', ' architecture', 's']
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: ['tensor', 'flow', ' 2', '.', ' 0', ' and', ' p', 'yt', 'or', 'ch']
Using the model in our context:
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = "Hello my browser is not working, I need help."
questions = [
"What is the issue?",
"What is the request?",
]
def extract_answer_idxs(start_logits, end_logits):
answer_start = torch.argmax(start_logits)
answer_end = torch.argmax(end_logits) + 1
return answer_start, answer_end
text = [text] * len(questions)
inputs = tokenizer(questions, text, add_special_tokens=True, return_tensors="pt", max_length=512, truncation=True)
input_ids = inputs["input_ids"].tolist()
outputs = model(**inputs)
idxs = map(
lambda x, y: extract_answer_idxs(x, y),
outputs.start_logits,
outputs.end_logits,
)
answers = list(
map(
lambda x, y: tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(x[y[0]:y[1]])),
input_ids,
(idx for idx in idxs),
)
)
print(f"Questions: {questions}")
print(f"Answers: {answers}")
Result:
Questions: ['What is the issue?', 'What is the request?']
Answers: [['my', ' browser', ' is', ' not', ' working'], ['help']]
(I also tried in a loop fashion and I get the same identical result.)
Expected behavior
Questions: ['What is the issue?', 'What is the request?']
Answers: ['my browser is not working', 'help']
As the docs show, I expect a string and not a list of tokens.
Please notice how whitespaces are somehow introduced in some of the tokens.
Furthermore, some tokens are split e.g. ['tensor', 'flow', ' 2', '.', ' 0', ' and', ' p', 'yt', 'or', 'ch']
.
I expect convert_tokens_to_string
to return a str
, as it was previously.
Issue Analytics
- State:
- Created a year ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
SAS token Error : Signature did not match. String to sign used ...
I tried in my environment and got below results: Code: using Azure.Identity; using Azure.Storage; using Azure.Storage.Blobs; using Azure.
Read more >The STRING signature - Standard ML
The STRING signature specifies the basic operations on a string type, which is a vector of the underlying character type char as defined...
Read more >Azure App Configuration REST API - HMAC authentication
Reason: The Signature provided doesn't match what the server expects. Solution: Make sure the String-To-Sign is correct.
Read more >Validate Access Tokens - Okta Developer
This guide on tokens shows you how to verify a token's signature, manage key rotation, and how to use a refresh token to...
Read more >Can't create meeting by using an access token ...
After getting an access token by using grant type of ... one generated with my app is not working and getting me Invalid...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Thank you both for sharing your issues 🤗 !
You are indeed right, your problem was related to the same issue: a change in the format of the output given by the
decode
method of thedecoders
objects of thetokenizers
library.We have for the moment yanked the version of tokenizers 0.12.0 and in the process of releasing a new version 0.12.1 which reverts this change. Using a previous version of 0.12.0 or the future new 0.12.1 should solve this issue.
Sorry again for any problems this may have caused you 😊
Thank you @SaulLu 🤗