Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

convert_tokens_to_string does not conform to its signature

See original GitHub issue

Environment info

transformers version: 4.17.0
Platform: macOS-11.6.4-x86_64-i386-64bit
Python version: 3.9.10
PyTorch version (GPU?): 1.11.0 (False)
Tensorflow version (GPU?): 2.7.0 (False)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: False
Using distributed or parallel set-up in script?: False

Who can help

@SaulLu

Information

Model I am using (Bert, XLNet …): AutoModelForQuestionAnswering

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: Question Answering
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Using the official example script (I will omit it, I will just post the result):

Question: How many pretrained models are available in 🤗 Transformers?
Answer: ['over', ' 32', ' +']
Question: What does 🤗 Transformers provide?
Answer: ['general', ' -', ' purpose', ' architecture', 's']
Question: 🤗 Transformers provides interoperability between which frameworks?
Answer: ['tensor', 'flow', ' 2', '.', ' 0', ' and', ' p', 'yt', 'or', 'ch']

Using the model in our context:

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = "Hello my browser is not working, I need help."
questions = [
    "What is the issue?",
    "What is the request?",
]


def extract_answer_idxs(start_logits, end_logits):
    answer_start = torch.argmax(start_logits)
    answer_end = torch.argmax(end_logits) + 1
    return answer_start, answer_end

text = [text] * len(questions)
inputs = tokenizer(questions, text, add_special_tokens=True, return_tensors="pt", max_length=512, truncation=True)
input_ids = inputs["input_ids"].tolist()
outputs = model(**inputs)
idxs = map(
        lambda x, y: extract_answer_idxs(x, y),
        outputs.start_logits,
        outputs.end_logits,
)
answers = list(
    map(
        lambda x, y: tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(x[y[0]:y[1]])),
        input_ids,
        (idx for idx in idxs),
    )
)

print(f"Questions: {questions}")
print(f"Answers: {answers}")

Result:

Questions: ['What is the issue?', 'What is the request?']
Answers: [['my', ' browser', ' is', ' not', ' working'], ['help']]

(I also tried in a loop fashion and I get the same identical result.)

Expected behavior

Questions: ['What is the issue?', 'What is the request?']
Answers: ['my browser is not working', 'help']

As the docs show, I expect a string and not a list of tokens. Please notice how whitespaces are somehow introduced in some of the tokens. Furthermore, some tokens are split e.g. ['tensor', 'flow', ' 2', '.', ' 0', ' and', ' p', 'yt', 'or', 'ch'].

I expect convert_tokens_to_string to return a str, as it was previously.

Issue Analytics

State:
Created a year ago
Comments:7 (4 by maintainers)

Top GitHub Comments

2reactions

SaulLucommented, Apr 1, 2022

Thank you both for sharing your issues 🤗 !

You are indeed right, your problem was related to the same issue: a change in the format of the output given by the decode method of the decoders objects of the tokenizers library.

We have for the moment yanked the version of tokenizers 0.12.0 and in the process of releasing a new version 0.12.1 which reverts this change. Using a previous version of 0.12.0 or the future new 0.12.1 should solve this issue.

Sorry again for any problems this may have caused you 😊

1reaction

inspiralpatternscommented, Apr 1, 2022

Thank you @SaulLu 🤗