Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer Fast bug: ValueError: TextInputSequence must be str

See original GitHub issue

Environment info

transformers version:
Platform: In a Colab enviroment aswell as on my local windows version
Python version: 3.7.4
PyTorch version (GPU?): Yes and No
Tensorflow version (GPU?): I didn’t try with tensorflow, but I suspect that it has nothing to do with it
Using GPU in script?: I used the automodeling on a GPU session in Colab
Using distributed or parallel set-up in script?: Nope

Who can help

@mfuntowicz

Information

Model I am using: Initially Electra but I tested it out with BERT, DistilBERT and RoBERTa

It’s using your scripts, but again, it believe it wouldn’t work if I did it myself. The model is trained on SQuAD.

Error traceback

"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 165, in squad_convert_example_to_features
    return_token_type_ids=True,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py", line 2050, in encode_plus
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 473, in _encode_plus
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 376, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
  File "/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py", line 212, in encode
    return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
ValueError: TextInputSequence must be str
"""

To reproduce

Steps to reproduce the behavior:

Download model and tokenizer (fast)
Test it out with the transformers pipeline for a question answering task

I’ve also made a small notebook to test it out for yourself. here

Expected behavior

Instead of giving an error, I would expect the tokenizer to work…

Issue Analytics

State:
Created 3 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

3reactions

LysandreJikcommented, Oct 13, 2020

Hi, thanks for opening such a detailed issue with a notebook!

Unfortunately, fast tokenizers don’t currently work with the QA pipeline. They will in the second pipeline version which is expected in a few weeks to a few months, but right now please use the slow tokenizers for the QA pipeline.

Thanks!

1reaction

ICSE-2023commented, May 13, 2022

I also find this problem when using transformers. I check my data and find that if csv file contains much Null data or the length of str is 0, this error will be returned. I filter these data and I can successfully run my code.

Top Results From Across the Web

TextInputSequence must be str” error on Hugging Face ...

This is an issue with data , the data consists of None type or other data type except string.

Textinputsequence Must Be Str" Error On Hugging Face ...

The python and rust tokenizers have roughly the same API. huggingface / transformers Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735 ...

Help resolving error ("TextInputSequence must be str")

Hi, I'm very new to HuggingFace, I've come around this error “TextInputSequence must be str” on a notebook which is helping me a...

Fine Tuning With Custom Data Error:TextEncodeInput must be ...

Hey guys, I was following the fastai transformers tutorial on a custom dataset and ran into this error when trying to create my...

How to use BERT from the Hugging Face transformer library

Unlike the BERT Models, you don't have to download a different tokenizer for each different type of model. You can use the same...