question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer Fast bug: ValueError: TextInputSequence must be str

See original GitHub issue

Environment info

  • transformers version:
  • Platform: In a Colab enviroment aswell as on my local windows version
  • Python version: 3.7.4
  • PyTorch version (GPU?): Yes and No
  • Tensorflow version (GPU?): I didn’t try with tensorflow, but I suspect that it has nothing to do with it
  • Using GPU in script?: I used the automodeling on a GPU session in Colab
  • Using distributed or parallel set-up in script?: Nope

Who can help

@mfuntowicz

Information

Model I am using: Initially Electra but I tested it out with BERT, DistilBERT and RoBERTa

It’s using your scripts, but again, it believe it wouldn’t work if I did it myself. The model is trained on SQuAD.

Error traceback

"""
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 165, in squad_convert_example_to_features
    return_token_type_ids=True,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py", line 2050, in encode_plus
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 473, in _encode_plus
    **kwargs,
  File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 376, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
  File "/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py", line 212, in encode
    return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
ValueError: TextInputSequence must be str
"""

To reproduce

Steps to reproduce the behavior:

  1. Download model and tokenizer (fast)
  2. Test it out with the transformers pipeline for a question answering task

I’ve also made a small notebook to test it out for yourself. here

Expected behavior

Instead of giving an error, I would expect the tokenizer to work…

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
LysandreJikcommented, Oct 13, 2020

Hi, thanks for opening such a detailed issue with a notebook!

Unfortunately, fast tokenizers don’t currently work with the QA pipeline. They will in the second pipeline version which is expected in a few weeks to a few months, but right now please use the slow tokenizers for the QA pipeline.

Thanks!

1reaction
ICSE-2023commented, May 13, 2022

I also find this problem when using transformers. I check my data and find that if csv file contains much Null data or the length of str is 0, this error will be returned. I filter these data and I can successfully run my code.

Read more comments on GitHub >

github_iconTop Results From Across the Web

TextInputSequence must be str” error on Hugging Face ...
This is an issue with data , the data consists of None type or other data type except string.
Read more >
Textinputsequence Must Be Str" Error On Hugging Face ...
The python and rust tokenizers have roughly the same API. huggingface / transformers Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735 ...
Read more >
Help resolving error ("TextInputSequence must be str")
Hi, I'm very new to HuggingFace, I've come around this error “TextInputSequence must be str” on a notebook which is helping me a...
Read more >
Fine Tuning With Custom Data Error:TextEncodeInput must be ...
Hey guys, I was following the fastai transformers tutorial on a custom dataset and ran into this error when trying to create my...
Read more >
How to use BERT from the Hugging Face transformer library
Unlike the BERT Models, you don't have to download a different tokenizer for each different type of model. You can use the same...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found