Tokenizer Fast bug: ValueError: TextInputSequence must be str
See original GitHub issueEnvironment info
transformers
version:- Platform: In a Colab enviroment aswell as on my local windows version
- Python version: 3.7.4
- PyTorch version (GPU?): Yes and No
- Tensorflow version (GPU?): I didn’t try with tensorflow, but I suspect that it has nothing to do with it
- Using GPU in script?: I used the automodeling on a GPU session in Colab
- Using distributed or parallel set-up in script?: Nope
Who can help
Information
Model I am using: Initially Electra but I tested it out with BERT, DistilBERT and RoBERTa
It’s using your scripts, but again, it believe it wouldn’t work if I did it myself. The model is trained on SQuAD.
Error traceback
"""
Traceback (most recent call last):
File "/usr/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/usr/local/lib/python3.6/dist-packages/transformers/data/processors/squad.py", line 165, in squad_convert_example_to_features
return_token_type_ids=True,
File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py", line 2050, in encode_plus
**kwargs,
File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 473, in _encode_plus
**kwargs,
File "/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_fast.py", line 376, in _batch_encode_plus
is_pretokenized=is_split_into_words,
File "/usr/local/lib/python3.6/dist-packages/tokenizers/implementations/base_tokenizer.py", line 212, in encode
return self._tokenizer.encode(sequence, pair, is_pretokenized, add_special_tokens)
ValueError: TextInputSequence must be str
"""
To reproduce
Steps to reproduce the behavior:
- Download model and tokenizer (fast)
- Test it out with the transformers pipeline for a question answering task
I’ve also made a small notebook to test it out for yourself. here
Expected behavior
Instead of giving an error, I would expect the tokenizer to work…
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (2 by maintainers)
Top Results From Across the Web
TextInputSequence must be str” error on Hugging Face ...
This is an issue with data , the data consists of None type or other data type except string.
Read more >Textinputsequence Must Be Str" Error On Hugging Face ...
The python and rust tokenizers have roughly the same API. huggingface / transformers Tokenizer Fast bug: ValueError: TextInputSequence must be str #7735 ...
Read more >Help resolving error ("TextInputSequence must be str")
Hi, I'm very new to HuggingFace, I've come around this error “TextInputSequence must be str” on a notebook which is helping me a...
Read more >Fine Tuning With Custom Data Error:TextEncodeInput must be ...
Hey guys, I was following the fastai transformers tutorial on a custom dataset and ran into this error when trying to create my...
Read more >How to use BERT from the Hugging Face transformer library
Unlike the BERT Models, you don't have to download a different tokenizer for each different type of model. You can use the same...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hi, thanks for opening such a detailed issue with a notebook!
Unfortunately, fast tokenizers don’t currently work with the QA pipeline. They will in the second pipeline version which is expected in a few weeks to a few months, but right now please use the slow tokenizers for the QA pipeline.
Thanks!
I also find this problem when using transformers. I check my data and find that if csv file contains much Null data or the length of str is 0, this error will be returned. I filter these data and I can successfully run my code.