AutoTokenizer from pretrained BERT throws TypeError when encoding certain input
See original GitHub issueEnvironment info
transformers
version: 4.3.2- Platform: Arch Linux
- Python version: 3.9.1
- PyTorch version (GPU?): 1.7.1, no
- Tensorflow version (GPU?): Not installed
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No
Who can help
Guess from git blame: @LysandreJik , @thomwolf @n1t0
Information
Model I am using (Bert, XLNet …): BERT
The problem arises when using:
- the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- my own task or dataset: (give details below)
To reproduce
When I use a pretrained BERT tokenizer, it throws a TypeError on singleton input or input containing ø/æ/å.
It was discovered when I used the pretrained Maltehb/danish-bert-botxo
which would fail in the below way on any input containing Danish characters (ø/æ/å), but I also realized that it happens with the standard bert-base-uncased
as shown below.
Steps to reproduce the behavior:
- Run these line
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer.encode(["hello", "world"]) # <--- This works
tokenizer.encode(["hello"]) # <--- This throws the below shown stack trace
tokenizer.encode(["dette", "er", "en", "sø"]) # <--- This throws the same error
Stack trace
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-13-ef056deb5f59> in <module>
----> 1 tokenizer.encode(["hello"])
~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
2102 ``convert_tokens_to_ids`` method).
2103 """
-> 2104 encoded_inputs = self.encode_plus(
2105 text,
2106 text_pair=text_pair,
~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2418 )
2419
-> 2420 return self._encode_plus(
2421 text=text,
2422 text_pair=text_pair,
~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
453
454 batched_input = [(text, text_pair)] if text_pair else [text]
--> 455 batched_output = self._batch_encode_plus(
456 batched_input,
457 is_split_into_words=is_split_into_words,
~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
380 )
381
--> 382 encodings = self._tokenizer.encode_batch(
383 batch_text_or_text_pairs,
384 add_special_tokens=add_special_tokens,
TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
Expected behavior
I expect the model not to throw a type error when the types are the same. I also expected that the tokenization would produce id’s.
This issue is caused by the above I am grateful for the software and thank you in advance for the help!
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:11 (6 by maintainers)
Top GitHub Comments
I believe the
encode
method never accepted batches as inputs. We introducedencode_plus
andbatch_encode_plus
down the road, the latter being the first to handle batching.While these two methods are deprecated, they’re still tested and working, and they’re used under the hood when calling
__call__
.What is happening here is that v3.5.1 is treating your input as individual words (but by all means it shouldn’t as the
is_split_into_words
argument isFalse
by default), rather than as different batches, I was mistaken in my first analysis. Something did change between version v3.5.1 and v4.0.0, all the breaking changes are documented in the migration guide.If you want to get back to the previous behavior, you have two ways of handling it:
AutoTokenizer
returns a fast tokenizer by default (in Rust) rather than the python-based tokenizer. You can change that behavior with the following:is_split_into_words
parameter: you’re passing it a list of words, rather than a sequence of words. That it worked in previous versions seems like a bug to me, here’s how you would handle it now (works with a fast tokenizer):Hello! Thank you for opening an issue with a reproducible example, it helps a lot.
The issue here is that you’re using the
encode
method to encode a batch, which it can’t do. Encode only encodes single sequences, and can accept a “batch” of two because it processes them as two independent sequences that should be joined together, for example for text-classification where you would want to classify the relationship between two sequences (tasks like Next Sentence Prediction from BERT or Sentence Ordering Prediction ALBERT).The method you’re looking for is the
__call__
method of the tokenizer, which handles exactly all the use-cases you’ve mentioned, and is the recommended API for tokenizers:Here is the documentation for that method, hope that helps!