Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AutoTokenizer from pretrained BERT throws TypeError when encoding certain input

See original GitHub issue

Environment info

transformers version: 4.3.2
Platform: Arch Linux
Python version: 3.9.1
PyTorch version (GPU?): 1.7.1, no
Tensorflow version (GPU?): Not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

Guess from git blame: @LysandreJik , @thomwolf @n1t0

Information

Model I am using (Bert, XLNet …): BERT

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

When I use a pretrained BERT tokenizer, it throws a TypeError on singleton input or input containing ø/æ/å.

It was discovered when I used the pretrained Maltehb/danish-bert-botxo which would fail in the below way on any input containing Danish characters (ø/æ/å), but I also realized that it happens with the standard bert-base-uncased as shown below.

Steps to reproduce the behavior:

Run these line

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer.encode(["hello", "world"])                          # <--- This works
tokenizer.encode(["hello"])                                   # <--- This throws the below shown stack trace
tokenizer.encode(["dette", "er", "en", "sø"])                 # <--- This throws the same error

Stack trace

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-ef056deb5f59> in <module>
----> 1 tokenizer.encode(["hello"])

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
   2102                 ``convert_tokens_to_ids`` method).
   2103         """
-> 2104         encoded_inputs = self.encode_plus(
   2105             text,
   2106             text_pair=text_pair,

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2418         )
   2419 
-> 2420         return self._encode_plus(
   2421             text=text,
   2422             text_pair=text_pair,

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    453 
    454         batched_input = [(text, text_pair)] if text_pair else [text]
--> 455         batched_output = self._batch_encode_plus(
    456             batched_input,
    457             is_split_into_words=is_split_into_words,

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    380         )
    381 
--> 382         encodings = self._tokenizer.encode_batch(
    383             batch_text_or_text_pairs,
    384             add_special_tokens=add_special_tokens,

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Expected behavior

I expect the model not to throw a type error when the types are the same. I also expected that the tokenization would produce id’s.

This issue is caused by the above I am grateful for the software and thank you in advance for the help!

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:11 (6 by maintainers)

Top GitHub Comments

8reactions

LysandreJikcommented, Feb 22, 2021

I believe the encode method never accepted batches as inputs. We introduced encode_plus and batch_encode_plus down the road, the latter being the first to handle batching.

While these two methods are deprecated, they’re still tested and working, and they’re used under the hood when calling __call__.

What is happening here is that v3.5.1 is treating your input as individual words (but by all means it shouldn’t as the is_split_into_words argument is False by default), rather than as different batches, I was mistaken in my first analysis. Something did change between version v3.5.1 and v4.0.0, all the breaking changes are documented in the migration guide.

If you want to get back to the previous behavior, you have two ways of handling it:

Specify that you don’t want a fast tokenizer. The main change affecting you here is that the AutoTokenizer returns a fast tokenizer by default (in Rust) rather than the python-based tokenizer. You can change that behavior with the following:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)

The behavior you’re relying on here is the is_split_into_words parameter: you’re passing it a list of words, rather than a sequence of words. That it worked in previous versions seems like a bug to me, here’s how you would handle it now (works with a fast tokenizer):

tokenizer(["hello", "world"], is_split_into_words=True)
tokenizer(["hello"], is_split_into_words=True)
tokenizer(["dette", "er", "en", "sø"], is_split_into_words=True)

8reactions

LysandreJikcommented, Feb 22, 2021

Hello! Thank you for opening an issue with a reproducible example, it helps a lot.

The issue here is that you’re using the encode method to encode a batch, which it can’t do. Encode only encodes single sequences, and can accept a “batch” of two because it processes them as two independent sequences that should be joined together, for example for text-classification where you would want to classify the relationship between two sequences (tasks like Next Sentence Prediction from BERT or Sentence Ordering Prediction ALBERT).

The method you’re looking for is the __call__ method of the tokenizer, which handles exactly all the use-cases you’ve mentioned, and is the recommended API for tokenizers:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer(["hello", "world"])                          # <--- This works
tokenizer(["hello"])                                   # <--- This works too :)
tokenizer(["dette", "er", "en", "sø"])                 # <--- This works as well!

Here is the documentation for that method, hope that helps!