question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

AutoTokenizer from pretrained BERT throws TypeError when encoding certain input

See original GitHub issue

Environment info

  • transformers version: 4.3.2
  • Platform: Arch Linux
  • Python version: 3.9.1
  • PyTorch version (GPU?): 1.7.1, no
  • Tensorflow version (GPU?): Not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

Guess from git blame: @LysandreJik , @thomwolf @n1t0

Information

Model I am using (Bert, XLNet …): BERT

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

When I use a pretrained BERT tokenizer, it throws a TypeError on singleton input or input containing ø/æ/å.

It was discovered when I used the pretrained Maltehb/danish-bert-botxo which would fail in the below way on any input containing Danish characters (ø/æ/å), but I also realized that it happens with the standard bert-base-uncased as shown below.

Steps to reproduce the behavior:

  1. Run these line
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer.encode(["hello", "world"])                          # <--- This works
tokenizer.encode(["hello"])                                   # <--- This throws the below shown stack trace
tokenizer.encode(["dette", "er", "en", "sø"])                 # <--- This throws the same error

Stack trace

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-13-ef056deb5f59> in <module>
----> 1 tokenizer.encode(["hello"])

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in encode(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, return_tensors, **kwargs)
   2102                 ``convert_tokens_to_ids`` method).
   2103         """
-> 2104         encoded_inputs = self.encode_plus(
   2105             text,
   2106             text_pair=text_pair,

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2418         )
   2419 
-> 2420         return self._encode_plus(
   2421             text=text,
   2422             text_pair=text_pair,

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    453 
    454         batched_input = [(text, text_pair)] if text_pair else [text]
--> 455         batched_output = self._batch_encode_plus(
    456             batched_input,
    457             is_split_into_words=is_split_into_words,

~/.venv/lib/python3.9/site-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    380         )
    381 
--> 382         encodings = self._tokenizer.encode_batch(
    383             batch_text_or_text_pairs,
    384             add_special_tokens=add_special_tokens,

TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]

Expected behavior

I expect the model not to throw a type error when the types are the same. I also expected that the tokenization would produce id’s.

This issue is caused by the above I am grateful for the software and thank you in advance for the help!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:11 (6 by maintainers)

github_iconTop GitHub Comments

8reactions
LysandreJikcommented, Feb 22, 2021

I believe the encode method never accepted batches as inputs. We introduced encode_plus and batch_encode_plus down the road, the latter being the first to handle batching.

While these two methods are deprecated, they’re still tested and working, and they’re used under the hood when calling __call__.

What is happening here is that v3.5.1 is treating your input as individual words (but by all means it shouldn’t as the is_split_into_words argument is False by default), rather than as different batches, I was mistaken in my first analysis. Something did change between version v3.5.1 and v4.0.0, all the breaking changes are documented in the migration guide.

If you want to get back to the previous behavior, you have two ways of handling it:

  • Specify that you don’t want a fast tokenizer. The main change affecting you here is that the AutoTokenizer returns a fast tokenizer by default (in Rust) rather than the python-based tokenizer. You can change that behavior with the following:
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased", use_fast=False)
  • The behavior you’re relying on here is the is_split_into_words parameter: you’re passing it a list of words, rather than a sequence of words. That it worked in previous versions seems like a bug to me, here’s how you would handle it now (works with a fast tokenizer):
tokenizer(["hello", "world"], is_split_into_words=True)
tokenizer(["hello"], is_split_into_words=True)
tokenizer(["dette", "er", "en", "sø"], is_split_into_words=True)
8reactions
LysandreJikcommented, Feb 22, 2021

Hello! Thank you for opening an issue with a reproducible example, it helps a lot.

The issue here is that you’re using the encode method to encode a batch, which it can’t do. Encode only encodes single sequences, and can accept a “batch” of two because it processes them as two independent sequences that should be joined together, for example for text-classification where you would want to classify the relationship between two sequences (tasks like Next Sentence Prediction from BERT or Sentence Ordering Prediction ALBERT).

The method you’re looking for is the __call__ method of the tokenizer, which handles exactly all the use-cases you’ve mentioned, and is the recommended API for tokenizers:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer(["hello", "world"])                          # <--- This works
tokenizer(["hello"])                                   # <--- This works too :)
tokenizer(["dette", "er", "en", "sø"])                 # <--- This works as well!

Here is the documentation for that method, hope that helps!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer - Hugging Face
Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...
Read more >
OSError: Exception encountered when calling layer "encoder ...
This is the code i've used to import the bert model for tensorflow in jupyter notebook. I've installed all the required packages and...
Read more >
How to use BERT from the Hugging Face transformer library
Given a text input, here is how I generally tokenize it in projects: encoding = tokenizer.encode_plus(text, add_special_tokens = True, ...
Read more >
HOW TO USE TRANSFORMER FOR REAL LIFE PROBLEMS ...
Autoencoding models are pretrained by corrupting the input tokens in some way and then trying to reconstruct the original sentence.
Read more >
Text Extraction with BERT - Keras
Description: Fine tune pretrained BERT from HuggingFace Transformers on SQuAD. Introduction. This demonstration uses SQuAD (Stanford Question- ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found