Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer not working

See original GitHub issue

Environment info

transformers version: 4.3.2
Platform: Ubuntu 16.04.6 LTS
Python version: 3.8.8

Who can help

tokenizers: @n1t0, @LysandreJik

Information

To reproduce:

conda create --name=env1 python=3.8 jupyter transformers tokenizers -y -c conda-forge
conda activate env1

~$ python
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", do_lower_case=False, strip_accents=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 395, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
    return cls._from_pretrained(
  File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/models/bert/tokenization_bert_fast.py", line 199, in __init__
    self.backend_tokenizer.normalizer = pre_tok_class(**pre_tok_state)
TypeError: PyBertNormalizer.__new__() got an unexpected keyword argument: do_lower_case

The most weird thing is that running that exact command in a jupyter notebook does not raise any error. Also AutoTokenizer.from_pretrained("bert-base-cased", do_lower_case=False) works so it seems to be something related to strip accents.

Issue Analytics

State:
Created 3 years ago
Comments:12 (5 by maintainers)

Top GitHub Comments

3reactions

LysandreJikcommented, Mar 10, 2021

Re-opening this as the issue isn’t solved.

1reaction

LysandreJikcommented, Feb 25, 2021

Hi! We do not maintain the conda-forge versions of transformers and tokenizers. We maintain the versions that are on the huggingface channel.

I just tried with the huggingface channel and I get no such errors:

conda create --name=env1 python=3.8 jupyter transformers tokenizers -y -c huggingface && conda activate env1

See the result:

~ (🌟) 🤗 python                                                                     (env1) 10:00:33 ~
Python 3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", do_lower_case=False, strip_accents=False)
Downloading: 100%|█████████████████████████████████████████████████| 213k/213k [00:00<00:00, 2.71MB/s]
Downloading: 100%|█████████████████████████████████████████████████| 436k/436k [00:00<00:00, 4.67MB/s]
Ignored unknown kwargs option do_lower_case
>>>

Top Results From Across the Web

Latest Version of Tokenizer Not Working After Update - Reddit

I recently updated Foundry to version 9. I made sure that I have the latest version of Tokenizer installed, and the site says...

Tokenizer not working - Stack Overflow

I am trying to tokenize a string to give an array of strings but it seems like my code is wrong. Here is...

Word_ids not working with deberta_v2 - Tokenizers

Hello all, Currently, I am working on a token classification. ... "word_ids() is not available when using Python-based tokenizers".

nltk.tokenize package

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to...

dictionary attribute 'Text index tokenizer language' is not ...

dictionary attribute 'Text index tokenizer language' is not working for some Japanese Katakana characters Steps to Reproduce 1. log into instance 2. go...