question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. ItΒ collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer not working

See original GitHub issue

Environment info

  • transformers version: 4.3.2
  • Platform: Ubuntu 16.04.6 LTS
  • Python version: 3.8.8

Who can help

Information

To reproduce:

conda create --name=env1 python=3.8 jupyter transformers tokenizers -y -c conda-forge
conda activate env1
~$ python
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", do_lower_case=False, strip_accents=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 395, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
    return cls._from_pretrained(
  File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/models/bert/tokenization_bert_fast.py", line 199, in __init__
    self.backend_tokenizer.normalizer = pre_tok_class(**pre_tok_state)
TypeError: PyBertNormalizer.__new__() got an unexpected keyword argument: do_lower_case

The most weird thing is that running that exact command in a jupyter notebook does not raise any error. Also AutoTokenizer.from_pretrained("bert-base-cased", do_lower_case=False) works so it seems to be something related to strip accents.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

3reactions
LysandreJikcommented, Mar 10, 2021

Re-opening this as the issue isn’t solved.

1reaction
LysandreJikcommented, Feb 25, 2021

Hi! We do not maintain the conda-forge versions of transformers and tokenizers. We maintain the versions that are on the huggingface channel.

I just tried with the huggingface channel and I get no such errors:

conda create --name=env1 python=3.8 jupyter transformers tokenizers -y -c huggingface && conda activate env1

See the result:

~ (🌟) πŸ€— python                                                                     (env1) 10:00:33 ~
Python 3.8.5 (default, Sep  4 2020, 07:30:14)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", do_lower_case=False, strip_accents=False)
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 213k/213k [00:00<00:00, 2.71MB/s]
Downloading: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 436k/436k [00:00<00:00, 4.67MB/s]
Ignored unknown kwargs option do_lower_case
>>>
Read more comments on GitHub >

github_iconTop Results From Across the Web

Latest Version of Tokenizer Not Working After Update - Reddit
I recently updated Foundry to version 9. I made sure that I have the latest version of Tokenizer installed, and the site says...
Read more >
Tokenizer not working - Stack Overflow
I am trying to tokenize a string to give an array of strings but it seems like my code is wrong. Here is...
Read more >
Word_ids not working with deberta_v2 - Tokenizers
Hello all, Currently, I am working on a token classification. ... "word_ids() is not available when using Python-based tokenizers".
Read more >
nltk.tokenize package
Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to...
Read more >
dictionary attribute 'Text index tokenizer language' is not ...
dictionary attribute 'Text index tokenizer language' is not working for some Japanese Katakana characters Steps to Reproduce 1. log into instance 2. go...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found