Tokenizer not working
See original GitHub issueEnvironment info
transformers
version: 4.3.2- Platform: Ubuntu 16.04.6 LTS
- Python version: 3.8.8
Who can help
- tokenizers: @n1t0, @LysandreJik
Information
To reproduce:
conda create --name=env1 python=3.8 jupyter transformers tokenizers -y -c conda-forge
conda activate env1
~$ python
Python 3.8.8 | packaged by conda-forge | (default, Feb 20 2021, 16:22:27)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from transformers import AutoTokenizer
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-cased", do_lower_case=False, strip_accents=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 395, in from_pretrained
return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
return cls._from_pretrained(
File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1860, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/guillem.garcia/.conda/envs/cosas/lib/python3.8/site-packages/transformers/models/bert/tokenization_bert_fast.py", line 199, in __init__
self.backend_tokenizer.normalizer = pre_tok_class(**pre_tok_state)
TypeError: PyBertNormalizer.__new__() got an unexpected keyword argument: do_lower_case
The most weird thing is that running that exact command in a jupyter notebook does not raise any error. Also AutoTokenizer.from_pretrained("bert-base-cased", do_lower_case=False)
works so it seems to be something related to strip accents.
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (5 by maintainers)
Top Results From Across the Web
Latest Version of Tokenizer Not Working After Update - Reddit
I recently updated Foundry to version 9. I made sure that I have the latest version of Tokenizer installed, and the site says...
Read more >Tokenizer not working - Stack Overflow
I am trying to tokenize a string to give an array of strings but it seems like my code is wrong. Here is...
Read more >Word_ids not working with deberta_v2 - Tokenizers
Hello all, Currently, I am working on a token classification. ... "word_ids() is not available when using Python-based tokenizers".
Read more >nltk.tokenize package
Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to...
Read more >dictionary attribute 'Text index tokenizer language' is not ...
dictionary attribute 'Text index tokenizer language' is not working for some Japanese Katakana characters Steps to Reproduce 1. log into instance 2. go...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Re-opening this as the issue isnβt solved.
Hi! We do not maintain the conda-forge versions of transformers and tokenizers. We maintain the versions that are on the
huggingface
channel.I just tried with the
huggingface
channel and I get no such errors:See the result: