RAG Tokenizer erroring out
See original GitHub issueEnvironment info
transformers
version: 3.3.1- Platform: Linux-5.4.0-48-generic-x86_64-with-debian-buster-sid
- Python version: 3.7.9
- PyTorch version (GPU?): 1.6.0 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>
Information
Hi- I am trying to get the RAG running, however I am getting the error when I follow the instructions here: https://huggingface.co/facebook/rag-token-nq
Particularly, the error message is as follows:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-35cd6a2213c0> in <module>
1 from transformers import AutoTokenizer, AutoModelWithLMHead
2
----> 3 tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")
~/src/transformers/src/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
258 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
259 else:
--> 260 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
261
262 raise ValueError(
~/src/transformers/src/transformers/tokenization_rag.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
61 print(config.generator)
62 print("***")
---> 63 generator = AutoTokenizer.from_pretrained(generator_path, config=config.generator)
64 return cls(question_encoder=question_encoder, generator=generator)
65
~/src/transformers/src/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
258 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
259 else:
--> 260 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
261
262 raise ValueError(
~/src/transformers/src/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1557
1558 return cls._from_pretrained(
-> 1559 resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
1560 )
1561
~/src/transformers/src/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
1648
1649 # Add supplementary tokens.
-> 1650 special_tokens = tokenizer.all_special_tokens
1651 if added_tokens_file is not None:
1652 with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:
~/src/transformers/src/transformers/tokenization_utils_base.py in all_special_tokens(self)
1026 Convert tokens of :obj:`tokenizers.AddedToken` type to string.
1027 """
-> 1028 all_toks = [str(s) for s in self.all_special_tokens_extended]
1029 return all_toks
1030
~/src/transformers/src/transformers/tokenization_utils_base.py in all_special_tokens_extended(self)
1046 logger.info(all_toks)
1047 print(all_toks)
-> 1048 all_toks = list(OrderedDict.fromkeys(all_toks))
1049 return all_toks
1050
TypeError: unhashable type: 'dict'
all_toks
variable looks as follows. Obviously, it is a dictionary and OrderedDict.fromkeys
doesn’t like it.
[{'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '</s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<unk>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '</s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<pad>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<mask>', 'single_word': False, 'lstrip': True, 'rstrip': False, 'normalized': True}]
I will be digging deeper, hoping that I am doing an obvious mistake.
To reproduce
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")
Expected behavior
It should load the tokenizer!
Thank you.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:6 (2 by maintainers)
Top Results From Across the Web
Troubleshoot - Hugging Face
Sometimes errors occur, but we are here to help! This guide covers some of the most common issues we've seen and how you...
Read more >Retrieval-Augmented Generation (RAG) - ParlAI
Tutorial for generating your own embeddings / build your own index. Directory structure/overview. If you have any questions, please reach out to @klshuster...
Read more >Transformer · spaCy API Documentation
Pipeline component for multi-task learning with transformer models.
Read more >Question Answering with a Fine-Tuned BERT - Chris McCormick
Just to see exactly what the tokenizer is doing, let's print out the tokens with their IDs. # BERT only needs the token...
Read more >tf.RaggedTensor | TensorFlow v2.11.0
Represents a ragged tensor. ... A RaggedTensor is a tensor with one or more ragged dimensions, ... Raises. ValueError, If axis is out...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Should be solved now - let me know if you still experience problems @dzorlu
Thank you!