Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RAG Tokenizer erroring out

See original GitHub issue

Environment info

transformers version: 3.3.1
Platform: Linux-5.4.0-48-generic-x86_64-with-debian-buster-sid
Python version: 3.7.9
PyTorch version (GPU?): 1.6.0 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: <fill in>
Using distributed or parallel set-up in script?: <fill in>

@ola13 @mfuntowicz

Information

Hi- I am trying to get the RAG running, however I am getting the error when I follow the instructions here: https://huggingface.co/facebook/rag-token-nq

Particularly, the error message is as follows:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-35cd6a2213c0> in <module>
      1 from transformers import AutoTokenizer, AutoModelWithLMHead
      2 
----> 3 tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")

~/src/transformers/src/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    258                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    259             else:
--> 260                 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    261 
    262         raise ValueError(

~/src/transformers/src/transformers/tokenization_rag.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
     61         print(config.generator)
     62         print("***")
---> 63         generator = AutoTokenizer.from_pretrained(generator_path, config=config.generator)
     64         return cls(question_encoder=question_encoder, generator=generator)
     65 

~/src/transformers/src/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    258                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    259             else:
--> 260                 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    261 
    262         raise ValueError(

~/src/transformers/src/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1557 
   1558         return cls._from_pretrained(
-> 1559             resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
   1560         )
   1561 

~/src/transformers/src/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
   1648 
   1649         # Add supplementary tokens.
-> 1650         special_tokens = tokenizer.all_special_tokens
   1651         if added_tokens_file is not None:
   1652             with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:

~/src/transformers/src/transformers/tokenization_utils_base.py in all_special_tokens(self)
   1026         Convert tokens of :obj:`tokenizers.AddedToken` type to string.
   1027         """
-> 1028         all_toks = [str(s) for s in self.all_special_tokens_extended]
   1029         return all_toks
   1030 

~/src/transformers/src/transformers/tokenization_utils_base.py in all_special_tokens_extended(self)
   1046         logger.info(all_toks)
   1047         print(all_toks)
-> 1048         all_toks = list(OrderedDict.fromkeys(all_toks))
   1049         return all_toks
   1050 

TypeError: unhashable type: 'dict'

all_toks variable looks as follows. Obviously, it is a dictionary and OrderedDict.fromkeys doesn’t like it.

[{'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '</s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<unk>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '</s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<pad>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<mask>', 'single_word': False, 'lstrip': True, 'rstrip': False, 'normalized': True}]

I will be digging deeper, hoping that I am doing an obvious mistake.

To reproduce

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")

Expected behavior

It should load the tokenizer!

Thank you.

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:6 (2 by maintainers)

Top GitHub Comments

3reactions

patrickvonplatencommented, Oct 13, 2020

Should be solved now - let me know if you still experience problems @dzorlu

0reactions

dzorlucommented, Oct 14, 2020

Thank you!

Top Results From Across the Web

Troubleshoot - Hugging Face

Sometimes errors occur, but we are here to help! This guide covers some of the most common issues we've seen and how you...

Retrieval-Augmented Generation (RAG) - ParlAI

Tutorial for generating your own embeddings / build your own index. Directory structure/overview. If you have any questions, please reach out to @klshuster...

Transformer · spaCy API Documentation

Pipeline component for multi-task learning with transformer models.

Question Answering with a Fine-Tuned BERT - Chris McCormick

Just to see exactly what the tokenizer is doing, let's print out the tokens with their IDs. # BERT only needs the token...

tf.RaggedTensor | TensorFlow v2.11.0

Represents a ragged tensor. ... A RaggedTensor is a tensor with one or more ragged dimensions, ... Raises. ValueError, If axis is out...