question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

RAG Tokenizer erroring out

See original GitHub issue

Environment info

  • transformers version: 3.3.1
  • Platform: Linux-5.4.0-48-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.9
  • PyTorch version (GPU?): 1.6.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: <fill in>
  • Using distributed or parallel set-up in script?: <fill in>

@ola13 @mfuntowicz

Information

Hi- I am trying to get the RAG running, however I am getting the error when I follow the instructions here: https://huggingface.co/facebook/rag-token-nq

Particularly, the error message is as follows:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-35cd6a2213c0> in <module>
      1 from transformers import AutoTokenizer, AutoModelWithLMHead
      2 
----> 3 tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")

~/src/transformers/src/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    258                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    259             else:
--> 260                 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    261 
    262         raise ValueError(

~/src/transformers/src/transformers/tokenization_rag.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
     61         print(config.generator)
     62         print("***")
---> 63         generator = AutoTokenizer.from_pretrained(generator_path, config=config.generator)
     64         return cls(question_encoder=question_encoder, generator=generator)
     65 

~/src/transformers/src/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
    258                 return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    259             else:
--> 260                 return tokenizer_class_py.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
    261 
    262         raise ValueError(

~/src/transformers/src/transformers/tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1557 
   1558         return cls._from_pretrained(
-> 1559             resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs
   1560         )
   1561 

~/src/transformers/src/transformers/tokenization_utils_base.py in _from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, *init_inputs, **kwargs)
   1648 
   1649         # Add supplementary tokens.
-> 1650         special_tokens = tokenizer.all_special_tokens
   1651         if added_tokens_file is not None:
   1652             with open(added_tokens_file, encoding="utf-8") as added_tokens_handle:

~/src/transformers/src/transformers/tokenization_utils_base.py in all_special_tokens(self)
   1026         Convert tokens of :obj:`tokenizers.AddedToken` type to string.
   1027         """
-> 1028         all_toks = [str(s) for s in self.all_special_tokens_extended]
   1029         return all_toks
   1030 

~/src/transformers/src/transformers/tokenization_utils_base.py in all_special_tokens_extended(self)
   1046         logger.info(all_toks)
   1047         print(all_toks)
-> 1048         all_toks = list(OrderedDict.fromkeys(all_toks))
   1049         return all_toks
   1050 

TypeError: unhashable type: 'dict'

all_toks variable looks as follows. Obviously, it is a dictionary and OrderedDict.fromkeys doesn’t like it.

[{'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '</s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<unk>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '</s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<pad>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<s>', 'single_word': False, 'lstrip': False, 'rstrip': False, 'normalized': True}, {'content': '<mask>', 'single_word': False, 'lstrip': True, 'rstrip': False, 'normalized': True}]

I will be digging deeper, hoping that I am doing an obvious mistake.

To reproduce

from transformers import AutoTokenizer, AutoModelWithLMHead

tokenizer = AutoTokenizer.from_pretrained("facebook/rag-token-nq")

Expected behavior

It should load the tokenizer!

Thank you.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

3reactions
patrickvonplatencommented, Oct 13, 2020

Should be solved now - let me know if you still experience problems @dzorlu

0reactions
dzorlucommented, Oct 14, 2020

Thank you!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Troubleshoot - Hugging Face
Sometimes errors occur, but we are here to help! This guide covers some of the most common issues we've seen and how you...
Read more >
Retrieval-Augmented Generation (RAG) - ParlAI
Tutorial for generating your own embeddings / build your own index. Directory structure/overview. If you have any questions, please reach out to @klshuster...
Read more >
Transformer · spaCy API Documentation
Pipeline component for multi-task learning with transformer models.
Read more >
Question Answering with a Fine-Tuned BERT - Chris McCormick
Just to see exactly what the tokenizer is doing, let's print out the tokens with their IDs. # BERT only needs the token...
Read more >
tf.RaggedTensor | TensorFlow v2.11.0
Represents a ragged tensor. ... A RaggedTensor is a tensor with one or more ragged dimensions, ... Raises. ValueError, If axis is out...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found