question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve handling of special tokens in Dictionary

See original GitHub issue

https://github.com/pytorch/fairseq/blob/eb68afca0208a040d4e91eceae86f5f22ca24b04/fairseq/data/dictionary.py#L178-L190

When loading dict.txt that already contains special tokens such as <s> or <pad> (which are added by default in sentencepiece), these tokens appear twice in the fairseq dictionary. They are added once in Dictionary.__init__() and a second time from the dict.txt file in Dictionary.add_from_file(). This causes weird behaviours e.g. when using the model in https://github.com/huggingface/transformers.

Ideally Dictionary would not add the special tokens manually when loading an external dict.txt that already contains them (such as in https://github.com/huggingface/transformers). But I am afraid that this can break backward compatibility for people who already trained models with this “duplicated special tokens bug”.

For instance:

>> print([fairseq_model.task.dictionary[i] for i in range(15)])
['<s>', '<pad>', '</s>', '<unk>', '<unk>', '<s>', '</s>', ',', '▁the', ...]

In the fill_mask() method for roberta, this is what happens:

>> tokens = self.task.source_dictionary.encode_line(
       '<s> ' + text_spans_bpe,
       append_eos=True,
       add_if_not_exist=False,
   )
   print(tokens)
tensor([[    5,  1285, 32004,     2]])

With the first token 5 being the <s> that was added as a string and matched to the token from dict.txt and the last token 2 corresponding to dictionary.eos().

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
louismartincommented, Dec 6, 2019

Cross-referencing related bugs in HuggingFace Transformers: https://github.com/huggingface/transformers/pull/2065

0reactions
stale[bot]commented, Jun 28, 2021

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, “bump”), and we’ll keep it open. We are sorry that we haven’t been able to prioritize it yet. If you have any new additional information, please include it with your comment!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Improve handling of special tokens in Dictionary #1309 - GitHub
One nice way to handle backward compatibility is to add a header line to new dict.txt files indicating the version. Under the new...
Read more >
How to manage your Design Tokens with Style Dictionary
Now, not only Style Dictionary allows this seamless management of the “aliases” values, but, even better, the order in which the files are...
Read more >
Utilities for Tokenizers - Hugging Face
Special tokens are carefully handled by the tokenizer (they are never split). You can easily refer to special tokens using tokenizer class attributes...
Read more >
Tokenization for Natural Language Processing
In this method the tokens are found based on the tokens already existing in the dictionary. If the token is not found, then...
Read more >
spaCy 101: Everything you need to know
Tokenizer exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found