Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Improve handling of special tokens in Dictionary

See original GitHub issue

https://github.com/pytorch/fairseq/blob/eb68afca0208a040d4e91eceae86f5f22ca24b04/fairseq/data/dictionary.py#L178-L190

When loading dict.txt that already contains special tokens such as <s> or <pad> (which are added by default in sentencepiece), these tokens appear twice in the fairseq dictionary. They are added once in Dictionary.__init__() and a second time from the dict.txt file in Dictionary.add_from_file(). This causes weird behaviours e.g. when using the model in https://github.com/huggingface/transformers.

Ideally Dictionary would not add the special tokens manually when loading an external dict.txt that already contains them (such as in https://github.com/huggingface/transformers). But I am afraid that this can break backward compatibility for people who already trained models with this “duplicated special tokens bug”.

For instance:

>> print([fairseq_model.task.dictionary[i] for i in range(15)])
['<s>', '<pad>', '</s>', '<unk>', '<unk>', '<s>', '</s>', ',', '▁the', ...]

In the fill_mask() method for roberta, this is what happens:

>> tokens = self.task.source_dictionary.encode_line(
       '<s> ' + text_spans_bpe,
       append_eos=True,
       add_if_not_exist=False,
   )
   print(tokens)
tensor([[    5,  1285, 32004,     2]])

With the first token 5 being the <s> that was added as a string and matched to the token from dict.txt and the last token 2 corresponding to dictionary.eos().

Issue Analytics

State:
Created 4 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

louismartincommented, Dec 6, 2019

Cross-referencing related bugs in HuggingFace Transformers: https://github.com/huggingface/transformers/pull/2065

0reactions

stale[bot]commented, Jun 28, 2021

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, “bump”), and we’ll keep it open. We are sorry that we haven’t been able to prioritize it yet. If you have any new additional information, please include it with your comment!

Top Results From Across the Web

Improve handling of special tokens in Dictionary #1309 - GitHub

One nice way to handle backward compatibility is to add a header line to new dict.txt files indicating the version. Under the new...

How to manage your Design Tokens with Style Dictionary

Now, not only Style Dictionary allows this seamless management of the “aliases” values, but, even better, the order in which the files are...

Utilities for Tokenizers - Hugging Face

Special tokens are carefully handled by the tokenizer (they are never split). You can easily refer to special tokens using tokenizer class attributes...

Tokenization for Natural Language Processing

In this method the tokens are found based on the tokens already existing in the dictionary. If the token is not found, then...

spaCy 101: Everything you need to know

Tokenizer exception: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied....