Improve handling of special tokens in Dictionary
See original GitHub issueWhen loading dict.txt
that already contains special tokens such as <s>
or <pad>
(which are added by default in sentencepiece), these tokens appear twice in the fairseq dictionary.
They are added once in Dictionary.__init__()
and a second time from the dict.txt
file in Dictionary.add_from_file()
.
This causes weird behaviours e.g. when using the model in https://github.com/huggingface/transformers.
Ideally Dictionary
would not add the special tokens manually when loading an external dict.txt
that already contains them (such as in https://github.com/huggingface/transformers).
But I am afraid that this can break backward compatibility for people who already trained models with this “duplicated special tokens bug”.
For instance:
>> print([fairseq_model.task.dictionary[i] for i in range(15)])
['<s>', '<pad>', '</s>', '<unk>', '<unk>', '<s>', '</s>', ',', '▁the', ...]
In the fill_mask()
method for roberta, this is what happens:
>> tokens = self.task.source_dictionary.encode_line(
'<s> ' + text_spans_bpe,
append_eos=True,
add_if_not_exist=False,
)
print(tokens)
tensor([[ 5, 1285, 32004, 2]])
With the first token 5
being the <s>
that was added as a string and matched to the token from dict.txt
and the last token 2
corresponding to dictionary.eos()
.
Issue Analytics
- State:
- Created 4 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Cross-referencing related bugs in HuggingFace Transformers: https://github.com/huggingface/transformers/pull/2065
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, “bump”), and we’ll keep it open. We are sorry that we haven’t been able to prioritize it yet. If you have any new additional information, please include it with your comment!