Slow BERT Tokenizer adds UNK when calling tokenize()
See original GitHub issueHi! I’ve run into an inconsistency between the base tokenizer docstring and the slow BERT tokenizer. Specifically, when calling tokenizer.encode(), the [UNK] token is inserted for unknown tokens, even though the docstring says that such tokens should be unchanged. Here’s how I’m calling the tokenizer:
tokenizer = BertTokenizer.from_pretrained(
save_dir, do_lower_case=False, strip_accents=False, tokenize_chinese_chars=True
)
sentence = "RINDIRIZZA Ġwann Marija Vianney"
print(tokenizer.tokenize(sentence))
and the output is
['RI', '##ND', '##IR', '##I', '##Z', '##ZA', '[UNK]', 'Marija', 'Via', '##nne', '##y']
(notice the [UNK] in the middle).
So, it seems that this particular slow tokenizer isn’t following the docstring. Is this expected?
If not, is there a way to prevent replacement of unknown tokens? I wanted to use the slow BERT tokenizer over the fast one for exactly this reason, and it’d be great if there’s a way to make this work.
I’m using transformers v4.0.1, but it looks like this docstring hasn’t changed between master and 4.0.1.
Thanks!
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (7 by maintainers)

Top Related StackOverflow Question
Actually just fixing the docstrings and comitting! As soon as we merge the PR the documentation of the
masterbranch will be updated.Sure – would it just involve fixing the docstrings that say this in the python code and then building the docs as specified here? Or is there more to it?