Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow BERT Tokenizer adds UNK when calling tokenize()

See original GitHub issue

Hi! I’ve run into an inconsistency between the base tokenizer docstring and the slow BERT tokenizer. Specifically, when calling tokenizer.encode(), the [UNK] token is inserted for unknown tokens, even though the docstring says that such tokens should be unchanged. Here’s how I’m calling the tokenizer:

tokenizer = BertTokenizer.from_pretrained(
        save_dir, do_lower_case=False, strip_accents=False, tokenize_chinese_chars=True
)

sentence = "RINDIRIZZA Ġwann Marija Vianney"
print(tokenizer.tokenize(sentence))

and the output is

['RI', '##ND', '##IR', '##I', '##Z', '##ZA', '[UNK]', 'Marija', 'Via', '##nne', '##y']

(notice the [UNK] in the middle).

So, it seems that this particular slow tokenizer isn’t following the docstring. Is this expected?

If not, is there a way to prevent replacement of unknown tokens? I wanted to use the slow BERT tokenizer over the fast one for exactly this reason, and it’d be great if there’s a way to make this work.

I’m using transformers v4.0.1, but it looks like this docstring hasn’t changed between master and 4.0.1.

Thanks!

Issue Analytics

State:
Created 3 years ago
Comments:7 (7 by maintainers)

Top GitHub Comments

2reactions

LysandreJikcommented, Jan 28, 2021

Actually just fixing the docstrings and comitting! As soon as we merge the PR the documentation of the master branch will be updated.

0reactions

ethch18commented, Jan 28, 2021

Sure – would it just involve fixing the docstrings that say this in the python code and then building the docs as specified here? Or is there more to it?

Top Results From Across the Web

Tokenizer - Hugging Face

Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens ...

BART Tokenizer tokenises same word differently?

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded...

4. Tokenization - Applied Natural Language Processing in the ...

Here, the tokenizer reads in a file and builds a new representation of the source code where the raw ASCII/Unicode characters are replaced...

BERT - Tokenization and Encoding - Albert Au Yeung

For tokens not appearing in the original vocabulary, it is designed that they should be replaced with a special token [UNK] , which...

Tokenize Text Columns Into Sentences in Pandas | by Baris Sari

I see that split() is used in many articles for word tokenization. ... Stackoverflow (with a small change, by adding |! to the...