question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Slow BERT Tokenizer adds UNK when calling tokenize()

See original GitHub issue

Hi! I’ve run into an inconsistency between the base tokenizer docstring and the slow BERT tokenizer. Specifically, when calling tokenizer.encode(), the [UNK] token is inserted for unknown tokens, even though the docstring says that such tokens should be unchanged. Here’s how I’m calling the tokenizer:

tokenizer = BertTokenizer.from_pretrained(
        save_dir, do_lower_case=False, strip_accents=False, tokenize_chinese_chars=True
)

sentence = "RINDIRIZZA Ġwann Marija Vianney"
print(tokenizer.tokenize(sentence))

and the output is

['RI', '##ND', '##IR', '##I', '##Z', '##ZA', '[UNK]', 'Marija', 'Via', '##nne', '##y']

(notice the [UNK] in the middle).

So, it seems that this particular slow tokenizer isn’t following the docstring. Is this expected?

If not, is there a way to prevent replacement of unknown tokens? I wanted to use the slow BERT tokenizer over the fast one for exactly this reason, and it’d be great if there’s a way to make this work.

I’m using transformers v4.0.1, but it looks like this docstring hasn’t changed between master and 4.0.1.

Thanks!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (7 by maintainers)

github_iconTop GitHub Comments

2reactions
LysandreJikcommented, Jan 28, 2021

Actually just fixing the docstrings and comitting! As soon as we merge the PR the documentation of the master branch will be updated.

0reactions
ethch18commented, Jan 28, 2021

Sure – would it just involve fixing the docstrings that say this in the python code and then building the docs as specified here? Or is there more to it?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer - Hugging Face
Handle all the shared methods for tokenization and special tokens as well as methods downloading/caching/loading pretrained tokenizers as well as adding tokens ...
Read more >
BART Tokenizer tokenises same word differently?
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded...
Read more >
4. Tokenization - Applied Natural Language Processing in the ...
Here, the tokenizer reads in a file and builds a new representation of the source code where the raw ASCII/Unicode characters are replaced...
Read more >
BERT - Tokenization and Encoding - Albert Au Yeung
For tokens not appearing in the original vocabulary, it is designed that they should be replaced with a special token [UNK] , which...
Read more >
Tokenize Text Columns Into Sentences in Pandas | by Baris Sari
I see that split() is used in many articles for word tokenization. ... Stackoverflow (with a small change, by adding |! to the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found