Fast tokenizers calculate wrong offsets when special characters are present
See original GitHub issueExample:
>>> import transformers
>>> t_fast = transformers.AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True, add_special_tokens=False)
>>> sentence = "A, naïve [MASK] AllenNLP sentence."
>>> tokenized = t_fast.encode_plus(sentence, add_special_tokens=False, return_offsets_mapping=True)
>>> for start, end in tokenized['offset_mapping']:
... print(repr(sentence[start:end]))
'A'
','
'naïve'
' [MASK'
' Alle'
'nN'
'L'
' sentenc'
'e'
As you can see, after the word “naïve”, the offsets go off the rails.
Issue Analytics
- State:
- Created 4 years ago
- Comments:10 (9 by maintainers)
Top Results From Across the Web
Fast tokenizers' special powers - Hugging Face
This is especially useful to determine if a token is at the start of a word or if two tokens are in the...
Read more >Source code for farm.modeling.tokenization - Deepset
getLogger(__name__) # Special characters used by the different tokenizers to indicate start of word / whitespace SPECIAL_TOKENIZER_CHARS = r"^(##|Ġ|·)".
Read more >Linguistic Features · spaCy Usage Documentation
After tokenization, spaCy can parse and tag a given Doc . ... mind that Span is initialized with the start and end token...
Read more >How does SpaCy keeps track of character and token offset ...
Summary: During tokenization, this is the part that keeps track of offset and character. Simple answer: It goes character by character in ...
Read more >Software > Stanford Tokenizer
PTBTokenizer is a an efficient, fast, deterministic tokenizer. ... or ?) is found which is not grouped with other characters into a token...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
@mfuntowicz, would it make sense for me to integrate our tokenizer tests into your code, so you can see these things immediately? I’d be happy to do so.
and
and the last one:
gives