Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fast tokenizers calculate wrong offsets when special characters are present

See original GitHub issue

Example:

>>> import transformers
>>> t_fast = transformers.AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True, add_special_tokens=False)
>>> sentence = "A, naïve [MASK] AllenNLP sentence."
>>> tokenized = t_fast.encode_plus(sentence, add_special_tokens=False, return_offsets_mapping=True)
>>> for start, end in tokenized['offset_mapping']:
...     print(repr(sentence[start:end]))
'A'
','
'naïve'
' [MASK'
' Alle'
'nN'
'L'
' sentenc'
'e'

As you can see, after the word “naïve”, the offsets go off the rails.

Issue Analytics

State:
Created 4 years ago
Comments:10 (9 by maintainers)

Top GitHub Comments

2reactions

dirkgrcommented, Feb 28, 2020

@mfuntowicz, would it make sense for me to integrate our tokenizer tests into your code, so you can see these things immediately? I’d be happy to do so.

0reactions

n1t0commented, Apr 22, 2020

>>> import transformers
>>> t_fast = transformers.AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True, add_special_tokens=False)
>>> sentence = "A, naïve [MASK] AllenNLP sentence."
>>> tokenized = t_fast.encode_plus(sentence, add_special_tokens=False, return_offsets_mapping=True)
>>> for start, end in tokenized['offset_mapping']:
...     print(repr(sentence[start:end]))
'A'
','
'naïve'
'[MASK]'
'Allen'
'NL'
'P'
'sentence'
'.'

and

>>> import transformers
>>> t_fast = transformers.AutoTokenizer.from_pretrained("roberta-base", use_fast=True, add_special_tokens=False)
>>> sentence = "I went to the zoo yesterday, but they had only one animal."
>>> tokenized = t_fast.encode_plus(sentence, add_special_tokens=False, return_offsets_mapping=True)
>>> for start, end in (t for t in tokenized['offset_mapping'] if t is not None):
...     print(repr(sentence[start:end]))
'I'
'went'
'to'
'the'
'zoo'
'yesterday'
','
'but'
'they'
'had'
'only'
'one'
'animal'
'.'

and the last one:

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer.add_tokens(['[test]'])
text = "A [test] C"

print(tokenizer.encode(text, add_special_tokens=True))

results = tokenizer.encode_plus(text, 
            return_offsets_mapping=True,
            pad_to_max_length=False,
            max_length=128,
            return_overflowing_tokens=False,
            add_special_tokens=True)

for se in results['offset_mapping']:
  if se:
    print(text[se[0]:se[1]], se)

gives

[101, 1037, 30522, 1039, 102]
 (0, 0)
A (0, 1)
[test] (2, 8)
C (9, 10)
 (0, 0)

Top Results From Across the Web

Fast tokenizers' special powers - Hugging Face

This is especially useful to determine if a token is at the start of a word or if two tokens are in the...

Source code for farm.modeling.tokenization - Deepset

getLogger(__name__) # Special characters used by the different tokenizers to indicate start of word / whitespace SPECIAL_TOKENIZER_CHARS = r"^(##|Ġ|·)".

Linguistic Features · spaCy Usage Documentation

After tokenization, spaCy can parse and tag a given Doc . ... mind that Span is initialized with the start and end token...

How does SpaCy keeps track of character and token offset ...

Summary: During tokenization, this is the part that keeps track of offset and character. Simple answer: It goes character by character in ...

Software > Stanford Tokenizer

PTBTokenizer is a an efficient, fast, deterministic tokenizer. ... or ?) is found which is not grouped with other characters into a token...