question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fast tokenizers calculate wrong offsets when special characters are present

See original GitHub issue

Example:

>>> import transformers
>>> t_fast = transformers.AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True, add_special_tokens=False)
>>> sentence = "A, naïve [MASK] AllenNLP sentence."
>>> tokenized = t_fast.encode_plus(sentence, add_special_tokens=False, return_offsets_mapping=True)
>>> for start, end in tokenized['offset_mapping']:
...     print(repr(sentence[start:end]))
'A'
','
'naïve'
' [MASK'
' Alle'
'nN'
'L'
' sentenc'
'e'

As you can see, after the word “naïve”, the offsets go off the rails.

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:10 (9 by maintainers)

github_iconTop GitHub Comments

2reactions
dirkgrcommented, Feb 28, 2020

@mfuntowicz, would it make sense for me to integrate our tokenizer tests into your code, so you can see these things immediately? I’d be happy to do so.

0reactions
n1t0commented, Apr 22, 2020
>>> import transformers
>>> t_fast = transformers.AutoTokenizer.from_pretrained("bert-base-cased", use_fast=True, add_special_tokens=False)
>>> sentence = "A, naïve [MASK] AllenNLP sentence."
>>> tokenized = t_fast.encode_plus(sentence, add_special_tokens=False, return_offsets_mapping=True)
>>> for start, end in tokenized['offset_mapping']:
...     print(repr(sentence[start:end]))
'A'
','
'naïve'
'[MASK]'
'Allen'
'NL'
'P'
'sentence'
'.'

and

>>> import transformers
>>> t_fast = transformers.AutoTokenizer.from_pretrained("roberta-base", use_fast=True, add_special_tokens=False)
>>> sentence = "I went to the zoo yesterday, but they had only one animal."
>>> tokenized = t_fast.encode_plus(sentence, add_special_tokens=False, return_offsets_mapping=True)
>>> for start, end in (t for t in tokenized['offset_mapping'] if t is not None):
...     print(repr(sentence[start:end]))
'I'
'went'
'to'
'the'
'zoo'
'yesterday'
','
'but'
'they'
'had'
'only'
'one'
'animal'
'.' 

and the last one:

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
tokenizer.add_tokens(['[test]'])
text = "A [test] C"

print(tokenizer.encode(text, add_special_tokens=True))

results = tokenizer.encode_plus(text, 
            return_offsets_mapping=True,
            pad_to_max_length=False,
            max_length=128,
            return_overflowing_tokens=False,
            add_special_tokens=True)

for se in results['offset_mapping']:
  if se:
    print(text[se[0]:se[1]], se)

gives

[101, 1037, 30522, 1039, 102]
 (0, 0)
A (0, 1)
[test] (2, 8)
C (9, 10)
 (0, 0)
Read more comments on GitHub >

github_iconTop Results From Across the Web

Fast tokenizers' special powers - Hugging Face
This is especially useful to determine if a token is at the start of a word or if two tokens are in the...
Read more >
Source code for farm.modeling.tokenization - Deepset
getLogger(__name__) # Special characters used by the different tokenizers to indicate start of word / whitespace SPECIAL_TOKENIZER_CHARS = r"^(##|Ġ|·)".
Read more >
Linguistic Features · spaCy Usage Documentation
After tokenization, spaCy can parse and tag a given Doc . ... mind that Span is initialized with the start and end token...
Read more >
How does SpaCy keeps track of character and token offset ...
Summary: During tokenization, this is the part that keeps track of offset and character. Simple answer: It goes character by character in ...
Read more >
Software > Stanford Tokenizer
PTBTokenizer is a an efficient, fast, deterministic tokenizer. ... or ?) is found which is not grouped with other characters into a token...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found