question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Predictions for pre-tokenized tokens with Roberta have strange offset_mapping

See original GitHub issue

Environment info

  • transformers version: 4.12.3
  • Platform: Windows-10-10.0.19041-SP0
  • Python version: 3.9.2
  • PyTorch version (GPU?): 1.9.0+cu111 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

Error/Issue is in fast Roberta tokenizers

@LysandreJik

Information

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: POS tagging with Roberta based models

I try to POS tag with a Roberta based transformer. I base my code on this. The issues arises when I want to map back from subword tokenized predictions to my tokens.

I followed this guide and it works for BERT-based models, but I do not know exactly how to check whether something is a subword token with add_prefix_space, as they start both with 1 when a token of length 1 is followed by a subword token:

(0, 1)	I
(1, 3)	##KE
(3, 4)	##A
(1, 1)	ĠI
(1, 3)	KE
(3, 4)	A

I do not know whether it is intended or not, but it makes it not easy to align the predictions back to original tokens, as the rule that the last and first index of consecutive tokens are identical for subwords is broken in fast Roberta tokenizers.

in the WNUT example, it says That means that if the first position in the tuple is anything other than 0, , we will set its corresponding label to -100, which means that we do not keep it… If we now use 1 instead, as for every token, a space is added, then this rule breaks.

To reproduce

Steps to reproduce the behavior:

  1. Tokenize pre-tokenized sequences, e.g. for POS tagging with a fast Roberta Tokenizer and use add_prefix_space together with is_split_into_words
  2. See that the offset_mapping looks strange
from collections import defaultdict

from transformers import AutoTokenizer

s = ['I', 'love', 'IKEA', 'very', 'much', '.']


keeps = defaultdict(list)

names = ["distilbert-base-cased", "distilroberta-base"]

for name in names:
    is_roberta = "roberta" in name
    tokenizer = AutoTokenizer.from_pretrained(name, use_fast=True, add_prefix_space=is_roberta)

    encoding = tokenizer(
        s, truncation=True, padding=True, is_split_into_words=True, return_offsets_mapping=True
    )

    offsets = encoding.offset_mapping
    input_ids = encoding.input_ids

    decoded_tokens = tokenizer.convert_ids_to_tokens(input_ids)

    print(name)
    for idx in range(len(input_ids)):
        offset = offsets[idx]
        token_id = input_ids[idx]

        if is_roberta:
            keep = decoded_tokens[idx][0] == "Ġ"
        else:
            keep = offset != (0, 0) and offset[0] == 0

        print(f"{offset}\t{decoded_tokens[idx]}")

        keeps[name].append(keep)

    print()

for name in names:
    print(f"{name:25}\t{keeps[name]}")

Output

distilbert-base-cased
(0, 0)	[CLS]
(0, 1)	I
(0, 4)	love
(0, 1)	I
(1, 3)	##KE
(3, 4)	##A
(0, 4)	very
(0, 4)	much
(0, 1)	.
(0, 0)	[SEP]

distilroberta-base
(0, 0)	<s>
(1, 1)	ĠI
(1, 4)	Ġlove
(1, 1)	ĠI
(1, 3)	KE
(3, 4)	A
(1, 4)	Ġvery
(1, 4)	Ġmuch
(1, 1)	Ġ.
(0, 0)	</s>

distilbert-base-cased    	[False, True, True, True, False, False, True, True, True, False]
distilroberta-base       	[False, True, True, True, False, False, True, True, True, False]

Expected behavior

I would expect that the offsets behave similar to when not using add_prefix_space, e.g. that the space added does not influence the offsets, as it is automatically added. Is there a better way to align tokens and predictions for Roberta tokenizers than looking at the first char being a space?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
ohmeowcommented, Mar 23, 2022

Yup … my version of tokenizers was outdated! Sorry to bother you 😃

Thanks for the follow-up.

1reaction
SaulLucommented, Dec 13, 2021

First of all, thank you very much for the detailed issue that allows to understand very easily your problem. 🤗

To put it in context, the offsets feature comes from the (Rust) Tokenizers library. And I must unfortunately admit that I would need to have a little more information about the behavior in this library to be able to provide you with a solution to your problem (see the question I asked here).

That being said, I strongly suspect that there was also an oversight on our part to adapt the tokenizer stored into the backend_tokenizer from the transformers library (see this PR). I propose a little more to have additional information on the behavior of the rust library (which would confirm the necessity of this PR)

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fast tokenizers' special powers - Hugging Face
Slow tokenizers are those written in Python inside the Transformers ... span of texts the final tokens come from — a feature we...
Read more >
Do I need to pre-tokenize the text first before using ...
RoBERTa uses SentecePiece which has lossless pre-tokenization. I.e., when you have a tokenized text, you should always be able to say how ...
Read more >
tokenizer — PaddleNLP documentation - Read the Docs
Build offset map from a pair of offset map by concatenating and adding offsets of special tokens. A Roberta offset_mapping has the following...
Read more >
How to Apply Transformers to Any Length of Text
BERT (and many other transformer models) will consume 512 tokens max ... There we have our sentiment predictions for longer pieces of text!...
Read more >
NBME - Score Clinical Patient Notes | Kaggle
Because of this effect, the predictions are obtained "tattered", and it hurts the performance a lot! Below, I gave an example of predictions...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found