question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Word offsets of some fast tokenizers are not compatible with token classification pipeline label aggregation

See original GitHub issue

System Info

  • transformers version: 4.21.0.dev0
  • Platform: macOS-12.4-x86_64-i386-64bit
  • Python version: 3.9.13
  • Huggingface_hub version: 0.8.1
  • PyTorch version (GPU?): 1.11.0 (False)
  • Tensorflow version (GPU?): 2.9.1 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.5.2 (cpu)
  • Jax version: 0.3.6
  • JaxLib version: 0.3.5
  • Using GPU in script?: N
  • Using distributed or parallel set-up in script?: N

Who can help?

Tagging @Narsil for pipelines and @SaulLu for tokenization. Let me know if I should tag anyone for specific models, but it’s not really a model issue, except in terms of tokenization.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, …)
  • My own task or dataset (give details below)

Reproduction

I noticed this issue with a DeBERTa model, but it also affects some others. The high level issue is that some tokenizers include leading spaces in the offset indices, some exclude them, and some are configurable with trim_offsets. When offsets include leading spaces (equivalent to trim_offsets==False), the pipeline word heuristic doesn’t work. The result is aggregating all tokens in the sequence to one label. Simple example:

model_name = "brandon25/deberta-base-finetuned-ner"
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

ner_aggregate = pipeline("ner", model=model, tokenizer=tokenizer, ignore_labels=[], aggregation_strategy="max")
ner_aggregate("We're from New York")

Result:

[{'entity_group': 'O', 'score': 0.9999778, 'word': " We're from New York", 'start': 0, 'end': 19}]

Expected behavior

Expected result, something like:

[{'entity_group': 'O', 'score': 0.9999778, 'word': " We're from", 'start': 0, 'end': 10}, {'entity_group': 'O', 'score': 0.9xxx, 'word': "New York", 'start': 11, 'end': 19}]

If you’d like to see actual output, here’s a colab notebook with relevant models for comparison.

This affects at least these:

  • DeBERTa V1
  • DeBERTa V2/3
  • GPT2 (tested because DebertaTokenizerFast is a subclass of GPT2TokenizerFast)
  • Depending on config, Roberta (and any other tokenizer that honors trim_offsets==False)

The easiest solution would be to update the heuristic. Here is a change that works for preceding space in sequence (like current heuristic) or leading space in token. I can turn into a PR if desired.

I know a lot of the default configuration matches reference implementations or published research, so I’m not sure where inconsistencies between tokenizers are desired behavior. I did notice, for example, that some sentencepiece tokenizers include leading spaces in offset indices (DeBERTa V2/3), and some don’t (Albert, XLNet). I looked at the converter config and the rust code (which is pretty opaque to me), but it’s not obvious to me why the offsets are different. Do you know, @SaulLu? Is that expected?

I am comparing different architectures to replace a production Bert model and was evaluating models fine tuned on an internal dataset when I ran into this. I have my manager’s blessing to spend some time on this (and already have! 😂), so I’m happy to work on a PR or help out how I can.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:16 (16 by maintainers)

github_iconTop GitHub Comments

1reaction
Narsilcommented, Aug 1, 2022

@davidbenton what’s your environement ? I can’t seem to reproduce on my local env

Do you mind creating a new issue for this ? Report it like a regular bug, there should be tools to print your exact env. https://github.com/huggingface/transformers/issues/new?assignees=&labels=bug&template=bug-report.yml

As I said, slow tests can be sometimes a little more flaky that fast tests, but usually within acceptable bounds (pytorch will modify kernels which affects ever so slightly values, but it can pile up, Python version can break dictionary order etc…)

1reaction
Narsilcommented, Aug 1, 2022

Thanks for flagging, I am looking into it right now 😃

Read more comments on GitHub >

github_iconTop Results From Across the Web

Source code for transformers.pipelines.token_classification
class TokenClassificationPipeline(Pipeline): """ Named Entity Recognition pipeline using any :obj:`ModelForTokenClassification`. See the `named entity ...
Read more >
Building a Pipeline for State-of-the-Art Natural Language ...
This talk will focus on the entire NLP pipeline, from text to tokens with huggingface/tokenizers and from tokens to predictions with ...
Read more >
xlm-roberta tokenizer sticks all words together - Stack Overflow
I am trying to use a xlm-roberta model I have fine-tuned for token classification, but no matter ...
Read more >
Tokenizer reference | Elasticsearch Guide [8.5] | Elastic
Token type, a classification of each term produced, such as <ALPHANUM> ... tokenizer divides text into terms whenever it encounters any whitespace character ......
Read more >
Text classification with the torchtext library - PyTorch
The text and label pipelines will be used to process the raw data strings from the dataset iterators. text_pipeline = lambda x: vocab(tokenizer...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found