question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Padding offsets mapping via `tokenizer.pad`

See original GitHub issue

Feature request

While preparing the dataset for the Named Entity Recognition task, I noticed that tokenizer.pad does not apply padding for offset_mapping, which is necessary not only for the Named Entity Recognition task.

Screenshot-11

Motivation

In order to get “padded” offset_mapping I need to write some additional lines, which seems very frustrating near the great HuggingFace library API.

Screenshot-12 Screenshot-13

Your contribution

You just need to add some lines of code for padding offset_mapping with (0, 0) tuples, which stands for special tokens offsets.

Screenshot-14

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:8 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
vad13irtcommented, Aug 19, 2022

Ok, I will write it as soon as possible.

1reaction
SaulLucommented, Aug 19, 2022

I apologize, by rereading your feature request I actually understand better your request which concerns - as you say - the pad common to slow and fast tokenizers and not the __call__ method .

This feature makes sense, we could absolutely add the offsets. Would you be interested in working on it?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer - Hugging Face
Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...
Read more >
TRANSFORMERS: Asking to pad but the tokenizer does not ...
Asking to pad but the tokenizer does not have a padding token. Please select a token to use as 'pad_token' '(tokenizer.pad_token = tokenizer....
Read more >
Code To Align Annotations With Huggingface Tokenizers
This post demonstrates an end to end implementation of token alignment and windowing. We'll start by implementing utility classes that make programming a...
Read more >
allennlp.data.token_indexers
This method pads a list of tokens to desired_num_tokens and returns that padded list of ... count_vocab_items (self, token: allennlp.data.tokenizers.token.
Read more >
tokenization_utils.py - CodaLab Worksheets
If your tokenizer set a padding / truncation strategy before, ... n" "To remove this error, you can add a new pad token...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found