Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Padding offsets mapping via `tokenizer.pad`

See original GitHub issue

Feature request

While preparing the dataset for the Named Entity Recognition task, I noticed that tokenizer.pad does not apply padding for offset_mapping, which is necessary not only for the Named Entity Recognition task.

Motivation

In order to get “padded” offset_mapping I need to write some additional lines, which seems very frustrating near the great HuggingFace library API.

Your contribution

You just need to add some lines of code for padding offset_mapping with (0, 0) tuples, which stands for special tokens offsets.

Issue Analytics

State:
Created a year ago
Comments:8 (4 by maintainers)

Top GitHub Comments

1reaction

vad13irtcommented, Aug 19, 2022

Ok, I will write it as soon as possible.

1reaction

SaulLucommented, Aug 19, 2022

I apologize, by rereading your feature request I actually understand better your request which concerns - as you say - the pad common to slow and fast tokenizers and not the __call__ method .

This feature makes sense, we could absolutely add the offsets. Would you be interested in working on it?

Top Results From Across the Web

Tokenizer - Hugging Face

Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...

TRANSFORMERS: Asking to pad but the tokenizer does not ...

Asking to pad but the tokenizer does not have a padding token. Please select a token to use as 'pad_token' '(tokenizer.pad_token = tokenizer....

Code To Align Annotations With Huggingface Tokenizers

This post demonstrates an end to end implementation of token alignment and windowing. We'll start by implementing utility classes that make programming a...

allennlp.data.token_indexers

This method pads a list of tokens to desired_num_tokens and returns that padded list of ... count_vocab_items (self, token: allennlp.data.tokenizers.token.

tokenization_utils.py - CodaLab Worksheets

If your tokenizer set a padding / truncation strategy before, ... n" "To remove this error, you can add a new pad token...