Padding offsets mapping via `tokenizer.pad`
See original GitHub issueFeature request
While preparing the dataset for the Named Entity Recognition task, I noticed that tokenizer.pad
does not apply padding for offset_mapping
, which is necessary not only for the Named Entity Recognition task.
Motivation
In order to get “padded” offset_mapping
I need to write some additional lines, which seems very frustrating near the great HuggingFace library API.
Your contribution
You just need to add some lines of code for padding offset_mapping
with (0, 0)
tuples, which stands for special tokens offsets.
Issue Analytics
- State:
- Created a year ago
- Comments:8 (4 by maintainers)
Top Results From Across the Web
Tokenizer - Hugging Face
Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...
Read more >TRANSFORMERS: Asking to pad but the tokenizer does not ...
Asking to pad but the tokenizer does not have a padding token. Please select a token to use as 'pad_token' '(tokenizer.pad_token = tokenizer....
Read more >Code To Align Annotations With Huggingface Tokenizers
This post demonstrates an end to end implementation of token alignment and windowing. We'll start by implementing utility classes that make programming a...
Read more >allennlp.data.token_indexers
This method pads a list of tokens to desired_num_tokens and returns that padded list of ... count_vocab_items (self, token: allennlp.data.tokenizers.token.
Read more >tokenization_utils.py - CodaLab Worksheets
If your tokenizer set a padding / truncation strategy before, ... n" "To remove this error, you can add a new pad token...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Ok, I will write it as soon as possible.
I apologize, by rereading your feature request I actually understand better your request which concerns - as you say - the
pad
common to slow and fast tokenizers and not the__call__
method .This feature makes sense, we could absolutely add the offsets. Would you be interested in working on it?