question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizer encode to have an option to overflow from left

See original GitHub issue

🚀 Feature request

Current tokenizer encode variants ( encode, batch_encode, batch_encode_plus) handle longer sequences than max_length by overflowing tokens from the right hand side and thus restricting the length to max_length. This feature request is to allow an option for the tokenizer encode methods to overflow tokens from the left hand side as well.

Motivation

For problems dealing with dialog, if one were to train an intent classification or next sentence prediction model and the dialog was longer than max_length, one would like to throw away the tokens from the beginning of the conversation as they are less relevant than the more recent messages.

This motivates the need for a encoder that works well with dialog data where more recent tokens are more valuable.

Your contribution

I could change the function truncate_sequences by adding a new truncation_strategy option that will truncate from left. But want to get feedback from the Huggingface team about this proposal.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:10
  • Comments:9 (6 by maintainers)

github_iconTop GitHub Comments

9reactions
thomwolfcommented, Aug 20, 2020

Indeed, we should add an option to truncate on the left! cc @n1t0 for our sprint of September.

3reactions
ldong87commented, Aug 26, 2020

perhaps add a truncation_side to https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer to be consistent with padding_side.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer - Hugging Face
Tokenizer. A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most of...
Read more >
How to avoid double quoted string , site URL and email ...
1 Answer 1 ... You can try this: First, tokenize the text into sentences. If a sentence contains a special character, tokenize it...
Read more >
text-overflow - CSS: Cascading Style Sheets - MDN Web Docs
To make text overflow its container you have to set other CSS ... the first specifies overflow behavior for the left end of...
Read more >
text.BertTokenizer - TensorFlow
Tokenizer used for BERT. ... If this is set to None , out-of-vocabulary tokens are left as is. split_unknown_characters, (optional) Whether to split...
Read more >
tokenization_utils.py - CodaLab Worksheets
Fast tokenizers are provided by HuggingFace's tokenizers library. ... def __init__( self, data: Optional[Dict[str, Any]] = None, encoding: ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found