Tokenizer encode to have an option to overflow from left
See original GitHub issue🚀 Feature request
Current tokenizer encode variants ( encode, batch_encode, batch_encode_plus) handle longer sequences than max_length by overflowing tokens from the right hand side and thus restricting the length to max_length. This feature request is to allow an option for the tokenizer encode methods to overflow tokens from the left hand side as well.
Motivation
For problems dealing with dialog, if one were to train an intent classification or next sentence prediction model and the dialog was longer than max_length, one would like to throw away the tokens from the beginning of the conversation as they are less relevant than the more recent messages.
This motivates the need for a encoder that works well with dialog data where more recent tokens are more valuable.
Your contribution
I could change the function truncate_sequences
by adding a new truncation_strategy option that will truncate from left. But want to get feedback from the Huggingface team about this proposal.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:10
- Comments:9 (6 by maintainers)
Top GitHub Comments
Indeed, we should add an option to truncate on the left! cc @n1t0 for our sprint of September.
perhaps add a truncation_side to https://huggingface.co/transformers/main_classes/tokenizer.html#transformers.PreTrainedTokenizer to be consistent with padding_side.