Consider adding "middle" option for tokenizer truncation_side argument
See original GitHub issueFeature request
At the moment, thanks to this PR https://github.com/huggingface/transformers/pull/14947 the option to truncate the text from the left instead of just from the right has been added. However, for some NLP tasks like summarization of long documents, it might also be advantageous to truncate the middle part of the document instead. For example if our sequence length is 512 tokens and a document exceeds this length, we might want to keep the first 256 and the last 256 tokens of the document, and truncate everything in between. Therefore this issue is to request implementation of this option.
Motivation
The reason this feature might be helpful is is because when dealing in particular with long documents (for example for longformer summarization tasks), depending on the documents domain, the start of the document might set out relevant information, and the end of the document might contain a useful recap of the main points discussed, therefore both can be very relevant and valuable to keep, whereas the text in the middle may not be as important. Therefore adding an option truncation_side="middle"
, allowing retention of the first 256 and the last 256 tokens, might be very helpful for certain use cases.
Your contribution
I have limited bandwidth right now, but might consider contributing if this can be done as a quick fix and someone from HuggingFace can provide overview.
Issue Analytics
- State:
- Created a year ago
- Comments:5 (4 by maintainers)
Top GitHub Comments
100% agree with @SaulLu .
There might be a use case, but it doesn’t seem as a blatant missing feature (and we try to focus on those). Future reader, make yourself heard so that we can revisit our opinion 😃
Ok that’s fine, thanks a lot for getting back to me @SaulLu Let’s see if there is more appetite, if not we can leave it here for now. I can always implement the truncation myself for my specific model and tokenizer, I just thought it may be a helpful feature to have, but as you said we’d need to see how much demand there is. Feel free to close the issue if appropriate