Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consider adding "middle" option for tokenizer truncation_side argument

See original GitHub issue

Feature request

At the moment, thanks to this PR https://github.com/huggingface/transformers/pull/14947 the option to truncate the text from the left instead of just from the right has been added. However, for some NLP tasks like summarization of long documents, it might also be advantageous to truncate the middle part of the document instead. For example if our sequence length is 512 tokens and a document exceeds this length, we might want to keep the first 256 and the last 256 tokens of the document, and truncate everything in between. Therefore this issue is to request implementation of this option.

Motivation

The reason this feature might be helpful is is because when dealing in particular with long documents (for example for longformer summarization tasks), depending on the documents domain, the start of the document might set out relevant information, and the end of the document might contain a useful recap of the main points discussed, therefore both can be very relevant and valuable to keep, whereas the text in the middle may not be as important. Therefore adding an option truncation_side="middle", allowing retention of the first 256 and the last 256 tokens, might be very helpful for certain use cases.

Your contribution

I have limited bandwidth right now, but might consider contributing if this can be done as a quick fix and someone from HuggingFace can provide overview.

Issue Analytics

State:
Created a year ago
Comments:5 (4 by maintainers)

Top GitHub Comments

2reactions

Narsilcommented, Jul 15, 2022

100% agree with @SaulLu .

There might be a use case, but it doesn’t seem as a blatant missing feature (and we try to focus on those). Future reader, make yourself heard so that we can revisit our opinion 😃

1reaction

AndreaSottanacommented, Jul 4, 2022

Ok that’s fine, thanks a lot for getting back to me @SaulLu Let’s see if there is more appetite, if not we can leave it here for now. I can always implement the truncation myself for my specific model and tokenizer, I just thought it may be a helpful feature to have, but as you said we’d need to see how much demand there is. Feel free to close the issue if appropriate