question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Consider adding "middle" option for tokenizer truncation_side argument

See original GitHub issue

Feature request

At the moment, thanks to this PR https://github.com/huggingface/transformers/pull/14947 the option to truncate the text from the left instead of just from the right has been added. However, for some NLP tasks like summarization of long documents, it might also be advantageous to truncate the middle part of the document instead. For example if our sequence length is 512 tokens and a document exceeds this length, we might want to keep the first 256 and the last 256 tokens of the document, and truncate everything in between. Therefore this issue is to request implementation of this option.

Motivation

The reason this feature might be helpful is is because when dealing in particular with long documents (for example for longformer summarization tasks), depending on the documents domain, the start of the document might set out relevant information, and the end of the document might contain a useful recap of the main points discussed, therefore both can be very relevant and valuable to keep, whereas the text in the middle may not be as important. Therefore adding an option truncation_side="middle", allowing retention of the first 256 and the last 256 tokens, might be very helpful for certain use cases.

Your contribution

I have limited bandwidth right now, but might consider contributing if this can be done as a quick fix and someone from HuggingFace can provide overview.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:5 (4 by maintainers)

github_iconTop GitHub Comments

2reactions
Narsilcommented, Jul 15, 2022

100% agree with @SaulLu .

There might be a use case, but it doesn’t seem as a blatant missing feature (and we try to focus on those). Future reader, make yourself heard so that we can revisit our opinion 😃

1reaction
AndreaSottanacommented, Jul 4, 2022

Ok that’s fine, thanks a lot for getting back to me @SaulLu Let’s see if there is more appetite, if not we can leave it here for now. I can always implement the truncation myself for my specific model and tokenizer, I just thought it may be a helpful feature to have, but as you said we’d need to see how much demand there is. Feel free to close the issue if appropriate

Read more comments on GitHub >

github_iconTop Results From Across the Web

Padding and truncation - Hugging Face
This will truncate token by token, removing a token from the longest sequence in the pair until the proper length is reached.
Read more >
How does max_length, padding and truncation arguments ...
What I think is as follows: max_length=5 will keep all the sentences as of length 5 strictly. padding=max_length will add a padding of...
Read more >
How to Apply Transformers to Any Length of Text
Almost every article I write on Medium contains 1000+ words, which, when tokenized for a transformer model like BERT, will produce 1000+ tokens....
Read more >
Transformers Course - Chapter 2 - TF & Torch - Kaggle
tf_inputs = tokenizer(raw_inputs, padding=True, truncation=True, ... but this may change in the future, so consider avoiding using them.
Read more >
Emptying an index - Manticore Search Manual
Index is cleared (similar to TRUNCATE); All index files are removed from the ... When deleting an index via PHP, you can add...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found