Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

How to prevent tokenizer from outputting certain information

See original GitHub issue

Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even if some tokens have been removed.

Issue Analytics

State:
Created 2 years ago
Reactions:4
Comments:10 (5 by maintainers)

Top GitHub Comments

11reactions

eduOScommented, Nov 15, 2021

Set the verbosity level as follows:

transformers.logging.set_verbosity_error()

3reactions

lextoumbouroucommented, Jun 25, 2022

This warning can be pretty noisy when your batch size is low, and the dataset is big. It would be nice only to see this warning once, as nreimers mentioned.

For anyone coming from Google who cannot suppress the error with eduOS’s solution. The nuclear option is to disable all warnings in Python like this:

import logging
logging.disable(logging.WARNING)

Top Results From Across the Web

How can I prevent spacy's tokenizer from splitting a specific ...

Adding an exception to prevent the string shell from being split by the tokenizer can be done with nlp.tokenizer.add_special_case as follows:

Building a tokenizer, block by block - Hugging Face Course

To build a tokenizer with the 🤗 Tokenizers library, we start by instantiating a Tokenizer object with a model , then set its...

How to use tokenization to improve data security and reduce ...

Tokenization is the process of replacing actual sensitive data elements with non-sensitive data elements that have no exploitable value for data ...

Tokenizer reference | Elasticsearch Guide [8.5] | Elastic

The keyword tokenizer is a “noop” tokenizer that accepts whatever text it is given and outputs the exact same text as a single...

Tokenization - Stanford NLP Group

Given a character sequence and a defined document unit, tokenization is the task ... However, if to is omitted from the index (as...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

How to prevent tokenizer from outputting certain information

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

Predictions for pre-tokenized tokens with Roberta have strange offset_mapping

Mismatch between sentinel token IDs from T5 data collator and T5 tokenizer