question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizers: setting bos_token_id = 0 and adding language_pair_codes

See original GitHub issue

I am unable to set bos_token_id=0 for a new SentencePiece tokenizer (MBART). Here is what I’m doing?

wget https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model
from transformers import T5Tokenizer
vocab_file = 'sentence.bpe.model'
t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0)
t2.bos_token_id # => 1

The following also returns 1

t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0,
                 additional_special_tokens=['<s>'])
t2.bos_token_id

Help much appreciated!

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
thomwolfcommented, Apr 1, 2020

Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.

0reactions
kellymarchisiocommented, May 3, 2022

Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.

For posterity, I think Thomas means this:

https://huggingface.co/transformers/v4.6.0/_modules/transformers/models/camembert/tokenization_camembert.html
https://huggingface.co/transformers/v3.5.1/_modules/transformers/tokenization_xlm_roberta.html
Read more comments on GitHub >

github_iconTop Results From Across the Web

Tokenizer - Hugging Face
Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure ...
Read more >
tokenizers - PyPI
Tokenizers. Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation.
Read more >
what 's the meaning of "Using bos_token, but it is not set yet."
This means adding the BOS (beginning of a sentence) token at the beginning and the EOS (end of a sentence) token at the...
Read more >
Tokenizers | Apache Solr Reference Guide 6.6
You configure the tokenizer for a text field type in schema.xml with a <tokenizer> element, as a child of <analyzer> : <fieldType name="text"...
Read more >
Tokenizers - Nominatim 4.2.0
For information on how to configure a specific tokenizer for a database see the ... the indexer needs to add the token IDs...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found