Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tokenizers: setting bos_token_id = 0 and adding language_pair_codes

See original GitHub issue

I am unable to set bos_token_id=0 for a new SentencePiece tokenizer (MBART). Here is what I’m doing?

wget https://s3.amazonaws.com/models.huggingface.co/bert/facebook/mbart-large-en-ro/sentence.bpe.model

from transformers import T5Tokenizer
vocab_file = 'sentence.bpe.model'
t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0)
t2.bos_token_id # => 1

The following also returns 1

t2 = T5Tokenizer(vocab_file, bos_token='<s>', bos_token_id=0,
                 additional_special_tokens=['<s>'])
t2.bos_token_id

Help much appreciated!

Issue Analytics

State:
Created 3 years ago
Comments:7 (5 by maintainers)

Top GitHub Comments

1reaction

thomwolfcommented, Apr 1, 2020

Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.

0reactions

kellymarchisiocommented, May 3, 2022

Yes you can check how we do these token index offset stuff (it’s specific to fairseq + sentencepiece) in Camembert and XLMRoberta tokenizers.

For posterity, I think Thomas means this:

https://huggingface.co/transformers/v4.6.0/_modules/transformers/models/camembert/tokenization_camembert.html
https://huggingface.co/transformers/v3.5.1/_modules/transformers/tokenization_xlm_roberta.html

Top Results From Across the Web

Tokenizer - Hugging Face

Managing special tokens (like mask, beginning-of-sentence, etc.): adding them, assigning them to attributes in the tokenizer for easy access and making sure ...

tokenizers - PyPI

Tokenizers. Provides an implementation of today's most used tokenizers, with a focus on performance and versatility. Bindings over the Rust implementation.

what 's the meaning of "Using bos_token, but it is not set yet."

This means adding the BOS (beginning of a sentence) token at the beginning and the EOS (end of a sentence) token at the...

Tokenizers | Apache Solr Reference Guide 6.6

You configure the tokenizer for a text field type in schema.xml with a <tokenizer> element, as a child of <analyzer> : <fieldType name="text"...

Tokenizers - Nominatim 4.2.0

For information on how to configure a specific tokenizer for a database see the ... the indexer needs to add the token IDs...