Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Special token index is verbose.

See original GitHub issue

Context Special tokens are frequently used for masking or padding or interpreting the model. It’s important in a Encoder/Decoder context that the decoder and encoder share the same indexes for EOS, SOS, and PAD.

Problem Creating two fields, one for French and one for English, there are no class constants for the index of eos_token. The only way to find out the index of eos_token is per instance of the class (etc. self.stoi[eos_token]).

The code by default is not designed to guarantee that the French dictionary has the same EOS index as the English dictionary.

Possible Solution A With setting the optional parameter ‘eos_token’ would it be possible to set ‘eos_token_index’?

Possible Solution B Vocab or Field constant for the index of special tokens.

Issue Analytics

State:
Created 6 years ago
Comments:12 (12 by maintainers)

Top GitHub Comments

1reaction

nelson-liucommented, Jul 18, 2017

i feel like it’d be a mistake to design this library to mimic opennmt’s data-handling utils / focus on the seq2seq application. I personally don’t really see the need to have it as a constant (it’s not too hard to reference self.field.vocabanyway). frankly i don’t think the verbosity is an issue, as long as it’s keeping it clear.

0reactions

PetrochukMcommented, Jul 18, 2017

Okay! Thanks for your input!

Top Results From Across the Web

Utilities for Tokenizers

Returns the vocabulary as a dictionary of token to index. ... TensorType, NoneType] = Noneverbose: bool = True ). Expand 7 parameters. Parameters....

re — Regular expression operations — Python 3.11.1 ...

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A' ... X (verbose), for the entire regular expression.

How to use BERT from the Hugging Face transformer library

The add special tokens parameter is just for BERT to add tokens like the start, end, [SEP], and [CLS] tokens. Return_tensors = “pt”...

tf.keras.layers.TextVectorization | TensorFlow v2.11.0

It transforms a batch of strings (one example = one string) into either a list of token indices (one example = 1D tensor...

Source code for paddlenlp.transformers.tokenizer_utils

"hello", index 0, is left of h, index 1 is between h and e. ... Handle all the shared methods for tokenization and...