question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Special token index is verbose.

See original GitHub issue

Context Special tokens are frequently used for masking or padding or interpreting the model. It’s important in a Encoder/Decoder context that the decoder and encoder share the same indexes for EOS, SOS, and PAD.

Problem Creating two fields, one for French and one for English, there are no class constants for the index of eos_token. The only way to find out the index of eos_token is per instance of the class (etc. self.stoi[eos_token]).

The code by default is not designed to guarantee that the French dictionary has the same EOS index as the English dictionary.

Possible Solution A With setting the optional parameter ‘eos_token’ would it be possible to set ‘eos_token_index’?

Possible Solution B Vocab or Field constant for the index of special tokens.

Issue Analytics

  • State:closed
  • Created 6 years ago
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
nelson-liucommented, Jul 18, 2017

i feel like it’d be a mistake to design this library to mimic opennmt’s data-handling utils / focus on the seq2seq application. I personally don’t really see the need to have it as a constant (it’s not too hard to reference self.field.vocabanyway). frankly i don’t think the verbosity is an issue, as long as it’s keeping it clear.

0reactions
PetrochukMcommented, Jul 18, 2017

Okay! Thanks for your input!

Read more comments on GitHub >

github_iconTop Results From Across the Web

Utilities for Tokenizers
Returns the vocabulary as a dictionary of token to index. ... TensorType, NoneType] = Noneverbose: bool = True ). Expand 7 parameters. Parameters....
Read more >
re — Regular expression operations — Python 3.11.1 ...
Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A' ... X (verbose), for the entire regular expression.
Read more >
How to use BERT from the Hugging Face transformer library
The add special tokens parameter is just for BERT to add tokens like the start, end, [SEP], and [CLS] tokens. Return_tensors = “pt”...
Read more >
tf.keras.layers.TextVectorization | TensorFlow v2.11.0
It transforms a batch of strings (one example = one string) into either a list of token indices (one example = 1D tensor...
Read more >
Source code for paddlenlp.transformers.tokenizer_utils
"hello", index 0, is left of h, index 1 is between h and e. ... Handle all the shared methods for tokenization and...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found