XLMRobertaTokenizer vocab size
See original GitHub issueI think the XLMRobertaTokenizer vocab_size is off. Currently double counts '<unk>' | '<s>' | '</s>'
Maybe change it to
def vocab_size(self):
return len(self.sp_model) + self.fairseq_offset
Issue Analytics
- State:
- Created 4 years ago
- Comments:7 (2 by maintainers)
Top Results From Across the Web
XLM-RoBERTa-XL - Hugging Face
vocab_size ( int , optional, defaults to 250880) — Vocabulary size of the XLM_ROBERTA_XL model. Defines the number of different tokens that can...
Read more >Create a Tokenizer and Train a Huggingface RoBERTa Model ...
We choose a vocab size of 8,192 and a min frequency of 2 (you can tune this value depending on your max vocabulary...
Read more >Tutorial: How to train a RoBERTa Language Model for Spanish
This dataset has a size of 5.4 GB and we will train on a subset of ~300 MB. ... RoBERTa uses a Byte-Level...
Read more >Basics of BERT and XLM-RoBERTa - PyTorch - Kaggle
from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification MODEL_TYPE ... Check the vocab size tokenizer.vocab_size.
Read more >Understanding the vaccine stance of Italian tweets and ...
The model we selected for this task, XLM-RoBERTa-large, uses a tokenizer with a vocabulary size of 250,002. For comparison, another popular ...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
This issue is known and will be fixed.
This should have been fixed with https://github.com/huggingface/transformers/pull/3198