XLM tokenizer lang2id attribute is None
See original GitHub issueEnvironment info
transformers
version: 4.5.1- Platform: Windows-10-10.0.19041-SP0
- Python version: 3.8.8
- PyTorch version (GPU?): 1.8.1+cpu (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: No
Who can help
Models:
- albert, bert, xlm: @LysandreJik
Library:
- tokenizers: @LysandreJik
Information
Model I am using XLM with Causal language modelling:
The problem arises when using:
- the official example scripts: (give details below)
To reproduce
Steps to reproduce the behaviour:
- Run example code from https://huggingface.co/transformers/multilingual.html
import torch
from transformers import XLMTokenizer, XLMWithLMHeadModel
tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")
language_id = tokenizer.lang2id['en']
The attribute lang2id is None and so I get a Nonetype is a non-suscriptable error. Following the example I am expecting to get 0 for language_id.
As a side note, it says that these checkpoints require language embeddings which I’m assuming is from the argument langs. What is the default behavior when this is not provided? I tried looking at https://huggingface.co/transformers/glossary.html but could not find any reference to it.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:5 (1 by maintainers)
Top Results From Across the Web
XLM - Hugging Face
Construct an XLM tokenizer. ... The lang2id attribute maps the languages supported by the model with ... FloatTensor] = Nonestart_top_log_probs: typing.
Read more >transformers/tokenization_xlm.py at main · huggingface ...
Construct an XLM tokenizer. Based on Byte-Pair Encoding. The tokenization process is the following: - Moses preprocessing and tokenization for ...
Read more >python - AttributeError: module transformers has no attribute ...
Try without using from_tf=True flag like below: from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer ...
Read more >pytext.data.xlm_tensorizer
__init__(tokenizer, vocab, max_seq_len) self.language_vocab = ScriptVocabulary(language_vocab) self.default_language = torch.jit.Attribute(default_language ...
Read more >Training RoBERTa from scratch - the missing guide
Every file is a huge XML containing articles with ... I think that it affects the quality of both tokenization and training.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Hello! Sorry for taking so long to get back to this issue - the issue should normally be fixed now, for all versions. We updated the configurations of the XLM models on the hub.
Thanks for flagging!
FYI, I tried downgrading and I found that the most recent version that doesn’t have this bug is
transformers==4.3.3
. So you could try downgrading to that version for now, until someone fixes it.