Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XLM tokenizer lang2id attribute is None

See original GitHub issue

Environment info

transformers version: 4.5.1
Platform: Windows-10-10.0.19041-SP0
Python version: 3.8.8
PyTorch version (GPU?): 1.8.1+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: <fill in>
Using distributed or parallel set-up in script?: No

Who can help

Models:

albert, bert, xlm: @LysandreJik

Library:

tokenizers: @LysandreJik

Information

Model I am using XLM with Causal language modelling:

The problem arises when using:

the official example scripts: (give details below)

To reproduce

Steps to reproduce the behaviour:

Run example code from https://huggingface.co/transformers/multilingual.html

import torch
from transformers import XLMTokenizer, XLMWithLMHeadModel

tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")

language_id = tokenizer.lang2id['en']

The attribute lang2id is None and so I get a Nonetype is a non-suscriptable error. Following the example I am expecting to get 0 for language_id.

As a side note, it says that these checkpoints require language embeddings which I’m assuming is from the argument langs. What is the default behavior when this is not provided? I tried looking at https://huggingface.co/transformers/glossary.html but could not find any reference to it.

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:5 (1 by maintainers)

Top GitHub Comments

1reaction

LysandreJikcommented, Jul 8, 2021

Hello! Sorry for taking so long to get back to this issue - the issue should normally be fixed now, for all versions. We updated the configurations of the XLM models on the hub.

Thanks for flagging!

1reaction

cbaziotiscommented, Jun 14, 2021

FYI, I tried downgrading and I found that the most recent version that doesn’t have this bug is transformers==4.3.3. So you could try downgrading to that version for now, until someone fixes it.

pip install transformers==4.3.3

Top Results From Across the Web

XLM - Hugging Face

Construct an XLM tokenizer. ... The lang2id attribute maps the languages supported by the model with ... FloatTensor] = Nonestart_top_log_probs: typing.

transformers/tokenization_xlm.py at main · huggingface ...

Construct an XLM tokenizer. Based on Byte-Pair Encoding. The tokenization process is the following: - Moses preprocessing and tokenization for ...

python - AttributeError: module transformers has no attribute ...

Try without using from_tf=True flag like below: from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer ...

pytext.data.xlm_tensorizer

__init__(tokenizer, vocab, max_seq_len) self.language_vocab = ScriptVocabulary(language_vocab) self.default_language = torch.jit.Attribute(default_language ...

Training RoBERTa from scratch - the missing guide

Every file is a huge XML containing articles with ... I think that it affects the quality of both tokenization and training.