question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

XLM tokenizer lang2id attribute is None

See original GitHub issue

Environment info

  • transformers version: 4.5.1
  • Platform: Windows-10-10.0.19041-SP0
  • Python version: 3.8.8
  • PyTorch version (GPU?): 1.8.1+cpu (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: <fill in>
  • Using distributed or parallel set-up in script?: No

Who can help

Models:

Library:

Information

Model I am using XLM with Causal language modelling:

The problem arises when using:

  • the official example scripts: (give details below)

To reproduce

Steps to reproduce the behaviour:

  1. Run example code from https://huggingface.co/transformers/multilingual.html
import torch
from transformers import XLMTokenizer, XLMWithLMHeadModel

tokenizer = XLMTokenizer.from_pretrained("xlm-clm-enfr-1024")
model = XLMWithLMHeadModel.from_pretrained("xlm-clm-enfr-1024")

language_id = tokenizer.lang2id['en']

The attribute lang2id is None and so I get a Nonetype is a non-suscriptable error. Following the example I am expecting to get 0 for language_id.

As a side note, it says that these checkpoints require language embeddings which I’m assuming is from the argument langs. What is the default behavior when this is not provided? I tried looking at https://huggingface.co/transformers/glossary.html but could not find any reference to it.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:5 (1 by maintainers)

github_iconTop GitHub Comments

1reaction
LysandreJikcommented, Jul 8, 2021

Hello! Sorry for taking so long to get back to this issue - the issue should normally be fixed now, for all versions. We updated the configurations of the XLM models on the hub.

Thanks for flagging!

1reaction
cbaziotiscommented, Jun 14, 2021

FYI, I tried downgrading and I found that the most recent version that doesn’t have this bug is transformers==4.3.3. So you could try downgrading to that version for now, until someone fixes it.

pip install transformers==4.3.3
Read more comments on GitHub >

github_iconTop Results From Across the Web

XLM - Hugging Face
Construct an XLM tokenizer. ... The lang2id attribute maps the languages supported by the model with ... FloatTensor] = Nonestart_top_log_probs: typing.
Read more >
transformers/tokenization_xlm.py at main · huggingface ...
Construct an XLM tokenizer. Based on Byte-Pair Encoding. The tokenization process is the following: - Moses preprocessing and tokenization for ...
Read more >
python - AttributeError: module transformers has no attribute ...
Try without using from_tf=True flag like below: from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer ...
Read more >
pytext.data.xlm_tensorizer
__init__(tokenizer, vocab, max_seq_len) self.language_vocab = ScriptVocabulary(language_vocab) self.default_language = torch.jit.Attribute(default_language ...
Read more >
Training RoBERTa from scratch - the missing guide
Every file is a huge XML containing articles with ... I think that it affects the quality of both tokenization and training.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found