Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

xlm-roberta (large/base) : run_language_modeling.py cannot starting training

See original GitHub issue

Hi HuggingFace, thank you very much for your great contribution.

❓ Questions & Help

My problem is : run_language_modeling.py takes abnormally long time for xlm-roberta-large & base “before” start training . It got stuck at the following step for 7 hours (so I gave up eventually) :

transformers.data.datasets.language_modeling - Creating features from dataset file at ./

I have successfully running gpt2-large, distilbert-base-multilingual-cased using exactly the same command below (just change model) which start training within just 2-3 minutes. At first I thought that because of the big size of XLM-Roberta. However, as gpt2-large has similar size, is there somehow problem on finetuning with XLM-Roberta? (So maybe a bug in the current version)

I also tried to rerun the same command in another machine, but got the same stuck (which is not the case for gpt2-large, distilbert-base-multilingual-cased )

update the same thing happen to xlm-roberta-base

Command Details I used

Machine AWS p3.2xlarge (V100, 64GB Ram) Training file size is around 60MB

!python transformers/examples/run_language_modeling.py
–model_type=xlm-roberta
–model_name_or_path=xlm-roberta-large
–do_train
–mlm
–per_gpu_train_batch_size=1
–gradient_accumulation_steps=8
–train_data_file={TRAIN_FILE}
–num_train_epochs=2
–block_size=225
–output_dir=output_lm
–save_total_limit=1
–save_steps=10000
–cache_dir=output_lm
–overwrite_cache
–overwrite_output_dir

Issue Analytics

State:
Created 3 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

3reactions

julien-ccommented, Apr 23, 2020

Have you tried launching a debugger to see exactly what takes a long time?

I would use vscode remote debugging.

1reaction

mfilipavcommented, Apr 28, 2020

I would guess that your tokenization process takes too long. If you’re training a new LM from scratch, I would recommend using the fast Tokenizers library written in Rust. You can initialize a new ByteLevelBPETokenizer instance in your LineByLineTextDataset class and encode_batch your text with it.

Top Results From Across the Web

XLM-RoBERTa - Hugging Face

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks....

Hugging Face Transformers and Habana Gaudi AWS DL1 ...

Learn how to learn how to fine-tune XLM-RoBERTa for multi-lingual multi-class text-classification using a Habana Gaudi-based DL1 instance.

Inference with XLM-RoBerta - Kaggle

This model was fine-tune on XNLI dataset which contains 15 different languages and its base model xlm-roberta-large was trained on 100 different languages....

Training RoBERTa from scratch - the missing guide

Every file is a huge XML containing articles with mediawiki-specific ... cat train.txt | xargs -I@ python wikiextractor/WikiExtractor.py ...

Training Large Models With Your GPU | HP® Official Site

As I mentioned before, the workstation is equipped with 2 24G VRAM RTX6000 GPUs, while in the experiments I only used one GPU....