question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

xlm-roberta (large/base) : run_language_modeling.py cannot starting training

See original GitHub issue

Hi HuggingFace, thank you very much for your great contribution.

❓ Questions & Help

My problem is : run_language_modeling.py takes abnormally long time for xlm-roberta-large & base “before” start training . It got stuck at the following step for 7 hours (so I gave up eventually) :

transformers.data.datasets.language_modeling - Creating features from dataset file at ./

I have successfully running gpt2-large, distilbert-base-multilingual-cased using exactly the same command below (just change model) which start training within just 2-3 minutes. At first I thought that because of the big size of XLM-Roberta. However, as gpt2-large has similar size, is there somehow problem on finetuning with XLM-Roberta? (So maybe a bug in the current version)

I also tried to rerun the same command in another machine, but got the same stuck (which is not the case for gpt2-large, distilbert-base-multilingual-cased )

update the same thing happen to xlm-roberta-base

Command Details I used

Machine AWS p3.2xlarge (V100, 64GB Ram) Training file size is around 60MB

!python transformers/examples/run_language_modeling.py
–model_type=xlm-roberta
–model_name_or_path=xlm-roberta-large
–do_train
–mlm
–per_gpu_train_batch_size=1
–gradient_accumulation_steps=8
–train_data_file={TRAIN_FILE}
–num_train_epochs=2
–block_size=225
–output_dir=output_lm
–save_total_limit=1
–save_steps=10000
–cache_dir=output_lm
–overwrite_cache
–overwrite_output_dir

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
julien-ccommented, Apr 23, 2020

Have you tried launching a debugger to see exactly what takes a long time?

I would use vscode remote debugging.

1reaction
mfilipavcommented, Apr 28, 2020

I would guess that your tokenization process takes too long. If you’re training a new LM from scratch, I would recommend using the fast Tokenizers library written in Rust. You can initialize a new ByteLevelBPETokenizer instance in your LineByLineTextDataset class and encode_batch your text with it.

Read more comments on GitHub >

github_iconTop Results From Across the Web

XLM-RoBERTa - Hugging Face
This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks....
Read more >
Hugging Face Transformers and Habana Gaudi AWS DL1 ...
Learn how to learn how to fine-tune XLM-RoBERTa for multi-lingual multi-class text-classification using a Habana Gaudi-based DL1 instance.
Read more >
Inference with XLM-RoBerta - Kaggle
This model was fine-tune on XNLI dataset which contains 15 different languages and its base model xlm-roberta-large was trained on 100 different languages....
Read more >
Training RoBERTa from scratch - the missing guide
Every file is a huge XML containing articles with mediawiki-specific ... cat train.txt | xargs -I@ python wikiextractor/WikiExtractor.py ...
Read more >
Training Large Models With Your GPU | HP® Official Site
As I mentioned before, the workstation is equipped with 2 24G VRAM RTX6000 GPUs, while in the experiments I only used one GPU....
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found