xlm-roberta (large/base) : run_language_modeling.py cannot starting training
See original GitHub issueHi HuggingFace, thank you very much for your great contribution.
❓ Questions & Help
My problem is : run_language_modeling.py takes abnormally long time for xlm-roberta-large & base
“before” start training
. It got stuck at the following step for 7 hours (so I gave up eventually) :
transformers.data.datasets.language_modeling - Creating features from dataset file at ./
I have successfully running gpt2-large
, distilbert-base-multilingual-cased
using exactly the same command below (just change model) which start training within just 2-3 minutes. At first I thought that because of the big size of XLM-Roberta. However, as gpt2-large
has similar size, is there somehow problem on finetuning with XLM-Roberta? (So maybe a bug in the current version)
I also tried to rerun the same command in another machine, but got the same stuck (which is not the case for gpt2-large
, distilbert-base-multilingual-cased
)
update the same thing happen to xlm-roberta-base
Command Details I used
Machine AWS p3.2xlarge (V100, 64GB Ram) Training file size is around 60MB
!python transformers/examples/run_language_modeling.py
–model_type=xlm-roberta
–model_name_or_path=xlm-roberta-large
–do_train
–mlm
–per_gpu_train_batch_size=1
–gradient_accumulation_steps=8
–train_data_file={TRAIN_FILE}
–num_train_epochs=2
–block_size=225
–output_dir=output_lm
–save_total_limit=1
–save_steps=10000
–cache_dir=output_lm
–overwrite_cache
–overwrite_output_dir
Issue Analytics
- State:
- Created 3 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Have you tried launching a debugger to see exactly what takes a long time?
I would use vscode remote debugging.
I would guess that your tokenization process takes too long. If you’re training a new LM from scratch, I would recommend using the fast Tokenizers library written in Rust. You can initialize a new
ByteLevelBPETokenizer
instance in yourLineByLineTextDataset
class andencode_batch
your text with it.