Memory leak
See original GitHub issueEnvironment info
transformers
version: 3.1.0- Platform: Linux-3.10.0-1127.el7.x86_64-x86_64-with-debian-buster-sid
- Python version: 3.7.0
- PyTorch version (GPU?): 1.5.1 (True)
- Tensorflow version (GPU?): 2.2.0 (False)
- Using GPU in script?: yes
- Using distributed or parallel set-up in script?: distributed
Who can help
@LysandreJik, @sgugger, @patrickvonplaten
Information
Model I am using (Bert, GPT2):
The problem arises when using:
- [ X] the official example scripts: (give details below)
- my own modified scripts: (give details below)
The tasks I am working on is:
- an official GLUE/SQUaD task: (give the name)
- [X ] my own task or dataset: (give details below)
To reproduce
When I pretrain or fine tune a model (in my case BERT and GPT2) using torch.distributed.launch, the CPU memory usage will grow up to the memory limit (>500GB) until the first process is killed due to this issue. If I train bert-base, it takes around 30 epochs until the first process is killed, but when I train gpt-large, it just need 3 epochs until it is killed. Following is the command line I run to train/fine tune the bert-base (similar with gpt2). The script run_language_modeling.py is a copy of transformers/examples/language-modeling/run_language_modeling.py (vers. 3.1.0)
python -m torch.distributed.launch --nproc_per_node=8
…/run_language_modeling.py
–output_dir $model_target
–model_name_or_path $model_source
–config_name $model_source
–tokenizer_name $model_source
–train_data_file $target_train
–eval_data_file $target_test
–save_total_limit 5
–block_size 128
–overwrite_output_dir
–fp16
–num_train_epochs 50
–do_train --do_eval
–per_device_train_batch_size 32
–per_device_eval_batch_size 4
–mlm
Expected behavior
I would expect that the distributed training run until it is done without any memory issue. Thanks for checking it.
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (6 by maintainers)
#6999
The size of dataset (indonesian Wikipedia) is around 522MB.