[Trainer] possible DDP memory regression
See original GitHub issueI think we may have created a memory regression somewhere recently.
I tried with pt-1.7 and pt-1.8 with the same results.
memory limit on this setup is 8gb
on transformers
master:
This takes about 5.5GB/gpu:
PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10
(no need to run more than a few secs, we are just trying to see that the job can start training)
switching to DDP immediately OOMs:
PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10
even if I reduce the bs from 4 to 1 it still goes over 8GB.
@sgugger, could you please confirm if you’re seeing the same?
Issue Analytics
- State:
- Created 2 years ago
- Comments:5 (5 by maintainers)
Top Results From Across the Web
From PyTorch DDP to Accelerate to Trainer, mastery of ...
This tutorial assumes you have a basic understanding of PyTorch and how to train a simple model. It will showcase training on multiple...
Read more >How to Increase Training Performance Through Memory ...
The degree to which you can increase your training batch size is bound by the amount of available memory in your GPU.
Read more >Training Memory-Intensive Deep Learning Models with ...
This post is intended to serve as a comprehensive tutorial for training (very) deep and memory-intensive models using PyTorch's parallel ...
Read more >Multi GPU Model Training: Monitoring and Optimizing
In this article, we will discuss multi GPU training with Pytorch ... As they require less memory, it is possible to train and...
Read more >How to calculate metric over entire validation set ... - GitHub
How to calculate metric over entire validation set when training with DDP? #3225 ... the file I ran out of GPU memory because...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
This is a bit of a problem with our memory metrics reporting as we only report gpu0, but I guess since most users will have symmetrical setups (cards of the same size) and gpu0 consumes the biggest amount of memory in DP/DDP then it’s OK.
Will have to think how to extend the metrics for setups where it’s critical to know each gpu’s allocations - e.g. pipeline or model parallel.
With DP the gradients and optimizer states are only on one GPU, I think that is why we have the big difference. With DDP they are copied over the two.