question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Trainer] possible DDP memory regression

See original GitHub issue

I think we may have created a memory regression somewhere recently.

I tried with pt-1.7 and pt-1.8 with the same results.

memory limit on this setup is 8gb

on transformers master:

This takes about 5.5GB/gpu:

PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10

(no need to run more than a few secs, we are just trying to see that the job can start training)

switching to DDP immediately OOMs:

PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python  -m torch.distributed.launch --nproc_per_node=2  examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10

even if I reduce the bs from 4 to 1 it still goes over 8GB.

@sgugger, could you please confirm if you’re seeing the same?

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
stas00commented, Mar 29, 2021

This is a bit of a problem with our memory metrics reporting as we only report gpu0, but I guess since most users will have symmetrical setups (cards of the same size) and gpu0 consumes the biggest amount of memory in DP/DDP then it’s OK.

Will have to think how to extend the metrics for setups where it’s critical to know each gpu’s allocations - e.g. pipeline or model parallel.

1reaction
sguggercommented, Mar 29, 2021

With DP the gradients and optimizer states are only on one GPU, I think that is why we have the big difference. With DDP they are copied over the two.

Read more comments on GitHub >

github_iconTop Results From Across the Web

From PyTorch DDP to Accelerate to Trainer, mastery of ...
This tutorial assumes you have a basic understanding of PyTorch and how to train a simple model. It will showcase training on multiple...
Read more >
How to Increase Training Performance Through Memory ...
The degree to which you can increase your training batch size is bound by the amount of available memory in your GPU.
Read more >
Training Memory-Intensive Deep Learning Models with ...
This post is intended to serve as a comprehensive tutorial for training (very) deep and memory-intensive models using PyTorch's parallel ...
Read more >
Multi GPU Model Training: Monitoring and Optimizing
In this article, we will discuss multi GPU training with Pytorch ... As they require less memory, it is possible to train and...
Read more >
How to calculate metric over entire validation set ... - GitHub
How to calculate metric over entire validation set when training with DDP? #3225 ... the file I ran out of GPU memory because...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found