Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Trainer] possible DDP memory regression

See original GitHub issue

I think we may have created a memory regression somewhere recently.

I tried with pt-1.7 and pt-1.8 with the same results.

memory limit on this setup is 8gb

on transformers master:

This takes about 5.5GB/gpu:

PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10

(no need to run more than a few secs, we are just trying to see that the job can start training)

switching to DDP immediately OOMs:

PYTHONPATH=src USE_TF=0 CUDA_VISIBLE_DEVICES=0,1 python  -m torch.distributed.launch --nproc_per_node=2  examples/seq2seq/run_translation.py --model_name_or_path google/mt5-small --do_train --do_eval --source_lang en --target_lang ro --dataset_name wmt16 --dataset_config_name ro-en --output_dir /tmp/test --per_device_train_batch_size=4 --per_device_eval_batch_size=4 --overwrite_output_dir --predict_with_generate --logging_step 10

even if I reduce the bs from 4 to 1 it still goes over 8GB.

@sgugger, could you please confirm if you’re seeing the same?

Issue Analytics

State:
Created 2 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

stas00commented, Mar 29, 2021

This is a bit of a problem with our memory metrics reporting as we only report gpu0, but I guess since most users will have symmetrical setups (cards of the same size) and gpu0 consumes the biggest amount of memory in DP/DDP then it’s OK.

Will have to think how to extend the metrics for setups where it’s critical to know each gpu’s allocations - e.g. pipeline or model parallel.

1reaction

sguggercommented, Mar 29, 2021

With DP the gradients and optimizer states are only on one GPU, I think that is why we have the big difference. With DDP they are copied over the two.