Fairseq stuck during Multi-gpu training without OOM warnings
See original GitHub issueI am facing quite the same problem as in https://github.com/pytorch/fairseq/issues/708 at the moment when using multi-gpu training.
The training halts forever (+8hrs) The GPU utilization goes up to 100% while mem is free and power consumption isn’t reflecting the gpu utilization.
Here are my thoughts:
- I doubt it is an OOM problem, I never got OOM messages I kept reducing my batch size till less than 50% of the GPU was used
-
I doubt it has to do anything with eval or data loading since it halts in the first epoch and data is already loaded to the mem.
-
I ran the same job twice with the same fixed seed on the same TESLA V100 and they halted at two different update steps.
-
I am using no_c10d backend and fp16
Here are the last lines of my log file for one of the halted jobs.
| epoch 001: 124 / 9333 loss=14.312, ppl=20345.88, wps=2785, ups=0, wpb=23490.943, bsz=311.082, num_updates=122, lr=0.00762599, gnorm=4.038, clip=1.000, oom=0.000, loss_scale=16.000, wall=1029, train_wall=400, MDSU:loss=14.3124, MDSU:ntokens=23490.9, MDSU:nsentences=311.082, MDSU:sample_size=23490.9
| epoch 001: 125 / 9333 loss=14.298, ppl=20140.58, wps=2797, ups=0, wpb=23480.715, bsz=311.350, num_updates=123, lr=0.00768849, gnorm=4.017, clip=1.000, oom=0.000, loss_scale=16.000, wall=1032, train_wall=403, MDSU:loss=14.2978, MDSU:ntokens=23480.7, MDSU:nsentences=311.35, MDSU:sample_size=23480.7
| epoch 001: 126 / 9333 loss=14.283, ppl=19934.59, wps=2810, ups=0, wpb=23475.718, bsz=311.548, num_updates=124, lr=0.00775099, gnorm=3.996, clip=1.000, oom=0.000, loss_scale=16.000, wall=1036, train_wall=406, MDSU:loss=14.283, MDSU:ntokens=23475.7, MDSU:nsentences=311.548, MDSU:sample_size=23475.7
| epoch 001: 127 / 9333 loss=14.266, ppl=19698.79, wps=2827, ups=0, wpb=23503.544, bsz=311.744, num_updates=125, lr=0.00781349, gnorm=3.975, clip=1.000, oom=0.000, loss_scale=16.000, wall=1039, train_wall=409, MDSU:loss=14.2658, MDSU:ntokens=23503.5, MDSU:nsentences=311.744, MDSU:sample_size=23503.5
| epoch 001: 128 / 9333 loss=14.252, ppl=19511.45, wps=2837, ups=0, wpb=23486.492, bsz=312.063, num_updates=126, lr=0.00787599, gnorm=3.955, clip=1.000, oom=0.000, loss_scale=16.000, wall=1043, train_wall=412, MDSU:loss=14.252, MDSU:ntokens=23486.5, MDSU:nsentences=312.063, MDSU:sample_size=23486.5
| epoch 001: 129 / 9333 loss=14.237, ppl=19307.72, wps=2850, ups=0, wpb=23484.008, bsz=312.315, num_updates=127, lr=0.00793849, gnorm=3.935, clip=1.000, oom=0.000, loss_scale=16.000, wall=1046, train_wall=415, MDSU:loss=14.2369, MDSU:ntokens=23484, MDSU:nsentences=312.315, MDSU:sample_size=23484
| epoch 001: 130 / 9333 loss=14.225, ppl=19151.72, wps=2858, ups=0, wpb=23444.039, bsz=312.188, num_updates=128, lr=0.00800099, gnorm=3.915, clip=1.000, oom=0.000, loss_scale=16.000, wall=1050, train_wall=418, MDSU:loss=14.2252, MDSU:ntokens=23444, MDSU:nsentences=312.188, MDSU:sample_size=23444
Here’s the command I use for running training
python train.py --task xxxxxxxxx \
--MDSUdata xxxxxxxxx \
--arch xxxxxxxxx \
--max-update 670400 \
--lr-period-updates 270000 \
--lr-scheduler cosine --lr-shrink 0.75 \
--max-lr 1 \
--log-interval 1 \
--warmup-updates 16000 --warmup-init-lr 1e-06 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
--update-freq 5 --seed 666 --skip-invalid-size-inputs-valid-test \
--source-lang input --target-lang output --max-tokens 10000 --max-source-positions 1500 \
--max-target-positions 150 \
--dropout 0.1 \
--input-dict xxxxxxxx \
--output-dict xxxxxxxx \
--save-dir xxxxxxxxxx \
--ddp-backend=no_c10d --fp16
The issue is reproducible on both pytorch1.2 and 1.3 using python 3.7.4 and latest fairseq
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (2 by maintainers)
Top GitHub Comments
Yes it is "OOM in a middle train_step with update_freq > 5
It seems that this is usually happening when two conditions are satisfied:
When OOM is happening in all the workers you usually get an error message like this:
When only some of them are OOM usually it freezes forever.
To reproduce this error on a single machine with multiple GPUs try setting
--max-tokens
not too high but on the limit to making your GPU mem 100%. This way OOM might happen on some workers and others not.Yes I, later on, found that it is an unhandled OOM error that caused this. Caused mainly by a spike in GPU memory utilization that is not the norm.
This wasn’t detectable on Grafana as it averages gpu mem across the 1 minute so indeed on average the GPU memory util was below 100%.
What should concern you is that Fairseq failed silently and just froze for hours without any errors. Let me know if you are concerned with reproducing it or investigating more.