Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fairseq stuck during Multi-gpu training without OOM warnings

See original GitHub issue

I am facing quite the same problem as in https://github.com/pytorch/fairseq/issues/708 at the moment when using multi-gpu training.

The training halts forever (+8hrs) The GPU utilization goes up to 100% while mem is free and power consumption isn’t reflecting the gpu utilization.

Here are my thoughts:

I doubt it is an OOM problem, I never got OOM messages I kept reducing my batch size till less than 50% of the GPU was used

I doubt it has to do anything with eval or data loading since it halts in the first epoch and data is already loaded to the mem.
I ran the same job twice with the same fixed seed on the same TESLA V100 and they halted at two different update steps.
I am using no_c10d backend and fp16

Here are the last lines of my log file for one of the halted jobs.

| epoch 001:    124 / 9333 loss=14.312, ppl=20345.88, wps=2785, ups=0, wpb=23490.943, bsz=311.082, num_updates=122, lr=0.00762599, gnorm=4.038, clip=1.000, oom=0.000, loss_scale=16.000, wall=1029, train_wall=400, MDSU:loss=14.3124, MDSU:ntokens=23490.9, MDSU:nsentences=311.082, MDSU:sample_size=23490.9
| epoch 001:    125 / 9333 loss=14.298, ppl=20140.58, wps=2797, ups=0, wpb=23480.715, bsz=311.350, num_updates=123, lr=0.00768849, gnorm=4.017, clip=1.000, oom=0.000, loss_scale=16.000, wall=1032, train_wall=403, MDSU:loss=14.2978, MDSU:ntokens=23480.7, MDSU:nsentences=311.35, MDSU:sample_size=23480.7
| epoch 001:    126 / 9333 loss=14.283, ppl=19934.59, wps=2810, ups=0, wpb=23475.718, bsz=311.548, num_updates=124, lr=0.00775099, gnorm=3.996, clip=1.000, oom=0.000, loss_scale=16.000, wall=1036, train_wall=406, MDSU:loss=14.283, MDSU:ntokens=23475.7, MDSU:nsentences=311.548, MDSU:sample_size=23475.7
| epoch 001:    127 / 9333 loss=14.266, ppl=19698.79, wps=2827, ups=0, wpb=23503.544, bsz=311.744, num_updates=125, lr=0.00781349, gnorm=3.975, clip=1.000, oom=0.000, loss_scale=16.000, wall=1039, train_wall=409, MDSU:loss=14.2658, MDSU:ntokens=23503.5, MDSU:nsentences=311.744, MDSU:sample_size=23503.5
| epoch 001:    128 / 9333 loss=14.252, ppl=19511.45, wps=2837, ups=0, wpb=23486.492, bsz=312.063, num_updates=126, lr=0.00787599, gnorm=3.955, clip=1.000, oom=0.000, loss_scale=16.000, wall=1043, train_wall=412, MDSU:loss=14.252, MDSU:ntokens=23486.5, MDSU:nsentences=312.063, MDSU:sample_size=23486.5
| epoch 001:    129 / 9333 loss=14.237, ppl=19307.72, wps=2850, ups=0, wpb=23484.008, bsz=312.315, num_updates=127, lr=0.00793849, gnorm=3.935, clip=1.000, oom=0.000, loss_scale=16.000, wall=1046, train_wall=415, MDSU:loss=14.2369, MDSU:ntokens=23484, MDSU:nsentences=312.315, MDSU:sample_size=23484
| epoch 001:    130 / 9333 loss=14.225, ppl=19151.72, wps=2858, ups=0, wpb=23444.039, bsz=312.188, num_updates=128, lr=0.00800099, gnorm=3.915, clip=1.000, oom=0.000, loss_scale=16.000, wall=1050, train_wall=418, MDSU:loss=14.2252, MDSU:ntokens=23444, MDSU:nsentences=312.188, MDSU:sample_size=23444

Here’s the command I use for running training

python train.py --task xxxxxxxxx \
--MDSUdata xxxxxxxxx \
--arch xxxxxxxxx \
--max-update 670400 \
--lr-period-updates 270000 \
--lr-scheduler cosine --lr-shrink 0.75 \
--max-lr 1 \
--log-interval 1 \
--warmup-updates 16000 --warmup-init-lr 1e-06 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
--update-freq 5  --seed 666 --skip-invalid-size-inputs-valid-test \
--source-lang input --target-lang output --max-tokens 10000 --max-source-positions 1500 \
--max-target-positions 150 \
--dropout 0.1 \
--input-dict xxxxxxxx \
--output-dict xxxxxxxx \
--save-dir xxxxxxxxxx \
--ddp-backend=no_c10d --fp16

The issue is reproducible on both pytorch1.2 and 1.3 using python 3.7.4 and latest fairseq

Issue Analytics

State:
Created 4 years ago
Comments:6 (2 by maintainers)

Top GitHub Comments

7reactions

hadyelsaharcommented, Dec 18, 2019

Any idea where the OOM occurred?

Yes it is "OOM in a middle train_step with update_freq > 5

It seems that this is usually happening when two conditions are satisfied:

You are training on multiple GPUs
OOM is not happening in all the workers but some of them

maybe update freq > 1 (not sure though)

When OOM is happening in all the workers you usually get an error message like this:

| epoch 001:   0%|▏                                               
| WARNING: OOM in all workers, skipping update

When only some of them are OOM usually it freezes forever.

To reproduce this error on a single machine with multiple GPUs try setting --max-tokens not too high but on the limit to making your GPU mem 100%. This way OOM might happen on some workers and others not.

2reactions

hadyelsaharcommented, Dec 17, 2019

Yes I, later on, found that it is an unhandled OOM error that caused this. Caused mainly by a spike in GPU memory utilization that is not the norm.
This wasn’t detectable on Grafana as it averages gpu mem across the 1 minute so indeed on average the GPU memory util was below 100%.

What should concern you is that Fairseq failed silently and just froze for hours without any errors. Let me know if you are concerned with reproducing it or investigating more.