question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Fairseq stuck during Multi-gpu training without OOM warnings

See original GitHub issue

I am facing quite the same problem as in https://github.com/pytorch/fairseq/issues/708 at the moment when using multi-gpu training.

The training halts forever (+8hrs) The GPU utilization goes up to 100% while mem is free and power consumption isn’t reflecting the gpu utilization.

image

Here are my thoughts:

  • I doubt it is an OOM problem, I never got OOM messages I kept reducing my batch size till less than 50% of the GPU was used

image

  • I doubt it has to do anything with eval or data loading since it halts in the first epoch and data is already loaded to the mem.

  • I ran the same job twice with the same fixed seed on the same TESLA V100 and they halted at two different update steps.

  • I am using no_c10d backend and fp16

Here are the last lines of my log file for one of the halted jobs.

| epoch 001:    124 / 9333 loss=14.312, ppl=20345.88, wps=2785, ups=0, wpb=23490.943, bsz=311.082, num_updates=122, lr=0.00762599, gnorm=4.038, clip=1.000, oom=0.000, loss_scale=16.000, wall=1029, train_wall=400, MDSU:loss=14.3124, MDSU:ntokens=23490.9, MDSU:nsentences=311.082, MDSU:sample_size=23490.9
| epoch 001:    125 / 9333 loss=14.298, ppl=20140.58, wps=2797, ups=0, wpb=23480.715, bsz=311.350, num_updates=123, lr=0.00768849, gnorm=4.017, clip=1.000, oom=0.000, loss_scale=16.000, wall=1032, train_wall=403, MDSU:loss=14.2978, MDSU:ntokens=23480.7, MDSU:nsentences=311.35, MDSU:sample_size=23480.7
| epoch 001:    126 / 9333 loss=14.283, ppl=19934.59, wps=2810, ups=0, wpb=23475.718, bsz=311.548, num_updates=124, lr=0.00775099, gnorm=3.996, clip=1.000, oom=0.000, loss_scale=16.000, wall=1036, train_wall=406, MDSU:loss=14.283, MDSU:ntokens=23475.7, MDSU:nsentences=311.548, MDSU:sample_size=23475.7
| epoch 001:    127 / 9333 loss=14.266, ppl=19698.79, wps=2827, ups=0, wpb=23503.544, bsz=311.744, num_updates=125, lr=0.00781349, gnorm=3.975, clip=1.000, oom=0.000, loss_scale=16.000, wall=1039, train_wall=409, MDSU:loss=14.2658, MDSU:ntokens=23503.5, MDSU:nsentences=311.744, MDSU:sample_size=23503.5
| epoch 001:    128 / 9333 loss=14.252, ppl=19511.45, wps=2837, ups=0, wpb=23486.492, bsz=312.063, num_updates=126, lr=0.00787599, gnorm=3.955, clip=1.000, oom=0.000, loss_scale=16.000, wall=1043, train_wall=412, MDSU:loss=14.252, MDSU:ntokens=23486.5, MDSU:nsentences=312.063, MDSU:sample_size=23486.5
| epoch 001:    129 / 9333 loss=14.237, ppl=19307.72, wps=2850, ups=0, wpb=23484.008, bsz=312.315, num_updates=127, lr=0.00793849, gnorm=3.935, clip=1.000, oom=0.000, loss_scale=16.000, wall=1046, train_wall=415, MDSU:loss=14.2369, MDSU:ntokens=23484, MDSU:nsentences=312.315, MDSU:sample_size=23484
| epoch 001:    130 / 9333 loss=14.225, ppl=19151.72, wps=2858, ups=0, wpb=23444.039, bsz=312.188, num_updates=128, lr=0.00800099, gnorm=3.915, clip=1.000, oom=0.000, loss_scale=16.000, wall=1050, train_wall=418, MDSU:loss=14.2252, MDSU:ntokens=23444, MDSU:nsentences=312.188, MDSU:sample_size=23444

Here’s the command I use for running training

python train.py --task xxxxxxxxx \
--MDSUdata xxxxxxxxx \
--arch xxxxxxxxx \
--max-update 670400 \
--lr-period-updates 270000 \
--lr-scheduler cosine --lr-shrink 0.75 \
--max-lr 1 \
--log-interval 1 \
--warmup-updates 16000 --warmup-init-lr 1e-06 --min-lr 1e-09 --optimizer nag --lr 0.0001 --clip-norm 0.1 \
--update-freq 5  --seed 666 --skip-invalid-size-inputs-valid-test \
--source-lang input --target-lang output --max-tokens 10000 --max-source-positions 1500 \
--max-target-positions 150 \
--dropout 0.1 \
--input-dict xxxxxxxx \
--output-dict xxxxxxxx \
--save-dir xxxxxxxxxx \
--ddp-backend=no_c10d --fp16

The issue is reproducible on both pytorch1.2 and 1.3 using python 3.7.4 and latest fairseq

Issue Analytics

  • State:open
  • Created 4 years ago
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

7reactions
hadyelsaharcommented, Dec 18, 2019

Any idea where the OOM occurred?

Yes it is "OOM in a middle train_step with update_freq > 5

It seems that this is usually happening when two conditions are satisfied:

  • You are training on multiple GPUs
  • OOM is not happening in all the workers but some of them
  • maybe update freq > 1 (not sure though)

When OOM is happening in all the workers you usually get an error message like this:

| epoch 001:   0%|▏                                               
| WARNING: OOM in all workers, skipping update

When only some of them are OOM usually it freezes forever.

To reproduce this error on a single machine with multiple GPUs try setting --max-tokens not too high but on the limit to making your GPU mem 100%. This way OOM might happen on some workers and others not.

2reactions
hadyelsaharcommented, Dec 17, 2019

Yes I, later on, found that it is an unhandled OOM error that caused this. Caused mainly by a spike in GPU memory utilization that is not the norm.
This wasn’t detectable on Grafana as it averages gpu mem across the 1 minute so indeed on average the GPU memory util was below 100%.

What should concern you is that Fairseq failed silently and just froze for hours without any errors. Let me know if you are concerned with reproducing it or investigating more.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Fairseq stuck during Multi-gpu training without OOM warnings
I am facing quite the same problem as in #708 at the moment when using multi-gpu training. The training halts forever (+8hrs) The...
Read more >
Model Zoo - Deep learning code and pretrained models for ...
ModelZoo curates and provides a platform for deep learning researchers to easily find code and pre-trained models for a variety of platforms and...
Read more >
The Best GPUs for Deep Learning in 2020 - Tim Dettmers
Here, I provide an in-depth analysis of GPUs for deep learning/machine learning and explain what is the best GPU for your use-case and ......
Read more >
Bristol-Myers Squibb – Molecular Translation | Kaggle
speed now is "using fairseq+jit", you can do inference at 2 min for 10_000 images with ... no augmentation in training. rotation prediction...
Read more >
Ray Tune FAQ — Ray 2.2.0 - the Ray documentation
Why is my training stuck and Ray reporting that pending actor or tasks ... How costly are the resources, like GPUs)? Can you...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found