question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed training hanging without any output or any error hint.

See original GitHub issue

What is your question?

During my training, sometimes it just hanging without any visible output.

Code

fairseq-train \
        --memory-efficient-fp16 --user-dir $USER_DIR --task translation_prophetnet --arch $ARCH \
        --optimizer adam --adam-betas '(0.9, 0.999)' --clip-norm 0.1 \
        --lr 0.00001 \
        --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 1000 \
        --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
        --criterion $CRITERION --label-smoothing 0.1 \
        --update-freq 32  --max-sentences 4 \
        --num-workers 8 --load-from-pretrained-model $PRETRAINED_MODEL \
        --ddp-backend=no_c10d --max-epoch 20 \
        --max-source-positions 512 --max-target-positions 512 \
        --skip-invalid-size-inputs-valid-test \
        --seed 1 \
        --save-dir $SAVE_DIR \
        --keep-last-epochs 20 \
        --tensorboard-logdir $TENSORBOARD_LOGDIR \
        $DATA_DIR

What have you tried?

Sometimes it just disappear when I reduce max sentence, so I’m doubting it was caused by a silent OOM problem.

But this time it continue appears even I reduce batch size and turn on memory-efficenicy-fp16 option.

The progress bar simply freeze. And nvidia-smi shows that all GPUs‘ Volatile-GPU-utils are 100%. But the power consumption is just 75w (far less than normally training).

Most of the time after I use Ctrl-C kill the hanging training, manually kill fairseq-train is still required. Those fairseq-train process are showed as process runing in top. Sometimes the after Ctrl-C, these error pop up.

Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/examples/fairseq/fairseq_cli/train.py", line 317, in cli_main
    distributed_main(args.device_id, args)
  File "/examples/fairseq/fairseq_cli/train.py", line 296, in distributed_main
    main(args, init_distributed=True)
  File "/examples/fairseq/fairseq_cli/train.py", line 86, in main
    train(args, trainer, task, epoch_itr)
  File "/examples/fairseq/fairseq_cli/train.py", line 127, in train
    log_output = trainer.train_step(samples)
  File "/examples/fairseq/fairseq/trainer.py", line 330, in train_step
    sample, self.model, self.criterion, self.optimizer, ignore_grad
  File "/examples/fairseq/fairseq/tasks/fairseq_task.py", line 254, in train_step
    optimizer.backward(loss)
  File "/examples/fairseq/fairseq/optim/fp16_optimizer.py", line 103, in backward
    loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 150, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/examples/fairseq/fairseq/legacy_distributed_data_parallel.py", line 136, in reduction_fn
    def reduction_fn():
KeyboardInterrupt

What’s your environment?

  • fairseq Version : 0.9.0
  • PyTorch Version :1.3.0
  • OS (e.g., Linux): Ubuntu
  • How you installed fairseq (pip, source): build from source
  • Build command you used (if compiling from source): git clone https://github.com/pytorch/fairseq && cd fairseq &&git checkout tags/v0.9.0&& pip install --editable .
  • CUDA/cuDNN version: CUDA:10.0.130 cuDNN:7603
  • GPU models and configuration: NVIDIA V100 16G x 8
  • Any other relevant information: Apex and openmpi installed. Every time hanging happens, it stop at same step. Unless max sentence or fp16 option had changed.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Reactions:2
  • Comments:7 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
YuiTHcommented, Mar 10, 2022

Maybe some torch.distributed.launch option? Can’t remember clearly, sorry.

Wasn't aware there are error logs for individual gpus. Any pointers to how I can access them?

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you modified the open/close state.Message ID: @.***>

0reactions
felixkreukcommented, Mar 10, 2022

Wasn’t aware there are error logs for individual gpus. Any pointers to how I can access them?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Distributed training hanging without any output or any error hint.
During my training, sometimes it just hanging without any visible output. Code. fairseq-train \ --memory-efficient-fp16 --user-dir $USER_DIR -- ...
Read more >
Distributed training with DDP hangs - PyTorch Forums
I am attempting to use DistributedDataParallel for single-node, multi-GPU training in a SageMaker Studio multi-GPU instance environment, ...
Read more >
Distributed Tensorflow: Workers Hangs - python - Stack Overflow
I am trying to implement a process-group idea of training network, by which we have several groups, each group contains a parameter server ......
Read more >
Distributed Training in Amazon SageMaker
SageMaker provides distributed training libraries and supports various distributed training options for deep learning tasks such as computer vision (CV) and ...
Read more >
Distributed PyTorch — Ray 1.11.0 - the Ray documentation
To train across a cluster, first make sure that the Ray cluster is started (see Ray Cluster Overview for more details).
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found