Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed training hanging without any output or any error hint.

See original GitHub issue

What is your question?

During my training, sometimes it just hanging without any visible output.

Code

fairseq-train \
        --memory-efficient-fp16 --user-dir $USER_DIR --task translation_prophetnet --arch $ARCH \
        --optimizer adam --adam-betas '(0.9, 0.999)' --clip-norm 0.1 \
        --lr 0.00001 \
        --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 1000 \
        --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
        --criterion $CRITERION --label-smoothing 0.1 \
        --update-freq 32  --max-sentences 4 \
        --num-workers 8 --load-from-pretrained-model $PRETRAINED_MODEL \
        --ddp-backend=no_c10d --max-epoch 20 \
        --max-source-positions 512 --max-target-positions 512 \
        --skip-invalid-size-inputs-valid-test \
        --seed 1 \
        --save-dir $SAVE_DIR \
        --keep-last-epochs 20 \
        --tensorboard-logdir $TENSORBOARD_LOGDIR \
        $DATA_DIR

What have you tried?

Sometimes it just disappear when I reduce max sentence, so I’m doubting it was caused by a silent OOM problem.

But this time it continue appears even I reduce batch size and turn on memory-efficenicy-fp16 option.

The progress bar simply freeze. And nvidia-smi shows that all GPUs‘ Volatile-GPU-utils are 100%. But the power consumption is just 75w (far less than normally training).

Most of the time after I use Ctrl-C kill the hanging training, manually kill fairseq-train is still required. Those fairseq-train process are showed as process runing in top. Sometimes the after Ctrl-C, these error pop up.

Traceback (most recent call last):
  File "/usr/local/bin/fairseq-train", line 11, in <module>
    load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
  File "/examples/fairseq/fairseq_cli/train.py", line 317, in cli_main
    distributed_main(args.device_id, args)
  File "/examples/fairseq/fairseq_cli/train.py", line 296, in distributed_main
    main(args, init_distributed=True)
  File "/examples/fairseq/fairseq_cli/train.py", line 86, in main
    train(args, trainer, task, epoch_itr)
  File "/examples/fairseq/fairseq_cli/train.py", line 127, in train
    log_output = trainer.train_step(samples)
  File "/examples/fairseq/fairseq/trainer.py", line 330, in train_step
    sample, self.model, self.criterion, self.optimizer, ignore_grad
  File "/examples/fairseq/fairseq/tasks/fairseq_task.py", line 254, in train_step
    optimizer.backward(loss)
  File "/examples/fairseq/fairseq/optim/fp16_optimizer.py", line 103, in backward
    loss.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 150, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/examples/fairseq/fairseq/legacy_distributed_data_parallel.py", line 136, in reduction_fn
    def reduction_fn():
KeyboardInterrupt

What’s your environment?

fairseq Version : 0.9.0
PyTorch Version :1.3.0
OS (e.g., Linux): Ubuntu
How you installed fairseq (pip, source): build from source
Build command you used (if compiling from source): git clone https://github.com/pytorch/fairseq && cd fairseq &&git checkout tags/v0.9.0&& pip install --editable .
CUDA/cuDNN version: CUDA:10.0.130 cuDNN:7603
GPU models and configuration: NVIDIA V100 16G x 8
Any other relevant information: Apex and openmpi installed. Every time hanging happens, it stop at same step. Unless max sentence or fp16 option had changed.

Issue Analytics

State:
Created 3 years ago
Reactions:2
Comments:7 (3 by maintainers)

Top GitHub Comments

1reaction

YuiTHcommented, Mar 10, 2022

Maybe some torch.distributed.launch option? Can’t remember clearly, sorry.

Wasn't aware there are error logs for individual gpus. Any pointers to how I can access them?

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you modified the open/close state.Message ID: @.***>

0reactions

felixkreukcommented, Mar 10, 2022

Wasn’t aware there are error logs for individual gpus. Any pointers to how I can access them?

Top Results From Across the Web

Distributed training hanging without any output or any error hint.

During my training, sometimes it just hanging without any visible output. Code. fairseq-train \ --memory-efficient-fp16 --user-dir $USER_DIR -- ...

Distributed training with DDP hangs - PyTorch Forums

I am attempting to use DistributedDataParallel for single-node, multi-GPU training in a SageMaker Studio multi-GPU instance environment, ...

Distributed Tensorflow: Workers Hangs - python - Stack Overflow

I am trying to implement a process-group idea of training network, by which we have several groups, each group contains a parameter server ......

Distributed Training in Amazon SageMaker

SageMaker provides distributed training libraries and supports various distributed training options for deep learning tasks such as computer vision (CV) and ...

Distributed PyTorch — Ray 1.11.0 - the Ray documentation

To train across a cluster, first make sure that the Ray cluster is started (see Ray Cluster Overview for more details).