Distributed training hanging without any output or any error hint.
See original GitHub issueWhat is your question?
During my training, sometimes it just hanging without any visible output.
Code
fairseq-train \
--memory-efficient-fp16 --user-dir $USER_DIR --task translation_prophetnet --arch $ARCH \
--optimizer adam --adam-betas '(0.9, 0.999)' --clip-norm 0.1 \
--lr 0.00001 \
--lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 1000 \
--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
--criterion $CRITERION --label-smoothing 0.1 \
--update-freq 32 --max-sentences 4 \
--num-workers 8 --load-from-pretrained-model $PRETRAINED_MODEL \
--ddp-backend=no_c10d --max-epoch 20 \
--max-source-positions 512 --max-target-positions 512 \
--skip-invalid-size-inputs-valid-test \
--seed 1 \
--save-dir $SAVE_DIR \
--keep-last-epochs 20 \
--tensorboard-logdir $TENSORBOARD_LOGDIR \
$DATA_DIR
What have you tried?
Sometimes it just disappear when I reduce max sentence, so I’m doubting it was caused by a silent OOM problem.
But this time it continue appears even I reduce batch size and turn on memory-efficenicy-fp16 option.
The progress bar simply freeze. And nvidia-smi shows that all GPUs‘ Volatile-GPU-utils are 100%. But the power consumption is just 75w (far less than normally training).
Most of the time after I use Ctrl-C kill the hanging training, manually kill fairseq-train is still required. Those fairseq-train process are showed as process runing in top. Sometimes the after Ctrl-C, these error pop up.
Traceback (most recent call last):
File "/usr/local/bin/fairseq-train", line 11, in <module>
load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()
File "/examples/fairseq/fairseq_cli/train.py", line 317, in cli_main
distributed_main(args.device_id, args)
File "/examples/fairseq/fairseq_cli/train.py", line 296, in distributed_main
main(args, init_distributed=True)
File "/examples/fairseq/fairseq_cli/train.py", line 86, in main
train(args, trainer, task, epoch_itr)
File "/examples/fairseq/fairseq_cli/train.py", line 127, in train
log_output = trainer.train_step(samples)
File "/examples/fairseq/fairseq/trainer.py", line 330, in train_step
sample, self.model, self.criterion, self.optimizer, ignore_grad
File "/examples/fairseq/fairseq/tasks/fairseq_task.py", line 254, in train_step
optimizer.backward(loss)
File "/examples/fairseq/fairseq/optim/fp16_optimizer.py", line 103, in backward
loss.backward()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 150, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
File "/examples/fairseq/fairseq/legacy_distributed_data_parallel.py", line 136, in reduction_fn
def reduction_fn():
KeyboardInterrupt
What’s your environment?
- fairseq Version : 0.9.0
- PyTorch Version :1.3.0
- OS (e.g., Linux): Ubuntu
- How you installed fairseq (
pip
, source): build from source - Build command you used (if compiling from source): git clone https://github.com/pytorch/fairseq && cd fairseq &&git checkout tags/v0.9.0&& pip install --editable .
- CUDA/cuDNN version: CUDA:10.0.130 cuDNN:7603
- GPU models and configuration: NVIDIA V100 16G x 8
- Any other relevant information: Apex and openmpi installed. Every time hanging happens, it stop at same step. Unless max sentence or fp16 option had changed.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:2
- Comments:7 (3 by maintainers)
Top Results From Across the Web
Distributed training hanging without any output or any error hint.
During my training, sometimes it just hanging without any visible output. Code. fairseq-train \ --memory-efficient-fp16 --user-dir $USER_DIR -- ...
Read more >Distributed training with DDP hangs - PyTorch Forums
I am attempting to use DistributedDataParallel for single-node, multi-GPU training in a SageMaker Studio multi-GPU instance environment, ...
Read more >Distributed Tensorflow: Workers Hangs - python - Stack Overflow
I am trying to implement a process-group idea of training network, by which we have several groups, each group contains a parameter server ......
Read more >Distributed Training in Amazon SageMaker
SageMaker provides distributed training libraries and supports various distributed training options for deep learning tasks such as computer vision (CV) and ...
Read more >Distributed PyTorch — Ray 1.11.0 - the Ray documentation
To train across a cluster, first make sure that the Ray cluster is started (see Ray Cluster Overview for more details).
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
Maybe some torch.distributed.launch option? Can’t remember clearly, sorry.
— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you modified the open/close state.Message ID: @.***>
Wasn’t aware there are error logs for individual gpus. Any pointers to how I can access them?