can't train customized model on multiple gpus
See original GitHub issueI have customized a model using the fairseq and want to train it on multiple gpus, but I got this following error:
`Traceback (most recent call last): File “train.py”, line 313, in <module> cli_main() File “train.py”, line 305, in cli_main nprocs=args.distributed_world_size, File “/cluster/apps/python/3.7.1/x86_64/lib64/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 167, in spawn while not spawn_context.join(): File “/cluster/apps/python/3.7.1/x86_64/lib64/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 114, in join raise Exception(msg) Exception:
– Process 3 terminated with the following error: Traceback (most recent call last): File “/cluster/apps/python/3.7.1/x86_64/lib64/python3.7/site-packages/torch/multiprocessing/spawn.py”, line 19, in _wrap fn(i, *args) File “/cluster/home/nnikolov/convtransformer/train.py”, line 272, in distributed_main main(args, init_distributed=True) File “/cluster/home/nnikolov/convtransformer/train.py”, line 85, in main train(args, trainer, task, epoch_itr) File “/cluster/home/nnikolov/convtransformer/train.py”, line 128, in train log_output = trainer.train_step(samples) File “/cluster/home/nnikolov/convtransformer/fairseq/trainer.py”, line 267, in train_step ignore_grad File “/cluster/home/nnikolov/convtransformer/fairseq/tasks/fairseq_task.py”, line 237, in train_step optimizer.backward(loss) File “/cluster/home/nnikolov/convtransformer/fairseq/optim/fairseq_optimizer.py”, line 74, in backward loss.backward() File “/cluster/apps/python/3.7.1/x86_64/lib64/python3.7/site-packages/torch/tensor.py”, line 102, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File “/cluster/apps/python/3.7.1/x86_64/lib64/python3.7/site-packages/torch/autograd/init.py”, line 90, in backward allow_unreachable=True) # allow_unreachable flag File “/cluster/apps/python/3.7.1/x86_64/lib64/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 445, in distributed_data_parallel_hook self._queue_reduction(bucket_idx) File “/cluster/apps/python/3.7.1/x86_64/lib64/python3.7/site-packages/torch/nn/parallel/distributed.py”, line 475, in _queue_reduction self.device_ids) TypeError: _queue_reduction(): incompatible function arguments. The following argument types are supported: 1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]
Invoked with: <torch.distributed.ProcessGroupNCCL object at 0x2aba50398848>, [[tensor([ 5.1338e-04, 3.3774e-03, -2.8672e-04, 4.5749e-04, -2.5838e-04, -8.6781e-05, -1.4793e-03, 5.2592e-05, 3.1613e-04, 1.7187e-04, -4.1386e-06, 3.4677e-04, -2.4201e-03, 1.7680e-04, 7.3398e-04, -9.7241e-04, 5.3283e-05, -5.9374e-04, -6.8135e-05, 7.6470e-04, -1.7447e-04, -7.4860e-06, 2.7631e-03, -7.9307e-04, 3.0332e-04, -5.0689e-04, -2.0335e-04, -5.0705e-04, -1.8029e-04, 1.7714e-03,` Any idea on how to solve this issue? Thanks!
Issue Analytics
- State:
- Created 4 years ago
- Comments:6 (3 by maintainers)
Top GitHub Comments
thanks for the reply. using
--ddp-backend=no_c10d
does help. problem solved.What is the dtype and device of the tensor in your traceback? It’s currently cutoff so can’t tell for sure, but is it possible you have a model parameter whose type is not supported by NCCL?
You might also try
--ddp-backend=no_c10d
. That will switch to a simpler distributed training implementation that is typically more robust, although it still uses NCCL so it won’t help if you have some unusual parameter type in your model.