DistributedDataParallel on AWS with multi GPU EC2 instances (p3.8xlarge / p3.16xlarge)
See original GitHub issue@sshaoshuai I am trying to use your codebase on AWS, and while it works on a p3.2xlarge instance with a single GPU, train.py can not successfully execute the following line on bigger instances with multiple GPUs (e.g. p3.8xlarge or p3.16xlarge): https://github.com/open-mmlab/OpenPCDet/blob/f982b5bfdf0e8e15a2e2d7fead2925ff564051d7/tools/train.py#L142
What I mean is it gets stuck there, so there is no error, but this line never returns. Do you have an idea what could cause this?
I tried to stick to those guides: https://pytorch.org/tutorials/beginner/aws_distributed_training_tutorial.html https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/
So here is how I invoke train.py:
#!/bin/bash
config_path=$1
output_dir=$2
tcp_port=$3
# from https://pytorch.org/tutorials/beginner/aws_distributed_training_tutorial.html
export NCCL_SOCKET_IFNAME=ens3
python -u train.py --launcher pytorch
--workers 8
--tcp_port=${tcp_port}
--cfg_file=${config_path}
--extra_tag=$(basename ${output_dir})"
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1B.0 Off | 0 |
| N/A 49C P0 55W / 300W | 776MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000000:00:1C.0 Off | 0 |
| N/A 44C P0 39W / 300W | 11MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000000:00:1D.0 Off | 0 |
| N/A 41C P0 42W / 300W | 11MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 44C P0 41W / 300W | 11MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 2519 C python 765MiB |
+-----------------------------------------------------------------------------+
Issue Analytics
- State:
- Created 3 years ago
- Comments:12 (5 by maintainers)
Top Results From Across the Web
Amazon EC2 P3 – Ideal for Machine Learning and HPC - AWS
Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to...
Read more >Distributed Training in Amazon SageMaker
One solution is to increase the number of GPUs you use for training. On an instance with multiple GPUs, like a p3.16xlarge that...
Read more >Distributed GPU Training - AWS Deep Learning Containers
This tutorial shows how to setup distributed training of Apache MXNet (Incubating) models on your multi-node GPU cluster that uses Horovod . It...
Read more >AWS Deep Learning AMIs now include Horovod for faster ...
AWS Deep Learning AMIs now include Horovod for faster multi-GPU TensorFlow training on Amazon EC2 P3 instances.
Read more >Supported Frameworks, AWS Regions, and Instances Types
The SageMaker data parallelism library requires one of the following ML instance types. Instance type. ml.p3.16xlarge. ml.p3dn.24xlarge.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

hi Martin, I have hardcoded Pytorch launcher.
./aws_driver.sh cfgs/kitti_models/pv_rcnn.yaml ./…/output/ “” 1 1 20 1
I worked only for batch size 1.
When I use only 1 gpu, It works fine.
My instance has 4 gpus, I would like to use higher batch size and utilize all the gpus.
If I use more than one gpu, I encounter timeout issue below:
Did you modify any part of the script or the code to run on all 4 gpus?
(cherry) ubuntu@ip-172-31-30-100:/projectdata/OpenPCDet/tools$ ./aws_driver.sh cfgs/kitti_models/pv_rcnn.yaml 4 64 20 4 ‘none’ basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=0
./aws_driver.sh: line 45: /train.out: Permission denied basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=1
basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=2
basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=3
basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=4
Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File “train.py”, line 198, in <module> File “train.py”, line 198, in <module> File “train.py”, line 198, in <module> main()main()main()
File “train.py”, line 64, in main File “train.py”, line 64, in main File “train.py”, line 64, in main total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)(
total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)(total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)( File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch
File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch dist.init_process_group( dist.init_process_group( dist.init_process_group(
File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator)store, rank, world_size = next(rendezvous_iterator)
File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler store, rank, world_size = next(rendezvous_iterator) File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)
RuntimeErrorRuntimeError: : connect() timed out.connect() timed out.RuntimeError
: connect() timed out. Traceback (most recent call last): File “train.py”, line 198, in <module> main() File “train.py”, line 64, in main total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)( File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch dist.init_process_group( File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.
I don’t know the GPU utilization, but these are snippets from my launch script. Maybe it helps.