question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DistributedDataParallel on AWS with multi GPU EC2 instances (p3.8xlarge / p3.16xlarge)

See original GitHub issue

@sshaoshuai I am trying to use your codebase on AWS, and while it works on a p3.2xlarge instance with a single GPU, train.py can not successfully execute the following line on bigger instances with multiple GPUs (e.g. p3.8xlarge or p3.16xlarge): https://github.com/open-mmlab/OpenPCDet/blob/f982b5bfdf0e8e15a2e2d7fead2925ff564051d7/tools/train.py#L142

What I mean is it gets stuck there, so there is no error, but this line never returns. Do you have an idea what could cause this?

I tried to stick to those guides: https://pytorch.org/tutorials/beginner/aws_distributed_training_tutorial.html https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/

So here is how I invoke train.py:

#!/bin/bash

config_path=$1
output_dir=$2
tcp_port=$3

# from https://pytorch.org/tutorials/beginner/aws_distributed_training_tutorial.html
export NCCL_SOCKET_IFNAME=ens3

python -u train.py --launcher pytorch 
                   --workers 8 
                   --tcp_port=${tcp_port}
                   --cfg_file=${config_path}
                   --extra_tag=$(basename ${output_dir})"

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   49C    P0    55W / 300W |    776MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   44C    P0    39W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P0    42W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   44C    P0    41W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2519      C   python                                       765MiB |
+-----------------------------------------------------------------------------+

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:12 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
suvasiscommented, Nov 3, 2020

hi Martin, I have hardcoded Pytorch launcher.

./aws_driver.sh cfgs/kitti_models/pv_rcnn.yaml ./…/output/ “” 1 1 20 1

I worked only for batch size 1.

When I use only 1 gpu, It works fine.

My instance has 4 gpus, I would like to use higher batch size and utilize all the gpus.

If I use more than one gpu, I encounter timeout issue below:

Did you modify any part of the script or the code to run on all 4 gpus?

(cherry) ubuntu@ip-172-31-30-100:/projectdata/OpenPCDet/tools$ ./aws_driver.sh cfgs/kitti_models/pv_rcnn.yaml 4 64 20 4 ‘none’ basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=0

./aws_driver.sh: line 45: /train.out: Permission denied basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=1

basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=2

basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=3

basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=4

Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File “train.py”, line 198, in <module> File “train.py”, line 198, in <module> File “train.py”, line 198, in <module> main()main()main()

File “train.py”, line 64, in main File “train.py”, line 64, in main File “train.py”, line 64, in main total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)(
total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)(total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)( File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch

File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch dist.init_process_group( dist.init_process_group( dist.init_process_group(

File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator)store, rank, world_size = next(rendezvous_iterator)

File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler store, rank, world_size = next(rendezvous_iterator) File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)

store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)

RuntimeErrorRuntimeError: : connect() timed out.connect() timed out.RuntimeError

: connect() timed out. Traceback (most recent call last): File “train.py”, line 198, in <module> main() File “train.py”, line 64, in main total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)( File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch dist.init_process_group( File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.

0reactions
MartinHahnercommented, Feb 19, 2021

I don’t know the GPU utilization, but these are snippets from my launch script. Maybe it helps.

#!/bin/bash

output_dir=$1
log_level=$2
non_root=$3
hash=$4
config_path=$5
timestamp=$6
n_tasks=$7
batch_size=$8
workers=$9

...

# create random TCP port for distributed training
tcp_port=$((RANDOM+32767))                        # RANDOM provides a number in 0 – 32767, max TCP port number is 65535

for i in $(seq 0 ${n_tasks}); do

    if [ "${checkpoint}" = "none" ]; then

        python_args="--local_rank=${i}
                     --launcher pytorch
                     --workers=${workers}
                     --tcp_port=${tcp_port}
                     --log_level=${log_level}
                     --cfg_file=${config_path}
                     --batch_size=${batch_size}
                     --extra_tag=$(basename ${output_dir})"

    else

        python_args="--local_rank=${i}
                     --launcher pytorch
                     --workers=${workers}
                     --tcp_port=${tcp_port}
                     --log_level=${log_level}
                     --cfg_file=${config_path}
                     --batch_size=${batch_size}
                     --extra_tag=$(basename ${output_dir})
                     --ckpt=${checkpoint}
                     --pretend_from_scratch=True"

    fi

    mycmd=(python -u train.py)
    mycmd+=(${python_args})

    echo "${mycmd[@]}"
    echo ""

    if [ "${i}" = "0" ]; then

        # log output of local_rank=0
        eval "${mycmd[@]}" &>> ${output_dir}/train.out &

    elif [ "${i}" = "${n_tasks}" ]; then

        # wait after issuing the last training command
        eval "${mycmd[@]}"

    else

        # continue after issuing every other training command
        eval "${mycmd[@]}" &

    fi

done

# prepare evaluation command
python_args="--folder ${output_dir}"

mycmd=(python -u eval.py)
mycmd+=(${python_args})

echo "${mycmd[@]}"
echo ""

# start evaluation
eval "${mycmd[@]}" &>> ${output_dir}/test.out

...
Read more comments on GitHub >

github_iconTop Results From Across the Web

Amazon EC2 P3 – Ideal for Machine Learning and HPC - AWS
Amazon EC2 P3 instances deliver high performance compute in the cloud with up to 8 NVIDIA® V100 Tensor Core GPUs and up to...
Read more >
Distributed Training in Amazon SageMaker
One solution is to increase the number of GPUs you use for training. On an instance with multiple GPUs, like a p3.16xlarge that...
Read more >
Distributed GPU Training - AWS Deep Learning Containers
This tutorial shows how to setup distributed training of Apache MXNet (Incubating) models on your multi-node GPU cluster that uses Horovod . It...
Read more >
AWS Deep Learning AMIs now include Horovod for faster ...
AWS Deep Learning AMIs now include Horovod for faster multi-GPU TensorFlow training on Amazon EC2 P3 instances.
Read more >
Supported Frameworks, AWS Regions, and Instances Types
The SageMaker data parallelism library requires one of the following ML instance types. Instance type. ml.p3.16xlarge. ml.p3dn.24xlarge.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found