Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

DistributedDataParallel on AWS with multi GPU EC2 instances (p3.8xlarge / p3.16xlarge)

See original GitHub issue

@sshaoshuai I am trying to use your codebase on AWS, and while it works on a p3.2xlarge instance with a single GPU, train.py can not successfully execute the following line on bigger instances with multiple GPUs (e.g. p3.8xlarge or p3.16xlarge): https://github.com/open-mmlab/OpenPCDet/blob/f982b5bfdf0e8e15a2e2d7fead2925ff564051d7/tools/train.py#L142

What I mean is it gets stuck there, so there is no error, but this line never returns. Do you have an idea what could cause this?

I tried to stick to those guides: https://pytorch.org/tutorials/beginner/aws_distributed_training_tutorial.html https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/

So here is how I invoke train.py:

#!/bin/bash

config_path=$1
output_dir=$2
tcp_port=$3

# from https://pytorch.org/tutorials/beginner/aws_distributed_training_tutorial.html
export NCCL_SOCKET_IFNAME=ens3

python -u train.py --launcher pytorch 
                   --workers 8 
                   --tcp_port=${tcp_port}
                   --cfg_file=${config_path}
                   --extra_tag=$(basename ${output_dir})"

$ nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   49C    P0    55W / 300W |    776MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   44C    P0    39W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   41C    P0    42W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   44C    P0    41W / 300W |     11MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2519      C   python                                       765MiB |
+-----------------------------------------------------------------------------+

Issue Analytics

State:
Created 3 years ago
Comments:12 (5 by maintainers)

Top GitHub Comments

1reaction

suvasiscommented, Nov 3, 2020

hi Martin, I have hardcoded Pytorch launcher.

./aws_driver.sh cfgs/kitti_models/pv_rcnn.yaml ./…/output/ “” 1 1 20 1

I worked only for batch size 1.

When I use only 1 gpu, It works fine.

My instance has 4 gpus, I would like to use higher batch size and utilize all the gpus.

If I use more than one gpu, I encounter timeout issue below:

Did you modify any part of the script or the code to run on all 4 gpus?

(cherry) ubuntu@ip-172-31-30-100:/projectdata/OpenPCDet/tools$ ./aws_driver.sh cfgs/kitti_models/pv_rcnn.yaml 4 64 20 4 ‘none’ basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=0

./aws_driver.sh: line 45: /train.out: Permission denied basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=1

basename: missing operand Try ‘basename --help’ for more information. python -u train.py --launcher=‘pytorch’ --workers=4 --cfg_file=cfgs/kitti_models/pv_rcnn.yaml --batch_size=64 --epoch=20 --extra_tag= --local_rank=2

Traceback (most recent call last): Traceback (most recent call last): Traceback (most recent call last): File “train.py”, line 198, in <module> File “train.py”, line 198, in <module> File “train.py”, line 198, in <module> main()main()main()

File “train.py”, line 64, in main File “train.py”, line 64, in main File “train.py”, line 64, in main total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)(
total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)(total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)( File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch

File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch dist.init_process_group( dist.init_process_group( dist.init_process_group(

File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator)store, rank, world_size = next(rendezvous_iterator)

File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler store, rank, world_size = next(rendezvous_iterator) File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)

store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout)

RuntimeErrorRuntimeError: : connect() timed out.connect() timed out.RuntimeError

: connect() timed out. Traceback (most recent call last): File “train.py”, line 198, in <module> main() File “train.py”, line 64, in main total_gpus, cfg.LOCAL_RANK = getattr(common_utils, ‘init_dist_%s’ % args.launcher)( File “/projectdata/OpenPCDet/pcdet/utils/common_utils.py”, line 147, in init_dist_pytorch dist.init_process_group( File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py”, line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File “/projectdata/anaconda3/envs/cherry/lib/python3.8/site-packages/torch/distributed/rendezvous.py”, line 126, in _tcp_rendezvous_handler store = TCPStore(result.hostname, result.port, world_size, start_daemon, timeout) RuntimeError: connect() timed out.

0reactions

MartinHahnercommented, Feb 19, 2021

I don’t know the GPU utilization, but these are snippets from my launch script. Maybe it helps.

#!/bin/bash

output_dir=$1
log_level=$2
non_root=$3
hash=$4
config_path=$5
timestamp=$6
n_tasks=$7
batch_size=$8
workers=$9

...

# create random TCP port for distributed training
tcp_port=$((RANDOM+32767))                        # RANDOM provides a number in 0 – 32767, max TCP port number is 65535

for i in $(seq 0 ${n_tasks}); do

    if [ "${checkpoint}" = "none" ]; then

        python_args="--local_rank=${i}
                     --launcher pytorch
                     --workers=${workers}
                     --tcp_port=${tcp_port}
                     --log_level=${log_level}
                     --cfg_file=${config_path}
                     --batch_size=${batch_size}
                     --extra_tag=$(basename ${output_dir})"

    else

        python_args="--local_rank=${i}
                     --launcher pytorch
                     --workers=${workers}
                     --tcp_port=${tcp_port}
                     --log_level=${log_level}
                     --cfg_file=${config_path}
                     --batch_size=${batch_size}
                     --extra_tag=$(basename ${output_dir})
                     --ckpt=${checkpoint}
                     --pretend_from_scratch=True"

    fi

    mycmd=(python -u train.py)
    mycmd+=(${python_args})

    echo "${mycmd[@]}"
    echo ""

    if [ "${i}" = "0" ]; then

        # log output of local_rank=0
        eval "${mycmd[@]}" &>> ${output_dir}/train.out &

    elif [ "${i}" = "${n_tasks}" ]; then

        # wait after issuing the last training command
        eval "${mycmd[@]}"

    else

        # continue after issuing every other training command
        eval "${mycmd[@]}" &

    fi

done

# prepare evaluation command
python_args="--folder ${output_dir}"

mycmd=(python -u eval.py)
mycmd+=(${python_args})

echo "${mycmd[@]}"
echo ""

# start evaluation
eval "${mycmd[@]}" &>> ${output_dir}/test.out

...