Multi-node setup, host can't connect to it's own provided IP address
See original GitHub issueHi 🤗 I have a 2
Nodes, each of 8xA100
for a total of 16 GPUs. I’m utilizing SLURM for launching the jobs:
SLURM scripts for the curious: https://rentry.co/9geu8n
Here, the main script uses the alloted 2 nodes and runs srun
over it, i.e each node is given the PY file to execute once.
Env
Accelerate
version: 0.13.0.dev0- Platform: Linux-5.10.126-117.518.amzn2.x86_64-x86_64-with-glibc2.10
- Python version: 3.8.13
- Numpy version: 1.22.4
- PyTorch version (GPU?): 1.13.0a0+08820cb (True)
Accelerate
default config: Not found
Now, I noticed a peculiar behavior. When on a single node (no SLURM, no multi-node, only multi-GPU) and run this:
accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16 --machine_rank 0 --main_process_ip 172.... --main_process_port 69420 \
\
scripts/...
The script won’t run - the command simply executes, and I’m back the the command prompt again - no stdout or stderr.
But with
accelerate launch --num_processes 8 --num_machines 1 --multi_gpu \
--mixed_precision fp16 \
\
scripts/torch_...
It works fine. The scripts runs alone on the 8 GPUs, and I can monitor the WandB logs.
This is a little quirk which puzzled me, and I can neither make head or tail of. I suspect it might mean something to someone here…
Multi-node training
For multi-node training, this is the PY script being executed: https://rentry.co/tz465
- This script works correctly for multi-GPU cases, but NOT for multi-node
Most of it’s standard snippets, but it may have some glaring flaw
Output:
This is the output of the main sbatch
script, which tells SLURM to deploy
Number of Nodes: 2
Name of all Hosts: gpu-st-p4d-24xlarge-60 gpu-st-p4d-24xlarge-61 # two nodes here, each 8xA100s
Master IP: 172.3.... # IP address of the main node
MASTER_PORT= 16543
ID: 0 # Each node reporting its RANK
ID: 1
NODE_COUNT=2 #number of nodes deployed
[18:14:34] WARNING The following values were not passed to launch.py:838
`accelerate launch` and had defaults used
instead:
`--num_cpu_threads_per_process` was
set to `48` to improve out-of-box performance
To avoid this warning pass in values for each
of the problematic parameters or run
`accelerate config`.
[18:14:35] WARNING The following values were not passed to launch.py:838
`accelerate launch` and had defaults used
instead:
`--num_cpu_threads_per_process` was
set to `48` to improve out-of-box performance
To avoid this warning pass in values for each
of the problematic parameters or run
`accelerate config`.
{Waiting about 15 mins}
[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).
[E socket.cpp:858] [c10d] The client socket has timed out after 900s while trying to connect to (gpu-st-p4d-24xlarge-60, 16543).
Trying random ports yields no results.
I think it might be connected with the problem specified above. Does anyone have any idea?
Issue Analytics
- State:
- Created a year ago
- Comments:20
Top GitHub Comments
@neel04 should have a solution here soon for you to try. Thanks for the clear bug report!
An interesting bag of results. using the new
torch.distributed.launch
commands, the first one half works - it complains aboutlocal_rank
but it waits at the***** Setting
part unless I run the same command on the second machine - which implies that there is some inter-node communication atleast.I feel the error could be resolved after some effort, for which I will update later on 😃
The second command seems to work quite well 👌I wasn’t able to train more than a couple steps (pre-emption) but the synchronized initial loss leads me to believe that atleast the parameters synced initially - and since training worked, inter-node comms are working.
So it appears some problem in
accelerate
with multi-node setup’s networking. Whiletorchrun
works, I think I might need to add AMP to my setup forfp16
. I’d still love to get to this issue’s core so that future users have no problem - and as such, I’m up for further debugging and testing on my side 🤗 let me know if you have any further stuff you might want me to try if that helps triaging the bug! 🚀I’ve put the error traceback for the first command just in case, though I’m pretty sure I can get it to work.
Error @ command - 1 [Main Host]