Address Already in Use Error When Training on 2 GPUS and Starting a New Job on the Remaining 2 GPUS
See original GitHub issueDescribe the bug
Following the steps on the [tutorial] (https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html)
GlowTTS:
python3 -m trainer.distribute --script train.py --gpus "0,1"
Vocoder:
python3 -m trainer.distribute --script train_vocoder.py --gpus "2,3"
Getting this exception for the second command: Address already in use
To Reproduce
-
Download the dataset per this [tutorial] (https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html)
-
Run GlowTTS:
python3 -m trainer.distribute --script train.py --gpus "0,1"
-
Run Vocoder:
python3 -m trainer.distribute --script train_vocoder.py --gpus "2,3"
Expected behavior
No response
Logs
Traceback (most recent call last):
File "train_vocoder.py", line 44, in <module>
TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples
File "/apps/tts/Trainer/trainer/trainer.py", line 460, in __init__
self.config.distributed_url,
File "/apps/tts/Trainer/trainer/utils/distributed.py", line 62, in init_distributed
group_name=group_name,
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler
store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:54321 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
Environment
{
"CUDA": {
"GPU": [
"NVIDIA A100-SXM4-40GB",
"NVIDIA A100-SXM4-40GB",
"NVIDIA A100-SXM4-40GB",
"NVIDIA A100-SXM4-40GB"
],
"available": true,
"version": "11.5"
},
"Packages": {
"PyTorch_debug": false,
"PyTorch_version": "1.11.0+cu115",
"TTS": "0.7.0",
"numpy": "1.21.6"
},
"System": {
"OS": "Linux",
"architecture": [
"64bit",
"ELF"
],
"processor": "x86_64",
"python": "3.7.13",
"version": "#61~18.04.3-Ubuntu SMP Fri Oct 1 14:04:01 UTC 2021"
}
}
Additional context
First command starts with this log:
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=0']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=1']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=2']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=3']
> Using CUDA: True
> Number of GPUs: 4
Second command starts with this log:
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=0']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=1']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=2']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=3']
> Using CUDA: True
> Number of GPUs: 4
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:54321 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.
Issue Analytics
- State:
- Created a year ago
- Comments:16 (12 by maintainers)
Top Results From Across the Web
Multiple GPUs issue #1442 - OpenNMT/OpenNMT-py - GitHub
I want to use Multiple GPUs for my experiments. ... It doesn't work and shows an error saying "Address already in use".
Read more >Multi-GPU Training - YOLOv5 Documentation
Multi-GPU Training. This guide explains how to properly use multiple GPUs to train a dataset with YOLOv5 on single or multiple machine(s).
Read more >Distributed Data Parallel with Slurm, Submitit & PyTorch
PyTorch offers various methods to distribute your training onto multiple GPUs, whether the GPUs are on your local machine, a cluster node, ...
Read more >How to Configure a GPU Cluster to Scale with PyTorch ...
In this post, we will learn how to configure a cluster to enable Lighting to scale to multiple GPU machines with a simple,...
Read more >Frequently Asked Questions - Slurm Workload Manager
Why is the Slurm backfill scheduler not starting my job? ... Slurm and PMIx, and complaining about GPUs being 'In use by another...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
@erogol @Dapwner
If you request GPU
0
in your list of GPUs, it will be used correctly, as in the case with my first command, but otherwise, it’s the first GPU id in your list which becomes the master. Then there is a mismatch between a GPU ID (GPU ID is being used in the current released code) and the rank (rank0
is treated as the master). This also explains the test @Dapwner ran:Basically, the exception happens on the distributed training with GPUs list starting at
1
and up. Essentially, they run master-less. No model files are being stored by the master, then this exception happens -FileNotFoundError
.So, in my second example I used GPU list
5,6,7
:/apps/tts/TTS # nohup python3 -m trainer.distribute --script train_hifigan_vocoder_en.py --gpus "5,6,7" --coqpit.distributed_url "tcp://localhost:54322" </dev/null > hifigan_en.log 2>&1 &
A rank is printed correctly in the logs initially - GPU 5 gets rank 0. But then during the training at runtime, once it is run on the GPUs, the pytorch gets GPU ID, instead of the rank using the current released code, so it looses the master all together, thus the ghost processes.
@lexkoro @erogol @Dapwner
I ended up debugging this more at runtime and tested a fix.
It turns out that the current Trainer
distributed.py
behavior is using a GPU id for identifying the current device rank coming from one of the env vars. I opened a PR for the Trainer project. The proposed behavior is using thetorch.distributed.get_rank()