question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Address Already in Use Error When Training on 2 GPUS and Starting a New Job on the Remaining 2 GPUS

See original GitHub issue

Describe the bug

Following the steps on the [tutorial] (https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html)

GlowTTS: python3 -m trainer.distribute --script train.py --gpus "0,1"

Vocoder: python3 -m trainer.distribute --script train_vocoder.py --gpus "2,3"

Getting this exception for the second command: Address already in use

To Reproduce

  1. Download the dataset per this [tutorial] (https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html)

  2. Run GlowTTS: python3 -m trainer.distribute --script train.py --gpus "0,1"

  3. Run Vocoder: python3 -m trainer.distribute --script train_vocoder.py --gpus "2,3"

Expected behavior

No response

Logs

Traceback (most recent call last):
  File "train_vocoder.py", line 44, in <module>
    TrainerArgs(), config, output_path, model=model, train_samples=train_samples, eval_samples=eval_samples
  File "/apps/tts/Trainer/trainer/trainer.py", line 460, in __init__
    self.config.distributed_url,
  File "/apps/tts/Trainer/trainer/utils/distributed.py", line 62, in init_distributed
    group_name=group_name,
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 186, in _tcp_rendezvous_handler
    store = _create_c10d_store(result.hostname, result.port, rank, world_size, timeout)
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:54321 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A100-SXM4-40GB",
            "NVIDIA A100-SXM4-40GB",
            "NVIDIA A100-SXM4-40GB",
            "NVIDIA A100-SXM4-40GB"
        ],
        "available": true,
        "version": "11.5"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "1.11.0+cu115",
        "TTS": "0.7.0",
        "numpy": "1.21.6"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.7.13",
        "version": "#61~18.04.3-Ubuntu SMP Fri Oct 1 14:04:01 UTC 2021"
    }
}

Additional context

First command starts with this log:

['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=0']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=1']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=2']
['train.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205837', '--use_ddp=true', '--rank=3']
 > Using CUDA: True
 > Number of GPUs: 4

Second command starts with this log:

['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=0']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=1']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=2']
['train_vocoder.py', '--continue_path=', '--restore_path=', '--group_id=group_2022_06_21-205851', '--use_ddp=true', '--rank=3']
 > Using CUDA: True
 > Number of GPUs: 4
[W socket.cpp:401] [c10d] The server socket has failed to bind to [::]:54321 (errno: 98 - Address already in use).
[W socket.cpp:401] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:435] [c10d] The server socket has failed to listen on any local network address.

Issue Analytics

  • State:closed
  • Created a year ago
  • Comments:16 (12 by maintainers)

github_iconTop GitHub Comments

2reactions
iprovalocommented, Jul 11, 2022

@erogol @Dapwner

If you request GPU 0 in your list of GPUs, it will be used correctly, as in the case with my first command, but otherwise, it’s the first GPU id in your list which becomes the master. Then there is a mismatch between a GPU ID (GPU ID is being used in the current released code) and the rank (rank 0 is treated as the master). This also explains the test @Dapwner ran:

Interestingly, I have tried to run the code on our nvidia dgx station (4 V100 gpus) with the following gpus: 0 + 1, 0 + 3, 1 + 3, 1 + 2 and it turned out that the error did not appear in the first two cases (when gpu 0 was included), but did occur in the latter two (when gpu 0 was not present).

Basically, the exception happens on the distributed training with GPUs list starting at 1 and up. Essentially, they run master-less. No model files are being stored by the master, then this exception happens - FileNotFoundError.

So, in my second example I used GPU list 5,6,7:

/apps/tts/TTS # nohup python3 -m trainer.distribute --script train_hifigan_vocoder_en.py --gpus "5,6,7" --coqpit.distributed_url "tcp://localhost:54322" </dev/null > hifigan_en.log 2>&1 &

A rank is printed correctly in the logs initially - GPU 5 gets rank 0. But then during the training at runtime, once it is run on the GPUs, the pytorch gets GPU ID, instead of the rank using the current released code, so it looses the master all together, thus the ghost processes.

1reaction
iprovalocommented, Jul 7, 2022

@lexkoro @erogol @Dapwner

I ended up debugging this more at runtime and tested a fix.

It turns out that the current Trainer distributed.py behavior is using a GPU id for identifying the current device rank coming from one of the env vars. I opened a PR for the Trainer project. The proposed behavior is using the torch.distributed.get_rank()

Read more comments on GitHub >

github_iconTop Results From Across the Web

Multiple GPUs issue #1442 - OpenNMT/OpenNMT-py - GitHub
I want to use Multiple GPUs for my experiments. ... It doesn't work and shows an error saying "Address already in use".
Read more >
Multi-GPU Training - YOLOv5 Documentation
Multi-GPU Training. This guide explains how to properly use multiple GPUs to train a dataset with YOLOv5 on single or multiple machine(s).
Read more >
Distributed Data Parallel with Slurm, Submitit & PyTorch
PyTorch offers various methods to distribute your training onto multiple GPUs, whether the GPUs are on your local machine, a cluster node, ...
Read more >
How to Configure a GPU Cluster to Scale with PyTorch ...
In this post, we will learn how to configure a cluster to enable Lighting to scale to multiple GPU machines with a simple,...
Read more >
Frequently Asked Questions - Slurm Workload Manager
Why is the Slurm backfill scheduler not starting my job? ... Slurm and PMIx, and complaining about GPUs being 'In use by another...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found