question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

accelerate test -- RuntimeError: Address already in use

See original GitHub issue

(Acc) water@amax:~/Basecode/law-qa-competition2021/torchVersion/src$ accelerate configIn which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-GPU, [2] TPU): 1
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
(Acc) water@amax:~/Basecode/law-qa-competition2021/torchVersion/src$ accelerate test

Running:  accelerate-launch --config_file=None /home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py
stderr: Traceback (most recent call last):
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py", line 225, in <module>
stderr:     main()
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py", line 205, in main
stderr:     accelerator = Accelerator()
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/accelerator.py", line 79, in __init__
stderr:     self.state = AcceleratorState(fp16=fp16, cpu=cpu, _from_accelerator=True)
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/state.py", line 125, in __init__
stderr:     torch.distributed.init_process_group(backend="nccl")
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
stderr:     store, rank, world_size = next(rendezvous_iterator)
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
stderr:     store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
stderr: RuntimeError: Address already in use
stderr: Traceback (most recent call last):
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/runpy.py", line 194, in _run_module_as_main
stderr:     return _run_code(code, main_globals, None,
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/runpy.py", line 87, in _run_code
stderr:     exec(code, run_globals)
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
stderr:     main()
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
stderr:     sigkill_handler(signal.SIGTERM, None)  # not coming back
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
stderr:     raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/home/water/anaconda3/envs/Acc/bin/python', '-u', '/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
stdout: *****************************************
stdout: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
stdout: *****************************************
stdout: Killing subprocess 43038
stdout: Killing subprocess 43039
stdout: Killing subprocess 43040
stdout: Killing subprocess 43041
stderr: Traceback (most recent call last):
stderr:   File "/home/water/anaconda3/envs/Acc/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 319, in main
stderr:     launch_command(args)
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 307, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 151, in multi_gpu_launcher
stderr:     raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/home/water/anaconda3/envs/Acc/bin/python', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '4', '/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/home/water/anaconda3/envs/Acc/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 41, in main
    args.func(args)
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/test.py", line 52, in test_command
    result = execute_subprocess_async(cmd, env=os.environ.copy())
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/testing.py", line 134, in execute_subprocess_async
    raise RuntimeError(
RuntimeError: 'accelerate-launch --config_file=None /home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Traceback (most recent call last):
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py", line 225, in <module>
    main()
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py", line 205, in main
    accelerator = Accelerator()
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/accelerator.py", line 79, in __init__
    self.state = AcceleratorState(fp16=fp16, cpu=cpu, _from_accelerator=True)
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/state.py", line 125, in __init__
    torch.distributed.init_process_group(backend="nccl")
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/water/anaconda3/envs/Acc/bin/python', '-u', '/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
Traceback (most recent call last):
  File "/home/water/anaconda3/envs/Acc/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 319, in main
    launch_command(args)
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 307, in launch_command
    multi_gpu_launcher(args)
  File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 151, in multi_gpu_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/water/anaconda3/envs/Acc/bin/python', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '4', '/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Comments:8 (3 by maintainers)

github_iconTop GitHub Comments

1reaction
sguggercommented, Dec 15, 2021

You should override the port for your second launch with the --main_process_port argument.

0reactions
sguggercommented, Dec 17, 2021

There is nothing in the screenshot you took, just PyTorch telling us the process was killed, but no reason why. If there is no other error message, my best guess would be a RAM error, but it’s hard to know for sure.

Read more comments on GitHub >

github_iconTop Results From Across the Web

How to solve "RuntimeError: Address already in use" in ...
My Solution: It simply means that the GPU is already occupied under some other ddp training. Try deleting all the processes related to...
Read more >
Python [Errno 98] Address already in use - Stack Overflow
A simple solution that worked for me is to close the Terminal and restart it. Share. Share a link to this answer.
Read more >
Pytorch distributed RuntimeError: Address already in use
Pytorch distributed RuntimeError: Address already in use如果是使用pytorch distributed 单机多卡训练方式,出现该错误,非常好解决。
Read more >
Multi-GPU Training - YOLOv5 Documentation
This guide explains how to properly use multiple GPUs to train a dataset with ... If you get RuntimeError: Address already in use...
Read more >
Why my Accelerate just doesn't work? - Hugging Face Forums
[yes/NO]: No How many processes in total will you use? ... Now the error comes out, when I check with accelerate test ,...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found