accelerate test -- RuntimeError: Address already in use
See original GitHub issue
(Acc) water@amax:~/Basecode/law-qa-competition2021/torchVersion/src$ accelerate configIn which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 0
Which type of machine are you using? ([0] No distributed training, [1] multi-GPU, [2] TPU): 1
How many different machines will you use (use more than 1 for multi-node training)? [1]: 1
How many processes in total will you use? [1]: 4
Do you wish to use FP16 (mixed precision)? [yes/NO]: yes
(Acc) water@amax:~/Basecode/law-qa-competition2021/torchVersion/src$ accelerate test
Running: accelerate-launch --config_file=None /home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py
stderr: Traceback (most recent call last):
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py", line 225, in <module>
stderr: main()
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py", line 205, in main
stderr: accelerator = Accelerator()
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/accelerator.py", line 79, in __init__
stderr: self.state = AcceleratorState(fp16=fp16, cpu=cpu, _from_accelerator=True)
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/state.py", line 125, in __init__
stderr: torch.distributed.init_process_group(backend="nccl")
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
stderr: store, rank, world_size = next(rendezvous_iterator)
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
stderr: store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
stderr: RuntimeError: Address already in use
stderr: Traceback (most recent call last):
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/runpy.py", line 194, in _run_module_as_main
stderr: return _run_code(code, main_globals, None,
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/runpy.py", line 87, in _run_code
stderr: exec(code, run_globals)
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
stderr: main()
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
stderr: sigkill_handler(signal.SIGTERM, None) # not coming back
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
stderr: raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/home/water/anaconda3/envs/Acc/bin/python', '-u', '/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
stdout: *****************************************
stdout: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
stdout: *****************************************
stdout: Killing subprocess 43038
stdout: Killing subprocess 43039
stdout: Killing subprocess 43040
stdout: Killing subprocess 43041
stderr: Traceback (most recent call last):
stderr: File "/home/water/anaconda3/envs/Acc/bin/accelerate-launch", line 8, in <module>
stderr: sys.exit(main())
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 319, in main
stderr: launch_command(args)
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 307, in launch_command
stderr: multi_gpu_launcher(args)
stderr: File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 151, in multi_gpu_launcher
stderr: raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/home/water/anaconda3/envs/Acc/bin/python', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '4', '/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/home/water/anaconda3/envs/Acc/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 41, in main
args.func(args)
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/test.py", line 52, in test_command
result = execute_subprocess_async(cmd, env=os.environ.copy())
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/testing.py", line 134, in execute_subprocess_async
raise RuntimeError(
RuntimeError: 'accelerate-launch --config_file=None /home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py' failed with returncode 1
The combined stderr from workers follows:
Traceback (most recent call last):
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py", line 225, in <module>
main()
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py", line 205, in main
accelerator = Accelerator()
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/accelerator.py", line 79, in __init__
self.state = AcceleratorState(fp16=fp16, cpu=cpu, _from_accelerator=True)
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/state.py", line 125, in __init__
torch.distributed.init_process_group(backend="nccl")
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 500, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 190, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
Traceback (most recent call last):
File "/home/water/anaconda3/envs/Acc/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/water/anaconda3/envs/Acc/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/water/anaconda3/envs/Acc/bin/python', '-u', '/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
Traceback (most recent call last):
File "/home/water/anaconda3/envs/Acc/bin/accelerate-launch", line 8, in <module>
sys.exit(main())
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 319, in main
launch_command(args)
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 307, in launch_command
multi_gpu_launcher(args)
File "/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/commands/launch.py", line 151, in multi_gpu_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/water/anaconda3/envs/Acc/bin/python', '-m', 'torch.distributed.launch', '--use_env', '--nproc_per_node', '4', '/home/water/anaconda3/envs/Acc/lib/python3.8/site-packages/accelerate/test_utils/test_script.py']' returned non-zero exit status 1.
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (3 by maintainers)
Top Results From Across the Web
How to solve "RuntimeError: Address already in use" in ...
My Solution: It simply means that the GPU is already occupied under some other ddp training. Try deleting all the processes related to...
Read more >Python [Errno 98] Address already in use - Stack Overflow
A simple solution that worked for me is to close the Terminal and restart it. Share. Share a link to this answer.
Read more >Pytorch distributed RuntimeError: Address already in use
Pytorch distributed RuntimeError: Address already in use如果是使用pytorch distributed 单机多卡训练方式,出现该错误,非常好解决。
Read more >Multi-GPU Training - YOLOv5 Documentation
This guide explains how to properly use multiple GPUs to train a dataset with ... If you get RuntimeError: Address already in use...
Read more >Why my Accelerate just doesn't work? - Hugging Face Forums
[yes/NO]: No How many processes in total will you use? ... Now the error comes out, when I check with accelerate test ,...
Read more >Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start FreeTop Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found
Top GitHub Comments
You should override the port for your second launch with the
--main_process_port
argument.There is nothing in the screenshot you took, just PyTorch telling us the process was killed, but no reason why. If there is no other error message, my best guess would be a RAM error, but it’s hard to know for sure.