port address conflict causing head node enters into an broken state and not usable
See original GitHub issueray core
What is the problem?
ray start --head ...
with a group of specified ports will run and appeared to be successful. But then ray status
will fail with “ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting address.”. Logs in /tmp/ray/session_latest/logs/raylet.err shows “Address already in use” error like below:
[root@/tmp/ray/session_latest/logs #]cat raylet.err
E0215 01:53:44.582325479 2825 server_chttp2.cc:40] {"created":"@1613354024.582257191","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1613354024.582255389","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1613354024.582244433","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1613354024.582234203","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1613354024.582254497","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1613354024.582251910","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]}]}]}
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: *** Aborted at 1613354024 (unix time) try "date -d @1613354024" if you are using GNU date ***
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: PC: @ 0x0 (unknown)
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: *** SIGSEGV (@0x58) received by PID 2825 (TID 0x7ffa56db57c0) from PID 88; stack trace: ***
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: @ 0x555562d3743f google::(anonymous
...
Ray version and other system information (Python version, TensorFlow version, OS): Ray2.0.0.dev0 has this issue. Ray1.0.0 appears to be ok.
Reproduction (REQUIRED)
reproducible either on Debian GNU/Linux 9 (stretch) or MacOS Catalina 10.15.7 (19H512) under python 3.6.9
command that will fail:
ray start --head --dashboard-port=31003 --port=31001 --object-manager-port=31002 --min-worker-port=31005 --max-worker-port=31011 --num-cpus=5 --dashboard-host="0.0.0.0"
run ray status
after this will show error. ray head node enters into an unhealthy / not usable state.
run ray stop
to clear any unhealthy ray process, then try again with a good command that only set --object-manager-port to a value that’s far way from 3100x, the command will succeed and ray status
repot correct status:
command that will succeed:
ray start --head --dashboard-port=31003 --port=31001 --object-manager-port=31099 --min-worker-port=31005 --max-worker-port=31011 --num-cpus=5 --dashboard-host="0.0.0.0"
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Comments:28 (27 by maintainers)
All right I see. If you have retry at higher level, it will probably fine as well. I will try this latest code to see if things works fine. Thanks.
Hi @rkooo567, I encountered a similar problem during using ray. I checked the PR #14435, and the relevant code has been removed in the latest branch. Did you retry in the higher-level code?