question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

port address conflict causing head node enters into an broken state and not usable

See original GitHub issue

ray core

What is the problem?

ray start --head ... with a group of specified ports will run and appeared to be successful. But then ray status will fail with “ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting address.”. Logs in /tmp/ray/session_latest/logs/raylet.err shows “Address already in use” error like below:

[root@/tmp/ray/session_latest/logs #]cat raylet.err 
E0215 01:53:44.582325479    2825 server_chttp2.cc:40]        {"created":"@1613354024.582257191","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1613354024.582255389","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1613354024.582244433","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1613354024.582234203","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1613354024.582254497","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1613354024.582251910","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]}]}]}
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: *** Aborted at 1613354024 (unix time) try "date -d @1613354024" if you are using GNU date ***
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: PC: @                0x0 (unknown)
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: *** SIGSEGV (@0x58) received by PID 2825 (TID 0x7ffa56db57c0) from PID 88; stack trace: ***
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435:     @     0x555562d3743f google::(anonymous
... 

Ray version and other system information (Python version, TensorFlow version, OS): Ray2.0.0.dev0 has this issue. Ray1.0.0 appears to be ok.

Reproduction (REQUIRED)

reproducible either on Debian GNU/Linux 9 (stretch) or MacOS Catalina 10.15.7 (19H512) under python 3.6.9 command that will fail: ray start --head --dashboard-port=31003 --port=31001 --object-manager-port=31002 --min-worker-port=31005 --max-worker-port=31011 --num-cpus=5 --dashboard-host="0.0.0.0"

run ray status after this will show error. ray head node enters into an unhealthy / not usable state. run ray stop to clear any unhealthy ray process, then try again with a good command that only set --object-manager-port to a value that’s far way from 3100x, the command will succeed and ray status repot correct status: command that will succeed: ray start --head --dashboard-port=31003 --port=31001 --object-manager-port=31099 --min-worker-port=31005 --max-worker-port=31011 --num-cpus=5 --dashboard-host="0.0.0.0"

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:28 (27 by maintainers)

github_iconTop GitHub Comments

1reaction
yudubercommented, Mar 3, 2021

All right I see. If you have retry at higher level, it will probably fine as well. I will try this latest code to see if things works fine. Thanks.

0reactions
lalalapottercommented, Sep 6, 2022

Hi @rkooo567, I encountered a similar problem during using ray. I checked the PR #14435, and the relevant code has been removed in the latest branch. Did you retry in the higher-level code?

Read more comments on GitHub >

github_iconTop Results From Across the Web

What's an IP Conflict and How Do You Resolve It? - MakeUseOf
What is an IP address conflict? Learn how to fix the problem if two devices have the same IP address on your network....
Read more >
How to fix Next.js Vercel deployment module not found error
If the symbolic link is broken, you will need to find and fix the cause of the broken link during the deployment process....
Read more >
Guidelines on firewalls and firewall policy - GovInfo
For these protocols, most firewalls with stateful inspection are only able to track the source and destination IP addresses and ports. UDP packets...
Read more >
Common HAProxy Errors - DigitalOcean
If your HAProxy server does not show active (running) as highlighted in the ... to determine the IP address and port combination that...
Read more >
Brocade Fabric OS Troubleshooting and Diagnostics ... - Dell
Probable cause and recommended action. A feature is not working. Refer to the Brocade Fabric OS Software Licensing Guide to determine if the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found