Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

port address conflict causing head node enters into an broken state and not usable

See original GitHub issue

ray core

What is the problem?

ray start --head ... with a group of specified ports will run and appeared to be successful. But then ray status will fail with “ConnectionError: Could not find any running Ray instance. Please specify the one to connect to by setting address.”. Logs in /tmp/ray/session_latest/logs/raylet.err shows “Address already in use” error like below:

[root@/tmp/ray/session_latest/logs #]cat raylet.err 
E0215 01:53:44.582325479    2825 server_chttp2.cc:40]        {"created":"@1613354024.582257191","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":394,"referenced_errors":[{"created":"@1613354024.582255389","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1613354024.582244433","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1613354024.582234203","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1613354024.582254497","description":"Unable to configure socket","fd":32,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":207,"referenced_errors":[{"created":"@1613354024.582251910","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":181,"os_error":"Address already in use","syscall":"bind"}]}]}]}
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: *** Aborted at 1613354024 (unix time) try "date -d @1613354024" if you are using GNU date ***
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: PC: @                0x0 (unknown)
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435: *** SIGSEGV (@0x58) received by PID 2825 (TID 0x7ffa56db57c0) from PID 88; stack trace: ***
[2021-02-15 01:53:44,582 E 2825 2825] logging.cc:435:     @     0x555562d3743f google::(anonymous
...

Ray version and other system information (Python version, TensorFlow version, OS): Ray2.0.0.dev0 has this issue. Ray1.0.0 appears to be ok.

Reproduction (REQUIRED)

reproducible either on Debian GNU/Linux 9 (stretch) or MacOS Catalina 10.15.7 (19H512) under python 3.6.9 command that will fail: ray start --head --dashboard-port=31003 --port=31001 --object-manager-port=31002 --min-worker-port=31005 --max-worker-port=31011 --num-cpus=5 --dashboard-host="0.0.0.0"

run ray status after this will show error. ray head node enters into an unhealthy / not usable state. run ray stop to clear any unhealthy ray process, then try again with a good command that only set --object-manager-port to a value that’s far way from 3100x, the command will succeed and ray status repot correct status: command that will succeed: ray start --head --dashboard-port=31003 --port=31001 --object-manager-port=31099 --min-worker-port=31005 --max-worker-port=31011 --num-cpus=5 --dashboard-host="0.0.0.0"

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 3 years ago
Comments:28 (27 by maintainers)

Top GitHub Comments

1reaction

yudubercommented, Mar 3, 2021

All right I see. If you have retry at higher level, it will probably fine as well. I will try this latest code to see if things works fine. Thanks.

0reactions

lalalapottercommented, Sep 6, 2022

Hi @rkooo567, I encountered a similar problem during using ray. I checked the PR #14435, and the relevant code has been removed in the latest branch. Did you retry in the higher-level code?

Top Results From Across the Web

What's an IP Conflict and How Do You Resolve It? - MakeUseOf

What is an IP address conflict? Learn how to fix the problem if two devices have the same IP address on your network....

How to fix Next.js Vercel deployment module not found error

If the symbolic link is broken, you will need to find and fix the cause of the broken link during the deployment process....

Guidelines on firewalls and firewall policy - GovInfo

For these protocols, most firewalls with stateful inspection are only able to track the source and destination IP addresses and ports. UDP packets...

Common HAProxy Errors - DigitalOcean

If your HAProxy server does not show active (running) as highlighted in the ... to determine the IP address and port combination that...

Brocade Fabric OS Troubleshooting and Diagnostics ... - Dell

Probable cause and recommended action. A feature is not working. Refer to the Brocade Fabric OS Software Licensing Guide to determine if the...

Troubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.

Start Free

Top Related Reddit Thread

No results found

Top Related Tweet

No results found

Top Related Dev.to Post

No results found

port address conflict causing head node enters into an broken state and not usable

What is the problem?

Reproduction (REQUIRED)

Issue Analytics

Top GitHub Comments

Top Results From Across the Web

Top Related Medium Post

Top Related StackOverflow Question

Troubleshoot Live Code

Top Related Reddit Thread

Top Related Hackernoon Post

Top Related Tweet

Top Related Dev.to Post

Top Related Hashnode Post

[rllib] AttributeError: 'list' object has no attribute 'float', when using dreamer

[core] Tasks are not spilled back when waiting for dependencies, even when there are remote resources available