Dashboard failures with include_dashboard set to false
See original GitHub issueRunning Tune with A3C fails straight at the beginning with the following traceback:
2020-11-11 14:13:37,114 WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
loop.run_until_complete(agent.run())
File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
agent_ip_address=self.ip))
File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1605096817.110308830","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605096817.110303917","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>
**This is obviously a dashboard related exception, which is unexpected since include_dashboard is set to False. It might be related to https://github.com/ray-project/ray/issues/11943 but it shouldn’t happen if this flag is set to False, so it’s a different issue. **
Ray version and other system information (Python version, TensorFlow version, OS): Ray installed via https://docs.ray.io/en/master/development.html#building-ray-python-only on both latest master and releases/1.0.1
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
ray.init(include_dashboard=False)
tune.run(
A3CTrainer,
config=<any config>,
stop={
"timesteps_total": 50e6,
},
)
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 3 years ago
- Reactions:1
- Comments:63 (61 by maintainers)
Top Results From Across the Web
Configuring Ray — Ray 2.2.0 - the Ray documentation
For the multi-node setting, you must first run ray start on the command line to start ... If you don't want the dashboard,...
Read more >Resolve common errors for Dashboard Components
This article covers several common dashboard errors and how to resolve them. ... This error means that the Running User set for the...
Read more >CDE advanced solutions
Create a dashboard using RequireJS · Step 1: Set up folders in User Console and create the dashboard · Step 2: Add layout...
Read more >Tutorials - InMon Support
This tutorial describes how to construct a live dashboard to share with an external user who may not have login access to your...
Read more >Honda Recalls About 1.6 Million Vehicles Due to Door Latch ...
Honda Problems Include Dashboard Errors and Backup C. ... All of the recalls are scheduled to begin on September 23, 2020.
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

When I run the following script:
on a machine with http_proxy and https_proxy set, it spits out
This looks like the same error, right? (The sleeps are necessary so that the script doesn’t exit before the error appears; the length of sleep necessary presumably varies by machine.)
Obviously, on any machine with http_proxy and https_proxy set, no_proxy is also going to be set, presumably with localhost and 127.0.0.1…but no_proxy usually won’t include the machine’s external IP address. Ray is using that external IP address from
get_node_ip_address().For my machine, at least, adding the external IP address to no_proxy makes everything go through without that error message.
I think @fyrestone hit the nail on the head.
Unfortunately, the problem being diagnosed is not the same thing as the problem being solved. Setting no_proxy that way works for a simple standalone script like that one, but for the more complicated operations such asray startand tune, the new processes don’t get started with the new value of no_proxy, even if youexport no_proxy.The new processes must pull the values of the variables from some deeper level when they get started up, and I’m not sure where. Not .bashrc, I assume, since these new processes aren’t starting in shells as such.~~Looking at https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L1438, the dashboard process doesn’t have shell=True, so I’m really not sure where it’s pulling the proxy information from. And yet setting no_proxy on the command line works when running a simple ray.init() script…? https://stackoverflow.com/questions/12060863/python-subprocess-call-a-bash-alias~~
It’s sort of baffling, because actors are also separate processes, but apparently those actors started from that script do somehow inherit the value of no_proxy. (export no_proxy="$(hostname -i),$no_proxy"makes that script go through just fine; it doesn’t matter whether no_proxy is set on the same line, no_proxy just needs to be set.) Yet other workers created byray startdo not, so thatstill results in workers spitting out that error.All the various processes started by Ray inherit no_proxy. I dunno how, but they do. You do need to set no_proxy on all machines involved, though, with the numerical IP addresses of all machines involved (comma-separated), including its own. Remember that the IP address by which one machine can find another machine is not necessarily the same IP address that
hostname -ibrings up on the target machine.You might not know in advance the IP address of every machine that will be joining. But you could probably brute-force that by just adding every IP address ending in a number to no_proxy:
(Presumably, the problem only happens because we’re using raw numerical IP addresses; presumably, no_proxy is already set to cover all relevant domains.) We could add that to the documentation as a recommendation to just always do.
I’m not sure what to do about this with respect to the port-checking documentation. netcat and nmap (natually, I think?) completely ignore http_proxy and https_proxy for non-HTTP traffic. (This isn’t HTTP traffic, is it? This is a metric-export thing and that’s why it happens even with the dashboard disabled? I’m not sure why gRPC is using the proxy settings. I’m guessing there are some kind of gRPC-over-HTTP shenanigans going on, for some reason?)
(Okay, I guess they just always ignore proxy settings.
I still don’t get why gRPC is using the proxy. I assume this is dashboard-specific somehow, since nothing else goes wrong if you run
http_proxy=http://some.imaginary.proxy:80 https_proxy=http://some.imaginary.proxy:80 python test_actors.py, just the dashboard thing. You even still get the correct answer, despite the error messages the dashboard is spitting out.)(To be clear, if you literally use an imaginary proxy like
http://some.random.proxy:80, you’ll get a different error message. But the computation will still go through, so it’s only the dashboard gRPC thing that’s looking at http_proxy.)In any case, we could have an error message that gives the IP and port that failed to reach, possibly with a suggestion to add them to no_proxy.
This looks like a bad bug. @mfitton can you take a look at it?