Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Dashboard failures with include_dashboard set to false

See original GitHub issue

Running Tune with A3C fails straight at the beginning with the following traceback:

2020-11-11 14:13:37,114	WARNING worker.py:1111 -- The agent on node *** failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 298, in <module>
    loop.run_until_complete(agent.run())
  File "/usr/lib/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.6/dist-packages/ray/new_dashboard/agent.py", line 172, in run
    agent_ip_address=self.ip))
  File "/usr/local/lib/python3.6/dist-packages/grpc/experimental/aio/_call.py", line 286, in __await__
    self._cython_call._status)
grpc.experimental.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1605096817.110308830","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4090,"referenced_errors":[{"created":"@1605096817.110303917","description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":394,"grpc_status":14}]}"
>

**This is obviously a dashboard related exception, which is unexpected since include_dashboard is set to False. It might be related to https://github.com/ray-project/ray/issues/11943 but it shouldn’t happen if this flag is set to False, so it’s a different issue. **

Ray version and other system information (Python version, TensorFlow version, OS): Ray installed via https://docs.ray.io/en/master/development.html#building-ray-python-only on both latest master and releases/1.0.1

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

    ray.init(include_dashboard=False)
    tune.run(
        A3CTrainer,
        config=<any config>,
        stop={
            "timesteps_total": 50e6,
        },
    )

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 3 years ago
Reactions:1
Comments:63 (61 by maintainers)

Top GitHub Comments

3reactions

dHannaschcommented, Nov 20, 2020

environment variables http_proxy or https_proxy

When I run the following script:

import time
import ray
import ray.services

@ray.remote
def f():
    time.sleep(8)
    return ray.services.get_node_ip_address()

if __name__ == "__main__":
  ray.init(num_cpus=1)
  IPaddresses = set(ray.get([f.remote() for _ in range(4)]))
  print('IPaddresses =', IPaddresses)
  ray.shutdown()

on a machine with http_proxy and https_proxy set, it spits out

Traceback (most recent call last):
  File "ray/new_dashboard/agent.py", line 305, in <module>
    loop.run_until_complete(agent.run())
  File "python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "python3.8/site-packages/ray/new_dashboard/agent.py", line 169, in run
    await raylet_stub.RegisterAgent(
  File "python3.8/site-packages/grpc/aio/_call.py", line 285, in __await__
    raise _create_rpc_error(self._cython_call._initial_metadata,
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "failed to connect to all addresses"
        debug_error_string = "{"description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":4165,"referenced_errors":[{"description":"failed to connect to all addresses","file":"src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":397,"grpc_status":14}]}"

This looks like the same error, right? (The sleeps are necessary so that the script doesn’t exit before the error appears; the length of sleep necessary presumably varies by machine.)

Obviously, on any machine with http_proxy and https_proxy set, no_proxy is also going to be set, presumably with localhost and 127.0.0.1…but no_proxy usually won’t include the machine’s external IP address. Ray is using that external IP address from get_node_ip_address().

For my machine, at least, adding the external IP address to no_proxy makes everything go through without that error message.

$ no_proxy="$(hostname -i),$no_proxy" python test_actors.py

I think @fyrestone hit the nail on the head.

Unfortunately, the problem being diagnosed is not the same thing as the problem being solved. Setting no_proxy that way works for a simple standalone script like that one, but for the more complicated operations such as ray start and tune, the new processes don’t get started with the new value of no_proxy, even if you export no_proxy. The new processes must pull the values of the variables from some deeper level when they get started up, and I’m not sure where. Not .bashrc, I assume, since these new processes aren’t starting in shells as such.

~~Looking at https://github.com/ray-project/ray/blob/master/python/ray/_private/services.py#L1438, the dashboard process doesn’t have shell=True, so I’m really not sure where it’s pulling the proxy information from. And yet setting no_proxy on the command line works when running a simple ray.init() script…? https://stackoverflow.com/questions/12060863/python-subprocess-call-a-bash-alias~~

It’s sort of baffling, because actors are also separate processes, but apparently those actors started from that script do somehow inherit the value of no_proxy. (export no_proxy="$(hostname -i),$no_proxy" makes that script go through just fine; it doesn’t matter whether no_proxy is set on the same line, no_proxy just needs to be set.) Yet other workers created by ray start do not, so that

$ export no_proxy="$(hostname -i),$no_proxy"
$ ray start

~~still results in workers spitting out that error.~~

All the various processes started by Ray inherit no_proxy. I dunno how, but they do. You do need to set no_proxy on all machines involved, though, with the numerical IP addresses of all machines involved (comma-separated), including its own. Remember that the IP address by which one machine can find another machine is not necessarily the same IP address that hostname -i brings up on the target machine.

You might not know in advance the IP address of every machine that will be joining. But you could probably brute-force that by just adding every IP address ending in a number to no_proxy:

no_proxy="0,1,2,3,4,5,6,7,8,9,$no_proxy" python test_actors.py

(Presumably, the problem only happens because we’re using raw numerical IP addresses; presumably, no_proxy is already set to cover all relevant domains.) We could add that to the documentation as a recommendation to just always do.

I’m not sure what to do about this with respect to the port-checking documentation. netcat and nmap (natually, I think?) completely ignore http_proxy and https_proxy for non-HTTP traffic. (This isn’t HTTP traffic, is it? This is a metric-export thing and that’s why it happens even with the dashboard disabled? I’m not sure why gRPC is using the proxy settings. I’m guessing there are some kind of gRPC-over-HTTP shenanigans going on, for some reason?)

(Okay, I guess they just always ignore proxy settings.

$ http_proxy=http://some.random.proxy:80 https_proxy=http://some.random.proxy:80 nc -vv -z www.google.com 80
Connection to www.google.com 80 port [tcp/http] succeeded!
$ http_proxy=http://some.random.proxy:80 https_proxy=http://some.random.proxy:80 nmap -p 80 www.google.com
PORT   STATE SERVICE
80/tcp open  http

I still don’t get why gRPC is using the proxy. I assume this is dashboard-specific somehow, since nothing else goes wrong if you run http_proxy=http://some.imaginary.proxy:80 https_proxy=http://some.imaginary.proxy:80 python test_actors.py, just the dashboard thing. You even still get the correct answer, despite the error messages the dashboard is spitting out.)

(To be clear, if you literally use an imaginary proxy like http://some.random.proxy:80, you’ll get a different error message. But the computation will still go through, so it’s only the dashboard gRPC thing that’s looking at http_proxy.)

In any case, we could have an error message that gives the IP and port that failed to reach, possibly with a suggestion to add them to no_proxy.

3reactions

rkooo567commented, Nov 11, 2020

This looks like a bad bug. @mfitton can you take a look at it?