Failed to forward task to node manager [ray]
See original GitHub issue-> In a private cluster setup <-
It started with Tune not using nodes other than the head, then we took few steps back and tried to run lower level ray snippets, to identify the issue. We ended up getting:
I0220 14:08:16.620895 3799 node_manager.cc:2149] Failed to forward task ff237343da81f0ef334299ed90993428 to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.620929 3799 node_manager.cc:2149] Failed to forward task 60945591c66e8a468153e975ec34745b to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.620961 3799 node_manager.cc:2149] Failed to forward task df3e50eb1c1503bb6f7446700473889e to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.620995 3799 node_manager.cc:2149] Failed to forward task 21e58caf15d4be68b02b94621b96e6b5 to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.621026 3799 node_manager.cc:2149] Failed to forward task 28cfb580f25d19ca87f3c56aa7bb1ed5 to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
Which seems related to #5223 but it is unfortunately closed without resolution so I’m opening this instead.
Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.7.2
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I’m running on two local machines with Ubuntu 16.04 and inside a Docker container.
Steps followed:
- Disabled any firewall
- Opened all ports for the Docker container
- Ran:
ray start --head --redis-port=6379 --redis-shard-ports=6380 --node-manager-port=12345 --object-manager-port=12346 --resources='{"Driver": 1.0}' --num-cpus=0
on the headray start --redis-address=<IP_OF_HEAD>:6379 --node-manager-port=12345 --object-manager-port=12346 --resources='{"Node": 1.0}'
on the node - Verified (with telnet) all ports above are accessible from master->node AND node->master
- Inside the head’s python, I ran:
import ray
import time
ray.init(redis_address="<HEAD_IP>(10.67.0.201):6379")
ray.cluster_resources()
gives:
{'GPU': 2.0, 'Driver': 1.0, 'CPU': 12.0, 'Node': 1.0}
ray.nodes()
gives:
[
{'ClientID': 'e98dd25ed961a708adf90566c20abd2e77b4deb5', 'EntryType': 0, 'NodeManagerAddress': '10.67.0.201', 'NodeManagerPort': 12345, 'ObjectManagerPort': 12346, 'ObjectStoreSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/raylet', 'Resources': {'GPU': 1.0, 'Driver': 1.0}},
{'ClientID': 'd38de4071fd540f93c2dd531915c327ef877ed8b', 'EntryType': 1, 'NodeManagerAddress': '10.67.0.163', 'NodeManagerPort': 12345, 'ObjectManagerPort': 12346, 'ObjectStoreSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/raylet', 'Resources': {'CPU': 12.0, 'GPU': 1.0, 'Node': 1.0}}]
and
@ray.remote
def f():
time.sleep(0.01)
return ray.services.get_node_ip_address()
# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(100)]))
Just hangs for a while, until I get:
...
2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID 62a3c929e6460dfe47fe94ce0acb682b is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.
2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID 7f4d78d43b607006628488e069753533 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.
2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID 82e8b3db13d57242e0b7112954d04c83 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.
2020-02-20 14:21:07,437 ERROR worker.py:1672 -- The task with ID c2668d440194bc623a1b95ab848731a1 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.
If we cannot run your script, we cannot fix your issue.
- I have verified my script runs in a clean environment and reproduces the issue.
- I have verified the issue also occurs with the latest wheels.
Issue Analytics
- State:
- Created 4 years ago
- Reactions:1
- Comments:7 (4 by maintainers)
Top GitHub Comments
OK, I now managed to supply the worker node with its own external IP and see it inside the head node’s resources list. This didn’t solve the problem though. I’m still getting:
@deepankar27 @wangzelong0663 It is probably related to networking. Try disabling your firewall, work around your proxy and test for open connection using common communication tools like telnet, etc.