Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Failed to forward task to node manager [ray]

See original GitHub issue

-> In a private cluster setup <-

It started with Tune not using nodes other than the head, then we took few steps back and tried to run lower level ray snippets, to identify the issue. We ended up getting:

I0220 14:08:16.620895  3799 node_manager.cc:2149] Failed to forward task ff237343da81f0ef334299ed90993428 to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.620929  3799 node_manager.cc:2149] Failed to forward task 60945591c66e8a468153e975ec34745b to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.620961  3799 node_manager.cc:2149] Failed to forward task df3e50eb1c1503bb6f7446700473889e to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.620995  3799 node_manager.cc:2149] Failed to forward task 21e58caf15d4be68b02b94621b96e6b5 to node manager d38de4071fd540f93c2dd531915c327ef877ed8b
I0220 14:08:16.621026  3799 node_manager.cc:2149] Failed to forward task 28cfb580f25d19ca87f3c56aa7bb1ed5 to node manager d38de4071fd540f93c2dd531915c327ef877ed8b

Which seems related to #5223 but it is unfortunately closed without resolution so I’m opening this instead.

Ray version and other system information (Python version, TensorFlow version, OS): Ray 0.7.2

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):

I’m running on two local machines with Ubuntu 16.04 and inside a Docker container.

Steps followed:

Disabled any firewall
Opened all ports for the Docker container
Ran: ray start --head --redis-port=6379 --redis-shard-ports=6380 --node-manager-port=12345 --object-manager-port=12346 --resources='{"Driver": 1.0}' --num-cpus=0 on the head ray start --redis-address=<IP_OF_HEAD>:6379 --node-manager-port=12345 --object-manager-port=12346 --resources='{"Node": 1.0}' on the node
Verified (with telnet) all ports above are accessible from master->node AND node->master
Inside the head’s python, I ran:

import ray
import time

ray.init(redis_address="<HEAD_IP>(10.67.0.201):6379")

ray.cluster_resources()

gives: {'GPU': 2.0, 'Driver': 1.0, 'CPU': 12.0, 'Node': 1.0}

ray.nodes() gives:

[
{'ClientID': 'e98dd25ed961a708adf90566c20abd2e77b4deb5', 'EntryType': 0, 'NodeManagerAddress': '10.67.0.201', 'NodeManagerPort': 12345, 'ObjectManagerPort': 12346, 'ObjectStoreSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/raylet', 'Resources': {'GPU': 1.0, 'Driver': 1.0}}, 

{'ClientID': 'd38de4071fd540f93c2dd531915c327ef877ed8b', 'EntryType': 1, 'NodeManagerAddress': '10.67.0.163', 'NodeManagerPort': 12345, 'ObjectManagerPort': 12346, 'ObjectStoreSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/plasma_store', 'RayletSocketName': '/tmp/ray/session_2020-02-20_14-04-58_584612_3775/sockets/raylet', 'Resources': {'CPU': 12.0, 'GPU': 1.0, 'Node': 1.0}}]

and


@ray.remote
def f():
    time.sleep(0.01)
    return ray.services.get_node_ip_address()

# Get a list of the IP addresses of the nodes that have joined the cluster.
set(ray.get([f.remote() for _ in range(100)]))

Just hangs for a while, until I get:

...
2020-02-20 14:21:07,437	ERROR worker.py:1672 -- The task with ID 62a3c929e6460dfe47fe94ce0acb682b is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.
2020-02-20 14:21:07,437	ERROR worker.py:1672 -- The task with ID 7f4d78d43b607006628488e069753533 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.
2020-02-20 14:21:07,437	ERROR worker.py:1672 -- The task with ID 82e8b3db13d57242e0b7112954d04c83 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.
2020-02-20 14:21:07,437	ERROR worker.py:1672 -- The task with ID c2668d440194bc623a1b95ab848731a1 is infeasible and cannot currently be executed. It requires {CPU,1.000000} for execution and {CPU,1.000000} for placement. Check the client table to view node resources.

If we cannot run your script, we cannot fix your issue.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Issue Analytics

State:
Created 4 years ago
Reactions:1
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

roireshefcommented, Feb 20, 2020

OK, I now managed to supply the worker node with its own external IP and see it inside the head node’s resources list. This didn’t solve the problem though. I’m still getting:

...
E0220 18:04:58.267797  1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 18:04:58.269181  1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses
E0220 18:04:58.269912  1310 direct_task_transport.cc:146] Retrying attempt to schedule task at remote node. Error: IOError: 14: failed to connect to all addresses

0reactions

roireshefcommented, Dec 10, 2020

@deepankar27 @wangzelong0663 It is probably related to networking. Try disabling your firewall, work around your proxy and test for open connection using common communication tools like telnet, etc.

Top Results From Across the Web

(raylet) Some workers of the worker process(68497) have not ...

Can anyone help me? I ran ray start --head , and it gave me: image. I ran ray status , it gave me:...

2 Node Manager Overview - Oracle Help Center

Node Manager is a WebLogic Server utility that enables you to start, shut down, and restart Administration Server and Managed Server instances from...

Behavior Tree Node Reference: Tasks

This is a reference page for the Task nodes available in the Behavior Tree Editor. Tasks are nodes that "do" things, like move...

Untitled

Open Com Port Fail HatasıWitam, posiadam interfejs taki jak na zdjęciu i mam ... The issue “Minecraft Server Port Forwarding not working” might...

Deny-listed nodes - Amazon EMR - AWS Documentation

The application master can also deny list a NodeManager node if it has more than three failed tasks. You can change this to...