Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`ray.get` on cluster mode sometimes does not return

See original GitHub issue

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
Ray installed from (source or binary): wheels
Ray version: 0.8.0.dev3
Python version: 3.6
Exact command to reproduce: With 2 nodes:

In [1]: import ray

In [2]: ray.init(redis_address="localhost:6379")
2019-08-01 02:57:18,898	WARNING worker.py:1372 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Out[2]:
{'node_ip_address': '172.31.95.217',
 'redis_address': '172.31.95.217:6379',
 'object_store_address': '/tmp/ray/session_2019-08-01_02-55-05_728763_2867/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2019-08-01_02-55-05_728763_2867/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2019-08-01_02-55-05_728763_2867'}

In [3]: @ray.remote
   ...: def test():
   ...:     print("hello!")
   ...:     return 123
   ...:
   ...:

In [4]: ray.get(test.remote())
(pid=2896) hello!
Out[4]: 123

In [5]: ray.get(test.remote())
(pid=2833, ip=172.31.89.59) hello!

Sometimes, ray.get does not return.

# An unique identifier for the head node and workers of this cluster.
cluster_name: sgd-pytorch

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
min_workers: 1
initial_workers: 1
max_workers: 1

target_utilization_fraction: 0.9

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 20
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1f

auth:
    ssh_user: ubuntu

head_node:
    InstanceType: c5.xlarge
    ImageId: ami-0d96d570269578cd7

worker_nodes:
    InstanceType: c5.xlarge
    ImageId: ami-0d96d570269578cd7

setup_commands:
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev3-cp36-cp36m-manylinux1_x86_64.whl

file_mounts: {}

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# # Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --object-store-memory=1000000000

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --redis-address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --object-store-memory=1000000000

Issue Analytics

State:
Created 4 years ago
Comments:5 (5 by maintainers)

Top GitHub Comments

1reaction

simon-mocommented, Aug 2, 2019

I traced down the real issue. So this bug only occurs after https://github.com/ray-project/ray/pull/5120. However, the real issue seems to be somehow ray stop is unable to stop the default_worker and raylet processes. I’m suspecting that kill failed for some reason; or gRPC thread doesn’t respond to signals correctly.

Checking…

0reactions

simon-mocommented, Aug 2, 2019

Steps to reproduce the ray stop bug

https://gist.github.com/simon-mo/7194824e161f336d699d9a0bcb65c13e