question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

`ray.get` on cluster mode sometimes does not return

See original GitHub issue

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
  • Ray installed from (source or binary): wheels
  • Ray version: 0.8.0.dev3
  • Python version: 3.6
  • Exact command to reproduce: With 2 nodes:
In [1]: import ray

In [2]: ray.init(redis_address="localhost:6379")
2019-08-01 02:57:18,898	WARNING worker.py:1372 -- WARNING: Not updating worker name since `setproctitle` is not installed. Install this with `pip install setproctitle` (or ray[debug]) to enable monitoring of worker processes.
Out[2]:
{'node_ip_address': '172.31.95.217',
 'redis_address': '172.31.95.217:6379',
 'object_store_address': '/tmp/ray/session_2019-08-01_02-55-05_728763_2867/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2019-08-01_02-55-05_728763_2867/sockets/raylet',
 'webui_url': None,
 'session_dir': '/tmp/ray/session_2019-08-01_02-55-05_728763_2867'}

In [3]: @ray.remote
   ...: def test():
   ...:     print("hello!")
   ...:     return 123
   ...:
   ...:

In [4]: ray.get(test.remote())
(pid=2896) hello!
Out[4]: 123

In [5]: ray.get(test.remote())
(pid=2833, ip=172.31.89.59) hello!

Sometimes, ray.get does not return.

# An unique identifier for the head node and workers of this cluster.
cluster_name: sgd-pytorch

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers. min_workers default to 0.
min_workers: 1
initial_workers: 1
max_workers: 1

target_utilization_fraction: 0.9

# If a node is idle for this many minutes, it will be removed.
idle_timeout_minutes: 20
provider:
    type: aws
    region: us-east-1
    availability_zone: us-east-1f

auth:
    ssh_user: ubuntu

head_node:
    InstanceType: c5.xlarge
    ImageId: ami-0d96d570269578cd7

worker_nodes:
    InstanceType: c5.xlarge
    ImageId: ami-0d96d570269578cd7

setup_commands:
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev3-cp36-cp36m-manylinux1_x86_64.whl

file_mounts: {}

# Custom commands that will be run on the head node after common setup.
head_setup_commands: []

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# # Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --object-store-memory=1000000000

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - ray start --redis-address=$RAY_HEAD_IP:6379 --object-manager-port=8076 --object-store-memory=1000000000

Issue Analytics

  • State:closed
  • Created 4 years ago
  • Comments:5 (5 by maintainers)

github_iconTop GitHub Comments

1reaction
simon-mocommented, Aug 2, 2019

I traced down the real issue. So this bug only occurs after https://github.com/ray-project/ray/pull/5120. However, the real issue seems to be somehow ray stop is unable to stop the default_worker and raylet processes. I’m suspecting that kill failed for some reason; or gRPC thread doesn’t respond to signals correctly.

Checking…

0reactions
simon-mocommented, Aug 2, 2019
Read more comments on GitHub >

github_iconTop Results From Across the Web

[ray] Clustering issue · Issue #4393 · ray-project/ray - GitHub
I tried "manual cluster setup" on gcp instances, but always fail. ... Exception: Redis has started but no raylets have registered yet.
Read more >
Ray Client: Interactive Development — Ray 2.2.0
Ray Client is useful for developing interactively in a local Python shell. However, it requires a stable connection to the remote cluster and...
Read more >
Ray Documentation - Read the Docs
return 1 ray.init() results = ray.get([f.remote() for i in. ˓→range(4)]). To launch a Ray cluster, either privately, on AWS, or on GCP, ...
Read more >
ray_tutorial.py
By default, Ray does not schedule more tasks concurrently than there are CPUs. This example requires four tasks to run concurrently, so we...
Read more >
Autoscaling clusters with Ray - Anyscale
First, from the CLI, Ray will use the Ray Cluster Launcher to launch the head node of the cluster. To do this, the...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found