question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Bug] can't get ray cluster fully utilized

See original GitHub issue

Search before asking

  • I searched the issues and found no similar issues.

(I created this issue based on a chat with Sang Cho)

Ray Component

Ray Clusters

What happened + What you expected to happen

i created a cluster with 120 nodes, each having 8 vCPU and 32GB of memory. I triggered 10k task executions, and used the ray decorator assigning 0.15 vCPU to each task execution. While that should be enough work to fully utilize the cluster, many nodes are only lightly or not utilized at all (according to the ray dashboard).

Is this familiar behavior? Have you seen this in other scenarios of an equivalent size?

image

Versions / Dependencies

1.9

Reproduction script

I just call the task 10k times, which usually finished within a small, single-digit number of seconds. I also had tried the RAY_max_pending_lease_requests_per_scheduling_category == num_nodes environment variable setting, but didn’t see a major effect.

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:6 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
scv119commented, Jun 24, 2022

we should tackle this at ray 2.1

1reaction
scv119commented, Jun 24, 2022

the fundamental reason is ray single node scheduler becomes a bottleneck for embarrassingly parallel workload like this. An simple workaround for now is to submit jobs through actors placed on different nodes.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Dask on Ray + Ray Distributed Cluster - Workers not getting ...
Problem. It doesn't seem that any of my worker nodes are getting utilized. I am submitting 1000 graphs to the cluster at a...
Read more >
How to clear objects from the object store in ray?
The problem is that your remote function captures big_data_object_ref , and the ... So no need to use ray.get() in the remote function....
Read more >
Spark, Dask, and Ray: Choosing the Right Framework
Unlike Dask, however, Ray doesn't try to mimic the NumPy and Pandas ... clustering and parallelisation framework that can be used to build ......
Read more >
The Biggest Spark Troubleshooting Challenges in 2022
It's very hard to know how long a job “should” take, or where to start in optimizing a job or a cluster. Excessive...
Read more >
Autoscaling - Amazon EKS - AWS Documentation
Auto Scaling groups are suitable for a large number of use cases. ... An existing Amazon EKS cluster – If you don't have...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found