Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"slow start" launching worker processes on new nodes

See original GitHub issue

What is the problem?

This is with Ray 1.0.1 on Ubuntu 20.04 on AWS c5a servers.

I create a cluster with 13 worker computers, each a c5a-16xlarge node on AWS, so 64 vCPUs per worker computer (13*64 = 832). Watching the Ray dashboard, I can see the number of workers, at the bottom, versus the number of cores. It takes somewhere between 3.5 and 4 minutes for the number of worker processes to equal the number of cores.

Under Ray 1.0.0, the worker processes were launched at the start, and I’d immediately have full CPU utilization across my cluster. Now it takes nearly four minutes. Once some other issues get resolved, I’d like to increase the number of vCPUs by a factor of ten or more, at which point this “slow start” behavior would be the gating factor in my ability to achieve scalable performance.

(Right now, some other unrelated bugs are limiting my ability to add more workers.)

Reproduction (REQUIRED)

I ran a demo for @rkooo567 so he could see the behavior.

Desired fix? Some way of telling Ray to start worker processes immediately when nodes are launched.

Issue Analytics

State:
Created 3 years ago
Comments:22 (20 by maintainers)

Top GitHub Comments

1reaction

ericlcommented, Nov 18, 2020

@rkooo567 can we come up with a simpler repro? The problem with a complex repro is that the issue can always be an issue in the application code. Have you tried reproducing this on that cluster with a simple wave of tasks?

Another thing to try is reproducing the issue on a different cluster. If it’s not possible there might be some environment specific problem (slow NFS mount, etc).

0reactions

ericlcommented, Nov 25, 2020

@kfstorm sure, I’ll take this one over. Hopefully we don’t need to revert.