question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

"slow start" launching worker processes on new nodes

See original GitHub issue

What is the problem?

This is with Ray 1.0.1 on Ubuntu 20.04 on AWS c5a servers.

I create a cluster with 13 worker computers, each a c5a-16xlarge node on AWS, so 64 vCPUs per worker computer (13*64 = 832). Watching the Ray dashboard, I can see the number of workers, at the bottom, versus the number of cores. It takes somewhere between 3.5 and 4 minutes for the number of worker processes to equal the number of cores.

Under Ray 1.0.0, the worker processes were launched at the start, and I’d immediately have full CPU utilization across my cluster. Now it takes nearly four minutes. Once some other issues get resolved, I’d like to increase the number of vCPUs by a factor of ten or more, at which point this “slow start” behavior would be the gating factor in my ability to achieve scalable performance.

(Right now, some other unrelated bugs are limiting my ability to add more workers.)

Reproduction (REQUIRED)

I ran a demo for @rkooo567 so he could see the behavior.

Desired fix? Some way of telling Ray to start worker processes immediately when nodes are launched.

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:22 (20 by maintainers)

github_iconTop GitHub Comments

1reaction
ericlcommented, Nov 18, 2020

@rkooo567 can we come up with a simpler repro? The problem with a complex repro is that the issue can always be an issue in the application code. Have you tried reproducing this on that cluster with a simple wave of tasks?

Another thing to try is reproducing the issue on a different cluster. If it’s not possible there might be some environment specific problem (slow NFS mount, etc).

0reactions
ericlcommented, Nov 25, 2020

@kfstorm sure, I’ll take this one over. Hopefully we don’t need to revert.

Read more comments on GitHub >

github_iconTop Results From Across the Web

node worker for loop slow between first and second iteration
I have a method that contains a for loop, and I have the main process and child processes calling this method. 20404 create:...
Read more >
How to Handle OpenShift Worker Nodes Resources in ...
Starting pod/<worker node>-debug ... If you don't see a command prompt, try pressing enter. sh-4.2# sysctl -a |grep commit
Read more >
Optimizing your Node.js app's performance with clustering
Boost the performance of your Node.js apps with the cluster module, which allows you to fully utilize the CPUs at your disposal.
Read more >
Fix IIS website hangs and slow loads - LeanSentry
Learn how to troubleshoot hangs and slow loads in ASP. ... by opening the server node, going to Worker Processes, and picking your...
Read more >
​Apache Storm Resource Contention Resolution Strat...
A Storm cluster is installed with a fixed number of available worker processes across all worker nodes. Each time you deploy a new...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found