question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[tune] workers sit idle between training iterations when running on autoscaler

See original GitHub issue

System information

Describe the problem

I’m seeing this issue where all my tune trials running on workers sit idle for most of the time. For example, in the following case, the cluster was started about 30 minutes before this log, and only the first two trials (that are running on the head node) are making reasonable progress. Rest of the trials are running about 1 step/10 minutes, even though based on the time_total_s it takes only 20 seconds per iteration.

RUNNING trials:
- id=f38ff9fe-seed=7678:       RUNNING [pid=3482], 1895 s, 58 iter, 58000 ts
- id=314e0483-seed=291:        RUNNING [pid=3483], 1897 s, 58 iter, 58000 ts
- id=3b333c47-seed=5763:       RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-9aea164f pid=3161], 92 s, 4 iter, 4000 ts
- id=512ae016-seed=2978:       RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-83ac45ed pid=3151], 93 s, 4 iter, 4000 ts
- id=8388ec55-seed=7037:       RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-087e6b1d pid=3146], 92 s, 4 iter, 4000 ts
... 71 not shown
- id=5d69cff2-seed=1606:       RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-40e01303 pid=3123], 79 s, 3 iter, 3000 ts
- id=0bf67558-seed=7161:       RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-dcdda17b pid=3150], 77 s, 3 iter, 3000 ts
- id=2d8d3f48-seed=8881:       RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-4774d905 pid=3192], 85 s, 3 iter, 3000 ts
- id=e805dfe9-seed=1388:       RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-40e01303 pid=3119], 78 s, 3 iter, 3000 ts
- id=e1bc2581-seed=6844:       RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-4774d905 pid=3193], 85 s, 3 iter, 3000 ts

Here are some diagnostics for one of the workers from the GCP console: image

The cpu diagnostic shows that the cpus sit idle most of the, which matches well with the slowness showing up in the ray results. The same thing happens both on 0.6.2 and https://github.com/ray-project/ray/pull/4154/commits/2b623a118db889935a53cafe4f2441b8cf430fcc, and I don’t see any problems when running things locally.

I’m not sure if this is a bug or maybe it’s possible that for example the head node being slow could somehow stall the workers?

Issue Analytics

  • State:closed
  • Created 5 years ago
  • Comments:14 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
stephanie-wangcommented, Mar 4, 2019

Hey @hartikainen, we merged #4242. Do you want to try it out? Thanks!

0reactions
richardliawcommented, Mar 21, 2019

I think we closed this issue with #4379 and #4302; feel free to reopen if otherwise.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Model training remain idle for 12hrs! - Ray
we started training a model in ray cluster of 2 workers model was trained for several epochs. suddenly it remains idle, neither trained ......
Read more >
Running Ray Tune 1 Iteration At A Time - ADocLib
Ray Tune periodically checkpoints the experiment state so that it can be restarted ... [tune] workers sit idle between training iterations when running...
Read more >
Scaling Applications on Kubernetes with Ray | by Vishnu Deva
From my previous experience with Ray Tune, connectivity was never an issue — we launch a Ray cluster on EC2, run an ML...
Read more >
A Predictive Autoscaler for Elastic Batch Jobs - arXiv
from the cloud provider and do the reverse when instances become idle. The system is built on the Kubernetes(Burns.
Read more >
Cioppino: Multi-Tenant Crowd Management
workers between applications to improve performance and match workers with tasks they enjoy most ... Too large, and workers will sit idle, incurring...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found