[tune] workers sit idle between training iterations when running on autoscaler
See original GitHub issueSystem information
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
- Ray installed from (source or binary): Both
- Ray version: 0.6.2 and https://github.com/ray-project/ray/pull/4154/commits/2b623a118db889935a53cafe4f2441b8cf430fcc
- Python version: 3.6
- Exact command to reproduce: n/a
Describe the problem
I’m seeing this issue where all my tune trials running on workers sit idle for most of the time. For example, in the following case, the cluster was started about 30 minutes before this log, and only the first two trials (that are running on the head node) are making reasonable progress. Rest of the trials are running about 1 step/10 minutes, even though based on the time_total_s
it takes only 20 seconds per iteration.
RUNNING trials:
- id=f38ff9fe-seed=7678: RUNNING [pid=3482], 1895 s, 58 iter, 58000 ts
- id=314e0483-seed=291: RUNNING [pid=3483], 1897 s, 58 iter, 58000 ts
- id=3b333c47-seed=5763: RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-9aea164f pid=3161], 92 s, 4 iter, 4000 ts
- id=512ae016-seed=2978: RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-83ac45ed pid=3151], 93 s, 4 iter, 4000 ts
- id=8388ec55-seed=7037: RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-087e6b1d pid=3146], 92 s, 4 iter, 4000 ts
... 71 not shown
- id=5d69cff2-seed=1606: RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-40e01303 pid=3123], 79 s, 3 iter, 3000 ts
- id=0bf67558-seed=7161: RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-dcdda17b pid=3150], 77 s, 3 iter, 3000 ts
- id=2d8d3f48-seed=8881: RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-4774d905 pid=3192], 85 s, 3 iter, 3000 ts
- id=e805dfe9-seed=1388: RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-40e01303 pid=3119], 78 s, 3 iter, 3000 ts
- id=e1bc2581-seed=6844: RUNNING [ray-20190301t094602-31c17c-dclaw3-screwv2-v0-worker-4774d905 pid=3193], 85 s, 3 iter, 3000 ts
Here are some diagnostics for one of the workers from the GCP console:
The cpu diagnostic shows that the cpus sit idle most of the, which matches well with the slowness showing up in the ray results. The same thing happens both on 0.6.2 and https://github.com/ray-project/ray/pull/4154/commits/2b623a118db889935a53cafe4f2441b8cf430fcc, and I don’t see any problems when running things locally.
I’m not sure if this is a bug or maybe it’s possible that for example the head node being slow could somehow stall the workers?
Issue Analytics
- State:
- Created 5 years ago
- Comments:14 (12 by maintainers)
Top GitHub Comments
Hey @hartikainen, we merged #4242. Do you want to try it out? Thanks!
I think we closed this issue with #4379 and #4302; feel free to reopen if otherwise.