[autoscaler][tune] only two instances are spawned and one used on GCP
See original GitHub issueProblem
I am trying to use ray tune with the autoscaler on a GCP cluster with up to 3 additional workers (see config in Reproduction section):
- start cluster using
ray up ray_cluster.yaml -y - wait until head is up and running
- submit tune script using
ray submit ray_cluster.yaml tune_script.py --start tune_script.pystarts on head node, ray spawns first worker instance- first worker instance spawned and starts
tune_script.py, ray spawns second worker instance - second worker instance spawned. this is where the error happens, since
tune_script.pyis never executed and the autoscaling stops
Here are the logs of a session where this bug happens at around 20:58:13: ray_logs.tar.gz
monitor.err
2020-11-24 20:58:08,107 INFO autoscaler.py:591 -- Cluster status: 2/2 target nodes (0 pending)
- MostDelayedHeartbeats: {'10.164.0.50': 1.2863950729370117, '10.164.0.49': 1.0419065952301025, '10.164.0.48': 1.0417735576629639}
- NodeIdleSeconds: Min=1 Mean=1 Max=1
- NumNodesConnected: 2
- NumNodesUsed: 2.0
- ResourceUsage: 4.0/4.0 CPU, 0.0 GiB/10.01 GiB memory, 0.0 GiB/3.15 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=1 Mean=1 Max=1
2020-11-24 20:58:13,164 INFO autoscaler.py:591 -- Cluster status: 2/2 target nodes (0 pending)
- MostDelayedHeartbeats: {'10.164.0.49': 1.05641508102417, '10.164.0.50': 1.056370735168457, '10.164.0.48': 1.056330919265747}
- NodeIdleSeconds: Min=1 Mean=2 Max=5
- NumNodesConnected: 3
- NumNodesUsed: 2.0
- ResourceUsage: 4.0/6.0 CPU, 0.0 GiB/15.47 GiB memory, 0.0 GiB/4.72 GiB object_store_memory
- TimeSinceLastHeartbeat: Min=1 Mean=1 Max=1
System
Using the instance image and docker image from the example config: Image: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu Docker Image: rayproject/ray:latest-gpu
Ubuntu 18.04 ray 1.0.1 python 3.7.7
Reproduction (REQUIRED)
ray_cluster
(set project_id)
cluster_name: default
min_workers: 0
max_workers: 3
initial_workers: 0
autoscaling_mode: default
docker:
image: "rayproject/ray:latest-gpu"
container_name: "ray_container"
pull_before_run: True
run_options: []
target_utilization_fraction: 0.8
idle_timeout_minutes: 10
provider:
type: gcp
region: europe-west4
availability_zone: europe-west4-b
project_id: null
auth:
ssh_user: ubuntu
head_node:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
scheduling:
- onHostMaintenance: TERMINATE
worker_nodes:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
scheduling:
- onHostMaintenance: TERMINATE
cluster_synced_files: []
file_mounts_sync_continuously: True
initialization_commands: []
rsync_exclude:
- "**/.git"
- "**/.git/**"
rsync_filter:
- ".gitignore"
setup_commands: []
head_setup_commands:
- pip install google-api-python-client==1.7.8
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--head
--port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
tune_script.py
from ray import tune
import ray
from ray.tune.integration.docker import DockerSyncer
def trainable(config):
import time
t1 = time.time()
while True:
print(time.time() - t1)
time.sleep(1)
def tune_mnist_pbt():
ray.init(address='auto')
tune.run(
trainable,
resources_per_trial={
"cpu": 2
},
config={},
sync_config=tune.SyncConfig(sync_to_driver=DockerSyncer),
num_samples=100,
scheduler=None,
fail_fast=True,
queue_trials=True,
reuse_actors=True,
name="example")
if __name__ == '__main__':
tune_mnist_pbt()
Issue Analytics
- State:
- Created 3 years ago
- Comments:7 (4 by maintainers)
Top Results From Across the Web
Autoscaling groups of instances - Google Cloud
An autoscaler can make scaling decisions based on multiple signals, but it can handle only one signal per metric type except in the...
Read more >Avi Deployment Guide for Google Cloud Platform (GCP)
The Controller instance should be spawned with a read-write scope, while SEs are spawned with a read-only scope. For more information, refer to...
Read more >Google Pub/Sub + Cloud Run spawning multiple containers
I only want 1 message to spawn 1 container once to completion (return 200 OK). Current Flow (Issue):. A message is published to...
Read more >Understanding and optimization of Google App Engine's ...
On the other hand, setting it to 1 would only be helpful if you want to spawn one instance per request. Also, it's...
Read more >App Engine, Scheduler settings, and instance count. - Medium
Since the server-work takes about 4 seconds, this one caused a massive spike in the # of instances that were spawned. Just to...
Read more >
Top Related Medium Post
No results found
Top Related StackOverflow Question
No results found
Troubleshoot Live Code
Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free
Top Related Reddit Thread
No results found
Top Related Hackernoon Post
No results found
Top Related Tweet
No results found
Top Related Dev.to Post
No results found
Top Related Hashnode Post
No results found

Thats it, thank you so much @richardliaw !! I also had this problem in my real training script, since one epoch runs for several hours, so
tune.reportis not called within that time. So, an easy workaround is to add callbacks within the training loop.oh i see; can you try adding
tune.reportin your training function? Tune doesn’t launch new jobs unless it receives information from its workers.