Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[autoscaler][tune] only two instances are spawned and one used on GCP

See original GitHub issue

Problem

I am trying to use ray tune with the autoscaler on a GCP cluster with up to 3 additional workers (see config in Reproduction section):

start cluster using ray up ray_cluster.yaml -y
wait until head is up and running
submit tune script using ray submit ray_cluster.yaml tune_script.py --start
tune_script.py starts on head node, ray spawns first worker instance
first worker instance spawned and starts tune_script.py, ray spawns second worker instance
second worker instance spawned. this is where the error happens, since tune_script.py is never executed and the autoscaling stops

Here are the logs of a session where this bug happens at around 20:58:13: ray_logs.tar.gz

monitor.err

2020-11-24 20:58:08,107	INFO autoscaler.py:591 -- Cluster status: 2/2 target nodes (0 pending)
 - MostDelayedHeartbeats: {'10.164.0.50': 1.2863950729370117, '10.164.0.49': 1.0419065952301025, '10.164.0.48': 1.0417735576629639}
 - NodeIdleSeconds: Min=1 Mean=1 Max=1
 - NumNodesConnected: 2
 - NumNodesUsed: 2.0
 - ResourceUsage: 4.0/4.0 CPU, 0.0 GiB/10.01 GiB memory, 0.0 GiB/3.15 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=1 Mean=1 Max=1

2020-11-24 20:58:13,164	INFO autoscaler.py:591 -- Cluster status: 2/2 target nodes (0 pending)
 - MostDelayedHeartbeats: {'10.164.0.49': 1.05641508102417, '10.164.0.50': 1.056370735168457, '10.164.0.48': 1.056330919265747}
 - NodeIdleSeconds: Min=1 Mean=2 Max=5
 - NumNodesConnected: 3
 - NumNodesUsed: 2.0
 - ResourceUsage: 4.0/6.0 CPU, 0.0 GiB/15.47 GiB memory, 0.0 GiB/4.72 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=1 Mean=1 Max=1

System

Using the instance image and docker image from the example config: Image: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu Docker Image: rayproject/ray:latest-gpu

Ubuntu 18.04 ray 1.0.1 python 3.7.7

Reproduction (REQUIRED)

ray_cluster (set project_id)

cluster_name: default
min_workers: 0
max_workers: 3
initial_workers: 0
autoscaling_mode: default

docker:
    image: "rayproject/ray:latest-gpu"
    container_name: "ray_container"

    pull_before_run: True
    run_options: []

target_utilization_fraction: 0.8
idle_timeout_minutes: 10

provider:
    type: gcp
    region: europe-west4
    availability_zone: europe-west4-b
    project_id: null

auth:
    ssh_user: ubuntu

head_node:
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
    scheduling:
      - onHostMaintenance: TERMINATE

worker_nodes:
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
    scheduling:
      - onHostMaintenance: TERMINATE

cluster_synced_files: []
file_mounts_sync_continuously: True
initialization_commands: []

rsync_exclude:
    - "**/.git"
    - "**/.git/**"


rsync_filter:
    - ".gitignore"

setup_commands: []

head_setup_commands:
  - pip install google-api-python-client==1.7.8

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

tune_script.py

from ray import tune
import ray
from ray.tune.integration.docker import DockerSyncer


def trainable(config):
    import time

    t1 = time.time()
    while True:
        print(time.time() - t1)
        time.sleep(1)

def tune_mnist_pbt():
    ray.init(address='auto')

    tune.run(
        trainable,
        resources_per_trial={
            "cpu": 2
        },
        config={},
        sync_config=tune.SyncConfig(sync_to_driver=DockerSyncer),
        num_samples=100,
        scheduler=None,
        fail_fast=True,
        queue_trials=True,
        reuse_actors=True,
        name="example")


if __name__ == '__main__':
    tune_mnist_pbt()

Issue Analytics

State:
Created 3 years ago
Comments:7 (4 by maintainers)

Top GitHub Comments

1reaction

tuxacommented, Nov 25, 2020

Thats it, thank you so much @richardliaw !! I also had this problem in my real training script, since one epoch runs for several hours, so tune.report is not called within that time. So, an easy workaround is to add callbacks within the training loop.

0reactions

richardliawcommented, Nov 25, 2020

oh i see; can you try adding tune.report in your training function? Tune doesn’t launch new jobs unless it receives information from its workers.