question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[autoscaler][tune] only two instances are spawned and one used on GCP

See original GitHub issue

Problem

I am trying to use ray tune with the autoscaler on a GCP cluster with up to 3 additional workers (see config in Reproduction section):

  1. start cluster using ray up ray_cluster.yaml -y
  2. wait until head is up and running
  3. submit tune script using ray submit ray_cluster.yaml tune_script.py --start
  4. tune_script.py starts on head node, ray spawns first worker instance
  5. first worker instance spawned and starts tune_script.py, ray spawns second worker instance
  6. second worker instance spawned. this is where the error happens, since tune_script.py is never executed and the autoscaling stops

Here are the logs of a session where this bug happens at around 20:58:13: ray_logs.tar.gz

monitor.err

2020-11-24 20:58:08,107	INFO autoscaler.py:591 -- Cluster status: 2/2 target nodes (0 pending)
 - MostDelayedHeartbeats: {'10.164.0.50': 1.2863950729370117, '10.164.0.49': 1.0419065952301025, '10.164.0.48': 1.0417735576629639}
 - NodeIdleSeconds: Min=1 Mean=1 Max=1
 - NumNodesConnected: 2
 - NumNodesUsed: 2.0
 - ResourceUsage: 4.0/4.0 CPU, 0.0 GiB/10.01 GiB memory, 0.0 GiB/3.15 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=1 Mean=1 Max=1

2020-11-24 20:58:13,164	INFO autoscaler.py:591 -- Cluster status: 2/2 target nodes (0 pending)
 - MostDelayedHeartbeats: {'10.164.0.49': 1.05641508102417, '10.164.0.50': 1.056370735168457, '10.164.0.48': 1.056330919265747}
 - NodeIdleSeconds: Min=1 Mean=2 Max=5
 - NumNodesConnected: 3
 - NumNodesUsed: 2.0
 - ResourceUsage: 4.0/6.0 CPU, 0.0 GiB/15.47 GiB memory, 0.0 GiB/4.72 GiB object_store_memory
 - TimeSinceLastHeartbeat: Min=1 Mean=1 Max=1

System

Using the instance image and docker image from the example config: Image: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu Docker Image: rayproject/ray:latest-gpu

Ubuntu 18.04 ray 1.0.1 python 3.7.7

Reproduction (REQUIRED)

ray_cluster (set project_id)

cluster_name: default
min_workers: 0
max_workers: 3
initial_workers: 0
autoscaling_mode: default

docker:
    image: "rayproject/ray:latest-gpu"
    container_name: "ray_container"

    pull_before_run: True
    run_options: []

target_utilization_fraction: 0.8
idle_timeout_minutes: 10

provider:
    type: gcp
    region: europe-west4
    availability_zone: europe-west4-b
    project_id: null

auth:
    ssh_user: ubuntu

head_node:
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
    scheduling:
      - onHostMaintenance: TERMINATE

worker_nodes:
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-1-13-cpu
    scheduling:
      - onHostMaintenance: TERMINATE

cluster_synced_files: []
file_mounts_sync_continuously: True
initialization_commands: []

rsync_exclude:
    - "**/.git"
    - "**/.git/**"


rsync_filter:
    - ".gitignore"

setup_commands: []

head_setup_commands:
  - pip install google-api-python-client==1.7.8

worker_setup_commands: []

head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

tune_script.py

from ray import tune
import ray
from ray.tune.integration.docker import DockerSyncer


def trainable(config):
    import time

    t1 = time.time()
    while True:
        print(time.time() - t1)
        time.sleep(1)

def tune_mnist_pbt():
    ray.init(address='auto')

    tune.run(
        trainable,
        resources_per_trial={
            "cpu": 2
        },
        config={},
        sync_config=tune.SyncConfig(sync_to_driver=DockerSyncer),
        num_samples=100,
        scheduler=None,
        fail_fast=True,
        queue_trials=True,
        reuse_actors=True,
        name="example")


if __name__ == '__main__':
    tune_mnist_pbt()

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:7 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
tuxacommented, Nov 25, 2020

Thats it, thank you so much @richardliaw !! I also had this problem in my real training script, since one epoch runs for several hours, so tune.report is not called within that time. So, an easy workaround is to add callbacks within the training loop.

0reactions
richardliawcommented, Nov 25, 2020

oh i see; can you try adding tune.report in your training function? Tune doesn’t launch new jobs unless it receives information from its workers.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Autoscaling groups of instances - Google Cloud
An autoscaler can make scaling decisions based on multiple signals, but it can handle only one signal per metric type except in the...
Read more >
Avi Deployment Guide for Google Cloud Platform (GCP)
The Controller instance should be spawned with a read-write scope, while SEs are spawned with a read-only scope. For more information, refer to...
Read more >
Google Pub/Sub + Cloud Run spawning multiple containers
I only want 1 message to spawn 1 container once to completion (return 200 OK). Current Flow (Issue):. A message is published to...
Read more >
Understanding and optimization of Google App Engine's ...
On the other hand, setting it to 1 would only be helpful if you want to spawn one instance per request. Also, it's...
Read more >
App Engine, Scheduler settings, and instance count. - Medium
Since the server-work takes about 4 seconds, this one caused a massive spike in the # of instances that were spawned. Just to...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found