Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature] Autoscaler should understand AWS availability and act accordingly

See original GitHub issue

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

I ran into this issue because frequently, spot instances are not available on AWS.

This manifests itself in a repeated log of ‘spawning x’ without any progress: bildschirmfoto_2021-11-24_um_16 25 49

When peeking into the logs, the following can be found to be repeatedly logged:

1779botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (us-west-1b). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-1c.

Interestlngly, the log seems to be at least a partial lie: This type of text is logged even if there is no availability in us-west-1c, or at least as much can be discerned from the logs by the message pingponging this message between us-west-1c and us-west-1b.

One can work around this issue by manually disabling the instance types from the config.yaml and re-launching the ray cluster. However, it would be nice if ray understood this problem automatically. Ideally, this could be addressed by considering other worker types as fallback.

Use case

Better consistency in spawning nodes on AWS with low availability.

Reproducibility

Hard to plan since AWS availability is non-deterministic. GPU instances on spot seem to be in high demand though. Not sure if setting max spot price low helps. Example setup:

available_node_types:
    ray.head.default:
        min_workers: 0
        max_workers: 0
        resources: {}
        node_config:
            InstanceType: t3.large
            ImageId: XXXXXXXX
            IamInstanceProfile:
                Arn: XXXXXXXX
            InstanceMarketOptions:
                MarketType: spot
                SpotOptions:
                    MaxPrice: '0.5'  # $

    ray.worker.gpu.g4dn.xlarge:
        min_workers: 1
        max_workers: 1
        resources: {}
        node_config:
            InstanceType: g4dn.xlarge
            ImageId: XXXXXXXX
            IamInstanceProfile:
                Arn: XXXXXXXX
            InstanceMarketOptions:
                MarketType: spot
                SpotOptions:
                    MaxPrice: '2'  # $

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

Issue Analytics

State:
Created 2 years ago
Reactions:1
Comments:6 (6 by maintainers)

Top GitHub Comments

1reaction

DmitriGekhtmancommented, Dec 8, 2021

Somewhat mitigated by https://github.com/ray-project/ray/pull/20814. Decreasing priority to P2.

1reaction

DmitriGekhtmancommented, Nov 30, 2021

I agree.

Currently you get a failed node launch message if a node updater failed. We can update the node launcher with some error handling around the create_node call to produce the same or similar message.

Top Results From Across the Web

Amazon EC2 Auto Scaling FAQs

Amazon EC2 Auto Scaling helps you maintain application availability through fleet management for EC2 instances, which detects and replaces unhealthy ...

Capacity and availability - Amazon Elastic Container Service

Availability depends on having resources that are accessible and have enough capacity to meet demand. AWS provides several mechanisms to manage availability.

Amazon EC2 Auto Scaling benefits - AWS Documentation

Better availability. Amazon EC2 Auto Scaling helps ensure that your application always has the right amount of capacity to handle the current traffic...

AWS Foundational Security Best Practices controls

6] Auto Scaling groups should use multiple instance types in multiple Availability Zones. Category: Recover > Resilience > High Availability. Severity: Medium.

Target tracking scaling policies for Amazon EC2 Auto Scaling

The intention of Amazon EC2 Auto Scaling is to always prioritize availability, so its behavior differs depending on whether the target tracking policies...