question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

[Feature] Autoscaler should understand AWS availability and act accordingly

See original GitHub issue

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

I ran into this issue because frequently, spot instances are not available on AWS.

This manifests itself in a repeated log of ‘spawning x’ without any progress: bildschirmfoto_2021-11-24_um_16 25 49

When peeking into the logs, the following can be found to be repeatedly logged:

1779botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (us-west-1b). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-1c.

Interestlngly, the log seems to be at least a partial lie: This type of text is logged even if there is no availability in us-west-1c, or at least as much can be discerned from the logs by the message pingponging this message between us-west-1c and us-west-1b.

One can work around this issue by manually disabling the instance types from the config.yaml and re-launching the ray cluster. However, it would be nice if ray understood this problem automatically. Ideally, this could be addressed by considering other worker types as fallback.

Use case

Better consistency in spawning nodes on AWS with low availability.

Reproducibility

Hard to plan since AWS availability is non-deterministic. GPU instances on spot seem to be in high demand though. Not sure if setting max spot price low helps. Example setup:

available_node_types:
    ray.head.default:
        min_workers: 0
        max_workers: 0
        resources: {}
        node_config:
            InstanceType: t3.large
            ImageId: XXXXXXXX
            IamInstanceProfile:
                Arn: XXXXXXXX
            InstanceMarketOptions:
                MarketType: spot
                SpotOptions:
                    MaxPrice: '0.5'  # $

    ray.worker.gpu.g4dn.xlarge:
        min_workers: 1
        max_workers: 1
        resources: {}
        node_config:
            InstanceType: g4dn.xlarge
            ImageId: XXXXXXXX
            IamInstanceProfile:
                Arn: XXXXXXXX
            InstanceMarketOptions:
                MarketType: spot
                SpotOptions:
                    MaxPrice: '2'  # $

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Issue Analytics

  • State:open
  • Created 2 years ago
  • Reactions:1
  • Comments:6 (6 by maintainers)

github_iconTop GitHub Comments

1reaction
DmitriGekhtmancommented, Dec 8, 2021

Somewhat mitigated by https://github.com/ray-project/ray/pull/20814. Decreasing priority to P2.

1reaction
DmitriGekhtmancommented, Nov 30, 2021

I agree.

Currently you get a failed node launch message if a node updater failed. We can update the node launcher with some error handling around the create_node call to produce the same or similar message.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Amazon EC2 Auto Scaling FAQs
Amazon EC2 Auto Scaling helps you maintain application availability through fleet management for EC2 instances, which detects and replaces unhealthy ...
Read more >
Capacity and availability - Amazon Elastic Container Service
Availability depends on having resources that are accessible and have enough capacity to meet demand. AWS provides several mechanisms to manage availability.
Read more >
Amazon EC2 Auto Scaling benefits - AWS Documentation
Better availability. Amazon EC2 Auto Scaling helps ensure that your application always has the right amount of capacity to handle the current traffic...
Read more >
AWS Foundational Security Best Practices controls
6] Auto Scaling groups should use multiple instance types in multiple Availability Zones. Category: Recover > Resilience > High Availability. Severity: Medium.
Read more >
Target tracking scaling policies for Amazon EC2 Auto Scaling
The intention of Amazon EC2 Auto Scaling is to always prioritize availability, so its behavior differs depending on whether the target tracking policies...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found