[Feature] Autoscaler should understand AWS availability and act accordingly
See original GitHub issueSearch before asking
- I had searched in the issues and found no similar feature requirement.
Description
I ran into this issue because frequently, spot instances are not available on AWS.
This manifests itself in a repeated log of ‘spawning x’ without any progress:
When peeking into the logs, the following can be found to be repeatedly logged:
1779botocore.exceptions.ClientError: An error occurred (InsufficientInstanceCapacity) when calling the RunInstances operation (reached max retries: 0): We currently do not have sufficient g4dn.xlarge capacity in the Availability Zone you requested (us-west-1b). Our system will be working on provisioning additional capacity. You can currently get g4dn.xlarge capacity by not specifying an Availability Zone in your request or choosing us-west-1c.
Interestlngly, the log seems to be at least a partial lie: This type of text is logged even if there is no availability in us-west-1c
, or at least as much can be discerned from the logs by the message pingponging this message between us-west-1c
and us-west-1b
.
One can work around this issue by manually disabling the instance types from the config.yaml
and re-launching the ray cluster.
However, it would be nice if ray understood this problem automatically. Ideally, this could be addressed by considering other worker types as fallback.
Use case
Better consistency in spawning nodes on AWS with low availability.
Reproducibility
Hard to plan since AWS availability is non-deterministic. GPU instances on spot seem to be in high demand though. Not sure if setting max spot price low helps. Example setup:
available_node_types:
ray.head.default:
min_workers: 0
max_workers: 0
resources: {}
node_config:
InstanceType: t3.large
ImageId: XXXXXXXX
IamInstanceProfile:
Arn: XXXXXXXX
InstanceMarketOptions:
MarketType: spot
SpotOptions:
MaxPrice: '0.5' # $
ray.worker.gpu.g4dn.xlarge:
min_workers: 1
max_workers: 1
resources: {}
node_config:
InstanceType: g4dn.xlarge
ImageId: XXXXXXXX
IamInstanceProfile:
Arn: XXXXXXXX
InstanceMarketOptions:
MarketType: spot
SpotOptions:
MaxPrice: '2' # $
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Issue Analytics
- State:
- Created 2 years ago
- Reactions:1
- Comments:6 (6 by maintainers)
Somewhat mitigated by https://github.com/ray-project/ray/pull/20814. Decreasing priority to P2.
I agree.
Currently you get a failed node launch message if a node updater failed. We can update the node launcher with some error handling around the create_node call to produce the same or similar message.