aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks
See original GitHub issueDescribe the bug
Current Scenario
For the scaling policy of Queue processing fargate service, 2 parts are added -
- Queue length-based scaling - In this scenario, if the user has not provided a step-down counter the system auto-calculates that it does not need a scale-in and does not create a scale-in alarm.
- Cpu base scaling - In this scenario the scale-in and scale-out are now dependent on the avg CPU utilisation of the system.
This can be found here - https://github.com/aws/aws-cdk/blob/fd5808f47111c0f72050f60e414df7f3f4ded6aa/packages/%40aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts#L344
Issue
The CPU base scaling does not seem appropriate in a Queue processing fargate service, the fargate service should only scale out or in depending on the number of messages are there in the queue, not the CPU utilization of the system.
Because of the CPU-based scaling, the auto-scaling group may start a new instance that will process the same message again if there is a CPU-intensive process triggered by the message and is not completed within the scaling alarm trigger.
Also, if the process is memory intensive then the CPU-based scaling will always be in alarm causing the auto-scaling group to remove a task till it reaches the desired capacity.
These scenarios are also relevant for the memory utilization metric but the running task is actually CPU intensive.
Since there is no task-level termination protection, and disable scale-in feature is missing from the patterns this can cause the ASG to terminate a task that is mid-execution.
Expected Behavior
When a Queue processing fargate service has been set up to only scale-out on an approximate number of messages in the queue and the scale-in has been disabled it should not terminate the tasks.
Current Behavior
The ASG on Queue Processing fargate service starts terminating the task if the task is memory intensive and has a long processing time, because of a CW scale in alarm triggered from the CPUUtilizationMetric Scaling policy, thus terminating a random task mid-execution.
Reproduction Steps
Following CDK -
import { Stack, StackProps } from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { QueueProcessingFargateService } from 'aws-cdk-lib/aws-ecs-patterns'
import { ContainerImage } from 'aws-cdk-lib/aws-ecs';
export class QueueProcessingFargateServiceAutoscaleTestStack extends Stack {
constructor(scope: Construct, id: string, props?: StackProps) {
super(scope, id, props);
var containerImage = ContainerImage.fromAsset('test')
var service = new QueueProcessingFargateService(this, 'test', {
image: containerImage,
scalingSteps : [
{ upper: 0, change: 0 },{ lower: 1, change: +1 }
]
})
}
}
will create a new QueueProcessingFargateService with following type of scaling policy -
which causes conflicting alarms to be always trigger -
Possible Solution
The issue is with this method in the Queue processing fargate service pattern base - https://github.com/aws/aws-cdk/blob/fd5808f47111c0f72050f60e414df7f3f4ded6aa/packages/%40aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts#L344
It is adding a default CPUUtilizationScalingPolicy that cannot be removed, edited nor disabled.
Solution 1
Remove the CPU Utilization scaling factor if not necessarily required.
Solution 2
Add optional properties and let the user modify the value to disable scale-in on CPU utilization metric or let the user modify the values as per the user’s will.
Additional Information/Context
No response
CDK CLI Version
2.27
Framework Version
No response
Node.js Version
16.14.2
OS
Linux
Language
Typescript
Language Version
3.9.7
Other information
No response
Issue Analytics
- State:
- Created a year ago
- Reactions:3
- Comments:5 (3 by maintainers)
Top GitHub Comments
Sharing how I am working around this issue at the moment.
I created a class that derives from
QueueProcessingFargateService
and overrides theprotected configureAutoscalingForService(service: BaseService)
method:Then I can simply use the derived class (
QueueAndMessageBasedScalingOnlyFargateService
) in place ofQueueProcessingFargateService
within my software.I have the same problem.
In my case I have a service using QueueProcessingFargateService for text parsing. Sometimes there are more than 1000 tasks in the AWS SQS. Each task has low CPU load. And when my SQS count scalling rule adds a new instance, CPU scalling rule stops it in 1 minute. After my SQS count scalling rule adds a new instance again, and CPU scalling rule stops it too.
To solve it I manually deleted the CPU scalling rule in the AWS console, but I think this is not a good solution to do it manually in the console