question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

aws-ecs-patterns (QueueProcessingFargateService): non-editable Scaling Policy causes race conditions & dropped tasks

See original GitHub issue

Describe the bug

Current Scenario

For the scaling policy of Queue processing fargate service, 2 parts are added -

  1. Queue length-based scaling - In this scenario, if the user has not provided a step-down counter the system auto-calculates that it does not need a scale-in and does not create a scale-in alarm.
  2. Cpu base scaling - In this scenario the scale-in and scale-out are now dependent on the avg CPU utilisation of the system.

This can be found here - https://github.com/aws/aws-cdk/blob/fd5808f47111c0f72050f60e414df7f3f4ded6aa/packages/%40aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts#L344

Issue

The CPU base scaling does not seem appropriate in a Queue processing fargate service, the fargate service should only scale out or in depending on the number of messages are there in the queue, not the CPU utilization of the system.

Because of the CPU-based scaling, the auto-scaling group may start a new instance that will process the same message again if there is a CPU-intensive process triggered by the message and is not completed within the scaling alarm trigger.

Also, if the process is memory intensive then the CPU-based scaling will always be in alarm causing the auto-scaling group to remove a task till it reaches the desired capacity.

These scenarios are also relevant for the memory utilization metric but the running task is actually CPU intensive.

Since there is no task-level termination protection, and disable scale-in feature is missing from the patterns this can cause the ASG to terminate a task that is mid-execution.

Expected Behavior

When a Queue processing fargate service has been set up to only scale-out on an approximate number of messages in the queue and the scale-in has been disabled it should not terminate the tasks.

Current Behavior

The ASG on Queue Processing fargate service starts terminating the task if the task is memory intensive and has a long processing time, because of a CW scale in alarm triggered from the CPUUtilizationMetric Scaling policy, thus terminating a random task mid-execution.

Reproduction Steps

Following CDK -

  import { Stack, StackProps } from 'aws-cdk-lib';
  import { Construct } from 'constructs';
  import { QueueProcessingFargateService }  from 'aws-cdk-lib/aws-ecs-patterns'
  import { ContainerImage } from 'aws-cdk-lib/aws-ecs';
  
  export class QueueProcessingFargateServiceAutoscaleTestStack extends Stack {
    constructor(scope: Construct, id: string, props?: StackProps) {
      super(scope, id, props);  
      var containerImage = ContainerImage.fromAsset('test')  
      var service = new QueueProcessingFargateService(this, 'test', {
        image: containerImage,
        scalingSteps : [
          { upper: 0, change: 0 },{ lower: 1, change: +1 }
        ]
     })
    }
  }

will create a new QueueProcessingFargateService with following type of scaling policy -

image

which causes conflicting alarms to be always trigger -

image

image

Possible Solution

The issue is with this method in the Queue processing fargate service pattern base - https://github.com/aws/aws-cdk/blob/fd5808f47111c0f72050f60e414df7f3f4ded6aa/packages/%40aws-cdk/aws-ecs-patterns/lib/base/queue-processing-service-base.ts#L344

It is adding a default CPUUtilizationScalingPolicy that cannot be removed, edited nor disabled.

Solution 1

Remove the CPU Utilization scaling factor if not necessarily required.

Solution 2

Add optional properties and let the user modify the value to disable scale-in on CPU utilization metric or let the user modify the values as per the user’s will.

Additional Information/Context

No response

CDK CLI Version

2.27

Framework Version

No response

Node.js Version

16.14.2

OS

Linux

Language

Typescript

Language Version

3.9.7

Other information

No response

Issue Analytics

  • State:open
  • Created a year ago
  • Reactions:3
  • Comments:5 (3 by maintainers)

github_iconTop GitHub Comments

3reactions
inh3commented, Dec 16, 2022

Sharing how I am working around this issue at the moment.

I created a class that derives from QueueProcessingFargateService and overrides the protected configureAutoscalingForService(service: BaseService) method:

// see: https://github.com/aws/aws-cdk/issues/20706
export class QueueAndMessageBasedScalingOnlyFargateService extends QueueProcessingFargateService {
  /**
   * Configure autoscaling based off of number of messages visible in the SQS queue only.
   *
   * @param service the ECS/Fargate service for which to apply the autoscaling rules to
   */
  protected configureAutoscalingForService(service: BaseService) {
    const scalingTarget = service.autoScaleTaskCount({
      maxCapacity: this.maxCapacity,
      minCapacity: this.minCapacity,
    });
    scalingTarget.scaleOnMetric("QueueMessagesVisibleScaling", {
      metric: this.sqsQueue.metricApproximateNumberOfMessagesVisible(),
      scalingSteps: this.scalingSteps,
    });
  }
}

Then I can simply use the derived class (QueueAndMessageBasedScalingOnlyFargateService) in place of QueueProcessingFargateService within my software.

3reactions
danilvalovcommented, Aug 25, 2022

I have the same problem.

In my case I have a service using QueueProcessingFargateService for text parsing. Sometimes there are more than 1000 tasks in the AWS SQS. Each task has low CPU load. And when my SQS count scalling rule adds a new instance, CPU scalling rule stops it in 1 minute. After my SQS count scalling rule adds a new instance again, and CPU scalling rule stops it too.

To solve it I manually deleted the CPU scalling rule in the AWS console, but I think this is not a good solution to do it manually in the console

Read more comments on GitHub >

github_iconTop Results From Across the Web

Queue Processing Fargate Service - AWS Documentation
No information is available for this page.
Read more >
@aws-cdk/aws-stepfunctions | Yarn - Package Manager
The @aws-cdk/aws-stepfunctions package contains constructs for building serverless workflows using objects. Use this in conjunction with the @aws-cdk/aws- ...
Read more >
awsecs - Go Packages
Amazon ECS Construct Library. This package contains constructs for working with Amazon Elastic Container Service (Amazon ECS).
Read more >
Source - GitHub
... steps inflate policy size ([#20396](https://github.com/aws/aws-cdk/issues/20396)) ... **apigateway:** race condition between Stage and CfnAccount ...
Read more >
aws-cdk/aws-eks module
@aws-cdk/aws-applicationautoscaling ... @aws-cdk/aws-ecs-patterns ... In some cases, this could cause race conditions where two Helm charts attempt to ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found