question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

(rds): AWS::RDS::DBInstance should not have EngineVersion property set for Aurora clusters

See original GitHub issue

Describe the bug

Currently, when upgrading RDS Aurora between versions, a runtime error will occur that leaves the CloudFormation stack in an unrecoverable UPDATE_ROLLBACK_FAILED state. This is because CDK is setting the EngineVersion property on AWS::RDS::DBInstance, which the Cfn documentation states that you should not set when using an Aurora cluster:

Amazon Aurora

Not applicable. The version number of the database engine to be used by the DB instance is managed by the DB cluster.

Screen_Shot_2022-08-25_at_08_49_07

When upgrading between versions that require downtime for upgrade, it causes this error:

The stack named BlimmerTestAuroraUpgradeStack failed to deploy: UPDATE_ROLLBACK_FAILED (The following resource(s) failed to update: [DatabaseCluster68FC2945, DatabaseClusterInstance1C566869D]. ): The specified DB Instance is a member of a cluster. Modify the DB engine version for the DB Cluster using the ModifyDbCluster API (Service: Rds, Status Code: 400, Request ID: 9998e162-bb47-4ff0-a6ad-91665e964ff2), DB cluster isn’t available for modification with status upgrading. (Service: Rds, Status Code: 400, Request ID: 5b967f09-58a2-42dc-aa20-3a8bffbf705a)

Here’s a test stack with the event log:

event-log

The worst part about this bug is that the Database stack is left in the unrecoverable UPDATE_ROLLBACK_FAILED state, which means that you either have to:

a) attempt to complete the rollback. this is impossible because the cluster actually upgrades even though the CloudFormation steps fail. A rollback between the new target major version and the old version is not allowed by RDS.

b) delete the stack. this obviously is not ideal because databases should not be deleted in most cases. worse, you can’t update the RetentionPolicy to try to retain the database while deleting the stack.

Expected Behavior

I expected to be able to update a DatabaseCluster between major supported versions via the engine property on DatabaseCluster.

Current Behavior

As mentioned above, if you try to upgrade an Aurora cluster between major versions, you’ll encounter this error, which leaves the stack in an unrecoverable state.

The stack named BlimmerTestAuroraUpgradeStack failed to deploy: UPDATE_ROLLBACK_FAILED (The following resource(s) failed to update: [DatabaseCluster68FC2945, DatabaseClusterInstance1C566869D]. ): The specified DB Instance is a member of a cluster. Modify the DB engine version for the DB Cluster using the ModifyDbCluster API (Service: Rds, Status Code: 400, Request ID: 9998e162-bb47-4ff0-a6ad-91665e964ff2), DB cluster isn’t available for modification with status upgrading. (Service: Rds, Status Code: 400, Request ID: 5b967f09-58a2-42dc-aa20-3a8bffbf705a)

Reproduction Steps

  1. Create a new stack with an older major version of Aurora Postgres:
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { DatabaseCluster, DatabaseClusterEngine, AuroraPostgresEngineVersion } from 'aws-cdk-lib/aws-rds';
import { Vpc } from 'aws-cdk-lib/aws-ec2';

export class CdkBugReportsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new DatabaseCluster(this, 'DatabaseCluster', {
      engine: DatabaseClusterEngine.auroraPostgres({
        version: AuroraPostgresEngineVersion.VER_10_18,
      }),
      instanceProps: {
        vpc: new Vpc(this, 'Vpc')
      }
    })
  }
}
  1. cdk deploy the stack above
  2. Update to a newer major version of Aurora Postgres. At the time of writing, 10.18 -> 13.4 is a valid upgrade target:
import * as cdk from 'aws-cdk-lib';
import { Construct } from 'constructs';
import { DatabaseCluster, DatabaseClusterEngine, AuroraPostgresEngineVersion } from 'aws-cdk-lib/aws-rds';
import { Vpc } from 'aws-cdk-lib/aws-ec2';

export class CdkBugReportsStack extends cdk.Stack {
  constructor(scope: Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    new DatabaseCluster(this, 'DatabaseCluster', {
      engine: DatabaseClusterEngine.auroraPostgres({
        version: AuroraPostgresEngineVersion.VER_13_4,  // Changed
      }),
      instanceProps: {
        vpc: new Vpc(this, 'Vpc')
      }
    })
  }
}
  1. cdk deploy

Observe: the stack update will fail with the aforementioned error, leaving the Database stack in an unrecoverable state.

Possible Solution

Short Term Workaround

As a short-term solution, users can reach into the L1 construct to remove the EngineVersion property like this:

const cfnInstances = cluster.node.children.filter((child) => child instanceof CfnDBInstance);
if (cfnInstances.length === 0) {
  throw new Error("Couldn't pull CfnDBInstances from the L1 constructs!");
}
cfnInstances.forEach((cfnInstance) => delete (cfnInstance as CfnDBInstance).engineVersion);

I’ve tested removing the EngineVersion property from an existing CloudFormation Stack/DbInstance. It shows the following diff:

Stack BlimmerTestAuroraUpgradeStack
Resources
[~] AWS::RDS::DBInstance DatabaseCluster/Instance1 DatabaseClusterInstance1C566869D
 └─ [-] EngineVersion
     └─ 10.18

And the change applied with (what appears to be) no changes to the actual instances:

BlimmerTestAuroraUpgradeStack: creating CloudFormation changeset...
BlimmerTestAuroraUpgradeStack | 0/3 | 11:31:22 AM | UPDATE_IN_PROGRESS   | AWS::CloudFormation::Stack                  | BlimmerTestAuroraUpgradeStack User Initiated
BlimmerTestAuroraUpgradeStack | 0/3 | 11:31:29 AM | UPDATE_IN_PROGRESS   | AWS::RDS::DBInstance                        | DatabaseCluster/Instance1 (DatabaseClusterInstance1C566869D)
BlimmerTestAuroraUpgradeStack | 1/3 | 11:31:32 AM | UPDATE_COMPLETE      | AWS::RDS::DBInstance                        | DatabaseCluster/Instance1 (DatabaseClusterInstance1C566869D)
BlimmerTestAuroraUpgradeStack | 2/3 | 11:31:33 AM | UPDATE_COMPLETE_CLEA | AWS::CloudFormation::Stack                  | BlimmerTestAuroraUpgradeStack
BlimmerTestAuroraUpgradeStack | 3/3 | 11:31:34 AM | UPDATE_COMPLETE      | AWS::CloudFormation::Stack                  | BlimmerTestAuroraUpgradeStack

I verified in the RDS console that the instance didn’t shut down or have any other events during this cdk deploy.

So, I believe it should be safe for all Aurora DatabaseCluster users to use this workaround without downtime. However, I’ve only tested this on my test cluster, so your mileage may vary.

Note that this should only be done for Aurora clusters, per the AWS::RDS::DBInstance CFN documentation

Longer Term Fix

CDK should detect when the engine passed is an Aurora RDS engine. In this case, it should not set the AWS::RDS::DBInstance EngineVersion property

Additional Information/Context

I confirmed this issue by testing in my AWS accounts. I also have an internal ticket (case #10630169951) opened with AWS Support about this issue. In addition to a CDK fix, it seems this could be handled more elegantly on the CloudFormation side.

CDK CLI Version

2.38.1 (build a5ced21)

Framework Version

No response

Node.js Version

16 LTS

OS

MacOS

Language

Typescript

Language Version

No response

Other information

Because of the unrecoverable state in which the DatabaseCluster stack is left, I’d highly recommend this being a P1 bug.

Issue Analytics

  • State:closed
  • Created a year ago
  • Reactions:5
  • Comments:8 (5 by maintainers)

github_iconTop GitHub Comments

2reactions
davenix-palmettocommented, Sep 1, 2022

Thanks all for raising this issue. We are hitting this in production as well.

2reactions
corymhallcommented, Aug 25, 2022

It seems like this issue impacts a significant number of customers, and I’ve tagged it as P1, which means it should be on our near-term roadmap.

We welcome community contributions! If you are able, we encourage you to contribute (https://github.com/aws/aws-cdk/blob/master/CONTRIBUTING.md) a bug fix or new feature to the CDK. If you decide to contribute, please start an engineering discussion in this issue to ensure there is a commonly understood design before submitting code. This will minimize the number of review cycles and get your code merged faster.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Modifying an Amazon Aurora DB cluster
Using the RDS API, call ModifyDBInstance and set the AutoMinorVersionUpgrade parameter. The entire DB cluster. An outage doesn't occur during this change.
Read more >
AWS::RDS::DBCluster - AWS CloudFormation
The AWS::RDS::DBCluster resource creates an Amazon Aurora DB cluster or Multi-AZ DB ... When you specify this property for an update, the DB...
Read more >
Working with parameter groups - Amazon Aurora
DB parameter groups apply to DB instances in both Amazon RDS and Aurora. These configuration settings apply to properties that can vary among...
Read more >
Getting started with Aurora Serverless v2 - AWS Documentation
You can switch the cluster to use all Aurora Serverless v2 DB instances. ... attribute confirms that the cluster doesn't have a capacity...
Read more >
AWS::RDS::DBInstance - AWS CloudFormation
When you specify this property for an update, the DB instance is not restored from ... For DB instances that are part of...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found