question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

(aws-logs): log retention failing cdk deployment with OperationAbortedException

See original GitHub issue

What is the problem?

The LogRetention custom resource is causing CDK deployments to fail due to a race condition happening due to the log group the log retention is trying to create and the log group being created (and the retention period being set) for the log retention Lambda.

There was previously an issue made for this and a fix was completed and released, but many people are still reporting that the issue persists. The previous issue is here: https://github.com/aws/aws-cdk/issues/15709

Reproduction Steps

This is difficult to reproduce consistently, but the more a CDK app makes use of the LogRetention custom resource, the more likely it is to happen.

The RFDK integration tests (RFDK is a CDK construct library) deploy multiple stacks in parallel and have seen the failure on 7 of the last 10 runs.

What did you expect to happen?

We were expecting the CDK app to deploy successfully.

What actually happened?

The CDK app failed with the following error:

CREATE_FAILED        | Custom::LogRetention                        | RenderStructRQ2/RenderQueue/LogGroupWrapper (RenderStructRQ2RenderQueueLogGroupWrapper18972F11) Received response stat
us [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again. (RequestId: 5a7d9dda-94e2-4d20-ad9b-eb66d61746fa)
        new LogRetention (/local/home/horsmand/workspace/rfdk/integ/node_modules/@aws-cdk/aws-logs/lib/log-retention.ts:53:22)
        \_ Function.createOrFetch (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/core/lib/log-group-factory.ts:46:9)
        \_ new RenderQueue (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/deadline/lib/render-queue.ts:398:37)
        \_ new RenderStruct (/local/home/horsmand/workspace/rfdk/integ/integ/lib/render-struct.ts:103:24)
        \_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/integ/components/deadline/deadline_02_renderQueue/bin/deadline_02_renderQueue.ts:52:3)
        \_ Module._compile (internal/modules/cjs/loader.js:1085:14)
        \_ Module.m._compile (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1365:23)
        \_ Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
        \_ Object.require.extensions.<computed> [as .ts] (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1368:12)
        \_ Module.load (internal/modules/cjs/loader.js:950:32)
        \_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
        \_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
        \_ main (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:331:12)
        \_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:482:3)
        \_ Module._compile (internal/modules/cjs/loader.js:1085:14)
        \_ Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
        \_ Module.load (internal/modules/cjs/loader.js:950:32)
        \_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
        \_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
        \_ /local/home/horsmand/.nvm/versions/node/v14.17.1/lib/node_modules/npm/node_modules/libnpx/index.js:268:14

CDK CLI Version

1.129.0

Framework Version

No response

Node.js Version

14.17.1

OS

Amazon Linux 2

Language

Typescript

Language Version

TypeScript ~4.4.4

Other information

I modified the LogRetention Lambda to include a few more log statements and was able to reproduce the error. This is my modified createLogGroupSafe function:

var _a;

let retryCount = (options === null || options === void 0 ? void 0 : options.maxRetries) == undefined ? 10 : options.maxRetries;
console.log(`retryCount: ${retryCount}`); // DEBUGGING
const delay = ((_a = options === null || options === void 0 ? void 0 : options.retryOptions) === null || _a === void 0 ? void 0 : _a.base) == undefined ? 10 : options.retryOptions.base;
do {
    try {
        const cloudwatchlogs = new AWS.CloudWatchLogs({ apiVersion: '2014-03-28', region, ...options });
        await cloudwatchlogs.createLogGroup({ logGroupName }).promise();
        console.log(`LogGroup ${logGroupName} created successfully. Returning.`); // DEBUGGING
        return;
    }
    catch (error) {
        console.log('Caught error'); // DEBUGGING
        console.log(error); // DEBUGGING
        console.log(error.code); // DEBUGGING
        if (error.code === 'ResourceAlreadyExistsException') {
            // The log group is already created by the lambda execution
            console.log('ResourceAlreadyExistsException'); // DEBUGGING
            return;
        }
        if (error.code === 'OperationAbortedException') {
            console.log(`Caught OperationAbortedException. Retry: ${retryCount}`); // DEBUGGING
            if (retryCount > 0) {
                retryCount--;
                console.log(`Retrying with new retry count: ${retryCount}`); // DEBUGGING
                await new Promise(resolve => setTimeout(resolve, delay));
                continue;
            }
            else {
                console.log('Out of retry attempts'); // DEBUGGING
                // The log group is still being created by another execution but we are out of retries
                throw new Error('Out of attempts to create a logGroup');
            }
        }
        // Any other error
        console.error(error);
        throw error;
    }
} while (true); // exit happens on retry count check

This produced the following log:

START RequestId: d0d569a4-4171-468a-988f-b09e041e0644 Version: $LATEST
2021-11-17T02:55:41.181Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    
{
    "RequestType": "Create",
    "ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
    "ResponseURL": "https://cloudformation-custom-resource-response-uswest2.s3-us-west-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-west-2%3A##########%3Astack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f%7CRenderStructRQ2RenderQueueLogGroupWrapper18972F11%7C5a7d9dda-94e2-4d20-ad9b-eb66d61746fa?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20211117T025540Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIA54RCMT6SDABHSHLV%2F20211117%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=49cb653700d2f4d9d1a165e48f90cfade037da40c77ab2b356dfa6d73bb83ced",
    "StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
    "RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
    "LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
    "ResourceType": "Custom::LogRetention",
    "ResourceProperties": {
        "ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
        "RetentionInDays": "3",
        "LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
    }
}

2021-11-17T02:55:41.188Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    retryCount: 10
2021-11-17T02:55:41.867Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    LogGroup RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue created successfully. Returning.
2021-11-17T02:55:42.006Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    retryCount: 10
2021-11-17T02:55:42.130Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    Caught error
2021-11-17T02:55:42.170Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException: The specified log group already exists
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'ResourceAlreadyExistsException',
  time: 2021-11-17T02:55:42.129Z,
  requestId: '43e6312e-25b3-4d54-a95d-01c393e66280',
  statusCode: 400,
  retryable: false,
  retryDelay: 63.22735256810377
}
2021-11-17T02:55:42.207Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException
2021-11-17T02:55:42.207Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    ResourceAlreadyExistsException
2021-11-17T02:55:42.290Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    OperationAbortedException: A conflicting operation is currently in progress against this resource. Please try again.
    at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
    at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
    at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
    at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
    at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
    at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
  code: 'OperationAbortedException',
  time: 2021-11-17T02:55:42.290Z,
  requestId: '16d429d1-1e27-4575-921e-237d508d1318',
  statusCode: 400,
  retryable: false,
  retryDelay: 28.945744461931877
}
2021-11-17T02:55:42.307Z    d0d569a4-4171-468a-988f-b09e041e0644    INFO    Responding 
{
    "Status": "FAILED",
    "Reason": "A conflicting operation is currently in progress against this resource. Please try again.",
    "PhysicalResourceId": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue",
    "StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
    "RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
    "LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
    "Data": {
        "LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
    }
}

END RequestId: d0d569a4-4171-468a-988f-b09e041e0644

From this we can see that:

  1. It created the initial log it was trying to create successfully.
  2. It hit a ResourceAlreadyExistsException, likely when trying to create the LogGroup for the Lambda itself, which is expected if the Lambda has previously ran.
  3. After that, it goes right to an OperationAbortedException without hitting the Caught error logging line I added or flowing through any of the retry logic.

I believe this logging would indicate that the OperationAbortedException might not be coming from createLogGroupSafe and might instead be coming from setRetentionPolicy that happens after.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:7
  • Comments:6 (2 by maintainers)

github_iconTop GitHub Comments

1reaction
jonashdowncommented, Dec 8, 2021

Could the speed of the machine CDK is running on manifest this ? We have works on my machine when running CDK from developer’s mac books, but failure when running CDK through Github actions

0reactions
mtsoltancommented, Dec 21, 2021

Still occurs as of v1.136.0

Read more comments on GitHub >

github_iconTop Results From Across the Web

class LogRetention (construct) · AWS CDK
Creates a custom resource to control the retention policy of a CloudWatch Logs log group. The log group is created if it doesn't...
Read more >
@aws-cdk/aws-logs | Yarn - Package Manager
This library supplies constructs for working with CloudWatch Logs. Log Groups/Streams. The basic unit of CloudWatch is a Log Group. Every log group...
Read more >
Top 5 @aws-cdk/aws-logs Code Examples - Snyk
Learn more about how to use @aws-cdk/aws-logs, based on @aws-cdk/aws-logs code ... "Partitions the Athena table for CloudTrail", logRetention: logs.
Read more >
@aws-cdk/aws-ecr-assets: Versions | Openbase
Full version history for @aws-cdk/aws-ecr-assets including change logs. ... logs: log retention fails with OperationAbortedException (#17688) (95b8da9), ...
Read more >
Can I set the default retention time for log groups created in a ...
For log groups you can't access directly in CDK/Cfn, use a Lambda-backed Trigger or CustomResource to put a retention policy using SDK calls ......
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found