(aws-logs): log retention failing cdk deployment with OperationAbortedException
See original GitHub issueWhat is the problem?
The LogRetention
custom resource is causing CDK deployments to fail due to a race condition happening due to the log group the log retention is trying to create and the log group being created (and the retention period being set) for the log retention Lambda.
There was previously an issue made for this and a fix was completed and released, but many people are still reporting that the issue persists. The previous issue is here: https://github.com/aws/aws-cdk/issues/15709
Reproduction Steps
This is difficult to reproduce consistently, but the more a CDK app makes use of the LogRetention
custom resource, the more likely it is to happen.
The RFDK integration tests (RFDK is a CDK construct library) deploy multiple stacks in parallel and have seen the failure on 7 of the last 10 runs.
What did you expect to happen?
We were expecting the CDK app to deploy successfully.
What actually happened?
The CDK app failed with the following error:
CREATE_FAILED | Custom::LogRetention | RenderStructRQ2/RenderQueue/LogGroupWrapper (RenderStructRQ2RenderQueueLogGroupWrapper18972F11) Received response stat
us [FAILED] from custom resource. Message returned: A conflicting operation is currently in progress against this resource. Please try again. (RequestId: 5a7d9dda-94e2-4d20-ad9b-eb66d61746fa)
new LogRetention (/local/home/horsmand/workspace/rfdk/integ/node_modules/@aws-cdk/aws-logs/lib/log-retention.ts:53:22)
\_ Function.createOrFetch (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/core/lib/log-group-factory.ts:46:9)
\_ new RenderQueue (/local/home/horsmand/workspace/rfdk/integ/packages/aws-rfdk/lib/deadline/lib/render-queue.ts:398:37)
\_ new RenderStruct (/local/home/horsmand/workspace/rfdk/integ/integ/lib/render-struct.ts:103:24)
\_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/integ/components/deadline/deadline_02_renderQueue/bin/deadline_02_renderQueue.ts:52:3)
\_ Module._compile (internal/modules/cjs/loader.js:1085:14)
\_ Module.m._compile (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1365:23)
\_ Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
\_ Object.require.extensions.<computed> [as .ts] (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/index.ts:1368:12)
\_ Module.load (internal/modules/cjs/loader.js:950:32)
\_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
\_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
\_ main (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:331:12)
\_ Object.<anonymous> (/local/home/horsmand/workspace/rfdk/integ/node_modules/ts-node/src/bin.ts:482:3)
\_ Module._compile (internal/modules/cjs/loader.js:1085:14)
\_ Object.Module._extensions..js (internal/modules/cjs/loader.js:1114:10)
\_ Module.load (internal/modules/cjs/loader.js:950:32)
\_ Function.Module._load (internal/modules/cjs/loader.js:790:14)
\_ Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:76:12)
\_ /local/home/horsmand/.nvm/versions/node/v14.17.1/lib/node_modules/npm/node_modules/libnpx/index.js:268:14
CDK CLI Version
1.129.0
Framework Version
No response
Node.js Version
14.17.1
OS
Amazon Linux 2
Language
Typescript
Language Version
TypeScript ~4.4.4
Other information
I modified the LogRetention Lambda to include a few more log statements and was able to reproduce the error.
This is my modified createLogGroupSafe
function:
var _a;
let retryCount = (options === null || options === void 0 ? void 0 : options.maxRetries) == undefined ? 10 : options.maxRetries;
console.log(`retryCount: ${retryCount}`); // DEBUGGING
const delay = ((_a = options === null || options === void 0 ? void 0 : options.retryOptions) === null || _a === void 0 ? void 0 : _a.base) == undefined ? 10 : options.retryOptions.base;
do {
try {
const cloudwatchlogs = new AWS.CloudWatchLogs({ apiVersion: '2014-03-28', region, ...options });
await cloudwatchlogs.createLogGroup({ logGroupName }).promise();
console.log(`LogGroup ${logGroupName} created successfully. Returning.`); // DEBUGGING
return;
}
catch (error) {
console.log('Caught error'); // DEBUGGING
console.log(error); // DEBUGGING
console.log(error.code); // DEBUGGING
if (error.code === 'ResourceAlreadyExistsException') {
// The log group is already created by the lambda execution
console.log('ResourceAlreadyExistsException'); // DEBUGGING
return;
}
if (error.code === 'OperationAbortedException') {
console.log(`Caught OperationAbortedException. Retry: ${retryCount}`); // DEBUGGING
if (retryCount > 0) {
retryCount--;
console.log(`Retrying with new retry count: ${retryCount}`); // DEBUGGING
await new Promise(resolve => setTimeout(resolve, delay));
continue;
}
else {
console.log('Out of retry attempts'); // DEBUGGING
// The log group is still being created by another execution but we are out of retries
throw new Error('Out of attempts to create a logGroup');
}
}
// Any other error
console.error(error);
throw error;
}
} while (true); // exit happens on retry count check
This produced the following log:
START RequestId: d0d569a4-4171-468a-988f-b09e041e0644 Version: $LATEST
2021-11-17T02:55:41.181Z d0d569a4-4171-468a-988f-b09e041e0644 INFO
{
"RequestType": "Create",
"ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
"ResponseURL": "https://cloudformation-custom-resource-response-uswest2.s3-us-west-2.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-west-2%3A##########%3Astack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f%7CRenderStructRQ2RenderQueueLogGroupWrapper18972F11%7C5a7d9dda-94e2-4d20-ad9b-eb66d61746fa?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20211117T025540Z&X-Amz-SignedHeaders=host&X-Amz-Expires=7200&X-Amz-Credential=AKIA54RCMT6SDABHSHLV%2F20211117%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=49cb653700d2f4d9d1a165e48f90cfade037da40c77ab2b356dfa6d73bb83ced",
"StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
"RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
"LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
"ResourceType": "Custom::LogRetention",
"ResourceProperties": {
"ServiceToken": "arn:aws:lambda:us-west-2:##########:function:RFDKInteg-RQ-ComponentTie-LogRetentionaae0aa3c5b4d-Wm8HvYy4jYOD",
"RetentionInDays": "3",
"LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
}
}
2021-11-17T02:55:41.188Z d0d569a4-4171-468a-988f-b09e041e0644 INFO retryCount: 10
2021-11-17T02:55:41.867Z d0d569a4-4171-468a-988f-b09e041e0644 INFO LogGroup RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue created successfully. Returning.
2021-11-17T02:55:42.006Z d0d569a4-4171-468a-988f-b09e041e0644 INFO retryCount: 10
2021-11-17T02:55:42.130Z d0d569a4-4171-468a-988f-b09e041e0644 INFO Caught error
2021-11-17T02:55:42.170Z d0d569a4-4171-468a-988f-b09e041e0644 INFO ResourceAlreadyExistsException: The specified log group already exists
at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
code: 'ResourceAlreadyExistsException',
time: 2021-11-17T02:55:42.129Z,
requestId: '43e6312e-25b3-4d54-a95d-01c393e66280',
statusCode: 400,
retryable: false,
retryDelay: 63.22735256810377
}
2021-11-17T02:55:42.207Z d0d569a4-4171-468a-988f-b09e041e0644 INFO ResourceAlreadyExistsException
2021-11-17T02:55:42.207Z d0d569a4-4171-468a-988f-b09e041e0644 INFO ResourceAlreadyExistsException
2021-11-17T02:55:42.290Z d0d569a4-4171-468a-988f-b09e041e0644 INFO OperationAbortedException: A conflicting operation is currently in progress against this resource. Please try again.
at Request.extractError (/var/runtime/node_modules/aws-sdk/lib/protocol/json.js:52:27)
at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:106:20)
at Request.emit (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:78:10)
at Request.emit (/var/runtime/node_modules/aws-sdk/lib/request.js:688:14)
at Request.transition (/var/runtime/node_modules/aws-sdk/lib/request.js:22:10)
at AcceptorStateMachine.runTo (/var/runtime/node_modules/aws-sdk/lib/state_machine.js:14:12)
at /var/runtime/node_modules/aws-sdk/lib/state_machine.js:26:10
at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:38:9)
at Request.<anonymous> (/var/runtime/node_modules/aws-sdk/lib/request.js:690:12)
at Request.callListeners (/var/runtime/node_modules/aws-sdk/lib/sequential_executor.js:116:18) {
code: 'OperationAbortedException',
time: 2021-11-17T02:55:42.290Z,
requestId: '16d429d1-1e27-4575-921e-237d508d1318',
statusCode: 400,
retryable: false,
retryDelay: 28.945744461931877
}
2021-11-17T02:55:42.307Z d0d569a4-4171-468a-988f-b09e041e0644 INFO Responding
{
"Status": "FAILED",
"Reason": "A conflicting operation is currently in progress against this resource. Please try again.",
"PhysicalResourceId": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue",
"StackId": "arn:aws:cloudformation:us-west-2:##########:stack/RFDKInteg-RQ-ComponentTier1637117337359895769/98ae8520-4751-11ec-a61d-0284b18a205f",
"RequestId": "5a7d9dda-94e2-4d20-ad9b-eb66d61746fa",
"LogicalResourceId": "RenderStructRQ2RenderQueueLogGroupWrapper18972F11",
"Data": {
"LogGroupName": "RFDKInteg-RQ-ComponentTier1637117337359895769-RenderStructRQ2RenderQueue"
}
}
END RequestId: d0d569a4-4171-468a-988f-b09e041e0644
From this we can see that:
- It created the initial log it was trying to create successfully.
- It hit a
ResourceAlreadyExistsException
, likely when trying to create the LogGroup for the Lambda itself, which is expected if the Lambda has previously ran. - After that, it goes right to an
OperationAbortedException
without hitting theCaught error
logging line I added or flowing through any of the retry logic.
I believe this logging would indicate that the OperationAbortedException
might not be coming from createLogGroupSafe
and might instead be coming from setRetentionPolicy
that happens after.
Issue Analytics
- State:
- Created 2 years ago
- Reactions:7
- Comments:6 (2 by maintainers)
Could the speed of the machine CDK is running on manifest this ? We have
works on my machine
when running CDK from developer’s mac books, but failure when running CDK through Github actionsStill occurs as of
v1.136.0