question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

OpenSearch: Bug in Describe-Domain API is causing CFN GetAtt "Internal error occurred"

See original GitHub issue

What is the problem?

When you create an OpenSearch Domain with a VPC and then attempt to reference that endpoint in the AWS CDK (thereby creating a GetAtt reference in CloudFormation), the Domain creates successfully, but then the CloudFormation resource (Fargate) that attempts to reference the endpoint returns an “Internal error occurred” (see attached screenshot). Additional findings from research detailed in “Other information” below. Screen Shot 2022-01-02 at 21 45 28

Reproduction Steps

self.opensearch_domain = opensearch.Domain(self, "OpenSearchIndices",
    **opensearch_params,
    version=opensearch.EngineVersion.OPENSEARCH_1_0,
    vpc=self.scope.network_stack.vpc,        
    logging={
        "slow_search_log_enabled": True,
        "app_log_enabled": True,
        "slow_index_log_enabled": True
    },
    encryption_at_rest={
        "enabled": True
    },
    zone_awareness=opensearch.ZoneAwarenessConfig(
        enabled=True,
        availability_zone_count=zone_count
    ),
    removal_policy = self.data_resources_removal_policy
)
self.opensearch_endpoint = self.opensearch_domain.domain_endpoint 

What did you expect to happen?

All resource created successfully

What actually happened?

CloudFormation Stack rollback due resource creation failure. (Screenshot from above re-attached here) Screen Shot 2022-01-02 at 21 45 28

CDK CLI Version

2.3

Framework Version

No response

Node.js Version

16.13.1

OS

Mac OS 12.1

Language

Python

Language Version

3.10.1

Other information

I noticed that I didn’t have this problem when creating a public OpenSearch Domain. So I thought it might have something to do with how the API is returning domain endpoints with Domains created in a VPC vs public Domains.

I created a public Domain and then ran aws opensearch describe-domain against both the Domain created with the CDK and the test public Domain. Here were the results:

# Public Domain
~ % aws opensearch describe-domain --domain-name test | jq '.DomainStatus.Endpoint'
"search-test-xxxxxxxxx.us-east-1.es.amazonaws.com"
# Domain in VPC
~ % aws opensearch describe-domain --domain-name dataindic-xxxxxxxxx | jq '.DomainStatus.Endpoint' 
null
~ % aws opensearch describe-domain --domain-name dataindic-xxxxx | jq '.DomainStatus.Endpoints'
{
  "vpc": "vpc-xxxxxxx-yyyyyyy-zzzzzzzz.us-east-1.es.amazonaws.com"
}

As you can see, the Endpoint value is null for Domains in the VPC. Instead, it appears to put that value in a new key called “Endpoints”. It appears that maybe CloudFormation wasn’t updated to support the new “Endpoints” key or OpenSearch should be publishing endpoints for Domains in the VPC.

I understand that this might be a CloudFormation or OpenSearch bug, but until those teams sort it out, it’s obviously a bug in the AWS CDK. And it seems like this is something the CDK could maybe work around for the time being with a custom resource. Example:

opensearch_client = boto3.client('opensearch')
opensearch_domain_details = opensearch_client.describe_domain(
      DomainName=aws_opensearch_domain_name
 )['DomainStatus']
opensearch_endpoint = opensearch_domain_details.get('Endpoint') or opensearch_domain_details.get('Endpoints')['vpc']

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:5
  • Comments:39 (19 by maintainers)

github_iconTop GitHub Comments

2reactions
automartin5000commented, Feb 25, 2022

Yes, the service team has gotten back to me and confirmed for the stack arns available, that all of them were due to the throttling limits.

I’m working with them on getting the error message improved, to make it clear to the user what the failure is being caused by

@peterwoodworth I could imagine this being our issue too, although we only have 5 task definitions using that value. This seems like something CloudFormation should just auto-retry?

1reaction
SamStephenscommented, Aug 9, 2022

@peterwoodworth said:

I’ve been told that you will now receive a proper error message in the case of throttling. Can anyone here confirm this is the case?

Most of us currently following here have some form of workaround for this issue in place, and I don’t think any of us will be removing that workaround until this issue is fixed properly. We will not be removing our workarounds, because we cannot expose our stacks to non-deterministic failures. I’ve already described the experience I had where the rollback failed because I hit this error during a non-reversible upgrade. A decent error message would not have helped me get out of the awful position this Cloudformation defect left me in.

A proper error message is better than nothing. However throttling is an implementation detail of the deployments Cloudformation does that we should not be exposed to as users at all. The abstraction is leaking.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolve the "Internal Failure" error in CloudFormation
To resolve this issue using AWS CloudTrail, complete the steps in the Find the failed API operations in your CloudTrail event logs section....
Read more >
Find Answers to AWS Questions about AWS CloudFormation | AWS ...
[BUG] Unable to retrieve attribute for AWS::OpenSearchService::Domain, with error message Internal error occurred, due to throttling.
Read more >
Error creating AWS ElasticSearch Domain (now as AWS ...
We had the same vague error. Adding AmazonOpenSearchServiceFullAccess policy to the deploy user got beyond it.
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found