NoCredentialsError reported occasionally, thread-safe problems?
See original GitHub issueDescribe the bug
Here is my command to garbage collect AWS EIP, targeting 6 regions, run every day.
c7n-org run -s log -c aws/accounts/nca-accounts.yml --not-accounts 123456789012 --metrics-uri 'aws://master?region=eu-west-2' -u aws/policies/gc-eip.yml
Here is the log I got for this particular account myprod
. You can see that only 3 regions reported errors. In fact, these errors were only reported recently, and they are not run into errors every day. There are good days and bad days that make me a headache.
2022-03-16 23:09:44,541: c7n_org:ERROR Exception running policy:unused_eip_mark account:myprod region:ap-southeast-2 error:NoCredentialsError('Unable to locate credentials')
2022-03-16 23:09:44,570: c7n_org:ERROR Exception running policy:unused_eip_mark account:myprod region:us-west-2 error:NoCredentialsError('Unable to locate credentials')
2022-03-16 23:09:44,578: c7n_org:ERROR Exception running policy:unused_eip_mark account:myprod region:us-west-1 error:NoCredentialsError('Unable to locate credentials')
Then I dug into the detailed log in log/myprod/ap-southeast-2/unused_eip_mark/custodian-run.log
. There are a few things I’d like to call out
- some other regions log showed in this
ap-southeast-2
log file - the policy has been run multiple times against the same region
- the NoCredentialsError reported at
23:09:44
, then came the followingretries
2022-03-15 23:41:36,980 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:ap-southeast-2 count:0 time:1.43
# NOTE here starts a brand new day, to see the error again
2022-03-16 23:09:45,148 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:ap-southeast-2 count:0 time:0.00
2022-03-16 23:09:45,151 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:ap-southeast-2 count:0 time:0.00
2022-03-16 23:09:45,155 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:ap-southeast-2 count:0 time:0.00
2022-03-16 23:09:48,739 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:48,741 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:48,744 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:50,267 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:50,270 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:50,272 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:50,276 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:51,854 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:eu-central-1 count:0 time:0.00
2022-03-16 23:09:51,857 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:eu-central-1 count:0 time:0.00
2022-03-16 23:09:51,859 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:eu-central-1 count:0 time:0.00
2022-03-16 23:09:51,861 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:eu-central-1 count:0 time:0.00
2022-03-16 23:09:52,328 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,331 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,333 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,335 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,801 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,803 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,805 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,808 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:53,263 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:53,266 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:53,268 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:53,271 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:53,833 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:53,836 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:53,838 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:53,840 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:54,365 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:54,367 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:54,370 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:54,372 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-west-2 count:0 time:0.00
I heard that boto3 could have thread-safe problems but I’ve never been able to dig it further.
What did you expect to happen?
I want the good old days back
Cloud Provider
Amazon Web Services (AWS)
Cloud Custodian version and dependency information
NOTE: The errors were captured on an AWS EC2. But the following info was got from my MacBook. The info should be the same except for the Platform section.
Custodian: 0.9.15
Python: 3.8.11 (default, Jul 23 2021, 04:25:24)
[Clang 12.0.5 (clang-1205.0.22.11)]
Platform: xxx
Using venv: True
Docker: False
Installed:
argcomplete==2.0.0
attrs==21.4.0
boto3==1.21.15
botocore==1.24.15
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.12
docutils==0.17.1
google-api-core==2.7.1
google-api-python-client==2.39.0
google-auth==2.6.0
google-auth-httplib2==0.1.0
google-cloud-appengine-logging==1.1.1
google-cloud-audit-log==0.2.0
google-cloud-core==2.2.3
google-cloud-logging==2.7.0
google-cloud-monitoring==2.9.1
google-cloud-storage==1.44.0
google-crc32c==1.3.0
google-resumable-media==2.3.2
googleapis-common-protos==1.55.0
grpc-google-iam-v1==0.12.3
grpcio==1.44.0
httplib2==0.20.4
idna==3.3
importlib-metadata==4.11.2
importlib-resources==5.4.0
jmespath==0.10.0
jsonschema==4.4.0
packaging==21.3
proto-plus==1.20.3
protobuf==3.19.4
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.7
pyrsistent==0.18.1
python-dateutil==2.8.2
pytz==2021.3
pyyaml==6.0
ratelimiter==1.2.0.post0
requests==2.27.1
retrying==1.3.3
rsa==4.8
s3transfer==0.5.2
six==1.16.0
tabulate==0.8.9
typing-extensions==4.1.1
uritemplate==4.1.1
urllib3==1.26.8
zipp==3.7.0
Policy
- name: unused_eip_mark
resource: network-addr
description: Mark un-preserved, unattached EIP for notify in x days
filters:
- "tag:custodian_status_gc": absent
- "tag:custodian_status_gc_notify": absent
- "tag:Preserve": absent
- and: *eip_unattached
actions:
- type: mark-for-op
tag: custodian_status_gc_notify
days: 14
op: notify
Relevant log/traceback output
please see above
Extra information or context
No response
Issue Analytics
- State:
- Created 2 years ago
- Comments:8 (7 by maintainers)
Top GitHub Comments
Thanks again @kentnsw for reporting this, it’s a good one.
Yeah that feels legit to me. Here’s a bit more context:
As for where to go next…
C7N_ORG_PARALLEL
environment variable. It may be worth checking to see what the default worker count would be on your target system:And then trying lower values of
C7N_ORG_PARALLEL
to find a spot that “brings the good old days back” (to be clear this is more of a diagnosis/confirmation aid than a fix) 😃.AJ Kerrigan mentioned that IIRC when I’ve seen it, it had more to do with hammering the local EC2 metadata service for creds than thread-safety issues.
Here comes my theory: The AWS sdk/core/api what ever throttled the
get credentials
calls then NoCredentialsError was raised.Supporting:
get credentials
has to be called/triggered by Cloud Custodian every account/region/policy, that’s 80 * 6 * 20 = 9600 calls in a few minutes. It will make sense to me that the EC2 metadata service may turn down some calls.Of course, I’ve no proof of that, neither to Cloud Custodian nor AWS service. What do you think?