question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

NoCredentialsError reported occasionally, thread-safe problems?

See original GitHub issue

Describe the bug

Here is my command to garbage collect AWS EIP, targeting 6 regions, run every day.

c7n-org run -s log -c aws/accounts/nca-accounts.yml --not-accounts 123456789012 --metrics-uri 'aws://master?region=eu-west-2' -u aws/policies/gc-eip.yml

Here is the log I got for this particular account myprod. You can see that only 3 regions reported errors. In fact, these errors were only reported recently, and they are not run into errors every day. There are good days and bad days that make me a headache.

2022-03-16 23:09:44,541: c7n_org:ERROR Exception running policy:unused_eip_mark account:myprod region:ap-southeast-2 error:NoCredentialsError('Unable to locate credentials')
2022-03-16 23:09:44,570: c7n_org:ERROR Exception running policy:unused_eip_mark account:myprod region:us-west-2 error:NoCredentialsError('Unable to locate credentials')
2022-03-16 23:09:44,578: c7n_org:ERROR Exception running policy:unused_eip_mark account:myprod region:us-west-1 error:NoCredentialsError('Unable to locate credentials')

Then I dug into the detailed log in log/myprod/ap-southeast-2/unused_eip_mark/custodian-run.log. There are a few things I’d like to call out

  • some other regions log showed in this ap-southeast-2 log file
  • the policy has been run multiple times against the same region
  • the NoCredentialsError reported at 23:09:44, then came the following retries
2022-03-15 23:41:36,980 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:ap-southeast-2 count:0 time:1.43
# NOTE here starts a brand new day, to see the error again
2022-03-16 23:09:45,148 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:ap-southeast-2 count:0 time:0.00
2022-03-16 23:09:45,151 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:ap-southeast-2 count:0 time:0.00
2022-03-16 23:09:45,155 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:ap-southeast-2 count:0 time:0.00
2022-03-16 23:09:48,739 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:48,741 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:48,744 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:50,267 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:50,270 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:50,272 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:50,276 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:51,854 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:eu-central-1 count:0 time:0.00
2022-03-16 23:09:51,857 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:eu-central-1 count:0 time:0.00
2022-03-16 23:09:51,859 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:eu-central-1 count:0 time:0.00
2022-03-16 23:09:51,861 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:eu-central-1 count:0 time:0.00
2022-03-16 23:09:52,328 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,331 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,333 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,335 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,801 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,803 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,805 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:52,808 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-east-1 count:0 time:0.00
2022-03-16 23:09:53,263 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:53,266 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:53,268 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:53,271 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:53,833 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:53,836 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:53,838 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:53,840 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:ca-central-1 count:0 time:0.00
2022-03-16 23:09:54,365 - custodian.policy - INFO - policy:unused_eip_mark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:54,367 - custodian.policy - INFO - policy:unused_eip_notify resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:54,370 - custodian.policy - INFO - policy:unused_eip_unmark resource:network-addr region:us-west-2 count:0 time:0.00
2022-03-16 23:09:54,372 - custodian.policy - INFO - policy:unused_eip_release resource:network-addr region:us-west-2 count:0 time:0.00

I heard that boto3 could have thread-safe problems but I’ve never been able to dig it further.

What did you expect to happen?

I want the good old days back

Cloud Provider

Amazon Web Services (AWS)

Cloud Custodian version and dependency information

NOTE: The errors were captured on an AWS EC2. But the following info was got from my MacBook. The info should be the same except for the Platform section.

Custodian:   0.9.15
Python:      3.8.11 (default, Jul 23 2021, 04:25:24)
             [Clang 12.0.5 (clang-1205.0.22.11)]
Platform:   xxx
Using venv:  True
Docker: False
Installed:

argcomplete==2.0.0
attrs==21.4.0
boto3==1.21.15
botocore==1.24.15
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.12
docutils==0.17.1
google-api-core==2.7.1
google-api-python-client==2.39.0
google-auth==2.6.0
google-auth-httplib2==0.1.0
google-cloud-appengine-logging==1.1.1
google-cloud-audit-log==0.2.0
google-cloud-core==2.2.3
google-cloud-logging==2.7.0
google-cloud-monitoring==2.9.1
google-cloud-storage==1.44.0
google-crc32c==1.3.0
google-resumable-media==2.3.2
googleapis-common-protos==1.55.0
grpc-google-iam-v1==0.12.3
grpcio==1.44.0
httplib2==0.20.4
idna==3.3
importlib-metadata==4.11.2
importlib-resources==5.4.0
jmespath==0.10.0
jsonschema==4.4.0
packaging==21.3
proto-plus==1.20.3
protobuf==3.19.4
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.7
pyrsistent==0.18.1
python-dateutil==2.8.2
pytz==2021.3
pyyaml==6.0
ratelimiter==1.2.0.post0
requests==2.27.1
retrying==1.3.3
rsa==4.8
s3transfer==0.5.2
six==1.16.0
tabulate==0.8.9
typing-extensions==4.1.1
uritemplate==4.1.1
urllib3==1.26.8
zipp==3.7.0

Policy

- name: unused_eip_mark
    resource: network-addr
    description: Mark un-preserved, unattached EIP for notify in x days
    filters:
      - "tag:custodian_status_gc": absent
      - "tag:custodian_status_gc_notify": absent
      - "tag:Preserve": absent
      - and: *eip_unattached
    actions:
      - type: mark-for-op
        tag: custodian_status_gc_notify
        days: 14
        op: notify

Relevant log/traceback output

please see above

Extra information or context

No response

Issue Analytics

  • State:open
  • Created 2 years ago
  • Comments:8 (7 by maintainers)

github_iconTop GitHub Comments

1reaction
ajkerrigancommented, Mar 17, 2022

Thanks again @kentnsw for reporting this, it’s a good one.

Here comes my theory: The AWS sdk/core/api what ever throttled the get credentials calls then NoCredentialsError was raised.

Yeah that feels legit to me. Here’s a bit more context:

  • Credentials do get cached, but with a c7n-org run there are a whole bunch of “first calls” happening at the same time (creating its own thundering herd problem).
  • Your error snippet shows 3 failed calls happening within the same tenth of a second, which supports that theory:
2022-03-16 23:09:44,541: ...
2022-03-16 23:09:44,570: ...
2022-03-16 23:09:44,578: ...

As for where to go next…

  • By default, c7n-org bases its number of parallel workers on 4 * the number of CPUs. That will vary with the size of your instance. But you can also define your own limit by setting the C7N_ORG_PARALLEL environment variable. It may be worth checking to see what the default worker count would be on your target system:
python -c 'import multiprocessing; print(multiprocessing.cpu_count())'

And then trying lower values of C7N_ORG_PARALLEL to find a spot that “brings the good old days back” (to be clear this is more of a diagnosis/confirmation aid than a fix) 😃.

  • From our side, it may also be worth trying to catch/backoff/retry when we hit that error. That’d probably require nailing down a way to consistently reproduce it though. I see only a couple mentions of it in previous issues here, and nothing that seems like the same pattern you’re seeing (though there are numerous related botocore issues).
1reaction
kentnswcommented, Mar 17, 2022

AJ Kerrigan mentioned that IIRC when I’ve seen it, it had more to do with hammering the local EC2 metadata service for creds than thread-safety issues.

Here comes my theory: The AWS sdk/core/api what ever throttled the get credentials calls then NoCredentialsError was raised.

Supporting:

  • Back in the old days, there were fewer policies and worked like a charm. At the moment, I have 80 accounts in my org, targeting 6 regions, with approximately 20 policies to run at a time. The errors happened from time to time.
  • IF a get credentials has to be called/triggered by Cloud Custodian every account/region/policy, that’s 80 * 6 * 20 = 9600 calls in a few minutes. It will make sense to me that the EC2 metadata service may turn down some calls.
  • In one go, e.g. a policy applying to a couple of accounts and regions, some accounts/regions succeed and some didn’t. That means some of the ‘get credentials’ got the right response.

Of course, I’ve no proof of that, neither to Cloud Custodian nor AWS service. What do you think?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Resolving the Boto3 NoCredentialsError in Python - Rollbar
NoCredentialsError is raised while using Boto3 to access AWS in Python, when a credentials file is invalid or cannot be located.
Read more >
Thread Safety Analysis — Clang 16.0.0git documentation
Clang Thread Safety Analysis is a C++ language extension which warns ... The depositImpl() method does not have REQUIRES , so the analysis...
Read more >
Boto3 Error: botocore.exceptions.NoCredentialsError: Unable ...
I have solved the problem like this: aws configure. Afterwards I manually entered:
Read more >
Thread Safety - Python Concurrency for Senior Engineering ...
This lessons discusses the concept of thread safety. ... If you run the program enough times, you'll sometimes get the correct summation, and...
Read more >
RacerD - Infer Static Analyzer
This analysis does not attempt to prove the absence of concurrency issues, ... In the following code, RacerD will report an Interface not...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found