Credstash operation delay on ECS
See original GitHub issueGood day. We are using credstash==1.13.2 with our apps to get secrets from dynamoDB with KMS keys. The thing is that during the start of our service (aka amount of task that are running is 2) we grab app secrets with our entrypoint.py
from credstash import getSecret
getSecret(key, region=self.__region, table=self.__table,
context=self.__context)
During the start of the app, we are expecting delays that cannot be explained because they occur periodically. It can be first deployment, new version deployment, or one of the containers can run normally while another will hang out with credstash delay. We don’t want just to increase health check for the app cause we don’t understand the nature of this issue, so we are not able to put correct amount of time when it can start without failing because of delay. Each time that delay is new. Here you can see the difference in cloudwatch logs: One ECS task starts normally, while another is in PENDING state because cannot GET same secret.
Successful 1st task 07:55:26 Key django.secretkey value 55 len 07:55:26 Getting key django.secretkey took 0.28343939781188965 seconds
Task with delay for the same key, usually it 2nd task 07:59:14 Key django.secretkey value 55 len 07:59:14 Getting key django.secretkey took 10.96645712852478 seconds
Same ECS service, but one of the task has operation delay ~ 10 sec while another did it in ~ 1 sec. Maybe there is some consequences/timeout in amount of keys that credstash can take one after another?
I checked the credstash dynamo tables request rates for the times I specified and seems there are no throttles events or high consumed read peaks on dynamo db side. Same issue on different dynamo db tables with different Read capacity units greater than 180.
Additionally we had issue with AWS, where we reached the API rate limits with too frequent deployment, when we tried to identify the reason of delays. I`m not sure if that can be useful.
Error
botocore.vendored.requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='169.254.170.2', port=80): Max retries exceeded with url: /v2/credentials/02exxf87-2xx6-4dxx-9cd9-1xxxxfa587xx (Caused by ConnectTimeoutError(<botocore.awsrequest.AWSHTTPConnection object at 0x7f55c95b9278>, 'Connection to 169.254.170.2 timed out. (connect timeout=2)'))
But those days have passed and now we`re getting delays.
Issue Analytics
- State:
- Created 6 years ago
- Comments:5 (2 by maintainers)
Top GitHub Comments
I would look at how many secrets you have in credstash and what your DDB read throughput is set to. IF you’re trying to pull down a ton of items and you have 1 unit of read throughput, itll take a while.
On Fri, Sep 29, 2017 at 7:24 AM, Eugene Starchenko <notifications@github.com
Looks like it was solved. It came down to python using the disk - and the burst balance of ebs volumes causing delays.