Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Credstash operation delay on ECS

See original GitHub issue

Good day. We are using credstash==1.13.2 with our apps to get secrets from dynamoDB with KMS keys. The thing is that during the start of our service (aka amount of task that are running is 2) we grab app secrets with our entrypoint.py

from credstash import getSecret
getSecret(key, region=self.__region, table=self.__table,
                         context=self.__context)

During the start of the app, we are expecting delays that cannot be explained because they occur periodically. It can be first deployment, new version deployment, or one of the containers can run normally while another will hang out with credstash delay. We don’t want just to increase health check for the app cause we don’t understand the nature of this issue, so we are not able to put correct amount of time when it can start without failing because of delay. Each time that delay is new. Here you can see the difference in cloudwatch logs: One ECS task starts normally, while another is in PENDING state because cannot GET same secret.

Successful 1st task 07:55:26 Key django.secretkey value 55 len 07:55:26 Getting key django.secretkey took 0.28343939781188965 seconds

Task with delay for the same key, usually it 2nd task 07:59:14 Key django.secretkey value 55 len 07:59:14 Getting key django.secretkey took 10.96645712852478 seconds

Same ECS service, but one of the task has operation delay ~ 10 sec while another did it in ~ 1 sec. Maybe there is some consequences/timeout in amount of keys that credstash can take one after another?

I checked the credstash dynamo tables request rates for the times I specified and seems there are no throttles events or high consumed read peaks on dynamo db side. Same issue on different dynamo db tables with different Read capacity units greater than 180.

Additionally we had issue with AWS, where we reached the API rate limits with too frequent deployment, when we tried to identify the reason of delays. I`m not sure if that can be useful.

Error 
botocore.vendored.requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='169.254.170.2', port=80): Max retries exceeded with url: /v2/credentials/02exxf87-2xx6-4dxx-9cd9-1xxxxfa587xx (Caused by ConnectTimeoutError(<botocore.awsrequest.AWSHTTPConnection object at 0x7f55c95b9278>, 'Connection to 169.254.170.2 timed out. (connect timeout=2)'))

But those days have passed and now we`re getting delays.

Issue Analytics

State:
Created 6 years ago
Comments:5 (2 by maintainers)

Top GitHub Comments

1reaction

alex-luminalcommented, Sep 29, 2017

I would look at how many secrets you have in credstash and what your DDB read throughput is set to. IF you’re trying to pull down a ton of items and you have 1 unit of read throughput, itll take a while.

On Fri, Sep 29, 2017 at 7:24 AM, Eugene Starchenko <notifications@github.com

wrote:

I implemented the DEBUG into the code and here is what I’ve got First GOOD ECS task [image: good] https://user-images.githubusercontent.com/17835122/31012995-b25b6168-a51c-11e7-8962-9964c60fdbfb.jpg Second BAD ECS task (it is always delay from first secret on second ECS task during deployment) [image: bad] https://user-images.githubusercontent.com/17835122/31012997-b54c8b90-a51c-11e7-94c6-fbed68f1d4c8.jpg

Next you can see timeline of occurred delays, looks like boto spends too much time on connections. I`m not sure if issue is in boto or aws endpoints or even both. If you compare timing of both containers - same actions are not so time consuming on the first one.

Delays on events below on second task

09:40:02 INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTP connection (1): 169.254.170.2  09:40:02 DEBUG:botocore.vendored. requests.packages.urllib3.connectionpool:“GET /v2/credentials/XXXXXXX-7091-XXXX-XXXX-c290XXXXbf4c HTTP/1.1” 200 879  09:40:05 DEBUG:botocore.loaders:Loading JSON file: /usr/local/lib/python3.6/dist-packages/botocore/data/endpoints.json  09:40:05 DEBUG:botocore.loaders:Loading JSON file: /usr/local/lib/python3.6/dist-packages/botocore/data/ dynamodb/2012-08-10/service-2.json  09:40:05 DEBUG:botocore.loaders:Loading JSON file: /usr/local/lib/python3.6/dist- packages/botocore/data/_retry.json

 09:40:05 INFO:botocore.vendored.requests.packages.urllib3.connectionpool:Starting new HTTPS connection (1): dynamodb.us-east-1.amazonaws.com  09:40:07 DEBUG:botocore.vendored.requests.packages.urllib3.connectionpool:“POST / HTTP/1.1” 200 609

09:40:07 DEBUG:botocore.hooks:Event needs-retry.kms.Decrypt: calling handler <botocore.retryhandler.RetryHandler object at XXXXX487f2320>  09:40:07 DEBUG:botocore.retryhandler:crc32 check skipped, the x-amz-crc32 header is not in the http response.  09:40:07 DEBUG:botocore.retryhandler:No retry needed.  09:40:13 WARNING:main:Key test.secretkey value 48 len  09:40:13 WARNING:main:Getting key test.secretkey took 25.796634197235107 seconds  09:40:13 DEBUG:botocore.client:Registering retry handlers for service: dynamodb

Does anyone known where to dig in?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fugue/credstash/issues/168#issuecomment-333101390, or mute the thread https://github.com/notifications/unsubscribe-auth/AMZ3DgN6ferdN91LTFTPmNPnCp57nPGKks5snNNugaJpZM4PauyE .

0reactions

eugenestarchenkocommented, Nov 29, 2017

Looks like it was solved. It came down to python using the disk - and the burst balance of ebs volumes causing delays.

Top Results From Across the Web

Troubleshoot Amazon ECS tasks that delay stopping when the ...

In the navigation pane, choose Clusters, and then choose the cluster where your container instance is draining. Choose the ECS Instances tab, ...

Posts tagged with: aws - FP Complete

Managing secrets is hard. Moving them around securely is even harder. Learn how to get secrets to the Cloud with terraform and credstash....

Engineering | The Twilio Segment Blog

With our microservice architecture, our operational overhead increased linearly with each added ... In the best case, these API failures cause delays.

Here Be Dragons: Security Maps of the Container New World

ECS ? Is the Journey Worth the Risk? What are some risks? Escalation of privileges; Image integrity; Un-patched Containers. Tips on operating Docker ......

Awesome Stars - Source for https://arbal.github.io

Please follow Documentation/SubmittingPatches procedure for any of your ... built on top of Amazon EC2 Container Service (ECS); docker/cli - The Docker CLI ......