Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Tasking Manager Load Testing

See original GitHub issue

I load tested the Tasking Manager last week with locustio, and the findings have been pretty interesting.

Locust?

Locust works by simulating requests to specific end points, and weighting requests by end points. The end points are defined in a file: locustfile.py, and requests are sent to each of these end points.

Load Test Simulation Settings

Number of users: 500
Hatch rate (Number of users added/s): 50/s

Load Tests

Load Test 1

Initially, I tested on a single EC2 - a c3.2xlarge, which has 8 CPU cores, and 15GB RAM. Snippets from the results are as follows:

Load Test 2

Next, I increased the number of instances to 4, so requests are distributed equally among all instances. I also got rid of the requests to /. Snippets from the results are as follows: screen shot 2019-02-25 at 5 42 37 pm Continued… screen shot 2019-02-25 at 6 02 55 pm

Next actions

The CPU utilization across instances is really low - it averages around 2 - 10%, even though I am running gunicorn with (cores * 2)+1 workers. I also ran htop on these instances, and saw that all the CPU cores are not being used. This needs to be investigated - I am not sure if there is an option on gunicorn to spread workers across various processes. Testing async workers and optimising gunicorn may help with some of these issues.
When the request count dips, and then increases sharply, it’s accompanied by a massive spike in latency (as you can see in load test 1, where the latency increases to 720s at that yellow peak) - why does this happen? I think fixing 1, can help fix this issue as well.
All the failures are a result of gateway timeouts - this is when the requests take longer than the ELB timeout period.
Running tests on eu-west-1 vs us-east-1.

Locust

I need to figure out getting locust stats across a time range, and having a file with these stats
A single locust instance, that is localhost is sufficient to run queries for our scale, but this is also likely influenced by network and bandwidth locally. Going to set this up on an ec2, once I figure out a way to get the output stats.

cc/ @hotosm/tech

Issue Analytics

State:
Created 5 years ago
Comments:6 (6 by maintainers)

Top GitHub Comments

2reactions

arunasankcommented, Apr 17, 2019

Overall, I feel really good about our new stack, and I think we are ready to move ahead based on the load tests. The latency looks sharp, and we are able to serve about 20-25 users per EC2 instance we are using. Which should be a good indicator of how to scale up our stacks during peak traffic.

@willemarcel the latency we saw during your load tests was because we were running fewer instances on the CloudFormation stack vs the production stack, which we have fixed now. Thanks for identifying that issue. 🙇‍♀️

0reactions

arunasankcommented, May 13, 2019

No next actions. Closing here!