Distributed-mode (Discrepancy in total stats) : Locust master fails to acknowledge the "last stats" from slave during end of test
See original GitHub issueDescription :
I am executing the locust tests under distributed mode in a Kubernetes cluster. At the moment I am testing only 1 master - 1 slave configuration. At end of the tests, it is found that locust master fails to acknowledge the last stats from slave. On completion, the Locust master prints the following statement “Time limit reached. Stopping Locust.” and sends a quit signal to the slave/s with a wait time of 0.5 seconds to get reports from slaves. This is described as part of below code
main.py
` def spawn_run_time_limit_greenlet():
logger.info("Run time limit set to %s seconds" % options.run_time)
def timelimit_stop():
logger.info("Time limit reached. Stopping Locust.")
runners.locust_runner.quit()
gevent.spawn_later(options.run_time, timelimit_stop)`
runners.py
def quit(self):
for client in self.clients.all:
self.server.send_to_client(Message("quit", None, client.id))
gevent.sleep(0.5) # wait for final stats report from all slaves
self.greenlet.kill(block=True)
However in most of the cases it is found that by the time slaves sends the report to master, its shutdown is already invoked stopping it from receiving any stats. Here is one example of the locust master-slave logs
As per master logs (as shown below), the signal was sent around 2020-06-13 21:04:13,968 and then it waited for exactly 0,5 seconds and the shutdown process began at 2020-06-13 21:04:14,469
[2020-06-13 21:04:13,968] locust-master-1-n9qxq/INFO/locust.main: Time limit reached. Stopping Locust. [2020-06-13 21:04:14,469] locust-master-1-n9qxq/INFO/locust.main: Shutting down (exit code 1), bye. [2020-06-13 21:04:14,470] locust-master-1-n9qxq/INFO/locust.main: Cleaning up runner… [2020-06-13 21:04:14,971] locust-master-1-n9qxq/INFO/locust.main: Running teardowns…
As per the slave logs (as shown below), it received the message from master only at [2020-06-13 21:04:15,727] and by this time the master was already in shutdown process.
[2020-06-13 21:04:15,727] locust-slave-1-jqpq8/INFO/locust.runners: Got quit message from master, shutting down… [2020-06-13 21:04:16,632] locust-slave-1-jqpq8/INFO/locust.main: Shutting down (exit code 0), bye. [2020-06-13 21:04:16,632] locust-slave-1-jqpq8/INFO/locust.main: Cleaning up runner… [2020-06-13 21:04:16,632] locust-slave-1-jqpq8/INFO/locust.main: Running teardowns…
As you can see, the master didn’t rely on “acknowledgment signal from slave” rather it was depending on a wait time of 0.5 seconds which should not be the said criteria. This has resulted in loss of data wherein the master doesn’t have the entire information of stats from the slave. So the aggregate results at master is less than the total requests tested by slave.
Expected behavior
The master should wait for the slave to send back the acknowledgement signal and the part of keeping time as the constant should be removed.
Environment
- Python version: 3.8
- Locust version: All the versions
- Locust command line that you ran: Locust master : locust --no-web --expect-slaves=1 -c 10 -r 2 --run-time=10m --csv=<> --logfile=<> -f <some-path>/locustfile.py -H <application-api-url> --master Locust slave : locust --no-web -f <some-path>/locustfile.py -H -H <application-api-url> --slave --master-host=<master-host-point> --master-port=<master port>
Issue Analytics
- State:
- Created 3 years ago
- Comments:5
Top GitHub Comments
Yes, that was what I meant.
Looks like your test is very “normal” (assuming your endpoint calls are just HttpUser requests and “capture result” doesnt actually do anything blocking)
When I added this sleep (quite recently actually, until then we were always dropping the last few requests on the workers 😕 ), I first tried to do it a more “safe” way, but ended up deadlocking somehow.
If you do have >0.5s latency between master and slave (introduced by k8s, underlying hardware, something else) then I’m afraid losing the last results is expected and unlikely to be fixed any time soon - unless you fix it yourself and make a PR 😃
I guess for your purposes you could just fork / monkey patch it to be 5 seconds instead.
Hi! Can you do this consistently? What about if you run it on a single machine?
I’ve had no issues getting the last samples since that sleep was introduced. The size of the sleep is designed to account for any latency between master & slave, but nothing more, so if it takes a full second or so before the slaves get the message to shut down then it will not work. Are you doing anything particular in your locustfile that might block the slaves from receiving the message for that long?