question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed-mode (Discrepancy in total stats) : Locust master fails to acknowledge the "last stats" from slave during end of test

See original GitHub issue

Description :

I am executing the locust tests under distributed mode in a Kubernetes cluster. At the moment I am testing only 1 master - 1 slave configuration. At end of the tests, it is found that locust master fails to acknowledge the last stats from slave. On completion, the Locust master prints the following statement “Time limit reached. Stopping Locust.” and sends a quit signal to the slave/s with a wait time of 0.5 seconds to get reports from slaves. This is described as part of below code

main.py

` def spawn_run_time_limit_greenlet():
            logger.info("Run time limit set to %s seconds" % options.run_time)
            def timelimit_stop():
                logger.info("Time limit reached. Stopping Locust.")
                runners.locust_runner.quit()
            gevent.spawn_later(options.run_time, timelimit_stop)`

runners.py

def quit(self):
     for client in self.clients.all:
         self.server.send_to_client(Message("quit", None, client.id))
     gevent.sleep(0.5) # wait for final stats report from all slaves
     self.greenlet.kill(block=True)

However in most of the cases it is found that by the time slaves sends the report to master, its shutdown is already invoked stopping it from receiving any stats. Here is one example of the locust master-slave logs

As per master logs (as shown below), the signal was sent around 2020-06-13 21:04:13,968 and then it waited for exactly 0,5 seconds and the shutdown process began at 2020-06-13 21:04:14,469

[2020-06-13 21:04:13,968] locust-master-1-n9qxq/INFO/locust.main: Time limit reached. Stopping Locust. [2020-06-13 21:04:14,469] locust-master-1-n9qxq/INFO/locust.main: Shutting down (exit code 1), bye. [2020-06-13 21:04:14,470] locust-master-1-n9qxq/INFO/locust.main: Cleaning up runner… [2020-06-13 21:04:14,971] locust-master-1-n9qxq/INFO/locust.main: Running teardowns…

As per the slave logs (as shown below), it received the message from master only at [2020-06-13 21:04:15,727] and by this time the master was already in shutdown process.

[2020-06-13 21:04:15,727] locust-slave-1-jqpq8/INFO/locust.runners: Got quit message from master, shutting down… [2020-06-13 21:04:16,632] locust-slave-1-jqpq8/INFO/locust.main: Shutting down (exit code 0), bye. [2020-06-13 21:04:16,632] locust-slave-1-jqpq8/INFO/locust.main: Cleaning up runner… [2020-06-13 21:04:16,632] locust-slave-1-jqpq8/INFO/locust.main: Running teardowns…

As you can see, the master didn’t rely on “acknowledgment signal from slave” rather it was depending on a wait time of 0.5 seconds which should not be the said criteria. This has resulted in loss of data wherein the master doesn’t have the entire information of stats from the slave. So the aggregate results at master is less than the total requests tested by slave.

Expected behavior

The master should wait for the slave to send back the acknowledgement signal and the part of keeping time as the constant should be removed.

Environment

  • Python version: 3.8
  • Locust version: All the versions
  • Locust command line that you ran: Locust master : locust --no-web --expect-slaves=1 -c 10 -r 2 --run-time=10m --csv=<> --logfile=<> -f <some-path>/locustfile.py -H <application-api-url> --master Locust slave : locust --no-web -f <some-path>/locustfile.py -H -H <application-api-url> --slave --master-host=<master-host-point> --master-port=<master port>

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:5

github_iconTop GitHub Comments

1reaction
cyberwcommented, Jun 15, 2020

Correct me if I am wrong but by single machine do you mean running both master and slave on a single host? In that case, there is no issue at all the request-response between slave and master is quite fast.

Yes, that was what I meant.

I haven’t added anything else apart from 2 API calls in locust file. It looks something like this

Looks like your test is very “normal” (assuming your endpoint calls are just HttpUser requests and “capture result” doesnt actually do anything blocking)

When I added this sleep (quite recently actually, until then we were always dropping the last few requests on the workers 😕 ), I first tried to do it a more “safe” way, but ended up deadlocking somehow.

If you do have >0.5s latency between master and slave (introduced by k8s, underlying hardware, something else) then I’m afraid losing the last results is expected and unlikely to be fixed any time soon - unless you fix it yourself and make a PR 😃

I guess for your purposes you could just fork / monkey patch it to be 5 seconds instead.

1reaction
cyberwcommented, Jun 15, 2020

Hi! Can you do this consistently? What about if you run it on a single machine?

I’ve had no issues getting the last samples since that sleep was introduced. The size of the sleep is designed to account for any latency between master & slave, but nothing more, so if it takes a full second or so before the slaves get the message to shut down then it will not work. Are you doing anything particular in your locustfile that might block the slaves from receiving the message for that long?

Read more comments on GitHub >

github_iconTop Results From Across the Web

Running Locust distributed — Locust 0.13.1 documentation
This is the instance that will be running Locust's web interface where you start the test and see live statistics. The master node...
Read more >
Locust master doesn't see slaves in Kubernetes cluster
Locust master doesn't see slaves in Kubernetes cluster. 701 views ... You are running in distributed mode but have no slave servers connected....
Read more >
Running Locust in distributed mode on Azure functions
Run.py triggers the tests. What I want to do now, is to have master/slave setup (cluster) for a massive scale perf test. I...
Read more >
Source Performance Testing tools versus OMEXUS
and user behavior during performance testing with regard to the specified ... statistics such as average execution time, response time, and throughput. The....
Read more >
DevOps (Development & Operations) - 2021 - BogoToBogo
The Harness software delivery platform has Continuous Delivery, Continuous Integration, and Cloud Cost Management modules, allowing us to build, test, deploy, ...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found