Distributed test stopped despite workers running
See original GitHub issueDescribe the bug
When executing a distributed load test where worker node might not heartbeat back in-time (which is not configurable anymore) due to CPU and/or I/O-intensive tasks, it can happen that the whole test is being stopped despite the workers being fine and just busy.
Expected behavior
The test continues to run.
Actual behavior
All workers are being stopped by the master after the following messages:
[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker ffaeb7471fb6_898127e830cc4c7487b6674f88b045fc failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker da57de88394e_76e79054084547768aa00e0adba033bf failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker d3c53c424e43_1644c39706a44f118090761360c76fe1 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:34,241] 7c22a81c40a0/INFO/locust.runners: Worker 95ac4f8adc8a_ce1bb0094a494d7f8a0540ebab54e105 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:34,242] 7c22a81c40a0/INFO/locust.runners: Worker c08fe40ccea4_b63905214f7846cea3f18cf529cb8767 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:35,242] 7c22a81c40a0/INFO/locust.runners: Worker 8784191196a2_9a136837b45d468cb46b0edaa3c3697c failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:35,243] 7c22a81c40a0/INFO/locust.runners: Worker a5c9ae640c92_1f323758e42c4b609fa5a050c28bac50 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:36,243] 7c22a81c40a0/INFO/locust.runners: Worker 3f2bf4b8fc3f_90f59e8d0efb4e6fb424fb2e95c8c50a failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:36,243] 7c22a81c40a0/INFO/locust.runners: The last worker went missing, stopping test.
After logging some more internals, it became evident that the calculation is simply wrong:
... The last worker went missing, stopping test (workers: 15, missing: 15).
…where:
workers = self.worker_count
(despite actually running 30 workers in my case)missing = len(self.clients.missing)
…however, self.worker_count
doesn’t even include missing clients, which makes the condition completely obsolete:
if self.worker_count - len(self.clients.missing) <= 0:
So, either self.worker_count
needs to include missing clients or the condition should changed to this instead:
if self.worker_count <= 0:
Steps to reproduce
Create a load-test that has a CPU-intensive task that runs for more than 3 seconds on each worker.
Environment
- OS: Ubuntu 20.04 LTS
- Python version: 3.8
- Locust version: 1.4.3
- Locust command line that you ran:
docker-compose up --scale worker=30
(see docker-compose file below) - Locust file contents (anonymized if necessary): the one that I have is too complex for this
Docker compose file:
version: "3.7"
x-base-service: &base_service
image: "locustio/locust:latest"
restart: "no"
volumes:
- ./:/mnt/tests:ro
working_dir: "/mnt/tests"
services:
locust-master:
<<: *base_service
container_name: locust-master
command: [
"--master",
"--headless",
"--locustfile", "/mnt/tests/${LOCUST_FILE:?Locustfile not specified}",
"--users", "${NUM_USERS:-10}",
"--spawn-rate", "${SPAWN_RATE:-7}",
"--run-time", "${RUN_TIME:-5m}",
"--stop-timeout", "${STOP_TIMEOUT:-60}",
"--expect-workers", "${LOCUST_WORKERS:-1}",
"--host", "${LOCUST_TARGET:?No test target host specified}"
]
worker:
<<: *base_service
command: [
"--worker",
"--master-host", "locust-master",
"--locustfile", "/mnt/tests/${LOCUST_FILE:?Locustfile not specified}",
"--users", "${NUM_USERS:-10}",
"--spawn-rate", "${SPAWN_RATE:-7}",
"--host", "${LOCUST_TARGET:?No test target host specified}"
]
.env
file for the specific run:
LOCUST_FILE=<redacted>
LOCUST_WORKERS=30
NUM_USERS=240
SPAWN_RATE=30
RUN_TIME=15m
STOP_TIMEOUT=60
LOCUST_TARGET=<redacted>
Issue Analytics
- State:
- Created 3 years ago
- Comments:14 (4 by maintainers)
Top GitHub Comments
yes I have opened this https://github.com/locustio/locust/issues/1843
Sure, I’ll add a unit test. Didn’t have the time to look into the test setup itself, yet.