question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Distributed test stopped despite workers running

See original GitHub issue

Describe the bug

When executing a distributed load test where worker node might not heartbeat back in-time (which is not configurable anymore) due to CPU and/or I/O-intensive tasks, it can happen that the whole test is being stopped despite the workers being fine and just busy.

Expected behavior

The test continues to run.

Actual behavior

All workers are being stopped by the master after the following messages:

[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker ffaeb7471fb6_898127e830cc4c7487b6674f88b045fc failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker da57de88394e_76e79054084547768aa00e0adba033bf failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:33,241] 7c22a81c40a0/INFO/locust.runners: Worker d3c53c424e43_1644c39706a44f118090761360c76fe1 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:34,241] 7c22a81c40a0/INFO/locust.runners: Worker 95ac4f8adc8a_ce1bb0094a494d7f8a0540ebab54e105 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:34,242] 7c22a81c40a0/INFO/locust.runners: Worker c08fe40ccea4_b63905214f7846cea3f18cf529cb8767 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:35,242] 7c22a81c40a0/INFO/locust.runners: Worker 8784191196a2_9a136837b45d468cb46b0edaa3c3697c failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:35,243] 7c22a81c40a0/INFO/locust.runners: Worker a5c9ae640c92_1f323758e42c4b609fa5a050c28bac50 failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:36,243] 7c22a81c40a0/INFO/locust.runners: Worker 3f2bf4b8fc3f_90f59e8d0efb4e6fb424fb2e95c8c50a failed to send heartbeat, setting state to missing.
[2021-02-18 10:58:36,243] 7c22a81c40a0/INFO/locust.runners: The last worker went missing, stopping test.

After logging some more internals, it became evident that the calculation is simply wrong:

... The last worker went missing, stopping test (workers: 15, missing: 15).

…where:

  • workers = self.worker_count (despite actually running 30 workers in my case)
  • missing = len(self.clients.missing)

…however, self.worker_count doesn’t even include missing clients, which makes the condition completely obsolete:

if self.worker_count - len(self.clients.missing) <= 0:

So, either self.worker_count needs to include missing clients or the condition should changed to this instead:

if self.worker_count <= 0:

Steps to reproduce

Create a load-test that has a CPU-intensive task that runs for more than 3 seconds on each worker.

Environment

  • OS: Ubuntu 20.04 LTS
  • Python version: 3.8
  • Locust version: 1.4.3
  • Locust command line that you ran: docker-compose up --scale worker=30 (see docker-compose file below)
  • Locust file contents (anonymized if necessary): the one that I have is too complex for this

Docker compose file:

version: "3.7"

x-base-service: &base_service
  image: "locustio/locust:latest"
  restart: "no"
  volumes:
    - ./:/mnt/tests:ro
  working_dir: "/mnt/tests"

services:
  locust-master:
    <<: *base_service
    container_name: locust-master
    command: [
      "--master",
      "--headless",
      "--locustfile", "/mnt/tests/${LOCUST_FILE:?Locustfile not specified}",
      "--users", "${NUM_USERS:-10}",
      "--spawn-rate", "${SPAWN_RATE:-7}",
      "--run-time", "${RUN_TIME:-5m}",
      "--stop-timeout", "${STOP_TIMEOUT:-60}",
      "--expect-workers", "${LOCUST_WORKERS:-1}",
      "--host", "${LOCUST_TARGET:?No test target host specified}"
    ]

  worker:
    <<: *base_service
    command: [
      "--worker",
      "--master-host", "locust-master",
      "--locustfile", "/mnt/tests/${LOCUST_FILE:?Locustfile not specified}",
      "--users", "${NUM_USERS:-10}",
      "--spawn-rate", "${SPAWN_RATE:-7}",
      "--host", "${LOCUST_TARGET:?No test target host specified}"
    ]

.env file for the specific run:

LOCUST_FILE=<redacted>
LOCUST_WORKERS=30
NUM_USERS=240
SPAWN_RATE=30
RUN_TIME=15m
STOP_TIMEOUT=60
LOCUST_TARGET=<redacted>

Issue Analytics

  • State:closed
  • Created 3 years ago
  • Comments:14 (4 by maintainers)

github_iconTop GitHub Comments

1reaction
roquemoyano-tccommented, Aug 11, 2021
1reaction
enote-kanecommented, Feb 22, 2021

Sure, I’ll add a unit test. Didn’t have the time to look into the test setup itself, yet.

Read more comments on GitHub >

github_iconTop Results From Across the Web

JMeter master ends distributed test though some threads still ...
1. - I have set up a distributed load test environment using windows 10 machines (master/my local with four VMs (as slaves) running...
Read more >
Running large tests - Grafana k6
This document explains how to launch a large-scale k6 test on a single machine without the need for distributed execution.
Read more >
Distributed load generation — Locust 2.14.0 documentation
The workers run your Users and send back statistics to the master. The master instance doesn't run any Users itself. Both the master...
Read more >
Leaked Amazon memo warns the company is running out of ...
To be sure, part of Amazon's turnover issue relates to how some employees view working in a warehouse as a brief pit stop...
Read more >
Installing and Configuring Kafka Connect | Confluent Platform 5.2.3
As we discussed in the concepts section, workers can be run in two ... one worker per host (for example, if you are...
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found