question-mark
Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Synapse workers Complement image results in flaky tests due to inconsistent worker process init

See original GitHub issue

The worker-version of Synapse running in Complement (as described here) currently uses Supervisor as an init system to start all worker processes in the container. See the current config template we’re using.

This works, and eventually all processes start up. However, Complement checks whether a homeserver is ready to start testing by the fact that it responds successfully to a GET /_matrix/client/versions call. This endpoint may be successfully responded to by a worker that has started, while other workers are still starting up. This inconsistency can lead to test failures, where Complement finds a 502 from a call to a different endpoint that should be handled by a different worker. Since that worker hasn’t started yet, nginx returns a 502, and the test fails.

The result of this is flaky Complement tests - which nobody wants.

I believe the solution is to start groups of processes in the container through a priority system. Only should the next group be started once the previous has successfully responded to healthchecks (indicating the process is ready to receive connections):

  1. Redis
  2. the main Synapse process
    • thus all database migrations are handled before any workers start up.
  3. all worker Synapse processes
  4. nginx

(Note that caddy [just used for custom CA stuff] and Postgres are started before even Supervisor is.) By starting nginx at the very end, which is the reverse proxy that actually routes matrix requests to the appropriate Synapse process, Complement will not receive a successful response to /_matrix/client/versions until everything else has started.

Initially I had hoped to use systemd as an init system to replace Supervisor, but systemd apparently doesn’t work in docker containers. Additionally, we need each process to output its logs to stdout, as otherwise Complement won’t be able to display the homeserver logs after a test failure. systemd would make this a bit tricky as it tries to capture logs. Currently this has worked by having Supervisor simply redirecting all process logs to stdout, which the ENTRYPOINT of the docker container, configure_workers_and_start.py, would simply relay.

I don’t believe we want to use synctl here, as the team has been trying to phase that out for a while now. We could simply do all of this via subprocess in configure_workers_and_start.py, but I’m hoping there’s a better, less manual way. Any ideas? I’d also love to be proven wrong as to whether Supervisor actually can do the following:

  • Wait for another process to start up before starting the next.
  • Healthchecks using HTTP.

@richvdh and @erikjohnston also mentioned that Synapse has a way to signal to processes that it’s ready to receive connections (that may potentially be better than just polling the /health endpoint), which may be useful for this discussion. Edit: I’ve just had a look, and it looks like we use a systemd-specific method called sdnotify, which won’t be useful here unfortunately.

Issue Analytics

  • State:closed
  • Created 2 years ago
  • Reactions:1
  • Comments:12 (12 by maintainers)

github_iconTop GitHub Comments

1reaction
clokepcommented, Sep 7, 2021

I attempted to implement what is proposed above in #10705:

Directly start the processes in configure_workers_and_start.py and wait for the READY=1 notification to ensure the process is ready.

Note that working in these scripts is a bit painful as it seems that not all output actually ends up at stdout/stderr. I think some of this might be related to Python buffering output, but I’m not 100% sure.

Anyway – I’m going to be unable to continue researching this due to other responsibilities.

1reaction
richvdhcommented, Aug 25, 2021

At that point, I’m kinda led to wonder if supervisord is actually doing much for you - forking processes isn’t particularly hard, maybe it would be easier to manage the whole thing in python yourself (as sytest does, only in perl).

It isn’t too bad, although we seem to depend on supervisor’s “restart until you don’t crash” to get past the upgrade your database error that happens. Any pointers in sytest for where we manage all this state?

Actually, I guess this is the systemd stuff talked about earlier? (Unless there’s more to it?)

Sytest waits for a READY=1 notification from the main process before starting the workers. https://github.com/matrix-org/sytest/blob/develop/lib/SyTest/Homeserver/Synapse.pm#L1096-L1124.

Read more comments on GitHub >

github_iconTop Results From Across the Web

Shard master process. #7593 - matrix-org/synapse
... Better testing ( Synapse workers Complement image results in flaky tests due to inconsistent worker process init #10065); A long tail of...
Read more >
CHANGES.md · 2d295a4be92894d18d71512548db8629a3ed4b50 ...
(#10774); Fix a bug which generated invalid homeserver config when the frontend_proxy worker type was passed to the Synapse Worker-based Complement image.
Read more >
the of and to a in for is on s that by this with i you it not
... own found sports house related security both county american photo game ... funds ed greater likely develop employees artists alternative processing ...
Read more >
ChangeLog
+ Improve handling of images that do not have a GPS altitude. ... errors package * Fix TODO comment syntax * Skip flaky...
Read more >
Untitled
https://api.dataforseo.com/cdn/i/12211619-4221-0121-0000-2e8dad0d3b55:4
Read more >

github_iconTop Related Medium Post

No results found

github_iconTop Related StackOverflow Question

No results found

github_iconTroubleshoot Live Code

Lightrun enables developers to add logs, metrics and snapshots to live code - no restarts or redeploys required.
Start Free

github_iconTop Related Reddit Thread

No results found

github_iconTop Related Hackernoon Post

No results found

github_iconTop Related Tweet

No results found

github_iconTop Related Dev.to Post

No results found

github_iconTop Related Hashnode Post

No results found