Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Some Synapse instances have been hammering their database after v1.66.0 -> v1.68.0 update

See original GitHub issue

Description

Some EMS hosted Synapse instances are hammering their database after upgrading from v1.66.0 to v1.68.0. The host concentrating here on is ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live (EMS internal host ID, please check with EMS team for real hostnames).

The offensive query is:

SELECT c.state_key FROM current_state_events as c
            /* Get the depth of the event from the events table */
            INNER JOIN events AS e USING (event_id)
            WHERE c.type = ? AND c.room_id = ? AND membership = ?
            /* Sorted by lowest depth first */
            ORDER BY e.depth ASC

The background update running at the time was event_push_backfill_thread_id, if relevant.

Graphs:

IOPS increase at upgrade. The initial plateau at 4K was due to the database being locked to 4K IOPS. Now it has 10K and has consistently continued to hammer the database after ~7 hours since the upgrade.

Screenshot-20220929153250-367x258

Degraded event send times especially when constrained to 4K IOPS, which the host has been running with for a long time fine.

Screenshot-20220929153504-893x343

Stateres worst-case seems to reflect the database usage, just side effect of a busy db?

Screenshot-20220929153521-889x343

DB usage for background jobs had a rather massive spike for notify_interested_appservices_ephemeral right after upgrade.

Screenshot-20220929154001-1761x547

Taking that away from the graph, we see db usage for background jobs higher since upgrade all around.

Screenshot-20220929154029-1783x562

DB transactions:

Screenshot-20220929154151-1782x569

Cache eviction seems to indicate we should raise the get_local_users_in_room cache as it is being evicted a lot by size. However, this has been the case pre-upgrade as well.

Screenshot-20220929154328-888x344

Appservice transactions have not changed during this time by a large factor (3 bridges):

Screenshot-20220929154352-1783x317

A few other hosts manually found:

01bbd800-4670-11e9-8324-b54a9efc8abc-live

db0718c0-2480-11e9-83c4-ad579ecfcc33-live

Day time based changes in traffic have been ruled out, all these issues started on upgrade with no other changes to the hosting or deployment stack. There are probably more hosts affected by the db usage increase.

Also discussed in backend internal.

Steps to reproduce

Uprgade from v1.66.0 to v1.68.0.

Homeserver

ecf6bc70-0bd7-11ec-8fb7-11c2f603e85f-live, 01bbd800-4670-11e9-8324-b54a9efc8abc-live, db0718c0-2480-11e9-83c4-ad579ecfcc33-live

Synapse Version

v1.68.0

Installation Method

Other (please mention below)

Platform

EMS flavour Docker images built from upstream images. Kubernetes cluster.

Relevant log output

Anything else that would be useful to know?

No response

Issue Analytics

State:
Created a year ago
Comments:10 (8 by maintainers)

Top GitHub Comments

3reactions

MadLittleModscommented, Oct 20, 2022

Summarizing some more discussion in backend internal:

We don’t know exactly what upstream callers are accounting for all of these get_users_in_room calls but everything kinda points to this being appservice related.

This also makes sense since one of the affected hosts is Element One which has a lot of bridged appservice users.

I suspect we have some more get_users_in_room mis-uses in the appservice code where we should be using get_local_users_in_room since the appservice only needs to worry about users on its own server. Fixing these mis-uses doesn’t fix get_users_in_room performance itself but using get_local_users_in_room would be more performant than whatever we end up changing with get_users_in_room anyway.

@clokep pointed out a couple get_users_in_room mis-uses already:

I’ll work on a PR now to change these over ⏩

2reactions

erikjohnstoncommented, Sep 30, 2022

I have to say I’m leaning towards backing out the change to get_users_in_room entirely, yeah there are some misuses going around but there are legitimate places we call get_users_in_room and I don’t see why those won’t be as affected by the change.

It’s also worth noting that Beeper (cc @Fizzadar) have been trying to rip out all the joins onto the events table for performance reasons.

I’d vote we add a separate get_ordered_hosts_in_room that returns the ordered list of hosts, and only use that when we actually care. That would also allow us in future to be more clever with the ordering of hosts (e.g. by order by perceived health of the remote, etc).