Matrix Downtime 2024 03 28

Hello all, I write to you to inform you of some unexpected downtime that occurred on this date for the matrix server.

At around 02:00UTC, the stitchy.moe matrix server experienced an outage whereby messages could not be sent/received nor states updated; however, the server still read as functioning (and clients read as the server was functioning).

Exploring the logs led me to a false trail in:

1
2
3
ERROR:  duplicate key value violates unique constraint "userapi_profiles_idx"
DETAIL:  Key (localpart, server_name)=(wz1OxJJrXdVT89IahQUtmHDi42ZnpzOY, stitchy.moe) already exists.
STATEMENT:  INSERT INTO userapi_profiles(localpart, server_name, display_name, avatar_url) VALUES ($1, $2, $3, $4)

This is an error in the postgres logs complaining about a duplicate name attempting to be inserted in the userapi_profiles table. Querying the db, we can see that the localpart (userID basically) is wz1OxJJrXdVT89IahQUtmHDi42ZnpzOY and it corresponds to the maucord appservice. This corresponds to my mautrix-discord bridge.

I tried deleting that table; however, that did not bring back the server. After messing with it for a bit more, I found more errors in the logs:

1
level=error msg="Failed to acquire database snapshot for sync request" error="context canceled"

This sync error was happening along with many devices failing to request cross-signing keys. From this I decided to peruse the dendrite repository and found issue 2911 which seemed to fit my issue. Unfortunately, there is no current resolution to said issue.

From browsing the repository more, I found that a number of issues seemed to be fixed by deleting the jetstream data directory. My server is containerized in docker and I did such, which, miraculously solved the issue.

You heard it hear folks: rm -rf /var/dendrite/jetstream is what solved this issue :3

Changes

Due to this incident, I have made a few changes to my setup to hopefully reduce the chance of this happening again. The first one was to disable the mautrix bridge. I don’t currently use it often and I had yet to open it to other people on my server, so this does not affect anyone besides myself.

You might have noticed that I had a message duct-taped to my server during this incident. I believe this was a good step; however, I need to make it easier to add temporary messages than manually editing the compiled HTML on my server. I do not know what this will look like; however, it is now on the think about list.

Lastly, I am going to attempt to build out my grafana dashboard more so that events like this could be displayed and caught easier (perhaps a messages sent graph if dendrite lets me). I also want to make some of my grafana boards public. There is once again no timeline on this, but it has now been added to the list of stuff to do.