Server Migration 2024 12 21

23 Dec 2024

This is a bit of a post-partum on the server migration that I am mostly complete with (but still solving a couple of straggler issues). The main migration was from my Debian 11 system to a Nixos machine under my Nixos config. I ended up naming it tanzanite and you can find it in my Nixos config.

The Situation

First off, let me give a little bit of context about the server and the situation. The server has 3 drives: 2x2TB drives in Raid1 and a 256GB SSD for the boot drive and other system configuration that does not need to be redundant. In addition, the server was running on Debian 11, like I previously mentioned. This mean that the jump to a modern Nixos version was a massive leap in terms of software; however, because most of my services are in docker containers, I didn’t feel like it would be that difficult to actually do the migration.

Repartitioning

As I am installing a new system, I would like to keep the old one available in case stuff goes wrong, so the first thing I did was attempt to shrink the drive and add a btrfs partition on the SSD so that I could use that as the base system drive. This went…… really not well. I, the genius that I am, forgot the order of operations to shrink a filesystem by command line and in my infinite wisdom, I didn’t look it up either. Due to this, well, I’ll let the code block show you all you need to know:

1
2
3
4
 Mon 02:58-tanzanite@nyadmin ~ sudo mount /dev/sdc2 /mnt
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sdc2, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.
 Mon 02:58-tanzanite@nyadmin ~

Well, you know exactly what that means :3

Installation

Now that I have obliterated any ability to actually go back and fix my mistakes, I am ready to actually do this install. Since I already have done a ton of work on my config, it was as easy as configure the hardware stuffs, pull in nyadmin, and it should justwork™. In fact, the me from the future would agree with this statement, at least for the hardware stuffs and also the user config. Wow, its almost like the stuff I have configured over the past year works really well am I right? In fact, this is the point I got with the initial commit for tanzanite.

Unfortunately, that is not the only thing that is needed for a functioning server. In fact, that was the bare minimum amount of config that puts the Nixos install around similar difficulty as another distro. And definitely it is still more work than just upgrading Debian would be; but if there’s anything I can say, its that using Nixos feels a lot less fragile than most other operating systems I have used before. Maybe I am just coping, but I feel like I am a lot less likely to break things when I make changes on Nix than I am on any other distro.

Enough with the rambling, lets move on. The main thing that the Debian operating system did was manage ACME certs through certbot and also a nginx reverse-proxy. These both needed to be re-implemented on the Nixos system; and even more unfortunately, I did not have a reference working config from my older system because of what happened to it.

Nixos, the goat that it is, is well equipped to deal with these tasks. The inbuilt ACME options allow one to let Nix do basically all the configuration for you. As I have a turn server and a couple other services that are not using http, I elected to setup DNS based challenges. If you are familiar with something like github pages, then you know what a DNS based challenge is; if not, then let me explain it real quick. A DNS challenge is where you put a specific flag (piece of text) in a DNS record and a service (like the ACME server), will read that challenge and then grant you access to it.

Since I have Porkbun, I used the inbuilt option, grabbed my API keys, and basically had a fully-functional cert-granting machine on my hands.

1
2
3
4
5
6
7
security.acme = {
  acceptTerms = true;
  defaults.email = "stitchy@stitchy.moe";
  defaults.dnsProvider = "porkbun";
  defaults.environmentFile = "/persist/acme/porkbun.tokens";
  certs = {};
};

Any other certs that I need (that are not on the nginx proxy) can be added to the certs option which will generate the cert files required for TLS. Eventually, the tokens file will be Agenix encrypted and stored in the nix store instead of in /persist. This would also give me the ability to use the tokens on other servers without needing setup, which will be nice once I start moving more of my servers to nix.

In addition to ACME stuff, I also needed to configure the nginx proxy. This I am also doing through Nixos as I desire the reproducibility as well as the config checking. As with the ACME stuff, Nixos has first-class support for configuring nginx. In fact, I was able to get just about everything working without requiring exctraConfig options, although there were a few that needed to be added (more on that later).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
services.nginx = {
  enable = true;
  recommendedProxySettings = true;
  recommendedTlsSettings = true;
  virtualHosts = {
    "stitchy.moe" = {
      enableACME = true;
      forceSSL = true;
      root = "/opt/www/stitchy.moe/public";
    };
    "gay.stitchy.moe" = {
      enableACME = true;
      forceSSL = true;
      locations."/" = {
        proxyPass = "http://127.0.0.1:3333";
      };
    };
    "matrix.stitchy.moe" = {
      enableACME = true;
      forceSSL = true;
      locations."/" = {
        proxyPass = "http://127.0.0.1:9008";
      };
    };
    "syncv3.stitchy.moe" = {
      enableACME = true;
      forceSSL = true;
      locations."/" = {
        proxyPass = "http://127.0.0.1:8009";
      };
    };
  };
};

With this setup, I had ACME certs generated for each of my domains (technically, I have a few more not yet migrated; but they are also not ones I particularly relied on, so it will be something I do later once I have the desire to work on those services). With this, matrix was just about working, my website was working, and Forgeio was working; well, sorta. Besides doing the reverse-proxy config, there was a bunch of other migration work that needed to happen under the hood for me to be happy.

One of those things was migrating the docker containers from /opt/data/ to /opt/docker-containers/. I did this migration by hand to make sure everything came over properly and none of the extra directories or data was brought over. This would make it easy to use my nvim config to edit all of the containers easily and eventually allow me to save it all in a git repo.

A quick addendum on the last part: I realize that almost every single one of the docker containers has plaintext database passwords and other secrets stored in them. Due to this, I will probably have an initial migration to git whereby I move all of these keys to an .env file and, at a later date, either integrate everything into Nixos or decide a different way of doing it.

Services

From this section, I now detail a number of services that had interesting stories about how they were restored or mistakes I made when trying to bring them back up.

Docker

Because of my Raid1 drive setup, I moved the daemon-root to my redundant drives (which would allow my important data to live even if a drive failed). This worked actually really well (despite the fact I probably should have just made a docker subvolume and mounted it on the generic docker location.) You know, now that I just said that, I think I might actually do that once I finish moving all of my containers in to the new directory and sorting out the rest of the stuff from the migration.

Anyways, my first mistake was eating the partition which had this configuration change on it. The second was not triple-checking which location I had placed the docker data directory. In my first commit here, we see a fool who is configuring stuff that she has not verified.

In this next commit here, we see the fool after wasting 3 hours trying to figure out why none of the docker data has saved itself. Once this had been done, all of the old containers came back online. At this point, I shut down all of the ones that were unnecessary and migrated the rest to the new directory (and spun them back up).

Forgeio

While the Forgeio service was actually super easy to bring online, I feel it is important to mention both because of its criticality for my deployment as well as a just general overview of how the docker-only migrations are supposed to go (and did go for the most part).

As far as I recall, all it took was moving it from the old directory to the new directory and calling docker compose up -d, at which point, it was working and hosted. With the nginx config, it was back to public and working for me to push and pull to for all of my changes to start being saved in git.

Website

The website was another one of the services hosted on the bare-metal machine. Well, it is just a static site made with hugo, so its not like it is particularly difficult to deal with. One slight mishap that I caused was losing a bunch of large files that I was storing on my website but didn’t wanna keep in git. A lot of these were videos that I would link for various reasons that are no longer working because I corrupted the drive that they were on. In the future, this happening would be unacceptable, so I moved it to a btrfs subvolume in the /opt/www directory. With this, I can now have redundant data for my web stuff and still not be forced to save large files into git. I will likely make a directory for these files to sit in later and host that on a different location with nginx.

Matrix

Oh dear, here’s the big one. It feels like matrix is always causing me problems, well, it kinda is. I mean, I use it pretty heavily tho as a chat program, so oh well. It is what I signed up when I told people besides myself that they could use my server.

Anyways, matrix had like 3 stupid mistakes in them, so I’ll start with the first:

CORS

Well, would you look at that. If it isn’t my best friend web problems :)

This is one of the most egregious wastes of my time out of the whole bunch of mistakes I made during this. In fact, there is like 1 mistake and 3 bugs that cropped up from decisions I made when first setting up matrix. Lets start with the first one.

Firstly, when I was originally setting up my matrix server, I hosted it under the stitchy.moe domain; moving it to matrix.stitchy.moe at a later time. Because of this, I left the stitchy.moe redirect alive in the server because it didn’t hurt anything and it meant that all my users wouldn’t need to re-login to their accounts. Well, it has been over a year since that change (this is me coping about forgetting that I still had it implemented), so I did not reinstate this fix to make old clients work. This meant that one of my clients (and a couple of other people on my server) experienced a case where CORS errors would keep them from working. By re-logging into these clients, it would use the new domain and thus would work.

What made this extra confusing was the fact I had a CORS related issue on the .well-known files for matrix. These required the Access-Control-Allow-Origin' '*' header to be appended by nginx for those files. This meant element.io was not able to access the .well-known and didn’t know it was supposed to look at matrix.stitchy.moe; instead, it looked at stitchy.moe and was encountering 404 errors because of this.

Sliding Sync

As part of my matrix server, I host the sliding-sync proxy (as I host dendrite which does not implement the proxy itself). As this is also hosted in docker, it was very similar to the other matrix services to get working (as in, it was just a few lines of nix for the nginx config). The one issue that I had (which cost me about 15 minutes to figure out), was that I used the wrong port in the proxy as shown in this commit. This is why I should have a standard for duplicate ports; well, that is something for another me :)

Room Joining Issue

One last issue that I had with the matrix server was an issue whereby I could not join a room on another matrix server. The server in question was the conduwuit server robert.zip. This server was spun up the day after my server migration. I tested joining a room on the matrix.org homeserver, which worked just fine. Therefor, the only join path that was not working was stitchy.moe -> robert.zip. robert.zip was able to join rooms on my server. After looking at the logs with Robert, we did not have a good idea of why this was happening; so it is still an issue for now. It is something I would like to resolve in the future. One thing to mention was that robert.zip users were able to join a room hosted by matrix.org, so it might be a one-off issue between dendrite and conduwuit.

Postgres Migration

While I was migrating the whole server to Nixos, I also migrated the postgres database from version 15 to version 17. This sped up the matrix server a noticeable amount in that there are now very few times where my clients will disconnect while the server is running properly. I will need to continue monitoring this to make sure it stays as the case, but from what I hear in the dendrite matrix room, this upgrade is helping other people quite a bit.

The actual migration was not that difficult. I followed this tutorial by Thomas Bandt which worked perfectly. The gist of it was bring down the service, dump the database, create a new volume and add it in the container, upgrade the container, and restore the database from the dump. With this major upgrade done and dendrite bumped from 13.8 -> 14.0, I feel there should be a lot of improvements made recently.

Duplicate Messages

As I bumped the dendrite version in the last section, this introduced a couple of new bugs that are still bothering the users of my matrix server. One of them is the duplicate messages bug whereby the server will ‘freeze’ for a time and afterwards duplicate messages will go through. This is weird because messages can still be sent in other rooms aka it only impacts singular rooms at once. This tends to happen more frequently when multiple messages are sent in quick succession. One of the devs in the matrix chat commented about this and said that they could reproduce this issue, so it will hopefully be solved in the near future and allow messages to work properly again.

Presence Issues

I have also been informed of issues with presence showing people online when they are not actually so. I have not looked into this much, but I have also not seen any mention of it in the matrix dev rooms. This also might just be an issue with one of my many matrix clients. I intend to look into it more later.

Conclusion

Wow, that was quite a lot of work to be done. There are many services I didn’t explicitly mention that were migrated (syncthing, my unifi container, multiple game servers, coturn, etc.). These all were either trivial because of docker containers, or are still down as I am deciding if I want to continue hosting them. I also need to finish migrating my Grafana service; but that one I am considering migrating to a bare-metal implementation. On the topic of bare-metal migrations, I think coturn will be the first service to be migrated like this as it already has host network access and doesn’t fit very well into the docker container. Migrating some of these to nix will be good as it will allow more of the services to be managed in a way that gives automatic updates and automatic configuration upgrades.

One thing that I think would be interesting to show was the git diff. This was taken on my laptop after I pulled down all the changes from git.

1
2
3
4
5
6
7
 flake.nix                   |  18 +++++++++++++++
 hosts/default.nix           |   6 +++++
 hosts/gemini/default.nix    |   6 -----
 hosts/tanzanite/default.nix | 157 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 packages/default.nix        |   3 +++
 users/stitchynyan/home.nix  |   3 ---
 6 files changed, 184 insertions(+), 9 deletions(-)

As you can see, it was about 157 lines of nix in the default tanzanite file. That is honestly not too bad considering a not insignificant amount of those lines are for declaring the drive configuration itself. Of course, this git diff does not count all the work that was made on the actual hardware and all the service changes that are not part of Nixos itself. In the future, I would like to move as much of that to either a git repo specifically for docker containers on my server, or to Nixos itself in specific instances. For example, I think coturn would be a good service to move to base Nixos because it does not have any persistent data and just needs to be configured properly. Agenix is a requirement for this, however, as it needs secret keys and Agenix will allow me to store those in a good way.

Related
Nix · Hardware · Matrix · Software · Linux · Downtime