Server Downtime 2024 11 24

3 Dec 2024

Hello again. It is me Stitchy here to give some context for the absolute annihilation of my uptime statistics downtime that was recently experienced. As you probably know, I run all of my servers out of my grandmother’s basement. Don’t believe me? Well you’re wrong.

Anyways, my large server, the one running most of my services was out for a period of around 10 days while I was attempting to juggle fixing it and school. The gist of things is that no data was lost and that I made the choice to protect user data at the expense of a bit of downtime.

Context

During a bout of bad power from the power company (according to the witnesses, the lights were flickering and devices were powering on and off and it looked like a poltergeist had shown up). During this, one of the hard drives on my server failed and caused it to enter a failed state whereby it was unable to boot due to the missing drive. This was two things:

So yeah, the Linux kernel does not really like booting when devices are missing. It is possible to manually mount the drive and bring it back online with the degraded mount option, but that was not a great solution either. Namely, it was still vulnerable to future power outages and it left me with only a single copy of my data.

Wait, I don’t think I’ve ever shared info on the server this is run on. I will leave the more detailed information for a later article; however, for now, I will say that the main server has a RAID1 array with 2x2TB drives. This means I have a full 2TB of data with double redundant storage.

Of note though is that I do not have off-site backups for my stuff. This is an oversight and something I would like to fix, but I don’t have other servers with large storages on them. I think the near-future solution is to save my docker compose files and other important configurations to a git repository. This would reduce the damage by actually losing the computer as I would know how to spin up a bunch of services without persistent data (or with fully replicated data like syncthing). For the rest of it, I think I will have to pick and choose the services to fully back up.

At least for now, Matrix is the most important service I run and I really want to get off-site backups enabled for it; but today is not that day unfortunately.

Back to the main story, I had to get another drive which is why it took another week to be brought back online. I have replaced it and rebalanced the drives, so everything should be in full working order. The next time I go back to my server, I will do a reboot test and make sure it can come back online. I haven’t done one yet because I was waiting for the rebalance to finish and didn’t have 24 hours to dally around.

While I have your attention, I would like to mention that I will be upgrading the postgres database for the matrix server sometime in the next couple of weeks. The downtime should not be long and it should make dendrite a lot happier (according to some people on the matrix dev rooms).

Related
Matrix · Hardware · Software