1. Want to get our most recent announcements - and XP codes - in your email?

    Sign up for our mailing list!

Servers Shotbow Devblog: When scheduled maintenance becomes unscheduled

Discussion in 'Announcements' started by McJeffr, Nov 9, 2024.

Servers - Shotbow Devblog: When scheduled maintenance becomes unscheduled
  1. McJeffr When in Rogue Lead, Developer

    XP:
    226,286xp
    [IMG]

    As some of you might have noticed, Shotbow suffered from an outage at the end of October 2024. This fairly lengthy and technical devblog will go into the issue that occurred, the timelines of the incident, and what steps we will take to prevent similar incidents from occurring in the future.

    Scheduled maintenance

    On Friday 18 October, at 1:37AM CEST, our automated alerting fired off a warning with critical severity: RAID array 'md2' on us01 is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically. This alert was noticed by a developer, who promptly alerted the other developers in the main developer channel. At this stage, the network continued to operate normally.

    [IMG]

    As highlighted in a previous devblog by Lazertester titled "Minecraft on Kubernetes", Shotbow runs their servers and other infrastructure on Kubernetes. Kubernetes is a scalable system that manages and automates the operation of services through containers across multiple physical boxes. Shotbow has several physical servers, us01 being one of them. This node, us01, is also an important one as it hosts the Shotbow website (which was not fully containerized when we embraced Kubernetes). The us01 node also acts as a control node in our Kubernetes cluster, which means that it is responsible for the "daily operations" of the entire Kubernetes cluster. Shotbow had two of these control nodes.

    Further investigation into the automated alert in the evening of Friday 18 October (CEST timezone) confirmed that one of the disks in the us01 node had failed, after about 30000 hours of uptime, and after writing about 800TB of data in its lifetime. This was concerning, but did not lead to an immediate problem. This is because all of the nodes in our cluster run using RAID 1 technology. This means that all data on one disk is replicated on another disk, to guard against disk failure taking down a server. After some internal discussion, a plan was drafted up to get the failed disk replaced by our host provider on Sunday 20 October at 8PM CEST. This maintenance window was then communicated to the public in our Discord server.

    Just before the disk replacement took place, we took a backup of our web files, but we wouldn't need these anyways because a disk replacement and subsequent RAID array reconstruction would be pretty straight-forward and should totally not lead to any issues (this is foreshadowing). As part of the disk replacement, the us01 node was taken offline for about 45 minutes. The network proceeded to run mostly as normal, as we had moved Minecraft and Bungee workloads off of the us01 node and onto the other nodes.

    When the us01 node came back online, all that was left was to reconstruct the RAID array, duplicating data from the healthy disk to the freshly installed disk. After struggling to get the required tools installed due to running a somewhat old operating system on us01, the RAID reconstruction started. Estimated completion time: about 2 hours.

    [IMG]

    Unscheduled maintenance

    Just as we were getting somewhat relaxed seeing the RAID array being rebuilt, disaster struck. The RAID array reconstruction had stopped.

    [IMG]

    From here, the network spiralled into an unhealthy state:
    • The us01 node froze not long after the RAID array reconstruction stopped
    • It subsequently went offline, presumably due to the OS not being able to run properly without any disks to write data to.
    • The Kubernetes cluster became unresponsive, kubectl commands could no longer be executed.
    • As part of normal lifecycle of gamemodes, finished Annihilation matches, scheduled restarts of MineZ servers, lobby restarts, all had servers shut down. The custom operator described in Lazertester's previous devblog could not contact the Kubernetes API server, meaning no new games were spinning up.
    • Our internal monitoring and tooling went offline as our web load balancers failed.
    • The network no longer was joinable, and the website went offline.
    Somewhat confused as what had just occurred, we attached a temporary "rescue" OS to the us01 node to view the state of the node. Investigation of the journalctl logs showed that the RAID array resconstruction had stopped because the reconstruction encountered too many bad blocks on the previously healthy disk. And indeed, the healthy disk had become unhealthy and was turned into a ready-only mode to protect further data loss. It has to be mentioned that suffering from a dual disk failure is unlikely, and we were extremely unlucky here.

    From here, we decided our main priorities were, in order:
    1. Restore the Kubernetes cluster
    2. Try to recover as much data as possible from us01
    3. Restore the website
    Recovering Kubernetes

    It took us several days to discover the root cause of the unresponsive Kubernetes cluster: the internal database of Kubernetes had lost quorum. The Kubernetes database is distributed across all control nodes to enable high availability (HA). However, with one control node still in operation and the us01 control node being offline, the database fails to elect a new leader, resulting in it not being able to fix itself. Once we reset the database and restored a snapshot of it, the Kubernetes cluster came back online and most of the Minecraft network healed itself automatically.

    Analyzing the faulty disk

    At the same time, an attempt to recover as much data as possible of the failed disk took place. We did have a backup of the website files, and most of our nodes are stateless (they do not store critical data). Nevertheless, we were not sure what kind of other important data might still be on us01. This involved cloning the disk to the freshly installed disk, and take a snapshot of the freshly installed disk and sending it over to an inactive node in our cluster. Once done, we could analyse the state of the disk. We discovered that about 45GB of the 512GB disk had "bad blocks", essentially dead / inaccessible data. We assume that a storage cell on the SSD had simply failed.

    [IMG]

    Dead blocks do not necessarily mean we lost 45GB worth of data. Most of these blocks would have been empty. However, we concluded that it was not worth trying to mount this data, as the risk of corruption of important parts of the disk leading to instability was too great. Also the previously taken backup of the web files meant we could restore the website from backup instead.

    Restoring the website

    Whilst investigation into the faulty disk was ongoing, we had switched our DNS to point shotbow.net to buy.shotbow.net, our shop. The shop was still running as it is hosted by an external party, but we figured that something is better than nothing.

    About one week after the remaining healthy disk had failed, we scheduled for maintenance with our host provider again to replace the failed disk. Once they finished this maintenance (and had a third intervention to fix the remote access to the server), we could provision the us01 node again, this time on a more modern OS. We had to update our provisioning tools (in a somewhat risky operation), but this succeeded and we were able to restore us01 to a blank state. From here, we could reinstantiate us01 as a control node in the cluster, restore the website data, and restart the website processes. Shortly after, the website was back up and running, with the exception of the MineZ Map.

    It took us somewhat longer to restore the website due to a problem that has haunted some of our other nodes in the past. Two other nodes were already taken out of service back in April of 2024 because any workload deployed on these nodes could not reach other nodes, or the public internet for that matter. Through countless hours of trial and error, we finally discovered that our iptables had to be configured to a different mode due to one of the control nodes running a different version of the iptables. Discovering this allowed us to not only provision us01 successfully, but also add the other nodes back into the cluster.

    Learnings

    The full incident lasted for about a week and a half. These were stressful days for some of our developers who were working on restoring operations whenever they had some free time. We had to dive deep into the core of the network, which was set up by developers who are no longer (actively) developing on Shotbow. A lot of knowledge had been lost over the years and had to be rediscovered. We found several key learnings though this incident:
    • A Kubernetes cluster needs at least 3 control nodes to run in high availability (HA) mode
    • Our internal documentation needs to be overhauled and expanded
    • A distributed file system that we operate is not running in high availability (HA) mode
    • We need to spend time to bring some components in our infrastructure to more modern versions
    • We need to improve our backup setup, as relying solely on local backups and RAID 1 is not sufficient
    • Read the documentation of the tools you use: it contains more information than you might think
    Some of our systems are not yet back up, such as the MineZ Map. We will of course try to get the remaining systems back up and running to restore full operations. In terms of data loss: this appears to be trivial. Whilst we did lose a node in our distributed file system, this file system was not used much and mostly stored old data pertaining to garaged games. It also stored the Annihilation map rotation and MineZ Supply Drops schedule, which is the reason why these features are somewhat limited at the moment. We still have the files, but will have to restructure them.

    We will also be spending quite some time in the upcoming weeks and months to follow up on a lot of actions to stabilize the network and making it more resilient to future incidents. This should not have a big impact development work on games, as the developers working on network infrastructure are already not working on games like Anni.

    Closing words

    Phew, that was quite the read, was it not? How about an XP code to reward you for sticking with it to the very end (and to make up for the downtime caused by this incident): DEVBLOG2024.

    If you enjoyed reading through this devblog, be sure to let us know! If there are other topics that you want to see a devblog written on, let us know as well!

    Thanks for flying Shotbow!

  2. TitoDonald Regular Member

    XP:
    134,583xp
    I have read the whole post, it has been interesting to know what is behind the operation of shotbow. I feel bad for the developers headache. At least you guys were able to recover some data of the faulty disk. Wow 30000 hours of staying up and finally its gone.
    JellySword8, Mistri and JACOBSMILE like this.
  3. SimplySanderZ Obsidian

    XP:
    116,418xp
    This blog post was an interesting read. As someone who reads jounralctl logs for work I can understand the amount information that must have needed to be sifted through. Props to everyone involved who work on the side to get Shotbow running again.
    Mistri and JACOBSMILE like this.
  4. Marijuana7 Regular Member

    XP:
    6,701xp
    Very well done to all the devs and staff involved. You got shotbow out of a serious pickle
    JACOBSMILE and Mistri like this.
  5. ThisIsMyUserID Platinum

    XP:
    97,773xp
    Outstanding to everyone involved. I was initially worried it was the end of the Shotbow Network. This place holds a special place in my heart as it was a major part of my childhood growing up. Even though I am no-longer an active member. I hope to see Shotbow return to her glory days. Lots of fond memories!
  6. PurrfectMistake Platinum

    XP:
    201,734xp
    Amazing work Lazer and all the developers.
    What I wouldn't give to work under and learn from the fine group of developers & sysadmins of Shotbow.

    Your work is just *chefs kiss*.

Share This Page