Shotbow
Nov
09
![[IMG]](https://i.imgur.com/A3O6oIx.png)
As some of you might have noticed, Shotbow suffered from an outage at the end of October 2024. This fairly lengthy and technical devblog will go into the issue that occurred, the timelines of the incident, and what steps we will take to prevent similar incidents from occurring in the future.
Scheduled maintenance
On Friday 18 October, at 1:37AM CEST, our automated alerting fired off a warning with critical severity: RAID array 'md2' on us01 is in degraded state due to one or more disks failures. Number of spare drives is insufficient to fix issue automatically. This alert was noticed by a developer, who promptly alerted the other developers in the main developer channel. At this stage, the network continued to operate normally.
![[IMG]](https://i.imgur.com/25cLtI4.png)
As highlighted in a previous devblog by Lazertester titled "Minecraft on Kubernetes", Shotbow runs their servers and other infrastructure on Kubernetes. Kubernetes is a scalable system that manages and automates the operation of services through containers across multiple physical boxes. Shotbow has several physical servers, us01 being one of them. This node, us01, is also an important one as it hosts the Shotbow website (which was not fully containerized when we embraced Kubernetes). The us01 node also acts as a control node in our Kubernetes cluster, which means that it is responsible for the "daily operations" of the entire Kubernetes cluster. Shotbow had two of these control nodes.
Further investigation into the automated alert in the evening of Friday 18 October (CEST timezone) confirmed that one of the disks in the us01 node had failed, after about 30000 hours of uptime, and after writing about 800TB of data in its lifetime. This was concerning, but did not lead to an immediate problem. This is because all of the nodes in our cluster run using RAID 1 technology. This means that all data on one disk is replicated on another disk, to guard against disk failure taking down a server. After some internal discussion, a plan was drafted up to get the failed disk replaced by our host provider on Sunday 20 October at 8PM CEST. This maintenance window was then communicated to the public in our Discord server.
Just before the disk replacement took place, we took a backup of our web files, but we wouldn't need these anyways because a disk replacement and subsequent RAID array reconstruction would be pretty straight-forward and should totally not lead to any issues (this is foreshadowing). As part of the disk replacement, the us01 node was taken offline for about 45 minutes. The network proceeded to run mostly as normal, as we had moved Minecraft and Bungee workloads off of the us01 node and onto the other nodes.
When the us01 node came back online, all that was left was to reconstruct the RAID array, duplicating data from the healthy disk to the freshly installed disk. After struggling to get the required tools installed due to running a somewhat old operating system on us01, the RAID reconstruction started. Estimated completion time: about 2 hours.
![[IMG]](https://i.imgur.com/BrSqbs7.png)
Unscheduled maintenance
Just as we were getting somewhat relaxed seeing the RAID array being rebuilt, disaster struck. The RAID array reconstruction had stopped.
![[IMG]](https://i.imgur.com/eEog7hG.png)
From here, the network spiralled into an unhealthy state:
- The us01 node froze not long after the RAID array reconstruction stopped
- It subsequently went offline, presumably due to the OS not being able to run properly without any disks to write data to.
- The Kubernetes cluster became unresponsive, kubectl commands could no longer be executed.
- As part of normal lifecycle of gamemodes, finished Annihilation matches, scheduled restarts of MineZ servers, lobby restarts, all had servers shut down. The custom operator described in Lazertester's previous devblog could not contact the Kubernetes API server, meaning no new games were spinning up.
- Our internal monitoring and tooling went offline as our web load balancers failed.
- The network no longer was joinable, and the website went offline.
From here, we decided our main priorities were, in order:
- Restore the Kubernetes cluster
- Try to recover as much data as possible from us01
- Restore the website
It took us several days to discover the root cause of the unresponsive Kubernetes cluster: the internal database of Kubernetes had lost quorum. The Kubernetes database is distributed across all control nodes to enable high availability (HA). However, with one control node still in operation and the us01 control node being offline, the database fails to elect a new leader, resulting in it not being able to fix itself. Once we reset the database and restored a snapshot of it, the Kubernetes cluster came back online and most of the Minecraft network healed itself automatically.
Analyzing the faulty disk
At the same time, an attempt to recover as much data as possible of the failed disk took place. We did have a backup of the website files, and most of our nodes are stateless (they do not store critical data). Nevertheless, we were not sure what kind of other important data might still be on us01. This involved cloning the disk to the freshly installed disk, and take a snapshot of the freshly installed disk and sending it over to an inactive node in our cluster. Once done, we could analyse the state of the disk. We discovered that about 45GB of the 512GB disk had "bad blocks", essentially dead / inaccessible data. We assume that a storage cell on the SSD had simply failed.
![[IMG]](https://i.imgur.com/bptqfAk.png)
Dead blocks do not necessarily mean we lost 45GB worth of data. Most of these blocks would have been empty. However, we concluded that it was not worth trying to mount this data, as the risk of corruption of important parts of the disk leading to instability was too great. Also the previously taken backup of the web files meant we could restore the website from backup instead.
Restoring the website
Whilst investigation into the faulty disk was ongoing, we had switched our DNS to point shotbow.net to buy.shotbow.net, our shop. The shop was still running as it is hosted by an external party, but we figured that something is better than nothing.
About one week after the remaining healthy disk had failed, we scheduled for maintenance with our host provider again to replace the failed disk. Once they finished this maintenance (and had a third intervention to fix the remote access to the server), we could provision the us01 node again, this time on a more modern OS. We had to update our provisioning tools (in a somewhat risky operation), but this succeeded and we were able to restore us01 to a blank state. From here, we could reinstantiate us01 as a control node in the cluster, restore the website data, and restart the website processes. Shortly after, the website was back up and running, with the exception of the MineZ Map.
It took us somewhat longer to restore the website due to a problem that has haunted some of our other nodes in the past. Two other nodes were already taken out of service back in April of 2024 because any workload deployed on these nodes could not reach other nodes, or the public internet for that matter. Through countless hours of trial and error, we finally discovered that our iptables had to be configured to a different mode due to one of the control nodes running a different version of the iptables. Discovering this allowed us to not only provision us01 successfully, but also add the other nodes back into the cluster.
Learnings
The full incident lasted for about a week and a half. These were stressful days for some of our developers who were working on restoring operations whenever they had some free time. We had to dive deep into the core of the network, which was set up by developers who are no longer (actively) developing on Shotbow. A lot of knowledge had been lost over the years and had to be rediscovered. We found several key learnings though this incident:
- A Kubernetes cluster needs at least 3 control nodes to run in high availability (HA) mode
- Our internal documentation needs to be overhauled and expanded
- A distributed file system that we operate is not running in high availability (HA) mode
- We need to spend time to bring some components in our infrastructure to more modern versions
- We need to improve our backup setup, as relying solely on local backups and RAID 1 is not sufficient
- Read the documentation of the tools you use: it contains more information than you might think
We will also be spending quite some time in the upcoming weeks and months to follow up on a lot of actions to stabilize the network and making it more resilient to future incidents. This should not have a big impact development work on games, as the developers working on network infrastructure are already not working on games like Anni.
Closing words
Phew, that was quite the read, was it not? How about an XP code to reward you for sticking with it to the very end (and to make up for the downtime caused by this incident): DEVBLOG2024.
If you enjoyed reading through this devblog, be sure to let us know! If there are other topics that you want to see a devblog written on, let us know as well!
Thanks for flying Shotbow!
Oct
19
![[IMG]](https://i.imgur.com/rxG62pk.png)
Hey there Shotbow!
I hope you're all doing fangtastic! It's time for the Spookiest time of the year - Halloween! I'm your host LangScott here, and I'm hyped to be back to spill the tea on our yearly Spookfest season here at Shotbow. We've got all the goOooOoods ready for you once again: game updates that'll give you goose bumps, thrilling events, contests that'll give you a scare, an XP code, and a whole bunch of other spooky surprises.
Get ready! I'm your host for this wild hay ride through everything that's happening during Shotbow's Spookfest 2024. Get ready for some Spooktacular fun!
Spooky Network Secrets
- We will be having a network-wide 2x XP Multiplier from the 24th till the end of October.
- Annihilation is enabling a 2x Rank Points Multiplier from around 10PM CST October 30th to 10PM CST November 3rd.
- Use code HALLOWEEN2024 for 15,000 Shotbow XP.
![[IMG]](https://i.imgur.com/PqEzXMz.png)
Pumpkin Patch Headhunt
Pumpkin have started to grow around the lobby! Grab 'em, and you'll be in for some spooky new cosmetic rewards!
Terrifying Game Changes Await
The spooky scary skeletons tell us that Annihilation's has a freaky new map rotation, with spooky maps for you to do the mash. And that's not all MineZ is back with its own Spookfest season, loaded with cursed items, builds to die for, and a dungeon if you dare to be scared. Keep an eye out for SMASH, as it's donning a Halloween-themed lobby, complete with freshly haunting items and mobs. Get ready for a hauntingly good time!
Spooktacular Store Bundle
The Spooktacular Store Bundle will soon be available on The Shotbow Store.
Closing Word
Lastly, thanks to you - our lovely community, for all the support! We really appreciate y'all. We you'll enjoy this year's Spookfest. Have fun, we hope to see you online!
![[IMG]](https://i.imgur.com/zDTm6bm.png)
Hey folks!
It's time for the Spookfest 2024 Skin Contest, and guess what? Anyone can join in on the fun! We've got cool prizes up for grabs, so don't miss out. This thread's got all you need to know for submitting your Halloween skin!
Prizes
1ST PLACE ━ Gold
2ND PLACE ━ 25K SHOTBOW XP
3RD PLACE ━ 10K SHOTBOW XP
Rules
- Screenshots must be taken on the Shotbow server.
- You may submit a maximum of 3 skin designs.
- All skin submissions should be related to the Halloween season.
- Any inappropriate submissions will immediately lead to disqualification.
You can submit your selfie by simply uploading your screenshot to imgur and posting it to this thread.
Deadline
The Selfie Competition wraps up on November 1st, 2024. If you're a winner, we'll shoot you a DM on the forums and shout it out on the Forums and our Discord announcements channel! Stay spooky and stay tuned!
Goodluck to everyone!
Oct
16
by halowars91 at 12:39 PM
![[IMG]](https://i.imgur.com/mjVH6Dl.png)
Greetings Survivors!
Spookfest has returned! MineZ Spookfest 2024 Phase I and II has arrived, bringing with it treats, feats and frights.
Treat Chests
Treat Chests have returned for Spookfest! These spooky supply drops have been a staple of Spookfest, and bring many different players together for the chance at getting some special loot. Loot from these events are special and those who manage to grab a portion of the loot from the many chests that spawn during these events will walk away well rewarded.
This year, we are testing a new scheme for treat chests, isolating the major ones to just the weekend. These drops are larger than normal and have a few special additions. These treat chests feature backfilling chests. This means that a Spooky Medium chest will also fill with some items from the tier below it. This looks like the following image. Weekday drops also spawn a few items from the treat chest loot pool, but do not spawn nearly as many chests or feature large Halloween builds.
We try our best to make sure chests are available to as many people as possible and we are accommodating our players all over the world. For that reason, we've decided on this schedule for Treat Chests! (All times in Eastern Standard Time EST)
You can view this years schedule in-game by typing the /drops command.
Chest spawning is an important part to be aware of during our Treat Chests to make sure you don’t miss the chance to grab some loot! At the scheduled start time there will be an announcement across MineZ with its location and the countdown to the first chest spawn! There will also be an announcement in the Shotbow discord for those who have the @Minez Notifications role!
Chests spawn in 10 minutes after the announcement, and similar to the regular supply drops, will not spawn all at the same time. Chests spawn in phases, little by little each of the chests will spawn in, staggered one after the other, and you'll either have to get out quick with the first drop of loot, or take control of the area and keep up defending to claim as much as you can! This year, the treat chests have very few, if any expiring items. The only items which are set to expire are items which would otherwise be too problematic to have in the game permanently.
As always, alting is not allowed, and all other MineZ and Shotbow Rules apply as normal.
Spooky Explorations
![[IMG]](https://i.imgur.com/9ZlmJ8N.png)
Spooky builds have become a staple throughout the many years of Spookfest. Each year the team has brought new and old builds back. This year we have added many new spooky and spine shivering additions. From returning favorites to new spooky builds, you’ll have quite a lot of exploring to do. Like previous years, spooky locations offer a variety of special halloween goodies and some even offer parkour challenges.
Step Right Up
Also new to Spookfest this year, the team has been hard at work creating a spooky dungeon for those of you brave enough to venture into it. We're not clowning around with this one. Best of Luck.
If you are brave enough to solo this dungeon, you will be eligible for a special contest. We will reward the first person to complete the dungeon SOLO and post their video to youtube with a special, unique prize(s). You must DM this video to a MineZ Staff member to claim your prize.
A Solo run of this dungeon is eligible when you alone, from start to finish, complete the entire dungeon with no other assistance from players at the time of your run. You may use any prior knowledge from group runs. You may not use any resource blocks which change the appearance of blocks.
The Spooky Skeleton Hunt
![[IMG]](https://i.imgur.com/lNxrt8D.png)
With the release of Phase II of Spookfest there are now two active Halloween Headhunts.
You may have to search more thoroughly for some that are well hidden, but if you are persistent you’ll be greatly rewarded! Once you've found a head, simply right click it to be awarded a small amount of Shotbow XP. When you claim the last head, you'll be given some extra prizes.
You can view info about this headhunt in-game as well by typing the /checkheads command
Closing
We're super excited for you all to explore Spookfest, head hunting, Treat Chests and more. These holiday events are always something to look forward too. Remember that Wonderland is not too far off!
Enjoy the spooky celebrations and as always,
Thanks for flying Shotbow!
Page 3 of 219