This is going to be a slightly more technical read than most of our postings, you have been warned. After a hiccup or two, we have completed an (almost) invisible fundamental change to our server infrastructure: Every one of our Minecraft servers is now deployed into a Kubernetes cluster for orchestration, instead of a home grown orchestration layer we’ve been using for over 5 years we’ve simply referred to as “Deployment”. Deployment: A Legacy A picture showcasing the internal Deployment file architecture and numerous python scripts. Deployment was an amalgamation of Python scripts, cron jobs, and database polling to keep up with the state of the servers, roll new ones, destroy expired ones, and give us insight into what was currently running. When this first was created it solved a HUGE need and was miles ahead of our old method of “Login and run script A if we need another MineZ server, edit the server.properties, and execute the start command. Login and run script B if we need another Anni server, edit the server.properties etc...”. It helped us break the model of server instances being uniquely managed (Anyone remember the old days of MINEZ_07_US loyalty?), and instead allowed us to have repeatable units we could scale out and in at will. Over the years, developers have added and removed from it to automate additional common tasks and it has been very useful, but its deep underbelly became a scary place that no one dared to venture into, since it was impossible to untangle the web of scripts and schedules at its heart. Sometimes rsyncs of the server files would fail and we’d have to dig through our servers to find what went wrong. Other times, some bad data broke one of its unspoken “rules” and it refused to work entirely. It became a gigantic, spooky, mysterious volcano that could erupt at any moment. Ultimately the key shortcoming of Deployment was that it was tailor made to only run our Minecraft instances, what happens if we want to run something else like a new database, web service, or even host another game? We knew there had to be a better way, and we happen to have the background necessary to put it to work. Enter Stage Right Adam (aet2505) and I both work professionally as Software Engineers dealing with Kubernetes (k8s) on a daily basis. I asked him what he thought it would take us to get everything running in k8s and we both sort of verbally shrugged and got to work. As many of you know, I am based in Salt Lake City, Utah, what you might not know is that Adam is not. The time difference between us basically meant at least one of us could be working on it around the clock for almost 2 weeks straight, provisioning the cluster, establishing deployments, creating a dockerization strategy for our instances, and testing out how Bungee liked the networking model inside the cluster (spoiler alert: It loved it). Every Rose Has Its Thorns The biggest hurdle we hit along the way was that Kubernetes pods are inherently ephemeral, meaning they come and go at will, however we want our Minecraft servers to only shut down under specific circumstances. To solve this, Adam established a Custom Resource Definition he aptly named “MinecraftSet” and created an operator that facilitates our desired control loop. This operator will keep the number of instances above the minReplicas, and a scalar can react to server state changes (like population) as well as ask if it is ready to shut down. This lets game creators respond to shut down requests in the existing event driven pattern inside of Spigot and indicate whether it is safe or not. This has afforded us the flexibility to deploy a server that is usually fine to shut down on demand such as DBV and a server that should NEVER shut down in most circumstances like Annihilation using the same pattern. Annihilation itself is able to own the logic governing shutdowns by influencing the control loop based on game state instead of a script running on a mystery box (could you imagine every game terminating just as Phase 3 starts!?). Like Popsicles In Summer Once we overcame that hurdle, we found a few problems that existed in our deployment world had simply melted away. Because of Kubernetes’ networking model, all Minecraft servers could simply run on port 25565. Every instance is deployed into a “pod” in the cluster, and receives its own IP address. We had swathes of code for figuring out which port was available and ensuring that port is published when the server registered with our router that was simply deleted (deleting code is the BEST). Managing storage was a bit of a pain, but PersistentVolumes in Kubernetes and volume mounts in Docker allow us to to handle storage in a way that is transparent to the server running. Using Prometheus, we are able to discover and scrape metrics on running instances automatically, allowing us to create useful Grafana dashboards for everything from server health to player report spikes. ArgoCD helps us reconcile our repository of Kubernetes manifests with the running cluster, provide single button rollback support, and a quick glance view into what is running as well as server logs. Previously reverts consisted of tracking down the plugin, configuration, or build that had changed, manually reverting it, then re-releasing. In this universe we can rollback with ArgoCD, which takes the pressure off the team so they can take their time diagnosing what went wrong with a rollout and easily roll forward again with the fix. Here’s When in Rogue in the ArgoCD interface: An image showing When in Rogue in the ArgoCD interface.Say Goodbye to Proxy Reboots!We used to be very limited in how often we can make changes to the Bungeecord proxies, because making a change kicked every single user at once, and therefore was not ideal.The new networking model has also allowed us to eliminate hard proxy reboots entirely, as Bungeecord now sits behind a Kubernetes service definition, which handles load balancing to running nodes automatically. This means when we have a new Bungeecord deployment, we can spin up a new set of proxies, route all new incoming connections to those, and allow the old proxies to continue to serve those who already connected to them, slowly draining as players sign off or reconnect. Let’s see it in action:Rebooting a proxy means standing up a new set and letting the old ones slowly drain.Easy to understand dashboards showcasing player count slowly draining from old proxies.What's Next?One of the new additions I am working on is a proper social system, starting with Friends and soon with network-wide parties. This has a backing web API and is built on a graph database which is well suited for this kind of problem. I was able to spin up the dev stack in minutes, with full discoverability and connectivity from our Minecraft instances to interact with it. This has allowed me to move quickly to create rich social features that will improve your experience with us here at Shotbow. This has also enabled some of our new devs to dive right into crafting new experiences, ensuring a safety net so those working with established modes can experiment and rollback easily, and has also increased our ability to monitor and support the infrastructure that powers these games we love. If you enjoyed this technical deep dive into Shotbow's core infrastructure, be sure to let us know! We may just continue to do more of these "Devblogs" in the future! But for now, as always, Thanks for flying Shotbow!