1. Want to get our most recent announcements - and XP codes - in your email?

    Sign up for our mailing list!

Servers Shotbow Devblog: Minecraft on Kubernetes

Discussion in 'Announcements' started by lazertester, Sep 27, 2020.

Servers - Shotbow Devblog: Minecraft on Kubernetes
  1. lazertester Lead Developer

    XP:
    365,160xp
    [IMG]

    This is going to be a slightly more technical read than most of our postings, you have been warned. :wink:

    After a hiccup or two, we have completed an (almost) invisible fundamental change to our server infrastructure: Every one of our Minecraft servers is now deployed into a Kubernetes cluster for orchestration, instead of a home grown orchestration layer we’ve been using for over 5 years we’ve simply referred to as “Deployment”.

    Deployment: A Legacy

    [IMG]
    A picture showcasing the internal Deployment file architecture and numerous python scripts.

    Deployment was an amalgamation of Python scripts, cron jobs, and database polling to keep up with the state of the servers, roll new ones, destroy expired ones, and give us insight into what was currently running.

    When this first was created it solved a HUGE need and was miles ahead of our old method of “Login and run script A if we need another MineZ server, edit the server.properties, and execute the start command. Login and run script B if we need another Anni server, edit the server.properties etc...”.

    It helped us break the model of server instances being uniquely managed (Anyone remember the old days of MINEZ_07_US loyalty?), and instead allowed us to have repeatable units we could scale out and in at will.

    [IMG]

    Over the years, developers have added and removed from it to automate additional common tasks and it has been very useful, but its deep underbelly became a scary place that no one dared to venture into, since it was impossible to untangle the web of scripts and schedules at its heart.

    Sometimes rsyncs of the server files would fail and we’d have to dig through our servers to find what went wrong. Other times, some bad data broke one of its unspoken “rules” and it refused to work entirely. It became a gigantic, spooky, mysterious volcano that could erupt at any moment.

    Ultimately the key shortcoming of Deployment was that it was tailor made to only run our Minecraft instances, what happens if we want to run something else like a new database, web service, or even host another game? We knew there had to be a better way, and we happen to have the background necessary to put it to work.

    Enter Stage Right

    Adam (aet2505) and I both work professionally as Software Engineers dealing with Kubernetes (k8s) on a daily basis.

    I asked him what he thought it would take us to get everything running in k8s and we both sort of verbally shrugged and got to work. As many of you know, I am based in Salt Lake City, Utah, what you might not know is that Adam is not.

    The time difference between us basically meant at least one of us could be working on it around the clock for almost 2 weeks straight, provisioning the cluster, establishing deployments, creating a dockerization strategy for our instances, and testing out how Bungee liked the networking model inside the cluster (spoiler alert: It loved it).

    Every Rose Has Its Thorns

    The biggest hurdle we hit along the way was that Kubernetes pods are inherently ephemeral, meaning they come and go at will, however we want our Minecraft servers to only shut down under specific circumstances.

    To solve this, Adam established a Custom Resource Definition he aptly named “MinecraftSet” and created an operator that facilitates our desired control loop. This operator will keep the number of instances above the minReplicas, and a scalar can react to server state changes (like population) as well as ask if it is ready to shut down.

    This lets game creators respond to shut down requests in the existing event driven pattern inside of Spigot and indicate whether it is safe or not. This has afforded us the flexibility to deploy a server that is usually fine to shut down on demand such as DBV and a server that should NEVER shut down in most circumstances like Annihilation using the same pattern.

    Annihilation itself is able to own the logic governing shutdowns by influencing the control loop based on game state instead of a script running on a mystery box (could you imagine every game terminating just as Phase 3 starts!?).

    Like Popsicles In Summer

    Once we overcame that hurdle, we found a few problems that existed in our deployment world had simply melted away.

    Because of Kubernetes’ networking model, all Minecraft servers could simply run on port 25565. Every instance is deployed into a “pod” in the cluster, and receives its own IP address. We had swathes of code for figuring out which port was available and ensuring that port is published when the server registered with our router that was simply deleted (deleting code is the BEST).

    Managing storage was a bit of a pain, but PersistentVolumes in Kubernetes and volume mounts in Docker allow us to to handle storage in a way that is transparent to the server running. Using Prometheus, we are able to discover and scrape metrics on running instances automatically, allowing us to create useful Grafana dashboards for everything from server health to player report spikes.

    ArgoCD helps us reconcile our repository of Kubernetes manifests with the running cluster, provide single button rollback support, and a quick glance view into what is running as well as server logs. Previously reverts consisted of tracking down the plugin, configuration, or build that had changed, manually reverting it, then re-releasing.

    In this universe we can rollback with ArgoCD, which takes the pressure off the team so they can take their time diagnosing what went wrong with a rollout and easily roll forward again with the fix. Here’s When in Rogue in the ArgoCD interface:

    [IMG]
    An image showing When in Rogue in the ArgoCD interface.

    Say Goodbye to Proxy Reboots!

    We used to be very limited in how often we can make changes to the Bungeecord proxies, because making a change kicked every single user at once, and therefore was not ideal.

    The new networking model has also allowed us to eliminate hard proxy reboots entirely, as Bungeecord now sits behind a Kubernetes service definition, which handles load balancing to running nodes automatically. This means when we have a new Bungeecord deployment, we can spin up a new set of proxies, route all new incoming connections to those, and allow the old proxies to continue to serve those who already connected to them, slowly draining as players sign off or reconnect. Let’s see it in action:

    [IMG]
    Rebooting a proxy means standing up a new set and letting the old ones slowly drain.
    [IMG]
    [IMG]
    Easy to understand dashboards showcasing player count slowly draining from old proxies.

    What's Next?

    One of the new additions I am working on is a proper social system, starting with Friends and soon with network-wide parties. This has a backing web API and is built on a graph database which is well suited for this kind of problem.

    I was able to spin up the dev stack in minutes, with full discoverability and connectivity from our Minecraft instances to interact with it. This has allowed me to move quickly to create rich social features that will improve your experience with us here at Shotbow. This has also enabled some of our new devs to dive right into crafting new experiences, ensuring a safety net so those working with established modes can experiment and rollback easily, and has also increased our ability to monitor and support the infrastructure that powers these games we love.


    If you enjoyed this technical deep dive into Shotbow's core infrastructure, be sure to let us know! We may just continue to do more of these "Devblogs" in the future!

    But for now, as always,

    Thanks for flying Shotbow!


    Jingle535, Yesus42, Goliac and 33 others like this.

  2. xwickedxshadow72 Regular Member

    XP:
    75,281xp
    lazer the GOAT
    Mistri, wincing and JeTi_Brothers like this.
  3. JeTi_Brothers Platinum

    XP:
    215,738xp
    Very cool insight!
    Mistri likes this.
  4. otcathatsya Platinum

    XP:
    206,219xp
  5. Logiz_ Regular Member

    XP:
    17,060xp
    I found it very interesting! am studying server architecture and backend so getting some real life examples and how its applied is helpful. (still lots of studying left doe)
  6. schiindler Retired Staff

    XP:
    214,332xp
    I had no idea what anything meant in this post. 10/10 good work
  7. TravisArmstrong Retired Staff

    XP:
    39,261xp
    Great to see a detailed writeup. I'm surprised my initial deployment foundation lasted this long. Adam helped a lot to improve and expand upon it to make it viable for more game modes in the beginning.
    Eillom, noobfan, JACOBSMILE and 3 others like this.
  8. Pjstaab Regular Member

    XP:
    20xp
    Nice, always cool to see k8s used in random things. Wondering if you guys have ever looked at Agones. Might make the whole thing even more hands off.
    JACOBSMILE likes this.
  9. lazertester Lead Developer

    XP:
    365,160xp
    Seriously, 10/10 work Travis.
    JACOBSMILE likes this.
  10. lazertester Lead Developer

    XP:
    365,160xp
    No, I hadn't looked into it. I will now though :stuck_out_tongue:
    JACOBSMILE likes this.
  11. Jarool Mini Admin

    XP:
    327,930xp
    I wish I understood what it all meant, but so far, all I've understood is that lazertester and aet2505 are simply incredible.
    JACOBSMILE likes this.
  12. Eillom Emerald

    XP:
    61,792xp
    Jesus this is impressive. Great work gents.
  13. aet2505 Developer

    XP:
    69,257xp

    I did have a look at this when fleshing out the solution and there were a couple of reasons I chose not to pursue it.
    • I'd have to write a load of hooks into the system using GRPC which felt much more complex than it needed to be, we expose all our monitoring information without adding anything additional to the spigot server and just use the already included netty library
    • It felt like it was designed around a system with a central matchmaker where a server lived the duration of a game and then terminated. This is great for most situations but doesn't really fit with our model around lobbies and MineZ etc.
    I did look at it and this rather nice article/tutorial from the guys at google https://cloud.google.com/solutions/gaming/running-dedicated-game-servers-in-kubernetes-engine which partially shaped the implementation.
    I think the implementation we have now though is super hands off and really effectively hides the complexity really quite nicely. A normal user of the system (devs) doesn't have to think about how this works under the hood and literally needs to add one event listener to their game plugin to control shutdown (literally an OK/Not OK) response and then also write a pretty straight forward scaling task. The task effectively needs to return one of three actions, stop a server, deploy a new one or do nothing at all.
    All that being said who knows what we will do in the future and if we add a matchmade system (Looking at you megacade) then who knows what the solution we pick for that would be. This whole change really just sets us up to be able to move faster going forward!
    I'm pretty excited to build on this further
  14. Oraceon Obsidian

    XP:
    18,535xp
    That's cool. Now please let me see more than 4 chunks at a time on MineZ like it used to be. Thank you.
  15. RisingThumb Regular Member

    XP:
    96,839xp
    Out of curiosity, why were Python Scripts used over Shell Scripts in the old deployment?
  16. lazertester Lead Developer

    XP:
    365,160xp
    Good question: python is more fully featured than bash scripts and easier to model things like our game mode abstraction which is pictured above.
    Mistri likes this.

Share This Page