Building a world that never sleeps (part 2)

We last left off in part 1 talking about how chatty multiplayer dedicated servers can be, and how we’ve got this “hive” system that powers and scales out the realms without being so chatty. But we never got to the no-sleep part, did we?

When you log into Parchment, if you didn’t already have one, a worker bee spawns just for you. This “actor,” we will call it from here on out, persists after you sign out and helps answer questions about your civilization/empire/base when you’re not around.

Persistent worlds

Because your actor hangs out after you’re gone, your civilization can continue to grow, defend, and make automatic choices as well. It sends its authoritative choices to the queen bee and the world keeps moving on while you sleep. If you have resources that you’re gathering, they continue to gather. If you have technologies you’re researching, they continue to research. Your own actor keeps time and executes these actions because it has it’s own “update loop.”

This internal update loop is something I’m calling the economy tick. The tick runs at between 0.25hz and 2hz (Once every 4 seconds to twice a second), depending on the need and priority. This tick purposely competes with user input and does not interleave. This means each tick of the economy loop per player is treated like a player input, which in turn means everything is processed in order as it should be.

Okay, yes, they sometimes nap

On occassion, realms may find that there is very little to do. This can happen when a player hasn’t given much input recently, hasn’t signed in for some time, and/or hasn’t had their civilization interacted with by others in some time. To save on server capacity, we “park” those inactive actors/bees until they are needed again. When they reactivate, we fast-forward them to present day. This includes marking any resources as harvested (and adding them to the player’s currency), as well as things like marking their technology researches complete. Ultimately these are all planned to be just math, so it shouldn’t be compute heavy and fairly easy. But if needed, we can always burst some compute for the catch up.

Resiliency and beta-realms

When a realm begins, it locks in several things:

The randomization seeds for proceedural generation
The technology trees
The realm start and end time (realms are designed to last for a “season”)

This means that we can effectively have each realm be it’s own flavorful and unique experience. There might be a realm that’s more prone to oceans, one that limits technology, or even one that introduces a new technology as a “beta realm.” Giving players the choice to further flavor their experience each time they play is something I’ve baked into the server tech.

Because these configurations are baked into each realm at creation time but all use the same code framework to empower them, I can continiously upgrade the game on both the server and game.exe on players machines without needing everyone to be on the same version. The system is inheritly backwards compatiable, by design and from the start. Realms themselves can act as feature flags. This resiliency allows players to enjoy their “playthrough” without fear of design patches.

Bug patches, if needed, can be applied to realms however as well. The configuration data they’ve snapshotted can always be altered if what is within is causing a bug. In addition to this abstract resiliency, there’s obviously concrete ones as well. Every so often your actor and the realm’s actor (queen bee) perform a snapshot and backup everything in case we need to restore to specific point in time. These backups are kept in a rolling window. The max size over time for a single realm becomes deterministic.

Scaling in three dimensions

In the server-side industry, there’s typically three kinds of ways to handle high volume:

Scaling Up - Giving servers more power, more muscle. More memory, more vCPUs, etc.
Scaling Out - Adding more servers. The box we carry weighs less the more of us that carry it.
Writing Performant Software - How your code perfoms during unfathomable scale, how it protects itself, and how much it understands what you’ve done for option 2 (multiple regions, shards, and so on.)

Options 1 and 2 can be boiled down to a single option if you’d like: “Buying away the problem.” Throw more money at it, it goes away - more servers and/or better ones. But unless you’ve coded for having multiple servers to carry the same load, option 2 can be a nightmare. Most dedicated host game servers are stateful and self contained for this reason. They live on a single VM or piece of hardware, and they service one simulation at a time. If you have more players and more demand, you simply add more dedicated servers that can host a 5x5 player match. MMOs don’t have that exact luxury and most be more intentional about executing option 2 by executing option 3 at the same time. They create shards and instances of responsibility, split up the work. You might be connected to a different game server if you’re in different continent in the game.

Parchment’s server design takes that even further. We split up the work down to each individual player, and spread it across the entire fleet of servers. Our “dedicated server” is not a single executable we throw onto a host VM that dies when a simulation ends. Our dedicated server is a series of logical and virtual nodes that exist across the entire fleet of cloud machines. Your realm might be ticking away in California while your worker bee/actor is ticking away in Texas and your rival is ticking away in Virginia. Queue the movie-esque globe network topology with nodes below.

Imagine if those nodes weren’t dedicated servers… but instead… that entire picture is the dedicated server.

Because the code is also written with high performance in mind (reduced backpressure, bulkhead partitioning, virtual circuit breakers, enforced timeouts, exponential backoff retry policies, negative caching, etc…) we can safely scale our fleet of servers with how busy the server is and everything else automatically expands. With less network I/O, we can host more realms on one “box” or machine, and therefore more players on that machine.

okbye

There’s a lot of items I didn’t cover - including gotchas and pitfalls for this approach, but for now that will conclude talking about the network/server design. Perhaps all of it could have been summarized into: “My dedicated server is a hipster,” or something. Oh well. ¯_(ツ)_/¯