Cloud Hosting Dedicated Servers - Multiplayer Game Programming: Architecting Networked Games (2016)

Multiplayer Game Programming: Architecting Networked Games (2016)

Chapter 13. Cloud Hosting Dedicated Servers

The changing cloudscape means even small studios can afford to host their own dedicated servers. No longer must the fate of a game rely on players with fast net connections hosting fairly administered servers. This chapter explores the pros, cons, and methods necessary to get your game’s servers running in the cloud.

To Host or Not To Host

In the early days of online gaming, hosting your own dedicated servers required the Herculean task of acquiring and maintaining large amounts of computer hardware, networking infrastructure, and IT staff. Any hardware ramp-up was a gamble at that. If you overestimated the number of players at launch, you’d end up with racks and racks of machines lying fallow. Worse, if you underestimated, your paying players would be unable to connect due to processing and bandwidth constraints. While you struggled to obtain more last minute equipment, your players would give up, write bad reviews, and warn their friends not to play your game.

Those days of terror are over. Thanks to the abundance of on-demand processing power available from giant cloud host providers like Amazon, Microsoft, and Google, gaming companies are able to spin up and down servers on a whim. Third-party services like Heroku and MongoLabs make deployment even easier by providing server and database management services as needed.

With the huge barrier to entry gone, the proposition of hosting dedicated servers is one that every developer should consider, no matter how small the studio. Despite the lack of upfront server cost, there are still some potential drawbacks to consider:

Image Complexity. Running a dedicated fleet of servers is more complex than allowing players to host their own. Even though cloud hosts provide the infrastructure and some of the management software, you still need to write custom process and virtual machine management code, as described later in this chapter. Also you have to interface with one or more cloud host providers, which means adapting to changing APIs.

Image Cost. Even though the cloud decreases upfront and long-term cost significantly, it’s still not free. Increased player interest may cover the increased cost, but that’s not always the case.

Image Reliance on a third party. Hosting your game on Amazon or Microsoft’s servers means the entire fate of your game rests on Amazon or Microsoft’s shoulders. Although hosting companies offer service-level agreements that guarantee minimum uptime, these do little to console paying players when every server suddenly goes down at once.

Image Unexpected hardware changes. Hosting providers usually guarantee to provide hardware that meets certain minimum specifications. This does not prevent them from changing hardware without warning, as long as it is above the minimum specification. If they suddenly introduce a bizarre hardware configuration which you have not tested, it may cause issues.

Image Loss of player ownership. In the early days of multiplayer gaming, administering your own game server was a matter of pride. It was a way for players to be an important part of the game community, and it created alpha players that spread the gospel of whatever game they were hosting. Even today the culture still lives on in the myriad custom Minecraft servers hosted across the land. The intangible benefits of player ownership are lost when the responsibility of running servers moves to the cloud.

Although these downsides can be significant, the benefits often outweigh them:

Image Reliable, scalable, high-bandwidth servers. Upstream bandwidth comes at a premium, and there’s no guarantee that the right players will be hosting the right servers when your other players want to play. With cloud hosting and a good server management program, you can spin up whatever server is necessary, wherever and whenever you need it.

Image Cheat prevention. If you run all the servers, you can make sure they’re running unmodified, legitimate versions of the games. This means all players get a uniform experience not subject to the whims of player administrators. This enables not only reliable rankings and leaderboards, but also persistent player progress based on gameplay, as found in Call of Duty, for example.

Image Reasonable copy protection. Players have a lot of hate for intrusive copy protection and digital rights management (DRM). However, DRM can be a necessity for some types of games, especially those that rely on microtransactions for revenue, like League of Legends. Restricting your game to run on company hosted, dedicated servers provides a de facto, nonintrusive form of DRM. You never have to release server executables to players, which makes it much harder for them to run cracked servers that illegally unlock content. It also allows you to check login credentials for every player, ensuring that they really should be playing your game.

As a multiplayer engineer, the choice of whether to host dedicated servers may be above your pay grade. However, given the value of full stack engineers in the work force, it is important to understand all the implications of the decision so you can weigh in with an informed opinion based on the specifics of the game your team is making.

Tools of the Trade

When working in a new environment, it is most efficient to work with tools tailored for that environment. Backend server development is a rapidly evolving field, with a rapidly evolving set of tools. There are many languages, platforms, and protocols designed to make life easier for the backend developer. At the time of this writing, there is a definite trend for services to use REST APIs, JSON data, and Node.JS. These are flexible and widely accepted tools for server development, and the examples in this chapter make use of them. You can choose different tools for your cloud server hosting development and the basic concepts will remain the same.

REST

REST stands for representational state transfer. A REST interface is one that supports the idea that all requests to a server should be self-contained and not rely on previous or future requests for interpretation. HTTP, the protocol that drives the web, is a perfect example of this, and thus typical REST APIs are built heavily around the use of HTTP requests to store, fetch, and modify server-side data. Requests are sent using the common HTTP methods GET and POST, and also the less common PUT, DELETE, and PATCH. Although various authors have proposed standards on exactly how these HTTP requests need to be structured to qualify as a REST interface, many engineers end up creating interfaces that are REST-flavored to best suit the needs of the users, but do not adhere strictly to any set of REST requirements. Generally, REST interfaces should use the HTTP methods in a fairly consistent manor: GET requests fetch data, POST requests create new pieces of data, PUT requests store data in a specific place, DELETE requests remove data, and PATCH requests edit data directly.

One major advantage of REST interfaces is that they are mostly plain text. Thus, they are human readable, discoverable, and debuggable. In addition, they employ HTTP, which itself uses TCP for transport and thus they are reliable. The self-contained nature of the REST request expands request debuggability, cementing REST as the chosen API style for the backbone of today’s cloud services. More details on REST style interfaces and proposed REST standards can be found in the resources listed in this chapter’s “Additional Readings” section.

JSON

In the late 1990s and early 2000s, XML was heralded as the universal data exchange format that would change the world. It started to change the world, but it had way too many angle brackets, equal signs, and closing element tags to last forever. These days JSON is the new darling for universal data exchange. Standing for JavaScript object notation, JSON is actually a subset of the JavaScript language. An object serialized to JSON is exactly the JavaScript that would be needed to recreate that object. It is text based, maintaining all the human readability of XML, but with fewer formatting and tag closing requirements. This makes it even more pleasant to read and debug. Additionally, because it is valid JavaScript, you can paste it directly into a JavaScript program to debug it.

JavaScript works well as a data format for REST queries. By specifying a Content-Type of application/json in the HTTP header, you can pass data to a POST, PATCH, or PUT request in JSON format, or return data from a GET request. It supports all the basic JavaScript datatypes, such as bools, strings, numbers, arrays, and objects.

Node.JS

Built on Google’s V8 JavaScript engine, Node JS is an open-source engine for building backend services in JavaScript. The idea behind the language choice was that it would facilitate development of AJAX style websites that also used JavaScript on the frontend. By using the same language on both client and server, developers can write functions and easily switch or share them between layers as necessary. The idea caught on and a very rich community has grown up around Node. Part of its success is due to the vast number of open-source packages available for Node, easily installable through the Node package manager (npm). Almost all popular services with REST APIs have node package wrappers, making it trivial to interface with the vast array of cloud service providers.

Node itself provides a single-threaded, event-driven JavaScript environment. An event loop runs on the main thread, much like in a video game, dispatching event handlers for any incoming events. These event handlers can in turn make long running requests to the file system, or to external services like databases or REST servers, that execute as asynchronous jobs on non-JavaScript threads. While the jobs execute, the main thread returns to the processing of incoming events. When an asynchronous job completes, it sends an event to the main thread, so the event loop can call an appropriate callback and execute an appropriate JavaScript handler. In this way, Node provides an environment that prevents the pain of race conditions while still allowing for non-blocking asynchronous behavior. As such it is a prime candidate for building services to handle incoming REST requests.

Node ships with a simple built-in HTTP server, but the task of decoding incoming HTTP requests, headers and parameters, and routing them to the appropriate JavaScript functions is usually handled by one of several open-source Node packages dedicated to the purpose. Express JS is one such very popular package and the one used by the examples in this chapter. More information on Express JS and Node JS can be found in the resources listed in the “Additional Readings” section.

Overview and Terminology

From the player’s perspective, the cloud server spin-up process should be transparent. When a player wants to join a game, the player’s client requests info on a match from the matchmaking service endpoint. The endpoint looks for one, and if it can’t find one, it should somehow trigger a new server to spin up. It then returns the IP address and port of the new server instance to the client. The client connects there automatically and the player joins the game.

Note that it can be tempting to combine the processes of matchmaking and dedicated server deployment into one giant blob of functionality. It saves on some redundant code and data, and can even aid in performance a little. However, it can be more useful to keep them separate for the single fact that you may want to plug one or more third-party matchmaking solutions into your dedicated server system. Just because your studio hosts its own dedicated servers, does not mean it can’t take advantage of third-party matchmaking solutions like Steam, Xbox Live, or PlayStation Network. In fact, depending on the platform for which you’re developing, it may be required. For this reason, it is sensible to keep the server deployment module clearly isolated from your matchmaking module.

When your deployment system finishes spinning up a new server, it should simply register itself with the matchmaking system just as a player hosted game server would. After that, the matchmaking system can take over matching players to server instances and your cloud deployment system can focus on what it does best—spinning up and down game instances as necessary.

Server Game Instance

Before going on, it is worthwhile to disambiguate some of the overloaded meanings of the word “server” when used in various contexts. Sometimes “server” refers to an instance of the class in code that simulates the one true version of the game world and replicates it to clients. Other times, it refers to the process listening for incoming connections, hosting that class instance. Still other times, it refers to the physical piece of hardware running that process, as in “check out all the servers I can fit on this rack.”

To avoid confusion, this chapter uses the term server game instance or just game instance to represent the entity that simulates the game world and replicates information to clients. The concept is an abstraction that represents a single reality shared by a group of players playing together. If your game supports 16-player battles, then a server game instance is a running 16-player battle. In League of Legends it is typically a 5 versus 5 game in the “Summoner’s Rift” level. In matchmaking terms, it is a single match.

Game Server Process

A game instance does not exist in a void. It lives inside a game server process, which updates it, manages its clients, interacts with the operating system, and does everything else a process typically does. It is the embodiment of your game, as far as the operating system is concerned. In all previous chapters, the concepts of game server process and game instance were not separated because there was a one-to-one mapping between them. Each game server process was responsible for maintaining only one game instance. However, in the world of dedicated server hosting, that can change.

In properly abstracted code, a single process can manage multiple game instances. As long as the process updates each instance, binds a unique port for each instance, and does not share mutable data between the instances, multiple game worlds can coexist peacefully in the same process.

Multiple instances per process can be an efficient way to host multiple games, because it allows sharing of large immutable resources like collision geometry, navigation meshes, and animation data. When multiple game instances run in their own processes, they each need copies of this data, which can cause unnecessary memory pressure. Games employing multiple instances per process also benefit from finer control of scheduling: By iterating through each instance each update, they can assure a roughly regular update pattern across instances. With multiple processes on the same host, this is not necessarily the case, as the operating system scheduler decides which process is updated when. This is not always a problem, but finer-grained control can be useful at times.

The significant advantages of the multi-instance approach may seem compelling, but the disadvantages of the tactic are just as significant. If a single instance crashes it can bring down the entire process, with all of its contained game instances. This can be particularly nasty if an individual instance corrupts a shared, supposedly immutable resource. Alternatively, when each game instance runs in a dedicated process, a corrupted or crashing game instance can only bring down itself. In addition, single game instance processes are easier to maintain and test. Engineers developing server code commonly only need a single game instance at a time to test and debug code. If the process supports multiple instances and engineers aren’t running them, it leaves a large code path without regular development coverage. A good QA team with a solid test plan can partially accommodate for this, but there is no substitution for engineers having full coverage of production code paths during development. For these reasons, it is most common for game server processes to contain a single game instance.

Game Server Machine

Just as a game instance needs to live in a game server process, a game server process needs to live on a game server machine, and just as a single process can host multiple instances, a single machine can host multiple processes. The choice of how many processes to run per machine should depend on the performance requirements of your specific game. For maximum performance, you can run a single process per machine. This ensures the machine’s full resources, including CPU, GPU, and RAM, are dedicated to your game process. However, it can be quite wasteful. Each machine needs an operating system and a typical OS is an immense consumer of resources.

Running a unique OS just for a single game process, especially one that contains only a single game instance, can be too expensive an endeavor. Luckily operating systems are designed to support multiple processes with features like protected memory to keep them from interfering with each other’s immutable assets. On a modern operating system, it is extremely unlikely that a crashing process can bring down another process on the same game server machine. Therefore, to be cost-efficient, it is typical to run multiple game server processes per server machine—often as many as the performance requirements will allow. Tweaking and tuning server performance and RAM use can pay off many times over if it allows more game processes to be hosted on the same server machine.

Hardware

In the cloud, a game server machine does not necessarily equate to a physical piece of hardware. Instead, machine images represent virtual machines (VMs) which are spun up and down at will, sometimes residing alone on a physical machine, or other times sharing resources with multiple other virtual machines on a physical machine of 16 cores or more. Depending on your cloud hosting provider, and your budget, you may not get to choose how your virtual machines are hosted. At lower price points, they must often share hardware, and are put to sleep when not used for a set amount of time. This can result in erratic performance. At higher price points, you can often specify the exact physical hardware configurations you desire.


Why Virtual Machines?

It may seem odd to have to pack your operating system of choice and game process into a virtual machine just to get hosted in the cloud. However, virtual machines provide an excellent way for cloud service providers to distribute the use of their hardware across their customer base. At Amazon, a single 16-core computer might be running four Call of Duty VMs, each requiring 4 cores. As demand for Call of Duty wanes at a certain time of day, Amazon might spin down two of those VMs, leaving an underutilized piece of hardware. When a request comes in from EA to spin up an 8-core Sim City machine, it can run that VM on the same hardware running the two Call of Duty VMs and make the most of its resources.

Virtual machines are also useful when dealing with hardware failure. Because virtual machine images contain the OS and application all as a single package, providers can recover from hardware failure very rapidly by just moving virtual machines from one physical piece of hardware to another.


Local Server Process Manager

A cloud server provisioning system needs a way to start up and monitor game server processes on game server machines. Server machines cannot simply launch the maximum number of game server processes at boot with the expectation that they will run for the uptime of the machine. One process could crash at any time, at which point the virtual machine would be underutilizing its resources. Also, even the most carefully engineered games can end up shipping with memory leaks. Sometimes ship dates are immovable and it is necessary to deploy servers that leak a few megabytes here or there. To keep small memory leaks from accumulating, and also to avoid the problem of resetting game state improperly, it is a good practice to shut down and restart server processes at the end of each match when possible.

If server processes can terminate, the virtual machine needs a way to start them back up. It also needs a way to configure them based on what kind of game players want to start. For all these reasons, a robust provisioning system needs a mechanism through which it can ask a given server machine to start up a server process configured in a specific way. To build such a system, you could hunt and peck around in the details of your operating system to see if there is a built-in way to remotely start and monitor processes. A more cross-platform and less fragile approach, however, is to build a local server process manager (LSPM).

The LSPM is itself a process that assumes the responsibility of listening for remote commands, spawning server processes as requested, and monitoring those processes to determine which processes the given machine is currently running. Listing 8.1 demonstrates initialization, launch, and kill routes for a simple node.js/express application to manage local server processes.

Listing 13.1 Initialization, Launch, and Kill


var gProcesses = {};
var gProcessCount = 0;
var gProcessPath = process.env.GAME_SERVER_PROCESS_PATH;
var gMaxProcessCount = process.env.MAX_PROCESS_COUNT;
var gSequenceIndex = 0;

var eMachineState =
{
empty: "empty",
partial: "partial",
full: "full",
shuttingDown: "shuttingDown",
};
var gMachineState = eMachineState.empty;
var gSequenceIndex = 0;

router.post('/processes/', function(req, res)
{
if(gMachineState === eMachineState.full)
{
res.send(
{
msg: 'Already Full',
machineState: gMachineState,
sequenceIndex: ++gSequenceIndex
});
}
else if(gMachineState === eMachineState.shuttingDown)
{
res.send(
{
msg: 'Already Shutting Down',
machineState: gMachineState,
sequenceIndex: ++gSequenceIndex
});
}
else
{
var processUUID = uuid.v1();
var params = req.body.params;
var child = childProcess.spawn(gProcessPath,
[
'--processUUID', processUUID,
'--lspmURL', "http://127.0.0.1:" + gListenPort,
'--json', JSON.stringify(params)
] );
gProcesses[processUUID] =
{
child: child,
params: params,
state: 'starting',
lastHeartbeat: getUTCSecondsSince1970()
};
++gProcessCount;
gMachineState = gProcessCount === gMaxProcessCount?
eMachineState.full: eMachineState.partial;
child.stdout.on('data', function (data) {
console.log('stdout: ' + data);
});
child.stderr.on('data', function (data) {
console.log('stderr: ' + data);
});
child.on('close', function (code, signal)
{
console.log('child terminated by signal '+ signal);
//were you at max process count?
var oldMachineState = gMachineState;
--gProcessCount;
gMachineState = gProcessCount > 0 ?
eMachineState.partial: eMachineState.empty;
if(oldMachineState !== gMachineState)
{
console.log("Machine state changed to " + gMachineState);
}
delete gProcesses[processUUID];
});
res.send(
{
msg: 'OK',
processUUID: processUUID,
machineState: gMachineState,
sequenceIndex: ++gSequenceIndex
});
}
});

router.post('/process/:processUUID/kill', function(req, res)
{
var processUUID = req.params.processUUID;
console.log("attempting to kill process: " + processUUID);
var process = gProcesses[processUUID];
if(process)
{
//killing triggers the close event and removes from the process list
process.child.kill();
res.sendStatus(200);
}
else
{
res.sendStatus(404);
}
});


The LSPM starts by initializing some global variables. gProcesses holds a map of all the processes currently being managed, while gProcessCount tracks the count. gProcessPath and gMaxProcessCount are read in from environment variables so they can be easily configured on a machine-by-machine basis. gMachineState caches the state of the entire machine, regarding whether it has room for more processes, is full, or is shutting down. The variable holds values from the eMachineState object.

The LSPM supports creation of new processes through a POST request to the /api/processes/ endpoint. Specifically, if the LSPM is running locally and listening on port 3000, you can use the curl web request program to launch a new process configured to host four players with the command line:

curl -H "Content-Type: application/json" -X POST -d '{"params":{"maxPlayers":4}}'
http://127.0.0.1:3000/api/processes

When the LSPM receives this request, it first checks that it is neither shutting down nor running the maximum number of processes allowed. If that is the case, it creates a new universally unique identifier for the pending process, and uses the Node JS child_process module to spawn a game server process. Through command line arguments, it passes the process both the unique ID and any configuration parameters posted by the requester.

Next, the LSPM stores a record of the spawned child process in its gProcesses map. The state variable is used to track whether the process is currently starting up, or is known to be running. The lastHeartbeat variable tracks the last time the LSPM heard from this process, and will come into play in the next section.

After recording the existence of the process, the LSPM sets up some event handlers to receive and log any output from the process. It also sets up a very important listener for the "close" event, which removes the process from the gProcesses map and reports on any change ingMachineState.

Finally, the LSPM responds to the request with the unique process ID and information regarding how many processes are currently running. Remember that the Node event model is single threaded, so there is no worry of a race condition changing the gProcessCount or thegProcesses hash map during the execution of the function.

With a copy of the unique process ID, the requester can then query information about the process by sending a GET request to the /processes/:processUUID endpoint (code not shown) or shutdown a process by sending a POST to the /processes/:processUUID/killendpoint.


Warning

When in production, you want to restrict who can launch and kill servers through your LSPM. One way to accomplish this is by whitelisting all IP addresses that are allowed to send requests directly to the LSPM, and then discarding any incoming requests not from those IP addresses. This will prevent mischievous players from sending process launch commands directly to your LSPM. Alternatively, you can add a security token in the request header and verify its presence before granting any request. Either way, you need to implement some level of security or run the risk of your provisioning system being disrupted.


Process Monitoring

Once the LSPM can launch a process, it needs a way to monitor them. It accomplishes this by listening for heartbeats from the processes. These are periodic packets from the processes indicating that they are still alive. If a set amount of time passes without the LSPM hearing from a particular process, the LSPM assumes that the process has halted, hung, slowed down, or broken in some unacceptable fashion, and it terminates the process. Listing 13.2 demonstrates.

Listing 13.2 Process Monitoring


var gMaxStartingHeartbeatAge = 20;
var gMaxRunningHeartbeatAge = 10;
var gHeartbeatCheckPeriod = 5000;

router.post('/processes/:processUUID/heartbeat', function(req, res)
{
var processUUID = req.params.processUUID;
console.log("heartbeat received for: " + processUUID);
var process = gProcesses[processUUID];
if(process)
{
process.lastHeartbeat = getUTCSecondsSince1970();
process.state = 'running';
res.sendStatus(200);
}
else
{
res.sendStatus(404);
}
});
function checkHeartbeats()
{
console.log("Checking for heartbeats...");
var processesToKill = [], processUUID;
var process, heartbeatAge;
var time = getUTCSecondsSince1970();
for(processUUID in gProcesses)
{
process = gProcesses[processUUID];
heartbeatAge = time - process.lastHeartbeat;
if(heartbeatAge > gMaxStartingHeartbeatAge ||
(heartbeatAge > gMaxRunningHeartbeatAge
&& process.state !== 'starting'))
{
console.log("Process " + processUUID + " timeout!");
processesToKill.push(process.child);
}
}
processesToKill.forEach(function(toKill)
{
toKill.kill();
});
}

setInterval(checkHeartbeats, gHeartbeatCheckPeriod);


Sending a POST to the /processes/:processUUID/heartbeat endpoint registers a heartbeat for the given process ID. When a heartbeat comes in, the LSPM checks the current timestamp and updates the last received heartbeat time of the appropriate process. Once a process sends its first heartbeat, the LSPM changes its state from starting to running to mark that it has proof that the game process has started.

The checkHeartbeat function loops through all processes owned by the LSPM and checks to make sure it has received a recent enough heartbeat. If a process is still in the starting state, it may have a slow initialization process to complete, so the function allows it a little extra time to register its first heartbeat. After that, if the latest heartbeat for a process is not within gMaxRunningHeartbeat seconds of the current time, it means something terrible happened to the server process. To deal with this, the LSPM attempts to manually kill the child process, in case it is not dead yet. When the process dies, the close event registered earlier removes it from the list of processes. The LSPM calls the checkHeartbeat function every gHeartbeatCheckPeriod ms by means of the setInterval call at the bottom of the script.

To send a heartbeat to the LSPM, each process needs to make a POST request to its LSPM heartbeat endpoint at least once every gHeartbeatCheckPeriod seconds. To send a REST request from a C++ program, you can build the http request as a string and then send it to the appropriate LSPM’s port using the TCPSocket class described in Chapter 3. For example, if the LSPM, listening on port 3000, launched a process with the -processUUID command line parameter 49b74f902d9711e5-8de0f3f32180aa49, then the process can register heartbeats by sending the following string via TCP to port 3000:

POST /api/processes/49b74f902d9711e5-8de0f3f32180aa49/heartbeat HTTP/1.1\r\n\r\n

Notice the two end line sequences in a row used to denote the end of the http request. For more on the textual format of HTTP requests, see the “Additional Readings” section. Alternatively, for a more turn-key solution, you can integrate a third-party C++ REST library like Microsoft’s open-source, cross-platform C++ REST SDK library. Listing 13.3 demonstrates how to send a heartbeat using the C++ REST SDK.

Listing 13.3 Sending a Heartbeat with the C++ REST SDK


void sendHeartbeat(const std::string& inURL,const std::string& inProcessUUID)
{
http_client client(U(inURL.c_str()));
uri_builder builder(U("/api/processes/" + inProcessUUID + "/heartbeat"));
client.request(methods::POST, builder.to_string());
}


To check on the results of the heartbeat, you can append continuation tasks to the task returned by the request invocation. The C++ REST SDK offers a rich library that provides not only asynchronous, task-based HTTP request functionality, but also server functionality, JSON parsing, WebSocket support, and more. For more on the C++ REST SDK and what it can do, refer to the resources listed in the “Additional Readings” section.


Note

REST requests are not the only way to send heartbeats to an LSPM. If you prefer, the LSPM can open a TCP or even UDP port directly in Node, and the server process can send very small heartbeat packets without the overhead of HTTP. Or, the game can just write heartbeat data to its log file and the LSPM can monitor that. However, given that your game will probably end up needing a REST API to talk to one or more other services, and the ease of debugging REST data, and the fact that the LSPM is already listening for incoming REST requests, it reduces complexity to just send heartbeats via REST.


Virtual Machine Manager

By facilitating remote startup and monitoring of an arbitrary number of processes on a virtual machine, the LSPM solves a significant portion of the cloud hosting problem. However, it does nothing to actually provision the machines themselves. To do this, you need a virtual machine manager (VMM). The VMM is responsible for tracking all the LSPMs, requesting LSPMs to spawn game processes when necessary, and spinning up and down entire virtual machines, with their associated LSPMs.

To provision a new virtual machine with a cloud provider, the VMM must identify what software to run on the machine. It does this by specifying a virtual machine image (VMI). The VMI represents the contents of the disk drive that the VM should boot. It contains the OS, the process executables, and any initialization scripts to run at boot. Each cloud host provider has a slightly different VMI format they prefer, and usually custom tools to create the VMs. To prepare for VM provisioning, you must create a VMI with your chosen OS, your compiled game server executable and data, your LSPM, and any necessary assets.


Note

Although each cloud provider has their own VMI format, many may soon be standardizing on the Docker Container format. For more on the Docker standard, see the “Additional Readings” section.


Asking a cloud hosting provider to spin up a VM from a VMI comes down to the details of the provider. Providers typically have a REST API for this purpose, with wrappers in common backend languages like JavaScript and Java. Because you may need to switch cloud host providers, or use multiple ones in multiple regions, it is a good idea to cleanly abstract the details of the provider API from your VMM code.

In addition to simply spinning up VMs when necessary, a VMM must be able to request new processes from the LSPM on each VM. It must also ask the cloud provider to shut down and deprovision any VMs no longer in use. Finally, it must monitor the health of all the VMs it manages to make sure none leak in case of error. Although Node is single threaded, the asynchronous interactions between requester, VMM, and LSPM present ample opportunity for a variety of race conditions. In addition, even though TCP is reliable, each REST request is on its own connection, which means communications can arrive out of order. Listing 13.4 shows the initialization and data structure of the VMM.

Listing 13.4 Initialization and Data Structures


var eMachineState =
{
empty: "empty",
partial: "partial",
full: "full",
pending: "pending",
shuttingDown: "shuttingDown",
recentLaunchUnknown: "recentLaunchUnknown"
};
var gVMs = {};
var gAvailableVMs = {};

function getFirstAvailableVM()
{
for( var vmuuid in gAvailableVMs)
{
return gAvailableVMs[vmuuid];
}
return null;
}

function updateVMState(vm, newState)
{
if(vm.machineState !== newState)
{
if(vm.machineState === eMachineState.partial)
{
delete gAvailableVMs[vm.uuid];
}
vm.machineState = newState;
if(newState === eMachineState.partial)
{
gAvailableVMs[vm.uuid] = vm;
}
}
}


The core data of the VMM lives in two hash maps. The gVMs hash map contains all currently active VMs managed by the VMM. The gAvailableVMs map is the subset of VMs which are available for spawning a new process. That is, they are not shutting down, starting up, currently spawning a process, or already at max process count. Each VM object needs the following members:

Image machineState. Representing the current state of the VM, this holds one of the members of the eMachineStates object. These states are a superset of the eMachineStates the LSPM uses, containing a few more states that are only relevant to the VMM.

Image uuid. This is the VMM-assigned unique identifier for the VM. When spawning the VM, the VMM passes the uuid to the LSPM so that the LSPM can tag any updates it sends the VMM.

Image url. The url stores the IP address and port of the LSPM on the VM. The IP and possibly the port are assigned by the cloud service provider whenever a VM is provisioned. The VMM must store it so it can communicate with the LSPM on the VM.

Image lastHeartbeat. Similar to how the LSPM listens for process heartbeats, the VMM listens for LSPM heartbeats. This stores the time the last heartbeat was received.

Image lastSequenceIndex. Because each REST request can come in on its own TCP connection, it’s possible for them to arrive out of their original order. To make sure the VMM ignores any stale updates from an LSPM, the LSPM tags each piece of communication with an increasing sequence index, and the VMM ignores any incoming data with a sequence index less than the lastSequenceIndex.

Image cloudProviderId. This stores the VMs identity as far as the cloud service provider is concerned. The VMM uses this when asking the provider to deprovision the VM.

When it’s time to spawn a new VM, the getFirstAvailableVM function finds the first VM in the gAvailableVMs map and returns it. The updateVMState function is responsible for transitioning VMs into and out of the gAvailableVMs map as their state changes. For consistency, the VMM should only change the state of a VM via the updateVMState function. With the necessary data structures in place, Listing 13.5 shows the REST endpoint handler that actually spawns a process. It provisions a VM first if necessary.

Listing 13.5 Spawning a Process and Provisioning a VM


router.post('/processes/', function(req, res)
{
var params = req.body.params;
var vm = getFirstAvailableVM();
async.series(
[
function(callback)
{
if(!vm ) //spin up if necessary
{
var vmUUID = uuid.v1();
askCloudProviderForVM(vmUUID,
function(err, cloudProviderResponse)
{
if(err) {callback(err);}
else
{
vm =
{
lastSequenceIndex: 0,
machineState: eMachineState.pending,
uuid: vmUUID,
url: cloudProviderResponse.url,
cloudProviderId: cloudProviderResponse.id,
lastHeartbeat: getUTCSecondsSince1970()
};
gVMs[vm.uuid] = vm;
callback(null);
}
});
}
else
{
updateVMState(vm, eMachineState.pending);
callback(null);
}
},
//vm is valid and in the pending state so no other can touch it
function(callback)
{
var options =
{
url: vm.url + "/api/processes/",
method: 'POST',
json: {params: params}
};

request(options, function(error, response, body)
{
if(!error && response.statusCode === 200)
{
if(body.sequenceIndex > vm.lastSequenceIndex)
{
vm.lastSequenceIndex = body.sequenceIndex;
if(body.msg === 'OK')
{
updateVMState(vm, body.machineState);
callback(null);
}
else
{
callback(body.msg); //failure- probably full
}
}
else
{
callback("seq# out of order: can't trust state");
}
}
else
{
callback("error from lspm: " + error);
}
});
}
],
function(err)
{
if(err)
{
//if vm is set, make sure it's not stuck in the pending state
if(vm)
{
updateVMState(vm, eMachineState.recentLaunchUnknown);
}
res.send({msg: "Error starting server process: " + err});
}
else
{
res.send({msg: 'OK'});
}
});
});



Note

This endpoint handler makes use of the async.series function, which is a utility in the popular async JavaScript library. It takes an array of functions, and a final completion function as parameters. It calls each of the functions in the array in order, waiting until they call their respective callback function to proceed. When the series is done, async.series calls the completion function. If any one of the functions in the array passes an error to its callback function, series immediately passes the error to the completion function and aborts the calling of any more functions in the array. async contains many other useful high-order asynchronous constructs and is one of the most depended upon packages in the Node community.

The handler also makes use of the request library for making REST requests to the LSPM. request is a full featured HTTP client library, similar in power and functionality to the curl command line utility. Like async, it is also a top library in the Node community and one worth learning. More information on both the async and request libraries can be found in the “Additional Readings” section.


Posting game parameters to the /processes/ endpoint of the VMM triggers the launch of a game process with those parameters. The handler has two main sections: the VM procurement and then the process spawn. First, the handler checks the gAvailableVMs map to see if there is a VM available to spawn a process. If there is not, it creates a unique ID for a new VM and asks the cloud provider to provision it. The function askCloudProviderForVM is highly dependent on the specific cloud provider used, and so is not listed here. It should call the cloud provider’s API for provisioning a VM, use the image that contains the game and the LSPM, and then start the LSPM, passing the VM identifier as a parameter.

Whether the VM is started up fresh, or already available, the handler sets its state to pending. This makes sure that the VMM will not try to start up another process on it while there is one currently starting up. The single-threaded nature of Node prevents traditional race conditions, but because the endpoint handler uses asynchronous callbacks, it is possible another process-launch request might arrive before the current one is fulfilled. In that case, it is necessary for the request to be handled by a different VM to avoid overlapping state updates. To facilitate this, the change topending state removes the VM from the gAvailableVMs map.

With the VM in pending state, the handler sends a REST request to the VM’s LSPM to launch a game process. If the launch succeeds, the handler sets the VM state to the new state returned by the LSPM—it should be either partial or full, depending on how many game processes the VM is currently hosting. If there is a bad or missing response from the LSPM, the VMM cannot know the resultant state of the VM. It is possible that the process did not launch before the error was returned, or that the process did launch and the response was lost somewhere in the network. Even though TCP is reliable, HTTP clients and servers have timeouts. Loose network cables, persistent traffic spikes, or bad Wi-Fi signals can cause communication to time out. In the case of indeterminate error, the handler sets the VM’s state to recentLaunchUnknown. This removes the server from the pending state so that the heartbeat monitoring system, explained later, can either restore the VM to a known state or kill it. It also keeps the VM out of the gAvailableVMs map, because its availability is unknown.

If all goes well, the handler finally responds to the original request with the message “OK,” meaning the new game process on a remote VM has launched.

Virtual Machine Monitoring

Because an LSPM can hang or crash at any time, the VMM needs to monitor each LSPM for heartbeats. To ensure that the VMM’s perception of the LSPM state remains accurate, the LSPM can send state updates with each heartbeat, tagged with an increasing sequenceIndex to help the VMM ignore out-of-order heartbeats. When a heartbeat indicates that an LSPM is running no processes, the VMM initiates a shutdown handshake with the LSPM. The handshake prevents race conditions that might cause the LSPM to launch a process while the VMM is trying to shut it down. Due to both the shutdown handshake and the state included in the heartbeat, the system is somewhat more complicated than the one the LSPM uses to monitor processes. Listing 13.6 demonstrates the VMM heartbeat monitoring system.

Listing 13.6 VMM Heartbeat Monitoring


router.post('/vms/:vmUUID/heartbeat', function(req, res)
{
var vmUUID = req.params.vmUUID;
var sequenceIndex = req.body.sequenceIndex;
var newState = req.body.machineState;
var vm = gVMs[vmUUID];
if(vm)
{
var oldState = vm.machineState;
res.sendStatus(200); //send status now so lspm can close connection
if(oldState !== eMachineState.pending &&
oldState !== eMachineState.shuttingDown &&
sequenceIndex > vm.lastSequenceIndex)
{
vm.lastHeartbeat = getUTCSecondsSince1970();
vm.lastSequenceIndex = sequenceIndex;
if(newState === eMachineState.empty)
{
var options = {url: vm.url + "/api/shutdown", method: 'POST'};
request(options, function( error, response, body)
{
body = JSON.parse( body );
if(!error && response.statusCode === 200)
{
updateVMState(vm, body.machineState);
//does lspm still think it's okay to shut down?
if(body.machineState === eMachineState.shuttingDown)
{
shutdownVM(vm);
}
}
} );
}
else
{
updateVMState(vm, newState);
}
}
}
else
{
res.sendStatus(404);
}
} );

function shutdownVM(vm)
{
updateVMState(vm, eMachineState.shuttingDown);
askCloudProviderToKillVM(vm.cloudProviderId, function(err)
{
if(err)
{
console.log("Error closing vm " + vm.uuid);
//we'll try again when heartbeat is missed
}
else
{
delete gVMs[vm.uuid]; //success...delete from everywhere
delete gAvailableVMs[vm.uuid];
}
} );
}
function checkHeartbeats()
{
var vmsToKill = [], vmUUID, vm, heartbeatAge;
var time = getUTCSecondsSince1970();
for(vmUUID in gVMs)
{
vm = gVMs[vmUUID];
heartbeatAge = time - vm.lastHeartbeat;
if(heartbeatAge > gMaxRunningHeartbeatAge &&
vm.machineState !== eMachineState.pending)
{
vmsToKill.push(vm);
}
}
vmsToKill.forEach(shutdownVM);
}
setInterval(checkHeartbeats, gHeartbeatCheckPeriodMS);


The heartbeat endpoint handler ignores heartbeats for VMs that are in the pending or shuttingDown states. Pending VMs change state as soon as their launch request is answered, so any other state change during that time needs to be handled after the launch completes. VMs in theshuttingDown state are shutting down already so do not require monitoring updates. The handler also ignores heartbeats with out-of-order sequence indices. If a heartbeat is worth considering, the handler updates the lastSequenceIndex and lastHeartbeat properties of the VM. Then, if the state is empty, indicating there are no game processes running on the VM, the handler begins the shutdown process by sending a shutdown request to the LSPM. The LSPM’s shutdown handler checks its own gMachineState to make sure that it hasn’t changed since theempty heartbeat went out. If it did not, it changes its own state to shuttingDown and responds to the VMM that it has accepted the request to shut down. The VMM then marks the VM as shuttingDown and asks the cloud provider to completely deprovision the VM.

The VMM checkHeartbeats function works like the LSPM function, but it ignores any timeouts for servers in the pending state. If a VM does time out, it means there is something wrong with the LSPM, so the VMM does not bother with the shutdown handshake. It instead immediately requests deprovisioning from the cloud service provider.

When the LSPM experiences a change in state due to a process shutting down, it does not need to wait for the predetermined heartbeat interval to notify the VMM. Instead, it can just send an extra heartbeat right away in response to the change. This is a simple way to give immediate feedback to the VMM and requires no extra functionality on the VMM’s part.

This VMM implementation is functionally correct, prevents errors from race conditions, and is reasonably efficient. If many requests come in at once during the time it takes to provision a VM, though, it will end up provisioning one VM for each request. If the traffic is consistent this won’t be a problem, but in the case of an anomalous spike, this may end up spawning a wasteful number of VMs. A better implementation could detect this situation and throttle the VM provisioning requests. Similarly, the VMM is possibly inefficiently aggressive in its shutting down of empty VMs. Depending on the rate at which games are requested and exited, it might be beneficial to keep empty VMs alive for a certain duration before deprovisioning them. A more robust VMM would have a tweakable threshold for this. Improvement of the VMM is left as an exercise.


Tip

If a VMM needs to handle hundreds of requests per second, it may need a dynamic load balancer in front of it, and several Node instances to bear the brunt of the requests. In this case, the statuses of the VMs in the gVMs array need to be shared between instances, so instead of living in a single process’ local memory, they should live in a rapid access shared data store such as redis. For more on redis, see the “Additional Readings” section. Alternatively, if requests are this frequent, it may be better to shard players geographically, with a statically dedicated VMM for each region.


Summary

With the increased prevalence of cloud service providers, every studio building a multiplayer game should consider hosting dedicated servers in the cloud. Even though it is easier than ever before, hosting dedicated servers still costs more than having the players host the servers, and increases complexity as well. It also introduces a dependency on third-party cloud service providers and removes feelings of ownership from your players. The advantages of hosting dedicated servers often outweigh the drawbacks though. Hosted servers provide reliability, availability, high bandwidth, cheat prevention, and unobtrusive copy protection.

Hosting dedicated servers requires building a few backend utilities. The tools of backend development differ significantly from those of client-side game development. REST APIs provide a text-based, discoverable, and easily debuggable interface between services. JSON provides a clean and compact format for data exchange. Node JS provides an optimized, event-loop driven, JavaScript engine for rapid development.

There are several moving parts in a dedicated server infrastructure. The server game instance represents an instance of the game shared between players. There may be one or more game instances in a game server process, which represents the game to the OS. One or more game server processes may run on a game server machine. Typically game server machines are actually virtual machines, running with zero or more other virtual machines on the same physical machine.

To manage all of these parts, there is a local server process manager and a virtual machine manager. There is one LSPM per virtual machine, and it is responsible for spawning and monitoring processes on that machine, as well as reporting on its own health to the VMM. The VMM itself is the main entry point for process launch. When a matchmaking service decides that it needs a new game server launched, it sends a REST request to a VMM endpoint. The handler for that endpoint then either finds an underutilized VM or requests the cloud service provider provision a new one. With a VM identified, it requests the VM’s LSPM launch the new game server process.

All these pieces work in concert to provide a robust, dedicated server environment, capable of supporting a vast and scalable number of players with no upfront hardware cost.

Review Questions

1. What are the advantages and disadvantages of hosting dedicated servers? Why was hosting dedicated servers much harder in the past?

2. What are the pros and cons of supporting multiple game instances per game server process?

3. What is a virtual machine? Why does cloud hosting typically involve virtual machines?

4. What main functions does a local server process manager provide?

5. List multiple ways a server game process can provide feedback to a local server process manager.

6. What is a virtual machine manager and what purpose does it serve?

7. Explain how the VMM might sometimes provision more VMs than it needs. Implement an improvement.

8. Explain how the VMM might sometimes deprovision VMs sooner than it should. Implement an improvement.

Additional Readings

C++ REST SDK—Home. Retrieved from https://casablanca.codeplex.com. Accessed September 12, 2015.

Caolan/async. Retrieved from https://github.com/caolan/async. Accessed September 12, 2015.

Docker—Build, Ship, and Run Any App, Anywhere. Retrieved from https://www.docker.com. Accessed September 12, 2015.

Express—Node.js web application framework. Retrieved from http://expressjs.com. Accessed September 12, 2015.

Fielding, R., J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee. (1999, June). Hypertext Transfer Protocol—HTTP/1.1. Retrieved from http://www.w3.org/Protocols/rfc2616/rfc2616.html. Accessed September 12, 2015.

Introducing JSON. Retrieved from http://json.org. Accessed September 12, 2015.

Node.js. Retrieved from https://nodejs.org. Accessed September 12, 2015.

Redis. Retrieved from http://redis.io/documentation. Accessed September 12, 2015.

Request/request. Retrieved from https://github.com/request/request. Accessed September 12, 2015.

Rest. Retrieved from http://www.w3.org/2001/sw/wiki/REST. Accessed September 12, 2015.