What's my idea for the future multiplayer gaming architecture? The following image briefly outlines the core structure of the proposed model:
Note that the client-side should have next to no game state or data, nor audio/visual assets, as they're supposed never to leave the server-side.
The following's the general flow of games using this architecture (all these happen per frame):
You can also represent the aforementioned flow this way:
Do note that it's different from cloud gaming in the case of multiplayer(although it's effectively the same in the case of single player), because cloud gaming doesn't demand the games to be specifically designed for that, while this architecture does, and the difference means that:
1. In cloud gaming, different players rent different remote machines, each hosting the traditional client side of the game, which communicates with the traditional server side of the game in the same real server that's distinct from those middlemen devices, meaning that there will be at most 2 round trips per frame(between the client and the remote machine, and between the remote machine and the real server), so if the remote machines isn't physically close to the real server, and the players aren't physically close to the remote machines, the latency can raise to an absurd level
2. This architecture forces games complying with it to be designed differently from the traditional counterparts right from the start, so it can install the client version(having minimal contents) directly into the device for each player, which directly communicates with the server side of the game in the same server(which has almost everything), thus removing the need of a remote machine per player as the middleman, and hence the problems created by it(latency and the setup/maintenance cost from those remote machines)
3. The full cycle of the communications in cloud gaming is the following:
- The player machines send the raw input commands to the remote machines
- The remote machines convert those commands into new game states of the client side of the game there
- The client side of the game in those remote machines synchronize with the server side of the game in the real server
- The remote machines draw new visuals on their screens and play new audios based on the latest game states on the client side of the game there
- The remote machines send those audio and visual information to the player machines
- The player machines redraw those new audios and visuals there
4. The full cycle of the communications of this architecture is the following:
- The player machines send the raw input commands directly to the real server
- The real server convert those commands into the new game states of the server side of the game there
- The real server send new audio and visual information to the player machines based on the involved parts of the latest game states on the server side of the game there
- The player machines draw those new audios and visuals there
3 + 4 means the rendering actually happens 2 times in cloud gaming - 1 in the remote machines and 1 in the player machines, while the same happens just once in this architecture - just the player machines directly, and the redundant rendering in cloud gaming can contribute quite a lot to the end latency experienced by players, so this is another advantage of this architecture over cloud gaming.
In short, cloud gaming supports games not having cloud gaming in mind(and is thus backward compatible) but can suffer from insane latency and increased business costs(which will be transferred to players), while this architecture only supports games targeting it specifically(and is thus not backward compatible) but removes quite some pains from the remote machine in cloud gaming(this architecture also has some other advantages over cloud gaming, but they’ll be covered in the next section).
On a side note: If some cloud gaming platforms don't let their players to
join servers outside of them, while it'd remove the issue of having 3 entities instead of just 2 in the connection, it'd also be more restrictive than this architecture, because the latter only restricts all players to play the same game using it.
Here are some advantages of the proposed architecture:
The disadvantages of this architecture at least include the following:
The advantages from this architecture will be unprecedented if the architecture itself can ever be realized, while its disadvantages are all hardware limitations that will become less and less significant and will eventually become trivial.
So while this architecture won't be the reality in the foreseeable future(at least several years from now), I still believe that it'll be the distant future (probably in terms of decades).
For instance, let's say a player joins a server being 300km away from his/her device(which is a bit far away already) to play a game with a [email protected] setup using this architecture, and the full latency would have to meet the following requirements in order to have everything done within around 9ms, which is a bit more than the maximum time allowed in 120 FPS:
1. The client will take around 1ms to capture and start sending the raw input commands from the player
2. The minimum ping, which is limited by the speed of light, will be 2 * 300km / 300,000km per second = around 2ms
3. The server will take around 1ms to receive and combine all raw input commands from all players
4. The server will take around 1ms to convert the current game state set with those raw input commands to form the new game state set
5. The server will take around 1ms to generate all rendered graphics markups(which are lossless, highly compressed and highly obfuscated) from the new camera state of all players
6. The server will take around 1ms to start sending those rendered graphics markups to all players
7. The client will take around 1ms to receive and decompress the rendered graphics markup of the corresponding player
8. The client will take around 1ms to render the decompressed rendered graphics markup as the end result being perceived by the player directly
Do note that hardware limitations, like mouse and keyboard polling rate, as well as monitor response time, are ignored, because they'll always be there regardless of how a multiplayer game is designed and played.
Of course, the above numbers are just outright impossible within years, especially when there are dozens of players in the same server, but they should become something very real after a decade or 2, because by then the hardware we've should be much, much more powerful than those right now.
Similarly, for a [email protected] setup, if the rendering is lossless but isn't compressed at all, it'd need (1920 * 1080) pixels * 32 bit * 120 FPS + little bandwidth from raw inputting commands sent to the server = Around 1GB/s per player, which is of course insane to the extreme right now, and the numbers for [email protected] and [email protected](assuming that it'll or is always a real thing) setups will be around 8GB/s and 64GB/s per player respectively, which are just incredibly ridiculous in the foreseeable future.
However, as the rendering markups sent to the client should be highly compressed, the actual numbers shouldn't be this large, and even if the rendering isn't compressed at all, in the distinct future, when 6G, or even newer generations, become the new norm, these numbers, while will still be quite something, should become practical enough in everyday gaming, and not just for enthusiasts.
Nevertheless, there might be an absolute limit on the screen resolution and/or FPS that can be supported by this architecture no matter how powerful the hardware is, so while I think this architecture will be the distinct future(like after a decade or 2), it probably won't be the only way multiplayer games being written and played, because the other models still have their values even by then.
If this architecture becomes the practical mainstream, the following will be at least some of the implications:
In the case of highly competitive E-Sports, the server can even implement some kind of fuzzy logic, which is fine-tuned with a deep learning AI, to help report suspicious raw player input sets(consisted of keyboard presses, mouse clicks, etc) with a rating on how suspicious it is, which can be further broken down to more detailed components on why they're that suspicious.
This can only be done effectively and efficiently if the server has direct access to the raw player input set, one of the cornerstones of this architecture.
Combining this with traditional anti-cheat measures, like:
The above-mentioned measures not only make cheating next to impossible in major LAN events (also being cut off from external connections), but also infeasible and unrealistic.
Games can also use a hybrid model, and this especially applies to multiplayer games also having single-player modes.
If the games support single-player, then the client-side needs to have everything (and the piracy/plagiarism issues will be back), it's just that most of them won't be used in multiplayer if this architecture's used.
If the games run on the multiplayer, the hosting server can choose (before hosting the game) whether this architecture's used. Of course, only players with the full client-side package can join servers using the traditional counterpart. Only players with the server-side subscription can join servers using this architecture.
Alternatively, players can choose to play single-player modes with a server for each player. Those servers are provided by the game company, causing players to play otherwise extremely demanding games with a low-end machine. The players will need to apply for periodic subscriptions to have access to this kind of single-player mode.
On the business side, it means such games will have a client-side package, with a one-time price for everything on the client-side. A server-side package, with a periodic subscription for playing multiplayer, and a single player with a dedicated server provided, the players can buy either one or both, depending on their needs and wants.
If both technically and economically feasible, this hybrid model is perhaps the best model I can think of.
Previously published here.