As a few may have heard, we have been investigating the possibility of changing provider for our servers. This did not come from our current ones being insufficient, although the webserver was rather overloaded.
But instead came from Jeff mentioning he saw some Hacker News post praising a datacenter company in Quebec called OVH.
After that point, I had a look at what they offered, prices / specs and multiple reviews I could find on the internet.
The price for the specs were… pretty darn impressive, to get machine a bit stronger than what we currently have, we end paying about half our current price but, specs arent everything and the price felt too good to be true.
None the less, we ordered a machine for a week so that we could test the quality of their network and hardware when compared to our current machines.
on liquid we currently have
- www:
- AMD Phenom(tm) II X6 1055T Processor (6×2.8GHz)
- 16 gb ram ddr3 1333MHz
- 300gb sas raid1 (15k rpm) for the database/webserver
- 1Tb 7200rpm sata drive for the immediate backups (theres 2 other levels of backups hosted externaly, this is just for the quick every 30minutes backups)
- 100mbit connection
- liberty:
- AMD Opteron(tm) Processor 6128 x4 (32x2GHz)
- 16 gb ram ddr3 1333MHz
- raid 1 sata 7200rpm for the main os / source code
- 1x ssd drive for the universe save (reduce the lag impact from the save to <2ms instead of the old 130ms that often occured on regular harddrive).
- 1gbit connection (to ensure minimal latency to the game at all time even with a lot of players)
the machines we are getting on ovh are
- www:
- Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz (6cores with H-T 12threads)
- 128 gb ram ddr3 1600MHz
- 300gb ssd raid 1 for the database/webserver
- 2tb sata raid1 7200rpm for the immediate backups
- 500gb of external backup storage (free with most dedicated servers at ovh)
- 1gbit/500mbit out unlimited connection (1gbit burst up to 10secs)
- ddos hardware protection from 100gbit/sec to 500gbit/sec depending on the specific
- liberty:
- Intel(R) Xeon(R) CPU E5-2670 v2 (20×2.5ghz with hyperthread so 40logical threads, turbo 2.9ghz and maintained at low temperature)
- 256 gb ram ddr3 1600MHz
- 3x ssd 300gb in raid1 for the os install and a partition in raid0 for higher performance uni save
- 500gb of external backup storage (free with most dedicated servers at ovh)
- 1gbit/500mbit out unlimited connection (1gbit burst up to 10secs)
- ddos hardware protection from 100gbit/sec to 500gbit/sec depending on the specific
Now, on paper this mean that the website should be a lot more responsive once we switch.
The database will have so much ram available for caching that it should be a world apart from our current configuration, not to mention the cpu being by far superior and so is the storage on which the database will be.
What this also mean for the webserver, is that we will be able to run test servers with full universe without impacting performance all that much.
Currently test server is rather limited on full uni testing, it cannot have both livetest and test running full uni at the same time or the memory drops too low and the database ends up affected.
With this new server, we should in theory have no problem with 2 or maybe even 3 -4 test servers (private test servers for the content devs to run their own separate player related tests etc).
Offcourse cpu is limited, but servers with 1-2 players on them use very little cpu so theres generally no issue here for a test server.
For liberty we are looking at an improvement of a little over 250% performance all around the board. There is a slight drop in the number of physical cores (from 32 to 20) but the performance per core is quite a bit faster.
Both in lag scenario (1500ai fighting in a single galaxy) and the current uni without players, the performance was always 2-3 times faster on the new server compared to the current liberty.
Now for the part the specs and paper rarely properly cover, the connection reliability. As some of you know, it’s been a constant issue over the years.
Some people unable to connect even tho others can, part of the world losing access and need to rely on proxy / gateway temporarily, frequent disconnection even tho their internet never lost the connection as far as they can tell etc.
It is extremely hard to properly test for those things due to the apparent randomness involved, so I came up with a test that I think should catch it as best as possible to allow for comparison.
No server is perfect and our network code is extremely fragile (its been a problem for a long time, the slightest corruption or interruption is often enough to cause a full disconection), therefor having a network that experience less of those interruption sound like a good improvement
Now for the specifics, I split the tests in 3 categories, each with their own frequency. All tests were ran on both liberty and the test machine we took on ovh for 5 days.
- A ping test: Every second, each server pinged 49 locations (7 per test regions) around the world looking for packetloss / disconnection as well as latency.
- A route test: Every 5 minutes, each server took a snapshot of the route used to reach all 49 locations to compare with the previous and note route change.
- A bandwith test: Every 2 hours, each server ran a speed test using the speedtest.net command line tool to 21 (3 per test regions) different speedtest nodes spread around the world.
The ping test has up to 1 second error per grouped packet loss (1 packet loss is 0-1 sec, 2 packet loss in a row is 1-2 sec, 3 is between 2-3 secs etc.). Downtime that were identical to both liberty and the ovh server were removed (the target was likely down and not the route to it)
the new server had a total all around disconnections between 397 and 1489 seconds out of 432000 seconds tested (0.0009% to 0.0034% off the time when not all the international routes were functional)
- Africa Central: 193-835 seconds down (249ms)
- Asia: 112-154 seconds down (195ms)
- Australia: 32-157 seconds down (203ms)
- England: 27-117 seconds down (97ms)
- Germany: 23-106 seconds down (85ms)
- US south west: 6-78 seconds down (72ms)
- US east: 4-39 seconds down (19ms)
current liberty had a total all around disconnections between 866 and 2309 seconds out of 432000 seconds tested (0.002% to 0.0053% off the time when not all the international routes were functional)
- Africa Central: 393-1135 seconds down (280ms)
- Asia: 212-468 seconds down (195ms)
- Australia: 74-235 seconds down (207ms)
- England: 87-178 seconds down (120ms)
- Germany: 73-162 seconds down (105ms)
- US south west: 12-81 seconds down (60ms)
- US east: 15-50 seconds down (23ms)
Interestingly enough, the ping on the server based in Quebec, is lower on average (with the exception of the worst case scenario and that would be the US south west), this was not expected.
But it can be explained due to the effort OVH put on their network, they have a lot of private routes / deals with other providers to ensure good connections whenever possible (which is along other things, needed for their impressive ddos protection).
All in all, the downtimes were a lot shorter on the OVH network than on liquid’s.
the second test, theres a little too much data to be put in a way that would remain clear in a post (most routes tested are international, and imply 10-20 gateways), but there was a pretty clear difference in the routes used by each server.
The current liberty changed route on average twice per hour, rarely major change, just a different gateway here and here. Those change often happened at the same time a disconnection was found during a ping, but not always.
On the new server, there were very few actual route changes, 4-5 per day with the exception of central Africa, which saw about 2 every hours as well.
The third test was the bandwith test. While bandwith doesn’t equate to performance when you’re looking at a video game (you dont need 30mbit to play ss, especially not on download) it does give a somewhat good representation to the link quality and load, as a rule of thumb, if a link let you transfer faster, it mean its able to take a whole lot more load.
Keep in mind that while this is the average bandwith speed obtained using speedtest cli, a lot of those ended up thottled at one point or another along the line because it was only 1 source. When doing multiple different tests in similar location at the same time, the total bandwith nearly always approached 700/400 DL/UL
The current liberty network average for each regions DL/UL MBit/sec
- Africa Central: 97/38
- Asia: 84/24
- Australia: 138/96
- England: 157/43
- Germany: 165/52
- US south west: 443/319
- US east: 409/294
The new server network average for each regions DL/UL MBit/sec
- Africa Central: 122/44
- Asia: 124/74
- Australia: 238/126
- England: 357/213
- Germany: 365/252
- US south west: 343/189
- US east: 705/494
In a nutshell, trying as I may to find a problem with the new server location, I could not. All I could find were more reasons on why we should switch over. Altho liquid do have their incredible support team going for them (never had problems fixed as fast as with them).
OVH gives us quite a fair bit more power over the hardware which should normally greatly reduce the need to rely on the support (beyond hardware repairs).
Funnily enough this is something quite a few of the review on the internet actually complain about, their control panel is compared to most other services I tried, complex and dangerous.
The OVH server end up being way cheaper than our current ones, an order of magnitude faster and have a more reliable connection as far as I could test.