All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Netcup pauses all G11 Root Server orders and reduces performance by 50%
I was planning to buy exactly 453 of the Root Server G11 VDS (4 dedicated-core) from NetCup based on good recommendation from fellow LET user (0% steal on many Netcup VDS) to run Quilibrium crypto nodes which I also learned from the LET user.
However, upon investigating my only G11 VDS root server I got performance dropped by 50% in just 1 month since I ordered it for testing.
Others report of the same performance decrease on NetCup's forums: https://forum.netcup.de/administration-of-a-server-vserver/vserver-server-kvm-server/15354-aktuelle-benchmarks-rs-vps-produkte/?pageNo=25
Netcup also halted all sales of their G11 root servers:
(Translated)
Root Server G11 currently unavailable:
We are trying to offer you our products again as soon as possible.
Thank you for your understanding.
This is the before and after on my G11 Root Server which was never rebooted between tests and is idling (clean Ubuntu install)
Netcup RS G11 - April 20 2024
Running GB6 benchmark test...
Test:
---------------------------------
Test | Value
|
Single Core | 2069
Multi Core | 6804
Full Test | https://browser.geekbench.com/v6/cpu/5819654
Netcup RS G11 - May 28 2024
Running GB6 benchmark test... *cue elevator music*^MESC[0KGeekbench 6 Benchmark Test:
---------------------------------
Test | Value
|
Single Core | 1071
Multi Core | 3249
Full Test | https://browser.geekbench.com/v6/cpu/6316670
Comments
Interesting. Seems Netcup can't handle the load of 100s of instances running Quickburn (or whatever it's called...) 24/7 either. I guess i see this with one eye laughing and one eye crying. On one hand it exposes a dishonest marketing scheme but on the other hand it'll come of the expense of people pushing the VMs hard but still not that hard.
I've also noticed this significant loss of performance.
If you have a previous-generation offer, I advise you not to upgrade to the new gen.
So I end up with a more expensive offer, less disk space for comparable performance.
Not worth it.
It makes customers very sad when performance decreases so much after purchase especially on VDS with dedicated cores. I expect 10-20% drop at most from CPU downclocking but not 50%.
I buy VDS to run only Geekbench to see pretty numbers appear on my terminal screen. But now my Netcup VDS shows not pretty numbers.
I will also need to buy 453 VDS with 4 dedicated cores somewhere else now![:/ :/](https://lowendtalk.com/resources/emoji/confused.png)
Good thing I did not buy 12 month Netcup contract.
As you may know, a lot of people run Geekbench5 and Geekbench6 tests, and some people notice that the score doesn't exactly scale with the amount of cores available in the system. For example, an 8 core system with a CPU that scores 1000 GB5 points in single threaded won't exactly score 8000 points in multi-threaded. This is probably because of a number of reasons, but I suspect that it's due to hyperthreading and all-core CPU clock speeds being lower.
EPYC 7702 Testing (64 cores, 128 threads)
I have an EPYC 7702 for a short period of time, so I thought I would start with this processor in particular. Threads are basically equivalent to what most providers call vCPU cores.
I'll be testing with:
0 threads loaded (0%)
16 threads loaded (12.5%)
32 threads loaded (25%)
48 threads loaded (37.5%)
64 threads loaded (50%)
96 threads loaded (75%)
120 threads loaded
The BIOS was set to default values, and the CPU is theoretically rated for a 200W TDP. In practice, I only saw up to 185W on the CPU.
In each test, a VM will be used to run a stress test using the stress -c command to stress these amount of threads. Then, another VM with 8 threads allocated will be running a Geekbench5 tests to see the changes in results.
It's important to note that this may not be representative of real world results. The benchmarks are purely artificial, and most virtualization solutions typically schedules the VM to certain threads rather than evenly distributing it across all of the cores.
In this case, we are testing with the following configuration:
AMD EPYC 7702 (64c/128t)
GIGABYTE MZ32-AR0 REV 1.0
16 x 64GB 2400 MHz DDR4 ECC RDIMM
4 x 3.84TB NVMe SSD + 2 x 512GB NVMe SSD + 1TB NVMe SSD
40G ConnectX-3 Pro NIC
Test VM:
Host Passthrough (8 vCPU Cores)
16GB DDR4 ECC Memory
60GB NVMe SSD Storage
0 threads loaded
Idle usage is approximately 145W-180W of power on the host node.
Single Core: 1037
Multi Core: 7892
16 threads loaded
Upon loading the test VM, the server has a power usage between 290-300W.
Single Core: 1047
Multi Core: 7519
Full Test: https://browser.geekbench.com/v5/cpu/22231694
Most of the cores that were loaded by the stress test seem to be at around 3.2 GHz.
32 threads loaded
Upon loading the test VM, the server has a power usage between 290-300W.
Single Core: 938
Multi Core: 6672
Full Test: https://browser.geekbench.com/v5/cpu/22231637
Most of the cores that were loaded by the stress test seem to be at around 3.05 GHz.
48 threads loaded
Upon loading the test VM, the server has a power usage around 300-305W.
Single Core: 860
Multi Core: 5606
Full Test: https://browser.geekbench.com/v5/cpu/22231663
Most of the cores that were loaded by the stress test seem to be at around 2.71 GHz.
64 threads loaded
Upon loading the test VM, the server has a power usage around 310W.
Single Core: 550
Multi Core: 4082
Full Test: https://browser.geekbench.com/v5/cpu/22231650
Most of the cores that were loaded by the stress test seem to be at around 2.45 GHz.
96 threads loaded
Upon loading the test VM, the server has a power usage around 315-320W.
Single Core: 544
Multi Core: 4139
Full Test: https://browser.geekbench.com/v5/cpu/22231672
Most of the cores that were loaded by the stress test seem to be at around 2.43 GHz.
120 threads loaded
Upon loading the test VM, the server has a power usage around 315-320W.
Single Core: 536
Multi Core: 3639
Full Test: https://browser.geekbench.com/v5/cpu/22231683
Most of the cores that were loaded by the stress test seem to be at around 2.39 GHz.
Conclusion
As you can see from the testing above, at the half way point where 50% of the CPU is utilized, we can see a 50% reduction in the CPU single core and multi core values in the 8 core VM running GB5 tests. Despite almost 50% of the CPU being free and 64 threads (64 vCPU cores) literally sitting unutilized, we still see low results. There's a dramatic drop in clock speed to 2.45 GHz, and it seems to stay around there as more cores are loaded, which is probably the culprit behind the lower Geekbench results.
@Advin hows this configured? Threads bound to VMs or big scheduler party? Not that it likely matters. The whole thing seems more like thermal throttling or hitting a power consumption limit (or maybe memory bandwidth depending how the GB test is constructed?). Well, at least i hope it's something like that, otherwise i'd call this a pretty shitty CPU. There is little sense in doubled thread counts if the overall performance doesn't increase.
Thanks for these numbers
It would be interesting to compare with pinned cpu cores.
Where the impact should be much less significant. Since context switching wouldn't be as frequent. (only thermal throttling / memory bandwidth should have impact)
But there must be a reason why Netcup doesn't do it...
Surprisingly, I've never seen such a drop in performance with the previous generation.
Maybe it's a microcode "update"? Old x6xx Xeons used to be theoretically able to simultaneously turbo boost on all cores, it was just disabled in (microcode) software. There's some models where "faulty" microcode is floating around that allows unlocking it and it's pretty crazy. If i remember correctly they are still bound by a power limit though, so even allowing boost on all cores still sees some throttling regardless of the amount of cooling applied. Modern CPUs often come with a stupid amount of artificial crippling.
Scheduler party
I can assure you that the CPU was not being thermal or power throttled manually, but the EPYC 7702 generally is a low-power 64 core SKU.
Keep in mind that the CPU has 128 threads. 64 of those threads are just a result of hyperthreading, so anything past 50% CPU usage will generally result in a bigger performance loss from what I'm aware of. I've sometimes seen CPU steal even though there's still 30-40% of CPU left, that's why I generally like keeping my nodes under 50% usage.
I'll maybe redo the testing at some point with EPYC Milan since it can sustain clocks better with higher default TDP, I'll also experiment with distributing the load and pinning CPU cores.
Also, keep in mind that Netcup is likely facing the Quilibrium problem, meaning that I wouldn't be surprised if their nodes were sustaining large amounts of vCPU from VMs.
In our shared environments, even with overallocated vCPU cores, we generally see an average load of 40-50%.
At Netcup, I wouldn't be surprised if they were generally lower, especially since they don't overallocate on vCPU (supposedly), leading to better clock speeds (if it weren't for Quilibrium).
@Advin I'm curious, were both workloads pinned to their own independent group of cores (within same CCX for the Geekbench)?
I wonder if NUMA configuration or cross-CCX latency across the Infinity Fabric is affecting the performance as well.
I also see some claims they people were able to hit close to boost clock on all cores even with AVX workloads on Epyc Rome when they hit their max 200W TDP.
Seems like in your test, it seems to only hit 185W of the 200W TDP limit?
I doubt it. If that were the case, why not have the CPU cores pinned ?![o:) o:)](https://lowendtalk.com/resources/emoji/innocent.png)
Performance would be better.
This may make things a little more complex. I doubt that's the only reason...
To my best knowledge those kind of "features" are sadly pretty much universally hardcoded these days. The thermal/power management done by the OS/BIOS is mostly for just tightening further (well, outside of a couple intended or unintended knobs sometimes). Good example is GFX cards: One of the biggest overclocking hacks these days is manipulating the power draw sensors.
Ah, that would kind of explain it trying to avoid drawing tons of power.
Good point. I would have expected the HT cores to at least add like half the performance of a real core though and not just stagnate. Here it would also make quite a bit of sense to pin threads to specific VMs in my opinion as it would guarantee a fair distribution of real and HT cores. The fact that not doing so allows VMs the draw from currently non-HT'd cores as long as the overall load is low might also be a reason to not do it thereby allowing VMs the get more performance under ideal circumstances.
This is pretty much baffling me but then again i know next to nothing about how steal is actually calculated.
Cool, it's pretty interesting.
According to this thread, some people are buying VDS/rootserver in large quantities, and the CPU is occupied for a long time. I think these reasons lead to performance degradation and shortage.
I just let the Linux scheduler handle everything on stock settings, each Geekbench ran in a VM. Perhaps there could be some performance optimization done there.
I can't really find any all-core benchmarks for the EPYC 7702, so I can't really compare it. However, I do know that the lower core count EPYC Rome processors can sustain their clocks better, since there is more power for each core.
AMD only started rating and advertising their all core boost speeds starting from EPYC Genoa, and the Genoa 9634 that Netcup uses downgrades from 3.7 GHz to 3.1 GHz under load.
The EPYC 7702 is more of a worst case scenario since it's a lower end 64 core chip that focuses simply on power efficiency and core density. Perhaps Milan 7763 or Rome 7742 would be better for a 64 core chip, since there's a larger TDP. I know that my Milan processors will basically hit 280W under half load since they aggressively turbo and try to maintain clocks.
I don't exactly know why it stuck to 185W, it's something that I should probably look more closely at in further tests. Perhaps it could be because not every core may have been under load, maybe the 7702 just doesn't turbo as aggressively?
:O I find it hard to believe! People are really trying to order hundreds of VDS at once???!!!
Yeah, I couldn't find too much benchmarks for the 7702 either unfortunately. I wish EPYC Genoa was cheaper.
If you look at the Geekbench history for EPYC 9634, you can literally see the NetCup RS G11 score go down each week : https://browser.geekbench.com/search?page=1&q=9634+
Well, like 5-10 years back people emptied Hetzner's server auction to mine Monero. I know nothing about Guacamole but maybe it's as lucrative as the Monero farms back then and people tend to do crazy things for money.
A rated TDP of 200W doesn't necessarily mean it'll hit exactly that in real life. It might as well be maxing out at 185W.
Accurate, and usually less of a spread between base and boost theoretical max.
I've tested with the custom variants (7B12, 7B13) as well as 7742/7763 and it's definitely an improvement, however you are at max TDP for socket/coolers usually so a lot of it comes down to environmentals. A lot harder to keep that 280W TDP CPU cooled and boosted in a quad-node 2U versus a single chassis 2U with a big hunk of metal sitting on the CPU.
This is often where the closeted crypto miners get themselves into trouble. Pinned/dedicated or not CPUs, a lot of cloud or multi-node systems are not built assuming 100% load on every core 24/7 mining some memecoin. The miners that (at least here) let us know what they're doing are explained to what they actually need, what it costs, and if we can provide it.
Could have been cooling/headroom related, silicon lottery, etc.
It's not so much a shitty CPU, just that the difference between base and boost clock (2.0GHz to 3.35GHz) is massive. Without the benefit of boosting, it's doing exactly what you'd expect from a ~2GHz Zen2 core. The more static load, the closer you get to always being @ base clock. I don't know the exact crossover (and this does depend on cooling, silicon, etc) but I'd say @Advin tests mirror closely what I saw. The drop-off is significant after ~50-60% of the actual physical cores are loaded up. Ignore HT, it definitely doesn't 'double' the cores/perf in the remotest sense, especially on synthetic benchmarks designed to load a system up. It's more usable in real-world scenarios though, with true varied workloads.
I believe the threshold TDP measured and used by CPU governors is something that can be hit given the proper BIOS and environmental configuration as long as the CPU itself doesn't internally downclock prior to that threshold.
that other guy was also buying 453 VMs so maybe he stole your chicken power?
Yeah, when i wrote that i didn't know yet that it was supposed to be low power.
Well, ignoring the low power aspect for a second that's still not overly impressive given (hacked) Xeons managed to boost on like at least half their cores while keeping base on the others.
Sure, like i've said above, i didn't expect any doubling but like a 50% increase would have been nice or, well, at least not total stagnation. What's HT good for if it doesn't add any performance at all (again ignoring the likelyhood of the CPU just hitting it's power target - well kind of... why not simply scale down if the power target prevents fully using the hardware anyways?).
HT/SMT is heavily dependent on the workload with some use cases where there is no performance difference (or even decreased) whether SMT is enabled or disabled. It's quite suitable for varied workloads like VM hosting. I run a lot of uniform workloads on Ryzen's and EPYCs with little performance benefits with SMT enabled (0-5% increase).
@Advin
May I request you to run similar load test with HT turned off in bios?
Thanks and much appreciated
The CPU cache is a critical resource for performance. When multiple threads run on the same core, they share the same cache, which can lead to frequent cache evictions and cache misses. This can significantly increase the latency for memory accesses and degrade performance.
This is clearly evident when active thread load > 64 on cpu with 64 cores (128 HT)
Makes sense, it's more of just a standard with AMD since you're getting a lot of cores per watt. Not necessarily low power, unless you need to use them all at the same time. Kind of meta to this whole discussion if you really think about it![:D :D](https://lowendtalk.com/resources/emoji/lol.png)
I do think it's impressive when you consider ~200W for 64 cores in a single socket (especially at 2019 launch date) where even two beefy Xeons E5/G1 Scalable CPUs weren't getting to 64 cores while being double the wattage. It just comes down to use case. The EPYC will still boost well on half the cores (32 of 64 was still pretty close to unloaded score) but the further you push it the quicker it diminishes.
@Moopah explained it really well. I thought of a few examples but they all sucked, HT is just kind of hard to benchmark with the standard tools. They're not meant for that. It's just giving you the ability to use the 'unused' parts of a core (per cycle) by emulating 1 core as 2. Obviously if you're trying to do the same computations it'll have to wait, but it can parallelize a lot of things effectively. Plenty of use cases (especially in heavy compute) where you don't want the overhead (or SMT is actually detrimental) and turn it off, though.
Curious, what's special about 453?
And quilibrium is basically like AWS? I buy some coins with some greenbacks, exchange them for some quota of compute/storage, and somehow the quilibrium networks deploy my code on some nasty lowgrade phpfriends or netcup oversold node, and Moopah makes a decent profit? Is that the essence of it? Anything standout or just another shitcoin with typical discord group hyping it?
Sure, it's kind of an achievement. Still in my opinion it's somewhat underwhelming. There's a lot of cores that don't do much once you start using them. Maybe they could do 1024 cores next. If one tries to actually use them all they degrade to C64 performance and run like a 15 year old desktop PC but the huge number will look nice on paper![;) ;)](https://lowendtalk.com/resources/emoji/wink.png)
Ordering exactly 453 instances of VDS nodes at one time against a single hosting provider with unrestricted automated provisioning is very important.
My research (reading LET threads) suggests this number is the most optimal for generating high quantities DramaState™ on LET.
My Grafana and Prometheus metrics charts is showing that the level of DramaState™ on LET is unstable and needs to be maintained at a higher healthy level.
DramaState mining on LET is a extremely compute intensive workload.
Technically AMD advertises cores and threads separate and doesn't market their threads as actual cores. It's primarily in the hosting space where vCores and vCPUs are used instead instead of "threads" that lead to confusion on the true performance of a dedicated "core" due to 2 vCPUs contending on a single physical core and sharing cache.
That is why it is recommended to pin both the HT to the VDS for optimal performance
Just like @crunchbits VDS
But that means you can't make more $$$ by selling each pinned thread as a dedicated Core