New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Comments
Not sure what chassis you're using but check the inlet temperature. Even if the CPU temps are fine, if the inlet temp on Dell's is over 40 degrees C, the CPU will be throttled heavily.
Also check the actual CPU frequencies to see if they're performing within the required ranges, if they're not then this indicates throttling or something wrong with the CPU itself.
do you trim the ssds in the zfs pool on a regular base? if not run manually trim.
Not sure I'm mentally strong, but that does describe my last two desktop PCs (for personal use) as well as my work PC at a previous job. I guess "no backup" isn't quite true, as I use source control, so I wouldn't actually lose anything if they failed other than a couple of hours work and the time to re-install.
Nowadays I've shifted to using NVMe drives, but for compilation heavy projects, striped disks make a massive difference and it's worth the risk of a day downtime once every few years compared to halving compilation time.
What does the "zpool list" command show?
Hipster grills meats over overheating server chips, with production workload still running.
Proof of improvement:
It's great that you fixed it but whoever finds the thread on Google 5 years from now will probably appreciate more detail
At the beginning of July, I played a little through the BIOS and incorrectly adjusted the configurable parameters of the CPU in the performance/energy consumption ratio. With the settings I had, I had created a bottleneck that led to an energy consumption of 500W in a simple benchmark with YABS without the temperature rising too much.
I realized the problem by following the current consumption during a YABS and I was very surprised to see that from 400W it suddenly dropped to 10W several times under the conditions that normally it would not consume below 250W at all.
Check the temperature of the LSI. They throttle if CPU is over 105*C or something. It's easy to happen if there isn't airflow over the card.
Edit: I see now you blame the bios settings. That doesn't correlate with when the problem started, but whatever. Given the 1200Mhz, probably still wrong.
"HBA" is a tip off that it's not in RAID.
self-inflicted wounds! glad you found the prob man.
Maybe my brother @yoursunny can explain you why 1200Mhz isn't wrong as long it is in "schedutil" mode
At the beginning, you gave "Processor : Common KVM processor", then Haswell, and only
in the end the host passthrough. If you don't let your CPU offload the tasks people do, like AES-NI
at least, you are going to hit bottleneck. I personally avoid all providers that don't pass host as a
general rule. It's just gonna be a shit show.
It's a 24/7 server. It doesn't matter, you're using the wrong CPU if you're manually throttling it. You should get the lower power variants instead.
(The CPU scores are garbage and in my experience, there's no reason to run a VM with a score less than 500 in 2022).
I totally agree with you, this is a normal benchmark result for this CPU model.
Could be issue, we had a motherboard model where this was incorrectly offset and at reported 66C you would burn your finger on it badly, and the server would misbehave.
Symptom being it gets slower and slower over time, reboot solves that temporarily and then it starts getting slower and slower again, until solved. Especially Disk I/O was affected by this.
At about 75C the server would outright crash, but the alarm temp was set for 90 or 95C on the Dell bios firmware -- never reaching alert temperatures.
We had hundreds of nodes like this, and it took 1½ years to figure out the culprit. Being tired with it, we simply attempted overkill cooling on the chipset, and just like that issue was gone.
Finnish retailers were immediately out of stock for 40mm fans, and even mindfactory went out for a while as we started ordering every single 40mm fan we could get our hands immediately
That motherboard model had dual chipset, so 2 fans per node. Hundreds and hundreds of fans later, issues were gone.
The original manufacturer of the motherboard Tyan used copper heatsinks for the chipset, but Dell Datacenter services did some cost savings and had used aluminium and with terrible thermal paste. These chipsets had to be running at 110C or more for prolonged periods because the ABS mounting studs were all degraded. ABS starts to degrade at 110C. Most of the pins would break from gentle touch.
It's amazing they worked with the original heatsinks as long as they did, considering the chipsets had to be running at 110+C a lot of the time for that ABS degradation to happen. Amazing those chips did not outright fail.
So physically go and check that PCH chip, swap thermal paste etc. Very cheap and fast to test.
Read the rest of the comments -> The off time and reboot might've solved it temporarily. Issues for bios settings do not appear by itself later on, they appear immediately typically.
Did you try to reboot the server before that? If not, it's that reboot which solved it.
You did not fix the server itself it appears, but moved customers to another node?
Half of the customers were moved (all of them with SSD storage).
Regarding reboot, I rebooted our server multiple times but the error wasn't gone just by rebooting and even not just after a while (it was idling for over 6 hours and still benchmark results were miserable).
Thank you for your time!