Production server | What the heck?

hadawayandplay · August 2022

@FlorinMarian said: 4. I checked the temperature of the processors, nothing abnormal (below 60 degrees Celsius)

Not sure what chassis you're using but check the inlet temperature. Even if the CPU temps are fine, if the inlet temp on Dell's is over 40 degrees C, the CPU will be throttled heavily.
Also check the actual CPU frequencies to see if they're performing within the required ranges, if they're not then this indicates throttling or something wrong with the CPU itself.

FlorinMarian · August 2022

@PulsedMedia said:
Sounds a bit like some chip is overheating, likely the motherboard chipset or LSI controller. Are you able to turn of the host for say 30minutes to let it cool for a few moments? do you have local on-site access?

I would remove the LSI and motherboard chipset heatsinks, replace paste and try again then.

We've seen a lot of peculiar performance issues when motherboard chipset overheats, but i don't recall an issue with LSI HBAs needing repasting.

Then again, might just be ZFS. We've had nothing but trouble with ZFS tbh. Tho, gave it a try again just few weeks ago and had some very interesting and unexpected benchmark results. Perhaps ZFS is finally maturing.

 Sensor Name   Ascending      Status   Ascending      Current Reading   Ascending
SYS_PWR_Monitor All deasserted  0x8000
SystemEvent All deasserted  0x8000
SEL State   All deasserted  0x8000
Watchdog    All deasserted  0x8000
ME_PWR_Status   All deasserted  0x8000
NMI_State   All deasserted  0x8000
CPU_PROC_HOT    All deasserted  0x8000
P12V    Normal  11.9 Volts
P5V Normal  4.991 Volts
P3V3    Normal  3.322 Volts
P3V3_AUX    Normal  3.266 Volts
P5V_AUX Normal  4.945 Volts
P1V5_PCH    Normal  1.491 Volts
P1V05_PCH   Normal  1.041 Volts
P1V05_PCH_STBY  Normal  1.05 Volts
PVDDQ_AB    Normal  1.24 Volts
PVDDQ_CD    Normal  1.231 Volts
PVDDQ_EF    Normal  1.24 Volts
PVDDQ_GH    Normal  1.231 Volts
PVCCIN_CPU0 Normal  1.813 Volts
PVCCIN_CPU1 Normal  1.822 Volts
PVCCIO  Normal  1.058 Volts
PVPP_AB Normal  2.657 Volts
PVPP_CD Normal  2.633 Volts
PVPP_EF Normal  2.633 Volts
PVPP_GH Normal  2.645 Volts
CPU_0_DTS_TEMP  Normal  -35
CPU_1_DTS_TEMP  Normal  -43
PCH_TEMP    Normal  66 ° C
AMB_Sensor_TEMP Normal  22 ° C
MB_Inlet_TEMP   Normal  32 ° C
MB_Outlet_TEMP  Normal  60 ° C
DIMM_Inlet_TEMP Normal  30 ° C
CPU0_CH0_DIMM0  Normal  40 ° C
CPU0_CH0_DIMM1  Normal  41 ° C
CPU0_CH1_DIMM0  Normal  43 ° C
CPU0_CH1_DIMM1  Normal  45 ° C
CPU0_CH2_DIMM0  Normal  42 ° C
CPU0_CH2_DIMM1  Normal  42 ° C
CPU0_CH3_DIMM0  Normal  42 ° C
CPU0_CH3_DIMM1  Normal  46 ° C
CPU1_CH0_DIMM0  Normal  33 ° C
CPU1_CH0_DIMM1  Normal  34 ° C
CPU1_CH1_DIMM0  Normal  33 ° C
CPU1_CH1_DIMM1  Normal  35 ° C
CPU1_CH2_DIMM0  Normal  35 ° C
CPU1_CH2_DIMM1  Normal  33 ° C
CPU1_CH3_DIMM0  Normal  34 ° C
CPU1_CH3_DIMM1  Normal  34 ° C
MEZZ_card_TEMP  Normal  Not Available
Fan1    Normal  12400 RPM
Fan2    Normal  12400 RPM
Fan3    Normal  12400 RPM
Fan4    Normal  12200 RPM
Fan5    Normal  12200 RPM
Fan6    Normal  12200 RPM
Fan7    Normal  12200 RPM
Fan8    Normal  12200 RPM
PMB1Voltage Normal  12.1 Volts
PMB1Current Normal  20.7 Amps
PMB1Power   Normal  252 Watts
PMB2Voltage Normal  Not Available
PMB2Current Normal  Not Available
PMB2Power   Normal  Not Available

snow2k · August 2022

do you trim the ssds in the zfs pool on a regular base? if not run manually trim.

ralf · August 2022

@emg said:
Mentally strong people run RAID 0 (striped) with many drives and no backup, right? :-o

Not sure I'm mentally strong, but that does describe my last two desktop PCs (for personal use) as well as my work PC at a previous job. I guess "no backup" isn't quite true, as I use source control, so I wouldn't actually lose anything if they failed other than a couple of hours work and the time to re-install.

Nowadays I've shifted to using NVMe drives, but for compilation heavy projects, striped disks make a massive difference and it's worth the risk of a day downtime once every few years compared to halving compilation time.

CyberCr33p · August 2022

What does the "zpool list" command show?

yoursunny · August 2022

Hipster grills meats over overheating server chips, with production workload still running.

FlorinMarian · August 2022

Good evening!

We are happy to announce that the problems that appeared approximately 48 hours ago on sv2.hazi.ro have been resolved and currently the performance of the server has not only returned to normal but has also increased because all the KVM SSD servers that were on sv2.hazi.ro were moved to sv4.hazi.ro, where there are only Enterprise SSDs.

Thank you all for your patience and do not hesitate to ask for help if you find that you have problems.

Yours, Florin.

Proof of improvement:

# ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #
#              Yet-Another-Bench-Script              #
#                     v2022-08-20                    #
# https://github.com/masonr/yet-another-bench-script #
# ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #

Wed 31 Aug 2022 10:45:21 PM EEST

Basic System Information:
---------------------------------
Uptime     : 0 days, 0 hours, 1 minutes
Processor  : Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
CPU cores  : 64 @ 1200.000 MHz
AES-NI     : ✔ Enabled
VM-x/AMD-V : ✔ Enabled
RAM        : 251.8 GiB
Swap       : 0.0 KiB
Disk       : 3.6 TiB
Distro     : Debian GNU/Linux 11 (bullseye)
Kernel     : 5.15.39-4-pve

fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 83.25 MB/s   (20.8k) | 2.85 GB/s    (44.5k)
Write      | 83.47 MB/s   (20.8k) | 2.86 GB/s    (44.8k)
Total      | 166.73 MB/s  (41.6k) | 5.72 GB/s    (89.4k)
           |                      |
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 3.27 GB/s     (6.3k) | 3.25 GB/s     (3.1k)
Write      | 3.44 GB/s     (6.7k) | 3.46 GB/s     (3.3k)
Total      | 6.72 GB/s    (13.1k) | 6.72 GB/s     (6.5k)

iperf3 Network Speed Tests (IPv4):
---------------------------------
Provider        | Location (Link)           | Send Speed      | Recv Speed
                |                           |                 |
Clouvider       | London, UK (10G)          | busy            | busy
Online.net      | Paris, FR (10G)           | busy            | busy
Hybula          | The Netherlands (40G)     | busy            | busy
Uztelecom       | Tashkent, UZ (10G)        | busy            | busy
Clouvider       | NYC, NY, US (10G)         | busy            | busy
Clouvider       | Dallas, TX, US (10G)      | busy            | busy
Clouvider       | Los Angeles, CA, US (10G) | busy            | busy

Running GB5 benchmark test... *cue elevator music*
Geekbench 5 Benchmark Test:
---------------------------------
Test            | Value
                |
Single Core     | 738
Multi Core      | 13495
Full Test       | https://browser.geekbench.com/v5/cpu/16992170

jackb · August 2022

It's great that you fixed it but whoever finds the thread on Google 5 years from now will probably appreciate more detail

FlorinMarian · August 2022

@jackb said:
It's great that you fixed it but whoever finds the thread on Google 5 years from now will probably appreciate more detail

At the beginning of July, I played a little through the BIOS and incorrectly adjusted the configurable parameters of the CPU in the performance/energy consumption ratio. With the settings I had, I had created a bottleneck that led to an energy consumption of 500W in a simple benchmark with YABS without the temperature rising too much.
I realized the problem by following the current consumption during a YABS and I was very surprised to see that from 400W it suddenly dropped to 10W several times under the conditions that normally it would not consume below 250W at all.

TimboJones · August 2022

@FlorinMarian said:

@jackb said:

@FlorinMarian said:

@jackb said:
What model of ssds does the system have?

Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

@cold said:
do you guys here get paid for the support you offer for his hosting company ?

But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

We have disk pools from 4 different categories:

3TB HDDs

4TB HDDs

EVO SSDs

Enterprise SSDs
All are slow and currently I suspect that the problem may be with the "LSI SAS 9305-16i HBA - Full Height PCIe-x8 SAS Controller" card, which would be to blame.

Check the temperature of the LSI. They throttle if CPU is over 105*C or something. It's easy to happen if there isn't airflow over the card.

Edit: I see now you blame the bios settings. That doesn't correlate with when the problem started, but whatever. Given the 1200Mhz, probably still wrong.

TimboJones · August 2022

@amarc said:
Why are you using ZFS with HW RAID (and how to be honest?) when even ZFS advises against it ?

"HBA" is a tip off that it's not in RAID.

zed · September 2022

@FlorinMarian said:

@jackb said:
It's great that you fixed it but whoever finds the thread on Google 5 years from now will probably appreciate more detail

At the beginning of July, I played a little through the BIOS and incorrectly adjusted the configurable parameters of the CPU in the performance/energy consumption ratio. With the settings I had, I had created a bottleneck that led to an energy consumption of 500W in a simple benchmark with YABS without the temperature rising too much.
I realized the problem by following the current consumption during a YABS and I was very surprised to see that from 400W it suddenly dropped to 10W several times under the conditions that normally it would not consume below 250W at all.

self-inflicted wounds! glad you found the prob man.

FlorinMarian · September 2022

@TimboJones said:

@FlorinMarian said:

@jackb said:

@FlorinMarian said:

@jackb said:
What model of ssds does the system have?

Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

@cold said:
do you guys here get paid for the support you offer for his hosting company ?

But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

We have disk pools from 4 different categories:

3TB HDDs

4TB HDDs

EVO SSDs

Enterprise SSDs
All are slow and currently I suspect that the problem may be with the "LSI SAS 9305-16i HBA - Full Height PCIe-x8 SAS Controller" card, which would be to blame.

Check the temperature of the LSI. They throttle if CPU is over 105*C or something. It's easy to happen if there isn't airflow over the card.

Edit: I see now you blame the bios settings. That doesn't correlate with when the problem started, but whatever. Given the 1200Mhz, probably still wrong.

Maybe my brother @yoursunny can explain you why 1200Mhz isn't wrong as long it is in "schedutil" mode

luckypenguin · September 2022

At the beginning, you gave "Processor : Common KVM processor", then Haswell, and only
in the end the host passthrough. If you don't let your CPU offload the tasks people do, like AES-NI
at least, you are going to hit bottleneck. I personally avoid all providers that don't pass host as a
general rule. It's just gonna be a shit show.

TimboJones · September 2022

@FlorinMarian said:

@TimboJones said:

@FlorinMarian said:

@jackb said:

@FlorinMarian said:

@jackb said:
What model of ssds does the system have?

Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

@cold said:
do you guys here get paid for the support you offer for his hosting company ?

But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

We have disk pools from 4 different categories:

3TB HDDs

4TB HDDs

EVO SSDs

Enterprise SSDs
All are slow and currently I suspect that the problem may be with the "LSI SAS 9305-16i HBA - Full Height PCIe-x8 SAS Controller" card, which would be to blame.

Check the temperature of the LSI. They throttle if CPU is over 105*C or something. It's easy to happen if there isn't airflow over the card.

Edit: I see now you blame the bios settings. That doesn't correlate with when the problem started, but whatever. Given the 1200Mhz, probably still wrong.

Maybe my brother @yoursunny can explain you why 1200Mhz isn't wrong as long it is in "schedutil" mode

It's a 24/7 server. It doesn't matter, you're using the wrong CPU if you're manually throttling it. You should get the lower power variants instead.

(The CPU scores are garbage and in my experience, there's no reason to run a VM with a score less than 500 in 2022).

FlorinMarian · September 2022

@FlorinMarian said:
Good evening!

We are happy to announce that the problems that appeared approximately 48 hours ago on sv2.hazi.ro have been resolved and currently the performance of the server has not only returned to normal but has also increased because all the KVM SSD servers that were on sv2.hazi.ro were moved to sv4.hazi.ro, where there are only Enterprise SSDs.

Thank you all for your patience and do not hesitate to ask for help if you find that you have problems.

Yours, Florin.
Proof of improvement:
# ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #
#              Yet-Another-Bench-Script              #
#                     v2022-08-20                    #
# https://github.com/masonr/yet-another-bench-script #
# ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #

Wed 31 Aug 2022 10:45:21 PM EEST

Basic System Information:
---------------------------------
Uptime     : 0 days, 0 hours, 1 minutes
Processor  : Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
CPU cores  : 64 @ 1200.000 MHz
AES-NI     : ✔ Enabled
VM-x/AMD-V : ✔ Enabled
RAM        : 251.8 GiB
Swap       : 0.0 KiB
Disk       : 3.6 TiB
Distro     : Debian GNU/Linux 11 (bullseye)
Kernel     : 5.15.39-4-pve

fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 83.25 MB/s   (20.8k) | 2.85 GB/s    (44.5k)
Write      | 83.47 MB/s   (20.8k) | 2.86 GB/s    (44.8k)
Total      | 166.73 MB/s  (41.6k) | 5.72 GB/s    (89.4k)
           |                      |
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ----
Read       | 3.27 GB/s     (6.3k) | 3.25 GB/s     (3.1k)
Write      | 3.44 GB/s     (6.7k) | 3.46 GB/s     (3.3k)
Total      | 6.72 GB/s    (13.1k) | 6.72 GB/s     (6.5k)

iperf3 Network Speed Tests (IPv4):
---------------------------------
Provider        | Location (Link)           | Send Speed      | Recv Speed
                |                           |                 |
Clouvider       | London, UK (10G)          | busy            | busy
Online.net      | Paris, FR (10G)           | busy            | busy
Hybula          | The Netherlands (40G)     | busy            | busy
Uztelecom       | Tashkent, UZ (10G)        | busy            | busy
Clouvider       | NYC, NY, US (10G)         | busy            | busy
Clouvider       | Dallas, TX, US (10G)      | busy            | busy
Clouvider       | Los Angeles, CA, US (10G) | busy            | busy

Running GB5 benchmark test... *cue elevator music*
Geekbench 5 Benchmark Test:
---------------------------------
Test            | Value
                |
Single Core     | 738
Multi Core      | 13495
Full Test       | https://browser.geekbench.com/v5/cpu/16992170
@TimboJones said:

@FlorinMarian said:

@TimboJones said:

@FlorinMarian said:

@jackb said:

@FlorinMarian said:

@jackb said:
What model of ssds does the system have?

Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

@cold said:
do you guys here get paid for the support you offer for his hosting company ?

But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

We have disk pools from 4 different categories:

3TB HDDs

4TB HDDs

EVO SSDs

Enterprise SSDs
All are slow and currently I suspect that the problem may be with the "LSI SAS 9305-16i HBA - Full Height PCIe-x8 SAS Controller" card, which would be to blame.

Check the temperature of the LSI. They throttle if CPU is over 105*C or something. It's easy to happen if there isn't airflow over the card.

Edit: I see now you blame the bios settings. That doesn't correlate with when the problem started, but whatever. Given the 1200Mhz, probably still wrong.

Maybe my brother @yoursunny can explain you why 1200Mhz isn't wrong as long it is in "schedutil" mode

It's a 24/7 server. It doesn't matter, you're using the wrong CPU if you're manually throttling it. You should get the lower power variants instead.

(The CPU scores are garbage and in my experience, there's no reason to run a VM with a score less than 500 in 2022).

I totally agree with you, this is a normal benchmark result for this CPU model.

PulsedMedia · September 2022

@FlorinMarian said: PCH_TEMP Normal 66 ° C

Could be issue, we had a motherboard model where this was incorrectly offset and at reported 66C you would burn your finger on it badly, and the server would misbehave.

Symptom being it gets slower and slower over time, reboot solves that temporarily and then it starts getting slower and slower again, until solved. Especially Disk I/O was affected by this.

At about 75C the server would outright crash, but the alarm temp was set for 90 or 95C on the Dell bios firmware -- never reaching alert temperatures.

We had hundreds of nodes like this, and it took 1½ years to figure out the culprit. Being tired with it, we simply attempted overkill cooling on the chipset, and just like that issue was gone.

Finnish retailers were immediately out of stock for 40mm fans, and even mindfactory went out for a while as we started ordering every single 40mm fan we could get our hands immediately

That motherboard model had dual chipset, so 2 fans per node. Hundreds and hundreds of fans later, issues were gone.

The original manufacturer of the motherboard Tyan used copper heatsinks for the chipset, but Dell Datacenter services did some cost savings and had used aluminium and with terrible thermal paste. These chipsets had to be running at 110C or more for prolonged periods because the ABS mounting studs were all degraded. ABS starts to degrade at 110C. Most of the pins would break from gentle touch.

It's amazing they worked with the original heatsinks as long as they did, considering the chipsets had to be running at 110+C a lot of the time for that ABS degradation to happen. Amazing those chips did not outright fail.

So physically go and check that PCH chip, swap thermal paste etc. Very cheap and fast to test.

PulsedMedia · September 2022

Read the rest of the comments -> The off time and reboot might've solved it temporarily. Issues for bios settings do not appear by itself later on, they appear immediately typically.

Did you try to reboot the server before that? If not, it's that reboot which solved it.

You did not fix the server itself it appears, but moved customers to another node?

FlorinMarian · September 2022

@PulsedMedia said:
Read the rest of the comments -> The off time and reboot might've solved it temporarily. Issues for bios settings do not appear by itself later on, they appear immediately typically.

Did you try to reboot the server before that? If not, it's that reboot which solved it.

You did not fix the server itself it appears, but moved customers to another node?

Half of the customers were moved (all of them with SSD storage).
Regarding reboot, I rebooted our server multiple times but the error wasn't gone just by rebooting and even not just after a while (it was idling for over 6 hours and still benchmark results were miserable).
Thank you for your time!

Howdy, Stranger!

Categories

In this Discussion

Production server | What the heck?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Production server | What the heck?

Comments