Production server | What the heck?

FlorinMarian · August 2022

A new day, a new post.

I woke up this morning with 5 tickets in which I was told that the KVM servers were moving deplorably and later I had to find out why (see the screenshots).

The big problem is that I was at the office (at work) all day and then I traveled by train, going back home only in the morning, in about 10 hours.

I know it is not ethical, professional, inspirational and well seen, but in order to solve the situation as quickly as possible tomorrow, I will tell you what I have already done and I did not manage to find the source of this problem:
1. I tested the RAM memory (nothing abnormal)
2. I checked the values in "smartctl" for each individual disk, there is no error on any of them
3. I checked the health status reported by proxmox for each ZFS pool, nothing abnormal there either.
4. I checked the temperature of the processors, nothing abnormal (below 60 degrees Celsius)

Any idea is welcome. (I try to impact customer services as little as possible, being an already critical situation)

Thank you!

Hxxx · August 2022

Did you looked at the list of processes?

yoursunny · August 2022

powersave strikes back.

FlorinMarian · August 2022

@Hxxx said:
Did you looked at the list of processes?

Yes, nothing abnormal.
I found out that most of VMs increased their usage at the same time, going from idle to 50-100%. Tried to figure out a pattern like a specific zfs pool but wasn't the case.

@yoursunny said:
powersave strikes back.

Performance is already on.

Hxxx · August 2022

How is the IO overall?

FlorinMarian · August 2022

@Hxxx said:
How is the IO overall?

Quite low.
Under 20Mbps according to iotop.

Hxxx · August 2022

Looking for info at proxmox forum,
one solution was to restart:

pvestatd and udev

daily.

Other just attributed the issue to a CPU hardware, replacing the CPU's fixed the problem

FlorinMarian · August 2022

@Hxxx said:
Looking for info at proxmox forum,
one solution was to restart:

pvestatd and udev

daily.

Other just attributed the issue to a CPU hardware, replacing the CPU's fixed the problem

We had today a node reboot but this didn't changed anything.
Tomorrow when I'll get home, I'll take care of it.
Thank you for your time!

Hxxx · August 2022

Is this a soft raid? Could be a storage issue, maybe something not detected with smartctl. For example soft raid degradation, syncing, IF I'm not mistaken, soft raid could use the CPU intensively. Specially if HDD?

FlorinMarian · August 2022

@Hxxx said:
Is this a soft raid? Could be a storage issue, maybe something not detected with smartctl. For example soft raid degradation, syncing, IF I'm not mistaken, soft raid could use the CPU intensively. Specially if HDD?

Good morning!
With all VMs turned off, I discovered a rather dubious thing, namely that with only one VM turned on (regardless of whether we are talking about an SSD or HDD pool), the benchmark no longer displays a few GB but no more than 400MB read/write . In conclusion, for some reason, the server no longer uses ZFS cache, even if the minimum/maximum values remained the same (several tens of GB).

SSD pool:

curl -sL yabs.sh | bash
# ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #
#              Yet-Another-Bench-Script              #
#                     v2022-08-20                    #
# https://github.com/masonr/yet-another-bench-script #
# ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #

miercuri 31 august 2022, 06:28:16 +0000

Basic System Information:
---------------------------------
Uptime     : 0 days, 0 hours, 17 minutes
Processor  : Common KVM processor
CPU cores  : 2 @ 2299.998 MHz
AES-NI     : ❌ Disabled
VM-x/AMD-V : ❌ Disabled
RAM        : 1.8 GiB
Swap       : 0.0 KiB
Disk       : 40.0 GiB
Distro     : CentOS Linux 7 (Core)
Kernel     : 3.10.0-1160.71.1.el7.x86_64

fio Disk Speed Tests (Mixed R/W 50/50):
---------------------------------
Block Size | 4k            (IOPS) | 64k           (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 90.94 MB/s   (22.7k) | 159.98 MB/s   (2.4k)
Write      | 91.18 MB/s   (22.7k) | 160.82 MB/s   (2.5k)
Total      | 182.13 MB/s  (45.5k) | 320.81 MB/s   (5.0k)
           |                      |                     
Block Size | 512k          (IOPS) | 1m            (IOPS)
  ------   | ---            ----  | ----           ---- 
Read       | 121.49 MB/s    (237) | 108.20 MB/s    (105)
Write      | 127.94 MB/s    (249) | 115.41 MB/s    (112)
Total      | 249.43 MB/s    (486) | 223.62 MB/s    (217)

iperf3 Network Speed Tests (IPv4):
---------------------------------
Provider        | Location (Link)           | Send Speed      | Recv Speed     
                |                           |                 |                
Clouvider       | London, UK (10G)          | 650 Mbits/sec   | 886 Mbits/sec  
Online.net      | Paris, FR (10G)           | busy            | 857 Mbits/sec  
Hybula          | The Netherlands (40G)     | 836 Mbits/sec   | 882 Mbits/sec  
Uztelecom       | Tashkent, UZ (10G)        | 611 Mbits/sec   | 510 Mbits/sec  
Clouvider       | NYC, NY, US (10G)         | 658 Mbits/sec   | 294 Mbits/sec  
Clouvider       | Dallas, TX, US (10G)      | 577 Mbits/sec   | 607 Mbits/sec  
Clouvider       | Los Angeles, CA, US (10G) | 615 Mbits/sec   | 662 Mbits/sec  

Geekbench 5 Benchmark Test:
---------------------------------
Test            | Value                         
                |                               
Single Core     | 248                           
Multi Core      | 439                           
Full Test       | https://browser.geekbench.com/v5/cpu/16979355

HDD pool:

jackb · August 2022

What model of ssds does the system have?

cold · August 2022

do you guys here get paid for the support you offer for his hosting company ?

FlorinMarian · August 2022

@jackb said:
What model of ssds does the system have?

Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

@cold said:
do you guys here get paid for the support you offer for his hosting company ?

But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

stefeman · August 2022

@cold said:
do you guys here get paid for the support you offer for his hosting company ?

Its the provider asking for guidance. Im also interested what could cause this so its a good topic.

Harmony · August 2022

Did you look at each virtual machines graph to see if any of them used a lot of CPU starting a few days ago (week graph)

FlorinMarian · August 2022

@Harmony said:
Did you look at each virtual machines graph to see if any of them used a lot of CPU starting a few days ago (week graph)

The whole problem appeared around 01:30 UTC+3, a day ago.
Among the affected machines (more than 50% being affected) there is also one of my git servers which both before the problem appeared and after it was idling and suddenly the permanent consumption of the CPU had reached 50-100%, without had processes that consume those resources.
It is clear to me after doing the same benchmark on the same VM that the storage is 20-30 times slower, ZFS cache no longer having an effect but even without it, I have too few iops. (several hundred, being the only VM that consumes storage)

jackb · August 2022

@FlorinMarian said:

@jackb said:
What model of ssds does the system have?

Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

@cold said:
do you guys here get paid for the support you offer for his hosting company ?

But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

Levi · August 2022

Clearly someone has launched YABS on cron. On a serious note: you deploying ZFS? If yes, read documentation on proper debug. Probably scrub? Out of cache? It is your job to do.

FlorinMarian · August 2022

@jackb said:

@FlorinMarian said:

@jackb said:
What model of ssds does the system have?

Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

@cold said:
do you guys here get paid for the support you offer for his hosting company ?

But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

We have disk pools from 4 different categories:

3TB HDDs
4TB HDDs
EVO SSDs
Enterprise SSDs
All are slow and currently I suspect that the problem may be with the "LSI SAS 9305-16i HBA - Full Height PCIe-x8 SAS Controller" card, which would be to blame.

FlorinMarian · August 2022

@LTniger said:
Clearly someone has launched YABS on cron. On a serious note: you deploying ZFS? If yes, read documentation on proper debug. Probably scrub? Out of cache? It is your job to do.

No VMs running, barely a reboot, 99% of RAM free and we don't have more than 450Mb RW via ZFS, and before we used to reach 10GB using the same configuration.

ralf · August 2022

@FlorinMarian said:

@Harmony said:
Did you look at each virtual machines graph to see if any of them used a lot of CPU starting a few days ago (week graph)

The whole problem appeared around 01:30 UTC+3, a day ago.
Among the affected machines (more than 50% being affected) there is also one of my git servers which both before the problem appeared and after it was idling and suddenly the permanent consumption of the CPU had reached 50-100%, without had processes that consume those resources.

Hmmmm, that actually sounds a little familiar with the situation I have on my home router, which is the only place I run proxmox. About every 4-5 days one of the Linux guests at random starts consuming 100% of its CPU, still responds to pings but I can't ssh into it any longer presumably because of the 100% CPU usage. I've tried all the combinations of host type, video type, HDD type I can find, but it still happens all the time. Interestingly, the pfSense which is a BSD guess doesn't have these same issues (and yes, I've tried the exact same config as that for my Linux guests and they still failed with the 100% CPU).

I've not had this issue with VMs anywhere else, but I use qemu/libvirt directly for all of those, so my suspicion is that it's some weird proxmox bug, but as it also uses qemu/libvirt under the hood, I'm not sure what it could be. So, I then put it down to just a hardware issue (this is the only VM host I have running on an Intel Atom) and live with rebooting those guests whenever I notice one is stuck.

But it might be worth checking if you're up-to-date with proxmox patches and/or try downgrading if you've recently upgraded it. My proxmox that I'm having issues with has been running just under 2 months.

It is clear to me after doing the same benchmark on the same VM that the storage is 20-30 times slower, ZFS cache no longer having an effect but even without it, I have too few iops. (several hundred, being the only VM that consumes storage)

I've read that ZFS, whilst being incredibly good in the normal case, can be quite hard to debug at times when something goes wrong. If you're new to ZFS, it's almost certainly considering this as the suspect more and trying to find out exactly what it's doing.

cold · August 2022

@FlorinMarian said:

@jackb said:
What model of ssds does the system have?

Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

@cold said:
do you guys here get paid for the support you offer for his hosting company ?

But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

I ask mostly support if i have a problem...

BingoBongo · August 2022

This is the best time to buy proxmox subscription

lowendclient · August 2022

Check iptraf-ng to see if anyone running UDP flood

FlorinMarian · August 2022

@lowendclient said:
Check iptraf-ng to see if anyone running UDP flood

No UDP ports are open with all VMs stopped and also iotop won't report any abnormal RW activity.

amarc · August 2022

Why are you using ZFS with HW RAID (and how to be honest?) when even ZFS advises against it ?

yoursunny · August 2022

Mentally strong people disable RAID and use separate ext4 partition on each disk.

FlorinMarian · August 2022

@amarc said:
Why are you using ZFS with HW RAID (and how to be honest?) when even ZFS advises against it ?

I don't use HW RAID configuration.

emg · August 2022

Mentally strong people run RAID 0 (striped) with many drives and no backup, right? :-o

PulsedMedia · August 2022

@Hxxx said:
Is this a soft raid? Could be a storage issue, maybe something not detected with smartctl. For example soft raid degradation, syncing, IF I'm not mistaken, soft raid could use the CPU intensively. Specially if HDD?

IOWAIT increases if storage issue. IODELAY (IOWait) has remained low.

PulsedMedia · August 2022

Sounds a bit like some chip is overheating, likely the motherboard chipset or LSI controller. Are you able to turn of the host for say 30minutes to let it cool for a few moments? do you have local on-site access?

I would remove the LSI and motherboard chipset heatsinks, replace paste and try again then.

We've seen a lot of peculiar performance issues when motherboard chipset overheats, but i don't recall an issue with LSI HBAs needing repasting.

Then again, might just be ZFS. We've had nothing but trouble with ZFS tbh. Tho, gave it a try again just few weeks ago and had some very interesting and unexpected benchmark results. Perhaps ZFS is finally maturing.

Howdy, Stranger!

Categories

In this Discussion

Production server | What the heck?

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Production server | What the heck?

Comments