Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Production server | What the heck? - Page 2
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Production server | What the heck?

2»

Comments

  • @FlorinMarian said: 4. I checked the temperature of the processors, nothing abnormal (below 60 degrees Celsius)

    Not sure what chassis you're using but check the inlet temperature. Even if the CPU temps are fine, if the inlet temp on Dell's is over 40 degrees C, the CPU will be throttled heavily.
    Also check the actual CPU frequencies to see if they're performing within the required ranges, if they're not then this indicates throttling or something wrong with the CPU itself.

  • FlorinMarianFlorinMarian Member, Host Rep

    @PulsedMedia said:
    Sounds a bit like some chip is overheating, likely the motherboard chipset or LSI controller. Are you able to turn of the host for say 30minutes to let it cool for a few moments? do you have local on-site access?

    I would remove the LSI and motherboard chipset heatsinks, replace paste and try again then.

    We've seen a lot of peculiar performance issues when motherboard chipset overheats, but i don't recall an issue with LSI HBAs needing repasting.

    Then again, might just be ZFS. We've had nothing but trouble with ZFS tbh. Tho, gave it a try again just few weeks ago and had some very interesting and unexpected benchmark results. Perhaps ZFS is finally maturing.

     Sensor Name   Ascending      Status   Ascending      Current Reading   Ascending
    SYS_PWR_Monitor All deasserted  0x8000
    SystemEvent All deasserted  0x8000
    SEL State   All deasserted  0x8000
    Watchdog    All deasserted  0x8000
    ME_PWR_Status   All deasserted  0x8000
    NMI_State   All deasserted  0x8000
    CPU_PROC_HOT    All deasserted  0x8000
    P12V    Normal  11.9 Volts
    P5V Normal  4.991 Volts
    P3V3    Normal  3.322 Volts
    P3V3_AUX    Normal  3.266 Volts
    P5V_AUX Normal  4.945 Volts
    P1V5_PCH    Normal  1.491 Volts
    P1V05_PCH   Normal  1.041 Volts
    P1V05_PCH_STBY  Normal  1.05 Volts
    PVDDQ_AB    Normal  1.24 Volts
    PVDDQ_CD    Normal  1.231 Volts
    PVDDQ_EF    Normal  1.24 Volts
    PVDDQ_GH    Normal  1.231 Volts
    PVCCIN_CPU0 Normal  1.813 Volts
    PVCCIN_CPU1 Normal  1.822 Volts
    PVCCIO  Normal  1.058 Volts
    PVPP_AB Normal  2.657 Volts
    PVPP_CD Normal  2.633 Volts
    PVPP_EF Normal  2.633 Volts
    PVPP_GH Normal  2.645 Volts
    CPU_0_DTS_TEMP  Normal  -35
    CPU_1_DTS_TEMP  Normal  -43
    PCH_TEMP    Normal  66 ° C
    AMB_Sensor_TEMP Normal  22 ° C
    MB_Inlet_TEMP   Normal  32 ° C
    MB_Outlet_TEMP  Normal  60 ° C
    DIMM_Inlet_TEMP Normal  30 ° C
    CPU0_CH0_DIMM0  Normal  40 ° C
    CPU0_CH0_DIMM1  Normal  41 ° C
    CPU0_CH1_DIMM0  Normal  43 ° C
    CPU0_CH1_DIMM1  Normal  45 ° C
    CPU0_CH2_DIMM0  Normal  42 ° C
    CPU0_CH2_DIMM1  Normal  42 ° C
    CPU0_CH3_DIMM0  Normal  42 ° C
    CPU0_CH3_DIMM1  Normal  46 ° C
    CPU1_CH0_DIMM0  Normal  33 ° C
    CPU1_CH0_DIMM1  Normal  34 ° C
    CPU1_CH1_DIMM0  Normal  33 ° C
    CPU1_CH1_DIMM1  Normal  35 ° C
    CPU1_CH2_DIMM0  Normal  35 ° C
    CPU1_CH2_DIMM1  Normal  33 ° C
    CPU1_CH3_DIMM0  Normal  34 ° C
    CPU1_CH3_DIMM1  Normal  34 ° C
    MEZZ_card_TEMP  Normal  Not Available
    Fan1    Normal  12400 RPM
    Fan2    Normal  12400 RPM
    Fan3    Normal  12400 RPM
    Fan4    Normal  12200 RPM
    Fan5    Normal  12200 RPM
    Fan6    Normal  12200 RPM
    Fan7    Normal  12200 RPM
    Fan8    Normal  12200 RPM
    PMB1Voltage Normal  12.1 Volts
    PMB1Current Normal  20.7 Amps
    PMB1Power   Normal  252 Watts
    PMB2Voltage Normal  Not Available
    PMB2Current Normal  Not Available
    PMB2Power   Normal  Not Available
    
  • do you trim the ssds in the zfs pool on a regular base? if not run manually trim.

  • ralfralf Member
    edited August 2022

    @emg said:
    Mentally strong people run RAID 0 (striped) with many drives and no backup, right? :-o

    Not sure I'm mentally strong, but that does describe my last two desktop PCs (for personal use) as well as my work PC at a previous job. I guess "no backup" isn't quite true, as I use source control, so I wouldn't actually lose anything if they failed other than a couple of hours work and the time to re-install.

    Nowadays I've shifted to using NVMe drives, but for compilation heavy projects, striped disks make a massive difference and it's worth the risk of a day downtime once every few years compared to halving compilation time.

  • CyberCr33pCyberCr33p Member
    edited August 2022

    What does the "zpool list" command show?

  • yoursunnyyoursunny Member, IPv6 Advocate

    Hipster grills meats over overheating server chips, with production workload still running.

  • FlorinMarianFlorinMarian Member, Host Rep
    Good evening!
    
    We are happy to announce that the problems that appeared approximately 48 hours ago on sv2.hazi.ro have been resolved and currently the performance of the server has not only returned to normal but has also increased because all the KVM SSD servers that were on sv2.hazi.ro were moved to sv4.hazi.ro, where there are only Enterprise SSDs.
    
    Thank you all for your patience and do not hesitate to ask for help if you find that you have problems.
    
    Yours, Florin.
    

    Proof of improvement:

    # ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #
    #              Yet-Another-Bench-Script              #
    #                     v2022-08-20                    #
    # https://github.com/masonr/yet-another-bench-script #
    # ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #
    
    Wed 31 Aug 2022 10:45:21 PM EEST
    
    Basic System Information:
    ---------------------------------
    Uptime     : 0 days, 0 hours, 1 minutes
    Processor  : Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
    CPU cores  : 64 @ 1200.000 MHz
    AES-NI     : ✔ Enabled
    VM-x/AMD-V : ✔ Enabled
    RAM        : 251.8 GiB
    Swap       : 0.0 KiB
    Disk       : 3.6 TiB
    Distro     : Debian GNU/Linux 11 (bullseye)
    Kernel     : 5.15.39-4-pve
    
    fio Disk Speed Tests (Mixed R/W 50/50):
    ---------------------------------
    Block Size | 4k            (IOPS) | 64k           (IOPS)
      ------   | ---            ----  | ----           ----
    Read       | 83.25 MB/s   (20.8k) | 2.85 GB/s    (44.5k)
    Write      | 83.47 MB/s   (20.8k) | 2.86 GB/s    (44.8k)
    Total      | 166.73 MB/s  (41.6k) | 5.72 GB/s    (89.4k)
               |                      |
    Block Size | 512k          (IOPS) | 1m            (IOPS)
      ------   | ---            ----  | ----           ----
    Read       | 3.27 GB/s     (6.3k) | 3.25 GB/s     (3.1k)
    Write      | 3.44 GB/s     (6.7k) | 3.46 GB/s     (3.3k)
    Total      | 6.72 GB/s    (13.1k) | 6.72 GB/s     (6.5k)
    
    iperf3 Network Speed Tests (IPv4):
    ---------------------------------
    Provider        | Location (Link)           | Send Speed      | Recv Speed
                    |                           |                 |
    Clouvider       | London, UK (10G)          | busy            | busy
    Online.net      | Paris, FR (10G)           | busy            | busy
    Hybula          | The Netherlands (40G)     | busy            | busy
    Uztelecom       | Tashkent, UZ (10G)        | busy            | busy
    Clouvider       | NYC, NY, US (10G)         | busy            | busy
    Clouvider       | Dallas, TX, US (10G)      | busy            | busy
    Clouvider       | Los Angeles, CA, US (10G) | busy            | busy
    
    Running GB5 benchmark test... *cue elevator music*
    Geekbench 5 Benchmark Test:
    ---------------------------------
    Test            | Value
                    |
    Single Core     | 738
    Multi Core      | 13495
    Full Test       | https://browser.geekbench.com/v5/cpu/16992170
    
  • jackbjackb Member, Host Rep

    It's great that you fixed it but whoever finds the thread on Google 5 years from now will probably appreciate more detail

  • FlorinMarianFlorinMarian Member, Host Rep

    @jackb said:
    It's great that you fixed it but whoever finds the thread on Google 5 years from now will probably appreciate more detail

    At the beginning of July, I played a little through the BIOS and incorrectly adjusted the configurable parameters of the CPU in the performance/energy consumption ratio. With the settings I had, I had created a bottleneck that led to an energy consumption of 500W in a simple benchmark with YABS without the temperature rising too much.
    I realized the problem by following the current consumption during a YABS and I was very surprised to see that from 400W it suddenly dropped to 10W several times under the conditions that normally it would not consume below 250W at all.

  • TimboJonesTimboJones Member
    edited August 2022

    @FlorinMarian said:

    @jackb said:

    @FlorinMarian said:

    @jackb said:
    What model of ssds does the system have?

    Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

    @cold said:
    do you guys here get paid for the support you offer for his hosting company ?

    But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

    PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

    To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

    We have disk pools from 4 different categories:

    • 3TB HDDs
    • 4TB HDDs
    • EVO SSDs
    • Enterprise SSDs
      All are slow and currently I suspect that the problem may be with the "LSI SAS 9305-16i HBA - Full Height PCIe-x8 SAS Controller" card, which would be to blame.

    Check the temperature of the LSI. They throttle if CPU is over 105*C or something. It's easy to happen if there isn't airflow over the card.

    Edit: I see now you blame the bios settings. That doesn't correlate with when the problem started, but whatever. Given the 1200Mhz, probably still wrong.

  • @amarc said:
    Why are you using ZFS with HW RAID (and how to be honest?) when even ZFS advises against it ?

    "HBA" is a tip off that it's not in RAID.

    Thanked by 1Hxxx
  • @FlorinMarian said:

    @jackb said:
    It's great that you fixed it but whoever finds the thread on Google 5 years from now will probably appreciate more detail

    At the beginning of July, I played a little through the BIOS and incorrectly adjusted the configurable parameters of the CPU in the performance/energy consumption ratio. With the settings I had, I had created a bottleneck that led to an energy consumption of 500W in a simple benchmark with YABS without the temperature rising too much.
    I realized the problem by following the current consumption during a YABS and I was very surprised to see that from 400W it suddenly dropped to 10W several times under the conditions that normally it would not consume below 250W at all.

    self-inflicted wounds! glad you found the prob man.

    Thanked by 1FlorinMarian
  • FlorinMarianFlorinMarian Member, Host Rep

    @TimboJones said:

    @FlorinMarian said:

    @jackb said:

    @FlorinMarian said:

    @jackb said:
    What model of ssds does the system have?

    Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

    @cold said:
    do you guys here get paid for the support you offer for his hosting company ?

    But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

    PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

    To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

    We have disk pools from 4 different categories:

    • 3TB HDDs
    • 4TB HDDs
    • EVO SSDs
    • Enterprise SSDs
      All are slow and currently I suspect that the problem may be with the "LSI SAS 9305-16i HBA - Full Height PCIe-x8 SAS Controller" card, which would be to blame.

    Check the temperature of the LSI. They throttle if CPU is over 105*C or something. It's easy to happen if there isn't airflow over the card.

    Edit: I see now you blame the bios settings. That doesn't correlate with when the problem started, but whatever. Given the 1200Mhz, probably still wrong.

    Maybe my brother @yoursunny can explain you why 1200Mhz isn't wrong as long it is in "schedutil" mode :smile:

    Thanked by 1yoursunny
  • At the beginning, you gave "Processor : Common KVM processor", then Haswell, and only
    in the end the host passthrough. If you don't let your CPU offload the tasks people do, like AES-NI
    at least, you are going to hit bottleneck. I personally avoid all providers that don't pass host as a
    general rule. It's just gonna be a shit show.

    Thanked by 1yoursunny
  • @FlorinMarian said:

    @TimboJones said:

    @FlorinMarian said:

    @jackb said:

    @FlorinMarian said:

    @jackb said:
    What model of ssds does the system have?

    Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

    @cold said:
    do you guys here get paid for the support you offer for his hosting company ?

    But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

    PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

    To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

    We have disk pools from 4 different categories:

    • 3TB HDDs
    • 4TB HDDs
    • EVO SSDs
    • Enterprise SSDs
      All are slow and currently I suspect that the problem may be with the "LSI SAS 9305-16i HBA - Full Height PCIe-x8 SAS Controller" card, which would be to blame.

    Check the temperature of the LSI. They throttle if CPU is over 105*C or something. It's easy to happen if there isn't airflow over the card.

    Edit: I see now you blame the bios settings. That doesn't correlate with when the problem started, but whatever. Given the 1200Mhz, probably still wrong.

    Maybe my brother @yoursunny can explain you why 1200Mhz isn't wrong as long it is in "schedutil" mode :smile:

    It's a 24/7 server. It doesn't matter, you're using the wrong CPU if you're manually throttling it. You should get the lower power variants instead.

    (The CPU scores are garbage and in my experience, there's no reason to run a VM with a score less than 500 in 2022).

  • FlorinMarianFlorinMarian Member, Host Rep

    @FlorinMarian said:

    Good evening!
    
    We are happy to announce that the problems that appeared approximately 48 hours ago on sv2.hazi.ro have been resolved and currently the performance of the server has not only returned to normal but has also increased because all the KVM SSD servers that were on sv2.hazi.ro were moved to sv4.hazi.ro, where there are only Enterprise SSDs.
    
    Thank you all for your patience and do not hesitate to ask for help if you find that you have problems.
    
    Yours, Florin.
    

    Proof of improvement:

    # ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #
    #              Yet-Another-Bench-Script              #
    #                     v2022-08-20                    #
    # https://github.com/masonr/yet-another-bench-script #
    # ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## ## #
    
    Wed 31 Aug 2022 10:45:21 PM EEST
    
    Basic System Information:
    ---------------------------------
    Uptime     : 0 days, 0 hours, 1 minutes
    Processor  : Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz
    CPU cores  : 64 @ 1200.000 MHz
    AES-NI     : ✔ Enabled
    VM-x/AMD-V : ✔ Enabled
    RAM        : 251.8 GiB
    Swap       : 0.0 KiB
    Disk       : 3.6 TiB
    Distro     : Debian GNU/Linux 11 (bullseye)
    Kernel     : 5.15.39-4-pve
    
    fio Disk Speed Tests (Mixed R/W 50/50):
    ---------------------------------
    Block Size | 4k            (IOPS) | 64k           (IOPS)
      ------   | ---            ----  | ----           ----
    Read       | 83.25 MB/s   (20.8k) | 2.85 GB/s    (44.5k)
    Write      | 83.47 MB/s   (20.8k) | 2.86 GB/s    (44.8k)
    Total      | 166.73 MB/s  (41.6k) | 5.72 GB/s    (89.4k)
               |                      |
    Block Size | 512k          (IOPS) | 1m            (IOPS)
      ------   | ---            ----  | ----           ----
    Read       | 3.27 GB/s     (6.3k) | 3.25 GB/s     (3.1k)
    Write      | 3.44 GB/s     (6.7k) | 3.46 GB/s     (3.3k)
    Total      | 6.72 GB/s    (13.1k) | 6.72 GB/s     (6.5k)
    
    iperf3 Network Speed Tests (IPv4):
    ---------------------------------
    Provider        | Location (Link)           | Send Speed      | Recv Speed
                    |                           |                 |
    Clouvider       | London, UK (10G)          | busy            | busy
    Online.net      | Paris, FR (10G)           | busy            | busy
    Hybula          | The Netherlands (40G)     | busy            | busy
    Uztelecom       | Tashkent, UZ (10G)        | busy            | busy
    Clouvider       | NYC, NY, US (10G)         | busy            | busy
    Clouvider       | Dallas, TX, US (10G)      | busy            | busy
    Clouvider       | Los Angeles, CA, US (10G) | busy            | busy
    
    Running GB5 benchmark test... *cue elevator music*
    Geekbench 5 Benchmark Test:
    ---------------------------------
    Test            | Value
                    |
    Single Core     | 738
    Multi Core      | 13495
    Full Test       | https://browser.geekbench.com/v5/cpu/16992170
    

    @TimboJones said:

    @FlorinMarian said:

    @TimboJones said:

    @FlorinMarian said:

    @jackb said:

    @FlorinMarian said:

    @jackb said:
    What model of ssds does the system have?

    Proxmox itself running on Samsung 860 EVO and customers VMs Samsung PM893 1.92TB.

    @cold said:
    do you guys here get paid for the support you offer for his hosting company ?

    But those like you who search on google for solutions to their problems and find solutions following the discussions of others, are they paid well?

    PM893 probably won't get as bad, but the evos will go to a crippling level of performance if you have allocated the whole drive and don't do trim passthrough (often not supported in software caching).

    To confirm or rule out assuming you've got a raid configuration for your SSD caching, drop a single SSD from the array and secure erase it; once it's done - readd to the array. Then do the other drive.

    We have disk pools from 4 different categories:

    • 3TB HDDs
    • 4TB HDDs
    • EVO SSDs
    • Enterprise SSDs
      All are slow and currently I suspect that the problem may be with the "LSI SAS 9305-16i HBA - Full Height PCIe-x8 SAS Controller" card, which would be to blame.

    Check the temperature of the LSI. They throttle if CPU is over 105*C or something. It's easy to happen if there isn't airflow over the card.

    Edit: I see now you blame the bios settings. That doesn't correlate with when the problem started, but whatever. Given the 1200Mhz, probably still wrong.

    Maybe my brother @yoursunny can explain you why 1200Mhz isn't wrong as long it is in "schedutil" mode :smile:

    It's a 24/7 server. It doesn't matter, you're using the wrong CPU if you're manually throttling it. You should get the lower power variants instead.

    (The CPU scores are garbage and in my experience, there's no reason to run a VM with a score less than 500 in 2022).

    I totally agree with you, this is a normal benchmark result for this CPU model.

  • PulsedMediaPulsedMedia Member, Patron Provider

    @FlorinMarian said: PCH_TEMP Normal 66 ° C

    Could be issue, we had a motherboard model where this was incorrectly offset and at reported 66C you would burn your finger on it badly, and the server would misbehave.

    Symptom being it gets slower and slower over time, reboot solves that temporarily and then it starts getting slower and slower again, until solved. Especially Disk I/O was affected by this.

    At about 75C the server would outright crash, but the alarm temp was set for 90 or 95C on the Dell bios firmware -- never reaching alert temperatures.

    We had hundreds of nodes like this, and it took 1½ years to figure out the culprit. Being tired with it, we simply attempted overkill cooling on the chipset, and just like that issue was gone.

    Finnish retailers were immediately out of stock for 40mm fans, and even mindfactory went out for a while as we started ordering every single 40mm fan we could get our hands immediately :D

    That motherboard model had dual chipset, so 2 fans per node. Hundreds and hundreds of fans later, issues were gone.

    The original manufacturer of the motherboard Tyan used copper heatsinks for the chipset, but Dell Datacenter services did some cost savings and had used aluminium and with terrible thermal paste. These chipsets had to be running at 110C or more for prolonged periods because the ABS mounting studs were all degraded. ABS starts to degrade at 110C. Most of the pins would break from gentle touch.

    It's amazing they worked with the original heatsinks as long as they did, considering the chipsets had to be running at 110+C a lot of the time for that ABS degradation to happen. Amazing those chips did not outright fail.

    So physically go and check that PCH chip, swap thermal paste etc. Very cheap and fast to test.

    Thanked by 1FlorinMarian
  • PulsedMediaPulsedMedia Member, Patron Provider

    Read the rest of the comments -> The off time and reboot might've solved it temporarily. Issues for bios settings do not appear by itself later on, they appear immediately typically.

    Did you try to reboot the server before that? If not, it's that reboot which solved it.

    You did not fix the server itself it appears, but moved customers to another node?

    Thanked by 1FlorinMarian
  • FlorinMarianFlorinMarian Member, Host Rep

    @PulsedMedia said:
    Read the rest of the comments -> The off time and reboot might've solved it temporarily. Issues for bios settings do not appear by itself later on, they appear immediately typically.

    Did you try to reboot the server before that? If not, it's that reboot which solved it.

    You did not fix the server itself it appears, but moved customers to another node?

    Half of the customers were moved (all of them with SSD storage).
    Regarding reboot, I rebooted our server multiple times but the error wasn't gone just by rebooting and even not just after a while (it was idling for over 6 hours and still benchmark results were miserable).
    Thank you for your time!

Sign In or Register to comment.