Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

unexplained high load on openvz hostnode

2»

Comments

  • rincewindrincewind Member
    edited October 2016

    https://github.com/brendangregg/perf-tools has a shell-script called iosnoop that traces IO. Examples : https://github.com/brendangregg/perf-tools/blob/master/examples/iosnoop_example.txt

    EDIT: You need atleast kernel 3.2

    Thanked by 1vimalware
  • ztkztk Member
    edited October 2016

    @rincewind

    Yeah we are running centos 6 on 2.6.32

    Still looking for suggestions if anyone has any ideas, the lavg is sometimes 5 but can go as high as 40 without any clear cause

  • ztkztk Member

    bump. anyone have any suggestions?

    load average: 19.14, 16.15, 15.08

  • ztkztk Member

    willing to paypal $50 if anyone can figure this out

  • I cant help you but I'm highly interested in the outcome. Do you have a lot of VM's there? Maybe entering to each one and see taking a look at htop / top in search for any infected script, wp attacks, etc.

  • ztkztk Member

    @Hxxx said:
    I cant help you but I'm highly interested in the outcome. Do you have a lot of VM's there? Maybe entering to each one and see taking a look at htop / top in search for any infected script, wp attacks, etc.

    this wouldn't be a good idea from a privacy and legal point of view, and we have no interest in doing this either.

    openvz is shared kernel so we wouldn't need to htop/top in each container, htop/top on the hostnode would show the same thing.

  • Systemtap might work on your system. You'll find sample scripts here:
    https://sourceware.org/systemtap/examples/keyword-index.html#PROFILING

    Look for ones labeled IO, like disktop, iodevstats, iotop. If you can't find the offending process or container by IO activity, then profile kernel/user functions - profiling/pf*.stp, thread-times.stp.

    Thanked by 1ztk
  • A few questions/thoughts:

    1. how does the (bad) machine compare with other good ones when you compare the buffered reads? 22MB/s seems pretty low unless that's because it's very busy at the time you checked. You can try to check at a time when the disk is (hopefully) not very busy but the comparison should give you a clue.

    2. dd matches the buffered reads which is again on the low side.

    3. The high load seems legit if the disk/IO is indeed slow because everything is going to block on the IO to complete (eventually) and so your runq is going to be high which is what the load is. As an example if you try to write (a lot) to a USB (2.0), you'll see a similar effect because the USB device will eventually start to throttle and the load will start to shoot up on your system.

    4. Can you check the smart attributes on the drives (smartctl --all /dev/sd[ab]) - the main ones to look for are 5, 187, 197, 198.

    My hunch is that the IO is slow which explains all the high load numbers but WHY is it slow is the real question...

    Hopefully the above values and compares (against a good system that you have referenced) will at least provide some clarity.

    HTH.

  • ztkztk Member

    @nullnothere

    thanks for the reply.

    1. the others can do 100MB/s most of the time. the bad machine can also do this when it's not busy but it's almost always busy at 90%+ util per drive.

    2. yep

    3. the disk I/O isn't slow normally, something is causing it to be slow and that's what i'm trying to find out

    4. the smartctl results for both drives are in the OP

  • ztkztk Member

    furthermore this command causes loads of 30+:

    dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync; unlink test

    and the result is:

    16384+0 records in
    16384+0 records out
    1073741824 bytes (1.1 GB) copied, 33.6937 s, 31.9 MB/s
    
  • @ztk,

    So you've clarified that the drives are normally OK (and can do 100+ MB/s) but because the system WAS slow when you ran the dd, the IO throughput is poor.

    Also, I checked the smartctl logs (missed the link in the OP) for sda and it looks normal.

    A couple more thoughts:

    1) You mentioned earlier (in response to @AshleyUK), that swap isn't really being used. A quick way to confirm that swap isn't the culprit is to disable swap (no reboot required, but of course assuming you have enough cached RAM to take the hit - which you have mentioned is the case). You can then check and reenable swap.

    2) What is your /proc/sys/vm/swappiness value (just to make sure that it is not weird). Hopefully it is 60 or less.

    3) As many others have suggested, have you been able to find which process (or guest container) is doing a lot of consistent IO? In case it helps, dstat has a blocked-io plugin which should at least give you a continuous list of processes blocked on IO and maybe from them you'll get a clue.

    4) Based on what container is having a lot of io throughput, suspending the container for as little as a few minutes should give you a clue if things improve. Of course there may not be a single container solely responsible for the peak IO but hopefully it is a few that you can at least try to guess/isolate (based on the ploop file for eg) and this will allow you to confirm the issue (from the OP, the top 4 IO processes have a ploop file that you can use to probably identify container and try things).

    HTH.

  • ztkztk Member

    @nullnothere said: So you've clarified that the drives are normally OK (and can do 100+ MB/s) but because the system WAS slow when you ran the dd, the IO throughput is poor.

    well i'm not sure anymore, when I did the DD test the load average was 5-6 and it spiked to 30 after the test completed with a throughput of 31.9MB/s

    nullnothere said: 1) You mentioned earlier (in response to @AshleyUK), that swap isn't really being used. A quick way to confirm that swap isn't the culprit is to disable swap (no reboot required, but of course assuming you have enough cached RAM to take the hit - which you have mentioned is the case). You can then check and reenable swap.

    will try this but I do not think swap is the problem, we have the same swap setup everywhere else.

    nullnothere said: 2) What is your /proc/sys/vm/swappiness value (just to make sure that it is not weird). Hopefully it is 60 or less.

    yes, it's 60

    nullnothere said: 3) As many others have suggested, have you been able to find which process (or guest container) is doing a lot of consistent IO? In case it helps, dstat has a blocked-io plugin which should at least give you a continuous list of processes blocked on IO and maybe from them you'll get a clue.

    iotop shows inflated I/O for very low amounts of R/W (and thus high load averages) which is confusing as it isn't really abuse. I will check dstat thanks.

    nullnothere said: 4) Based on what container is having a lot of io throughput, suspending the container for as little as a few minutes should give you a clue if things improve. Of course there may not be a single container solely responsible for the peak IO but hopefully it is a few that you can at least try to guess/isolate (based on the ploop file for eg) and this will allow you to confirm the issue (from the OP, the top 4 IO processes have a ploop file that you can use to probably identify container and try things).

    as said above, we can see the highest ploop files using the most I/O but it is a really inflated percent count because the R/W in bytes is low

    thanks for the suggestions!

  • We had similar issue with one of the node we found ploop was making such issue after change all VPS to simfs all issues resolved. However I would recommend this only to do if you have no options as this may not be the cause and some VPS may get corrupted during change from ploop to simfs. You may also try install SSD Cache to the node and see if issue get resolved if DC allows you to install x1 120 GB SSD.

Sign In or Register to comment.