New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Comments
https://github.com/brendangregg/perf-tools has a shell-script called
iosnoopthat traces IO. Examples : https://github.com/brendangregg/perf-tools/blob/master/examples/iosnoop_example.txtEDIT: You need atleast kernel 3.2
@rincewind
Yeah we are running centos 6 on 2.6.32
Still looking for suggestions if anyone has any ideas, the lavg is sometimes 5 but can go as high as 40 without any clear cause
bump. anyone have any suggestions?
load average: 19.14, 16.15, 15.08willing to paypal $50 if anyone can figure this out
I cant help you but I'm highly interested in the outcome. Do you have a lot of VM's there? Maybe entering to each one and see taking a look at htop / top in search for any infected script, wp attacks, etc.
this wouldn't be a good idea from a privacy and legal point of view, and we have no interest in doing this either.
openvz is shared kernel so we wouldn't need to htop/top in each container, htop/top on the hostnode would show the same thing.
Systemtap might work on your system. You'll find sample scripts here:
https://sourceware.org/systemtap/examples/keyword-index.html#PROFILING
Look for ones labeled IO, like disktop, iodevstats, iotop. If you can't find the offending process or container by IO activity, then profile kernel/user functions - profiling/pf*.stp, thread-times.stp.
A few questions/thoughts:
how does the (bad) machine compare with other good ones when you compare the buffered reads? 22MB/s seems pretty low unless that's because it's very busy at the time you checked. You can try to check at a time when the disk is (hopefully) not very busy but the comparison should give you a clue.
dd matches the buffered reads which is again on the low side.
The high load seems legit if the disk/IO is indeed slow because everything is going to block on the IO to complete (eventually) and so your runq is going to be high which is what the load is. As an example if you try to write (a lot) to a USB (2.0), you'll see a similar effect because the USB device will eventually start to throttle and the load will start to shoot up on your system.
Can you check the smart attributes on the drives (smartctl --all /dev/sd[ab]) - the main ones to look for are 5, 187, 197, 198.
My hunch is that the IO is slow which explains all the high load numbers but WHY is it slow is the real question...
Hopefully the above values and compares (against a good system that you have referenced) will at least provide some clarity.
HTH.
@nullnothere
thanks for the reply.
the others can do 100MB/s most of the time. the bad machine can also do this when it's not busy but it's almost always busy at 90%+ util per drive.
yep
the disk I/O isn't slow normally, something is causing it to be slow and that's what i'm trying to find out
the smartctl results for both drives are in the OP
furthermore this command causes loads of 30+:
dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync; unlink testand the result is:
@ztk,
So you've clarified that the drives are normally OK (and can do 100+ MB/s) but because the system WAS slow when you ran the dd, the IO throughput is poor.
Also, I checked the smartctl logs (missed the link in the OP) for sda and it looks normal.
A couple more thoughts:
1) You mentioned earlier (in response to @AshleyUK), that swap isn't really being used. A quick way to confirm that swap isn't the culprit is to disable swap (no reboot required, but of course assuming you have enough cached RAM to take the hit - which you have mentioned is the case). You can then check and reenable swap.
2) What is your /proc/sys/vm/swappiness value (just to make sure that it is not weird). Hopefully it is 60 or less.
3) As many others have suggested, have you been able to find which process (or guest container) is doing a lot of consistent IO? In case it helps, dstat has a blocked-io plugin which should at least give you a continuous list of processes blocked on IO and maybe from them you'll get a clue.
4) Based on what container is having a lot of io throughput, suspending the container for as little as a few minutes should give you a clue if things improve. Of course there may not be a single container solely responsible for the peak IO but hopefully it is a few that you can at least try to guess/isolate (based on the ploop file for eg) and this will allow you to confirm the issue (from the OP, the top 4 IO processes have a ploop file that you can use to probably identify container and try things).
HTH.
well i'm not sure anymore, when I did the DD test the load average was 5-6 and it spiked to 30 after the test completed with a throughput of 31.9MB/s
will try this but I do not think swap is the problem, we have the same swap setup everywhere else.
yes, it's 60
iotop shows inflated I/O for very low amounts of R/W (and thus high load averages) which is confusing as it isn't really abuse. I will check dstat thanks.
as said above, we can see the highest ploop files using the most I/O but it is a really inflated percent count because the R/W in bytes is low
thanks for the suggestions!
We had similar issue with one of the node we found ploop was making such issue after change all VPS to simfs all issues resolved. However I would recommend this only to do if you have no options as this may not be the cause and some VPS may get corrupted during change from ploop to simfs. You may also try install SSD Cache to the node and see if issue get resolved if DC allows you to install x1 120 GB SSD.