Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

unexplained high load on openvz hostnode

ztkztk Member
edited October 2016 in Help

hi lowendtalk,

I have an issue with one of my openvz hostnodes. I figured someone here could help me out as I've been scratching my head over this.

the dedicated server has got 2 drives in SW RAID1 (mdadm) and atop is reporting them as constantly busy causing load averages of 20-30 sometimes (fluctuates). There is no high r/w while this is happening which is why I'm confused. smartctl reports the disks as PASSED.

 

load average: 32.37, 21.31, 14.76

 

There is a raid check going on but this also happens when there are no checks:

Personalities : [raid1]
md0 : active raid1 sdb1[2] sda1[3]
      511936 blocks super 1.0 [2/2] [UU]

md2 : active raid1 sdb3[2] sda3[3]
      1944481792 blocks super 1.1 [2/2] [UU]
      [========>............]  check = 44.8% (872383040/1944481792) finish=26473.8min speed=674K/sec
      bitmap: 9/15 pages [36KB], 65536KB chunk

md1 : active raid1 sdb2[2] sda2[3]
      8380416 blocks super 1.1 [2/2] [UU]


dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync; unlink test
16384+0 records in
16384+0 records out
1073741824 bytes (1.1 GB) copied, 47.2936 s, 22.7 MB/s




DSK |          sdb | busy     97% | read     738 | write    996 | avio 5.49 ms |
DSK |          sda | busy     95% | read     717 | write   1007 | avio 5.41 ms |



/dev/sda:
 Timing buffered disk reads:  60 MB in  3.18 seconds =  18.89 MB/sec

/dev/sdb:
 Timing buffered disk reads:  52 MB in  3.14 seconds =  16.56 MB/sec

The drives are 2TB western digital RE4s.

SDA: http://termbin.com/2z5g

SDB: http://termbin.com/phcam

iotop:

Total DISK READ: 0.00 B/s | Total DISK WRITE: 1048.89 K/s
TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
341821 idle root          0.00 B     64.00 K  0.00 % 44.91 % [jbd2/ploop23591]
1004348 be/3 root          0.00 B     44.00 K  0.00 % 31.19 % [jbd2/p~op29098]
 5229 idle root          0.00 B     44.00 K  0.00 % 20.07 % [jbd2/ploop31426]
 5194 idle root          0.00 B      8.00 K  0.00 % 15.73 % [jbd2/ploop13867]
  972 be/3 root          0.00 B    116.00 K  0.00 % 15.48 % [jbd2/md2-8]
 2292 be/3 root          0.00 B     68.00 K  0.00 %  7.82 % auditd
543312 be/3 root          0.00 B     44.00 K  0.00 %  7.54 % [jbd2/ploop47686]
13863 be/3 root          0.00 B      4.00 K  0.00 %  6.52 % [jbd2/ploop52010]
 4520 be/3 root          0.00 B      0.00 B  0.00 %  6.23 % [jbd2/ploop45534]
729122 be/4 7796          0.00 B     64.00 K  0.00 %  5.87 % qmail-send
 4195 idle root          0.00 B     92.00 K  0.00 %  5.74 % [jbd2/ploop56464]
 5114 be/3 root          0.00 B     44.00 K  0.00 %  5.74 % [jbd2/ploop58038]
353581 be/4 110           8.00 K     24.00 K  0.00 %  5.71 % mysqld -~ort=3306
618746 be/3 root          0.00 B      0.00 B  0.00 %  5.23 % [jbd2/ploop17859]

top:

Cpu(s): 8.3%us, 2.9%sy, 0.0%ni, 70.4%id, 18.1%wa, 0.0%hi, 0.3%si, 0.0%s

 

is this normal? any pointers or assistance is appreciated.

«1

Comments

  • MikeAMikeA Member, Patron Provider

    Sounds like disk problem? Install sysstat then run iostat (iostat -k -h -n 5 maybe?). Or try running htop and filtering processes by the state "D" at top.

    Could be numerous things.

  • ztkztk Member
    edited October 2016

    @MikeA said:
    Sounds like disk problem? Install sysstat then run iostat (iostat -k -h -n 5 maybe?). Or try running htop and filtering processes by the state "D" at top.

    Could be numerous things.

    thanks for the reply. yeah I presumed it was disk. what i'm trying to find out is why i'm having this issue only on this machine and not others.

    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               9.12    0.00    3.55    5.32    0.00   82.01
    
    Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sda             177.00   111.00  679.00   53.00 112776.00  1426.00   156.01     5.17    7.12    5.03   33.85   1.21  88.70
    sdb             192.00   117.00  666.00   47.00 113152.00  1426.00   160.70    12.07   16.99   15.26   41.53   1.40 100.10
    md1               0.00     0.00    1.00    0.00     8.00     0.00     8.00     0.00    0.00    0.00    0.00   0.00   0.00
    md2               0.00     0.00    0.00  158.00     0.00  1392.00     8.81     0.00    0.00    0.00    0.00   0.00   0.00
    md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    
    avg-cpu:  %user   %nice %system %iowait  %steal   %idle
               8.99    0.00    4.35    4.81    0.00   81.84
    
    Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
    sda             245.00    79.00  388.00   52.00 81152.00  1064.00   186.85     5.93   13.64   12.53   21.94   2.08  91.60
    sdb             290.00    87.00  340.00   44.00 80768.00  1064.00   213.10     7.14   18.78   17.01   32.43   2.60  99.90
    md1               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    md2               0.00     0.00    0.00  245.00     0.00  2056.00     8.39     0.00    0.00    0.00    0.00   0.00   0.00
    md0               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00   0.00   0.00
    
  • Can you please run htop?

  • ztkztk Member

    @seriesn said:
    Can you please run htop?

    sure. what are you looking for specifically?

    there's only about 3-4 processes in D state

  • If you have compiled with openVz support, you should be able to see container specific usage. Just one additional step for troubleshooting. > @ztk said:

    @seriesn said:
    Can you please run htop?

    sure. what are you looking for specifically?

    there's only about 3-4 processes in D state

  • ztkztk Member

    @seriesn said:
    If you have compiled with openVz support, you should be able to see container specific usage. Just one additional step for troubleshooting. > @ztk said:

    @seriesn said:
    Can you please run htop?

    sure. what are you looking for specifically?

    there's only about 3-4 processes in D state

    I've attempted to compile the latest htop tarball with the required dev tools but with no luck. How did you manage it?

  • MikeAMikeA Member, Patron Provider

    @ztk said:

    @seriesn said:
    If you have compiled with openVz support, you should be able to see container specific usage. Just one additional step for troubleshooting. > @ztk said:

    @seriesn said:
    Can you please run htop?

    sure. what are you looking for specifically?

    there's only about 3-4 processes in D state

    I've attempted to compile the latest htop tarball with the required dev tools but with no luck. How did you manage it?

    yum/apt-get install htop

    On CentOS you might need epel-release installed.

  • ztkztk Member

    @MikeA

    I'm aware of installing the binary from repo, I was referring to compiling it with openvz support as suggested by @seriesn

  • MikeAMikeA Member, Patron Provider

    @ztk said:
    @MikeA

    I'm aware of installing the binary from repo, I was referring to compiling it with openvz support as suggested by @seriesn

    Oh sorry, I should read. (I had a feeling I was answering a question that was too obvious)

    Thanked by 1ztk
  • ztkztk Member

    @MikeA said: Oh sorry, I should read. (I had a feeling I was answering a question that was too obvious)

    no worries.

    still looking for suggestions on how to determine the cause of this high load issue.

  • This can sometimes be down to a single VM with for example 1 core assigned maxing out CPU to the point their VM load is in the 30's, and with OpenVZ this passes through to host.

    Have you checked per a VM load when the host is sitting in the 30's?

  • ztkztk Member

    @AshleyUk said:
    This can sometimes be down to a single VM with for example 1 core assigned maxing out CPU to the point their VM load is in the 30's, and with OpenVZ this passes through to host.

    Have you checked per a VM load when the host is sitting in the 30's?

    there are the 3 highest containers:

    CTID       LAVERAGE      NPROC
    519 1.86/2.06/2.09        417
    401 1.38/0.95/0.83         62
    496 1.07/0.78/0.79        179
    

    doesn't seem like anything that would cause 20-30 load avg on the host

    plus the CPUs are 80% idle as shown in the OP

  • ztkztk Member
    edited October 2016

    if anyone is interested this is the command I used to sort the containers by highest load and number of processes:

    vzlist -o vpsid,laverage,numproc -s -laverage

  • Have tried pausing / cancelling the Raid Check and then waiting around 30 minutes and compare the VM Load values alongside host?

    Have you checked Ram use? If the VM's are heavily eating into SWAP will increase the load, specially during heavy I/O from Raid Check.

    Have you tried a full reboot of the node as a last resort? Obviously during a scheduled window.

  • ztkztk Member
    edited October 2016

    @AshleyUk said: Have tried pausing / cancelling the Raid Check and then waiting around 30 minutes and compare the VM Load values alongside host?

    It's purely a coincidence that i'm posting while the raid check is going on, I've seen it at 10-20 load avg while the raid array was healthy without a check running.

    AshleyUk said: Have you checked Ram use? If the VM's are heavily eating into SWAP will increase the load, specially during heavy I/O from Raid Check.

    ram is mostly cached, and doesn't look like swap is fully utilized:

         total       used       free     shared    buffers     cached
    Mem:   70G        69G       1.1G       380M       9.8G        49G
    -/+ buffers/cache: 10G      60G
    Swap:  8.0G       5.8G       2.2G
    

    AshleyUk said: Have you tried a full reboot of the node as a last resort? Obviously during a scheduled window.

    after a reboot the loads are 100+ until all the containers are booted then it's back to 10-20. right now it's at 11-12.

  • Try SysDig. Also do a "perf top" Perf - you might be able to recognize the system calls taking up CPU.

    Thanked by 1vimalware
  • ztkztk Member

    @rincewind said:
    Try SysDig. Also do a "perf top" Perf - you might be able to recognize the system calls taking up CPU.

    I highly doubt it's CPU causing this as the CPUs are 70-80% idle, this looks like iowait to me for which I cannot find the cause of.

  • I know. You typically want to track down the kernel code-path that is causing problems - identify the device driver. Is it your RAID driver, or ext4 etc.. Most kernels are sufficiently instrumented that you can guess the problem from the system call trace taken over time. If its still hard to pin down, record some traces and generate a flame graph.

    Thanked by 1vimalware
  • AnthonySmithAnthonySmith Member, Host Rep

    seems to me more like a container or process is generating a huge amount of IOPS which is why mdadm is crawling along and your sequential speed is so low.

    Thanked by 1ztk
  • ztkztk Member

    @AnthonySmith said:
    seems to me more like a container or process is generating a huge amount of IOPS which is why mdadm is crawling along and your sequential speed is so low.

    yeah, this is a good assumption. it does look like one container might be causing a lot of writes or reads from the array.

    do you know the best way of tracking the number of IOPS on a per container basis?

  • ztkztk Member

    @rincewind said:
    I know. You typically want to track down the kernel code-path that is causing problems - identify the device driver. Is it your RAID driver, or ext4 etc.. Most kernels are sufficiently instrumented that you can guess the problem from the system call trace taken over time. If its still hard to pin down, record some traces and generate a flame graph.

    sounds a bit too complicated for me unfortunately, any guides on how to operate these tools to get the results i'm looking for?

  • AnthonySmithAnthonySmith Member, Host Rep

    atop -d and vzpid is a good start

  • miamiconsultantmiamiconsultant Member, Host Rep

    ztk said: sounds a bit too complicated for me unfortunately, any guides on how to operate these tools to get the results i'm looking for?

    the binary htop should have openvz support, you just need to learn to add columns (like CTID and disk) and sort.

    atop looks cool too, you can do the -d switch as @anthonysmith said or just hit 'd' once you are in the tool.

  • ztkztk Member

    RAID check is finished:

    Personalities : [raid1]
    md0 : active raid1 sdb1[2] sda1[3]
          511936 blocks super 1.0 [2/2] [UU]
    
    md2 : active raid1 sdb3[2] sda3[3]
          1944481792 blocks super 1.1 [2/2] [UU]
          bitmap: 8/15 pages [32KB], 65536KB chunk
    
    md1 : active raid1 sdb2[2] sda2[3]
          8380416 blocks super 1.1 [2/2] [UU]
    
    unused devices: <none>
    

    Load:

    load average: 11.15, 13.31, 14.14

  • ztkztk Member

  • ztkztk Member

    @AnthonySmith said:
    atop -d and vzpid is a good start

    I have been using these already but no particular process seems abusive.

    @miamiconsultant said: atop looks cool too, you can do the -d switch as @anthonysmith said or just hit 'd' once you are in the tool.

    yep, I have been using this already coupled with vzpid to find the CTID of the process.

  • ztkztk Member
    edited October 2016

    @miamiconsultant said: the binary htop should have openvz support, you just need to learn to add columns (like CTID and disk) and sort.

    Just added CTID and disk R/W columns but nothing over 1MB/s is coming up after sorting it by I/O.

  • Install perf (For Ubuntu, I think its the 'linux-tools' package)
    Run perf top. Watch the results for some time, maybe you will see a pattern.

    I haven't used Sysdig, but it has a GUI, and container support. The installation is a bit intrusive and installs DKMS (dynamic kernel modules).

  • ztkztk Member

    @rincewind said:
    Install perf (For Ubuntu, I think its the 'linux-tools' package)
    Run perf top. Watch the results for some time, maybe you will see a pattern.

    I haven't used Sysdig, but it has a GUI, and container support. The installation is a bit intrusive and installs DKMS (dynamic kernel modules).

    tried installing the centos perf binary and running it is just spewing errors all over the place

  • rincewindrincewind Member
    edited October 2016

    Maybe a version mismatch between kernel and perf. Perf source code is part of the linux kernel repo.

    Does CentOS have multiple (versioned) packages for perf?

    EDIT: perf is unstable for Linux kernel 2.6.x and CentOS 6.x

Sign In or Register to comment.