Unable to Find Cause of Random CPU Spikes

jetchirag · April 2019

Hey,

I’ve been trying to debug this problem since almost a week now. So, this is a cPanel server with CloudLinux and LS. Randomly server load reaches above 200 (mostly 250) and becomes unresponsive.

Things I’ve tried so far:

Disabling SWAP
Making sure everything’s upto date
Scanning accounts for malwares/exploits (even with lve enabled)
Things I’m thinking to try:

Disable CloudLinux OOM killer
Trying Kdump
Problem is even when load is high, there are no processes using significant amount of CPU. So, I’m doubting high IO. Here is monitor log from one such event:

!-------------------------------------------- top 50
top - 04:28:10 up 3 days,  1:59,  1 user,  load average: 131.37, 139.55, 102.51
Tasks: 1206 total, 270 running, 934 sleeping,   1 stopped,   1 zombie
%Cpu(s): 17.1 us, 80.1 sy,  1.1 ni,  0.0 id,  0.2 wa,  0.0 hi,  1.5 si,  0.0 st
KiB Mem : 65768876 total,  1533392 free, 19082896 used, 45152588 buff/cache
KiB Swap: 10491896 total,     7776 free, 10484120 used. 34913172 avail Mem

  PID USER  PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
10245 mysql     26   6   18.7g   4.4g   3344 S 106.5  7.0   2586:18 /usr/sbin/mysqld --daemonize --pid-file=/var/run/mysqld/mys$
17068 root  20   0  271208  56792   1532 R  50.2  0.1  57:06.76 cxswatch - scanning
  192 root  20   0       0      0      0 R  42.9  0.0 102:07.40 [kswapd0]
 9242 mongod    20   0 1325612 249392   2772 S  37.7  0.4 367:50.58 /usr/local/jetapps/usr/bin/mongod --quiet -f /usr/local/jet$
28994 digikrea  20   0  338700  24912   3444 R  36.4  0.0   0:01.29 lsphp:/home/someacc/public_html/index.php
  193 root  20   0       0      0      0 R  34.2  0.0 112:06.32 [kswapd1]
17067 root  20   0  272140  58868   1432 R  33.8  0.1  57:32.19 cxswatch - scanning

IOTOP:
    !------------------------------------- iotop -b -n 3
    Total DISK READ :     184.97 M/s | Total DISK WRITE :    2016.29 K/s
    Actual DISK READ:      98.64 M/s | Actual DISK WRITE:      24.53 M/s
      TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN      IO    COMMAND
    27604 be/4 anotheracc    0.00 B/s    0.00 B/s  0.00 % 96.25 % lsphp:/home/anotheracc/public_html/wp-login.php
    28277 be/4 lyricsa1   20.52 K/s    0.00 B/s  0.00 % 53.80 % lsphp:e/anotheracc/public_html/anotheracc/index.php
    28520 ?err {none}    182.79 K/s    0.00 B/s  0.00 % 10.18 % {no such process}

I have noticed several cron jobs and php (index.php) processes running when such event occured (hundreds of them) but cron jobs are configured to run for every x minute in various accounts and have been like that for months.

Any pointers?

ehab · April 2019

try to setup atop and take snapshots of the processes ... maybe that helps

https://haydenjames.io/use-atop-linux-server-performance-analysis/

jetchirag · April 2019

@ehab said:
try to setup atop and take snapshots of the processes ... maybe that helps

https://haydenjames.io/use-atop-linux-server-performance-analysis/

Server has a cron job which logs this and some other details to file ever minute. I'll try this as well. Thanks!

exception0x876 · April 2019

It looks like it is caused by memory hogging and intensive swapping. Try increasing the amount of RAM on your server.

rincewind · April 2019

100% swap utilization! Your disks must be getting hit real hard.

Check top values of

ps -eo pid,pmem,pcpu,rss,vsize,size,args --sort -size

and

ps -eo pid,pmem,pcpu,rss,vsize,size,args --sort -rss

Use it to track down swap usage, for instance

for x in `ps -eo pid --sort -rss  | grep -v PID | head -n 20`; do echo -n "Pid: $x, "; grep VmSwap /proc/$x/status; done

Peak swap use is 100%, but peak mem is only 32%. Looks like a memory leak. Is your system updated, especially mongodb and mysql? Also is cat /proc/sys/vm/swappiness having a sane value (in 30 to 40 range). Don't disable swap or OOM killer
Memory spike is correlated with network usage increasing. Is this some remote backup script? Some buggy backup script that takes forever to run would just eat up memory and free only on termination.

If it's a regular pattern in mem spiking, it could be a cronjob. Easier to pin down the culprit.

kazila · April 2019

It’s possible your hosting provider is generating backups or performing some intensive tasks on the host node which is resulting in these performance issues.

jetchirag · April 2019

@exception0x876 said:
It looks like it is caused by memory hogging and intensive swapping. Try increasing the amount of RAM on your server.

I've completely disabled SWAP for now. Let's see if this is a memory problem because normally RAM usage stays around ~25% and even during this spike, it was around 24% but swap which is normally 0.x% peaked to 52%. For some reason either swap is being preferred or swappiness is not being followed? It's not even like server was running OOM so started preferring swap - ram was stable throughout.

@rincewind
They indeed are -

Total DISK READ :     158.18 M/s | Total DISK WRITE :       3.07 M/s
Actual DISK READ:      70.68 M/s | Actual DISK WRITE:      12.37 M/s

Kernel and other softwares are upto date. MySQL - 5.7

Swappiness was set to 10. Upon recommendation by cPanel and CloudLinux, I've disabled SWAP and OOM for now, lets see how it goes tomorrow.

rincewind said: Memory spike is correlated with network usage increasing. Is this some remote backup script? Some buggy backup script that takes forever to run would just eat up memory and free only on termination.

I think this was just a coincident as several other times network usage was just normal. This specific log was wrote when backups were being run as well.

rincewind said: If it's a regular pattern in mem spiking, it could be a cronjob. Easier to pin down the culprit.

Not a regular pattern, totally random.

I appreciate you all taking time to write this

P.S. I've configured atop, let's see what we find.

Ympker · April 2019

It's great seeing so many people offering suggestions to help you out!

Good luck finding that nasty bug/issue :P

rincewind · April 2019

Another thing to check is if your tmpfs directories are getting full, i.e., df -kh | grep tmpfs.

Jona4s · April 2019

perf record -a
perf report

Jona4s · April 2019

it will tell you which syscall is causing 80% sy.

and which process call it (mongo, mysql, cxswatch).

jetchirag · April 2019

Just wanted to update here, it’s been around a day and haven’t noticed it reoccur. Here are things I did:

Disabled SWAP
Disabled OOM killer
Disabled CXSWATCH (which was also spawning clamd)
If everything goes fine for another 24 hours, I’ll try enabling them one-by-one except for swap. Let’s see

Big thanks for everything for suggestions

@rincewind said:
Another thing to check is if your tmpfs directories are getting full, i.e., df -kh | grep tmpfs.

Thanks, they are almost empty.

@Jona4s said:
perf record -a
perf report

This is really helpful and interesting. I'll play around with it incase something similar happens in future.

jetchirag · April 2019

Hey guys.

Since the issue was fixed, there has been lot of un-explainable webserver downtimes. Server’s working fine - php is working fine. LiteSpeed shows running but does not respond to any requests. Page keeps loading. Give it a few minute and it’s back up. Load to drops to 1 in the meantime.

Today, even SSH was unaccessible.

If someone with good reputation here can take a look at their price/cost, I’d be grateful.

Falzo · April 2019

jetchirag said: Since the issue was fixed, there has been lot of un-explainable webserver downtimes.

so it's not fixed at all. disabling swap and oomkiller isn't a fix at all ;-)

maybe provide more info, what exactly you consider to be the fix (other then disabling the above). you probably still have the same problem as before (memory leak or whatever) but now instead of filling up swap and hogging IO that way, you just see other odd behaviour as a result

just a guess.

solaire · April 2019

Have a look in syslog, that should give you some pointers. It probably failed to allocate memory since you disabled swap and as a result, the server was actively killing processes to get access to memory (syslog can confirm this). In the mean time, no single worker can do anything (not even accept your SSH connection), because it doesn't get any memory from your kernel.

Disabling swap is not a solution as pointed out by @Falzo. Though it will definitely help reducing the load put on the disks, the issue is still there: a memory hogging application.

Setting up atop is a good start. Basically, you need to find the PID who's doing immense memory allocation and then dump the procinfo of this PID to find additional information that can help you trace it down (e.g. filepath in the case of a PHP script).

I'd also like to point out that if you believe this could be caused by a PHP script but have not yet setup a sane memory limit you should consider doing so (https://www.php.net/manual/en/ini.core.php#ini.memory-limit).

rincewind · April 2019

Re-enable swap/OOM and find out what is using swap.

sudo · April 2019

@ehab said:
try to setup atop and take snapshots of the processes ... maybe that helps

https://haydenjames.io/use-atop-linux-server-performance-analysis/

Thanks.

Ikoula · April 2019

@OP What are you hosting ? Maybe your trafic / users generates spikes and that cause the symptoms you describe.

jetchirag · April 2019

Thank you @Falzo, @solaire and others. Issue was with the Transparent Huge Pages (THP). Disabling OOM-Killer and swap might have prevented cpu overload perhaps caused due to it therefore there was some improvement.

I think enabling both should be fine now.

Howdy, Stranger!

Categories

In this Discussion

Unable to Find Cause of Random CPU Spikes

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Unable to Find Cause of Random CPU Spikes

Comments