Shell script to check cpu load on many servers

umi · June 2020

Pardon my noobness, what are the tools to prevent runaway processes eating 100% cpu and how to do that without running something constantly in the target server's memory?

Here is the shell script that if put into cron will preodically check servers via ssh

#!/bin/bash
#to crontab for checks every 15 mins
#0,15 * * * * /path/to/this/script
#must be run once to suppress ssh hello screen
#touch ~/.hushlogin
#put here your server's ips
declare -a ips=("10.0.0.1" "10.0.0.2" "10.0.0.3" "10.0.0.4" "10.0.0.5")

for i in "${ips[@]}"
do
result=$(ssh -T $i "cat /proc/loadavg" 2>&1)
#the result can be parsed and some alerts issued or action taken
dt=`date +'%Y-%m-%d %H:%M'`
#append it into log file
echo "$dt $result" >> /user/name/cpumon/$i
done

danielhm · June 2020

Pardon my noobness, what are the tools to prevent runaway processes eating 100% cpu and how to do that without running something constantly in the server's memory?

You don't. Daemons cost resources.

The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.

The best architectures are running lightweights metric exporters and then a separate system to gather the metrics, analyse, alert, graph etc (e.g. Prometheus, Grafana).

umi · June 2020

i.e. some kind of a tiny kernel module which will send alerting UDP packets to monitoring servers when things are about to go south? It looks like something must be running on a server. The idea is to make it as light as possible then.

raindog308 · June 2020

@danielhm said: The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.

Yeah, you need to have something running "above" the ssh client that has a timeout alarm. When things reach that level of complexity, I'd rather work in perl or python, but it can be done in shell:

https://www.cyberciti.biz/faq/shell-scripting-run-command-under-alarmclock/

rcxb · June 2020

A process using 100% of CPU is no big deal. It will just get slowed down a little bit when something else wants time, and both processes will just be slow. What really makes a system unresponsive is running out of memory (and other less common things). You can set some ulimits to automatically kill a process after using a certain amount of CPU or memory if you want to, but usually you put network/server monitoring system in place, that wakes you up when things are getting dangerous, before things get unresponsive.

umi · June 2020

to address memory allocation I changed default crazy overcommit sysctl settings to something more meaningful:
vm.swappiness = 5
kernel.panic = 5
vm.overcommit_memory = 2
vm.overcommit_ratio = 120

So far I stick to idea of a tiny kernel module to monitor load and send udp packets to control center if situation is out of normal. That's because kernel will continue to run a bit longer even if all userspace is dead and that will be enough to send telemetry to remove this node from production. In addition, the node will be removed from production if it misses 2 consecutive the "i'm alive" packets.

Daniel15 · June 2020

to crontab for checks every 15 mins

Only collecting once every 15 minutes is not going to give you sufficient granularity for debugging most issues.

I'd recommend using Netdata, which collects data once per second, including per-process CPU and memory usage. You can aggregate multiple servers using Netdata Cloud or Prometheus (Prometheus is useful if you want to run queries across all the data).

umi · June 2020

Netdata is shiny! Installation&compilation is a masterpiece, but it has committed 380MB of memory and used 50MB which is a lot for a 500MB ram vps. I'll put it on a vps with 1.5Gb RAM to study more.

LittleCreek · June 2020

@danielhm said:
The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.

I don't know much about shell but I did something similar in perl just to check http and it forks every process so that no single server can hold up the rest of the script from checking the other servers so that is also a possibility.

umi · June 2020

Same here, one can add "&" to command to run it in parallel task and continue with the script at the same time.

LittleCreek · June 2020

@umi said:
Same here, one can add "&" to command to run it in parallel task and continue with the script at the same time.

Oh yeah. I forgot about &. I just know perl so much more.

umi · June 2020

Outer ssh monitoring is okay with infrequent tasks, like checking something once in 15 min. For more frequent monitoring it is quite clumsy unless one is using ssh tunnel to avoid frequent handshakes. The light agent running inside monitoring system is preferable in that case.

NanoG6 · June 2020

Why not using hetrixtools?

umi · June 2020

Wow! Looks nice. UptimeRobot also nice but less frequent UptimeRobot reveals that ping to server in SanJose is 40ms at night and 49ms during daytime when traffic load is higher. This is common behaviour but in Asia it can be as twice as bad between day and night.

NanoG6 · June 2020

Unless i'm missing something, hetrixtools is all you need

CConner · June 2020

What you can do is assign the processes you suspect might eat up your CPU to a CGroup. This will scale down the amount CPU time it gets to a set level if CPU time is being contested between processes.

Howdy, Stranger!

Categories

In this Discussion

Shell script to check cpu load on many servers

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Shell script to check cpu load on many servers

Comments