New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Shell script to check cpu load on many servers
Pardon my noobness, what are the tools to prevent runaway processes eating 100% cpu and how to do that without running something constantly in the target server's memory?
Here is the shell script that if put into cron will preodically check servers via ssh
#!/bin/bash #to crontab for checks every 15 mins #0,15 * * * * /path/to/this/script #must be run once to suppress ssh hello screen #touch ~/.hushlogin #put here your server's ips declare -a ips=("10.0.0.1" "10.0.0.2" "10.0.0.3" "10.0.0.4" "10.0.0.5") for i in "${ips[@]}" do result=$(ssh -T $i "cat /proc/loadavg" 2>&1) #the result can be parsed and some alerts issued or action taken dt=`date +'%Y-%m-%d %H:%M'` #append it into log file echo "$dt $result" >> /user/name/cpumon/$i done
Comments
You don't. Daemons cost resources.
The challenge with your script is when everything grinds to a halt, it just takes one server to hang the SSH connection for all your monitoring to be hung on that one bad node.
The best architectures are running lightweights metric exporters and then a separate system to gather the metrics, analyse, alert, graph etc (e.g. Prometheus, Grafana).
i.e. some kind of a tiny kernel module which will send alerting UDP packets to monitoring servers when things are about to go south? It looks like something must be running on a server. The idea is to make it as light as possible then.
Yeah, you need to have something running "above" the ssh client that has a timeout alarm. When things reach that level of complexity, I'd rather work in perl or python, but it can be done in shell:
https://www.cyberciti.biz/faq/shell-scripting-run-command-under-alarmclock/
A process using 100% of CPU is no big deal. It will just get slowed down a little bit when something else wants time, and both processes will just be slow. What really makes a system unresponsive is running out of memory (and other less common things). You can set some ulimits to automatically kill a process after using a certain amount of CPU or memory if you want to, but usually you put network/server monitoring system in place, that wakes you up when things are getting dangerous, before things get unresponsive.
to address memory allocation I changed default crazy overcommit sysctl settings to something more meaningful:
vm.swappiness = 5
kernel.panic = 5
vm.overcommit_memory = 2
vm.overcommit_ratio = 120
So far I stick to idea of a tiny kernel module to monitor load and send udp packets to control center if situation is out of normal. That's because kernel will continue to run a bit longer even if all userspace is dead and that will be enough to send telemetry to remove this node from production. In addition, the node will be removed from production if it misses 2 consecutive the "i'm alive" packets.
Only collecting once every 15 minutes is not going to give you sufficient granularity for debugging most issues.
I'd recommend using Netdata, which collects data once per second, including per-process CPU and memory usage. You can aggregate multiple servers using Netdata Cloud or Prometheus (Prometheus is useful if you want to run queries across all the data).
Netdata is shiny! Installation&compilation is a masterpiece, but it has committed 380MB of memory and used 50MB which is a lot for a 500MB ram vps. I'll put it on a vps with 1.5Gb RAM to study more.
I don't know much about shell but I did something similar in perl just to check http and it forks every process so that no single server can hold up the rest of the script from checking the other servers so that is also a possibility.
Same here, one can add "&" to command to run it in parallel task and continue with the script at the same time.
Oh yeah. I forgot about &. I just know perl so much more.
Outer ssh monitoring is okay with infrequent tasks, like checking something once in 15 min. For more frequent monitoring it is quite clumsy unless one is using ssh tunnel to avoid frequent handshakes. The light agent running inside monitoring system is preferable in that case.
Why not using hetrixtools?
Wow! Looks nice. UptimeRobot also nice but less frequent UptimeRobot reveals that ping to server in SanJose is 40ms at night and 49ms during daytime when traffic load is higher. This is common behaviour but in Asia it can be as twice as bad between day and night.
Unless i'm missing something, hetrixtools is all you need
What you can do is assign the processes you suspect might eat up your CPU to a CGroup. This will scale down the amount CPU time it gets to a set level if CPU time is being contested between processes.