Monitoring

PacketVM · July 2012

I've noticed some people are starting to set up monitoring networks for all of their VPS's.
What do you guys use for monitoring your VPS's and what do you host it on?

Thanks

PacketVM · July 2012

Anyone? what do you guys use for monitoring?

jar · July 2012

PHP Server Monitor Plus

Toying with openstatus but I think I keep messing something up and the uptime field is blank. Right now I host it on one of the vps but I'm going to end up hosting it on my shared hosting server.

Robert · July 2012

At risk of repeating myself from other threads, I use Zabbix. I have a dedicated server set up with it on. I plan on allowing free accounts once I finish setting up a self service portal in PHP. If anyone wants a free account, let me know. I'll have to add the hosts manually, but you'll have access to the monitoring dashboard and email alerts.

miTgiB · July 2012

http://www.centreon.com/

PacketVM · July 2012

works for me?

sleddog · July 2012

@Jack said: I use openstatus;

http://status.xjack.me

When I click on a '1 hr' or '3hr' link (etc.) all I get is a spinner.

Taylor · July 2012

@Jack said: Broken link.

>

https://forums.dotvps.net/

Brokenish link.

PacketVM · July 2012

@Jack said: Broken link.

Ah, yea I did notice that. Sorry

Taylor · July 2012

@Jack said: I hope you're going to activate if I put one up haha!

>

I may be, if I get a MB worth of VPS every post :P

Taz · July 2012

Status2k.

sleddog · July 2012

@dominicl said: What do you guys use for monitoring your VPS's

Just a thought... "monitoring" is different things....

You can monitor availability, e.g., port 80 on server xyz, and track its status over time (as a % uptime/downtime).

Or you can monitor performance, e.g., load, memory, swap, and track those variables over time.

Some apps do one, others do the other, I guess some do both

So when you say "monitor" you need to think about what it is you're concerned about monitoring....

justinb · July 2012

http://observium.org/wiki/Main_Page for performance
pingdom on all hosts for uptime/latency

miTgiB · July 2012

@justinb said: pingdom on all hosts for uptime/latency

Only if you like false results

jar · July 2012

@miTgiB Lets be fair. Pingdom is extremely accurate. Until it isn't.

justinb · July 2012

@miTgiB said: Only if you like false results

Haven't gotten a false result yet.. it's a free service anyways

miTgiB · July 2012

@jarland said: Pingdom is extremely accurate. Until it isn't.

I have zero faith in it. I don't know how many tickets I get with people that depend on pingdom and claiming I was down when there was no issue. Occasionally pingdom gets lucky with a correct down report, but they are rare.

KuJoe · July 2012

External Monitoring: BinaryCanary (waiting for our new server for a custom monitoring script) and scrd+status+munin (results).
Internal Monitoring: scrd+status+munin (results), custom monitoring script, and Observium.

camarg · July 2012

I use cacti along with nagios for some servers.

@miTgiB said: people

maybe that's the problem

sleddog · July 2012

Demo of my availability monitor: http://199.96.82.38/pung/

Ash_Hawkridge · July 2012

@Jack said: @sleddog looks nice how do i get it?

+1, thats some nice work.

HalfEatenPie · July 2012

@GetKVM_Ash said: @Jack said: @sleddog looks nice how do i get it?

comment Inception.

Anywyas yeah dude that looks awesome.... Are you going to open source that?

sleddog · July 2012

Yes, OSS, WIR I originally write it ~6 years ago for internal use. I thought I'd clean it up a bit for public use, but the "clean up" turned into a rewrite.

Meanwhile, guess who monitors for tcp connections with no data, and eventually blocks for an hour? I'm guessing WHT comes up again around 15:14 NDT

LV_Matt · July 2012

@miTgiB said: I have zero faith in it. I don't know how many tickets I get with people that depend on pingdom and claiming I was down when there was no issue. Occasionally pingdom gets lucky with a correct down report, but they are rare.

We are finding this increasingly, customers just assume that when pingdom says its down that its actually down. When in reality there site is still online.

@KuJoe said: BinaryCanary

We are using BinaryCanery too, finding it much more reliable than Pingdom! Then we have our internal monitoring system, teamed with Cacti and Nagios.

jar · July 2012

Still not one false report from uptime robot in a year

sleddog · July 2012

I re-started my demo/test monitor, targeting providers' test IPs from the Offers forum. I'm interested in reducing/eliminating false positives, and I thought this might help -- with confirmation/denial from the provider regarding any downtime

http://199.96.82.38/pung/

gsrdgrdghd · July 2012

@sleddog said: I'm interested in reducing/eliminating false positives

Maybe you should add distributed monitoring for that

sleddog · July 2012

@gsrdgrdghd said: Maybe you should add distributed monitoring for that

No. it would add little or no benefit, and be significantly more complex, which in turn creates a greater probability of errors.

Ash_Hawkridge · July 2012

I see one of our IPs there, thank you @sleddog

gsrdgrdghd · July 2012

@sleddog said: it would add little or no benefit

Hows that?

sleddog · July 2012

@gsrdgrdghd said: Hows that?

People generally look at scripts like this as an "uptime" monitor. It isn't, and I don't It's a point-to-point connection monitor. It tries to establish a tcp/ip connection across the Internet from Point A to multiple targets (Points on designated ports, and records the results.

In my script, an attempted connection can have one of three possible results:

Connection succeeded
Connection refused
Connection timed out

The meaning of "Connection succeeded" is obvious. "Connection refused" means that the target was reached but it refused the connection. This could be because a listening service has stopped (e.g., apache has crashed) or a firewall has rejected the connection.

"Connection timed out" means that the target offered no response. This is the most problematic. It could be because:

A. The target -- Point B -- is offline, or:
B. There is a networking issue somewhere on the route from Point A to Point B.

Obviously there's a third possibility:

C. There's a localized issue causing Point A to have lost Internet access.

But the script checks for that, logs it if it exists and exits quietly.

Adding one or more monitoring stations ("distributed" monitoring) might help clarify if the issue is B above, but only if the additional stations take different routes to Point B, and the issue doesn't lie along those alternate routes. But frankly this is something I'd prefer to investigate manually. Frequently-repeated or extended timed-outs say, "look into it but don't make assumptions"

Again, it's not an "uptime" monitor. It's a point-to-point connection monitor with history. A red dot doesn't neccesarily mean "OMG It's Down!" It means there was an issue establishing a connection between two points.

Of course monitoring won't happen (and may not be available for viewing via www) if the monitoring station (Point A) is down. That's why I put it either on a remote LEB with respectable uptime / network uptime (say at least 99%), or on a local box that I manage.

Howdy, Stranger!

Categories

In this Discussion

Monitoring

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Monitoring

Comments