Simple server monitoring - Feedback appreciated (Public beta)

Monsta_AU · June 2014

About the only thing missing is a ps display of what your server is doing, much like elastictrace shows.

I have had issues with Elastictrace on two of my three VPSes, although it looks like I have resolved one but the other just will not install properly. I believe it is an IPv6 issue but trying to mail it down is difficult.

NodeQuery just works, and very well. I love the simple interface, but it does mean the hard data isn't there. Hopefully you keep adding to it Joe.

bdtech · July 2014

@Joe_NQ Joe I would look at reducing the nq-agent wget timeout from 60 seconds to 20 seconds. During the NQ outage on the morning of July 4th I had 3 servers with NQ OOM'ing and maxed out of memory/kernel errors. 2 of 3 servers needed to have MySQL started (process was killed), and the third server was entirely unresponsive (including console access) which required a hard reboot. The issues arose at:

Fri, Jul 4, 2014 at 5:28 AM (EST)

Fri, Jul 4, 2014 at 6:13 AM

Fri, Jul 4, 2014 at 6:38 AM

Jul 4 06:09:33 nye1 kernel: [4973466.827859] Out of memory: Kill process 4398 (wget) score 224 or sacrifice child

netrix · July 2014

Output from command bash /etc/nodequery/nq-agent.sh > /etc/nodequery/nq-cron.log ..

df: `/var/named/chroot/var/run/dbus': Permission denied
df: `/var/named/chroot/var/run/dbus': Permission denied
df: `/var/named/chroot/var/run/dbus': Permission denied

BotoX · July 2014

Disk usage is not working with ZFS (on Linux) by the way.

It'd be nice if you could add support for it.

Joe_NQ · July 2014

@bdtech said:
Joe_NQ Joe I would look at reducing the nq-agent wget timeout from 60 seconds to 20 seconds. During the NQ outage on the morning of July 4th I had 3 servers with NQ OOM'ing and maxed out of memory/kernel errors. 2 of 3 servers needed to have MySQL started (process was killed), and the third server was entirely unresponsive (including console access) which required a hard reboot. The issues arose at:

Fri, Jul 4, 2014 at 5:28 AM (EST)

Fri, Jul 4, 2014 at 6:13 AM

Fri, Jul 4, 2014 at 6:38 AM

Jul 4 06:09:33 nye1 kernel: [4973466.827859] Out of memory: Kill process 4398 (wget) score 224 or sacrifice child

Thank you very much for reporting this issue, that's definitely not good. I've seen it happening before but was certain it must have been an error on my part. I will lower the timeout value and include an additional process termination.

@netrix said:
Output from command bash /etc/nodequery/nq-agent.sh > /etc/nodequery/nq-cron.log ..
df: `/var/named/chroot/var/run/dbus': Permission denied
df: `/var/named/chroot/var/run/dbus': Permission denied
df: `/var/named/chroot/var/run/dbus': Permission denied

May I ask which distribution/version you are using? I hope to improve this function very soon. Right now I am focused on the installation script, alerting and our API.

@BotoX said:
Disk usage is not working with ZFS (on Linux) by the way.

It'd be nice if you could add support for it.

I will look into it, thank you.

bdtech · July 2014

@Joe_NQ

Great, thanks! FYI Here's how i pulled it up -> cd /var/log; grep wget messages messages.1 syslog syslog.1

messages:Jul 4 04:33:59 acad-dev kernel: [13177802.721740] [16841] 999 16841 21805 20404 0 0 0 wget
messages:Jul 4 04:33:59 acad-dev kernel: [13177802.723851] [17294] 999 17294 18423 17023 0 0 0 wget
messages:Jul 4 04:33:59 acad-dev kernel: [13177802.725902] [17756] 999 17756 13349 11948 0 0 0 wget
messages:Jul 4 04:33:59 acad-dev kernel: [13177802.727904] [18213] 999 18213 8276 6875 0 0 0 wget
messages:Jul 4 04:33:59 acad-dev kernel: [13177802.730453] [18674] 999 18674 3201 1796 0 0 0 wget
messages:Jul 4 05:29:25 acad-dev kernel: [13181129.125541] [24624] 999 24624 28569 27168 0 0 0 wget
messages:Jul 4 05:29:25 acad-dev kernel: [13181129.127669] [25107] 999 25107 23496 22095 0 0 0 wget
messages:Jul 4 05:29:25 acad-dev kernel: [13181129.129719] [25539] 999 25539 20114 18712 0 0 0 wget
messages:Jul 4 05:29:25 acad-dev kernel: [13181129.131790] [26025] 999 26025 15041 13641 0 0 0 wget
messages:Jul 4 05:29:25 acad-dev kernel: [13181129.134193] [26454] 999 26454 11658 10257 0 0 0 wget
messages:Jul 4 05:29:25 acad-dev kernel: [13181129.136189] [26937] 999 26937 6585 5183 0 0 0 wget
messages:Jul 4 05:32:38 acad-dev kernel: [13181321.200405] wget invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
messages:Jul 4 05:32:38 acad-dev kernel: [13181321.200974] wget cpuset=/ mems_allowed=0
messages:Jul 4 05:32:38 acad-dev kernel: [13181321.201239] Pid: 25539, comm: wget Not tainted 3.2.0-4-686-pae #1 Debian 3.2.41-2+deb7u2

bdtech · July 2014

@Joe_NQ Here's my fix in the meantime
timeout 15 wget -q -o /dev/null -O /etc/nodequery/nq-agent.log -T 10

Joe_NQ · July 2014

@bdtech said:
Joe_NQ Here's my fix in the meantime
timeout 15 wget -q -o /dev/null -O /etc/nodequery/nq-agent.log -T 10

Using 'timeout' is actually the best way to prevent this for now. I have implemented it in our script with slightly adjusted values and hope it won't be happening again. Thank you for your help.

We also introduced process monitoring with the latest update. Right now you can simply view the top processes sorted by resource usage during the last interval. As soon as we release our new notification system you will also have it attached to every resource usage alert.

Additionally, our API is almost done and will finally be released within the next few days. It will be read-only for now but extended in the future to support the creation of new servers.

Many thanks to everyone using our service.

BlaZe · July 2014

Using it to monitor all my servers. Really thanks for this. Amazing design.

bdtech · July 2014

@Joe_NQ Awesome! Can process monitoring be a feature that can be disabled? It can obviously leak usernames and potentially even passwords on the command line.

Joe_NQ · July 2014

@bdtech said:
Joe_NQ Awesome! Can process monitoring be a feature that can be disabled? It can obviously leak usernames and potentially even passwords on the command line.

We could easily add a function to disable process monitoring in the web application so it won't be saved to the database. However, we're currently considering a different output that will show the service name only instead of the full command which would offer more privacy.

For now, if you have a server with very sensitive data you can manually add processes_array=" " (mind the space) to line 74 to prevent process data from being transmitted.

tr1cky · July 2014

Feature request: Allow spaces in names!

Monsta_AU · July 2014

Just had all servers report failure when other monitors say they are up.

Not exactly inspiring confidence in the service.

Joe_NQ · July 2014

@Monsta_AU said:
Just had all servers report failure when other monitors say they are up.

Not exactly inspiring confidence in the service.

Indeed, we just had an unusual high amount of alerts being processed where 5 of our 11 test servers triggered notifications. Strangely, only Linode and DigitalOcean services were affected on our end.

Might have been a routing issue to our server - I will look into it.

theduncan · July 2014

I had my kimsufi box, pop up with a outage alert too.

noyle · July 2014

@Joe_NQ
Looks great and thanks for the work! Just using nodequery watching my two VPSs. Would give it a try, if there's a Pro plan in the future.

iceTwy · July 2014

@Joe_NQ: I'm digging the recent update! The new process monitoring feature is nice. I'm glad you've taken some time to work on the agent and NQ as a whole. NQ is a lot more reliable than it was a few months ago.

I'm just hoping that the rate of false alerts will lower in the future. For that matter, I'd suggest implementing cross-checking on a regional scale; that is, having 2 or more nodes per continent, with one/the others confirming that a server effectively does not respond if one node reports so. That way, an alert wouldn't be sent out if one single node wrongfully detects the server as being down.

Joe_NQ · July 2014

@bdtech said:
Joe_NQ Awesome! Can process monitoring be a feature that can be disabled? It can obviously leak usernames and potentially even passwords on the command line.

The newest update now only collects and displays the service names so passwords displayed as parameters should not be visible anymore.

@iceTwy said:
Joe_NQ: I'm digging the recent update! The new process monitoring feature is nice. I'm glad you've taken some time to work on the agent and NQ as a whole. NQ is a lot more reliable than it was a few months ago.

I'm just hoping that the rate of false alerts will lower in the future. For that matter, I'd suggest implementing cross-checking on a regional scale; that is, having 2 or more nodes per continent, with one/the others confirming that a server effectively does not respond if one node reports so. That way, an alert wouldn't be sent out if one single node wrongfully detects the server as being down.

I am glad you like it. We've actually just implemented an additional ping check with our newest web application update last night. Now, before an alert is triggered our system attempts to ping the given address so make sure you'll allow incoming ICMP packets on monitored systems. We hope this will greatly reduce the possibility of false alerts. @Monsta_AU

I would also like to point out that we improved the compatibility of our agent script with the help of recent debug data. If you happen to have (or know someone) a 'Raspberry Pi' or 'Arduino Yún' we would greatly appreciate if you could try the newest version and provide feedback. We believe we found the reason why they were to working with earlier versions.

If you want to know more about our recent update, head over to our blog:
https://nodequery.com/blog/1028/application-update-agent-076-release

Thanks again to everyone for using our services, your support and feedback is greatly appreciated.

a_chris · July 2014

It would be great to tag servers and listing/filtering them by tags. It could be used to tag server profile (for example to have every application server listed together and have the whole view on first sight, or to view only servers in one location).

Thank you for this project.

jpsj · July 2014

@Joe_NQ said:
Thanks again to everyone for using our services, your support and feedback is greatly appreciated.

Great work on the latest update. Can you let me know which IPs the ICMP checks will source from so I can add to ACL. Thanks

Monsta_AU · July 2014

Joe_NQ said: I am glad you like it. We've actually just implemented an additional ping check with our newest web application update last night. Now, before an alert is triggered our system attempts to ping the given address so make sure you'll allow incoming ICMP packets on monitored systems. We hope this will greatly reduce the possibility of false alerts. @Monsta_AU

Thanks Joe, but it has been triggering various servers here and there for a while now, just for one update. Seems really strange as all my servers are in various places yet there is no issues with Uptime Robot at all.

After I posted the last one, all servers reported an error at the same time, but my two other monitoring services were up and running and reporting ping success.

It seems to have been better in the last 2 days or so, definitely seen fewer 'false positives' since late last week.

rethinkvps · July 2014

Can I put a suggestion in.

On the main page where you can see the overview of servers it should be able to be categorized.

Ryan22 · July 2014

Why choose India? Better use Singapore for Asia ping

tr1cky · July 2014

@Ryan22 said:
Why choose India? Better use Singapore for Asia ping

India is cheaper.

Joe_NQ · July 2014

@tr1cky said:
Feature request: Allow spaces in names!

Done. We didn't allow spaces before because we simply used a rule for hostnames to validate the value.

@a_chris said:
It would be great to tag servers and listing/filtering them by tags. It could be used to tag server profile (for example to have every application server listed together and have the whole view on first sight, or to view only servers in one location).

Thank you for this project.

I've implemented a simple text field to filter the servers by name. Tagging servers will be implemented when we have a little more time on our hands. @rethinkvps

@jpsj said:
Great work on the latest update. Can you let me know which IPs the ICMP checks will source from so I can add to ACL. Thanks

Right now the checks originate only from our web application server (nodequery.com). Additional checks will most likely be performed by our ping nodes (ping-eu.nodequery.com, ping-us.nodequery.com, ping-as.nodequery.com) so you can either use the hostnames or their IP addresses for whitelisting.

@Ryan22 said:
Why choose India? Better use Singapore for Asia ping

We will change the Asia ping node at some point, right know it is more the comfort of having one provider and not really the costs.

If someone is interested, we've finally released the public API today. I will write a small tutorial for PHP very soon on how to create a status page for your servers that can easily be integrated in your sites.

Thanks everyone, have a great day.

wrox · July 2014

Joe_NQ said: If someone is interested, we've finally released the public API today. I will write a small tutorial for PHP very soon on how to create a status page for your servers that can easily be integrated in your sites.

I am definitely interested. Thank you for informing us!

a_chris · July 2014

Joe_NQ said: I've implemented a simple text field to filter the servers by name. Tagging servers will be implemented when we have a little more time on our hands. @rethinkvps

Very interesting.
If you could allow some chars in the name (for example "(", ")", "[", "]") and change the way names are shown it would be almost the same.

Now names in the list are filtered only if the filter term is currently shown in the list (e.g. only if they are in the first part of the name) while it would be great to filter by every part of the name.

This way I could set names like this:
hostname (LOCATION) ([tag1] [tag2] [tag3] [tag4])
hostname2 (LOCATION) ([tag2] [tag3] [tag4] [tag5])
hostname3 (LOCATION) ([tag1] [tag3] [tag4])

And filtering by "[tag2]" would do the trick.

Joe_NQ · July 2014

@a_chris said:

Thank you for the suggestion Chris, I will give it a thought. Spaces are allowed since the last update and the filter has been fixed as of today. You can now search for every part of the name even when cut off.

We have additionally updated the notification system and included two of the most requested features. You can now specify the number of intervals after a loss notification should be triggered and separate resource usage thresholds for system load, ram and disk usage.

I hope you'll find the changes useful. Thank you for your support.

D4X69 · July 2014

This is.. beautiful! I was using a series of bash scripts (that probably weren't written properly because I fail at bash) to monitor my shitty 123Systems servers. With this, I no longer have to tail downtime.log and dig through everything - all simplified, on one page! Excellent work!

I hope Beta users will get a discounted rate, but even if they don't, I will probably pay for this service regardless.

Also: Outlook sends your emails to Junk - don't know if someone's said this already, just wanted to make sure you knew.

Edit: @Joe_NQ - I don't know if it's been suggested (or if it's there already and I just missed it), but perhaps add a way to reset statistics for a server? I shut down one of my servers for testing purposes, but I don't want that to be reflected on the true availability, as the server is actually quite stable.

Another suggestion (and again, maybe it's been suggested or is already there, I didn't all 7 pages), would be to add SMS alerts if a server is unresponsive.

OnraHost · July 2014

@Joe_NQ Great Work! I signed up awhile ago but just started testing it out recently. I will look out for any bugs I come across.

Howdy, Stranger!

Categories

In this Discussion

Simple server monitoring - Feedback appreciated (Public beta)

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Simple server monitoring - Feedback appreciated (Public beta)

Comments