Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Rabisu - Hosting Solutions
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

Simple server monitoring - Feedback appreciated (Public beta)

178101213

Comments

  • About the only thing missing is a ps display of what your server is doing, much like elastictrace shows.

    I have had issues with Elastictrace on two of my three VPSes, although it looks like I have resolved one but the other just will not install properly. I believe it is an IPv6 issue but trying to mail it down is difficult.

    NodeQuery just works, and very well. I love the simple interface, but it does mean the hard data isn't there. Hopefully you keep adding to it Joe.

  • bdtechbdtech Member
    edited July 2014

    @Joe_NQ Joe I would look at reducing the nq-agent wget timeout from 60 seconds to 20 seconds. During the NQ outage on the morning of July 4th I had 3 servers with NQ OOM'ing and maxed out of memory/kernel errors. 2 of 3 servers needed to have MySQL started (process was killed), and the third server was entirely unresponsive (including console access) which required a hard reboot. The issues arose at:

    Fri, Jul 4, 2014 at 5:28 AM (EST)

    Fri, Jul 4, 2014 at 6:13 AM

    Fri, Jul 4, 2014 at 6:38 AM

    Jul 4 06:09:33 nye1 kernel: [4973466.827859] Out of memory: Kill process 4398 (wget) score 224 or sacrifice child

    Thanked by 1Joe_NQ
  • netrixnetrix Member

    Output from command bash /etc/nodequery/nq-agent.sh > /etc/nodequery/nq-cron.log ..

    df: `/var/named/chroot/var/run/dbus': Permission denied
    df: `/var/named/chroot/var/run/dbus': Permission denied
    df: `/var/named/chroot/var/run/dbus': Permission denied
    
    Thanked by 1Joe_NQ
  • BotoXBotoX Member

    Disk usage is not working with ZFS (on Linux) by the way.

    It'd be nice if you could add support for it.

    Thanked by 1Joe_NQ
  • Joe_NQJoe_NQ Member

    @bdtech said:
    Joe_NQ Joe I would look at reducing the nq-agent wget timeout from 60 seconds to 20 seconds. During the NQ outage on the morning of July 4th I had 3 servers with NQ OOM'ing and maxed out of memory/kernel errors. 2 of 3 servers needed to have MySQL started (process was killed), and the third server was entirely unresponsive (including console access) which required a hard reboot. The issues arose at:

    Fri, Jul 4, 2014 at 5:28 AM (EST)

    Fri, Jul 4, 2014 at 6:13 AM

    Fri, Jul 4, 2014 at 6:38 AM

    Jul 4 06:09:33 nye1 kernel: [4973466.827859] Out of memory: Kill process 4398 (wget) score 224 or sacrifice child

    Thank you very much for reporting this issue, that's definitely not good. I've seen it happening before but was certain it must have been an error on my part. I will lower the timeout value and include an additional process termination.

    @netrix said:
    Output from command bash /etc/nodequery/nq-agent.sh > /etc/nodequery/nq-cron.log ..

    df: `/var/named/chroot/var/run/dbus': Permission denied
    df: `/var/named/chroot/var/run/dbus': Permission denied
    df: `/var/named/chroot/var/run/dbus': Permission denied
    

    May I ask which distribution/version you are using? I hope to improve this function very soon. Right now I am focused on the installation script, alerting and our API.

    @BotoX said:
    Disk usage is not working with ZFS (on Linux) by the way.

    It'd be nice if you could add support for it.

    I will look into it, thank you.

  • bdtechbdtech Member
    edited July 2014

    @Joe_NQ

    Great, thanks! FYI Here's how i pulled it up -> cd /var/log; grep wget messages messages.1 syslog syslog.1

    messages:Jul 4 04:33:59 acad-dev kernel: [13177802.721740] [16841] 999 16841 21805 20404 0 0 0 wget
    messages:Jul 4 04:33:59 acad-dev kernel: [13177802.723851] [17294] 999 17294 18423 17023 0 0 0 wget
    messages:Jul 4 04:33:59 acad-dev kernel: [13177802.725902] [17756] 999 17756 13349 11948 0 0 0 wget
    messages:Jul 4 04:33:59 acad-dev kernel: [13177802.727904] [18213] 999 18213 8276 6875 0 0 0 wget
    messages:Jul 4 04:33:59 acad-dev kernel: [13177802.730453] [18674] 999 18674 3201 1796 0 0 0 wget
    messages:Jul 4 05:29:25 acad-dev kernel: [13181129.125541] [24624] 999 24624 28569 27168 0 0 0 wget
    messages:Jul 4 05:29:25 acad-dev kernel: [13181129.127669] [25107] 999 25107 23496 22095 0 0 0 wget
    messages:Jul 4 05:29:25 acad-dev kernel: [13181129.129719] [25539] 999 25539 20114 18712 0 0 0 wget
    messages:Jul 4 05:29:25 acad-dev kernel: [13181129.131790] [26025] 999 26025 15041 13641 0 0 0 wget
    messages:Jul 4 05:29:25 acad-dev kernel: [13181129.134193] [26454] 999 26454 11658 10257 0 0 0 wget
    messages:Jul 4 05:29:25 acad-dev kernel: [13181129.136189] [26937] 999 26937 6585 5183 0 0 0 wget
    messages:Jul 4 05:32:38 acad-dev kernel: [13181321.200405] wget invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
    messages:Jul 4 05:32:38 acad-dev kernel: [13181321.200974] wget cpuset=/ mems_allowed=0
    messages:Jul 4 05:32:38 acad-dev kernel: [13181321.201239] Pid: 25539, comm: wget Not tainted 3.2.0-4-686-pae #1 Debian 3.2.41-2+deb7u2

  • bdtechbdtech Member

    @Joe_NQ Here's my fix in the meantime
    timeout 15 wget -q -o /dev/null -O /etc/nodequery/nq-agent.log -T 10

  • Joe_NQJoe_NQ Member

    @bdtech said:
    Joe_NQ Here's my fix in the meantime
    timeout 15 wget -q -o /dev/null -O /etc/nodequery/nq-agent.log -T 10

    Using 'timeout' is actually the best way to prevent this for now. I have implemented it in our script with slightly adjusted values and hope it won't be happening again. Thank you for your help.

    We also introduced process monitoring with the latest update. Right now you can simply view the top processes sorted by resource usage during the last interval. As soon as we release our new notification system you will also have it attached to every resource usage alert.

    Additionally, our API is almost done and will finally be released within the next few days. It will be read-only for now but extended in the future to support the creation of new servers.

    Many thanks to everyone using our service.

  • BlaZeBlaZe Member, Host Rep

    Using it to monitor all my servers. Really thanks for this. Amazing design.

    Thanked by 1Joe_NQ
  • bdtechbdtech Member
    edited July 2014

    @Joe_NQ Awesome! Can process monitoring be a feature that can be disabled? It can obviously leak usernames and potentially even passwords on the command line.

  • Joe_NQJoe_NQ Member

    @bdtech said:
    Joe_NQ Awesome! Can process monitoring be a feature that can be disabled? It can obviously leak usernames and potentially even passwords on the command line.

    We could easily add a function to disable process monitoring in the web application so it won't be saved to the database. However, we're currently considering a different output that will show the service name only instead of the full command which would offer more privacy.

    For now, if you have a server with very sensitive data you can manually add processes_array=" " (mind the space) to line 74 to prevent process data from being transmitted.

  • tr1ckytr1cky Member

    Feature request: Allow spaces in names!

  • Just had all servers report failure when other monitors say they are up.

    Not exactly inspiring confidence in the service.

  • Joe_NQJoe_NQ Member

    @Monsta_AU said:
    Just had all servers report failure when other monitors say they are up.

    Not exactly inspiring confidence in the service.

    Indeed, we just had an unusual high amount of alerts being processed where 5 of our 11 test servers triggered notifications. Strangely, only Linode and DigitalOcean services were affected on our end.

    Might have been a routing issue to our server - I will look into it.

  • I had my kimsufi box, pop up with a outage alert too.

  • noylenoyle Member

    @Joe_NQ
    Looks great and thanks for the work! Just using nodequery watching my two VPSs. Would give it a try, if there's a Pro plan in the future.

    Thanked by 1Joe_NQ
  • iceTwyiceTwy Member
    edited July 2014

    @Joe_NQ: I'm digging the recent update! The new process monitoring feature is nice. I'm glad you've taken some time to work on the agent and NQ as a whole. NQ is a lot more reliable than it was a few months ago.

    I'm just hoping that the rate of false alerts will lower in the future. For that matter, I'd suggest implementing cross-checking on a regional scale; that is, having 2 or more nodes per continent, with one/the others confirming that a server effectively does not respond if one node reports so. That way, an alert wouldn't be sent out if one single node wrongfully detects the server as being down.

  • Joe_NQJoe_NQ Member

    @bdtech said:
    Joe_NQ Awesome! Can process monitoring be a feature that can be disabled? It can obviously leak usernames and potentially even passwords on the command line.

    The newest update now only collects and displays the service names so passwords displayed as parameters should not be visible anymore.

    @iceTwy said:
    Joe_NQ: I'm digging the recent update! The new process monitoring feature is nice. I'm glad you've taken some time to work on the agent and NQ as a whole. NQ is a lot more reliable than it was a few months ago.

    I'm just hoping that the rate of false alerts will lower in the future. For that matter, I'd suggest implementing cross-checking on a regional scale; that is, having 2 or more nodes per continent, with one/the others confirming that a server effectively does not respond if one node reports so. That way, an alert wouldn't be sent out if one single node wrongfully detects the server as being down.

    I am glad you like it. We've actually just implemented an additional ping check with our newest web application update last night. Now, before an alert is triggered our system attempts to ping the given address so make sure you'll allow incoming ICMP packets on monitored systems. We hope this will greatly reduce the possibility of false alerts. @Monsta_AU

    I would also like to point out that we improved the compatibility of our agent script with the help of recent debug data. If you happen to have (or know someone) a 'Raspberry Pi' or 'Arduino Yún' we would greatly appreciate if you could try the newest version and provide feedback. We believe we found the reason why they were to working with earlier versions.

    If you want to know more about our recent update, head over to our blog:
    https://nodequery.com/blog/1028/application-update-agent-076-release

    Thanks again to everyone for using our services, your support and feedback is greatly appreciated.

  • It would be great to tag servers and listing/filtering them by tags. It could be used to tag server profile (for example to have every application server listed together and have the whole view on first sight, or to view only servers in one location).

    Thank you for this project.

  • jpsjjpsj Member

    @Joe_NQ said:
    Thanks again to everyone for using our services, your support and feedback is greatly appreciated.

    Great work on the latest update. Can you let me know which IPs the ICMP checks will source from so I can add to ACL. Thanks

    Thanked by 1M66B
  • Joe_NQ said: I am glad you like it. We've actually just implemented an additional ping check with our newest web application update last night. Now, before an alert is triggered our system attempts to ping the given address so make sure you'll allow incoming ICMP packets on monitored systems. We hope this will greatly reduce the possibility of false alerts. @Monsta_AU

    Thanks Joe, but it has been triggering various servers here and there for a while now, just for one update. Seems really strange as all my servers are in various places yet there is no issues with Uptime Robot at all.

    After I posted the last one, all servers reported an error at the same time, but my two other monitoring services were up and running and reporting ping success.

    It seems to have been better in the last 2 days or so, definitely seen fewer 'false positives' since late last week.

  • Can I put a suggestion in.

    On the main page where you can see the overview of servers it should be able to be categorized.

  • Ryan22Ryan22 Member

    Why choose India? Better use Singapore for Asia ping

  • tr1ckytr1cky Member

    @Ryan22 said:
    Why choose India? Better use Singapore for Asia ping

    India is cheaper.

  • Joe_NQJoe_NQ Member

    @tr1cky said:
    Feature request: Allow spaces in names!

    Done. We didn't allow spaces before because we simply used a rule for hostnames to validate the value.

    @a_chris said:
    It would be great to tag servers and listing/filtering them by tags. It could be used to tag server profile (for example to have every application server listed together and have the whole view on first sight, or to view only servers in one location).

    Thank you for this project.

    I've implemented a simple text field to filter the servers by name. Tagging servers will be implemented when we have a little more time on our hands. @rethinkvps

    @jpsj said:
    Great work on the latest update. Can you let me know which IPs the ICMP checks will source from so I can add to ACL. Thanks

    Right now the checks originate only from our web application server (nodequery.com). Additional checks will most likely be performed by our ping nodes (ping-eu.nodequery.com, ping-us.nodequery.com, ping-as.nodequery.com) so you can either use the hostnames or their IP addresses for whitelisting.

    @Ryan22 said:
    Why choose India? Better use Singapore for Asia ping

    We will change the Asia ping node at some point, right know it is more the comfort of having one provider and not really the costs.

    If someone is interested, we've finally released the public API today. I will write a small tutorial for PHP very soon on how to create a status page for your servers that can easily be integrated in your sites.

    Thanks everyone, have a great day.

    Thanked by 2tr1cky wrox
  • wroxwrox Member
    edited July 2014

    Joe_NQ said: If someone is interested, we've finally released the public API today. I will write a small tutorial for PHP very soon on how to create a status page for your servers that can easily be integrated in your sites.

    I am definitely interested. Thank you for informing us!

    Thanked by 1Joe_NQ
  • Joe_NQ said: I've implemented a simple text field to filter the servers by name. Tagging servers will be implemented when we have a little more time on our hands. @rethinkvps

    Very interesting.
    If you could allow some chars in the name (for example "(", ")", "[", "]") and change the way names are shown it would be almost the same.

    Now names in the list are filtered only if the filter term is currently shown in the list (e.g. only if they are in the first part of the name) while it would be great to filter by every part of the name.

    This way I could set names like this:
    hostname (LOCATION) ([tag1] [tag2] [tag3] [tag4])
    hostname2 (LOCATION) ([tag2] [tag3] [tag4] [tag5])
    hostname3 (LOCATION) ([tag1] [tag3] [tag4])

    And filtering by "[tag2]" would do the trick.

  • Joe_NQJoe_NQ Member

    @a_chris said:

    Thank you for the suggestion Chris, I will give it a thought. Spaces are allowed since the last update and the filter has been fixed as of today. You can now search for every part of the name even when cut off.

    We have additionally updated the notification system and included two of the most requested features. You can now specify the number of intervals after a loss notification should be triggered and separate resource usage thresholds for system load, ram and disk usage.

    I hope you'll find the changes useful. Thank you for your support.

    Thanked by 1Monsta_AU
  • D4X69D4X69 Member
    edited July 2014

    This is.. beautiful! I was using a series of bash scripts (that probably weren't written properly because I fail at bash) to monitor my shitty 123Systems servers. With this, I no longer have to tail downtime.log and dig through everything - all simplified, on one page! Excellent work!

    I hope Beta users will get a discounted rate, but even if they don't, I will probably pay for this service regardless.

    Also: Outlook sends your emails to Junk - don't know if someone's said this already, just wanted to make sure you knew.

    Edit: @Joe_NQ - I don't know if it's been suggested (or if it's there already and I just missed it), but perhaps add a way to reset statistics for a server? I shut down one of my servers for testing purposes, but I don't want that to be reflected on the true availability, as the server is actually quite stable.

    Another suggestion (and again, maybe it's been suggested or is already there, I didn't all 7 pages), would be to add SMS alerts if a server is unresponsive.

  • @Joe_NQ Great Work! I signed up awhile ago but just started testing it out recently. I will look out for any bugs I come across.

Sign In or Register to comment.