CPU Abuse Notices from VirMach

r0t3n · July 2018

@Virmach is there a specific reason you pull statistics from ps instead of a hypervisor specific method? There are many ways to pull more accurate statistics vs ps.

Ndha · July 2018

Bharat n Miguel are everywhere..greatt..

JohnMiller92 · July 2018

Dat response though, beautiful O_o

teamacc · July 2018

THIS is how you deal with these threads. No promises, just a summary of actions taken.

Hats off to @virmach, although I hope you keep monitoring your automated system.

jiggawatt · July 2018

VirMach said:

Bharat B. Jr ~~Sys Admin & Technical Support Agent~~ Jr. Sys Admin & Tier 3 (In-House)

Miguel V. ~~Sys. Admin~~ Sys Admin & Tier 3 (In-House)

Just curious: are these guys @BharatB and @MikePT ?

BharatB · July 2018

@jiggawattz said:

VirMach said:

Bharat B. Jr ~~Sys Admin & Technical Support Agent~~ Jr. Sys Admin & Tier 3 (In-House)

Miguel V. ~~Sys. Admin~~ Sys Admin & Tier 3 (In-House)

Just curious: are these guys @BharatB and @MikePT ?

Exactly

MikeIn · July 2018

Anyhow just for sake of having a normal service.
I have 3 VPS, out of which I usage only 1 of them (sometimes) and rest are idle.

And thinks are business as usual.

Simply saying, things might not be bad for all it's users.

asterisk14 · July 2018

@VirMach said:
We're discussing this internally, and we'll update this thread with any findings. I encourage those negatively affected to message me the account details so we can look at each case. I also encourage anyone who believes the system to be making a mistake to request the ticket be escalated for evaluation. The system is not perfect, but it's had several improvements since inception. What the support agent is supposed to do is manually review the logs that the system outputs and make a decision. These should get thrown out by the agent if the node is facing any large level of abuse or load, and they do automatically get thrown out if the number doesn't look right.

False positives are rare, but they do still happen. The agent may have said the system is correct because in most cases it is, but that should not be the agent's default answer. If you have a problem with the agent's response please have it escalated to management. The agent should not be saying there's zero chance of the system being incorrect.

Suspensions are extremely rare, and our system is lenient. There are several steps required for the system to be able to take automatic action.

In 9 months, 0.6% of all services have been suspended by the automated system

80% of all "suspected" abuse gets thrown out by the system

Warnings reset every week, and only get sent out if the system outputs the same levels for several hours

Results to action may be delayed by 8 hours, so it's possible that the high usage took place 8 hours ago

We do not use top/htop to gather the data, and we do not use system load, but it's possible for I/O usage to affect the data we gather

Mine was suspended repeatedly without any notice for maxing out my RAM. Sometimes I found out a week later because no one at Virmach sent out a notice. It was the 128MB $3/yr deal but I'm still glad a left. I was running the same services on a host1free (remember them?? free vps 128MB) and quite a few other LEB. Virmach were the only one to suspend!

Seems like Virmach will sell you a VPS, but don't want you to use it.

defkev · July 2018

VirMach said: We need the root password if you do not have logs or any sort of measuring tool in place yourself to compare the internal activity of your virtual server.

We do not track the processes within VMs; we only measure the utilization of the KVM guest itself.

I did include a RRD in both tickets, non showed any significant utilization, especially not "for multiple hours"

Regardless, what do you expect to find in the local logs after shutting the box down?

https://pastebin.com/RqA4sJrn

Here is a raw data log (syntax epoch timestamp/data) from my monitoring showing CPU idle (in percent) over the past seven days, one datapoint per line/minute, collected using SNMP generic OIDs, so essentially what /proc/stat was reporting at the time of collection.

Wouldn't be surprised if the data i collect within differs from what you see without, but we are talking about a almost 100% discrepancy here, at least so it seems.

If you need more/different metrics let me know.

greattomeetyou · July 2018

defkev said: Here is a raw data log (syntax epoch timestamp/data) from my monitoring showing CPU idle (in percent)

Is 100 == 100% idle? Usually it is the other way around?

defkev · July 2018

greattomeetyou said: Is 100 == 100% idle? Usually it is the other way around?

Ticks spend idle as reported by UCD-SNMP-MIB::ssCpuRawIdle.0 multiplied by 0.01 to get percentage, stored as a float.

I don't save the raw counters, too much data.

VirMach · July 2018

MikeIn said: Anyhow just for sake of having a normal service. I have 3 VPS, out of which I usage only 1 of them (sometimes) and rest are idle.

And thinks are business as usual.

Simply saying, things might not be bad for all it's users.

This is indeed the case, so far. Since yesterday's reply we've collected further data and have come close to completing our investigation, and everything points to "business as usual" for nearly all our customers.

We did discover something new since yesterday.

Three nodes specifically (NYKVM36, LAKVM10, ATLKVM4) were de-synced, which caused the node to preserve some numbers it shouldn't have; this resulted in some of those "max" number being sent out. @MasonR was one of these lucky few. As I mentioned in my earlier response, the node you
(@MasonR) were on had a normal level of load but some CPU spikes. In this specific case, the system was supposed to check the theoretical max for the VMs it suspected of abuse and if the number was over the previous limit we set of (cores*120) then the number should have been thrown out. This is the case for all of our other nodes, but unfortunately, these nodes did not have that functionality working appropriately and were affected by recent spikes. This also solves any remaining mystery around the repeated "239.6" and "119.8" numbers.

We ran a query on the database of all abuses, and this only happened between 07/15/2018 and 07/26/2018. There were 33 warnings without action, and 8 shutdowns, with 0 suspensions. These are all the people affected by a de-sync ever, and they will be e-mailed and credited. We have already patched this issue.

We have not found any other issues, but are still looking. We're going to attempt to run more useful queries and gather more data.

We will be replying to all incorrect tickets, and providing appropriate SLA credits. So far, things are looking better than expected. If we find any further groups of affected people we will do the same for those groups, or send an e-mail if the issue is more widespread. If you were sent an incorrect warning you will be credited (1) week SLA for the service warned. If you were shut down, you will be credited (1) month.

We also made another change -- lower tier outsourced agents now have access to the abuse logs and they will be told to specifically check the logs before making a reply.

defkev said: I did include a RRD in both tickets, non showed any significant utilization, especially not "for multiple hours"

Regardless, what do you expect to find in the local logs after shutting the box down?

If you do not receive an e-mail and credits or this happened before July 15th, please contact us.

As for what we expect to find in terms of accessing your service, lower-tier support agents are trained to verify potential issues for the customer if they are not running their own monitoring. In these cases, customers usually request assistance in monitoring their usage and we help. In your case, it may not have been entirely warranted if you already had the logs. Either way though, if it got escalated to a system administrator, accessing your service would be helpful so we could run a stress test and compare numbers.

asterisk14 said: Mine was suspended repeatedly without any notice for maxing out my RAM. Sometimes I found out a week later because no one at Virmach sent out a notice. It was the 128MB $3/yr deal but I'm still glad a left. I was running the same services on a host1free (remember them?? free vps 128MB) and quite a few other LEB. Virmach were the only one to suspend!

Seems like Virmach will sell you a VPS, but don't want you to use it.

As usual, when you make a reply like this out of context I will just tell everyone to check this thread: https://www.lowendtalk.com/discussion/135460/virmach-vps-is-up-but-not-accessible-after-recent-maintenance

Hint: you were actually maxing everything out, to the point where your applications crashed and you blamed us for downtime.

defkev · July 2018

VirMach said: ATLKVM4

Well i guess that settles it then.
Warning on 21st, shutdown on 24th

Awaiting my one month credit

teamacc · July 2018

@VirMach said:
This is indeed the case, so far. Since yesterday's reply we've collected further data and have come close to completing our investigation, and everything points to "business as usual" for nearly all our customers.

We did discover something new since yesterday.

You, I like you.

While you're fixing stuff, take a look at this:

huntercop · July 2018

@VirMach said:

< [A bunch of stuff I didn't read, blah blah blah sounds like being honest and nice]

tl;dr: existing customers, use coupon code TLDRSORRY for 50% off our monthly SSD1G package.

50% off? wow, talk about showing some real love, going to have to turn this into a lesson to educate my GF

P.S. The conversation with my GF prob won't end well.

deank · July 2018

I just want to say; this thread is long.

asterisk14 · July 2018

asterisk14 said: Mine was suspended repeatedly without any notice for maxing out my RAM. Sometimes I found out a week later because no one at Virmach sent out a notice. It was the 128MB $3/yr deal but I'm still glad a left. I was running the same services on a host1free (remember them?? free vps 128MB) and quite a few other LEB. Virmach were the only one to suspend!

Seems like Virmach will sell you a VPS, but don't want you to use it.

As usual, when you make a reply like this out of context I will just tell everyone to check this thread: https://www.lowendtalk.com/discussion/135460/virmach-vps-is-up-but-not-accessible-after-recent-maintenance

Hint: you were actually maxing everything out, to the point where your applications crashed and you blamed us for downtime.

I was using the resources I had paid for. NO other provider has had a problem with me running the same services including vultr, DO, nephoscale, even host1free (which was a free 128mb VPS). You were the only ones to suspend with your BS reasons. Worst still you never informed me that my VPS was suspended. I only found out when I went to use the service and found my VPS had been suspended for upto 5 days in one case. What kind of crap is that?

Howdy, Stranger!

Categories

In this Discussion

CPU Abuse Notices from VirMach

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

CPU Abuse Notices from VirMach

Comments