CPU Abuse Notices from VirMach

tsidhu · July 2018

htop is way better as it displays percentage per cpu (as you know already from your pic). top displays a sum of all cpu cores use percentage, unless you toggle Irix mode with "shift-i".

This still doesn't quite solve the problem of 240% on a 2core vps. I believe if the linux system is purposefully oversubscribed and delegating more vcpu's than physical threads then the statistics will be skewed when checking cpu loads on the VM's from the underlying host system.

AlyssaD · July 2018

I wonder if wait time is calculated in that mix.

stefeman · July 2018

@MasonR said:
Hey all,

I'll preface this post by saying that VirMach is a terrific provider and this thread is not meant to throw shade at them, just is intended to spark a healthy discussion on a possibly faulty system.

tl;dr: got a CPU abuse message, but VPS is mostly idle. Support insisted there is no chance their abuse detection system is wrong. Anyone else have same experience?

Long version:

I posted over on HostBalls about receiving a CPU abuse notice from VirMach and believing it to be a mistake on their part. To my surprise, I got quite a few replies from others saying they've gotten similar notices erroneously.

Here's a little background - this morning I received an automated email that one of my VPSes was using 240% CPU for many hours and to reduce CPU usage or I will have my service temporarily turned off. I monitor my array of VMs regularly (via live server stats) and was surprised to get this message. I immediately logged into the VPS that I was notified about and found the usage/load -

Low CPU usage and a 0.00 load avg as expected. I checked the access logs and there wasn't any suspicious logins. Root login is disabled and fail2ban is also installed. Really the only thing this VPS runs is a small TeamSpeak server, otherwise is idle as seen in the screencap above.

I replied to the ticket that there must be a mistake and that my VPS is barely using any resources. Level 3 support replies that, "this is a system generated message so it won't be wrong," then instructs me how to use the task manager to monitor CPU usage. I reply that I'm using Linux and attach the screenshot above, then get instructions on how to use top/htop to monitor CPU usage (even though the screenshot was a htop cap).

Side note - how does one reach 240% CPU usage with only 2 vCPUs?

How many of you guys have encountered similar issues with VirMach or other providers? Anyone have their VPS temporarily suspended even though they weren't using lots of resources?

Are you using teamspeak.red or r4p3.net? If so, theres your answer. Theyre backdoored.

MikeA · July 2018

@stefeman said:

Are you using teamspeak.red or r4p3.net? If so, theres your answer. Theyre backdoored.

I doubt he's using one of those.. he's staff after all.

The joys of cracked Teamspeak though.. never ends.. (well, until the next update that prevents it!)

Ndha · July 2018

You're so lucky..
I got their 512Mb vps gift from CM..install deb 8 n ubuntu 16, vm stuck in the middle when apt on it..install cen 7 same happens..if stuck then auto off..so the gift idle..lol

kendid · July 2018

SSD4G - same thing. I only had windows running, idling - hadn't had time to install any other software! I don't need it running all the time for what I'm using it for, so I just shut it down now when I am finished.

ricardo · July 2018

Does OVZ count things in kernel space as belonging to the user when measuring CPU? Does the container?

Just mentioning as I'd had warnings in the past from other providers, even when 'top' etc would not be reporting anything within the container. The program I was using made extensive use of event-based stuff with epoll, zero-copy with sendfile etc.

corbpie · July 2018

Gotta love a broken process

MasonR · July 2018

@greattomeetyou said:

MasonR said: using 240% CPU for many hours

They claimed many hours. You just have to prove otherwise? Do you happen to have logs or stats to prove it otherwise?

Unfortunately, no. I only do live monitoring and don't store the stats. The general purpose of this thread, though, is to see if this is a recurring theme and others are facing the same problem. If it was just me, then sure it might have been my VPS going bonkers. But since many others run into this as well with idle VMs, then something in the way they detect abuse is severely flawed.

@stefeman said:
Are you using teamspeak.red or r4p3.net? If so, theres your answer. Theyre backdoored.

It's just the small 32 slot server that you're permitted to run for free without a license.

saibal · July 2018

kendid said: I only had windows running, idling - hadn't had time to install any other software!

Windows updates are known to peg the CPU.

lemon · July 2018

@saibal said:

kendid said: I only had windows running, idling - hadn't had time to install any other software!

Windows updates are known to peg the CPU.

be virmarch

offer windows server

suspend server for automatic updates

???

profit

corbpie · July 2018

@lemon said:

@saibal said:

kendid said: I only had windows running, idling - hadn't had time to install any other software!

Windows updates are known to peg the CPU.

be virmarch

offer windows server

suspend server for automatic updates

???

profit

Virmach has pay-to-open tickets as well

r00t4bl3 · July 2018

It feels funny when I read their response message. The system won't be wrong? Who created the system at first place?

jetchirag · July 2018

@r00t4bl3 said:
It feels funny when I read their response message. The system won't be wrong? Who created the system at first place?

The mighty lord, GOD?

VirMach · July 2018

We're discussing this internally, and we'll update this thread with any findings. I encourage those negatively affected to message me the account details so we can look at each case. I also encourage anyone who believes the system to be making a mistake to request the ticket be escalated for evaluation. The system is not perfect, but it's had several improvements since inception. What the support agent is supposed to do is manually review the logs that the system outputs and make a decision. These should get thrown out by the agent if the node is facing any large level of abuse or load, and they do automatically get thrown out if the number doesn't look right.

False positives are rare, but they do still happen. The agent may have said the system is correct because in most cases it is, but that should not be the agent's default answer. If you have a problem with the agent's response please have it escalated to management. The agent should not be saying there's zero chance of the system being incorrect.

Suspensions are extremely rare, and our system is lenient. There are several steps required for the system to be able to take automatic action.

In 9 months, 0.6% of all services have been suspended by the automated system
80% of all "suspected" abuse gets thrown out by the system
Warnings reset every week, and only get sent out if the system outputs the same levels for several hours
Results to action may be delayed by 8 hours, so it's possible that the high usage took place 8 hours ago
We do not use top/htop to gather the data, and we do not use system load, but it's possible for I/O usage to affect the data we gather

MasonR · July 2018

@VirMach - I appreciate the response and the willingness to help find a solution for this and the details around the abuse system.

In my case, I'm not too worried about getting another abuse notice as I think the chances are pretty slim. Just was put off with support's response after I asked if there could be a mistake.

But I'm glad this dialog has opened up a channel to address this issue for your other clients as that's a win-win in my eyes. Customer that's able to work out these issues with you and not get their services suspended will likely renew and keep business coming your way

tsidhu · July 2018

I appreciate the response and the willingness to help find a solution for this and the details around the abuse system.

>

+1 nice to see the vendor respond here and communicate to others. Makes a big difference to me when I consider hosts.

Edmond · July 2018

My friend's been having issues with CPU and automatic terminated processes. Not as worst, the server isn't suspended but it goes get annoying. He hasn't made a ticket yet, the service is one of those BF no support ones but I'm glad that the issue's being worked on.

vimalware · July 2018

@MasonR Was this a PyPatrol satellite node?

MasonR · July 2018

@vimalware said:
@MasonR Was this a PyPatrol satellite node?

Negative. That project still needs more work before going live, so I just spawn the process up when I do testing. And it's been awhile since I've done anything with that, unfortunately.

defkev · July 2018

VirMach said: We're discussing this internally

Still doesn't address the elephant in the room:

How do you aggregate the "for multiple hours" stats?

Why the sudden spike in, what seems to be, false-positives on CPU-util?

My own monitoring doesn't show anything (at all) justifying such claims, leave alone that i haven't changed anything on the setup over six months yet all out of the sudden i am supposedly violating your terms twice within 72 hours.

Just to put this into perspective, i probably caused more load doing a yum update after powering the box back on after your system shut it down in 30 minutes than the last three weeks combined.

Don't get me wrong, the service provided is more than reasonable for the price asked and its not in my interest to utilize any more than i paid for but if you claim that i do so you actually have to provide me with something, anything, so i can address the problem at hand instead of asking me to buy a dedicated core for $5 per month for a box worth $20 annually.

dragon2611 · July 2018

@MasonR said:

@greattomeetyou said:
When using VPS, how do you aggregate stats for your analysis purposes?

From the provider side or the client side?

I use https://github.com/BotoX/ServerStatus to get live server stats and plop it all on a status page. But you can also use HetrixTools or LibreNMS or something similar if you want to collect and store the usage history.

LibreNMS or for Linux machines I've recently been playing with using Netdata + Prometheus + Grafana (Netdata for the real detailed short term metrics, Prometheus Polling netdata API then being pulled into Grafana for longer time periods).

liara · July 2018

Just throwing my own servers into the hat here. My storage server has been sitting idle for months and was the first to receive such a notice (120% on a single core server). I logged in and the CPU was barely being tickled. I cancelled the service as I don't use it.

Today, I received notice for my second VPS with 8 cores. They claim 958% usage (which is insane) as I'm sitting comfortably within their AUP (using less than half of the CPU). Load is at 3.1 for a 60 minute average and also well within their AUP for allowed maintained load average (70% of your logical cores).

I've been running this server with this load for months and have never had an issue until today. Something is certainly fishy about this, especially considering the first VPS in question was sitting idle until I shut the machine off.

greattomeetyou · July 2018

liara said: They claim 958% usage (which is insane)

Either you got hacked or they got hacked.

simonindia · July 2018

I too got add me to the club. But in my case I think there is python update script or something from ubuntu running which was generating some load 40% to be exact. And they were professional about it so no worries for now.

For 6$/year KVM the notification is better than out right suspending me but yes I find the the percentage of the usage intriguing. I am overall surprised after HostUS this guys are the one I'm having my small boxes with for several years.

"We've noticed that your service has been using lots of processing power, or more specifically 119.8% CPU for multiple hours."

VirMach · July 2018

Firstly, I'd like to more thoroughly address the concerns on how we measure the usage. A lot of people in this thread were wondering how we get numbers above 100% per core and if there were any recent changes to our system. @AnthonySmith mentioned emulation overhead & drivers. @MikeA was concerned about an automated system versus manual checks. There was also concern about updates and re-installations/initial setups. @Kendid mentioned a potential issue related to Windows.

We use VirIO drivers by default, for both network and disk. We enable CPU passthrough on request, although in this case it would have had no effect. In our testing we reached 220% usage on a 2 core virtual machine, using passthrough and exact matching.
Our system does not subtract the overhead when presenting the numbers, which is why the numbers could be above 100% per core, however, we do take it into account. This means that the system provides a level of leniency above the normal overhead, on top of the numbers permitted by our AUP.
The numbers obtained are similar across every tool (htop, top, atop, virt-top) and we specifically use the output from ps through a popular library used by many reputable technology companies. We measure CPU numbers every second for 6 hours.
Wait time is not calculated. Heavy load on the machine will always increase a variety of utilization percentages, but we try to throw out those numbers as much as possible. Currently, we throw out any number above 120% which includes a normal overhead, but this is not done frequently enough.
We provide a grace period with extreme leniency, as we do know the initial installation, update & setup of a VPS can impact the numbers. An update would not have a major effect on the numbers unless it lasts close to the entire 6 hour measurement period.
We used to manually handle all abuse. However, this is unfeasible due to the number of VMs we now host and our competitive price-points. In addition, a human cannot monitor a virtual server for 6 hours, but a system can and that usually ends up being more accurate & can be more lenient.
We don't believe there was necessarily a sudden spike in reports. We've reviewed the database for warnings and it seems to be fairly consistent for the past 2 months. In the specific case of @MasonR, the load for the 2xE5-2660 server was 12 to 18 during the entire week, but there were some CPU spikes.
We have not made any recent changes to the code that would negatively affect customers, and our load numbers have also highly improved over the months as we have more staff members dedicated to dealing with abuse & load spikes immediately where necessary.
We are aware of a current issue where a hung Windows VPS will max out its CPU usage; same for a booting one. It's also possible for spikes to affect these numbers, and we will be focusing on this for improvements.

There was a potential concern for privacy by @defkev

We need the root password if you do not have logs or any sort of measuring tool in place yourself to compare the internal activity of your virtual server.
We do not track the processes within VMs; we only measure the utilization of the KVM guest itself.

@greattomeetyou @AlyssaD mentioned logs

Logs would always help us in terms of weeding out false positives and improving the system
We do provide logs from our end on request and the average CPU percentage number on the ticket is directly from our logs
We do not install agents or snoop into the KVM machines without permission. This is why we do not have more specific logs

@corbpie @defkev and possibly others had concerns over pricing

We absolutely do not purposely send out warnings to make additional revenue from the high CPU options. We do not make a profit from the high CPU option.
The high CPU option is optional and a last-resort for both parties. We do not want customers to leave; we want to allievate the situation by providing an additional choice if they cannot reduce usage (which is our preference.)
We barely recover any costs associated with dedicated 24/7 CPU usage, and while the pricing may be relatively high compared to a service's special/discount pricing, it's the approximate cost to us for a dedicated core

Finally some miscellaneous comments I'd like to address individually

liara said: My storage server has been sitting idle for months and was the first to receive such a notice (120% on a single core server).

For storage servers, there's a higher possibility of load spikes due to I/O. However, most of the time, the high CPU spike is specifically caused by the same customer's high I/O usage.

For storage servers, we're more lenient on I/O, so it's possible (if you're using a lot of I/O) for the CPU message to be sent first instead of the I/O message. With everything considered this can get messy, so for now, we're going to completely disable the automated CPU system on storage servers. Thanks for the feedback.

Ndha said: You're so lucky.. I got their 512Mb vps gift from CM..install deb 8 n ubuntu 16, vm stuck in the middle when apt on it..install cen 7 same happens..if stuck then auto off..so the gift idle..lol

Edmond said: Not as worst, the server isn't suspended but it goes get annoying. He hasn't made a ticket yet, the service is one of those BF no support ones

@Ndha if this was a while ago, I apologize. We do now have a system in place to ensure people are not flagged for a couple days after their purchase, as described above.

If anyone has a special that has no support, I still encourage you to contact us if you believe there to be a flaw in our system. You will not be billed for support. If the lower-tier agent is being difficult, please escalate it.

MasonR said: Support insisted there is no chance their abuse detection system is wrong

We've spoken with the agent to ensure these get escalated to system administrators when the customer has concerns about the accuracy. You may also escalate a ticket at any time to our in-house staff by clicking the "escalate ticket" button. We absolutely encourage everyone to do this if they're not satisfied with our outsourced support.

We have changed everyone's signature/roles accordingly to better represent each agent. Some outsourced agents will be moved down a tier to represent their possible lower level of expertise & authority.

Syed A. ~~Sr. Sys Admin & Support Agent~~ Sr. Sys Admin (In-House)

Bharat B. ~~Jr Sys Admin & Technical Support Agent~~ Jr. Sys Admin & Tier 3 (In-House)

Miguel V. ~~Sys. Admin~~ Sys Admin & Tier 3 (In-House)

Shawn H. ~~SysAdmin & Technical Support Agent~~ Sys Admin & Tier 3 (In-House)

Aviv M. ~~Technical Support & Billing~~ Tier 2 & Billing (In-House)

Abhijeet A. ~~Tier 1 Outsourced Agent~~ Tier 1 Support (Outsourced)

Vaibhav K. ~~Tier 1 Outsourced Agent~~ Tier 1 Support (Outsourced)

Mike A. ~~Tier 2 Outsourced Agent~~ Tier 1 Support (Outsourced)

Harshad A. ~~Tier 3 Technical Support Agent~~ Tier 2 Support (Outsourced)

Vikas M. ~~Tier 3 Technical Support Agent~~ Tier 2 Support (Outsourced)

Vilas K. ~~Tier 3 Technical Support Agent~~ Tier 2 Support (Outsourced)

Now, let's address some immediate changes we're working on. Please understand that our intentions will always be to accurately and fairly deal with abuse so we can keep prices low for everyone. We will also always be lenient in enforcing our policies, especially if it's an automated system. These false positives are definitely not planned, and we continue to improve the system to reduce them. Although it will still be possible for them to exist, we're pretty proud on where we're at right now with the system compared to how it was initially. We do apologize for anyone negatively affected by a false positive and encourage everyone to contact us, provide logs, and have tickets escalated to system administrators so we can deal with each case properly.

We're changing the period of measurement from 6 to 8 hours. This should allow us to be more lenient to longer periods of bursting.
We're changing the method of measuring that we used. In the past, we used a longstanding thread that simply measured for 6 hours and gives us an average. We will be doing increments of 10 seconds instead, where we filter each number first and throw out bad numbers, and then average it. So if there was a load spike on the node in that period then that value will not be used towards the final average.
We're throwing out more "bad" high CPU numbers for VMs by modifying the formula from (cores * 120) to (cores * 100) + 20.) This means that, for example, a 4 core system's usable "ceiling" would be cut off at 420% instead of 480%
We will be taking into account load volatility. We will take no action if there's a major change in load or if the sever is overloading. These situations will be dealt with manually.
We will "soften" our messages/warnings and our support will have their eye out for anybody who was mistakenly flagged for abuse. In the message, we'll encourage users to escalate the ticket if they believe it's mistaken.

Thanks to everyone for your input and helping us improve VirMach.

tl;dr: existing customers, use coupon code TLDRSORRY for 50% off our monthly SSD1G package.

corbpie · July 2018

VirMach said: tl;dr: existing customers, use coupon code TLDRSORRY for 50% off our monthly SSD1G package.

Will we also get 50% CPU usage on it

Foul · July 2018

VirMach said: For storage servers, we're more lenient on I/O, so it's possible (if you're using a lot of I/O) for the CPU message to be sent first instead of the I/O message. With everything considered this can get messy, so for now, we're going to completely disable the automated CPU system on storage servers. Thanks for the feedback.

Thanks! This is why i'm going to keep my storage server now!

Harambe · July 2018

@VirMach said: Thanks to everyone for your input and helping us improve VirMach.

Appreciate that you took the feedback to heart and are implementing changes to fix false positives. Rare sight around these parts...

FAT32 · July 2018

@VirMach said:

Thanks to everyone for your input and helping us improve VirMach.

The response in this thread is enough to prove that how reliable and stable VirMach service is. No random shit, just pure facts and solutions. Kudos to VirMach for the transparency

Howdy, Stranger!

Categories

In this Discussion

CPU Abuse Notices from VirMach

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

CPU Abuse Notices from VirMach

Comments