New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
Steal time CPU
What is an acceptable level of steal time for a VM ? I have Netdata installed and a trigger warning is being issued as the amount is over 16%

Comments
Hi,
none?
Steal time means that you pay for X but you dont get X -- you get X - %steal...
Like buying a car with 100 horsepower but can only use 84...
I would say under 6%
Is it affecting your use case or is it just the alert?
Steal time is the percentage of time where your VPS is ready to run something on the CPU, but it's stuck waiting for the CPU to be available. 16% is pretty high and may mean the server is oversold too much (for example, the physical server has 32 cores, but 100+ VMs are running and everyone is trying to use the CPU). If it's persistent, you could ask the host to migrate your VPS to a different node and see if it helps.
For regular usage where the system isn't using much CPU (say less than 25% total), I usually aim for no more than 10% CPU steal on cheap VPSes that have "fair use" or "best effort" CPU, and no more than 3% for VPSes with dedicated CPU (like what HostHatch offers). My VPSes with HostHatch and GreenCloudVPS all have less than 2% steal.
That's the case only if you have dedicated CPU. If you don't, you're paying for some arbitrary percentage of the CPU that goes up and down based on how much CPU other people are using, and that's what you're getting
Hi,
no, not really.
You will only have steal time IF your VM at the time it wants to do something on the CPU, can not do it, because the CPU is busy with something else.
So you are waiting for CPU time.
If the Hostsystem is not overbooked then you will not see steal time. This has nothing to do with dedicated CPU's.
The only question is, does the hostsystem have enough CPU time to serve all guests CPU time when they need it. If yes, then its yes, no matter if its dedicated cpu's or shared ones.
In fact, if you have dedicated CPU's and your favorite provider oversold the CPU capacity up to a point where your VM needs to wait for the CPU, then you will still have steal time.
I guess someone forgot to tell @DeluxHost about this because I recently opened a ticket asking if it was normal to have an average of 22% steal with peaks hitting 43%, and their support replied, "This usage is normal for a VPS having shared resources."
I mean, I wasn’t expecting much from a service that costs $0.70/month, but hey... I guess you get what you pay for, right? 😂
Generally this is not the case, and is why VPS hosts have "fair use" policies. They don't have enough CPU power to handle every VPS using 100% of their allocated CPU.
Hosts that have this capacity usually advertise it as such. That's what "guaranteed" or "dedicated" CPU usually means - the VPS can use some amount of CPU power 24/7 and it won't be an issue. Most of the hosts with guaranteed CPU aren't actually using CPU affinity / pinning, so technically it's not dedicated.
Hi,
you are somehow mixing technical and sales stuff.
Only because you bought a VM with 4 cores it does not mean that you will automatically use always all 4 cores.
And this difference the hoster is using to sell more VM's on the "same" cores. As a result the single VM will be cheaper. So its somehow a win-win.
But this win-win only works if the hoster is not overdoing it like having physically 20 cores but selling 80 cores.
Again: technically Steal will come if the sum of the VM's CPU request will be bigger than the total amount of CPU requests the hardware can physically serve.
It does not matter if you bought a server with fair use, VDS or what/how ever.
Technically, if a VM wants CPU time, but does not get it, you have Steal. Plain easy.
The question is how such a situation happen. And the usual primary reason for this is, that the hoster overdid it with the overselling.
So the hoster created a situation where at some point in time the sum of CPU time the VM's request is bigger than the hardware can serve. Thats it.
And as a very result:
IF you have CPU steal, THEN you should contact your provider. Because THEN he actually delivers less than you actually paid for. And this he/she has to fix.
Its anyway no ultra big drama. Sometimes that situations happen. Especially if you are working with (ultra) lowcost offers you are forced to pack the hardware like hell. Otherwise its economical suicide for the hoster. And sometimes the usage of customers will change and suddenly they use more than before and if the hoster did not give enough reserve then you will have steal time.
@gowrann simply contact your provider, make a screenshot of the steal time you see and ask him to solve that.
Or you will accept that you paid for CPU power X but actually get less. Thats of course your choice.
Zero.
If you see STEAL, it means the node is totally oversold
Sadly i have seen this happening on hivelocity vps every time we tried them
Also with vultr
They just dont care, they just oversell.
I usually look for max 1% steal while basically idle, and max 4% while using all cores, but it's common that the steal is (way) higher when one writes much data to the disk, but I'm not really sure why, if anyone knows that I would be really interested.
How much percent steal would you consider acceptable/irrelevant? And do you happen to know why the steal is often (at different providers, also with the servers I have from you) very high during for example disk benchmarks, but way lower when the cores are used at 100%? (This isn't a complaint or something, I'm just curious because I have made that observation at multiple different providers.)
How do you find out CPU steal? Or is this something only VPS providers can reveal? I've checked
topbut it always seems to be 0 so I don't know if it's just that I'm not really doing anything intense.If you don't see it in top, it could mean that there is either simply no steal, or that the provider has disabled steal time reporting. Generally the provider could theoretically just fake the steal values.
I recommend measuring steal over a longer period of time, if it is always exactly 0, the provider most likely doesn't report it to the VPS.
False!
Even If I create a VPS on a server with no other vps on it, and hypervisor has 5% cpu load. There will be CPU steal. It might just be 0.1-2% though. It's normal. It does not cause performance issue unless it's very high.
Even on AWS dedicated cpu VPS It spikes to 2%.
I recommend to spin up a proxmox instance on a dedicated server and see for yourself
Not necessarily. Sometimes we have abusive clients, who hammer the CPU over prolonged periods of time - 6 cores being used in a dedicated way for hours. To avoid suspending them, we throttle them to avoid having an impact node wide while we work with the client, and within their VPS this shows as steal (due to the throttle). Sometimes, it has nothing to do with overselling
Hi,
i might potentially dig my own grave now but:
For example:
This here is a 7days2die server that i run privately on one of our High CPU ( AMD Epyc ) cluster. So thats 1:1 the same what our customers receive from us.
While the 7days server is _not_running i am doing some yabs test:
In idle before yabs testing we have plain:
Sometimes "peak" to 0.1 st
Then during yabs:
fio:
networking:
geekbench:
( is geekbench just running one cpu core? o_O ) I thought i would kill now the CPU with that...
So at the time when running this, the hostmachines looks like:
My interpretation of the numbers is:
I assume thats because there is enough CPU power. The VM idle, the Hostmachine has 50% Idle. So there is no point where steal time should come from.
I assume thats because we have here a natural delay between " I want CPU " and the Hostmachine Scheduler " Gives me CPU ".
My assumption / theory:
1)
With tests like network and disk ( iperf3 and fio ) we have the situation that those try to get the max from the hardware, but the hardware will limit that ( either by software limit like we have it with IO limitations we implement or by hardware itself "its just not going faster" -- even while iperf3 try's to push it ).
I can imagine that this "there is no more for you" is silently ( also ) interpreted as steal time, since i want more, but dont get. So from the perspective of the VM CPU / OS Scheduler, he tries, but just does not get more and is somehow slowed down.
2)
There is this natural delay i assume to exist between request of CPU time and serve of CPU time. The more processes running / the more busy the VM ( or physical server, should actually also apply there ) becomes, the more frequently it will ask for CPU time and this way the "rate" or "frequency" of requests will rise and thanks to the natural delay, the mathematical delay will stack up ( even there is no real one, as there are enough resources ). But just this "organisatoric" delay between the layers, will cause some minor steal always.
So all in all, based on the numbers i saw now with the yabs:
Stealtime -- during benchmark!! -- of < 1% is to be considered normal.
Otherwise it should be very close to 0.1...
But: Disclaimer: I only roughly read myself into this topic. All pure theory / based on experienced or what i see. No real knowledge... this could be answered better/more reliably by a linux kernel dev...
Really, are you sure with that in context to KVM? With containers i can perfectly imagine that, but with KVM... i wonder if such fine grained access is possible via the cgroups....
Yes it is possible to hide or fake steal time. The value is provided by the kernel of the host to the guest vm. I remembered someone on LET already dig into the linux kernel source code in another thread
If you get such response, ask for refund and move on.
Even for 0.70$, 43% cpu steal is nuts.
we have seen higher Steal
Yes, they are people in this fucking universe, that sell you an VPS on an E5, not that v4 where you can reach nearly 1k GB6.
Naaaaah, they sell you that VPS on the old slow E5.
With 500 GB6 tops.
And they fucking decide, to put a cpu cap on that.
For these people, there a special place in hell.
The YABS scored roughly 120 GB6.
And the admitted the throttled it, you know what I payed for this shit?
2€/m for that performance.
Do you know what I get for that money these days?
It isn't much but still and no I wasn't cpu mining.
I was using that box as a wg vpn for gaming, low bandwidth, low cpu usage, still got that fkn cap.
@layer7 Thanks for your measurements, quite interesting. Your assumptions also seem to be quite logical. Does the VM where you tested it use local or network storage?
If one wants to simply disable steal time reporting/accounting, one simply needs to boot with the linux kernel parameter
no-steal-acc. If one wants to fake the steal time (for example display lower than actual), it's also not that difficult, but one needs to compile an own kernel and edit the function record_steal_time.While the first can be detected rather easily from within the VPS, the latter is not really unambiguously detectable I think, but both are unfair to the customer IMHO.
But what I find interesting is the following (this wasn't done on a VPS from you, but I did it there as well and it was the same, just not as extreme):
Idle:
Seems to be quite normal, no steal, nearly no cpu usage.
stress -c 4 (simply loading all cores to 100% by calculating square roots):
Still nearly no steal, so the hypervisor most likely has enough unused cores to handle the load.
fio (the interesting thing):
Not much cpu usage, but seemingly a lot of steal!
This VPS uses ceph storage according to the provider, and the two Xeon VPSes from you where the same thing occurs are also using network storage, but interestingly, on the Threadripper VPS from you which (I think) uses local storage, this high-steal thing doesn't occur.
My theory regarding this is that as qemu is put into uninterruptible sleep by the linux scheduler (tested it by creating a VM, fsfreezing the root filesystem of the hypervisor and then the qemu process was in the state
Dl+, which is uninterruptible sleep according toman ps) while it waits for the IO request to complete, and as the latency is higher when using network storage, the steal time increases as it is calculated by subtractingcurrent->sched_info.run_delay(the time spent waiting on a runqueue, so basically time the process was blocked I think) from the previous value. (But I could also be completely wrong.)Maybe someone who knows a bit more about the internals of linux or qemu can say something about this.
Better to actually look at
/proc/stat(that will show you total steal since system boot)Cpu steal is common with budget providers Get a dedicated if you want dedicated resources
If you buy shit yes, usually even on decent budget providers you see next to none.
Hi,
thats High CPU in FRA1 location. So thats Clustersetup with network storage.
And with ceph in general, depending on the setup of the provider ( hyperconverged?!?! ) the "organisatoric delay / work" i was talking about in my post is with ceph ( a lot ) higher compared to simple glusterFS, NFS, .... network storage ( and local anyway ).
So maybe this extra work and delay through the layers to make ceph give you actually IO, will result in extra steal time.
Pure theory.... but in fact ( and we are also working with ceph a lot -- especially in the past ), this super cool features ( dynamic cluster stretching and reduction, self-healing, HA, object storage, crushmaps, .... ) have a very high payroll on the general delay, additionally to the usual overhead coming from/with a network storage.
So if this payment for such nice features have to be done by the host the VM reside on ( typical proxmox setup ), then i can perfectly imagine the steal time rising.
Again... theory.... but you have to admit, its a nice fitting story... hopefully not too much a fairy tale
Less then 1% is ok.
Acceptable levels are below 5%. Ideally it should be below 1% or completely zero (if the server is not oversold)
Deluxe Oversell
:
kernel.sched_autogroup_enabled = 0 on hypervisor and now inside vps:
Who would have thought a simple setting removed all steal on my idle hypervisor but it's not like 0.2 mattered to begin with but xd
I think he runs ceph on the same server as the compute. So when you blast the storage, the cpu usage goes up a ton. writing at 1 GB/s probably uses 400-600% cpu on the server itself.
Is the Calculation Method Correct?