The IncogNET thread - Discussion, news and updates.

forest · February 20

@MannDude said: 100% CPU usage for a long period and 150%+ BW usage, though that auto-throttles but still isn't "unlimited".

I'm showing only 50% CPU (60% 10 minute peak) usage over the last week of operation on my side, but I think I know the issue: The RAM is so low that it's constantly swapping at 10 MiB/s (30 MiB/s 10 minute peak). All that I/O is probably causing severe hypervisor overhead through all those vmexits that is not being accounted for within the guest.

Alternatively, maybe just hard-cap my network at 100 Mbps and set an IOPS limit on my VM to reduce swapping? That would reduce VM context switches and thus hypervisor overhead.

In the meantime, I'll configure Tor's bandwidth limit to not exceed 100 Mbps when it is back up.

@MannDude said: Only POP I've seen this in has been SE.

Could you move the server to BG? I'd be fine with that to help with load balancing.

forest · February 20

To limit memory usage and thus all the I/O causing hypervisor overhead, I've taken the following steps:

Installed Alpine Linux instead of Debian to significantly reduce base memory usage
Preloaded jemalloc2 for Tor (overriding glibc's ptmalloc3), reducing memory fragmentation somewhat
Strictly limited Tor's MaxMemInQueues to further reduce RSS
Ensured all services run lightweight BusyBox variants

Since it didn't seem to be enough, would you be open to allowing me to buy more RAM? It would benefit us both, and the extra memory means I can enable zswap to further reduce I/O without worrying about the extra slab pressure it causes. It increases guest CPU usage slightly (due to compression/decompression overhead), but the overall CPU usage as reported by the hypervisor would surely fall (due to fewer vmexits).

MannDude · February 20

@forest said:

@MannDude said: 100% CPU usage for a long period and 150%+ BW usage, though that auto-throttles but still isn't "unlimited".

I'm showing only 50% CPU (60% 10 minute peak) usage over the last week of operation on my side, but I think I know the issue: The RAM is so low that it's constantly swapping at 10 MiB/s (30 MiB/s 10 minute peak). All that I/O is probably causing severe hypervisor overhead through vmexit/vmenter that is not being accounted for within the guest.

Alternatively, maybe just hard-cap my network at 100 Mbps and set an IOPS limit on my VM to reduce swapping? That would reduce VM context switches and thus hypervisor overhead.

In the meantime, I'll configure Tor's bandwidth limit to not exceed 100 Mbps when it is back up.

@MannDude said: Only POP I've seen this in has been SE.

Could you move the server to BG? I'd be fine with that to help with load balancing.

I hurried back to my desk to review some things real quick.

It's back online though I did temp-cap the CPU to 65% - I don't have anyway to automate this but I do have written notes on my desk and was planning on un-capping tonight (SEA time) manually again.

Main concern here is that under normal usage, this hypervisor lingers around 25-35% CPU usage. I capped it weeks ago to prevent new VM creations as well. 48 cores, lots of room for individual VMs to burst to full usage for extended periods of times like we have in all of our other POPs and hypervisors. From our side, we've acted as we've always done in terms of hypervisor setup and capping of new VM creation after certain thresholds are met so as to maintain a quality service.

Last night was trying to settle down for the evening and just had the alerts going off again. Quickest way to restore service for everyone was to quick-cap the top offenders.

I think the only reason yours was suspended in full at the time and not just capped was because I saw the BW being 160% of the monthly quota with two weeks left before it resets + 100% CPU usage. At a glance that just screamed "abuse".

You can stay in Sweden if you'd like, or we can move you to Bulgaria if you'd prefer. Up to you.

I started digging into the VirtFusion documentation last night as well and trying to see if there is a better way to get notified of potential CPU abuse or issues from individual VMs via webhooks. I did find this as well, https://github.com/noxitylabs/virtfusion-cpu-abuse-detector .

MannDude · February 20

@forest said:
To limit memory usage and thus all the I/O causing hypervisor overhead, I've taken the following steps:

Installed Alpine Linux instead of Debian to significantly reduce base memory usage

Preloaded jemalloc2 for Tor (overriding glibc's ptmalloc3), reducing memory fragmentation somewhat

Strictly limited Tor's MaxMemInQueues to further reduce RSS

Ensured all services run lightweight BusyBox variants

Since it didn't seem to be enough, would you be open to allowing me to buy more RAM? It would benefit us both, and the extra memory means I can enable zswap to further reduce I/O without worrying about the extra slab pressure it causes. It increases guest CPU usage slightly (due to compression/decompression overhead), but the overall CPU usage as reported by the hypervisor would surely fall (due to fewer vmexits).

I'll just toss you more, no charge

Reboot for a surprise. I think that should help in your case.

forest · February 20

@MannDude said: I think the only reason yours was suspended in full at the time and not just capped was because I saw the BW being 160% of the monthly quota with two weeks left before it resets + 100% CPU usage. At a glance that just screamed "abuse".

That makes sense. I promise I'm not misusing your services, of course. No port scanning, no crypto mining. Nothing like that. Just a middle relay to help promote privacy and freedom.

@MannDude said: You can stay in Sweden if you'd like, or we can move you to Bulgaria if you'd prefer. Up to you.

If the Bulgaria hypervisor has less load and thus would be able to better tolerate a Tor relay, then let's move it there. Otherwise let's keep it where it is.

@MannDude said: I'll just toss you more, no charge

Thank you! I'll go log in and reboot it now!

ServerBachelor · February 20

@MannDude said:

@forest said:

@MannDude said: 100% CPU usage for a long period and 150%+ BW usage, though that auto-throttles but still isn't "unlimited".

I'm showing only 50% CPU (60% 10 minute peak) usage over the last week of operation on my side, but I think I know the issue: The RAM is so low that it's constantly swapping at 10 MiB/s (30 MiB/s 10 minute peak). All that I/O is probably causing severe hypervisor overhead through vmexit/vmenter that is not being accounted for within the guest.

Alternatively, maybe just hard-cap my network at 100 Mbps and set an IOPS limit on my VM to reduce swapping? That would reduce VM context switches and thus hypervisor overhead.

In the meantime, I'll configure Tor's bandwidth limit to not exceed 100 Mbps when it is back up.

@MannDude said: Only POP I've seen this in has been SE.

Could you move the server to BG? I'd be fine with that to help with load balancing.

I hurried back to my desk to review some things real quick.

It's back online though I did temp-cap the CPU to 65% - I don't have anyway to automate this but I do have written notes on my desk and was planning on un-capping tonight (SEA time) manually again.

Main concern here is that under normal usage, this hypervisor lingers around 25-35% CPU usage. I capped it weeks ago to prevent new VM creations as well. 48 cores, lots of room for individual VMs to burst to full usage for extended periods of times like we have in all of our other POPs and hypervisors. From our side, we've acted as we've always done in terms of hypervisor setup and capping of new VM creation after certain thresholds are met so as to maintain a quality service.

Last night was trying to settle down for the evening and just had the alerts going off again. Quickest way to restore service for everyone was to quick-cap the top offenders.

I think the only reason yours was suspended in full at the time and not just capped was because I saw the BW being 160% of the monthly quota with two weeks left before it resets + 100% CPU usage. At a glance that just screamed "abuse".

You can stay in Sweden if you'd like, or we can move you to Bulgaria if you'd prefer. Up to you.

I started digging into the VirtFusion documentation last night as well and trying to see if there is a better way to get notified of potential CPU abuse or issues from individual VMs via webhooks. I did find this as well, https://github.com/noxitylabs/virtfusion-cpu-abuse-detector .

Likewise, I have no issue leaving my VM offline until everything is resolved. But I am curious to know if there’s been any update re. what happened, given that I was barely using it and yet resource usage appeared hugely disproportionate.

ServerBachelor · February 20

@ServerBachelor said:

@MannDude said:

@forest said:

@MannDude said: 100% CPU usage for a long period and 150%+ BW usage, though that auto-throttles but still isn't "unlimited".

I'm showing only 50% CPU (60% 10 minute peak) usage over the last week of operation on my side, but I think I know the issue: The RAM is so low that it's constantly swapping at 10 MiB/s (30 MiB/s 10 minute peak). All that I/O is probably causing severe hypervisor overhead through vmexit/vmenter that is not being accounted for within the guest.

Alternatively, maybe just hard-cap my network at 100 Mbps and set an IOPS limit on my VM to reduce swapping? That would reduce VM context switches and thus hypervisor overhead.

In the meantime, I'll configure Tor's bandwidth limit to not exceed 100 Mbps when it is back up.

@MannDude said: Only POP I've seen this in has been SE.

Could you move the server to BG? I'd be fine with that to help with load balancing.

I hurried back to my desk to review some things real quick.

It's back online though I did temp-cap the CPU to 65% - I don't have anyway to automate this but I do have written notes on my desk and was planning on un-capping tonight (SEA time) manually again.

Main concern here is that under normal usage, this hypervisor lingers around 25-35% CPU usage. I capped it weeks ago to prevent new VM creations as well. 48 cores, lots of room for individual VMs to burst to full usage for extended periods of times like we have in all of our other POPs and hypervisors. From our side, we've acted as we've always done in terms of hypervisor setup and capping of new VM creation after certain thresholds are met so as to maintain a quality service.

Last night was trying to settle down for the evening and just had the alerts going off again. Quickest way to restore service for everyone was to quick-cap the top offenders.

I think the only reason yours was suspended in full at the time and not just capped was because I saw the BW being 160% of the monthly quota with two weeks left before it resets + 100% CPU usage. At a glance that just screamed "abuse".

You can stay in Sweden if you'd like, or we can move you to Bulgaria if you'd prefer. Up to you.

I started digging into the VirtFusion documentation last night as well and trying to see if there is a better way to get notified of potential CPU abuse or issues from individual VMs via webhooks. I did find this as well, https://github.com/noxitylabs/virtfusion-cpu-abuse-detector .

Likewise, I have no issue leaving my VM offline until everything is resolved. But I am curious to know if there’s been any update re. what happened, given that I was barely using it and yet resource usage appeared hugely disproportionate.

MannDude has already received the details via my ticket, but just to report publicly, I believe that I've taken sufficient measures to safeguard against future issue-causing processes of the same type.

zed · February 20

Does anyone know if Sweden is stable or are we waiting to see?

https://portal.incognet.io/serverstatus.php is still blank, maybe it's not the correct url.

JohnFilch123 · February 20

My SE vps has had a brief network outage ~5mins around maybe 3 hours but overall looks pretty stable since the incident.

zed · February 21

well here comes the explosion again.

Radi · February 21

I bought a few 512 mb and 1*2gb services from the lifetime promo for fun, have configured most of them with the ideas I had in mind. I just wish I bought a bit more 512s in the other locations (for VPNs).

So far very happy with my Incognet experience, thanks @MannDude .

forest · February 23

@MannDude I'm getting about 80% CPU steal on the Sweden node, just fyi. Previously, it's always been well under 1%.

ServerBachelor · February 23

@forest said:
@MannDude I'm getting about 80% CPU steal on the Sweden node, just fyi. Previously, it's always been well under 1%.

Similar issue; %Cpu(s): 22.9 st for me at some points, sometimes higher (spikes up to 72%)

zed · February 23

just ban whoever got a vm a week ago when this shit started thx.

ServerBachelor · February 24

@MannDude I have 2 legacy VMs which cannot be controlled via the enduser panel (they are listed as "offline" on control.incogvps.com despite my being able to still use the stuff I've installed on them).

I was hoping I'd be able to rebuild the VPS (fresh Debian 12 installs on both), I assume I can't because of the migration to Virtfusion?

Nothing urgent, just wondering.

iriska · February 24

@ServerBachelor

Announcement Link: "Legacy VMs - Temporary disabled control panel access"

ServerBachelor · February 24

@iriska said:
@ServerBachelor

Announcement Link: "Legacy VMs - Temporary disabled control panel access"

Thanks for this.

@MannDude I made ticket #0224D59E9 to review at your convenience.

JohnFilch123 · February 24

@forest said: I'm getting about 80% CPU steal

@ServerBachelor said: %Cpu(s): 22.9

Mine had a bit of steal but quickly gone back to normal

forest · February 24

@JohnFilch123 said:

@forest said: I'm getting about 80% CPU steal

@ServerBachelor said: %Cpu(s): 22.9

Mine had a bit of steal but quickly gone back to normal

Mine is still pretty bad.

Rsfk · February 26

Hey @MannDude

My server is still suspended despite payment and a ticket submitted yesterday. It’s production critical. Could you please review when available? Thank you.

Ticket Number: #0225A32I7

ServerBachelor · February 26

I cannot SSH into either of my VMs in the Sweden location, despite them being marked as active in WHMCS and running in Virtfusion, so I can't get exact numbers on CPU steal or other metrics.

Last time I was able to check it, my.vmho.st said 9.8% of CPU was being used on one server, and 1.9% on the other.

This problem does not affect servers in any of the other locations.

ServerBachelor · February 26

I was able to SSH in again. CPU steal hovers around ~28% and spiked up to 88% while I was watching.

JohnFilch123 · February 26

Ya steal is crazy today.

forest · February 27

@ServerBachelor said: I cannot SSH into either of my VMs in the Sweden location, despite them being marked as active in WHMCS and running in Virtfusion, so I can't get exact numbers on CPU steal or other metrics.

It's around 20-30% right now. The inability to SSH might be periodic network downtime? This is a graph of CPU usage in the last 24 hours. The gaps represent times where the my remote monitoring server was unable to reach it (network downtime):

graph

If it goes down again and I notice it, I'll connect via VNC (since SSH won't work of course) and see if I can troubleshoot.

ServerBachelor · February 28

@ServerBachelor said:

@iriska said:
@ServerBachelor

Announcement Link: "Legacy VMs - Temporary disabled control panel access"

Thanks for this.

@MannDude I made ticket #0224D59E9 to review at your convenience.

This was resolved.

MatthewM · February 28

BG has been up and down all afternoon, with the current outage at over an hour straight.

ServerBachelor · February 28

@MatthewM said:
BG has been up and down all afternoon, with the current outage at over an hour straight.

I can't SSH into my VM in Sofia, either.

No issues in Stockholm, other than that CPU steal is still spiking up to 30%.

oloke · February 28

ServerBachelor · February 28

@oloke said:

Bruh

ServerBachelor · February 28

@ServerBachelor said:

@MannDude said:
Correct, had a hypervisor issue in Sweden. Please also tag me (@MannDude) on LET if requesting LET updates since it'll go to an inbox that gives me a phone notification as well.

If you already have a ticket open about it, I'll be adding some credit for the issue. If you don't, please open a ticket and I'll make sure you're credited.

#0217Y55Y7 opened

Other updates while I'm here:

Singapore in the future? 👀 (At least for some limited offerings... VPN, DNS, Shared Hosting at minimum)

Hardware up in WA to begin legacy VPS migrations from Virtualizor to VirtFusion. ETA of completion... Probably a month, for just that one POP. Then will try to get PA done after that.

Hardware on order for NL legacy VPS migrations from Virtualizor to VirtFusion...

This just leaves KC as the only other POP with legacy VMs. We're going to do something a bit different for this migration but want to complete the other POPs first.

👀

Same ol' same ol'. Busy busy busy.

Btw @MannDude any update on ticket #0217Y55Y7?

Re. extra RAM as compensation for Stockholm issues. I understand the delay if the issue is still ongoing.

Howdy, Stranger!

Categories

In this Discussion

The IncogNET thread - Discussion, news and updates.

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

The IncogNET thread - Discussion, news and updates.

Comments