Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!


Shells Virtual Desktop
BMail.ag - Secure Email Service
Server.net
CPLicense.net
VPS Server
Buy VPN
Vultr
VMs for AI
HostDare
ReliableSite White-Label Dedicated Hosting for Resellers
InterServer VPS
BMail.ag - Secure Email Service
Best VPN
High-Performance Bare Metal Server Solutions
Karvl.com
Server Mania Cloud Hosting
DataWagon Hosting
AlphaVPS Hosting
Evoxt.com
Clouvider
Godlike VPS
VPS Hosting with NVMe
Residential IPs in the US & 4G Mobile Proxies in EU & US with Unlimited Bandwidth
ReliableSite White-Label Dedicated Hosting for Resellers
Shells Virtual Desktop
New on LowEndTalk? Please Register and read our Community Rules.

All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.

The IncogNET thread - Discussion, news and updates.

1192022242527

Comments

  • forestforest Member
    edited February 20

    @MannDude said: 100% CPU usage for a long period and 150%+ BW usage, though that auto-throttles but still isn't "unlimited".

    I'm showing only 50% CPU (60% 10 minute peak) usage over the last week of operation on my side, but I think I know the issue: The RAM is so low that it's constantly swapping at 10 MiB/s (30 MiB/s 10 minute peak). All that I/O is probably causing severe hypervisor overhead through all those vmexits that is not being accounted for within the guest.

    Alternatively, maybe just hard-cap my network at 100 Mbps and set an IOPS limit on my VM to reduce swapping? That would reduce VM context switches and thus hypervisor overhead.

    In the meantime, I'll configure Tor's bandwidth limit to not exceed 100 Mbps when it is back up.

    @MannDude said: Only POP I've seen this in has been SE.

    Could you move the server to BG? I'd be fine with that to help with load balancing.

  • To limit memory usage and thus all the I/O causing hypervisor overhead, I've taken the following steps:

    • Installed Alpine Linux instead of Debian to significantly reduce base memory usage
    • Preloaded jemalloc2 for Tor (overriding glibc's ptmalloc3), reducing memory fragmentation somewhat
    • Strictly limited Tor's MaxMemInQueues to further reduce RSS
    • Ensured all services run lightweight BusyBox variants

    Since it didn't seem to be enough, would you be open to allowing me to buy more RAM? It would benefit us both, and the extra memory means I can enable zswap to further reduce I/O without worrying about the extra slab pressure it causes. It increases guest CPU usage slightly (due to compression/decompression overhead), but the overall CPU usage as reported by the hypervisor would surely fall (due to fewer vmexits).

    Thanked by 1MannDude
  • MannDudeMannDude Patron Provider, Veteran

    @forest said:

    @MannDude said: 100% CPU usage for a long period and 150%+ BW usage, though that auto-throttles but still isn't "unlimited".

    I'm showing only 50% CPU (60% 10 minute peak) usage over the last week of operation on my side, but I think I know the issue: The RAM is so low that it's constantly swapping at 10 MiB/s (30 MiB/s 10 minute peak). All that I/O is probably causing severe hypervisor overhead through vmexit/vmenter that is not being accounted for within the guest.

    Alternatively, maybe just hard-cap my network at 100 Mbps and set an IOPS limit on my VM to reduce swapping? That would reduce VM context switches and thus hypervisor overhead.

    In the meantime, I'll configure Tor's bandwidth limit to not exceed 100 Mbps when it is back up.

    @MannDude said: Only POP I've seen this in has been SE.

    Could you move the server to BG? I'd be fine with that to help with load balancing.

    I hurried back to my desk to review some things real quick.

    It's back online though I did temp-cap the CPU to 65% - I don't have anyway to automate this but I do have written notes on my desk and was planning on un-capping tonight (SEA time) manually again.

    Main concern here is that under normal usage, this hypervisor lingers around 25-35% CPU usage. I capped it weeks ago to prevent new VM creations as well. 48 cores, lots of room for individual VMs to burst to full usage for extended periods of times like we have in all of our other POPs and hypervisors. From our side, we've acted as we've always done in terms of hypervisor setup and capping of new VM creation after certain thresholds are met so as to maintain a quality service.

    Last night was trying to settle down for the evening and just had the alerts going off again. Quickest way to restore service for everyone was to quick-cap the top offenders.

    I think the only reason yours was suspended in full at the time and not just capped was because I saw the BW being 160% of the monthly quota with two weeks left before it resets + 100% CPU usage. At a glance that just screamed "abuse".

    You can stay in Sweden if you'd like, or we can move you to Bulgaria if you'd prefer. Up to you.

    I started digging into the VirtFusion documentation last night as well and trying to see if there is a better way to get notified of potential CPU abuse or issues from individual VMs via webhooks. I did find this as well, https://github.com/noxitylabs/virtfusion-cpu-abuse-detector .

    Thanked by 1ServerBachelor
  • MannDudeMannDude Patron Provider, Veteran

    @forest said:
    To limit memory usage and thus all the I/O causing hypervisor overhead, I've taken the following steps:

    • Installed Alpine Linux instead of Debian to significantly reduce base memory usage
    • Preloaded jemalloc2 for Tor (overriding glibc's ptmalloc3), reducing memory fragmentation somewhat
    • Strictly limited Tor's MaxMemInQueues to further reduce RSS
    • Ensured all services run lightweight BusyBox variants

    Since it didn't seem to be enough, would you be open to allowing me to buy more RAM? It would benefit us both, and the extra memory means I can enable zswap to further reduce I/O without worrying about the extra slab pressure it causes. It increases guest CPU usage slightly (due to compression/decompression overhead), but the overall CPU usage as reported by the hypervisor would surely fall (due to fewer vmexits).

    I'll just toss you more, no charge :)

    Reboot for a surprise. I think that should help in your case.

    Thanked by 2forest ShadowLurker
  • @MannDude said: I think the only reason yours was suspended in full at the time and not just capped was because I saw the BW being 160% of the monthly quota with two weeks left before it resets + 100% CPU usage. At a glance that just screamed "abuse".

    That makes sense. I promise I'm not misusing your services, of course. No port scanning, no crypto mining. Nothing like that. Just a middle relay to help promote privacy and freedom. :smile:

    @MannDude said: You can stay in Sweden if you'd like, or we can move you to Bulgaria if you'd prefer. Up to you.

    If the Bulgaria hypervisor has less load and thus would be able to better tolerate a Tor relay, then let's move it there. Otherwise let's keep it where it is.

    @MannDude said: I'll just toss you more, no charge :)

    Thank you! I'll go log in and reboot it now!

    Thanked by 1MannDude
  • @MannDude said:

    @forest said:

    @MannDude said: 100% CPU usage for a long period and 150%+ BW usage, though that auto-throttles but still isn't "unlimited".

    I'm showing only 50% CPU (60% 10 minute peak) usage over the last week of operation on my side, but I think I know the issue: The RAM is so low that it's constantly swapping at 10 MiB/s (30 MiB/s 10 minute peak). All that I/O is probably causing severe hypervisor overhead through vmexit/vmenter that is not being accounted for within the guest.

    Alternatively, maybe just hard-cap my network at 100 Mbps and set an IOPS limit on my VM to reduce swapping? That would reduce VM context switches and thus hypervisor overhead.

    In the meantime, I'll configure Tor's bandwidth limit to not exceed 100 Mbps when it is back up.

    @MannDude said: Only POP I've seen this in has been SE.

    Could you move the server to BG? I'd be fine with that to help with load balancing.

    I hurried back to my desk to review some things real quick.

    It's back online though I did temp-cap the CPU to 65% - I don't have anyway to automate this but I do have written notes on my desk and was planning on un-capping tonight (SEA time) manually again.

    Main concern here is that under normal usage, this hypervisor lingers around 25-35% CPU usage. I capped it weeks ago to prevent new VM creations as well. 48 cores, lots of room for individual VMs to burst to full usage for extended periods of times like we have in all of our other POPs and hypervisors. From our side, we've acted as we've always done in terms of hypervisor setup and capping of new VM creation after certain thresholds are met so as to maintain a quality service.

    Last night was trying to settle down for the evening and just had the alerts going off again. Quickest way to restore service for everyone was to quick-cap the top offenders.

    I think the only reason yours was suspended in full at the time and not just capped was because I saw the BW being 160% of the monthly quota with two weeks left before it resets + 100% CPU usage. At a glance that just screamed "abuse".

    You can stay in Sweden if you'd like, or we can move you to Bulgaria if you'd prefer. Up to you.

    I started digging into the VirtFusion documentation last night as well and trying to see if there is a better way to get notified of potential CPU abuse or issues from individual VMs via webhooks. I did find this as well, https://github.com/noxitylabs/virtfusion-cpu-abuse-detector .

    Likewise, I have no issue leaving my VM offline until everything is resolved. But I am curious to know if there’s been any update re. what happened, given that I was barely using it and yet resource usage appeared hugely disproportionate.

  • edited February 20

    @ServerBachelor said:

    @MannDude said:

    @forest said:

    @MannDude said: 100% CPU usage for a long period and 150%+ BW usage, though that auto-throttles but still isn't "unlimited".

    I'm showing only 50% CPU (60% 10 minute peak) usage over the last week of operation on my side, but I think I know the issue: The RAM is so low that it's constantly swapping at 10 MiB/s (30 MiB/s 10 minute peak). All that I/O is probably causing severe hypervisor overhead through vmexit/vmenter that is not being accounted for within the guest.

    Alternatively, maybe just hard-cap my network at 100 Mbps and set an IOPS limit on my VM to reduce swapping? That would reduce VM context switches and thus hypervisor overhead.

    In the meantime, I'll configure Tor's bandwidth limit to not exceed 100 Mbps when it is back up.

    @MannDude said: Only POP I've seen this in has been SE.

    Could you move the server to BG? I'd be fine with that to help with load balancing.

    I hurried back to my desk to review some things real quick.

    It's back online though I did temp-cap the CPU to 65% - I don't have anyway to automate this but I do have written notes on my desk and was planning on un-capping tonight (SEA time) manually again.

    Main concern here is that under normal usage, this hypervisor lingers around 25-35% CPU usage. I capped it weeks ago to prevent new VM creations as well. 48 cores, lots of room for individual VMs to burst to full usage for extended periods of times like we have in all of our other POPs and hypervisors. From our side, we've acted as we've always done in terms of hypervisor setup and capping of new VM creation after certain thresholds are met so as to maintain a quality service.

    Last night was trying to settle down for the evening and just had the alerts going off again. Quickest way to restore service for everyone was to quick-cap the top offenders.

    I think the only reason yours was suspended in full at the time and not just capped was because I saw the BW being 160% of the monthly quota with two weeks left before it resets + 100% CPU usage. At a glance that just screamed "abuse".

    You can stay in Sweden if you'd like, or we can move you to Bulgaria if you'd prefer. Up to you.

    I started digging into the VirtFusion documentation last night as well and trying to see if there is a better way to get notified of potential CPU abuse or issues from individual VMs via webhooks. I did find this as well, https://github.com/noxitylabs/virtfusion-cpu-abuse-detector .

    Likewise, I have no issue leaving my VM offline until everything is resolved. But I am curious to know if there’s been any update re. what happened, given that I was barely using it and yet resource usage appeared hugely disproportionate.

    MannDude has already received the details via my ticket, but just to report publicly, I believe that I've taken sufficient measures to safeguard against future issue-causing processes of the same type.

  • zedzed Member

    Does anyone know if Sweden is stable or are we waiting to see?

    https://portal.incognet.io/serverstatus.php is still blank, maybe it's not the correct url.

    Thanked by 1ServerBachelor
  • My SE vps has had a brief network outage ~5mins around maybe 3 hours but overall looks pretty stable since the incident.

  • zedzed Member

    well here comes the explosion again.

  • RadiRadi Host Rep, Veteran

    I bought a few 512 mb and 1*2gb services from the lifetime promo for fun, have configured most of them with the ideas I had in mind. I just wish I bought a bit more 512s in the other locations (for VPNs).

    So far very happy with my Incognet experience, thanks @MannDude :smile: .

  • @MannDude I'm getting about 80% CPU steal on the Sweden node, just fyi. Previously, it's always been well under 1%.

  • edited February 23

    @forest said:
    @MannDude I'm getting about 80% CPU steal on the Sweden node, just fyi. Previously, it's always been well under 1%.

    Similar issue; %Cpu(s): 22.9 st for me at some points, sometimes higher (spikes up to 72%)

  • zedzed Member

    just ban whoever got a vm a week ago when this shit started thx.

  • @MannDude I have 2 legacy VMs which cannot be controlled via the enduser panel (they are listed as "offline" on control.incogvps.com despite my being able to still use the stuff I've installed on them).

    I was hoping I'd be able to rebuild the VPS (fresh Debian 12 installs on both), I assume I can't because of the migration to Virtfusion?

    Nothing urgent, just wondering.

  • Thanks for this.

    @MannDude I made ticket #0224D59E9 to review at your convenience.

  • @forest said: I'm getting about 80% CPU steal

    @ServerBachelor said: %Cpu(s): 22.9

    Mine had a bit of steal but quickly gone back to normal

  • @JohnFilch123 said:

    @forest said: I'm getting about 80% CPU steal

    @ServerBachelor said: %Cpu(s): 22.9

    Mine had a bit of steal but quickly gone back to normal

    Mine is still pretty bad.

  • RsfkRsfk Member

    Hey @MannDude

    My server is still suspended despite payment and a ticket submitted yesterday. It’s production critical. Could you please review when available? Thank you.

    Ticket Number: #0225A32I7

  • I cannot SSH into either of my VMs in the Sweden location, despite them being marked as active in WHMCS and running in Virtfusion, so I can't get exact numbers on CPU steal or other metrics.

    Last time I was able to check it, my.vmho.st said 9.8% of CPU was being used on one server, and 1.9% on the other.

    This problem does not affect servers in any of the other locations.

  • I was able to SSH in again. CPU steal hovers around ~28% and spiked up to 88% while I was watching.

  • Ya steal is crazy today.

  • forestforest Member
    edited February 27

    @ServerBachelor said: I cannot SSH into either of my VMs in the Sweden location, despite them being marked as active in WHMCS and running in Virtfusion, so I can't get exact numbers on CPU steal or other metrics.

    It's around 20-30% right now. The inability to SSH might be periodic network downtime? This is a graph of CPU usage in the last 24 hours. The gaps represent times where the my remote monitoring server was unable to reach it (network downtime):

    graph

    If it goes down again and I notice it, I'll connect via VNC (since SSH won't work of course) and see if I can troubleshoot.

    Thanked by 1ServerBachelor
  • @ServerBachelor said:

    Thanks for this.

    @MannDude I made ticket #0224D59E9 to review at your convenience.

    This was resolved.

    Thanked by 1JohnnySac
  • BG has been up and down all afternoon, with the current outage at over an hour straight.

    Thanked by 3oloke iriska Nekopara
  • edited February 28

    @MatthewM said:
    BG has been up and down all afternoon, with the current outage at over an hour straight.

    I can't SSH into my VM in Sofia, either.

    No issues in Stockholm, other than that CPU steal is still spiking up to 30%.

    Thanked by 1oloke
  • olokeoloke Member, Host Rep

    image

    Thanked by 1ServerBachelor
  • @ServerBachelor said:

    @MannDude said:
    Correct, had a hypervisor issue in Sweden. Please also tag me (@MannDude) on LET if requesting LET updates since it'll go to an inbox that gives me a phone notification as well. :)

    If you already have a ticket open about it, I'll be adding some credit for the issue. If you don't, please open a ticket and I'll make sure you're credited.

    #0217Y55Y7 opened


    Other updates while I'm here:

    • Singapore in the future? 👀 (At least for some limited offerings... VPN, DNS, Shared Hosting at minimum)
    • Hardware up in WA to begin legacy VPS migrations from Virtualizor to VirtFusion. ETA of completion... Probably a month, for just that one POP. Then will try to get PA done after that.
    • Hardware on order for NL legacy VPS migrations from Virtualizor to VirtFusion...
    • This just leaves KC as the only other POP with legacy VMs. We're going to do something a bit different for this migration but want to complete the other POPs first.

    👀

    • Same ol' same ol'. Busy busy busy.

    Btw @MannDude any update on ticket #0217Y55Y7?

    Re. extra RAM as compensation for Stockholm issues. I understand the delay if the issue is still ongoing.

Sign In or Register to comment.