All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
kvm host on almalinux 9 dropped packets
Hello!
I have kvm host node on almalinux 9 with virtualizor.
At random moments, packet loss begins. Traffic can be either 30 Mbit or 150 Mbit. The problem does not seem to be in the amount of traffic.
I updated the kernel to the latest from elrepo
I updated the driver from https://github.com/intel/ethernet-linux-i40e
I tried changing the settings, now:
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1
fs.file-max = 65536
net.netfilter.nf_conntrack_max = 1048576
net.nf_conntrack_max = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.netdev_max_backlog = 250000
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 131072 16777216
net.ipv4.tcp_fastopen = 3
net.core.netdev_budget = 25000
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_max_tw_buckets = 600000
net.ipv4.tcp_max_syn_backlog = 600000
net.ipv4.tcp_sack = 0
An irqbalance was assembled from source code (version 1.9.0).
Changed txqueuelen for all interfaces. Increased to 10000. MTU changed to 5000
ethtool -i eno1
driver: i40e
version: 2.26.8
firmware-version: 3.31 0x80000cd9 1.1747.0
expansion-rom-version:
bus-info: 0000:60:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes
ifconfig
eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 5000
ether 3c:ec:ef:a0:f8:9c txqueuelen 10000 (Ethernet)
RX packets 26187939764 bytes 21613611194137 (19.6 TiB)
RX errors 1619921 dropped 2347665748 overruns 0 frame 0
TX packets 23537096937 bytes 20723141882148 (18.8 TiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
ifconfig vps as an example
viifv1193: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet6 fe80::fc16:3eff:feef:50ee prefixlen 64 scopeid 0x20<link>
ether fe:16:3e:ef:50:ee txqueuelen 10000 (Ethernet)
RX packets 2410349 bytes 388340121 (370.3 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 1692786688 bytes 118112058934 (110.0 GiB)
TX errors 0 dropped 1418299 overruns 0 carrier 0 collisions 0
dropwatch -l kas
5724 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
11 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
24 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
2 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
36 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
2 drops at ip_rcv_finish_core.constprop.0+1d7 (0xffffffffac3f9f17) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
2 drops at tcp_v4_rcv+80 (0xffffffffac430c80) [software]
13 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
13 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
eu-addr2line -f -k 0xffffffffc0b9d3b2
tun_net_xmit
in dmesg and messages - no error. Only "HTB: quantum of class 10001 is big. Consider r2q change."
I can't find a solution to packet loss. Maybe someone has encountered this problem and can suggest a solution?

Comments
This should not happen at all. Can you show
ethtool -S eno1?ethtool -S eno1:
or full https://pastebin.com/Ap7buUek
@Netify see this:
I guess the issue is at the physical layer. Please check if the port of both NIC and switch are fine, as well as cable.
@tentor our guru
We changed the cable and port. But it did not affect the problem.
After changing port/cable, is rx_errors still increasing, or only rx_dropped increasing during packetloss?
Unfortunately, I can't say for sure. It seems that both values increased.
I rebooted the server yesterday to default almalinux kernel (not from elrepo).
and no error on interfaces:
but error on vps
when i ping host node or VPS, packet loss is observed:
So, it does not look like a hardware issue at the moment. So the basic checks on both virtual and host machines are (during packet loss): every separate core of CPU is not loaded more than 70% and SI (soft interrupts) not more than 50%, conntrack table is not full, disable conntrack (temporary if possible), set same MTU on all interfaces, disable offload on NICs.
At the moment of packet loss, some cores are loaded at 100% (2-4 out of 88)
LA - 15-25. Sometimes 30.
disable offload on NICs - done. But this also had no effect.
I set "ethtool -L eno1 combined 30" from 88. now it's a little better but it doesn't solve the problem. Packet loss is still growing.
What else to try?