kvm host on almalinux 9 dropped packets

Netify · October 2024

Hello!

I have kvm host node on almalinux 9 with virtualizor.

At random moments, packet loss begins. Traffic can be either 30 Mbit or 150 Mbit. The problem does not seem to be in the amount of traffic.

I updated the kernel to the latest from elrepo

I updated the driver from https://github.com/intel/ethernet-linux-i40e
I tried changing the settings, now:

net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1

fs.file-max = 65536
net.netfilter.nf_conntrack_max = 1048576
net.nf_conntrack_max = 1048576

net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

net.core.netdev_max_backlog = 250000
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 131072 16777216
net.ipv4.tcp_fastopen = 3

net.core.netdev_budget = 25000
net.ipv4.tcp_mtu_probing = 1
net.ipv4.tcp_max_tw_buckets = 600000
net.ipv4.tcp_max_syn_backlog = 600000
net.ipv4.tcp_sack = 0

An irqbalance was assembled from source code (version 1.9.0).

Changed txqueuelen for all interfaces. Increased to 10000. MTU changed to 5000

ethtool -i eno1

driver: i40e
version: 2.26.8
firmware-version: 3.31 0x80000cd9 1.1747.0
expansion-rom-version:
bus-info: 0000:60:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

ifconfig

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 5000
        ether 3c:ec:ef:a0:f8:9c  txqueuelen 10000  (Ethernet)
        RX packets 26187939764  bytes 21613611194137 (19.6 TiB)
        RX errors 1619921  dropped 2347665748  overruns 0  frame 0
        TX packets 23537096937  bytes 20723141882148 (18.8 TiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

ifconfig vps as an example

viifv1193: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::fc16:3eff:feef:50ee  prefixlen 64  scopeid 0x20<link>
        ether fe:16:3e:ef:50:ee  txqueuelen 10000  (Ethernet)
        RX packets 2410349  bytes 388340121 (370.3 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1692786688  bytes 118112058934 (110.0 GiB)
        TX errors 0  dropped 1418299 overruns 0  carrier 0  collisions 0

dropwatch -l kas

5724 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
11 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
24 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
2 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
36 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
2 drops at ip_rcv_finish_core.constprop.0+1d7 (0xffffffffac3f9f17) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
2 drops at tcp_v4_rcv+80 (0xffffffffac430c80) [software]
13 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
12 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]
13 drops at __init_scratch_end+1119d3b2 (0xffffffffc0b9d3b2) [software]
1 drops at ip6_mc_input+270 (0xffffffffac4b34f0) [software]

eu-addr2line -f -k 0xffffffffc0b9d3b2
tun_net_xmit

in dmesg and messages - no error. Only "HTB: quantum of class 10001 is big. Consider r2q change."

I can't find a solution to packet loss. Maybe someone has encountered this problem and can suggest a solution?

tentor · October 2024

@Netify said: RX errors 1619921

This should not happen at all. Can you show ethtool -S eno1?

rx_errors

Total number of bad packets received on this network device. This counter must include events counted by rx_length_errors, rx_crc_errors, rx_frame_errors and other errors not otherwise counted.

Netify · October 2024

ethtool -S eno1:

NIC statistics:
     rx_packets: 26205850611
     tx_packets: 23553964510
     rx_bytes: 21629429891407
     tx_bytes: 20739691723156
     rx_errors: 1620466
     tx_errors: 0
     rx_dropped: 2351373149
     tx_dropped: 0
     collisions: 0
     rx_length_errors: 0
     rx_crc_errors: 0
     rx_unicast: 24648285503
     tx_unicast: 23553009332
     rx_multicast: 93357327
     tx_multicast: 761492
     rx_broadcast: 3815579106
     tx_broadcast: 194424
     rx_unknown_protocol: 0
     tx_linearize: 0
     tx_force_wb: 306291
     tx_busy: 0
     tx_stopped: 11810
     rx_alloc_fail: 0
     rx_pg_alloc_fail: 0
     rx_cache_reuse: 78590569232
     tx-0.packets: 330155907
     tx-0.bytes: 290108487439
     rx-0.packets: 367900097
     rx-0.bytes: 300674570329
     rx-0.xdp.pass: 0
     rx-0.xdp.drop: 0
     rx-0.xdp.tx: 0
     rx-0.xdp.unknown: 0
     rx-0.xdp.redirect: 0
     rx-0.xdp.redirect_fail: 0
     tx-1.packets: 308497665
     tx-1.bytes: 270639527599
     rx-1.packets: 302286670
     rx-1.bytes: 269126523155
     rx-1.xdp.pass: 0
     rx-1.xdp.drop: 0
     rx-1.xdp.tx: 0
     rx-1.xdp.unknown: 0
     rx-1.xdp.redirect: 0
     rx-1.xdp.redirect_fail: 0
     tx-2.packets: 325785358
     tx-2.bytes: 289565400435
     rx-2.packets: 312357771
     rx-2.bytes: 269494815006
     rx-2.xdp.pass: 0
     rx-2.xdp.drop: 0
     rx-2.xdp.tx: 0
     rx-2.xdp.unknown: 0
     rx-2.xdp.redirect: 0
     rx-2.xdp.redirect_fail: 0

or full https://pastebin.com/Ap7buUek

tentor · October 2024

@Netify see this:

port.rx_csum_bad: 1620466

I guess the issue is at the physical layer. Please check if the port of both NIC and switch are fine, as well as cable.

emgh · October 2024

@tentor our guru

Netify · October 2024

@tentor said: I guess the issue is at the physical layer. Please check if the port of both NIC and switch are fine, as well as cable.

We changed the cable and port. But it did not affect the problem.

vsys_host · October 2024

@Netify said:

@tentor said: I guess the issue is at the physical layer. Please check if the port of both NIC and switch are fine, as well as cable.

We changed the cable and port. But it did not affect the problem.

After changing port/cable, is rx_errors still increasing, or only rx_dropped increasing during packetloss?

Netify · October 2024

@vsys_host said: After changing port/cable, is rx_errors still increasing, or only rx_dropped increasing during packetloss?

Unfortunately, I can't say for sure. It seems that both values increased.

I rebooted the server yesterday to default almalinux kernel (not from elrepo).

and no error on interfaces:

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 5000
        RX packets 1364580517  bytes 974542368902 (907.6 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1039997351  bytes 1045932782764 (974.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


viifbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 5000
        RX packets 334674207  bytes 18290964723 (17.0 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 12816505  bytes 944801110 (901.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

but error on vps

viifv1185: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 5000
        ether fe:16:3e:0f:93:26  txqueuelen 10000  (Ethernet)
        RX packets 1824967  bytes 1758177039 (1.6 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 324532263  bytes 20464029390 (19.0 GiB)
!!!        TX errors 0  dropped 24443 overruns 0  carrier 0  collisions 0

when i ping host node or VPS, packet loss is observed:

5 packets transmitted, 2 received, 60% packet loss, time 5001ms
rtt min/avg/max/mdev = 35.636/35.663/35.691/0.190 ms

vsys_host · October 2024

@Netify said:

@vsys_host said: After changing port/cable, is rx_errors still increasing, or only rx_dropped increasing during packetloss?

Unfortunately, I can't say for sure. It seems that both values increased.

I rebooted the server yesterday to default almalinux kernel (not from elrepo).

and no error on interfaces:

eno1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 5000
        RX packets 1364580517  bytes 974542368902 (907.6 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 1039997351  bytes 1045932782764 (974.1 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0


viifbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 5000
        RX packets 334674207  bytes 18290964723 (17.0 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 12816505  bytes 944801110 (901.0 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

but error on vps

viifv1185: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 5000
        ether fe:16:3e:0f:93:26  txqueuelen 10000  (Ethernet)
        RX packets 1824967  bytes 1758177039 (1.6 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 324532263  bytes 20464029390 (19.0 GiB)
!!!        TX errors 0  dropped 24443 overruns 0  carrier 0  collisions 0

when i ping host node or VPS, packet loss is observed:

5 packets transmitted, 2 received, 60% packet loss, time 5001ms
rtt min/avg/max/mdev = 35.636/35.663/35.691/0.190 ms

So, it does not look like a hardware issue at the moment. So the basic checks on both virtual and host machines are (during packet loss): every separate core of CPU is not loaded more than 70% and SI (soft interrupts) not more than 50%, conntrack table is not full, disable conntrack (temporary if possible), set same MTU on all interfaces, disable offload on NICs.

Netify · October 2024

At the moment of packet loss, some cores are loaded at 100% (2-4 out of 88)
LA - 15-25. Sometimes 30.

disable offload on NICs - done. But this also had no effect.

I set "ethtool -L eno1 combined 30" from 88. now it's a little better but it doesn't solve the problem. Packet loss is still growing.

What else to try?

Howdy, Stranger!

Categories

In this Discussion

kvm host on almalinux 9 dropped packets

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

kvm host on almalinux 9 dropped packets

Comments