Providers or managers of large linux server fleets

SplitIce · June 2021

Q. How often do you see non fatal kernel OOPS'es, warnings or bug alerts. Do you monitor for them (netconsole, or other)?

Recently we have increased the number of managed servers quite significantly. On all systems we watch, log and triage.

One thing I've noticed is that the Linux kernel really isnt as defect free as one might hope.

Q. Do you find there is significant benifit in high patch number releases of LTS branches? Do you find them significantly more stable than say low (i.e <20) releases?

SplitIce · June 2021

For those curious as to the spark.

Just today I found a netconsole (or virtio?) bug. Non fatal, but a crash risk for sure (for example if an IRQ occurred during the op).

[...]
[194051.326140] ------------[ cut here ]------------
[194051.326271] netpoll_send_skb_on_dev(): eth0 enabled interrupts in poll (start_xmit+0x0/0x4b0 [virtio_net])
[194051.327739] WARNING: CPU: 0 PID: 9 at net/core/netpoll.c:351 netpoll_send_skb_on_dev+0x231/0x240
[194051.327740] Modules linked in: [...]
[194051.327810] CPU: 0 PID: 9 Comm: ksoftirqd/0 Tainted: G           O      5.7.5+ #22
[194051.327810] Hardware name: Vultr VC2, BIOS
[194051.327811] RIP: 0010:netpoll_send_skb_on_dev+0x231/0x240
[...]
[194051.327838] Call Trace:
[194051.327838]  netpoll_send_udp+0x2c4/0x3e6
[194051.327839]  write_msg+0xda/0xf0 [netconsole]
[194051.327839]  console_unlock+0x33b/0x4b0
[194051.327839]  vprintk_emit+0x17d/0x270
[194051.327840]  printk+0x58/0x6f
[...]

An unsafe printk (or in this case net_warn_ratelimited) is a scary idea.

TimboJones · July 2021

@SplitIce said:
One thing I've noticed is that the Linux kernel really isnt as defect free as one might hope.

Correct, they're not that experienced or do very little with it to know where and how often it shits the bed.

Howdy, Stranger!

Categories

In this Discussion

Providers or managers of large linux server fleets

Comments

Howdy, Stranger!

Quick Links

Categories

In this Discussion

Providers or managers of large linux server fleets

Comments