[ceph-users] Avoid Ubuntu Linux kernel 4.15.0-36

Simon Leinen simon.leinen at switch.ch
Sun Oct 28 14:29:02 PDT 2018


As a little "heads-up":

If you are running Ubuntu Bionic 18.04, or Xenial 16.04 with "HWE"
kernels, and have systems running under 4.15.0-36 - which was the
default between 2018-10-01 and 2018-10-22 - please consider upgrading to
the latest 4.15.0-38 ASAP (or downgrade to 4.15.0-34).

4.15.0-36 has a TCP bug[1] that can occasionally slow down a TCP
connection to a trickle of 2.5 Kbytes/s (512-byte segments every 200ms).
Once a TCP connection is in this state, it will never get out.

This started happening within our Ceph clusters after we reinstalled a
few servers as part of our Bluestore migration.  The effect on our RBD
users (OpenStack VMs) was pretty terrible - the typical 4MB transaction
would take about 27 MINUTES at this rate, causing timeouts and crashes.

This was absolutely painful to diagnose, because it happened so rarely
and was hard to reproduce.  Fortunately the fix is easy - just don't run
this kernel.

I should note that our Ceph clusters run over IPv6; I'm not sure whether
the TCP bug can hit with IPv4 (the bug was reported for IPv6 as well),
although I see no reason why it shouldn't.
-- 
Simon.
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1796895


More information about the ceph-users mailing list