[ceph-users] Intel P4600 3.2TB U.2 form factor NVMe firmware problems causing dead disks

Kjetil Joergensen kjetil at medallia.com
Mon Mar 4 17:23:31 PST 2019


Hi,

If QDV10130 pre-dates feb/march 2018, you may have suffered the same
firmware bug as existed on the DC S4600 series. I'm under NDA so I
can't bitch and moan about specifics, but your symptoms sounds very
familiar.

It's entirely possible that there's *something* about bluestore that
has access patterns that differ from "regular filesystems", we burnt
ourselves with the DC S4600, which were burnt in (I were told) - but
probably the burn-in testing were done through filesystems rather than
ceph/bluestore.

Previously discussed around here
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-January/023835.html

On Mon, Feb 18, 2019 at 7:44 AM David Turner <drakonstein at gmail.com> wrote:
>
> We have 2 clusters of [1] these disks that have 2 Bluestore OSDs per disk (partitioned), 3 disks per node, 5 nodes per cluster.  The clusters are 12.2.4 running CephFS and RBDs.  So in total we have 15 NVMe's per cluster and 30 NVMe's in total.  They were all built at the same time and were running firmware version QDV10130.  On this firmware version we early on had 2 disks failures, a few months later we had 1 more, and then a month after that (just a few weeks ago) we had 7 disk failures in 1 week.
>
> The failures are such that the disk is no longer visible to the OS.  This holds true beyond server reboots as well as placing the failed disks into a new server.  With a firmware upgrade tool we got an error that pretty much said there's no way to get data back and to RMA the disk.  We upgraded all of our remaining disks' firmware to QDV101D1 and haven't had any problems since then.  Most of our failures happened while rebalancing the cluster after replacing dead disks and we tested rigorously around that use case after upgrading the firmware.  This firmware version seems to have resolved whatever the problem was.
>
> We have about 100 more of these scattered among database servers and other servers that have never had this problem while running the QDV10130 firmware as well as firmwares between this one and the one we upgraded to.  Bluestore on Ceph is the only use case we've had so far with this sort of failure.
>
> Has anyone else come across this issue before?  Our current theory is that Bluestore is accessing the disk in a way that is triggering a bug in the older firmware version that isn't triggered by more traditional filesystems.  We have a scheduled call with Intel to discuss this, but their preliminary searches into the bugfixes and known problems between firmware versions didn't indicate the bug that we triggered.  It would be good to have some more information about what those differences for disk accessing might be to hopefully get a better answer from them as to what the problem is.
>
>
> [1] https://www.intel.com/content/www/us/en/products/memory-storage/solid-state-drives/data-center-ssds/dc-p4600-series/dc-p4600-3-2tb-2-5inch-3d1.html
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Kjetil Joergensen <kjetil at medallia.com>
SRE, Medallia Inc


More information about the ceph-users mailing list