[ceph-users] New best practices for osds???

Anthony D'Atri aad at dreamsnake.net
Thu Jul 25 19:27:37 PDT 2019

> We run few hundred HDD OSDs for our backup cluster, we set one RAID 0 per HDD in order to be able
> to use -battery protected- write cache from the RAID controller. It really improves performance, for both
> bluestore and filestore OSDs.

Having run something like 6000 HDD-based FileStore OSDs with colo journals on RAID HBAs I’d like to offer some contrasting thoughts.

TL;DR:  Never again!  False economy.  ymmv.


* The implementation predated me and was carved in dogfood^H^H^H^H^H^H^Hstone, try as I might I could not get it fixed.

* Single-drive RAID0 VDs were created to expose the underlying drives to the OS.  When the architecture was conceived, the HBAs in question didn’t have JBOD/passthrough, though a firmware update shortly thereafter did bring that ability.  That caching was a function of VDs wasn’t known at the time.

* My sense was that the FBWC did offer some throughput performance for at least some workloads, but at the cost of latency.

* Using a RAID-capable HBA in IR mode with FBWC meant having to monitor for the presence and status of the BBU/supercap

* The utility needed for that monitoring, when invoked with ostensibly innocuous parameters, would lock up the HBA for several seconds.

* Traditional BBUs are rated for lifespan of *only* one year.  FBWCs maybe for … three?  Significant cost to RMA or replace them:  time and karma wasted fighting with the system vendor CSO, engineer and remote hands time to take the system down and swap.  And then the connectors for the supercap were touchy; 15% of the time the system would come up and not see it at all.

* The RAID-capable HBA itself + FBWC + supercap cost …. a couple three hundred more than an IT / JBOD equivalent

* There was a little-known flaw in secondary firmware that caused FBWC / supercap modules to be falsely reported bad.  The system vendor acted like I was making this up and washed their hands of it, even when I provided them the HBA vendors’ artifacts and documents.

* There were two design flaws that could and did result in cache data loss when a system rebooted or lost power.  There was a field notice for this, which required harvesting serial numbers and checking each.  The affected range of serials was quite a bit larger than what the validation tool admitted.  I had to manage the replacement of 302+ of these in production use, each needing engineer time time to manage Ceph, to do the hands work, and hassle with RMA paperwork.

* There was a firmware / utility design flaw that caused the HDD’s onboard volatile write cache to be silently turned on, despite an HBA config dump showing a setting that should have left it off.  Again data was lost when a node crashed hard or lost power.

* There was another firmware flaw that prevented booting if there was pinned / preserved cache data after a reboot / power loss if a drive failed or was yanked.  The HBA’s option ROM utility would block booting and wait for input on the console.  One could get in and tell it to discard that cache, but it would not actually do so, instead looping back to the same screen.  The only way to get the system to boot again was to replace and RMA the HBA.

* The VD layer lessened the usefulness of iostat data.  It also complicated OSD deployment / removal / replacement.  A smartctl hack to access SMART attributes below the VD layer would work on some systems but not others.

* The HBA model in question would work normally with a certain CPU generation, but not with slightly newer servers with the next CPU generation.  They would randomly, on roughly one boot out of five, negotiate PCIe gen3 which they weren’t capable of handling properly, and would silently run at about 20% of normal speed.  Granted this isn’t necessarily specific to an IR HBA.

Add it all up, and my assertion is that the money, time, karma, and user impact you save from NOT dealing with a RAID HBA *more than pays for* using SSDs for OSDs instead.

More information about the ceph-users mailing list