[ceph-users] Huge latency spikes
alexander.v.litvak at gmail.com
Mon Nov 19 22:33:01 PST 2018
I went through raid controller firmware update. I replaced a pair of SSDs with new ones. Nothing have changed. Per controller card utility it shows that no patrol reading happens and battery
backup is in a good shape. Cache policy is WriteBack. I am aware on the bad battery effect but it doesn't seem to be the case unless controller is lying to me.
On 11/19/2018 2:39 PM, Brendan Moloney wrote:
>> Raid card for journal disks is Perc H730 (Megaraid), RAID 1, battery back cache is on
>> Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
>> Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
>> I have 2 other nodes with older Perc H710 and similar SSDs with slightly higher wear (6.3% vs 5.18%) but from observation they hardly hit 1.5 ms on rear occasion
>> Cache, RAID, and battery situation is the same.
> I would take a closer look at the RAID card. Are you sure the BBU is ok? In the past I noticed the Megaraid cards would do periodic battery tests that would completely drain the battery and thus disable the write cache until they reached some threshold of charge again. They also can do periodic "patrol reads" and "consistency checks" that can hurt performance. Or the card could just be failing, I have almost gone through more RAID cards than HDDs. The unreliability and black box nature of hardware RAID cards is one of the things that first got me looking into Ceph (although even mdadm is a big improvement in my opinion).
> For journals you are better off putting half your OSDs on one SSD and half on the other instead of RAID1.
More information about the ceph-users