[ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time

Willem Jan Withagen wjw at digiware.nl
Mon Nov 6 05:48:30 PST 2017

On 6-11-2017 14:05, Chris Jones wrote:
> I'll document the resolution here for anyone else who experiences
> similar issues.
> We have determined the root cause of the long boot time was a
> combination of factors having to do with ZFS version and tuning, in
> combination with how long filenames are handled.
> ## 1 ## Insufficient ARC cache size. 
> Dramatically increasing the arc_max and arc_meta_limit allowed better
> performance once the cache had time to populate. Previously, each call
> to getxattr took about 8ms (0.008 sec). Multiply that by millions of
> getxattr calls during OSD daemon startup, this was taking hours. This
> only became apparent when we upgraded to Jewel. Hammer does not appear
> to parse all of the extended attributes during startup; This appeared to
> be introduced in Jewel as part of the sortbitwise algorithm.
> Increasing the arc_max and arc_meta_limit allowed more of the meta data
> to be cached in memory. This reduced getxattr call duration to between
> 10 to 100 microseconds (0.0001 to 0.00001 sec). An average of around
> 400x faster.
> ## 2 ## ZFS version and inability to store large amounts of
> meta info in the inode/dnode.
> My understanding is that the ability to use a larger dnode size to store
> meta was not introduced until ZFS version 0.7.x. In version
> this was causing large quantities of meta data to be stored in
> inefficient spill blocks, which were taking longer to access since they
> were not cached due to (previously) undersized ARC settings.
> ## Summary ##
> Increasing ARC cache settings improved performance, but performance will
> still be a concern if the ARC is purged/flushed, such during system
> reboot, until the cache rebuilds itself.
> Upgrading to ZFS version 0.7.x is one potential upgrade path to utilize
> larger dnode size. Another upgrade path is to switch to XFS, which is
> the recommended filesystem for CEPH. XFS does not appear to require any
> kind of meta cache due to different handling of meta info in the inode.

Hi Chris,

Thanx for the feedback, glad to see I was not completely off track.

I'm sort of failing to see how XFS could be (extreemly) much faster than
ZFS when accessing data for the first time. Especially if you are
accessing millions of attributes. But then again you are running the
tests, so this is what it is. And ATM I'm not in the position to these
this in my cluster running FreeBSD/ZFS.

On FreeBSD I beleive there is work on keeping the ARC on SSD hot over
cold reboots. So that would mean that you can have a preloaded cache
after a system reboot. But I have not really looked into this at all.

And then again it lloks like ZFSonLinux is lagging a bit in features.


More information about the ceph-users mailing list