[ceph-users] NUMA and ceph

Mark Nelson mark.nelson at inktank.com
Thu Dec 12 20:01:22 PST 2013

On 12/12/2013 04:30 PM, Kyle Bader wrote:
> It seems that NUMA can be problematic for ceph-osd daemons in certain
> circumstances. Namely it seems that if a NUMA zone is running out of
> memory due to uneven allocation it is possible for a NUMA zone to
> enter reclaim mode when threads/processes are scheduled on a core in
> that zone and those processes are request memory allocations greater
> than the zones remaining memory. In order for the kernel to satisfy
> the memory allocation for those processes it needs to page out some of
> the contents of the contentious zone, which can have dramatic
> performance implications due to cache misses, etc. I see two ways an
> operator could alleviate these issues:

Yes, quite possibly I think, though I'd be curious to see what impact 
testing would show on modern dual socket Intel boxes.  I suspect this 
could especially be an issue on quad socket AMD boxes though, especially 
Magnycours era.

> Set the vm.zone_reclaim_mode sysctl setting to 0, along with prefixing
> ceph-osd daemons with "numactl --interleave=all". This should probably
> be activated by a flag in /etc/default/ceph and modifying the
> ceph-osd.conf upstart script, along with adding a depend to the ceph
> package's debian/rules file on the "numactl" package.
> The alternative is to use a cgroup for each ceph-osd daemon, pinning
> each one to cores in the same NUMA zone using cpuset.cpu and
> cpuset.mems. This would probably also live in /etc/default/ceph and
> the upstart scripts.

Seems reasonable unless we are testing OSDs that we (eventually?) want 
to have utilize cores on multiple sockets.  If possible, pinning the OSD 
to whatever CPU has the the associated PCIE bus and NIC would be ideal, 
though there's no real good automated way to do that yet afaik.


More information about the ceph-users mailing list