[ceph-users] I/O stalls when doing fstrim on large RBD
moloney at ohsu.edu
Tue Nov 21 11:26:59 PST 2017
So I dug into this a bit. Apparently with XFS the fstrim command will ignore the provided "length" option once it hits a large contiguous block of free space and just keep going until there is a non-empty block. Most of my larger filesystems end up with the XFS allocation group being 1TB in size, so the meta data from the next allocation group ends up stopping the fstrim command at about the 1TB mark.
I did capture an fstrim call with blktrace and attached the results. I did this test on a smaller 2TB FS where the allocation groups are 512GB. I found an offset which hit a large contiguous block of empty space, so even though I only requested a length of 4GB it ended up trimming ~487GB.
# fstrim -v -o 549032275968 -l 4294967296 /data/bulk
/data/bulk: 487.3 GiB (523262476288 bytes) trimmed
Looking through the blktrace I see some CFQ related stuff, so maybe it is actually helping to reduce starvation for other processes?
These large fstrim runs can actually complete quite quickly (10-20 seconds for 1TB), but they can also be quite slow if the FS is busy (a few minutes).
I have heard the the ATA "trim" command can cause many problems because it is not "queueable". However I understand that the SCSI "unmap" command does not have this shortcoming. Could the virtio-scsi driver and/or librbd be handling these better?
Thanks for the help!
From: Jason Dillaman [jdillama at redhat.com]
Sent: Saturday, November 18, 2017 5:08 AM
To: Brendan Moloney
Cc: ceph-users at lists.ceph.com
Subject: Re: [ceph-users] I/O stalls when doing fstrim on large RBD
Can you capture a blktrace while perform fstrim to record the discard
operations? A 1TB trim extent would cause a huge impact since it would
translate to approximately 262K IO requests to the OSDs (assuming 4MB
On Fri, Nov 17, 2017 at 6:19 PM, Brendan Moloney <moloney at ohsu.edu> wrote:
> I guess this isn't strictly about Ceph, but I feel like other folks here
> must have run into the same issues.
> I am trying to keep my thinly provisioned RBD volumes thin. I use
> virtio-scsi to attach the RBD volumes to my VMs with the "discard=unmap"
> option. The RBD is formatted as XFS and some of them can be quite large
> (16TB+). I have a cron job that runs "fstrim" commands twice a week in the
> The issue is that I see massive I/O stalls on the VM during the fstrim. To
> the point where I am getting kernel panics from hung tasks and other
> timeouts. I have tried a number of things to lessen the impact:
> - Switching from deadline to CFQ (initially I thought this helped, but
> now I am not convinced)
> - Running fstrim with "ionice -c idle" (this doesn't seem to make a
> - Chunking the fstrim with the offset/length options (helps reduce worst
> case, but I can't trim less than 1TB at a time and that can still cause a
> pause for several minutes)
> Is there anything else I can do to avoid this issue?
> ceph-users mailing list
> ceph-users at lists.ceph.com
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 52981 bytes
More information about the ceph-users