[ceph-users] Ceph cache pool full

Gregory Farnum gfarnum at redhat.com
Mon Oct 9 09:00:44 PDT 2017


On Fri, Oct 6, 2017 at 2:22 PM Shawfeng Dong <shaw at ucsc.edu> wrote:

> Here is a quick update. I found that a CephFS client process was accessing
> the big 1TB file, which I think had a lock on the file, preventing the
> flushing of objects to the underlying data pool. Once I killed that
> process, objects started to flush to the data pool automatically (with
> target_max_bytes & target_max_objects set); and I can force the flushing
> with 'rados -p cephfs_cache cache-flush-evict-all' as well. So David
> appears to be right in saying that "it can only hold full files and not
> flush partial files". This will be problematic if we want to transfer a
> file that is bigger in size than the cache pool!
>

Hmm. I can say that there is definitely no explicit locking of cache object
files by the filesystem; no such mechanism exists.
I also can't think of any activity that would be going on keeping the
objects active in cache.

However, if you had a CephFS client actively reading or writing from the
file, any objects it was looking at would certainly be kept in the cache; I
think there's a minimum time since last activity to prevent RADOS from
flushing out stuff that's in use. If that was the issue, you just have hot
data sets bigger than your cache size. And we know our cache tiering system
doesn't work in those cases.
-Greg


>
> We did this whole scheme (EC data pool plus NVMe cache tier) just for
> experimentation. I've learned a lot from the experiment and from your guys.
> Thank you very much!
>
> For production, I think I'll simply use a replicated pool for data on the
> HDDs (with bluestore WAL and DB on the 1st NVMe), and a replicated pool for
> metadata on the 2nd NVMe.  Please let me know if you have any further
> advice / suggestion.
>
> Best,
> Shaw
>
>
>
> On Fri, Oct 6, 2017 at 10:07 AM, David Turner <drakonstein at gmail.com>
> wrote:
>
>> All of this data is test data, yeah?  I would start by removing the
>> cache-tier and pool, recreate it and attach it, configure all of the
>> settings including the maximums, and start testing things again.  I would
>> avoid doing the 1.3TB file test until after you've confirmed that the
>> smaller files are being flushed appropriately to the data pool (manually
>> flushing/evicting it) and then scale up your testing to the larger files.
>> On Fri, Oct 6, 2017 at 12:54 PM Shawfeng Dong <shaw at ucsc.edu> wrote:
>>
>>> Curiously, it has been quite a while, but there is still no object in
>>> the underlying data pool:
>>> # rados -p cephfs_data ls
>>>
>>> Any advice?
>>>
>>> On Fri, Oct 6, 2017 at 9:45 AM, David Turner <drakonstein at gmail.com>
>>> wrote:
>>>
>>>> Notice in the URL for the documentation the use of "luminous".  When
>>>> you looked a few weeks ago, you might have been looking at the
>>>> documentation for a different version of Ceph.  You can change that to
>>>> jewel, hammer, kraken, master, etc depending on which version of Ceph you
>>>> are running or reading about.  Google gets confused and will pull up random
>>>> versions of the ceph documentation for a page. It's on us to make sure that
>>>> the url is pointing to the version of Ceph that we are using.
>>>>
>>>> While it's sitting there in the flush command, can you see if there are
>>>> any objects in the underlying data pool?  Hopefully the count will be
>>>> growing.
>>>>
>>>> On Fri, Oct 6, 2017 at 12:39 PM Shawfeng Dong <shaw at ucsc.edu> wrote:
>>>>
>>>>> Hi Christian,
>>>>>
>>>>> I set those via CLI:
>>>>> # ceph osd pool set cephfs_cache target_max_bytes 1099511627776
>>>>> # ceph osd pool set cephfs_cache target_max_objects 1000000
>>>>>
>>>>> but manual flushing doesn't appear to work:
>>>>> # rados -p cephfs_cache cache-flush-evict-all
>>>>>         1000000046a.00000ca6
>>>>>
>>>>> it just gets stuck there for a long time.
>>>>>
>>>>> Any suggestion? Do I need to restart the daemons or reboot the nodes?
>>>>>
>>>>> Thanks,
>>>>> Shaw
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 6, 2017 at 9:31 AM, Christian Balzer <chibi at gol.com>
>>>>> wrote:
>>>>>
>>>>>> On Fri, 6 Oct 2017 09:14:40 -0700 Shawfeng Dong wrote:
>>>>>>
>>>>>> > I found the command: rados -p cephfs_cache cache-flush-evict-all
>>>>>> >
>>>>>> That's not what you want/need.
>>>>>> Though it will fix your current "full" issue.
>>>>>>
>>>>>> > The documentation (
>>>>>> > http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/)
>>>>>> has
>>>>>> > been improved a lot since I last checked it a few weeks ago!
>>>>>> >
>>>>>> The need to set max_bytes and max_objects has been documented for ages
>>>>>> (since Hammer).
>>>>>>
>>>>>> more below...
>>>>>>
>>>>>> > -Shaw
>>>>>> >
>>>>>> > On Fri, Oct 6, 2017 at 9:10 AM, Shawfeng Dong <shaw at ucsc.edu>
>>>>>> wrote:
>>>>>> >
>>>>>> > > Thanks, Luis.
>>>>>> > >
>>>>>> > > I've just set max_bytes and max_objects:
>>>>>> How?
>>>>>> Editing the conf file won't help until a restart.
>>>>>>
>>>>>> > > target_max_objects: 1000000 (1M)
>>>>>> > > target_max_bytes: 1099511627776 (1TB)
>>>>>> >
>>>>>> I'd lower that or the cache_target_full_ratio by another 10%.
>>>>>>
>>>>>> Christian
>>>>>> > >
>>>>>> > > but nothing appears to be happening. Is there a way to force
>>>>>> flushing?
>>>>>> > >
>>>>>> > > Thanks,
>>>>>> > > Shaw
>>>>>> > >
>>>>>> > > On Fri, Oct 6, 2017 at 8:55 AM, Luis Periquito <
>>>>>> periquito at gmail.com>
>>>>>> > > wrote:
>>>>>> > >
>>>>>> > >> Not looking at anything else, you didn't set the max_bytes or
>>>>>> > >> max_objects for it to start flushing...
>>>>>> > >>
>>>>>> > >> On Fri, Oct 6, 2017 at 4:49 PM, Shawfeng Dong <shaw at ucsc.edu>
>>>>>> wrote:
>>>>>> > >> > Dear all,
>>>>>> > >> >
>>>>>> > >> > Thanks a lot for the very insightful comments/suggestions!
>>>>>> > >> >
>>>>>> > >> > There are 3 OSD servers in our pilot Ceph cluster, each with
>>>>>> 2x 1TB SSDs
>>>>>> > >> > (boot disks), 12x 8TB SATA HDDs and 2x 1.2TB NVMe SSDs. We use
>>>>>> the
>>>>>> > >> bluestore
>>>>>> > >> > backend, with the first NVMe as the WAL and DB devices for
>>>>>> OSDs on the
>>>>>> > >> HDDs.
>>>>>> > >> > And we try to create a cache tier out of the second NVMes.
>>>>>> > >> >
>>>>>> > >> > Here are the outputs of the commands suggested by David:
>>>>>> > >> >
>>>>>> > >> > 1) # ceph df
>>>>>> > >> > GLOBAL:
>>>>>> > >> >     SIZE     AVAIL     RAW USED     %RAW USED
>>>>>> > >> >     265T      262T        2847G          1.05
>>>>>> > >> > POOLS:
>>>>>> > >> >     NAME                ID     USED      %USED      MAX AVAIL
>>>>>> > >>  OBJECTS
>>>>>> > >> >     cephfs_data         1          0          0          248T
>>>>>> > >>  0
>>>>>> > >> >     cephfs_metadata     2      8515k          0          248T
>>>>>> > >> 24
>>>>>> > >> >     cephfs_cache        3      1381G     100.00             0
>>>>>> > >> 355385
>>>>>> > >> >
>>>>>> > >> > 2) # ceph osd df
>>>>>> > >> >  0   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 174
>>>>>> > >> >  1   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 169
>>>>>> > >> >  2   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173
>>>>>> > >> >  3   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 159
>>>>>> > >> >  4   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173
>>>>>> > >> >  5   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 162
>>>>>> > >> >  6   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 149
>>>>>> > >> >  7   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 179
>>>>>> > >> >  8   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 163
>>>>>> > >> >  9   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 194
>>>>>> > >> > 10   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 185
>>>>>> > >> > 11   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 168
>>>>>> > >> > 36  nvme 1.09149  1.00000 1117G  855G   262G 76.53 73.01  79
>>>>>> > >> > 12   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 180
>>>>>> > >> > 13   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 168
>>>>>> > >> > 14   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 178
>>>>>> > >> > 15   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 170
>>>>>> > >> > 16   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 149
>>>>>> > >> > 17   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 203
>>>>>> > >> > 18   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173
>>>>>> > >> > 19   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 158
>>>>>> > >> > 20   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 154
>>>>>> > >> > 21   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 160
>>>>>> > >> > 22   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 167
>>>>>> > >> > 23   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 188
>>>>>> > >> > 37  nvme 1.09149  1.00000 1117G 1061G 57214M 95.00 90.63  98
>>>>>> > >> > 24   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 187
>>>>>> > >> > 25   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 200
>>>>>> > >> > 26   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 147
>>>>>> > >> > 27   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 171
>>>>>> > >> > 28   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 162
>>>>>> > >> > 29   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 152
>>>>>> > >> > 30   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 174
>>>>>> > >> > 31   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 176
>>>>>> > >> > 32   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 182
>>>>>> > >> > 33   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 155
>>>>>> > >> > 34   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 166
>>>>>> > >> > 35   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 176
>>>>>> > >> > 38  nvme 1.09149  1.00000 1117G  857G   260G 76.71 73.18  79
>>>>>> > >> >                     TOTAL  265T 2847G   262T  1.05
>>>>>> > >> > MIN/MAX VAR: 0.03/90.63  STDDEV: 22.81
>>>>>> > >> >
>>>>>> > >> > 3) # ceph osd tree
>>>>>> > >> > -1       265.29291 root default
>>>>>> > >> > -3        88.43097     host pulpo-osd01
>>>>>> > >> >  0   hdd   7.27829         osd.0            up  1.00000 1.00000
>>>>>> > >> >  1   hdd   7.27829         osd.1            up  1.00000 1.00000
>>>>>> > >> >  2   hdd   7.27829         osd.2            up  1.00000 1.00000
>>>>>> > >> >  3   hdd   7.27829         osd.3            up  1.00000 1.00000
>>>>>> > >> >  4   hdd   7.27829         osd.4            up  1.00000 1.00000
>>>>>> > >> >  5   hdd   7.27829         osd.5            up  1.00000 1.00000
>>>>>> > >> >  6   hdd   7.27829         osd.6            up  1.00000 1.00000
>>>>>> > >> >  7   hdd   7.27829         osd.7            up  1.00000 1.00000
>>>>>> > >> >  8   hdd   7.27829         osd.8            up  1.00000 1.00000
>>>>>> > >> >  9   hdd   7.27829         osd.9            up  1.00000 1.00000
>>>>>> > >> > 10   hdd   7.27829         osd.10           up  1.00000 1.00000
>>>>>> > >> > 11   hdd   7.27829         osd.11           up  1.00000 1.00000
>>>>>> > >> > 36  nvme   1.09149         osd.36           up  1.00000 1.00000
>>>>>> > >> > -5        88.43097     host pulpo-osd02
>>>>>> > >> > 12   hdd   7.27829         osd.12           up  1.00000 1.00000
>>>>>> > >> > 13   hdd   7.27829         osd.13           up  1.00000 1.00000
>>>>>> > >> > 14   hdd   7.27829         osd.14           up  1.00000 1.00000
>>>>>> > >> > 15   hdd   7.27829         osd.15           up  1.00000 1.00000
>>>>>> > >> > 16   hdd   7.27829         osd.16           up  1.00000 1.00000
>>>>>> > >> > 17   hdd   7.27829         osd.17           up  1.00000 1.00000
>>>>>> > >> > 18   hdd   7.27829         osd.18           up  1.00000 1.00000
>>>>>> > >> > 19   hdd   7.27829         osd.19           up  1.00000 1.00000
>>>>>> > >> > 20   hdd   7.27829         osd.20           up  1.00000 1.00000
>>>>>> > >> > 21   hdd   7.27829         osd.21           up  1.00000 1.00000
>>>>>> > >> > 22   hdd   7.27829         osd.22           up  1.00000 1.00000
>>>>>> > >> > 23   hdd   7.27829         osd.23           up  1.00000 1.00000
>>>>>> > >> > 37  nvme   1.09149         osd.37           up  1.00000 1.00000
>>>>>> > >> > 36  nvme   1.09149         osd.36           up  1.00000 1.00000
>>>>>> > >> > -5        88.43097     host pulpo-osd02
>>>>>> > >> > 12   hdd   7.27829         osd.12           up  1.00000 1.00000
>>>>>> > >> > 13   hdd   7.27829         osd.13           up  1.00000 1.00000
>>>>>> > >> > 14   hdd   7.27829         osd.14           up  1.00000 1.00000
>>>>>> > >> > 15   hdd   7.27829         osd.15           up  1.00000 1.00000
>>>>>> > >> > 16   hdd   7.27829         osd.16           up  1.00000 1.00000
>>>>>> > >> > 17   hdd   7.27829         osd.17           up  1.00000 1.00000
>>>>>> > >> > 18   hdd   7.27829         osd.18           up  1.00000 1.00000
>>>>>> > >> > 19   hdd   7.27829         osd.19           up  1.00000 1.00000
>>>>>> > >> > 20   hdd   7.27829         osd.20           up  1.00000 1.00000
>>>>>> > >> > 21   hdd   7.27829         osd.21           up  1.00000 1.00000
>>>>>> > >> > 22   hdd   7.27829         osd.22           up  1.00000 1.00000
>>>>>> > >> > 23   hdd   7.27829         osd.23           up  1.00000 1.00000
>>>>>> > >> > 37  nvme   1.09149         osd.37           up  1.00000 1.00000
>>>>>> > >> > -7        88.43097     host pulpo-osd03
>>>>>> > >> > 24   hdd   7.27829         osd.24           up  1.00000 1.00000
>>>>>> > >> > 25   hdd   7.27829         osd.25           up  1.00000 1.00000
>>>>>> > >> > 26   hdd   7.27829         osd.26           up  1.00000 1.00000
>>>>>> > >> > 27   hdd   7.27829         osd.27           up  1.00000 1.00000
>>>>>> > >> > 28   hdd   7.27829         osd.28           up  1.00000 1.00000
>>>>>> > >> > 29   hdd   7.27829         osd.29           up  1.00000 1.00000
>>>>>> > >> > 30   hdd   7.27829         osd.30           up  1.00000 1.00000
>>>>>> > >> > 31   hdd   7.27829         osd.31           up  1.00000 1.00000
>>>>>> > >> > 32   hdd   7.27829         osd.32           up  1.00000 1.00000
>>>>>> > >> > 33   hdd   7.27829         osd.33           up  1.00000 1.00000
>>>>>> > >> > 34   hdd   7.27829         osd.34           up  1.00000 1.00000
>>>>>> > >> > 35   hdd   7.27829         osd.35           up  1.00000 1.00000
>>>>>> > >> > 38  nvme   1.09149         osd.38           up  1.00000 1.00000
>>>>>> > >> >
>>>>>> > >> > 4) # ceph osd pool get cephfs_cache all
>>>>>> > >> > min_size: 2
>>>>>> > >> > crash_replay_interval: 0
>>>>>> > >> > pg_num: 128
>>>>>> > >> > pgp_num: 128
>>>>>> > >> > crush_rule: pulpo_nvme
>>>>>> > >> > hashpspool: true
>>>>>> > >> > nodelete: false
>>>>>> > >> > nopgchange: false
>>>>>> > >> > nosizechange: false
>>>>>> > >> > write_fadvise_dontneed: false
>>>>>> > >> > noscrub: false
>>>>>> > >> > nodeep-scrub: false
>>>>>> > >> > hit_set_type: bloom
>>>>>> > >> > hit_set_period: 14400
>>>>>> > >> > hit_set_count: 12
>>>>>> > >> > hit_set_fpp: 0.05
>>>>>> > >> > use_gmt_hitset: 1
>>>>>> > >> > auid: 0
>>>>>> > >> > target_max_objects: 0
>>>>>> > >> > target_max_bytes: 0
>>>>>> > >> > cache_target_dirty_ratio: 0.4
>>>>>> > >> > cache_target_dirty_high_ratio: 0.6
>>>>>> > >> > cache_target_full_ratio: 0.8
>>>>>> > >> > cache_min_flush_age: 0
>>>>>> > >> > cache_min_evict_age: 0
>>>>>> > >> > min_read_recency_for_promote: 0
>>>>>> > >> > min_write_recency_for_promote: 0
>>>>>> > >> > fast_read: 0
>>>>>> > >> > hit_set_grade_decay_rate: 0
>>>>>> > >> > crash_replay_interval: 0
>>>>>> > >> >
>>>>>> > >> > Do you see anything wrong? We had written some small files to
>>>>>> the CephFS
>>>>>> > >> > before we tried to write the big 1TB file. What is puzzling to
>>>>>> me is
>>>>>> > >> that no
>>>>>> > >> > data has been written back to the data pool.
>>>>>> > >> >
>>>>>> > >> > Best,
>>>>>> > >> > Shaw
>>>>>> > >> >
>>>>>> > >> > On Fri, Oct 6, 2017 at 6:46 AM, David Turner <
>>>>>> drakonstein at gmail.com>
>>>>>> > >> wrote:
>>>>>> > >> >>
>>>>>> > >> >>
>>>>>> > >> >>
>>>>>> > >> >> On Fri, Oct 6, 2017, 1:05 AM Christian Balzer <chibi at gol.com>
>>>>>> wrote:
>>>>>> > >> >>>
>>>>>> > >> >>>
>>>>>> > >> >>> Hello,
>>>>>> > >> >>>
>>>>>> > >> >>> On Fri, 06 Oct 2017 03:30:41 +0000 David Turner wrote:
>>>>>> > >> >>>
>>>>>> > >> >>> > You're missing most all of the important bits. What the
>>>>>> osds in your
>>>>>> > >> >>> > cluster look like, your tree, and your cache pool settings.
>>>>>> > >> >>> >
>>>>>> > >> >>> > ceph df
>>>>>> > >> >>> > ceph osd df
>>>>>> > >> >>> > ceph osd tree
>>>>>> > >> >>> > ceph osd pool get cephfs_cache all
>>>>>> > >> >>> >
>>>>>> > >> >>> Especially the last one.
>>>>>> > >> >>>
>>>>>> > >> >>> My money is on not having set target_max_objects and
>>>>>> target_max_bytes
>>>>>> > >> to
>>>>>> > >> >>> sensible values along with the ratios.
>>>>>> > >> >>> In short, not having read the (albeit spotty) documentation.
>>>>>> > >> >>>
>>>>>> > >> >>> > You have your writeback cache on 3 nvme drives. It looks
>>>>>> like you
>>>>>> > >> have
>>>>>> > >> >>> > 1.6TB available between them for the cache. I don't know
>>>>>> the
>>>>>> > >> behavior
>>>>>> > >> >>> > of a
>>>>>> > >> >>> > writeback cache tier on cephfs for large files, but I
>>>>>> would guess
>>>>>> > >> that
>>>>>> > >> >>> > it
>>>>>> > >> >>> > can only hold full files and not flush partial files.
>>>>>> > >> >>>
>>>>>> > >> >>> I VERY much doubt that, if so it would be a massive flaw.
>>>>>> > >> >>> One assumes that cache operations work on the RADOS object
>>>>>> level, no
>>>>>> > >> >>> matter what.
>>>>>> > >> >>
>>>>>> > >> >> I hope that it is on the rados level, but not a single object
>>>>>> had been
>>>>>> > >> >> flushed to the backing pool. So I hazarded a guess. Seeing his
>>>>>> > >> settings will
>>>>>> > >> >> shed more light.
>>>>>> > >> >>>
>>>>>> > >> >>>
>>>>>> > >> >>> > That would mean your
>>>>>> > >> >>> > cache needs to have enough space for any file being
>>>>>> written to the
>>>>>> > >> >>> > cluster.
>>>>>> > >> >>> > In this case a 1.3TB file with 3x replication would
>>>>>> require 3.9TB
>>>>>> > >> (more
>>>>>> > >> >>> > than double what you have available) of available space in
>>>>>> your
>>>>>> > >> >>> > writeback
>>>>>> > >> >>> > cache.
>>>>>> > >> >>> >
>>>>>> > >> >>> > There are very few use cases that benefit from a cache
>>>>>> tier. The
>>>>>> > >> docs
>>>>>> > >> >>> > for
>>>>>> > >> >>> > Luminous warn as much.
>>>>>> > >> >>> You keep repeating that like a broken record.
>>>>>> > >> >>>
>>>>>> > >> >>> And while certainly not false I for one wouldn't be able to
>>>>>> use
>>>>>> > >> (justify
>>>>>> > >> >>> using) Ceph w/o cache tiers in our main use case.
>>>>>> > >> >>>
>>>>>> > >> >>>
>>>>>> > >> >>> In this case I assume they were following and old cheat
>>>>>> sheet or such,
>>>>>> > >> >>> suggesting the previously required cache tier with EC pools.
>>>>>> > >> >>
>>>>>> > >> >>
>>>>>> > >> >>
>>>>>> http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/
>>>>>> > >> >>
>>>>>> > >> >> I know I keep repeating it, especially recently as there have
>>>>>> been a
>>>>>> > >> lot
>>>>>> > >> >> of people asking about it. The Luminous docs added a large
>>>>>> section
>>>>>> > >> about how
>>>>>> > >> >> it is probably not what you want. Like me, it is not saying
>>>>>> that there
>>>>>> > >> are
>>>>>> > >> >> no use cases for it. There was no information provided about
>>>>>> the use
>>>>>> > >> case
>>>>>> > >> >> and I made some suggestions/guesses. I'm also guessing that
>>>>>> they are
>>>>>> > >> >> following a guide where a writeback cache was necessary for
>>>>>> CephFS to
>>>>>> > >> use EC
>>>>>> > >> >> prior to Luminous. I also usually add that people should test
>>>>>> it out
>>>>>> > >> and
>>>>>> > >> >> find what works best for them. I will always defer to your
>>>>>> practical
>>>>>> > >> use of
>>>>>> > >> >> cache tiers as well, especially when using rbds.
>>>>>> > >> >>
>>>>>> > >> >> I manage a cluster that I intend to continue running a
>>>>>> writeback cache
>>>>>> > >> in
>>>>>> > >> >> front of CephFS on the same drives as the EC pool. The use
>>>>>> case
>>>>>> > >> receives a
>>>>>> > >> >> good enough benefit from the cache tier that it isn't even
>>>>>> required to
>>>>>> > >> use
>>>>>> > >> >> flash media to see it. It is used for video editing and the
>>>>>> files are
>>>>>> > >> >> usually modified and read within the first 24 hours and then
>>>>>> left in
>>>>>> > >> cold
>>>>>> > >> >> storage until deleted. I have the cache timed to keep
>>>>>> everything in it
>>>>>> > >> for
>>>>>> > >> >> 24 hours and then evict it by using a minimum time to flush
>>>>>> and evict
>>>>>> > >> at 24
>>>>>> > >> >> hours and a target max bytes of 0. All files are in there for
>>>>>> that
>>>>>> > >> time and
>>>>>> > >> >> then it never has to decide what to keep as it doesn't keep
>>>>>> anything
>>>>>> > >> longer
>>>>>> > >> >> than that. Luckily read performance from cold storage is not a
>>>>>> > >> requirement
>>>>>> > >> >> of this cluster as any read operation has to first read it
>>>>>> from EC
>>>>>> > >> storage,
>>>>>> > >> >> write it to replica storage, and then read it from replica
>>>>>> storage...
>>>>>> > >> Yuck.
>>>>>> > >> >>>
>>>>>> > >> >>>
>>>>>> > >> >>> Christian
>>>>>> > >> >>>
>>>>>> > >> >>> >What is your goal by implementing this cache? If the
>>>>>> > >> >>> > answer is to utilize extra space on the nvmes, then just
>>>>>> remove it
>>>>>> > >> and
>>>>>> > >> >>> > say
>>>>>> > >> >>> > thank you. The better use of nvmes in that case are as a
>>>>>> part of the
>>>>>> > >> >>> > bluestore stack and give your osds larger DB partitions.
>>>>>> Keeping
>>>>>> > >> your
>>>>>> > >> >>> > metadata pool on nvmes is still a good idea.
>>>>>> > >> >>> >
>>>>>> > >> >>> > On Thu, Oct 5, 2017, 7:45 PM Shawfeng Dong <shaw at ucsc.edu>
>>>>>> wrote:
>>>>>> > >> >>> >
>>>>>> > >> >>> > > Dear all,
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > We just set up a Ceph cluster, running the latest stable
>>>>>> release
>>>>>> > >> Ceph
>>>>>> > >> >>> > > v12.2.0 (Luminous):
>>>>>> > >> >>> > > # ceph --version
>>>>>> > >> >>> > > ceph version 12.2.0
>>>>>> (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
>>>>>> > >> >>> > > luminous
>>>>>> > >> >>> > > (rc)
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > The goal is to serve Ceph filesystem, for which we
>>>>>> created 3
>>>>>> > >> pools:
>>>>>> > >> >>> > > # ceph osd lspools
>>>>>> > >> >>> > > 1 cephfs_data,2 cephfs_metadata,3 cephfs_cache,
>>>>>> > >> >>> > > where
>>>>>> > >> >>> > > * cephfs_data is the data pool (36 OSDs on HDDs), which
>>>>>> is
>>>>>> > >> >>> > > erased-coded;
>>>>>> > >> >>> > > * cephfs_metadata is the metadata pool
>>>>>> > >> >>> > > * cephfs_cache is the cache tier (3 OSDs on NVMes) for
>>>>>> > >> cephfs_data.
>>>>>> > >> >>> > > The
>>>>>> > >> >>> > > cache-mode is writeback.
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > Everything had worked fine, until today when we tried to
>>>>>> copy a
>>>>>> > >> 1.3TB
>>>>>> > >> >>> > > file
>>>>>> > >> >>> > > to the CephFS.  We got the "No space left on device"
>>>>>> error!
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > 'ceph -s' says some OSDs are full:
>>>>>> > >> >>> > > # ceph -s
>>>>>> > >> >>> > >   cluster:
>>>>>> > >> >>> > >     id:     e18516bf-39cb-4670-9f13-88ccb7d19769
>>>>>> > >> >>> > >     health: HEALTH_ERR
>>>>>> > >> >>> > >             full flag(s) set
>>>>>> > >> >>> > >             1 full osd(s)
>>>>>> > >> >>> > >             1 pools have many more objects per pg than
>>>>>> average
>>>>>> > >> >>> > >
>>>>>> > >> >>> > >   services:
>>>>>> > >> >>> > >     mon: 3 daemons, quorum
>>>>>> pulpo-admin,pulpo-mon01,pulpo-mds01
>>>>>> > >> >>> > >     mgr: pulpo-mds01(active), standbys: pulpo-admin,
>>>>>> pulpo-mon01
>>>>>> > >> >>> > >     mds: pulpos-1/1/1 up  {0=pulpo-mds01=up:active}
>>>>>> > >> >>> > >     osd: 39 osds: 39 up, 39 in
>>>>>> > >> >>> > >          flags full
>>>>>> > >> >>> > >
>>>>>> > >> >>> > >   data:
>>>>>> > >> >>> > >     pools:   3 pools, 2176 pgs
>>>>>> > >> >>> > >     objects: 347k objects, 1381 GB
>>>>>> > >> >>> > >     usage:   2847 GB used, 262 TB / 265 TB avail
>>>>>> > >> >>> > >     pgs:     2176 active+clean
>>>>>> > >> >>> > >
>>>>>> > >> >>> > >   io:
>>>>>> > >> >>> > >     client:   19301 kB/s rd, 2935 op/s rd, 0 op/s wr
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > And indeed the cache pool is full:
>>>>>> > >> >>> > > # rados df
>>>>>> > >> >>> > > POOL_NAME       USED  OBJECTS CLONES COPIES
>>>>>> MISSING_ON_PRIMARY
>>>>>> > >> >>> > > UNFOUND
>>>>>> > >> >>> > > DEGRADED RD_OPS   RD
>>>>>> > >> >>> > >     WR_OPS  WR
>>>>>> > >> >>> > > cephfs_cache    1381G  355385      0 710770
>>>>>>     0
>>>>>> > >> >>> > > 0
>>>>>> > >> >>> > >     0 10004954 15
>>>>>> > >> >>> > > 22G 1398063  1611G
>>>>>> > >> >>> > > cephfs_data         0       0      0      0
>>>>>>     0
>>>>>> > >> >>> > > 0
>>>>>> > >> >>> > >     0        0
>>>>>> > >> >>> > >   0       0      0
>>>>>> > >> >>> > > cephfs_metadata 8515k      24      0     72
>>>>>>     0
>>>>>> > >> >>> > > 0
>>>>>> > >> >>> > >     0        3  3
>>>>>> > >> >>> > > 072    3953 10541k
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > total_objects    355409
>>>>>> > >> >>> > > total_used       2847G
>>>>>> > >> >>> > > total_avail      262T
>>>>>> > >> >>> > > total_space      265T
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > However, the data pool is completely empty! So it seems
>>>>>> that data
>>>>>> > >> has
>>>>>> > >> >>> > > only
>>>>>> > >> >>> > > been written to the cache pool, but not written back to
>>>>>> the data
>>>>>> > >> >>> > > pool.
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > I am really at a loss whether this is due to a setup
>>>>>> error on my
>>>>>> > >> >>> > > part, or
>>>>>> > >> >>> > > a Luminous bug. Could anyone shed some light on this?
>>>>>> Please let
>>>>>> > >> me
>>>>>> > >> >>> > > know if
>>>>>> > >> >>> > > you need any further info.
>>>>>> > >> >>> > >
>>>>>> > >> >>> > > Best,
>>>>>> > >> >>> > > Shaw
>>>>>> > >> >>> > > _______________________________________________
>>>>>> > >> >>> > > ceph-users mailing list
>>>>>> > >> >>> > > ceph-users at lists.ceph.com
>>>>>> > >> >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> > >> >>> > >
>>>>>> > >> >>>
>>>>>> > >> >>>
>>>>>> > >> >>> --
>>>>>> > >> >>> Christian Balzer        Network/Systems Engineer
>>>>>> > >> >>> chibi at gol.com           Rakuten Communications
>>>>>> > >> >>> _______________________________________________
>>>>>> > >> >>> ceph-users mailing list
>>>>>> > >> >>> ceph-users at lists.ceph.com
>>>>>> > >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> > >> >>
>>>>>> > >> >>
>>>>>> > >> >> _______________________________________________
>>>>>> > >> >> ceph-users mailing list
>>>>>> > >> >> ceph-users at lists.ceph.com
>>>>>> > >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> > >> >>
>>>>>> > >> >
>>>>>> > >> >
>>>>>> > >> > _______________________________________________
>>>>>> > >> > ceph-users mailing list
>>>>>> > >> > ceph-users at lists.ceph.com
>>>>>> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>> > >> >
>>>>>> > >>
>>>>>> > >
>>>>>> > >
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Christian Balzer        Network/Systems Engineer
>>>>>> chibi at gol.com           Rakuten Communications
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>>
>>>
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171009/e249bc9e/attachment.html>


More information about the ceph-users mailing list