[ceph-users] Ceph cache pool full

Shawfeng Dong shaw at ucsc.edu
Fri Oct 6 14:22:29 PDT 2017


Here is a quick update. I found that a CephFS client process was accessing
the big 1TB file, which I think had a lock on the file, preventing the
flushing of objects to the underlying data pool. Once I killed that
process, objects started to flush to the data pool automatically (with
target_max_bytes & target_max_objects set); and I can force the flushing
with 'rados -p cephfs_cache cache-flush-evict-all' as well. So David
appears to be right in saying that "it can only hold full files and not
flush partial files". This will be problematic if we want to transfer a
file that is bigger in size than the cache pool!

We did this whole scheme (EC data pool plus NVMe cache tier) just for
experimentation. I've learned a lot from the experiment and from your guys.
Thank you very much!

For production, I think I'll simply use a replicated pool for data on the
HDDs (with bluestore WAL and DB on the 1st NVMe), and a replicated pool for
metadata on the 2nd NVMe.  Please let me know if you have any further
advice / suggestion.

Best,
Shaw



On Fri, Oct 6, 2017 at 10:07 AM, David Turner <drakonstein at gmail.com> wrote:

> All of this data is test data, yeah?  I would start by removing the
> cache-tier and pool, recreate it and attach it, configure all of the
> settings including the maximums, and start testing things again.  I would
> avoid doing the 1.3TB file test until after you've confirmed that the
> smaller files are being flushed appropriately to the data pool (manually
> flushing/evicting it) and then scale up your testing to the larger files.
> On Fri, Oct 6, 2017 at 12:54 PM Shawfeng Dong <shaw at ucsc.edu> wrote:
>
>> Curiously, it has been quite a while, but there is still no object in the
>> underlying data pool:
>> # rados -p cephfs_data ls
>>
>> Any advice?
>>
>> On Fri, Oct 6, 2017 at 9:45 AM, David Turner <drakonstein at gmail.com>
>> wrote:
>>
>>> Notice in the URL for the documentation the use of "luminous".  When you
>>> looked a few weeks ago, you might have been looking at the documentation
>>> for a different version of Ceph.  You can change that to jewel, hammer,
>>> kraken, master, etc depending on which version of Ceph you are running or
>>> reading about.  Google gets confused and will pull up random versions of
>>> the ceph documentation for a page. It's on us to make sure that the url is
>>> pointing to the version of Ceph that we are using.
>>>
>>> While it's sitting there in the flush command, can you see if there are
>>> any objects in the underlying data pool?  Hopefully the count will be
>>> growing.
>>>
>>> On Fri, Oct 6, 2017 at 12:39 PM Shawfeng Dong <shaw at ucsc.edu> wrote:
>>>
>>>> Hi Christian,
>>>>
>>>> I set those via CLI:
>>>> # ceph osd pool set cephfs_cache target_max_bytes 1099511627776
>>>> # ceph osd pool set cephfs_cache target_max_objects 1000000
>>>>
>>>> but manual flushing doesn't appear to work:
>>>> # rados -p cephfs_cache cache-flush-evict-all
>>>>         1000000046a.00000ca6
>>>>
>>>> it just gets stuck there for a long time.
>>>>
>>>> Any suggestion? Do I need to restart the daemons or reboot the nodes?
>>>>
>>>> Thanks,
>>>> Shaw
>>>>
>>>>
>>>>
>>>> On Fri, Oct 6, 2017 at 9:31 AM, Christian Balzer <chibi at gol.com> wrote:
>>>>
>>>>> On Fri, 6 Oct 2017 09:14:40 -0700 Shawfeng Dong wrote:
>>>>>
>>>>> > I found the command: rados -p cephfs_cache cache-flush-evict-all
>>>>> >
>>>>> That's not what you want/need.
>>>>> Though it will fix your current "full" issue.
>>>>>
>>>>> > The documentation (
>>>>> > http://docs.ceph.com/docs/luminous/rados/operations/cache-tiering/)
>>>>> has
>>>>> > been improved a lot since I last checked it a few weeks ago!
>>>>> >
>>>>> The need to set max_bytes and max_objects has been documented for ages
>>>>> (since Hammer).
>>>>>
>>>>> more below...
>>>>>
>>>>> > -Shaw
>>>>> >
>>>>> > On Fri, Oct 6, 2017 at 9:10 AM, Shawfeng Dong <shaw at ucsc.edu> wrote:
>>>>> >
>>>>> > > Thanks, Luis.
>>>>> > >
>>>>> > > I've just set max_bytes and max_objects:
>>>>> How?
>>>>> Editing the conf file won't help until a restart.
>>>>>
>>>>> > > target_max_objects: 1000000 (1M)
>>>>> > > target_max_bytes: 1099511627776 (1TB)
>>>>> >
>>>>> I'd lower that or the cache_target_full_ratio by another 10%.
>>>>>
>>>>> Christian
>>>>> > >
>>>>> > > but nothing appears to be happening. Is there a way to force
>>>>> flushing?
>>>>> > >
>>>>> > > Thanks,
>>>>> > > Shaw
>>>>> > >
>>>>> > > On Fri, Oct 6, 2017 at 8:55 AM, Luis Periquito <
>>>>> periquito at gmail.com>
>>>>> > > wrote:
>>>>> > >
>>>>> > >> Not looking at anything else, you didn't set the max_bytes or
>>>>> > >> max_objects for it to start flushing...
>>>>> > >>
>>>>> > >> On Fri, Oct 6, 2017 at 4:49 PM, Shawfeng Dong <shaw at ucsc.edu>
>>>>> wrote:
>>>>> > >> > Dear all,
>>>>> > >> >
>>>>> > >> > Thanks a lot for the very insightful comments/suggestions!
>>>>> > >> >
>>>>> > >> > There are 3 OSD servers in our pilot Ceph cluster, each with 2x
>>>>> 1TB SSDs
>>>>> > >> > (boot disks), 12x 8TB SATA HDDs and 2x 1.2TB NVMe SSDs. We use
>>>>> the
>>>>> > >> bluestore
>>>>> > >> > backend, with the first NVMe as the WAL and DB devices for OSDs
>>>>> on the
>>>>> > >> HDDs.
>>>>> > >> > And we try to create a cache tier out of the second NVMes.
>>>>> > >> >
>>>>> > >> > Here are the outputs of the commands suggested by David:
>>>>> > >> >
>>>>> > >> > 1) # ceph df
>>>>> > >> > GLOBAL:
>>>>> > >> >     SIZE     AVAIL     RAW USED     %RAW USED
>>>>> > >> >     265T      262T        2847G          1.05
>>>>> > >> > POOLS:
>>>>> > >> >     NAME                ID     USED      %USED      MAX AVAIL
>>>>> > >>  OBJECTS
>>>>> > >> >     cephfs_data         1          0          0          248T
>>>>> > >>  0
>>>>> > >> >     cephfs_metadata     2      8515k          0          248T
>>>>> > >> 24
>>>>> > >> >     cephfs_cache        3      1381G     100.00             0
>>>>> > >> 355385
>>>>> > >> >
>>>>> > >> > 2) # ceph osd df
>>>>> > >> >  0   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 174
>>>>> > >> >  1   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 169
>>>>> > >> >  2   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173
>>>>> > >> >  3   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 159
>>>>> > >> >  4   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173
>>>>> > >> >  5   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 162
>>>>> > >> >  6   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 149
>>>>> > >> >  7   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 179
>>>>> > >> >  8   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 163
>>>>> > >> >  9   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 194
>>>>> > >> > 10   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 185
>>>>> > >> > 11   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 168
>>>>> > >> > 36  nvme 1.09149  1.00000 1117G  855G   262G 76.53 73.01  79
>>>>> > >> > 12   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 180
>>>>> > >> > 13   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 168
>>>>> > >> > 14   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 178
>>>>> > >> > 15   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 170
>>>>> > >> > 16   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 149
>>>>> > >> > 17   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 203
>>>>> > >> > 18   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 173
>>>>> > >> > 19   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 158
>>>>> > >> > 20   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 154
>>>>> > >> > 21   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 160
>>>>> > >> > 22   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 167
>>>>> > >> > 23   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 188
>>>>> > >> > 37  nvme 1.09149  1.00000 1117G 1061G 57214M 95.00 90.63  98
>>>>> > >> > 24   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 187
>>>>> > >> > 25   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 200
>>>>> > >> > 26   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 147
>>>>> > >> > 27   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 171
>>>>> > >> > 28   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 162
>>>>> > >> > 29   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 152
>>>>> > >> > 30   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 174
>>>>> > >> > 31   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 176
>>>>> > >> > 32   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 182
>>>>> > >> > 33   hdd 7.27829  1.00000 7452G 2072M  7450G  0.03  0.03 155
>>>>> > >> > 34   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 166
>>>>> > >> > 35   hdd 7.27829  1.00000 7452G 2076M  7450G  0.03  0.03 176
>>>>> > >> > 38  nvme 1.09149  1.00000 1117G  857G   260G 76.71 73.18  79
>>>>> > >> >                     TOTAL  265T 2847G   262T  1.05
>>>>> > >> > MIN/MAX VAR: 0.03/90.63  STDDEV: 22.81
>>>>> > >> >
>>>>> > >> > 3) # ceph osd tree
>>>>> > >> > -1       265.29291 root default
>>>>> > >> > -3        88.43097     host pulpo-osd01
>>>>> > >> >  0   hdd   7.27829         osd.0            up  1.00000 1.00000
>>>>> > >> >  1   hdd   7.27829         osd.1            up  1.00000 1.00000
>>>>> > >> >  2   hdd   7.27829         osd.2            up  1.00000 1.00000
>>>>> > >> >  3   hdd   7.27829         osd.3            up  1.00000 1.00000
>>>>> > >> >  4   hdd   7.27829         osd.4            up  1.00000 1.00000
>>>>> > >> >  5   hdd   7.27829         osd.5            up  1.00000 1.00000
>>>>> > >> >  6   hdd   7.27829         osd.6            up  1.00000 1.00000
>>>>> > >> >  7   hdd   7.27829         osd.7            up  1.00000 1.00000
>>>>> > >> >  8   hdd   7.27829         osd.8            up  1.00000 1.00000
>>>>> > >> >  9   hdd   7.27829         osd.9            up  1.00000 1.00000
>>>>> > >> > 10   hdd   7.27829         osd.10           up  1.00000 1.00000
>>>>> > >> > 11   hdd   7.27829         osd.11           up  1.00000 1.00000
>>>>> > >> > 36  nvme   1.09149         osd.36           up  1.00000 1.00000
>>>>> > >> > -5        88.43097     host pulpo-osd02
>>>>> > >> > 12   hdd   7.27829         osd.12           up  1.00000 1.00000
>>>>> > >> > 13   hdd   7.27829         osd.13           up  1.00000 1.00000
>>>>> > >> > 14   hdd   7.27829         osd.14           up  1.00000 1.00000
>>>>> > >> > 15   hdd   7.27829         osd.15           up  1.00000 1.00000
>>>>> > >> > 16   hdd   7.27829         osd.16           up  1.00000 1.00000
>>>>> > >> > 17   hdd   7.27829         osd.17           up  1.00000 1.00000
>>>>> > >> > 18   hdd   7.27829         osd.18           up  1.00000 1.00000
>>>>> > >> > 19   hdd   7.27829         osd.19           up  1.00000 1.00000
>>>>> > >> > 20   hdd   7.27829         osd.20           up  1.00000 1.00000
>>>>> > >> > 21   hdd   7.27829         osd.21           up  1.00000 1.00000
>>>>> > >> > 22   hdd   7.27829         osd.22           up  1.00000 1.00000
>>>>> > >> > 23   hdd   7.27829         osd.23           up  1.00000 1.00000
>>>>> > >> > 37  nvme   1.09149         osd.37           up  1.00000 1.00000
>>>>> > >> > 36  nvme   1.09149         osd.36           up  1.00000 1.00000
>>>>> > >> > -5        88.43097     host pulpo-osd02
>>>>> > >> > 12   hdd   7.27829         osd.12           up  1.00000 1.00000
>>>>> > >> > 13   hdd   7.27829         osd.13           up  1.00000 1.00000
>>>>> > >> > 14   hdd   7.27829         osd.14           up  1.00000 1.00000
>>>>> > >> > 15   hdd   7.27829         osd.15           up  1.00000 1.00000
>>>>> > >> > 16   hdd   7.27829         osd.16           up  1.00000 1.00000
>>>>> > >> > 17   hdd   7.27829         osd.17           up  1.00000 1.00000
>>>>> > >> > 18   hdd   7.27829         osd.18           up  1.00000 1.00000
>>>>> > >> > 19   hdd   7.27829         osd.19           up  1.00000 1.00000
>>>>> > >> > 20   hdd   7.27829         osd.20           up  1.00000 1.00000
>>>>> > >> > 21   hdd   7.27829         osd.21           up  1.00000 1.00000
>>>>> > >> > 22   hdd   7.27829         osd.22           up  1.00000 1.00000
>>>>> > >> > 23   hdd   7.27829         osd.23           up  1.00000 1.00000
>>>>> > >> > 37  nvme   1.09149         osd.37           up  1.00000 1.00000
>>>>> > >> > -7        88.43097     host pulpo-osd03
>>>>> > >> > 24   hdd   7.27829         osd.24           up  1.00000 1.00000
>>>>> > >> > 25   hdd   7.27829         osd.25           up  1.00000 1.00000
>>>>> > >> > 26   hdd   7.27829         osd.26           up  1.00000 1.00000
>>>>> > >> > 27   hdd   7.27829         osd.27           up  1.00000 1.00000
>>>>> > >> > 28   hdd   7.27829         osd.28           up  1.00000 1.00000
>>>>> > >> > 29   hdd   7.27829         osd.29           up  1.00000 1.00000
>>>>> > >> > 30   hdd   7.27829         osd.30           up  1.00000 1.00000
>>>>> > >> > 31   hdd   7.27829         osd.31           up  1.00000 1.00000
>>>>> > >> > 32   hdd   7.27829         osd.32           up  1.00000 1.00000
>>>>> > >> > 33   hdd   7.27829         osd.33           up  1.00000 1.00000
>>>>> > >> > 34   hdd   7.27829         osd.34           up  1.00000 1.00000
>>>>> > >> > 35   hdd   7.27829         osd.35           up  1.00000 1.00000
>>>>> > >> > 38  nvme   1.09149         osd.38           up  1.00000 1.00000
>>>>> > >> >
>>>>> > >> > 4) # ceph osd pool get cephfs_cache all
>>>>> > >> > min_size: 2
>>>>> > >> > crash_replay_interval: 0
>>>>> > >> > pg_num: 128
>>>>> > >> > pgp_num: 128
>>>>> > >> > crush_rule: pulpo_nvme
>>>>> > >> > hashpspool: true
>>>>> > >> > nodelete: false
>>>>> > >> > nopgchange: false
>>>>> > >> > nosizechange: false
>>>>> > >> > write_fadvise_dontneed: false
>>>>> > >> > noscrub: false
>>>>> > >> > nodeep-scrub: false
>>>>> > >> > hit_set_type: bloom
>>>>> > >> > hit_set_period: 14400
>>>>> > >> > hit_set_count: 12
>>>>> > >> > hit_set_fpp: 0.05
>>>>> > >> > use_gmt_hitset: 1
>>>>> > >> > auid: 0
>>>>> > >> > target_max_objects: 0
>>>>> > >> > target_max_bytes: 0
>>>>> > >> > cache_target_dirty_ratio: 0.4
>>>>> > >> > cache_target_dirty_high_ratio: 0.6
>>>>> > >> > cache_target_full_ratio: 0.8
>>>>> > >> > cache_min_flush_age: 0
>>>>> > >> > cache_min_evict_age: 0
>>>>> > >> > min_read_recency_for_promote: 0
>>>>> > >> > min_write_recency_for_promote: 0
>>>>> > >> > fast_read: 0
>>>>> > >> > hit_set_grade_decay_rate: 0
>>>>> > >> > crash_replay_interval: 0
>>>>> > >> >
>>>>> > >> > Do you see anything wrong? We had written some small files to
>>>>> the CephFS
>>>>> > >> > before we tried to write the big 1TB file. What is puzzling to
>>>>> me is
>>>>> > >> that no
>>>>> > >> > data has been written back to the data pool.
>>>>> > >> >
>>>>> > >> > Best,
>>>>> > >> > Shaw
>>>>> > >> >
>>>>> > >> > On Fri, Oct 6, 2017 at 6:46 AM, David Turner <
>>>>> drakonstein at gmail.com>
>>>>> > >> wrote:
>>>>> > >> >>
>>>>> > >> >>
>>>>> > >> >>
>>>>> > >> >> On Fri, Oct 6, 2017, 1:05 AM Christian Balzer <chibi at gol.com>
>>>>> wrote:
>>>>> > >> >>>
>>>>> > >> >>>
>>>>> > >> >>> Hello,
>>>>> > >> >>>
>>>>> > >> >>> On Fri, 06 Oct 2017 03:30:41 +0000 David Turner wrote:
>>>>> > >> >>>
>>>>> > >> >>> > You're missing most all of the important bits. What the
>>>>> osds in your
>>>>> > >> >>> > cluster look like, your tree, and your cache pool settings.
>>>>> > >> >>> >
>>>>> > >> >>> > ceph df
>>>>> > >> >>> > ceph osd df
>>>>> > >> >>> > ceph osd tree
>>>>> > >> >>> > ceph osd pool get cephfs_cache all
>>>>> > >> >>> >
>>>>> > >> >>> Especially the last one.
>>>>> > >> >>>
>>>>> > >> >>> My money is on not having set target_max_objects and
>>>>> target_max_bytes
>>>>> > >> to
>>>>> > >> >>> sensible values along with the ratios.
>>>>> > >> >>> In short, not having read the (albeit spotty) documentation.
>>>>> > >> >>>
>>>>> > >> >>> > You have your writeback cache on 3 nvme drives. It looks
>>>>> like you
>>>>> > >> have
>>>>> > >> >>> > 1.6TB available between them for the cache. I don't know the
>>>>> > >> behavior
>>>>> > >> >>> > of a
>>>>> > >> >>> > writeback cache tier on cephfs for large files, but I would
>>>>> guess
>>>>> > >> that
>>>>> > >> >>> > it
>>>>> > >> >>> > can only hold full files and not flush partial files.
>>>>> > >> >>>
>>>>> > >> >>> I VERY much doubt that, if so it would be a massive flaw.
>>>>> > >> >>> One assumes that cache operations work on the RADOS object
>>>>> level, no
>>>>> > >> >>> matter what.
>>>>> > >> >>
>>>>> > >> >> I hope that it is on the rados level, but not a single object
>>>>> had been
>>>>> > >> >> flushed to the backing pool. So I hazarded a guess. Seeing his
>>>>> > >> settings will
>>>>> > >> >> shed more light.
>>>>> > >> >>>
>>>>> > >> >>>
>>>>> > >> >>> > That would mean your
>>>>> > >> >>> > cache needs to have enough space for any file being written
>>>>> to the
>>>>> > >> >>> > cluster.
>>>>> > >> >>> > In this case a 1.3TB file with 3x replication would require
>>>>> 3.9TB
>>>>> > >> (more
>>>>> > >> >>> > than double what you have available) of available space in
>>>>> your
>>>>> > >> >>> > writeback
>>>>> > >> >>> > cache.
>>>>> > >> >>> >
>>>>> > >> >>> > There are very few use cases that benefit from a cache
>>>>> tier. The
>>>>> > >> docs
>>>>> > >> >>> > for
>>>>> > >> >>> > Luminous warn as much.
>>>>> > >> >>> You keep repeating that like a broken record.
>>>>> > >> >>>
>>>>> > >> >>> And while certainly not false I for one wouldn't be able to
>>>>> use
>>>>> > >> (justify
>>>>> > >> >>> using) Ceph w/o cache tiers in our main use case.
>>>>> > >> >>>
>>>>> > >> >>>
>>>>> > >> >>> In this case I assume they were following and old cheat sheet
>>>>> or such,
>>>>> > >> >>> suggesting the previously required cache tier with EC pools.
>>>>> > >> >>
>>>>> > >> >>
>>>>> > >> >> http://docs.ceph.com/docs/luminous/rados/operations/
>>>>> cache-tiering/
>>>>> > >> >>
>>>>> > >> >> I know I keep repeating it, especially recently as there have
>>>>> been a
>>>>> > >> lot
>>>>> > >> >> of people asking about it. The Luminous docs added a large
>>>>> section
>>>>> > >> about how
>>>>> > >> >> it is probably not what you want. Like me, it is not saying
>>>>> that there
>>>>> > >> are
>>>>> > >> >> no use cases for it. There was no information provided about
>>>>> the use
>>>>> > >> case
>>>>> > >> >> and I made some suggestions/guesses. I'm also guessing that
>>>>> they are
>>>>> > >> >> following a guide where a writeback cache was necessary for
>>>>> CephFS to
>>>>> > >> use EC
>>>>> > >> >> prior to Luminous. I also usually add that people should test
>>>>> it out
>>>>> > >> and
>>>>> > >> >> find what works best for them. I will always defer to your
>>>>> practical
>>>>> > >> use of
>>>>> > >> >> cache tiers as well, especially when using rbds.
>>>>> > >> >>
>>>>> > >> >> I manage a cluster that I intend to continue running a
>>>>> writeback cache
>>>>> > >> in
>>>>> > >> >> front of CephFS on the same drives as the EC pool. The use case
>>>>> > >> receives a
>>>>> > >> >> good enough benefit from the cache tier that it isn't even
>>>>> required to
>>>>> > >> use
>>>>> > >> >> flash media to see it. It is used for video editing and the
>>>>> files are
>>>>> > >> >> usually modified and read within the first 24 hours and then
>>>>> left in
>>>>> > >> cold
>>>>> > >> >> storage until deleted. I have the cache timed to keep
>>>>> everything in it
>>>>> > >> for
>>>>> > >> >> 24 hours and then evict it by using a minimum time to flush
>>>>> and evict
>>>>> > >> at 24
>>>>> > >> >> hours and a target max bytes of 0. All files are in there for
>>>>> that
>>>>> > >> time and
>>>>> > >> >> then it never has to decide what to keep as it doesn't keep
>>>>> anything
>>>>> > >> longer
>>>>> > >> >> than that. Luckily read performance from cold storage is not a
>>>>> > >> requirement
>>>>> > >> >> of this cluster as any read operation has to first read it
>>>>> from EC
>>>>> > >> storage,
>>>>> > >> >> write it to replica storage, and then read it from replica
>>>>> storage...
>>>>> > >> Yuck.
>>>>> > >> >>>
>>>>> > >> >>>
>>>>> > >> >>> Christian
>>>>> > >> >>>
>>>>> > >> >>> >What is your goal by implementing this cache? If the
>>>>> > >> >>> > answer is to utilize extra space on the nvmes, then just
>>>>> remove it
>>>>> > >> and
>>>>> > >> >>> > say
>>>>> > >> >>> > thank you. The better use of nvmes in that case are as a
>>>>> part of the
>>>>> > >> >>> > bluestore stack and give your osds larger DB partitions.
>>>>> Keeping
>>>>> > >> your
>>>>> > >> >>> > metadata pool on nvmes is still a good idea.
>>>>> > >> >>> >
>>>>> > >> >>> > On Thu, Oct 5, 2017, 7:45 PM Shawfeng Dong <shaw at ucsc.edu>
>>>>> wrote:
>>>>> > >> >>> >
>>>>> > >> >>> > > Dear all,
>>>>> > >> >>> > >
>>>>> > >> >>> > > We just set up a Ceph cluster, running the latest stable
>>>>> release
>>>>> > >> Ceph
>>>>> > >> >>> > > v12.2.0 (Luminous):
>>>>> > >> >>> > > # ceph --version
>>>>> > >> >>> > > ceph version 12.2.0 (32ce2a3ae5239ee33d6150705cdb24
>>>>> d43bab910c)
>>>>> > >> >>> > > luminous
>>>>> > >> >>> > > (rc)
>>>>> > >> >>> > >
>>>>> > >> >>> > > The goal is to serve Ceph filesystem, for which we
>>>>> created 3
>>>>> > >> pools:
>>>>> > >> >>> > > # ceph osd lspools
>>>>> > >> >>> > > 1 cephfs_data,2 cephfs_metadata,3 cephfs_cache,
>>>>> > >> >>> > > where
>>>>> > >> >>> > > * cephfs_data is the data pool (36 OSDs on HDDs), which is
>>>>> > >> >>> > > erased-coded;
>>>>> > >> >>> > > * cephfs_metadata is the metadata pool
>>>>> > >> >>> > > * cephfs_cache is the cache tier (3 OSDs on NVMes) for
>>>>> > >> cephfs_data.
>>>>> > >> >>> > > The
>>>>> > >> >>> > > cache-mode is writeback.
>>>>> > >> >>> > >
>>>>> > >> >>> > > Everything had worked fine, until today when we tried to
>>>>> copy a
>>>>> > >> 1.3TB
>>>>> > >> >>> > > file
>>>>> > >> >>> > > to the CephFS.  We got the "No space left on device"
>>>>> error!
>>>>> > >> >>> > >
>>>>> > >> >>> > > 'ceph -s' says some OSDs are full:
>>>>> > >> >>> > > # ceph -s
>>>>> > >> >>> > >   cluster:
>>>>> > >> >>> > >     id:     e18516bf-39cb-4670-9f13-88ccb7d19769
>>>>> > >> >>> > >     health: HEALTH_ERR
>>>>> > >> >>> > >             full flag(s) set
>>>>> > >> >>> > >             1 full osd(s)
>>>>> > >> >>> > >             1 pools have many more objects per pg than
>>>>> average
>>>>> > >> >>> > >
>>>>> > >> >>> > >   services:
>>>>> > >> >>> > >     mon: 3 daemons, quorum pulpo-admin,pulpo-mon01,pulpo-
>>>>> mds01
>>>>> > >> >>> > >     mgr: pulpo-mds01(active), standbys: pulpo-admin,
>>>>> pulpo-mon01
>>>>> > >> >>> > >     mds: pulpos-1/1/1 up  {0=pulpo-mds01=up:active}
>>>>> > >> >>> > >     osd: 39 osds: 39 up, 39 in
>>>>> > >> >>> > >          flags full
>>>>> > >> >>> > >
>>>>> > >> >>> > >   data:
>>>>> > >> >>> > >     pools:   3 pools, 2176 pgs
>>>>> > >> >>> > >     objects: 347k objects, 1381 GB
>>>>> > >> >>> > >     usage:   2847 GB used, 262 TB / 265 TB avail
>>>>> > >> >>> > >     pgs:     2176 active+clean
>>>>> > >> >>> > >
>>>>> > >> >>> > >   io:
>>>>> > >> >>> > >     client:   19301 kB/s rd, 2935 op/s rd, 0 op/s wr
>>>>> > >> >>> > >
>>>>> > >> >>> > > And indeed the cache pool is full:
>>>>> > >> >>> > > # rados df
>>>>> > >> >>> > > POOL_NAME       USED  OBJECTS CLONES COPIES
>>>>> MISSING_ON_PRIMARY
>>>>> > >> >>> > > UNFOUND
>>>>> > >> >>> > > DEGRADED RD_OPS   RD
>>>>> > >> >>> > >     WR_OPS  WR
>>>>> > >> >>> > > cephfs_cache    1381G  355385      0 710770
>>>>>     0
>>>>> > >> >>> > > 0
>>>>> > >> >>> > >     0 10004954 15
>>>>> > >> >>> > > 22G 1398063  1611G
>>>>> > >> >>> > > cephfs_data         0       0      0      0
>>>>>     0
>>>>> > >> >>> > > 0
>>>>> > >> >>> > >     0        0
>>>>> > >> >>> > >   0       0      0
>>>>> > >> >>> > > cephfs_metadata 8515k      24      0     72
>>>>>     0
>>>>> > >> >>> > > 0
>>>>> > >> >>> > >     0        3  3
>>>>> > >> >>> > > 072    3953 10541k
>>>>> > >> >>> > >
>>>>> > >> >>> > > total_objects    355409
>>>>> > >> >>> > > total_used       2847G
>>>>> > >> >>> > > total_avail      262T
>>>>> > >> >>> > > total_space      265T
>>>>> > >> >>> > >
>>>>> > >> >>> > > However, the data pool is completely empty! So it seems
>>>>> that data
>>>>> > >> has
>>>>> > >> >>> > > only
>>>>> > >> >>> > > been written to the cache pool, but not written back to
>>>>> the data
>>>>> > >> >>> > > pool.
>>>>> > >> >>> > >
>>>>> > >> >>> > > I am really at a loss whether this is due to a setup
>>>>> error on my
>>>>> > >> >>> > > part, or
>>>>> > >> >>> > > a Luminous bug. Could anyone shed some light on this?
>>>>> Please let
>>>>> > >> me
>>>>> > >> >>> > > know if
>>>>> > >> >>> > > you need any further info.
>>>>> > >> >>> > >
>>>>> > >> >>> > > Best,
>>>>> > >> >>> > > Shaw
>>>>> > >> >>> > > _______________________________________________
>>>>> > >> >>> > > ceph-users mailing list
>>>>> > >> >>> > > ceph-users at lists.ceph.com
>>>>> > >> >>> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> > >> >>> > >
>>>>> > >> >>>
>>>>> > >> >>>
>>>>> > >> >>> --
>>>>> > >> >>> Christian Balzer        Network/Systems Engineer
>>>>> > >> >>> chibi at gol.com           Rakuten Communications
>>>>> > >> >>> _______________________________________________
>>>>> > >> >>> ceph-users mailing list
>>>>> > >> >>> ceph-users at lists.ceph.com
>>>>> > >> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> > >> >>
>>>>> > >> >>
>>>>> > >> >> _______________________________________________
>>>>> > >> >> ceph-users mailing list
>>>>> > >> >> ceph-users at lists.ceph.com
>>>>> > >> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> > >> >>
>>>>> > >> >
>>>>> > >> >
>>>>> > >> > _______________________________________________
>>>>> > >> > ceph-users mailing list
>>>>> > >> > ceph-users at lists.ceph.com
>>>>> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> > >> >
>>>>> > >>
>>>>> > >
>>>>> > >
>>>>>
>>>>>
>>>>> --
>>>>> Christian Balzer        Network/Systems Engineer
>>>>> chibi at gol.com           Rakuten Communications
>>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> ceph-users at lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171006/2e97fe5d/attachment.html>


More information about the ceph-users mailing list