[ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

Igor Fedotov ifedotov at suse.de
Mon Oct 1 07:32:07 PDT 2018



On 10/1/2018 5:03 PM, Sergey Malinin wrote:
> Before I received your response, I had already added 20GB to the OSD (by epanding LV followed by bluefs-bdev-expand) and ran "ceph-kvstore-tool bluestore-kv <path> compact", however it still needs more space.
> Is that because I didn't update DB size with set-label-key?
In mimic you need to run both "bluefs-bdev-expand" and "set-label-key" 
command to commit bluefs volume expansion.
Unfortunately the last command doesn't handle "size" label properly. 
That's why you might need to backport and rebuild with the mentioned 
commits.

> What exactly is the label-key that needs to be updated, as I couldn't find which one is related to DB:
>
> # ceph-bluestore-tool show-label --path /var/lib/ceph/osd/ceph-1
> inferring bluefs devices from bluestore path
> {
>      "/var/lib/ceph/osd/ceph-1/block": {
>          "osd_uuid": "f8f122ee-70a6-4c54-8eb0-9b42205b1ecc",
>          "size": 471305551872,
>          "btime": "2018-07-31 03:06:43.751243",
>          "description": "main",
>          "bluefs": "1",
>          "ceph_fsid": "7d320499-5b3f-453e-831f-60d4db9a4533",
>          "kv_backend": "rocksdb",
>          "magic": "ceph osd volume v026",
>          "mkfs_done": "yes",
>          "osd_key": "XXX",
>          "ready": "ready",
>          "whoami": "1"
>      }
> }
'size' label but your output is for block(aka slow) device.

It should return labels for db/wal devices as well (block.db and 
block.wal symlinks respectively). It works for me in master, can't 
verify with mimic at the moment though..
Here is output for master:

# bin/ceph-bluestore-tool show-label --path dev/osd0
inferring bluefs devices from bluestore path
{
     "dev/osd0/block": {
         "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
         "size": 21474836480,
         "btime": "2018-09-10 15:55:09.044039",
         "description": "main",
         "bluefs": "1",
         "ceph_fsid": "56eddc15-11b9-4e0b-9192-e391fbae551c",
         "kv_backend": "rocksdb",
         "magic": "ceph osd volume v026",
         "mkfs_done": "yes",
         "osd_key": "AQCsaZZbYTxXJBAAe3jJI4p6WbMjvA8CBBUJbA==",
         "ready": "ready",
         "whoami": "0"
     },
     "dev/osd0/block.wal": {
         "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
         "size": 1048576000,
         "btime": "2018-09-10 15:55:09.044985",
         "description": "bluefs wal"
     },
     "dev/osd0/block.db": {
         "osd_uuid": "404dcbe9-3f8d-4ef5-ac59-2582454a9a75",
         "size": 1048576000,
         "btime": "2018-09-10 15:55:09.044469",
         "description": "bluefs db"
     }
}


You can try --dev option instead of --path, e.g.
ceph-bluestore-tool show-label --dev <path-to-block.db>


>
>> On 1.10.2018, at 16:48, Igor Fedotov <ifedotov at suse.de> wrote:
>>
>> This looks like a sort of deadlock when BlueFS needs some additional space to replay the log left after the crash. Which happens during BlueFS open.
>>
>> But such a space (at slow device as DB is full) is gifted in background during bluefs rebalance procedure which will occur after the open.
>>
>> Hence OSDs stuck in permanent crashing..
>>
>> The only way to recover I can suggest for now is to expand DB volumes. You can do that with lvm tools if you have any spare space for that.
>>
>> Once resized you'll need ceph-bluestore-tool to indicate volume expansion to BlueFS (bluefs-bdev-expand command ) and finally update DB volume size label with  set-label-key command.
>>
>> The latter is a bit tricky for mimic - you might need to backport https://github.com/ceph/ceph/pull/22085/commits/ffac450da5d6e09cf14b8363b35f21819b48f38b
>>
>> and rebuild ceph-bluestore-tool. Alternatively you can backport https://github.com/ceph/ceph/pull/22085/commits/71c3b58da4e7ced3422bce2b1da0e3fa9331530b
>>
>> then bluefs expansion and label updates will occur in a single step.
>>
>> I'll do these backports in upstream but this will take some time to pass all the procedures and get into official mimic  release.
>>
>> Will fire a ticket to fix the original issue as well.
>>
>>
>> Thanks,
>>
>> Igor
>>
>>
>> On 10/1/2018 3:28 PM, Sergey Malinin wrote:
>>> These are LVM bluestore NVMe SSDs created with "ceph-volume --lvm prepare --bluestore /dev/nvme0n1p3" i.e. without specifying wal/db devices.
>>> OSDs were created with bluestore_min_alloc_size_ssd=4096, another modified setting is bluestore_cache_kv_max=1073741824
>>>
>>> DB/block usage collected by prometheus module for 3 failed and 1 survived OSDs:
>>>
>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.0"} 65493008384.0
>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.1"} 49013587968.0
>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.2"} 76834406400.0 --> this one has survived
>>> ceph_bluefs_db_total_bytes{ceph_daemon="osd.3"} 63726157824.0
>>>
>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.0"} 65217232896.0
>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.1"} 48944381952.0
>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.2"} 68093476864.0
>>> ceph_bluefs_db_used_bytes{ceph_daemon="osd.3"} 63632834560.0
>>>
>>> ceph_osd_stat_bytes{ceph_daemon="osd.0"} 471305551872.0
>>> ceph_osd_stat_bytes{ceph_daemon="osd.1"} 471305551872.0
>>> ceph_osd_stat_bytes{ceph_daemon="osd.2"} 471305551872.0
>>> ceph_osd_stat_bytes{ceph_daemon="osd.3"} 471305551872.0
>>>
>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.0"} 222328213504.0
>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.1"} 214472544256.0
>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.2"} 163603996672.0
>>> ceph_osd_stat_bytes_used{ceph_daemon="osd.3"} 212806815744.0
>>>
>>>
>>> First crashed OSD was doing DB compaction, others crashed shortly after during backfilling. Workload was "ceph-data-scan scan_inodes" filling metadata pool located on these OSDs at the rate close to 10k objects/second.
>>> Here is the log excerpt of the first crash occurrence:
>>>
>>> 2018-10-01 03:27:12.762 7fbf16dd6700  0 bluestore(/var/lib/ceph/osd/ceph-1) _balance_bluefs_freespace no allocate on 0x80000000 min_alloc_size 0x1000
>>> 2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] Generated table #89741: 106356 keys, 68110589 bytes
>>> 2018-10-01 03:27:12.886 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632892744, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89741, "file_size": 68110589, "table_properties": {"data_size": 67112903, "index_size": 579319, "filter_size": 417316, "raw_key_size": 6733561, "raw_average_key_size": 63, "raw_value_size": 60994583, "raw_average_value_size": 573, "num_data_blocks": 16336, "num_entries": 106356, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "14444", "kMergeOperands": "0"}}
>>> 2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: [/build/ceph-13.2.2/src/rocksdb/db/compaction_job.cc:1166] [default] [JOB 24] Generated table #89742: 23214 keys, 16352315 bytes
>>> 2018-10-01 03:27:12.934 7fbf1e5e5700  4 rocksdb: EVENT_LOG_v1 {"time_micros": 1538353632938670, "cf_name": "default", "job": 24, "event": "table_file_creation", "file_number": 89742, "file_size": 16352315, "table_properties": {"data_size": 16116986, "index_size": 139894, "filter_size": 94386, "raw_key_size": 1470883, "raw_average_key_size": 63, "raw_value_size": 14775006, "raw_average_value_size": 636, "num_data_blocks": 3928, "num_entries": 23214, "filter_policy_name": "rocksdb.BuiltinBloomFilter", "kDeletedKeys": "90", "kMergeOperands": "0"}}
>>> 2018-10-01 03:27:13.042 7fbf1e5e5700  1 bluefs _allocate failed to allocate 0x4100000 on bdev 1, free 0x1a00000; fallback to bdev 2
>>> 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _allocate failed to allocate 0x4100000 on bdev 2, dne
>>> 2018-10-01 03:27:13.042 7fbf1e5e5700 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0x40ea9f1
>>> 2018-10-01 03:27:13.046 7fbf1e5e5700 -1 /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7fbf1e5e5700 time 2018-10-01 03:27:13.048298
>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs enospc")
>>>
>>>   ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
>>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7fbf2d4fe5c2]
>>>   2: (()+0x26c787) [0x7fbf2d4fe787]
>>>   3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1ab4) [0x5619325114b4]
>>>   4: (BlueRocksWritableFile::Flush()+0x3d) [0x561932527c1d]
>>>   5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x56193271c399]
>>>   6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x56193271d42b]
>>>   7: (rocksdb::CompactionJob::FinishCompactionOutputFile(rocksdb::Status const&, rocksdb::CompactionJob::SubcompactionState*, rocksdb::RangeDelAggregator*, CompactionIterationStats*, rocksdb::Slice const*)+0x3db) [0x56193276098b]
>>>   8: (rocksdb::CompactionJob::ProcessKeyValueCompaction(rocksdb::CompactionJob::SubcompactionState*)+0x7d9) [0x561932763da9]
>>>   9: (rocksdb::CompactionJob::Run()+0x314) [0x561932765504]
>>>   10: (rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*)+0xc54) [0x5619325b5c44]
>>>   11: (rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority)+0x397) [0x5619325b8557]
>>>   12: (rocksdb::DBImpl::BGWorkCompaction(void*)+0x97) [0x5619325b8cd7]
>>>   13: (rocksdb::ThreadPoolImpl::Impl::BGThread(unsigned long)+0x266) [0x5619327a5e36]
>>>   14: (rocksdb::ThreadPoolImpl::Impl::BGThreadWrapper(void*)+0x47) [0x5619327a5fb7]
>>>   15: (()+0xbe733) [0x7fbf2b500733]
>>>   16: (()+0x76db) [0x7fbf2bbf86db]
>>>   17: (clone()+0x3f) [0x7fbf2abbc88f]
>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>
>>>
>>>> On 1.10.2018, at 15:01, Igor Fedotov <ifedotov at suse.de> wrote:
>>>>
>>>> Hi Sergey,
>>>>
>>>> could you please provide more details on your OSDs ?
>>>>
>>>> What are sizes for DB/block devices?
>>>>
>>>> Do you have any modifications in BlueStore config settings?
>>>>
>>>> Can you share stats you're referring to?
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Igor
>>>>
>>>>
>>>> On 10/1/2018 12:29 PM, Sergey Malinin wrote:
>>>>> Hello,
>>>>> 3 of 4 NVME OSDs crashed at the same time on assert(0 == "bluefs enospc") and no longer start.
>>>>> Stats collected just before crash show that ceph_bluefs_db_used_bytes is 100% used. Although OSDs have over 50% of free space, it is not reallocated for DB usage.
>>>>>
>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240  1 bluefs _allocate failed to allocate 0x100000 on bdev 1, free 0x0; fallback to bdev 2
>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _allocate failed to allocate 0x100000 on bdev 2, dne
>>>>> 2018-10-01 12:18:06.744 7f1d6a04d240 -1 bluefs _flush_range allocated: 0x0 offset: 0x0 length: 0xa8700
>>>>> 2018-10-01 12:18:06.748 7f1d6a04d240 -1 /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_flush_range(BlueFS::FileWriter*, uint64_t, uint64_t)' thread 7f1d6a04d240 time 2018-10-01 12:18:06.746800
>>>>> /build/ceph-13.2.2/src/os/bluestore/BlueFS.cc: 1663: FAILED assert(0 == "bluefs enospc")
>>>>>
>>>>>   ceph version 13.2.2 (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)
>>>>>   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7f1d6146f5c2]
>>>>>   2: (()+0x26c787) [0x7f1d6146f787]
>>>>>   3: (BlueFS::_flush_range(BlueFS::FileWriter*, unsigned long, unsigned long)+0x1ab4) [0x5586b22684b4]
>>>>>   4: (BlueRocksWritableFile::Flush()+0x3d) [0x5586b227ec1d]
>>>>>   5: (rocksdb::WritableFileWriter::Flush()+0x1b9) [0x5586b2473399]
>>>>>   6: (rocksdb::WritableFileWriter::Sync(bool)+0x3b) [0x5586b247442b]
>>>>>   7: (rocksdb::BuildTable(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, rocksdb::Env*, rocksdb::ImmutableCFOptions const&, rocksdb::MutableCFOptions const&, rocksdb::EnvOptions const&, rock
>>>>> sdb::TableCache*, rocksdb::InternalIterator*, std::unique_ptr<rocksdb::InternalIterator, std::default_delete<rocksdb::InternalIterator> >, rocksdb::FileMetaData*, rocksdb::InternalKeyComparator const&, std::vector<std::unique_ptr<
>>>>> rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> >, std::allocator<std::unique_ptr<rocksdb::IntTblPropCollectorFactory, std::default_delete<rocksdb::IntTblPropCollectorFactory> > > > co
>>>>> nst*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<unsigned long, std::allocator<unsigned long> >, unsigned long, rocksdb::SnapshotChecker*, rocksdb::Compression
>>>>> Type, rocksdb::CompressionOptions const&, bool, rocksdb::InternalStats*, rocksdb::TableFileCreationReason, rocksdb::EventLogger*, int, rocksdb::Env::IOPriority, rocksdb::TableProperties*, int, unsigned long, unsigned long, rocksdb
>>>>> ::Env::WriteLifeTimeHint)+0x1e24) [0x5586b249ef94]
>>>>>   8: (rocksdb::DBImpl::WriteLevel0TableForRecovery(int, rocksdb::ColumnFamilyData*, rocksdb::MemTable*, rocksdb::VersionEdit*)+0xcb7) [0x5586b2321457]
>>>>>   9: (rocksdb::DBImpl::RecoverLogFiles(std::vector<unsigned long, std::allocator<unsigned long> > const&, unsigned long*, bool)+0x19de) [0x5586b232373e]
>>>>>   10: (rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool)+0x5d4) [0x5586b23242f4]
>>>>>   11: (rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescri
>>>>> ptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool)+0x68b) [0x5586b232559b]
>>>>>   12: (rocksdb::DB::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor
>>>>>>> const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**)+0x22) [0x5586b2326e72]
>>>>>   13: (RocksDBStore::do_open(std::ostream&, bool, std::vector<KeyValueDB::ColumnFamily, std::allocator<KeyValueDB::ColumnFamily> > const*)+0x170c) [0x5586b220219c]
>>>>>   14: (BlueStore::_open_db(bool, bool)+0xd8e) [0x5586b218ee1e]
>>>>>   15: (BlueStore::_mount(bool, bool)+0x4b7) [0x5586b21bf807]
>>>>>   16: (OSD::init()+0x295) [0x5586b1d673c5]
>>>>>   17: (main()+0x268d) [0x5586b1c554ed]
>>>>>   18: (__libc_start_main()+0xe7) [0x7f1d5ea2db97]
>>>>>   19: (_start()+0x2a) [0x5586b1d1d7fa]
>>>>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
>>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users at lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



More information about the ceph-users mailing list