[ceph-users] FAILED assert(p.same_interval_since) and unusable cluster

Jon Light jon at jonlight.com
Thu Nov 2 14:23:39 PDT 2017


I followed the instructions in the Github repo for cloning and setting up
the build environment, checked out the 12.2.0 tag, modified OSD.cc with the
fix, and then tried to build with dpkg-buildpackage. I got the following
error:
"ceph/src/kv/RocksDBStore.cc:593:22: error: ‘perf_context’ is not a member
of ‘rocksdb’"
I guess some changes have been made to RocksDB since 12.2.0?

Am I going about this the right way? Should I just simply recompile the OSD
binary with the fix and then copy it to the nodes in my cluster? What's the
best way to get this fix applied to my current installation?

Thanks

On Wed, Nov 1, 2017 at 11:39 AM, Jon Light <jon at jonlight.com> wrote:

> I'm currently running 12.2.0. How should I go about applying the patch?
> Should I upgrade to 12.2.1, apply the changes, and then recompile?
>
> I really appreciate the patch.
> Thanks
>
> On Wed, Nov 1, 2017 at 11:10 AM, David Zafman <dzafman at redhat.com> wrote:
>
>>
>> Jon,
>>
>>     If you are able please test my tentative fix for this issue which is
>> in https://github.com/ceph/ceph/pull/18673
>>
>>
>> Thanks
>>
>> David
>>
>>
>>
>> On 10/30/17 1:13 AM, Jon Light wrote:
>>
>>> Hello,
>>>
>>> I have three OSDs that are crashing on start with a FAILED
>>> assert(p.same_interval_since) error. I ran across a thread from a few
>>> days
>>> ago about the same issue and a ticket was created here:
>>> http://tracker.ceph.com/issues/21833.
>>>
>>> A very overloaded node in my cluster OOM'd many times which eventually
>>> led
>>> to the problematic PGs and then the failed assert.
>>>
>>> I currently have 49 pgs inactive, 33 pgs down, 15 pgs incomplete as well
>>> as
>>> 0.028% of objects unfound. Presumably due to this, I can't add any data
>>> to
>>> the FS or read some data. Just about any IO ends up in a good bit of
>>> stuck
>>> requests.
>>>
>>> Hopefully a fix can come from the issue, but can anyone give me some
>>> suggestions or guidance to get the cluster in a working state in the
>>> meantime?
>>>
>>> Thanks
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171102/17f4f23d/attachment.html>


More information about the ceph-users mailing list