<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">I'm at scan_links now, will post an update once it has finished.<div class="">Have you reset the journal after fs recovery as suggested in the doc?<br class=""><div class=""><br class=""></div><div class="">quote:</div><div class=""><br class=""></div><div class="">If the damaged filesystem contains dirty journal data, it may be recovered next with:</div><div class=""><br class="">cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries list --alternate-pool recovery<br class="">cephfs-journal-tool --rank recovery-fs:0 journal reset --force</div></div><div><br class=""></div><div><br class=""><blockquote type="cite" class=""><div class="">On 7.10.2018, at 00:36, Alfredo Daniel Rezinovsky <<a href="mailto:alfrenovsky@gmail.com" class="">alfrenovsky@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class="">
  
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252" class="">
  
  <div text="#000000" bgcolor="#FFFFFF" class=""><p class="">I did something wrong in the upgrade restart also...</p><p class="">after rescaning with:</p><p class="">cephfs-data-scan scan_extents cephfs_data (with threads)</p><p class="">cephfs-data-scan scan_inodes cephfs_data (with threads)</p>
    cephfs-data-scan scan_links<br class="">
    <br class="">
    My MDS still crashes and wont replay.<br class="">
     1: (()+0x3ec320) [0x55b0e2bd2320]<br class="">
     2: (()+0x12890) [0x7fc3adce3890]<br class="">
     3: (gsignal()+0xc7) [0x7fc3acddbe97]<br class="">
     4: (abort()+0x141) [0x7fc3acddd801]<br class="">
     5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
    const*)+0x250) [0x7fc3ae3cc080]<br class="">
     6: (()+0x26c0f7) [0x7fc3ae3cc0f7]<br class="">
     7: (()+0x21eb27) [0x55b0e2a04b27]<br class="">
     8: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*,
    CInode*, snapid_t)+0xc0) [0x55b0e2a04d40]<br class="">
     9: (Locker::check_inode_max_size(CInode*, bool, unsigned long,
    unsigned long, utime_t)+0x91d) [0x55b0e2a6a0fd]<br class="">
     10: (RecoveryQueue::_recovered(CInode*, int, unsigned long,
    utime_t)+0x39f) [0x55b0e2a3ca2f]<br class="">
     11: (MDSIOContextBase::complete(int)+0x119) [0x55b0e2b54ab9]<br class="">
     12: (Filer::C_Probe::finish(int)+0xe7) [0x55b0e2bd94e7]<br class="">
     13: (Context::complete(int)+0x9) [0x55b0e28e9719]<br class="">
     14: (Finisher::finisher_thread_entry()+0x12e) [0x7fc3ae3ca4ce]<br class="">
     15: (()+0x76db) [0x7fc3adcd86db]<br class="">
     16: (clone()+0x3f) [0x7fc3acebe88f]<br class="">
    <br class="">
    Did you do somenthing else before starting the MDSs again?<br class=""><p class=""><br class="">
    </p>
    <div class="moz-cite-prefix">On 05/10/18 21:17, Sergey Malinin
      wrote:<br class="">
    </div>
    <blockquote type="cite" cite="mid:28D8E32E-0FCA-4535-9C19-6E5ED0CD6CBD@newmail.com" class="">
      <meta http-equiv="Content-Type" content="text/html;
        charset=windows-1252" class="">
      I ended up rescanning the entire fs using alternate metadata pool
      approach as in <a href="http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/" class="" moz-do-not-send="true">http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/</a>
      <div class="">The process has not competed yet because during the
        recovery our cluster encountered another problem with OSDs that
        I got fixed yesterday (thanks to Igor Fedotov @ SUSE).
        <div class="">The first stage (scan_extents) completed in 84
          hours (120M objects in data pool on 8 hdd OSDs on 4 hosts).
          The second (scan_inodes) was interrupted by OSDs failure so I
          have no timing stats but it seems to be runing 2-3 times
          faster than extents scan.</div>
        <div class="">As to root cause -- in my case I recall that
          during upgrade I had forgotten to restart 3 OSDs, one of which
          was holding metadata pool contents, before restarting MDS
          daemons and that seemed to had an impact on MDS journal
          corruption, because when I restarted those OSDs, MDS was able
          to start up but soon failed throwing lots of 'loaded dup
          inode' errors.</div>
      </div>
      <div class=""><br class="">
      </div>
      <div style="" class=""><br class="">
        <blockquote type="cite" class="">
          <div class="">On 6.10.2018, at 00:41, Alfredo Daniel
            Rezinovsky <<a href="mailto:alfrenovsky@gmail.com" class="" moz-do-not-send="true">alfrenovsky@gmail.com</a>>
            wrote:</div>
          <br class="Apple-interchange-newline">
          <div class="">
            <div class="">Same problem...<br class="">
              <br class="">
              # cephfs-journal-tool --journal=purge_queue journal
              inspect<br class="">
              2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object
              500.0000016c<br class="">
              Overall journal integrity: DAMAGED<br class="">
              Objects missing:<br class="">
                0x16c<br class="">
              Corrupt regions:<br class="">
                0x5b000000-ffffffffffffffff<br class="">
              <br class="">
              Just after upgrade to 13.2.2<br class="">
              <br class="">
              Did you fixed it?<br class="">
              <br class="">
              <br class="">
              On 26/09/18 13:05, Sergey Malinin wrote:<br class="">
              <blockquote type="cite" class="">Hello,<br class="">
                Followed standard upgrade procedure to upgrade from
                13.2.1 to 13.2.2.<br class="">
                After upgrade MDS cluster is down, mds rank 0 and
                purge_queue journal are damaged. Resetting purge_queue
                does not seem to work well as journal still appears to
                be damaged.<br class="">
                Can anybody help?<br class="">
                <br class="">
                mds log:<br class="">
                <br class="">
                  -789> 2018-09-26 18:42:32.527 7f70f78b1700  1
                mds.mds2 Updating MDS map to version 586 from mon.2<br class="">
                  -788> 2018-09-26 18:42:32.527 7f70f78b1700  1
                mds.0.583 handle_mds_map i am now mds.0.583<br class="">
                  -787> 2018-09-26 18:42:32.527 7f70f78b1700  1
                mds.0.583 handle_mds_map state change up:rejoin -->
                up:active<br class="">
                  -786> 2018-09-26 18:42:32.527 7f70f78b1700  1
                mds.0.583 recovery_done -- successful recovery!<br class="">
                <skip><br class="">
                   -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1
                mds.0.purge_queue _consume: Decode error at
                read_pos=0x322ec6636<br class="">
                   -37> 2018-09-26 18:42:32.707 7f70f28a7700  5
                mds.beacon.mds2 set_want_state: up:active ->
                down:damaged<br class="">
                   -36> 2018-09-26 18:42:32.707 7f70f28a7700  5
                mds.beacon.mds2 _send down:damaged seq 137<br class="">
                   -35> 2018-09-26 18:42:32.707 7f70f28a7700 10
                monclient: _send_mon_message to mon.ceph3 at mon:6789/0<br class="">
                   -34> 2018-09-26 18:42:32.707 7f70f28a7700  1 --
                mds:6800/e4cc09cf --> mon:6789/0 --
                mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 --
                0x563b321ad480 con 0<br class="">
                <skip><br class="">
                    -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 --
                mds:6800/3838577103 >> mon:6789/0
                conn(0x563b3213e000 :-1
                s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1
                l=1). rx mon.2 seq 29 0x563b321ab880 mdsbeaco<br class="">
                n(85106/mds2 down:damaged seq 311 v587) v7<br class="">
                    -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 --
                mds:6800/3838577103 <== mon.2 mon:6789/0 29 ====
                mdsbeacon(85106/mds2 down:damaged seq 311 v587) v7 ====
                129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e<br class="">
                000<br class="">
                    -1> 2018-09-26 18:42:32.743 7f70f98b5700  5
                mds.beacon.mds2 handle_mds_beacon down:damaged seq 311
                rtt 0.038261<br class="">
                     0> 2018-09-26 18:42:32.743 7f70f28a7700  1
                mds.mds2 respawn!<br class="">
                <br class="">
                # cephfs-journal-tool --journal=purge_queue journal
                inspect<br class="">
                Overall journal integrity: DAMAGED<br class="">
                Corrupt regions:<br class="">
                  0x322ec65d9-ffffffffffffffff<br class="">
                <br class="">
                # cephfs-journal-tool --journal=purge_queue journal
                reset<br class="">
                old journal was 13470819801~8463<br class="">
                new journal start will be 13472104448 (1276184 bytes
                past old end)<br class="">
                writing journal head<br class="">
                done<br class="">
                <br class="">
                # cephfs-journal-tool --journal=purge_queue journal
                inspect<br class="">
                2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object
                500.00000c8c<br class="">
                Overall journal integrity: DAMAGED<br class="">
                Objects missing:<br class="">
                  0xc8c<br class="">
                Corrupt regions:<br class="">
                  0x323000000-ffffffffffffffff<br class="">
                _______________________________________________<br class="">
                ceph-users mailing list<br class="">
                <a href="mailto:ceph-users@lists.ceph.com" class="" moz-do-not-send="true">ceph-users@lists.ceph.com</a><br class="">
                <a class="moz-txt-link-freetext" href="http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com">http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com</a><br class="">
              </blockquote>
            </div>
          </div>
        </blockquote>
      </div>
      <br class="">
    </blockquote>
    <br class="">
  </div>

</div></blockquote></div><br class=""></body></html>