[ceph-users] Ceph mds is stuck in creating status

Kisik Jeong kisik.jeong at csl.skku.edu
Tue Oct 16 08:15:34 PDT 2018


Oh my god. The network configuration was the problem. Fixing network
configuration, I successfully created CephFS. Thank you very much.

-Kisik

2018년 10월 16일 (화) 오후 9:58, John Spray <jspray at redhat.com>님이 작성:

> On Mon, Oct 15, 2018 at 7:15 PM Kisik Jeong <kisik.jeong at csl.skku.edu>
> wrote:
> >
> > I attached osd & fs dumps. There are two pools (cephfs_data,
> cephfs_metadata) for CephFS clearly. And this system's network is 40Gbps
> ethernet for public & cluster. So I don't think the network speed would be
> problem. Thank you.
>
> Ah, your pools do exist, I had just been looking at the start of the
> MDS log where it hadn't seen the osdmap yet.
>
> Looking again at your original log together with your osdmap, I notice
> that your stuck operations are targeting OSDs 10,11,13,14,15, and all
> these OSDs have public addresses in the 192.168.10.x range rather than
> the 192.168.40.x range like the others.
>
> So my guess would be that you are intending your OSDs to be in the
> 192.168.40.x range, but are missing some config settings for certain
> daemons.
>
> John
>
>
> > 2018년 10월 16일 (화) 오전 1:18, John Spray <jspray at redhat.com>님이 작성:
> >>
> >> On Mon, Oct 15, 2018 at 4:24 PM Kisik Jeong <kisik.jeong at csl.skku.edu>
> wrote:
> >> >
> >> > Thank you for your reply, John.
> >> >
> >> > I  restarted my Ceph cluster and captured the mds logs.
> >> >
> >> > I found that mds shows slow request because some OSDs are laggy.
> >> >
> >> > I followed the ceph mds troubleshooting with 'mds slow request', but
> there is no operation in flight:
> >> >
> >> > root at hpc1:~/iodc# ceph daemon mds.hpc1 dump_ops_in_flight
> >> > {
> >> >     "ops": [],
> >> >     "num_ops": 0
> >> > }
> >> >
> >> > Is there any other reason that mds shows slow request? Thank you.
> >>
> >> Those stuck requests seem to be stuck because they're targeting pools
> >> that don't exist.  Has something strange happened in the history of
> >> this cluster that might have left a filesystem referencing pools that
> >> no longer exist?  Ceph is not supposed to permit removal of pools in
> >> use by CephFS, but perhaps something went wrong.
> >>
> >> Check out the "ceph osd dump --format=json-pretty" and "ceph fs dump
> >> --format=json-pretty" outputs and how the pool ID's relate.  According
> >> to those logs, data pool with ID 1 and metadata pool with ID 2 do not
> >> exist.
> >>
> >> John
> >>
> >> > -Kisik
> >> >
> >> > 2018년 10월 15일 (월) 오후 11:43, John Spray <jspray at redhat.com>님이 작성:
> >> >>
> >> >> On Mon, Oct 15, 2018 at 3:34 PM Kisik Jeong <
> kisik.jeong at csl.skku.edu> wrote:
> >> >> >
> >> >> > Hello,
> >> >> >
> >> >> > I successfully deployed Ceph cluster with 16 OSDs and created
> CephFS before.
> >> >> > But after rebooting due to mds slow request problem, when creating
> CephFS, Ceph mds goes creating status and never changes.
> >> >> > Seeing Ceph status, there is no other problem I think. Here is
> 'ceph -s' result:
> >> >>
> >> >> That's pretty strange.  Usually if an MDS is stuck in "creating",
> it's
> >> >> because an OSD operation is stuck, but in your case all your PGs are
> >> >> healthy.
> >> >>
> >> >> I would suggest setting "debug mds=20" and "debug objecter=10" on
> your
> >> >> MDS, restarting it and capturing those logs so that we can see where
> >> >> it got stuck.
> >> >>
> >> >> John
> >> >>
> >> >> > csl at hpc1:~$ ceph -s
> >> >> >   cluster:
> >> >> >     id:     1a32c483-cb2e-4ab3-ac60-02966a8fd327
> >> >> >     health: HEALTH_OK
> >> >> >
> >> >> >   services:
> >> >> >     mon: 1 daemons, quorum hpc1
> >> >> >     mgr: hpc1(active)
> >> >> >     mds: cephfs-1/1/1 up  {0=hpc1=up:creating}
> >> >> >     osd: 16 osds: 16 up, 16 in
> >> >> >
> >> >> >   data:
> >> >> >     pools:   2 pools, 640 pgs
> >> >> >     objects: 7 objects, 124B
> >> >> >     usage:   34.3GiB used, 116TiB / 116TiB avail
> >> >> >     pgs:     640 active+clean
> >> >> >
> >> >> > However, CephFS still works in case of 8 OSDs.
> >> >> >
> >> >> > If there is any doubt of this phenomenon, please let me know.
> Thank you.
> >> >> >
> >> >> > PS. I attached my ceph.conf contents:
> >> >> >
> >> >> > [global]
> >> >> > fsid = 1a32c483-cb2e-4ab3-ac60-02966a8fd327
> >> >> > mon_initial_members = hpc1
> >> >> > mon_host = 192.168.40.10
> >> >> > auth_cluster_required = cephx
> >> >> > auth_service_required = cephx
> >> >> > auth_client_required = cephx
> >> >> >
> >> >> > public_network = 192.168.40.0/24
> >> >> > cluster_network = 192.168.40.0/24
> >> >> >
> >> >> > [osd]
> >> >> > osd journal size = 1024
> >> >> > osd max object name len = 256
> >> >> > osd max object namespace len = 64
> >> >> > osd mount options f2fs = active_logs=2
> >> >> >
> >> >> > [osd.0]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.40.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.1]
> >> >> > host = hpc10
> >> >> > public_addr = 192.168.40.19
> >> >> > cluster_addr = 192.168.40.19
> >> >> >
> >> >> > [osd.2]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.40.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.3]
> >> >> > host = hpc10
> >> >> > public_addr = 192.168.40.19
> >> >> > cluster_addr = 192.168.40.19
> >> >> >
> >> >> > [osd.4]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.40.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.5]
> >> >> > host = hpc10
> >> >> > public_addr = 192.168.40.19
> >> >> > cluster_addr = 192.168.40.19
> >> >> >
> >> >> > [osd.6]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.40.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.7]
> >> >> > host = hpc10
> >> >> > public_addr = 192.168.40.19
> >> >> > cluster_addr = 192.168.40.19
> >> >> >
> >> >> > [osd.8]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.40.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.9]
> >> >> > host = hpc10
> >> >> > public_addr = 192.168.40.19
> >> >> > cluster_addr = 192.168.40.19
> >> >> >
> >> >> > [osd.10]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.10.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.11]
> >> >> > host = hpc10
> >> >> > public_addr = 192.168.10.19
> >> >> > cluster_addr = 192.168.40.19
> >> >> >
> >> >> > [osd.12]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.10.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.13]
> >> >> > host = hpc10
> >> >> > public_addr = 192.168.10.19
> >> >> > cluster_addr = 192.168.40.19
> >> >> >
> >> >> > [osd.14]
> >> >> > host = hpc9
> >> >> > public_addr = 192.168.10.18
> >> >> > cluster_addr = 192.168.40.18
> >> >> >
> >> >> > [osd.15]
> >> >> > host = hpc10
> >> >> > public_addr = 192.168.10.19
> >> >> > cluster_addr = 192.168.40.19
> >> >> >
> >> >> > --
> >> >> > Kisik Jeong
> >> >> > Ph.D. Student
> >> >> > Computer Systems Laboratory
> >> >> > Sungkyunkwan University
> >> >> > _______________________________________________
> >> >> > ceph-users mailing list
> >> >> > ceph-users at lists.ceph.com
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >> >
> >> >
> >> > --
> >> > Kisik Jeong
> >> > Ph.D. Student
> >> > Computer Systems Laboratory
> >> > Sungkyunkwan University
> >
> >
> >
> > --
> > Kisik Jeong
> > Ph.D. Student
> > Computer Systems Laboratory
> > Sungkyunkwan University
>


-- 
Kisik Jeong
Ph.D. Student
Computer Systems Laboratory
Sungkyunkwan University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20181017/fe984f14/attachment.html>


More information about the ceph-users mailing list