[ceph-users] Filestore to Bluestore migration question

Alfredo Deza adeza at redhat.com
Mon Nov 5 08:17:36 PST 2018


On Mon, Nov 5, 2018 at 10:43 AM Hayashida, Mami <mami.hayashida at uky.edu> wrote:
>
> Additional info -- I know that /var/lib/ceph/osd/ceph-{60..69} are not mounted at this point (i.e.  mount | grep ceph-60, and 61-69, returns nothing.).  They don't show up when I run "df", either.
>
> On Mon, Nov 5, 2018 at 10:15 AM, Hayashida, Mami <mami.hayashida at uky.edu> wrote:
>>
>> Well, over the weekend the whole server went down and is now in the emergency mode. (I am running Ubuntu 16.04).  When I run "journalctl  -p err -xb"   I see that
>>
>> systemd[1]: Timed out waiting for device dev-sdh1.device.
>> -- Subject: Unit dev-sdh1.device has failed
>> -- Defined-By: systemd
>> -- Support: http://lists.freeddesktop.org/....
>> --
>> -- Unit dev-sdh1.device has failed.
>>
>>
>> I see this for every single one of the newly-converted Bluestore OSD disks (/dev/sd{h..q}1).

This will happen with stale ceph-disk systemd units. You can disable those with:

ln -sf /dev/null /etc/systemd/system/ceph-disk at .service


>>
>>
>> --
>>
>> On Mon, Nov 5, 2018 at 9:57 AM, Alfredo Deza <adeza at redhat.com> wrote:
>>>
>>> On Fri, Nov 2, 2018 at 5:04 PM Hayashida, Mami <mami.hayashida at uky.edu> wrote:
>>> >
>>> > I followed all the steps Hector suggested, and almost everything seems to have worked fine.  I say "almost" because one out of the 10 osds I was migrating could not be activated even though everything up to that point worked just as well for that osd as the other ones. Here is the output for that particular failure:
>>> >
>>> > *****
>>> > ceph-volume lvm activate --all
>>> > ...
>>> > --> Activating OSD ID 67 FSID 17cd6755-76f9-4160-906c-XXXXXX
>>> > Running command: mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-67
>>> > --> Absolute path not found for executable: restorecon
>>> > --> Ensure $PATH environment variable contains common executable locations
>>> > Running command: ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/hdd67/data67 --path /var/lib/ceph/osd/ceph-67
>>> >  stderr: failed to read label for /dev/hdd67/data67: (2) No such file or directory
>>> > -->  RuntimeError: command returned non-zero exit status:
>>>
>>> I wonder if the /dev/sdo device where hdd67/data67 is located is
>>> available, or if something else is missing. You could try poking
>>> around with `lvs` and see if that LV shows up, also `ceph-volume lvm
>>> list hdd67/data67` can help here because it
>>> groups OSDs to LVs. If you run `ceph-volume lvm list --format=json
>>> hdd67/data67` you will also see all the metadata stored in it.
>>>
>>> Would be interesting to see that output to verify things exist and are
>>> usable for OSD activation.
>>>
>>> >
>>> > *******
>>> > I then checked to see if the rest of the migrated OSDs were back in by calling the ceph osd tree command from the admin node.  Since they were not, I tried to restart the first of the 10 newly migrated Bluestore osds by calling
>>> >
>>> > *******
>>> > systemctl start ceph-osd at 60
>>> >
>>> > At that point, not only this particular service could not be started, but ALL the OSDs (daemons) on the entire node shut down!!!!!
>>> >
>>> > ******
>>> > root at osd1:~# systemctl status ceph-osd at 60
>>> > ● ceph-osd at 60.service - Ceph object storage daemon osd.60
>>> >    Loaded: loaded (/lib/systemd/system/ceph-osd at .service; enabled-runtime; vendor preset: enabled)
>>> >    Active: inactive (dead) since Fri 2018-11-02 15:47:20 EDT; 1h 9min ago
>>> >   Process: 3473621 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
>>> >   Process: 3473147 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
>>> >  Main PID: 3473621 (code=exited, status=0/SUCCESS)
>>> >
>>> > Oct 29 15:57:53 osd1.xxxxx.uky.edu ceph-osd[3473621]: 2018-10-29 15:57:53.868856 7f68adaece00 -1 osd.60 48106 log_to_monitors {default=true}
>>> > Oct 29 15:57:53 osd1.xxxxx.uky.edu ceph-osd[3473621]: 2018-10-29 15:57:53.874373 7f68adaece00 -1 osd.60 48106 mon_cmd_maybe_osd_create fail: 'you must complete the upgrade and 'ceph osd require-osd-release luminous' before using crush device classes': (1) Operation not permitted
>>> > Oct 30 06:25:01 osd1.xxxxx.uky.edu ceph-osd[3473621]: 2018-10-30 06:25:01.961720 7f687feb3700 -1 received  signal: Hangup from  PID: 3485955 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
>>> > Oct 31 06:25:02 osd1.xxxxx.uky.edu ceph-osd[3473621]: 2018-10-31 06:25:02.110898 7f687feb3700 -1 received  signal: Hangup from  PID: 3500945 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
>>> > Nov 01 06:25:02 osd1.xxxxx.uky.edu ceph-osd[3473621]: 2018-11-01 06:25:02.101548 7f687feb3700 -1 received  signal: Hangup from  PID: 3514774 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
>>> > Nov 02 06:25:02 osd1.xxxxx.uky.edu ceph-osd[3473621]: 2018-11-02 06:25:01.997557 7f687feb3700 -1 received  signal: Hangup from  PID: 3528128 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
>>> > Nov 02 15:47:16 osd1.oxxxxx.uky.edu ceph-osd[3473621]: 2018-11-02 15:47:16.322229 7f687feb3700 -1 received  signal: Terminated from  PID: 1 task name: /lib/systemd/systemd --system --deserialize 20  UID: 0
>>> > Nov 02 15:47:16 osd1.xxxxx.uky.edu ceph-osd[3473621]: 2018-11-02 15:47:16.322253 7f687feb3700 -1 osd.60 48504 *** Got signal Terminated ***
>>> > Nov 02 15:47:16 osd1.xxxxx.uky.edu ceph-osd[3473621]: 2018-11-02 15:47:16.676625 7f687feb3700 -1 osd.60 48504 shutdown
>>> > Nov 02 16:34:05 osd1.oxxxxx.uky.edu systemd[1]: Stopped Ceph object storage daemon osd.60.
>>> >
>>> > **********
>>> > And ere is the output for one of the OSDs (osd.70 still using Filestore) that shut down right when I tried to start osd.60
>>> >
>>> > ********
>>> >
>>> > root at osd1:~# systemctl status ceph-osd at 70
>>> > ● ceph-osd at 70.service - Ceph object storage daemon osd.70
>>> >    Loaded: loaded (/lib/systemd/system/ceph-osd at .service; enabled-runtime; vendor preset: enabled)
>>> >    Active: inactive (dead) since Fri 2018-11-02 16:34:08 EDT; 2min 6s ago
>>> >   Process: 3473629 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=0/SUCCESS)
>>> >   Process: 3473153 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
>>> >  Main PID: 3473629 (code=exited, status=0/SUCCESS)
>>> >
>>> > Oct 29 15:57:51 osd1.xxxx.uky.edu ceph-osd[3473629]: 2018-10-29 15:57:51.300563 7f530eec2e00 -1 osd.70 pg_epoch: 48095 pg[68.ces1( empty local-lis/les=47489/47489 n=0 ec=6030/6030 lis/c 47488/47488 les/c/f 47489/47489/0 47485/47488/47488) [138,70,203]p138(0) r=1 lpr=0 crt=0'0 unknown NO
>>> > Oct 30 06:25:01 osd1.xxxx.uky.edu ceph-osd[3473629]: 2018-10-30 06:25:01.961743 7f52d8e44700 -1 received  signal: Hangup from  PID: 3485955 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
>>> > Oct 31 06:25:02 osd1.xxxx.uky.edu ceph-osd[3473629]: 2018-10-31 06:25:02.110920 7f52d8e44700 -1 received  signal: Hangup from  PID: 3500945 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
>>> > Nov 01 06:25:02 osd1.xxxx.uky.edu ceph-osd[3473629]: 2018-11-01 06:25:02.101568 7f52d8e44700 -1 received  signal: Hangup from  PID: 3514774 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
>>> > Nov 02 06:25:02 osd1.xxxx.uky.edu ceph-osd[3473629]: 2018-11-02 06:25:01.997633 7f52d8e44700 -1 received  signal: Hangup from  PID: 3528128 task name: killall -q -1 ceph-mon ceph-mgr ceph-mds ceph-osd ceph-fuse radosgw  UID: 0
>>> > Nov 02 16:34:05 osd1.xxxx.uky.edu ceph-osd[3473629]: 2018-11-02 16:34:05.607714 7f52d8e44700 -1 received  signal: Terminated from  PID: 1 task name: /lib/systemd/systemd --system --deserialize 20  UID: 0
>>> > Nov 02 16:34:05 osd1.xxxx.uky.edu ceph-osd[3473629]: 2018-11-02 16:34:05.607738 7f52d8e44700 -1 osd.70 48535 *** Got signal Terminated ***
>>> > Nov 02 16:34:05 osd1.xxxx.uky.edu systemd[1]: Stopping Ceph object storage daemon osd.70...
>>> > Nov 02 16:34:05 osd1.xxxx.uky.edu ceph-osd[3473629]: 2018-11-02 16:34:05.677348 7f52d8e44700 -1 osd.70 48535 shutdown
>>> > Nov 02 16:34:08 osd1.xxxx.uky.edu systemd[1]: Stopped Ceph object storage daemon osd.70.
>>> >
>>> > **************
>>> >
>>> > So, at this point, ALL the OSDs on that node have been shut down.
>>> >
>>> > For your information this is the output of lsblk command (selection)
>>> > *****
>>> > root at osd1:~# lsblk
>>> > NAME           MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
>>> > sda              8:0    0 447.1G  0 disk
>>> > ├─ssd0-db60    252:0    0    40G  0 lvm
>>> > ├─ssd0-db61    252:1    0    40G  0 lvm
>>> > ├─ssd0-db62    252:2    0    40G  0 lvm
>>> > ├─ssd0-db63    252:3    0    40G  0 lvm
>>> > ├─ssd0-db64    252:4    0    40G  0 lvm
>>> > ├─ssd0-db65    252:5    0    40G  0 lvm
>>> > ├─ssd0-db66    252:6    0    40G  0 lvm
>>> > ├─ssd0-db67    252:7    0    40G  0 lvm
>>> > ├─ssd0-db68    252:8    0    40G  0 lvm
>>> > └─ssd0-db69    252:9    0    40G  0 lvm
>>> > sdb              8:16   0 447.1G  0 disk
>>> > ├─sdb1           8:17   0    40G  0 part
>>> > ├─sdb2           8:18   0    40G  0 part
>>> >
>>> > .....
>>> >
>>> > sdh              8:112  0   3.7T  0 disk
>>> > └─hdd60-data60 252:10   0   3.7T  0 lvm
>>> > sdi              8:128  0   3.7T  0 disk
>>> > └─hdd61-data61 252:11   0   3.7T  0 lvm
>>> > sdj              8:144  0   3.7T  0 disk
>>> > └─hdd62-data62 252:12   0   3.7T  0 lvm
>>> > sdk              8:160  0   3.7T  0 disk
>>> > └─hdd63-data63 252:13   0   3.7T  0 lvm
>>> > sdl              8:176  0   3.7T  0 disk
>>> > └─hdd64-data64 252:14   0   3.7T  0 lvm
>>> > sdm              8:192  0   3.7T  0 disk
>>> > └─hdd65-data65 252:15   0   3.7T  0 lvm
>>> > sdn              8:208  0   3.7T  0 disk
>>> > └─hdd66-data66 252:16   0   3.7T  0 lvm
>>> > sdo              8:224  0   3.7T  0 disk
>>> > └─hdd67-data67 252:17   0   3.7T  0 lvm
>>> > sdp              8:240  0   3.7T  0 disk
>>> > └─hdd68-data68 252:18   0   3.7T  0 lvm
>>> > sdq             65:0    0   3.7T  0 disk
>>> > └─hdd69-data69 252:19   0   3.7T  0 lvm
>>> > sdr             65:16   0   3.7T  0 disk
>>> > └─sdr1          65:17   0   3.7T  0 part /var/lib/ceph/osd/ceph-70
>>> > .....
>>> >
>>> > As a Ceph novice, I am totally clueless about the next step at this point.  Any help would be appreciated.
>>> >
>>> > On Thu, Nov 1, 2018 at 3:16 PM, Hayashida, Mami <mami.hayashida at uky.edu> wrote:
>>> >>
>>> >> Thank you, both of you.  I will try this out very soon.
>>> >>
>>> >> On Wed, Oct 31, 2018 at 8:48 AM, Alfredo Deza <adeza at redhat.com> wrote:
>>> >>>
>>> >>> On Wed, Oct 31, 2018 at 8:28 AM Hayashida, Mami <mami.hayashida at uky.edu> wrote:
>>> >>> >
>>> >>> > Thank you for your replies. So, if I use the method Hector suggested (by creating PVs, VGs.... etc. first), can I add the --osd-id parameter to the command as in
>>> >>> >
>>> >>> > ceph-volume lvm prepare --bluestore --data hdd0/data0 --block.db ssd/db0  --osd-id 0
>>> >>> > ceph-volume lvm prepare --bluestore --data hdd1/data1 --block.db ssd/db1  --osd-id 1
>>> >>> >
>>> >>> > so that Filestore -> Bluestore migration will not change the osd ID on each disk?
>>> >>>
>>> >>> That looks correct.
>>> >>>
>>> >>> >
>>> >>> > And one more question.  Are there any changes I need to make to the ceph.conf file?  I did comment out this line that was probably used for creating Filestore (using ceph-deploy):  osd journal size = 40960
>>> >>>
>>> >>> Since you've pre-created the LVs the commented out line will not
>>> >>> affect anything.
>>> >>>
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > On Wed, Oct 31, 2018 at 7:03 AM, Alfredo Deza <adeza at redhat.com> wrote:
>>> >>> >>
>>> >>> >> On Wed, Oct 31, 2018 at 5:22 AM Hector Martin <hector at marcansoft.com> wrote:
>>> >>> >> >
>>> >>> >> > On 31/10/2018 05:55, Hayashida, Mami wrote:
>>> >>> >> > > I am relatively new to Ceph and need some advice on Bluestore migration.
>>> >>> >> > > I tried migrating a few of our test cluster nodes from Filestore to
>>> >>> >> > > Bluestore by following this
>>> >>> >> > > (http://docs.ceph.com/docs/luminous/rados/operations/bluestore-migration/)
>>> >>> >> > > as the cluster is currently running 12.2.9. The cluster, originally set
>>> >>> >> > > up by my predecessors, was running Jewel until I upgraded it recently to
>>> >>> >> > > Luminous.
>>> >>> >> > >
>>> >>> >> > > OSDs in each OSD host is set up in such a way that for ever 10 data HDD
>>> >>> >> > > disks, there is one SSD drive that is holding their journals.  For
>>> >>> >> > > example, osd.0 data is on /dev/sdh and its Filestore journal is on a
>>> >>> >> > > partitioned part of /dev/sda. So, lsblk shows something like
>>> >>> >> > >
>>> >>> >> > > sda       8:0    0 447.1G  0 disk
>>> >>> >> > > ├─sda1    8:1    0    40G  0 part # journal for osd.0
>>> >>> >> > >
>>> >>> >> > > sdh       8:112  0   3.7T  0 disk
>>> >>> >> > > └─sdh1    8:113  0   3.7T  0 part /var/lib/ceph/osd/ceph-0
>>> >>> >> > >
>>> >>> >> >
>>> >>> >> > The BlueStore documentation states that the wal will automatically use
>>> >>> >> > the db volume if it fits, so if you're using a single SSD I think
>>> >>> >> > there's no good reason to split out the wal, if I'm understanding it
>>> >>> >> > correctly.
>>> >>> >>
>>> >>> >> This is correct, no need for wal in this case.
>>> >>> >>
>>> >>> >> >
>>> >>> >> > You should be using ceph-volume, since ceph-disk is deprecated. If
>>> >>> >> > you're sharing the SSD as wal/db for a bunch of OSDs, I think you're
>>> >>> >> > going to have to create the LVs yourself first. The data HDDs should be
>>> >>> >> > PVs (I don't think it matters if they're partitions or whole disk PVs as
>>> >>> >> > long as LVM discovers them) each part of a separate VG (e.g. hdd0-hdd9)
>>> >>> >> > containing a single LV. Then the SSD should itself be an LV for a
>>> >>> >> > separate shared SSD VG (e.g. ssd).
>>> >>> >> >
>>> >>> >> > So something like (assuming sda is your wal SSD and sdb and onwards are
>>> >>> >> > your OSD HDDs):
>>> >>> >> > pvcreate /dev/sda
>>> >>> >> > pvcreate /dev/sdb
>>> >>> >> > pvcreate /dev/sdc
>>> >>> >> > ...
>>> >>> >> >
>>> >>> >> > vgcreate ssd /dev/sda
>>> >>> >> > vgcreate hdd0 /dev/sdb
>>> >>> >> > vgcreate hdd1 /dev/sdc
>>> >>> >> > ...
>>> >>> >> >
>>> >>> >> > lvcreate -L 40G -n db0 ssd
>>> >>> >> > lvcreate -L 40G -n db1 ssd
>>> >>> >> > ...
>>> >>> >> >
>>> >>> >> > lvcreate -L 100%VG -n data0 hdd0
>>> >>> >> > lvcreate -L 100%VG -n data1 hdd1
>>> >>> >> > ...
>>> >>> >> >
>>> >>> >> > ceph-volume lvm prepare --bluestore --data hdd0/data0 --block.db ssd/db0
>>> >>> >> > ceph-volume lvm prepare --bluestore --data hdd1/data1 --block.db ssd/db1
>>> >>> >> > ...
>>> >>> >> >
>>> >>> >> > ceph-volume lvm activate --all
>>> >>> >> >
>>> >>> >> > I think it might be possible to just let ceph-volume create the PV/VG/LV
>>> >>> >> > for the data disks and only manually create the DB LVs, but it shouldn't
>>> >>> >> > hurt to do it on your own and just give ready-made LVs to ceph-volume
>>> >>> >> > for everything.
>>> >>> >>
>>> >>> >> Another alternative here is to use the new `lvm batch` subcommand to
>>> >>> >> do all of this in one go:
>>> >>> >>
>>> >>> >> ceph-volume lvm batch /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde
>>> >>> >> /dev/sdf /dev/sdg /dev/sdh
>>> >>> >>
>>> >>> >> Will detect that sda is an SSD and will create the LVs for you for
>>> >>> >> block.db (one for each spinning disk). For each spinning disk, it will
>>> >>> >> place data on them.
>>> >>> >>
>>> >>> >> The one caveat is that you no longer control OSD IDs, and they are
>>> >>> >> created with whatever the monitors are giving out.
>>> >>> >>
>>> >>> >> This operation is not supported from ceph-deploy either.
>>> >>> >> >
>>> >>> >> > --
>>> >>> >> > Hector Martin (hector at marcansoft.com)
>>> >>> >> > Public Key: https://marcan.st/marcan.asc
>>> >>> >> > _______________________________________________
>>> >>> >> > ceph-users mailing list
>>> >>> >> > ceph-users at lists.ceph.com
>>> >>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > --
>>> >>> > Mami Hayashida
>>> >>> > Research Computing Associate
>>> >>> >
>>> >>> > Research Computing Infrastructure
>>> >>> > University of Kentucky Information Technology Services
>>> >>> > 301 Rose Street | 102 James F. Hardymon Building
>>> >>> > Lexington, KY 40506-0495
>>> >>> > mami.hayashida at uky.edu
>>> >>> > (859)323-7521
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Mami Hayashida
>>> >> Research Computing Associate
>>> >>
>>> >> Research Computing Infrastructure
>>> >> University of Kentucky Information Technology Services
>>> >> 301 Rose Street | 102 James F. Hardymon Building
>>> >> Lexington, KY 40506-0495
>>> >> mami.hayashida at uky.edu
>>> >> (859)323-7521
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Mami Hayashida
>>> > Research Computing Associate
>>> >
>>> > Research Computing Infrastructure
>>> > University of Kentucky Information Technology Services
>>> > 301 Rose Street | 102 James F. Hardymon Building
>>> > Lexington, KY 40506-0495
>>> > mami.hayashida at uky.edu
>>> > (859)323-7521
>>
>>
>>
>>
>> --
>> Mami Hayashida
>> Research Computing Associate
>>
>> Research Computing Infrastructure
>> University of Kentucky Information Technology Services
>> 301 Rose Street | 102 James F. Hardymon Building
>> Lexington, KY 40506-0495
>> mami.hayashida at uky.edu
>> (859)323-7521
>
>
>
>
> --
> Mami Hayashida
> Research Computing Associate
>
> Research Computing Infrastructure
> University of Kentucky Information Technology Services
> 301 Rose Street | 102 James F. Hardymon Building
> Lexington, KY 40506-0495
> mami.hayashida at uky.edu
> (859)323-7521


More information about the ceph-users mailing list