[ceph-users] VM Data corruption shortly after Luminous Upgrade

James Forde jimf at mninc.net
Mon Nov 6 08:23:56 PST 2017


Weird but Very bad problem with my test cluster 2-3 weeks after upgrading to Luminous.
All 7 running VM's are corrupted and unbootable. 6 Windows and 1 CentOS7. Windows error is "unmountable boot volume". CentOS7 will only boot to emergency mode.
3 VM's that were off during event work as expected. 2 Windows and 1 Ubuntu.

History:
7 node cluster: 5 OSD, 3 MON, (1 is MON-OSD). Plus 2 KVM nodes.

System originally running Jewel on old Tower servers. Migrated to all rackmount servers. Then upgraded to Kraken. Kraken added the MGR servers.

On the 13th or 14th of October Upgraded to Luminous. Upgrade went smoothly. Ceph versions showed all nodes running 12.2.1, Health_OK. Even checked out the Ceph Dashboard.

Then around the 20th I created a master for cloning, spun off a clone, mucked around with it, flattened it so it was stand alone, and shut it and the master off.

Problem:
On November 1st I started the clone and got the following error.

"failed to start domain internal error: qemu unexpectedly closed the monitor vice virtio-balloon"



To resolve: (restart MON's one at a time)

I restarted 1 MON. tried to restart clone. Same error.

Restarted 2nd MON. All 7 running VMs shut off!

Restarted 3rd MON. Clone now runs. Try to start any of the 7 VM's that were running. "Unmountable Boot Volume"



Pulled the logs on all nodes and am going through them.
So far have found this.

"terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
  what():  buffer::end_of_buffer
terminate called recursively
2017-11-01 19:41:48.814+0000: shutting down, reason=crashed"

Possible monmap corruption?
Any insight would be greatly appreciated.


Hints?
After the Luminous upgrade, ceph osd tree had nothing in the class column. After restarting the MON's, the MON-OSD node had "hdd" on each osd.
After restarting the entire cluster all OSD servers had "hdd" in the class column. Not sure why this would not have happened right after upgrade.

Also after restart the mgr servers failed to start. " key for mgr.HOST exists but cap mds does not match<https://www.seekhole.io/?p=12>"
Solved per https://www.seekhole.io/?p=12
$ ceph auth caps mgr.HOST mon 'allow profile mgr' mds 'allow *' osd 'allow *'
Again, not sure why this would not have manifested itself at the upgrade when all servers were restarted.

-Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171106/7a9d5319/attachment.html>


More information about the ceph-users mailing list