[ceph-users] Cluster hang

Matteo Dacrema mdacrema at enter.eu
Thu Nov 9 08:20:14 PST 2017


Update:  I noticed that there was a pg that remained scrubbing from the first day I found the issue to when I reboot the node and problem disappeared.
Can this cause the behaviour I described before?


> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema <mdacrema at enter.eu> ha scritto:
> 
> Hi all,
> 
> I’ve experienced a strange issue with my cluster.
> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 4 SSDs nodes with 5 SSDs each.
> All the nodes are behind 3 monitors and 2 different crush maps.
> All the cluster is on 10.2.7 
> 
> About 20 days ago I started to notice that long backups hangs with "task jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
> About few days ago another VM start to have high iowait without doing iops also on the HDD crush map.
> 
> Today about a hundreds VMs wasn’t able to read/write from many volumes all of them on HDD crush map. Ceph health was ok and no significant log entries were found.
> Not all the VMs experienced this problem and in the meanwhile the iops on the journal and HDDs was very low even if I was able to do significant iops on the working VMs.
> 
> After two hours of debug I decided to reboot one of the OSD nodes and the cluster start to respond again. Now the OSD node is back in the cluster and the problem is disappeared.
> 
> Can someone help me to understand what happened?
> I see strange entries in the log files like:
> 
> accept replacing existing (lossy) channel (new one lossy=1)
> fault with nothing to send, going to standby
> leveldb manual compact 
> 
> I can share all the logs that can help to identify the issue.
> 
> Thank you.
> Regards,
> 
> Matteo
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60
> 
> 



More information about the ceph-users mailing list