[ceph-users] Cluster hang
mdacrema at enter.eu
Thu Nov 9 08:20:14 PST 2017
Update: I noticed that there was a pg that remained scrubbing from the first day I found the issue to when I reboot the node and problem disappeared.
Can this cause the behaviour I described before?
> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema <mdacrema at enter.eu> ha scritto:
> Hi all,
> I’ve experienced a strange issue with my cluster.
> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 4 SSDs nodes with 5 SSDs each.
> All the nodes are behind 3 monitors and 2 different crush maps.
> All the cluster is on 10.2.7
> About 20 days ago I started to notice that long backups hangs with "task jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
> About few days ago another VM start to have high iowait without doing iops also on the HDD crush map.
> Today about a hundreds VMs wasn’t able to read/write from many volumes all of them on HDD crush map. Ceph health was ok and no significant log entries were found.
> Not all the VMs experienced this problem and in the meanwhile the iops on the journal and HDDs was very low even if I was able to do significant iops on the working VMs.
> After two hours of debug I decided to reboot one of the OSD nodes and the cluster start to respond again. Now the OSD node is back in the cluster and the problem is disappeared.
> Can someone help me to understand what happened?
> I see strange entries in the log files like:
> accept replacing existing (lossy) channel (new one lossy=1)
> fault with nothing to send, going to standby
> leveldb manual compact
> I can share all the logs that can help to identify the issue.
> Thank you.
> ceph-users mailing list
> ceph-users at lists.ceph.com
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
> Seguire il link qui sotto per segnalarlo come spam:
More information about the ceph-users