[ceph-users] Cluster hang (deep scrub bug? "waiting for scrub")

Matteo Dacrema mdacrema at enter.eu
Mon Nov 13 00:30:29 PST 2017


I’ve seen that only one time and noticed that there’s a bug fixed in 10.2.10 (  http://tracker.ceph.com/issues/20041 <http://tracker.ceph.com/issues/20041> ) 
Yes I use snapshots.

As I can see in my case the PG was scrubbing since 20 days but I’ve only 7 days logs so I’m not able to identify the affected PG.



> Il giorno 10 nov 2017, alle ore 14:05, Peter Maloney <peter.maloney at brockmann-consult.de> ha scritto:
> 
> I have often seen a problem where a single osd in an eternal deep scrup
> will hang any client trying to connect. Stopping or restarting that
> single OSD fixes the problem.
> 
> Do you use snapshots?
> 
> Here's what the scrub bug looks like (where that many seconds is 14 hours):
> 
>> ceph daemon "osd.$osd_number" dump_blocked_ops
> 
>>      {
>>          "description": "osd_op(client.6480719.0:2000419292 4.a27969ae
>> rbd_data.46820b238e1f29.000000000000aa70 [set-alloc-hint object_size
>> 524288 write_size 524288,write 0~4096] snapc 16ec0=[16ec0]
>> ack+ondisk+write+known_if_redirected e148441)",
>>          "initiated_at": "2017-09-12 20:04:27.987814",
>>          "age": 49315.666393,
>>          "duration": 49315.668515,
>>          "type_data": [
>>              "delayed",
>>              {
>>                  "client": "client.6480719",
>>                  "tid": 2000419292
>>              },
>>              [
>>                  {
>>                      "time": "2017-09-12 20:04:27.987814",
>>                      "event": "initiated"
>>                  },
>>                  {
>>                      "time": "2017-09-12 20:04:27.987862",
>>                      "event": "queued_for_pg"
>>                  },
>>                  {
>>                      "time": "2017-09-12 20:04:28.004142",
>>                      "event": "reached_pg"
>>                  },
>>                  {
>>                      "time": "2017-09-12 20:04:28.004219",
>>                      "event": "waiting for scrub"
>>                  }
>>              ]
>>          ]
>>      }
> 
> 
> 
> 
> 
> 
> On 11/09/17 17:20, Matteo Dacrema wrote:
>> Update:  I noticed that there was a pg that remained scrubbing from the first day I found the issue to when I reboot the node and problem disappeared.
>> Can this cause the behaviour I described before?
>> 
>> 
>>> Il giorno 09 nov 2017, alle ore 15:55, Matteo Dacrema <mdacrema at enter.eu> ha scritto:
>>> 
>>> Hi all,
>>> 
>>> I’ve experienced a strange issue with my cluster.
>>> The cluster is composed by 10 HDDs nodes with 20 nodes + 4 journal each plus 4 SSDs nodes with 5 SSDs each.
>>> All the nodes are behind 3 monitors and 2 different crush maps.
>>> All the cluster is on 10.2.7 
>>> 
>>> About 20 days ago I started to notice that long backups hangs with "task jbd2/vdc1-8:555 blocked for more than 120 seconds” on the HDD crush map.
>>> About few days ago another VM start to have high iowait without doing iops also on the HDD crush map.
>>> 
>>> Today about a hundreds VMs wasn’t able to read/write from many volumes all of them on HDD crush map. Ceph health was ok and no significant log entries were found.
>>> Not all the VMs experienced this problem and in the meanwhile the iops on the journal and HDDs was very low even if I was able to do significant iops on the working VMs.
>>> 
>>> After two hours of debug I decided to reboot one of the OSD nodes and the cluster start to respond again. Now the OSD node is back in the cluster and the problem is disappeared.
>>> 
>>> Can someone help me to understand what happened?
>>> I see strange entries in the log files like:
>>> 
>>> accept replacing existing (lossy) channel (new one lossy=1)
>>> fault with nothing to send, going to standby
>>> leveldb manual compact 
>>> 
>>> I can share all the logs that can help to identify the issue.
>>> 
>>> Thank you.
>>> Regards,
>>> 
>>> Matteo
>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users at lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> --
>>> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
>>> Seguire il link qui sotto per segnalarlo come spam: 
>>> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=12EAC4481A.A6F60
>>> 
>>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users at lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> -- 
> 
> --------------------------------------------
> Peter Maloney
> Brockmann Consult
> Max-Planck-Str. 2
> 21502 Geesthacht
> Germany
> Tel: +49 4152 889 300
> Fax: +49 4152 889 333
> E-mail: peter.maloney at brockmann-consult.de
> Internet: http://www.brockmann-consult.de
> --------------------------------------------
> 
> 
> --
> Questo messaggio e' stato analizzato con Libra ESVA ed e' risultato non infetto.
> Seguire il link qui sotto per segnalarlo come spam: 
> http://mx01.enter.it/cgi-bin/learn-msg.cgi?id=7814247B63.A75D3
> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ceph.com/pipermail/ceph-users-ceph.com/attachments/20171113/808876ad/attachment.html>


More information about the ceph-users mailing list