All of lore.kernel.org
 help / color / mirror / Atom feed
* All RBD IO stuck after flapping OSD's
@ 2021-04-14  8:51 Robin Geuze
  2021-04-14 17:00 ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Geuze @ 2021-04-14  8:51 UTC (permalink / raw)
  To: Ceph Development

Hey,

We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:

Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
(...)

After a while of that we start getting these errors being spammed in dmesg:

Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16

(In this case for two different RBD mounts)

At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate our VM's to another HV, since to make the messages go away we have to reboot the server.

All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.

We've seen this multiple times on various different machines and with various different clusters with differing problem types, so its not a freak incident. Does anyone have any ideas on how we can potentially solve this?

Regards,

Robin Geuze

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-04-14  8:51 All RBD IO stuck after flapping OSD's Robin Geuze
@ 2021-04-14 17:00 ` Ilya Dryomov
       [not found]   ` <8eb12c996e404870803e9a7c77e508d6@nl.team.blue>
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2021-04-14 17:00 UTC (permalink / raw)
  To: Robin Geuze; +Cc: Ceph Development

On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey,
>
> We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
>
> Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> (...)
>
> After a while of that we start getting these errors being spammed in dmesg:
>
> Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
>
> (In this case for two different RBD mounts)
>
> At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate our VM's to another HV, since to make the messages go away we have to reboot the server.

Hi Robin,

Do these messages appear even if no I/O is issued to /dev/rbd14 or only
if you attempt to write?

>
> All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.

Please explain how it's getting worse.

I think the problem is that the object map isn't locked.  What
probably happened is the kernel client lost its watch on the image
and for some reason can't get it back.   The flapping has likely
trigged some edge condition in the watch/notify code.

To confirm:

- paste the contents of /sys/bus/rbd/devices/14/client_addr

- paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
  for /dev/rbd14.  If you are using noshare, you will have multiple
  client instances with the same cluster id.  The one you need can be
  identified with /sys/bus/rbd/devices/14/client_id.

- paste the output of "rbd status <rbd14 image>" (image name can be
  identified from "rbd showmapped")

I'm also curious who actually has the lock on the header object and the
object map object.  Paste the output of

$ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
$ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
$ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
       [not found]   ` <8eb12c996e404870803e9a7c77e508d6@nl.team.blue>
@ 2021-04-19 12:40     ` Ilya Dryomov
  2021-06-16 11:56       ` Robin Geuze
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2021-04-19 12:40 UTC (permalink / raw)
  To: Robin Geuze; +Cc: Ceph Development

On Thu, Apr 15, 2021 at 2:21 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> We had to reboot the machine unfortunately, since we had customers unable to work with their VM's. We did manage to make a dynamic debugging dump of an earlier occurence, maybe that can help? I've attached it to this email.

No, I don't see anything to go on there.  Next time, enable logging for
both libceph and rbd modules and make sure that at least one instance of
the error (i.e. "pre object map update failed: -16") makes it into the
attached log.

>
> Those messages constantly occur, even after we kill the VM using the mount, I guess because there is pending IO which cannot be flushed.
>
> As for how its getting worse, if you try any management operations (eg unmap) on any of the RBD mounts that aren't affected, they hang and more often than not the IO for that one also stalls (not always though).

One obvious workaround workaround is to unmap, disable object-map and
exclusive-lock features with "rbd feature disable", and map back.  You
would lose the benefits of object map, but if it is affecting customer
workloads it is probably the best course of action until this thing is
root caused.

Thanks,

                Ilya

>
> Regards,
>
> Robin Geuze
>
> From: Ilya Dryomov <idryomov@gmail.com>
> Sent: 14 April 2021 19:00:20
> To: Robin Geuze
> Cc: Ceph Development
> Subject: Re: All RBD IO stuck after flapping OSD's
>
> On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey,
> >
> > We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
> >
> > Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> > Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> > Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> > (...)
> >
> > After a while of that we start getting these errors being spammed in dmesg:
> >
> > Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> > Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> > Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
> >
> > (In this case for two different RBD mounts)
> >
> > At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate our  VM's to another HV, since to make the messages go away we have to reboot the server.
>
> Hi Robin,
>
> Do these messages appear even if no I/O is issued to /dev/rbd14 or only
> if you attempt to write?
>
> >
> > All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.
>
> Please explain how it's getting worse.
>
> I think the problem is that the object map isn't locked.  What
> probably happened is the kernel client lost its watch on the image
> and for some reason can't get it back.   The flapping has likely
> trigged some edge condition in the watch/notify code.
>
> To confirm:
>
> - paste the contents of /sys/bus/rbd/devices/14/client_addr
>
> - paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
>   for /dev/rbd14.  If you are using noshare, you will have multiple
>   client instances with the same cluster id.  The one you need can be
>   identified with /sys/bus/rbd/devices/14/client_id.
>
> - paste the output of "rbd status <rbd14 image>" (image name can be
>   identified from "rbd showmapped")
>
> I'm also curious who actually has the lock on the header object and the
> object map object.  Paste the output of
>
> $ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
> $ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
> $ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq
>
> Thanks,
>
>                 Ilya
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-04-19 12:40     ` Ilya Dryomov
@ 2021-06-16 11:56       ` Robin Geuze
  2021-06-17  8:36         ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Geuze @ 2021-06-16 11:56 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Hey Ilya,

Sorry for the long delay, but we've finally managed to somewhat reliably reproduce this issue and produced a bunch of debug data. Its really big, so you can find the files here: https://dionbosschieter.stackstorage.com/s/RhM3FHLD28EcVJJ2

We also got some stack traces those are in there as well.

The way we reproduce it is that on one of the two ceph machines in the cluster (its a test cluster) we toggle both the bond NIC ports down, sleep 40 seconds, put them back up, wait another 15 seconds and then put them back down, wait another 40 seconds and  then put them back up.

Exact command line I used on the ceph machine:
ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up; sleep 15; ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up

Regards,

Robin Geuze 
  
From: Ilya Dryomov <idryomov@gmail.com>
Sent: 19 April 2021 14:40:00
To: Robin Geuze
Cc: Ceph Development
Subject: Re: All RBD IO stuck after flapping OSD's
    
On Thu, Apr 15, 2021 at 2:21 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> We had to reboot the machine unfortunately, since we had customers unable to work with their VM's. We did manage to make a dynamic debugging dump of an earlier occurence, maybe that can help? I've attached it to this email.

No, I don't see anything to go on there.  Next time, enable logging for
both libceph and rbd modules and make sure that at least one instance of
the error (i.e. "pre object map update failed: -16") makes it into the
attached log.

>
> Those messages constantly occur, even after we kill the VM using the mount, I guess because there is pending IO which cannot be flushed.
>
> As for how its getting worse, if you try any management operations (eg unmap) on any of the RBD mounts that aren't affected, they hang and more often than not the IO for that one also stalls (not always though).

One obvious workaround workaround is to unmap, disable object-map and
exclusive-lock features with "rbd feature disable", and map back.  You
would lose the benefits of object map, but if it is affecting customer
workloads it is probably the best course of action until this thing is
root caused.

Thanks,

                Ilya

>
> Regards,
>
> Robin Geuze
>
> From: Ilya Dryomov <idryomov@gmail.com>
> Sent: 14 April 2021 19:00:20
> To: Robin Geuze
> Cc: Ceph Development
> Subject: Re: All RBD IO stuck after flapping OSD's
>
> On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey,
> >
> > We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
> >
> > Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> > Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> > Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> > (...)
> >
> > After a while of that we start getting these errors being spammed in dmesg:
> >
> > Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> > Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> > Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
> >
> > (In this case for two different RBD mounts)
> >
> > At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate  our  VM's to another HV, since to make the messages go away we have to reboot the server.
>
> Hi Robin,
>
> Do these messages appear even if no I/O is issued to /dev/rbd14 or only
> if you attempt to write?
>
> >
> > All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.
>
> Please explain how it's getting worse.
>
> I think the problem is that the object map isn't locked.  What
> probably happened is the kernel client lost its watch on the image
> and for some reason can't get it back.   The flapping has likely
> trigged some edge condition in the watch/notify code.
>
> To confirm:
>
> - paste the contents of /sys/bus/rbd/devices/14/client_addr
>
> - paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
>   for /dev/rbd14.  If you are using noshare, you will have multiple
>   client instances with the same cluster id.  The one you need can be
>   identified with /sys/bus/rbd/devices/14/client_id.
>
> - paste the output of "rbd status <rbd14 image>" (image name can be
>   identified from "rbd showmapped")
>
> I'm also curious who actually has the lock on the header object and the
> object map object.  Paste the output of
>
> $ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
> $ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
> $ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq
>
> Thanks,
>
>                 Ilya
>
    

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-06-16 11:56       ` Robin Geuze
@ 2021-06-17  8:36         ` Ilya Dryomov
  2021-06-17  8:42           ` Robin Geuze
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2021-06-17  8:36 UTC (permalink / raw)
  To: Robin Geuze; +Cc: Ceph Development

On Wed, Jun 16, 2021 at 1:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> Sorry for the long delay, but we've finally managed to somewhat reliably reproduce this issue and produced a bunch of debug data. Its really big, so you can find the files here: https://dionbosschieter.stackstorage.com/s/RhM3FHLD28EcVJJ2
>
> We also got some stack traces those are in there as well.
>
> The way we reproduce it is that on one of the two ceph machines in the cluster (its a test cluster) we toggle both the bond NIC ports down, sleep 40 seconds, put them back up, wait another 15 seconds and then put them back down, wait another 40 seconds and  then put them back up.
>
> Exact command line I used on the ceph machine:
> ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up; sleep 15; ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up

Hi Robin,

This looks very similar to https://tracker.ceph.com/issues/42757.
I don't see the offending writer thread among stuck threads in
stuck_kthreads.md though (and syslog_stuck_krbd_shrinked covers only
a short 13-second period of time so it's not there either because the
problem, at least the one I'm suspecting, would have occurred before
13:00:00).

If you can reproduce reliably, try again without verbose logging but
do capture all stack traces -- once the system locks up, let it stew
for ten minutes and attach "blocked for more than X seconds" splats.

Additionally, a "echo w >/proc/sysrq-trigger" dump would be good if
SysRq is not disabled on your servers.

Thanks,

                Ilya

>
> Regards,
>
> Robin Geuze
>
> From: Ilya Dryomov <idryomov@gmail.com>
> Sent: 19 April 2021 14:40:00
> To: Robin Geuze
> Cc: Ceph Development
> Subject: Re: All RBD IO stuck after flapping OSD's
>
> On Thu, Apr 15, 2021 at 2:21 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > We had to reboot the machine unfortunately, since we had customers unable to work with their VM's. We did manage to make a dynamic debugging dump of an earlier occurence, maybe that can help? I've attached it to this email.
>
> No, I don't see anything to go on there.  Next time, enable logging for
> both libceph and rbd modules and make sure that at least one instance of
> the error (i.e. "pre object map update failed: -16") makes it into the
> attached log.
>
> >
> > Those messages constantly occur, even after we kill the VM using the mount, I guess because there is pending IO which cannot be flushed.
> >
> > As for how its getting worse, if you try any management operations (eg unmap) on any of the RBD mounts that aren't affected, they hang and more often than not the IO for that one also stalls (not always though).
>
> One obvious workaround workaround is to unmap, disable object-map and
> exclusive-lock features with "rbd feature disable", and map back.  You
> would lose the benefits of object map, but if it is affecting customer
> workloads it is probably the best course of action until this thing is
> root caused.
>
> Thanks,
>
>                 Ilya
>
> >
> > Regards,
> >
> > Robin Geuze
> >
> > From: Ilya Dryomov <idryomov@gmail.com>
> > Sent: 14 April 2021 19:00:20
> > To: Robin Geuze
> > Cc: Ceph Development
> > Subject: Re: All RBD IO stuck after flapping OSD's
> >
> > On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > >
> > > Hey,
> > >
> > > We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
> > >
> > > Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> > > Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> > > Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> > > (...)
> > >
> > > After a while of that we start getting these errors being spammed in dmesg:
> > >
> > > Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> > > Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> > > Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
> > >
> > > (In this case for two different RBD mounts)
> > >
> > > At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate  our  VM's to another HV, since to make the messages go away we have to reboot the server.
> >
> > Hi Robin,
> >
> > Do these messages appear even if no I/O is issued to /dev/rbd14 or only
> > if you attempt to write?
> >
> > >
> > > All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.
> >
> > Please explain how it's getting worse.
> >
> > I think the problem is that the object map isn't locked.  What
> > probably happened is the kernel client lost its watch on the image
> > and for some reason can't get it back.   The flapping has likely
> > trigged some edge condition in the watch/notify code.
> >
> > To confirm:
> >
> > - paste the contents of /sys/bus/rbd/devices/14/client_addr
> >
> > - paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
> >   for /dev/rbd14.  If you are using noshare, you will have multiple
> >   client instances with the same cluster id.  The one you need can be
> >   identified with /sys/bus/rbd/devices/14/client_id.
> >
> > - paste the output of "rbd status <rbd14 image>" (image name can be
> >   identified from "rbd showmapped")
> >
> > I'm also curious who actually has the lock on the header object and the
> > object map object.  Paste the output of
> >
> > $ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
> > $ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
> > $ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq
> >
> > Thanks,
> >
> >                 Ilya
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-06-17  8:36         ` Ilya Dryomov
@ 2021-06-17  8:42           ` Robin Geuze
  2021-06-17  9:40             ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Geuze @ 2021-06-17  8:42 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Hey Ilya,

We triggered the issue at roughly 13:05, so the problem cannot have occurred before 13:00.

We've also (in the wild, haven't reproduced that exact case yet) seen this occur without any stacktraces or stuck threads. The only "common" factor is that we see the watch errors, always at least twice within 1 or 2 minutes if its broken.

Regards,

Robin Geuze
  
From: Ilya Dryomov <idryomov@gmail.com>
Sent: 17 June 2021 10:36:33
To: Robin Geuze
Cc: Ceph Development
Subject: Re: All RBD IO stuck after flapping OSD's
    
On Wed, Jun 16, 2021 at 1:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> Sorry for the long delay, but we've finally managed to somewhat reliably reproduce this issue and produced a bunch of debug data. Its really big, so you can find the files here: https://dionbosschieter.stackstorage.com/s/RhM3FHLD28EcVJJ2
>
> We also got some stack traces those are in there as well.
>
> The way we reproduce it is that on one of the two ceph machines in the cluster (its a test cluster) we toggle both the bond NIC ports down, sleep 40 seconds, put them back up, wait another 15 seconds and then put them back down, wait another 40 seconds and   then put them back up.
>
> Exact command line I used on the ceph machine:
> ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up; sleep 15; ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up

Hi Robin,

This looks very similar to https://tracker.ceph.com/issues/42757.
I don't see the offending writer thread among stuck threads in
stuck_kthreads.md though (and syslog_stuck_krbd_shrinked covers only
a short 13-second period of time so it's not there either because the
problem, at least the one I'm suspecting, would have occurred before
13:00:00).

If you can reproduce reliably, try again without verbose logging but
do capture all stack traces -- once the system locks up, let it stew
for ten minutes and attach "blocked for more than X seconds" splats.

Additionally, a "echo w >/proc/sysrq-trigger" dump would be good if
SysRq is not disabled on your servers.

Thanks,

                Ilya

>
> Regards,
>
> Robin Geuze
>
> From: Ilya Dryomov <idryomov@gmail.com>
> Sent: 19 April 2021 14:40:00
> To: Robin Geuze
> Cc: Ceph Development
> Subject: Re: All RBD IO stuck after flapping OSD's
>
> On Thu, Apr 15, 2021 at 2:21 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > We had to reboot the machine unfortunately, since we had customers unable to work with their VM's. We did manage to make a dynamic debugging dump of an earlier occurence, maybe that can help? I've attached it to this email.
>
> No, I don't see anything to go on there.  Next time, enable logging for
> both libceph and rbd modules and make sure that at least one instance of
> the error (i.e. "pre object map update failed: -16") makes it into the
> attached log.
>
> >
> > Those messages constantly occur, even after we kill the VM using the mount, I guess because there is pending IO which cannot be flushed.
> >
> > As for how its getting worse, if you try any management operations (eg unmap) on any of the RBD mounts that aren't affected, they hang and more often than not the IO for that one also stalls (not always though).
>
> One obvious workaround workaround is to unmap, disable object-map and
> exclusive-lock features with "rbd feature disable", and map back.  You
> would lose the benefits of object map, but if it is affecting customer
> workloads it is probably the best course of action until this thing is
> root caused.
>
> Thanks,
>
>                 Ilya
>
> >
> > Regards,
> >
> > Robin Geuze
> >
> > From: Ilya Dryomov <idryomov@gmail.com>
> > Sent: 14 April 2021 19:00:20
> > To: Robin Geuze
> > Cc: Ceph Development
> > Subject: Re: All RBD IO stuck after flapping OSD's
> >
> > On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > >
> > > Hey,
> > >
> > > We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
> > >
> > > Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> > > Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> > > Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> > > (...)
> > >
> > > After a while of that we start getting these errors being spammed in dmesg:
> > >
> > > Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> > > Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> > > Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
> > >
> > > (In this case for two different RBD mounts)
> > >
> > > At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate   our  VM's to another HV, since to make the messages go away we have to reboot the server.
> >
> > Hi Robin,
> >
> > Do these messages appear even if no I/O is issued to /dev/rbd14 or only
> > if you attempt to write?
> >
> > >
> > > All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.
> >
> > Please explain how it's getting worse.
> >
> > I think the problem is that the object map isn't locked.  What
> > probably happened is the kernel client lost its watch on the image
> > and for some reason can't get it back.   The flapping has likely
> > trigged some edge condition in the watch/notify code.
> >
> > To confirm:
> >
> > - paste the contents of /sys/bus/rbd/devices/14/client_addr
> >
> > - paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
> >   for /dev/rbd14.  If you are using noshare, you will have multiple
> >   client instances with the same cluster id.  The one you need can be
> >   identified with /sys/bus/rbd/devices/14/client_id.
> >
> > - paste the output of "rbd status <rbd14 image>" (image name can be
> >   identified from "rbd showmapped")
> >
> > I'm also curious who actually has the lock on the header object and the
> > object map object.  Paste the output of
> >
> > $ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
> > $ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
> > $ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq
> >
> > Thanks,
> >
> >                 Ilya
> >
>
    

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-06-17  8:42           ` Robin Geuze
@ 2021-06-17  9:40             ` Ilya Dryomov
  2021-06-17 10:17               ` Robin Geuze
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2021-06-17  9:40 UTC (permalink / raw)
  To: Robin Geuze; +Cc: Ceph Development

On Thu, Jun 17, 2021 at 10:42 AM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> We triggered the issue at roughly 13:05, so the problem cannot have occurred before 13:00.
>
> We've also (in the wild, haven't reproduced that exact case yet) seen this occur without any stacktraces or stuck threads. The only "common" factor is that we see the watch errors, always at least twice within 1 or 2 minutes if its broken.

Ah, I guess I got confused by timestamps in stuck_kthreads.md.
I grepped for "pre object map update" errors that you reported
initially and didn't see any.

With any sort of networking issues, watch errors are expected.

I'll take a deeper look at syslog_stuck_krbd_shrinked.

Thanks,

                Ilya

>
> Regards,
>
> Robin Geuze
>
> From: Ilya Dryomov <idryomov@gmail.com>
> Sent: 17 June 2021 10:36:33
> To: Robin Geuze
> Cc: Ceph Development
> Subject: Re: All RBD IO stuck after flapping OSD's
>
> On Wed, Jun 16, 2021 at 1:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > Sorry for the long delay, but we've finally managed to somewhat reliably reproduce this issue and produced a bunch of debug data. Its really big, so you can find the files here: https://dionbosschieter.stackstorage.com/s/RhM3FHLD28EcVJJ2
> >
> > We also got some stack traces those are in there as well.
> >
> > The way we reproduce it is that on one of the two ceph machines in the cluster (its a test cluster) we toggle both the bond NIC ports down, sleep 40 seconds, put them back up, wait another 15 seconds and then put them back down, wait another 40 seconds and   then put them back up.
> >
> > Exact command line I used on the ceph machine:
> > ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up; sleep 15; ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up
>
> Hi Robin,
>
> This looks very similar to https://tracker.ceph.com/issues/42757.
> I don't see the offending writer thread among stuck threads in
> stuck_kthreads.md though (and syslog_stuck_krbd_shrinked covers only
> a short 13-second period of time so it's not there either because the
> problem, at least the one I'm suspecting, would have occurred before
> 13:00:00).
>
> If you can reproduce reliably, try again without verbose logging but
> do capture all stack traces -- once the system locks up, let it stew
> for ten minutes and attach "blocked for more than X seconds" splats.
>
> Additionally, a "echo w >/proc/sysrq-trigger" dump would be good if
> SysRq is not disabled on your servers.
>
> Thanks,
>
>                 Ilya
>
> >
> > Regards,
> >
> > Robin Geuze
> >
> > From: Ilya Dryomov <idryomov@gmail.com>
> > Sent: 19 April 2021 14:40:00
> > To: Robin Geuze
> > Cc: Ceph Development
> > Subject: Re: All RBD IO stuck after flapping OSD's
> >
> > On Thu, Apr 15, 2021 at 2:21 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > >
> > > Hey Ilya,
> > >
> > > We had to reboot the machine unfortunately, since we had customers unable to work with their VM's. We did manage to make a dynamic debugging dump of an earlier occurence, maybe that can help? I've attached it to this email.
> >
> > No, I don't see anything to go on there.  Next time, enable logging for
> > both libceph and rbd modules and make sure that at least one instance of
> > the error (i.e. "pre object map update failed: -16") makes it into the
> > attached log.
> >
> > >
> > > Those messages constantly occur, even after we kill the VM using the mount, I guess because there is pending IO which cannot be flushed.
> > >
> > > As for how its getting worse, if you try any management operations (eg unmap) on any of the RBD mounts that aren't affected, they hang and more often than not the IO for that one also stalls (not always though).
> >
> > One obvious workaround workaround is to unmap, disable object-map and
> > exclusive-lock features with "rbd feature disable", and map back.  You
> > would lose the benefits of object map, but if it is affecting customer
> > workloads it is probably the best course of action until this thing is
> > root caused.
> >
> > Thanks,
> >
> >                 Ilya
> >
> > >
> > > Regards,
> > >
> > > Robin Geuze
> > >
> > > From: Ilya Dryomov <idryomov@gmail.com>
> > > Sent: 14 April 2021 19:00:20
> > > To: Robin Geuze
> > > Cc: Ceph Development
> > > Subject: Re: All RBD IO stuck after flapping OSD's
> > >
> > > On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > > >
> > > > Hey,
> > > >
> > > > We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
> > > >
> > > > Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> > > > Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> > > > Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> > > > (...)
> > > >
> > > > After a while of that we start getting these errors being spammed in dmesg:
> > > >
> > > > Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> > > > Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> > > > Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
> > > >
> > > > (In this case for two different RBD mounts)
> > > >
> > > > At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate   our  VM's to another HV, since to make the messages go away we have to reboot the server.
> > >
> > > Hi Robin,
> > >
> > > Do these messages appear even if no I/O is issued to /dev/rbd14 or only
> > > if you attempt to write?
> > >
> > > >
> > > > All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.
> > >
> > > Please explain how it's getting worse.
> > >
> > > I think the problem is that the object map isn't locked.  What
> > > probably happened is the kernel client lost its watch on the image
> > > and for some reason can't get it back.   The flapping has likely
> > > trigged some edge condition in the watch/notify code.
> > >
> > > To confirm:
> > >
> > > - paste the contents of /sys/bus/rbd/devices/14/client_addr
> > >
> > > - paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
> > >   for /dev/rbd14.  If you are using noshare, you will have multiple
> > >   client instances with the same cluster id.  The one you need can be
> > >   identified with /sys/bus/rbd/devices/14/client_id.
> > >
> > > - paste the output of "rbd status <rbd14 image>" (image name can be
> > >   identified from "rbd showmapped")
> > >
> > > I'm also curious who actually has the lock on the header object and the
> > > object map object.  Paste the output of
> > >
> > > $ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
> > > $ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
> > > $ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq
> > >
> > > Thanks,
> > >
> > >                 Ilya
> > >
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-06-17  9:40             ` Ilya Dryomov
@ 2021-06-17 10:17               ` Robin Geuze
  2021-06-17 11:09                 ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Geuze @ 2021-06-17 10:17 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Hey Ilya,

We've added some extra debug info to the fileshare from before, including the sysrq-trigger output.

Regards,

Robin Geuze
  
From: Ilya Dryomov <idryomov@gmail.com>
Sent: 17 June 2021 11:40:54
To: Robin Geuze
Cc: Ceph Development
Subject: Re: All RBD IO stuck after flapping OSD's
    
On Thu, Jun 17, 2021 at 10:42 AM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> We triggered the issue at roughly 13:05, so the problem cannot have occurred before 13:00.
>
> We've also (in the wild, haven't reproduced that exact case yet) seen this occur without any stacktraces or stuck threads. The only "common" factor is that we see the watch errors, always at least twice within 1 or 2 minutes if its broken.

Ah, I guess I got confused by timestamps in stuck_kthreads.md.
I grepped for "pre object map update" errors that you reported
initially and didn't see any.

With any sort of networking issues, watch errors are expected.

I'll take a deeper look at syslog_stuck_krbd_shrinked.

Thanks,

                Ilya

>
> Regards,
>
> Robin Geuze
>
> From: Ilya Dryomov <idryomov@gmail.com>
> Sent: 17 June 2021 10:36:33
> To: Robin Geuze
> Cc: Ceph Development
> Subject: Re: All RBD IO stuck after flapping OSD's
>
> On Wed, Jun 16, 2021 at 1:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > Sorry for the long delay, but we've finally managed to somewhat reliably reproduce this issue and produced a bunch of debug data. Its really big, so you can find the files here: https://dionbosschieter.stackstorage.com/s/RhM3FHLD28EcVJJ2
> >
> > We also got some stack traces those are in there as well.
> >
> > The way we reproduce it is that on one of the two ceph machines in the cluster (its a test cluster) we toggle both the bond NIC ports down, sleep 40 seconds, put them back up, wait another 15 seconds and then put them back down, wait another 40 seconds  and   then put them back up.
> >
> > Exact command line I used on the ceph machine:
> > ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up; sleep 15; ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up
>
> Hi Robin,
>
> This looks very similar to https://tracker.ceph.com/issues/42757.
> I don't see the offending writer thread among stuck threads in
> stuck_kthreads.md though (and syslog_stuck_krbd_shrinked covers only
> a short 13-second period of time so it's not there either because the
> problem, at least the one I'm suspecting, would have occurred before
> 13:00:00).
>
> If you can reproduce reliably, try again without verbose logging but
> do capture all stack traces -- once the system locks up, let it stew
> for ten minutes and attach "blocked for more than X seconds" splats.
>
> Additionally, a "echo w >/proc/sysrq-trigger" dump would be good if
> SysRq is not disabled on your servers.
>
> Thanks,
>
>                 Ilya
>
> >
> > Regards,
> >
> > Robin Geuze
> >
> > From: Ilya Dryomov <idryomov@gmail.com>
> > Sent: 19 April 2021 14:40:00
> > To: Robin Geuze
> > Cc: Ceph Development
> > Subject: Re: All RBD IO stuck after flapping OSD's
> >
> > On Thu, Apr 15, 2021 at 2:21 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > >
> > > Hey Ilya,
> > >
> > > We had to reboot the machine unfortunately, since we had customers unable to work with their VM's. We did manage to make a dynamic debugging dump of an earlier occurence, maybe that can help? I've attached it to this email.
> >
> > No, I don't see anything to go on there.  Next time, enable logging for
> > both libceph and rbd modules and make sure that at least one instance of
> > the error (i.e. "pre object map update failed: -16") makes it into the
> > attached log.
> >
> > >
> > > Those messages constantly occur, even after we kill the VM using the mount, I guess because there is pending IO which cannot be flushed.
> > >
> > > As for how its getting worse, if you try any management operations (eg unmap) on any of the RBD mounts that aren't affected, they hang and more often than not the IO for that one also stalls (not always though).
> >
> > One obvious workaround workaround is to unmap, disable object-map and
> > exclusive-lock features with "rbd feature disable", and map back.  You
> > would lose the benefits of object map, but if it is affecting customer
> > workloads it is probably the best course of action until this thing is
> > root caused.
> >
> > Thanks,
> >
> >                 Ilya
> >
> > >
> > > Regards,
> > >
> > > Robin Geuze
> > >
> > > From: Ilya Dryomov <idryomov@gmail.com>
> > > Sent: 14 April 2021 19:00:20
> > > To: Robin Geuze
> > > Cc: Ceph Development
> > > Subject: Re: All RBD IO stuck after flapping OSD's
> > >
> > > On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > > >
> > > > Hey,
> > > >
> > > > We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
> > > >
> > > > Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> > > > Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> > > > Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> > > > (...)
> > > >
> > > > After a while of that we start getting these errors being spammed in dmesg:
> > > >
> > > > Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> > > > Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> > > > Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
> > > >
> > > > (In this case for two different RBD mounts)
> > > >
> > > > At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate    our  VM's to another HV, since to make the messages go away we have to reboot the server.
> > >
> > > Hi Robin,
> > >
> > > Do these messages appear even if no I/O is issued to /dev/rbd14 or only
> > > if you attempt to write?
> > >
> > > >
> > > > All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.
> > >
> > > Please explain how it's getting worse.
> > >
> > > I think the problem is that the object map isn't locked.  What
> > > probably happened is the kernel client lost its watch on the image
> > > and for some reason can't get it back.   The flapping has likely
> > > trigged some edge condition in the watch/notify code.
> > >
> > > To confirm:
> > >
> > > - paste the contents of /sys/bus/rbd/devices/14/client_addr
> > >
> > > - paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
> > >   for /dev/rbd14.  If you are using noshare, you will have multiple
> > >   client instances with the same cluster id.  The one you need can be
> > >   identified with /sys/bus/rbd/devices/14/client_id.
> > >
> > > - paste the output of "rbd status <rbd14 image>" (image name can be
> > >   identified from "rbd showmapped")
> > >
> > > I'm also curious who actually has the lock on the header object and the
> > > object map object.  Paste the output of
> > >
> > > $ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
> > > $ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
> > > $ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq
> > >
> > > Thanks,
> > >
> > >                 Ilya
> > >
> >
>
    

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-06-17 10:17               ` Robin Geuze
@ 2021-06-17 11:09                 ` Ilya Dryomov
  2021-06-17 11:12                   ` Robin Geuze
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2021-06-17 11:09 UTC (permalink / raw)
  To: Robin Geuze; +Cc: Ceph Development

On Thu, Jun 17, 2021 at 12:17 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> We've added some extra debug info to the fileshare from before, including the sysrq-trigger output.

Yup, seems to be exactly https://tracker.ceph.com/issues/42757.
Here are the relevant tasks listed in the same order as in the
ticket (you have two tasks in ceph_con_workfn() instead of one):

kworker/5:1     D    0 161820      2 0x80004000
Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 rwsem_down_read_slowpath+0x16c/0x4a0
 down_read+0x85/0xa0
 rbd_img_handle_request+0x40/0x1a0 [rbd]
 ? __rbd_obj_handle_request+0x61/0x2f0 [rbd]
 rbd_obj_handle_request+0x34/0x40 [rbd]
 rbd_osd_req_callback+0x44/0x80 [rbd]
 __complete_request+0x28/0x80 [libceph]
 handle_reply+0x2b6/0x460 [libceph]
 ? ceph_crypt+0x1d/0x30 [libceph]
 ? calc_signature+0xdf/0x100 [libceph]
 ? ceph_x_check_message_signature+0x5e/0xd0 [libceph]
 dispatch+0x34/0xb0 [libceph]
 ? dispatch+0x34/0xb0 [libceph]
 try_read+0x566/0x8c0 [libceph]
 ceph_con_workfn+0x130/0x620 [libceph]
 ? __queue_delayed_work+0x8a/0x90
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
kworker/26:1    D    0 226056      2 0x80004000
Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 rwsem_down_read_slowpath+0x16c/0x4a0
 down_read+0x85/0xa0
 rbd_img_handle_request+0x40/0x1a0 [rbd]
 ? __rbd_obj_handle_request+0x61/0x2f0 [rbd]
 rbd_obj_handle_request+0x34/0x40 [rbd]
 rbd_osd_req_callback+0x44/0x80 [rbd]
 __complete_request+0x28/0x80 [libceph]
 handle_reply+0x2b6/0x460 [libceph]
 ? ceph_crypt+0x1d/0x30 [libceph]
 ? calc_signature+0xdf/0x100 [libceph]
 ? ceph_x_check_message_signature+0x5e/0xd0 [libceph]
 dispatch+0x34/0xb0 [libceph]
 ? dispatch+0x34/0xb0 [libceph]
 try_read+0x566/0x8c0 [libceph]
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to+0x7f/0x470
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ceph_con_workfn+0x130/0x620 [libceph]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

kworker/u112:2  D    0 277829      2 0x80004000
Workqueue: rbd3-tasks rbd_reregister_watch [rbd]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 schedule_timeout+0x10e/0x160
 ? wait_for_completion_interruptible+0xb8/0x160
 wait_for_completion+0xb1/0x120
 ? wake_up_q+0x70/0x70
 rbd_quiesce_lock+0xa1/0xe0 [rbd]
 rbd_reregister_watch+0x109/0x1b0 [rbd]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

kworker/u112:3  D    0 284466      2 0x80004000
Workqueue: ceph-watch-notify do_watch_error [libceph]
Call Trace:
 __schedule+0x2e3/0x740
 ? wake_up_klogd.part.0+0x34/0x40
 ? sched_clock+0x9/0x10
 schedule+0x42/0xb0
 rwsem_down_write_slowpath+0x244/0x4d0
 down_write+0x41/0x50
 rbd_watch_errcb+0x2a/0x92 [rbd]
 do_watch_error+0x41/0xc0 [libceph]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

Not your original issue but closely related since it revolves around
exclusive-lock (which object-map depends on) and watches.

Would you be able to install a custom kernel on this node to test the
fix once I have it?

Thanks,

                Ilya

>
> Regards,
>
> Robin Geuze
>
> From: Ilya Dryomov <idryomov@gmail.com>
> Sent: 17 June 2021 11:40:54
> To: Robin Geuze
> Cc: Ceph Development
> Subject: Re: All RBD IO stuck after flapping OSD's
>
> On Thu, Jun 17, 2021 at 10:42 AM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > We triggered the issue at roughly 13:05, so the problem cannot have occurred before 13:00.
> >
> > We've also (in the wild, haven't reproduced that exact case yet) seen this occur without any stacktraces or stuck threads. The only "common" factor is that we see the watch errors, always at least twice within 1 or 2 minutes if its broken.
>
> Ah, I guess I got confused by timestamps in stuck_kthreads.md.
> I grepped for "pre object map update" errors that you reported
> initially and didn't see any.
>
> With any sort of networking issues, watch errors are expected.
>
> I'll take a deeper look at syslog_stuck_krbd_shrinked.
>
> Thanks,
>
>                 Ilya
>
> >
> > Regards,
> >
> > Robin Geuze
> >
> > From: Ilya Dryomov <idryomov@gmail.com>
> > Sent: 17 June 2021 10:36:33
> > To: Robin Geuze
> > Cc: Ceph Development
> > Subject: Re: All RBD IO stuck after flapping OSD's
> >
> > On Wed, Jun 16, 2021 at 1:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > >
> > > Hey Ilya,
> > >
> > > Sorry for the long delay, but we've finally managed to somewhat reliably reproduce this issue and produced a bunch of debug data. Its really big, so you can find the files here: https://dionbosschieter.stackstorage.com/s/RhM3FHLD28EcVJJ2
> > >
> > > We also got some stack traces those are in there as well.
> > >
> > > The way we reproduce it is that on one of the two ceph machines in the cluster (its a test cluster) we toggle both the bond NIC ports down, sleep 40 seconds, put them back up, wait another 15 seconds and then put them back down, wait another 40 seconds  and   then put them back up.
> > >
> > > Exact command line I used on the ceph machine:
> > > ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up; sleep 15; ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up
> >
> > Hi Robin,
> >
> > This looks very similar to https://tracker.ceph.com/issues/42757.
> > I don't see the offending writer thread among stuck threads in
> > stuck_kthreads.md though (and syslog_stuck_krbd_shrinked covers only
> > a short 13-second period of time so it's not there either because the
> > problem, at least the one I'm suspecting, would have occurred before
> > 13:00:00).
> >
> > If you can reproduce reliably, try again without verbose logging but
> > do capture all stack traces -- once the system locks up, let it stew
> > for ten minutes and attach "blocked for more than X seconds" splats.
> >
> > Additionally, a "echo w >/proc/sysrq-trigger" dump would be good if
> > SysRq is not disabled on your servers.
> >
> > Thanks,
> >
> >                 Ilya
> >
> > >
> > > Regards,
> > >
> > > Robin Geuze
> > >
> > > From: Ilya Dryomov <idryomov@gmail.com>
> > > Sent: 19 April 2021 14:40:00
> > > To: Robin Geuze
> > > Cc: Ceph Development
> > > Subject: Re: All RBD IO stuck after flapping OSD's
> > >
> > > On Thu, Apr 15, 2021 at 2:21 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > > >
> > > > Hey Ilya,
> > > >
> > > > We had to reboot the machine unfortunately, since we had customers unable to work with their VM's. We did manage to make a dynamic debugging dump of an earlier occurence, maybe that can help? I've attached it to this email.
> > >
> > > No, I don't see anything to go on there.  Next time, enable logging for
> > > both libceph and rbd modules and make sure that at least one instance of
> > > the error (i.e. "pre object map update failed: -16") makes it into the
> > > attached log.
> > >
> > > >
> > > > Those messages constantly occur, even after we kill the VM using the mount, I guess because there is pending IO which cannot be flushed.
> > > >
> > > > As for how its getting worse, if you try any management operations (eg unmap) on any of the RBD mounts that aren't affected, they hang and more often than not the IO for that one also stalls (not always though).
> > >
> > > One obvious workaround workaround is to unmap, disable object-map and
> > > exclusive-lock features with "rbd feature disable", and map back.  You
> > > would lose the benefits of object map, but if it is affecting customer
> > > workloads it is probably the best course of action until this thing is
> > > root caused.
> > >
> > > Thanks,
> > >
> > >                 Ilya
> > >
> > > >
> > > > Regards,
> > > >
> > > > Robin Geuze
> > > >
> > > > From: Ilya Dryomov <idryomov@gmail.com>
> > > > Sent: 14 April 2021 19:00:20
> > > > To: Robin Geuze
> > > > Cc: Ceph Development
> > > > Subject: Re: All RBD IO stuck after flapping OSD's
> > > >
> > > > On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > > > >
> > > > > Hey,
> > > > >
> > > > > We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
> > > > >
> > > > > Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> > > > > Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> > > > > Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> > > > > (...)
> > > > >
> > > > > After a while of that we start getting these errors being spammed in dmesg:
> > > > >
> > > > > Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> > > > > Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> > > > > Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
> > > > >
> > > > > (In this case for two different RBD mounts)
> > > > >
> > > > > At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate    our  VM's to another HV, since to make the messages go away we have to reboot the server.
> > > >
> > > > Hi Robin,
> > > >
> > > > Do these messages appear even if no I/O is issued to /dev/rbd14 or only
> > > > if you attempt to write?
> > > >
> > > > >
> > > > > All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.
> > > >
> > > > Please explain how it's getting worse.
> > > >
> > > > I think the problem is that the object map isn't locked.  What
> > > > probably happened is the kernel client lost its watch on the image
> > > > and for some reason can't get it back.   The flapping has likely
> > > > trigged some edge condition in the watch/notify code.
> > > >
> > > > To confirm:
> > > >
> > > > - paste the contents of /sys/bus/rbd/devices/14/client_addr
> > > >
> > > > - paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
> > > >   for /dev/rbd14.  If you are using noshare, you will have multiple
> > > >   client instances with the same cluster id.  The one you need can be
> > > >   identified with /sys/bus/rbd/devices/14/client_id.
> > > >
> > > > - paste the output of "rbd status <rbd14 image>" (image name can be
> > > >   identified from "rbd showmapped")
> > > >
> > > > I'm also curious who actually has the lock on the header object and the
> > > > object map object.  Paste the output of
> > > >
> > > > $ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
> > > > $ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
> > > > $ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq
> > > >
> > > > Thanks,
> > > >
> > > >                 Ilya
> > > >
> > >
> >
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-06-17 11:09                 ` Ilya Dryomov
@ 2021-06-17 11:12                   ` Robin Geuze
  2021-06-29  8:39                     ` Robin Geuze
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Geuze @ 2021-06-17 11:12 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Hey Ilya,

Yes we can install a custom kernel, or we can apply a patch to the current 5.4 kernel if you prefer (we have a build street for the ubuntu kernel set up, so its not a lot of effort).

Regards,

Robin Geuze

From: Ilya Dryomov <idryomov@gmail.com>
Sent: 17 June 2021 13:09
To: Robin Geuze
Cc: Ceph Development
Subject: Re: All RBD IO stuck after flapping OSD's
    
On Thu, Jun 17, 2021 at 12:17 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> We've added some extra debug info to the fileshare from before, including the sysrq-trigger output.

Yup, seems to be exactly https://tracker.ceph.com/issues/42757.
Here are the relevant tasks listed in the same order as in the
ticket (you have two tasks in ceph_con_workfn() instead of one):

kworker/5:1     D    0 161820      2 0x80004000
Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 rwsem_down_read_slowpath+0x16c/0x4a0
 down_read+0x85/0xa0
 rbd_img_handle_request+0x40/0x1a0 [rbd]
 ? __rbd_obj_handle_request+0x61/0x2f0 [rbd]
 rbd_obj_handle_request+0x34/0x40 [rbd]
 rbd_osd_req_callback+0x44/0x80 [rbd]
 __complete_request+0x28/0x80 [libceph]
 handle_reply+0x2b6/0x460 [libceph]
 ? ceph_crypt+0x1d/0x30 [libceph]
 ? calc_signature+0xdf/0x100 [libceph]
 ? ceph_x_check_message_signature+0x5e/0xd0 [libceph]
 dispatch+0x34/0xb0 [libceph]
 ? dispatch+0x34/0xb0 [libceph]
 try_read+0x566/0x8c0 [libceph]
 ceph_con_workfn+0x130/0x620 [libceph]
 ? __queue_delayed_work+0x8a/0x90
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
kworker/26:1    D    0 226056      2 0x80004000
Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 rwsem_down_read_slowpath+0x16c/0x4a0
 down_read+0x85/0xa0
 rbd_img_handle_request+0x40/0x1a0 [rbd]
 ? __rbd_obj_handle_request+0x61/0x2f0 [rbd]
 rbd_obj_handle_request+0x34/0x40 [rbd]
 rbd_osd_req_callback+0x44/0x80 [rbd]
 __complete_request+0x28/0x80 [libceph]
 handle_reply+0x2b6/0x460 [libceph]
 ? ceph_crypt+0x1d/0x30 [libceph]
 ? calc_signature+0xdf/0x100 [libceph]
 ? ceph_x_check_message_signature+0x5e/0xd0 [libceph]
 dispatch+0x34/0xb0 [libceph]
 ? dispatch+0x34/0xb0 [libceph]
 try_read+0x566/0x8c0 [libceph]
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to+0x7f/0x470
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ceph_con_workfn+0x130/0x620 [libceph]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

kworker/u112:2  D    0 277829      2 0x80004000
Workqueue: rbd3-tasks rbd_reregister_watch [rbd]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 schedule_timeout+0x10e/0x160
 ? wait_for_completion_interruptible+0xb8/0x160
 wait_for_completion+0xb1/0x120
 ? wake_up_q+0x70/0x70
 rbd_quiesce_lock+0xa1/0xe0 [rbd]
 rbd_reregister_watch+0x109/0x1b0 [rbd]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

kworker/u112:3  D    0 284466      2 0x80004000
Workqueue: ceph-watch-notify do_watch_error [libceph]
Call Trace:
 __schedule+0x2e3/0x740
 ? wake_up_klogd.part.0+0x34/0x40
 ? sched_clock+0x9/0x10
 schedule+0x42/0xb0
 rwsem_down_write_slowpath+0x244/0x4d0
 down_write+0x41/0x50
 rbd_watch_errcb+0x2a/0x92 [rbd]
 do_watch_error+0x41/0xc0 [libceph]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

Not your original issue but closely related since it revolves around
exclusive-lock (which object-map depends on) and watches.

Would you be able to install a custom kernel on this node to test the
fix once I have it?

Thanks,

                Ilya

>
> Regards,
>
> Robin Geuze
>
> From: Ilya Dryomov <idryomov@gmail.com>
> Sent: 17 June 2021 11:40:54
> To: Robin Geuze
> Cc: Ceph Development
> Subject: Re: All RBD IO stuck after flapping OSD's
>
> On Thu, Jun 17, 2021 at 10:42 AM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > We triggered the issue at roughly 13:05, so the problem cannot have occurred before 13:00.
> >
> > We've also (in the wild, haven't reproduced that exact case yet) seen this occur without any stacktraces or stuck threads. The only "common" factor is that we see the watch errors, always at least twice within 1 or 2 minutes if its broken.
>
> Ah, I guess I got confused by timestamps in stuck_kthreads.md.
> I grepped for "pre object map update" errors that you reported
> initially and didn't see any.
>
> With any sort of networking issues, watch errors are expected.
>
> I'll take a deeper look at syslog_stuck_krbd_shrinked.
>
> Thanks,
>
>                 Ilya
>
> >
> > Regards,
> >
> > Robin Geuze
> >
> > From: Ilya Dryomov <idryomov@gmail.com>
> > Sent: 17 June 2021 10:36:33
> > To: Robin Geuze
> > Cc: Ceph Development
> > Subject: Re: All RBD IO stuck after flapping OSD's
> >
> > On Wed, Jun 16, 2021 at 1:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > >
> > > Hey Ilya,
> > >
> > > Sorry for the long delay, but we've finally managed to somewhat reliably reproduce this issue and produced a bunch of debug data. Its really big, so you can find the files here: https://dionbosschieter.stackstorage.com/s/RhM3FHLD28EcVJJ2
> > >
> > > We also got some stack traces those are in there as well.
> > >
> > > The way we reproduce it is that on one of the two ceph machines in the cluster (its a test cluster) we toggle both the bond NIC ports down, sleep 40 seconds, put them back up, wait another 15 seconds and then put them back down, wait another 40 seconds   and   then put them back up.
> > >
> > > Exact command line I used on the ceph machine:
> > > ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up; sleep 15; ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up
> >
> > Hi Robin,
> >
> > This looks very similar to https://tracker.ceph.com/issues/42757.
> > I don't see the offending writer thread among stuck threads in
> > stuck_kthreads.md though (and syslog_stuck_krbd_shrinked covers only
> > a short 13-second period of time so it's not there either because the
> > problem, at least the one I'm suspecting, would have occurred before
> > 13:00:00).
> >
> > If you can reproduce reliably, try again without verbose logging but
> > do capture all stack traces -- once the system locks up, let it stew
> > for ten minutes and attach "blocked for more than X seconds" splats.
> >
> > Additionally, a "echo w >/proc/sysrq-trigger" dump would be good if
> > SysRq is not disabled on your servers.
> >
> > Thanks,
> >
> >                 Ilya
> >
> > >
> > > Regards,
> > >
> > > Robin Geuze
> > >
> > > From: Ilya Dryomov <idryomov@gmail.com>
> > > Sent: 19 April 2021 14:40:00
> > > To: Robin Geuze
> > > Cc: Ceph Development
> > > Subject: Re: All RBD IO stuck after flapping OSD's
> > >
> > > On Thu, Apr 15, 2021 at 2:21 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > > >
> > > > Hey Ilya,
> > > >
> > > > We had to reboot the machine unfortunately, since we had customers unable to work with their VM's. We did manage to make a dynamic debugging dump of an earlier occurence, maybe that can help? I've attached it to this email.
> > >
> > > No, I don't see anything to go on there.  Next time, enable logging for
> > > both libceph and rbd modules and make sure that at least one instance of
> > > the error (i.e. "pre object map update failed: -16") makes it into the
> > > attached log.
> > >
> > > >
> > > > Those messages constantly occur, even after we kill the VM using the mount, I guess because there is pending IO which cannot be flushed.
> > > >
> > > > As for how its getting worse, if you try any management operations (eg unmap) on any of the RBD mounts that aren't affected, they hang and more often than not the IO for that one also stalls (not always though).
> > >
> > > One obvious workaround workaround is to unmap, disable object-map and
> > > exclusive-lock features with "rbd feature disable", and map back.  You
> > > would lose the benefits of object map, but if it is affecting customer
> > > workloads it is probably the best course of action until this thing is
> > > root caused.
> > >
> > > Thanks,
> > >
> > >                 Ilya
> > >
> > > >
> > > > Regards,
> > > >
> > > > Robin Geuze
> > > >
> > > > From: Ilya Dryomov <idryomov@gmail.com>
> > > > Sent: 14 April 2021 19:00:20
> > > > To: Robin Geuze
> > > > Cc: Ceph Development
> > > > Subject: Re: All RBD IO stuck after flapping OSD's
> > > >
> > > > On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > > > >
> > > > > Hey,
> > > > >
> > > > > We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
> > > > >
> > > > > Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> > > > > Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> > > > > Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> > > > > (...)
> > > > >
> > > > > After a while of that we start getting these errors being spammed in dmesg:
> > > > >
> > > > > Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> > > > > Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> > > > > Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
> > > > >
> > > > > (In this case for two different RBD mounts)
> > > > >
> > > > > At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate     our  VM's to another HV, since to make the messages go away we have to reboot the server.
> > > >
> > > > Hi Robin,
> > > >
> > > > Do these messages appear even if no I/O is issued to /dev/rbd14 or only
> > > > if you attempt to write?
> > > >
> > > > >
> > > > > All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.
> > > >
> > > > Please explain how it's getting worse.
> > > >
> > > > I think the problem is that the object map isn't locked.  What
> > > > probably happened is the kernel client lost its watch on the image
> > > > and for some reason can't get it back.   The flapping has likely
> > > > trigged some edge condition in the watch/notify code.
> > > >
> > > > To confirm:
> > > >
> > > > - paste the contents of /sys/bus/rbd/devices/14/client_addr
> > > >
> > > > - paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
> > > >   for /dev/rbd14.  If you are using noshare, you will have multiple
> > > >   client instances with the same cluster id.  The one you need can be
> > > >   identified with /sys/bus/rbd/devices/14/client_id.
> > > >
> > > > - paste the output of "rbd status <rbd14 image>" (image name can be
> > > >   identified from "rbd showmapped")
> > > >
> > > > I'm also curious who actually has the lock on the header object and the
> > > > object map object.  Paste the output of
> > > >
> > > > $ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
> > > > $ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
> > > > $ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq
> > > >
> > > > Thanks,
> > > >
> > > >                 Ilya
> > > >
> > >
> >
>
    

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-06-17 11:12                   ` Robin Geuze
@ 2021-06-29  8:39                     ` Robin Geuze
  2021-06-29 10:07                       ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Geuze @ 2021-06-29  8:39 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Hey Ilya,

Do you have any idea on the cause of this bug yet? I tried to dig around a bit myself in the source, but the logic around this locking is very complex, so I couldn't figure out where the problem is.

Regards,

Robin Geuze
 
From: Robin Geuze
Sent: 17 June 2021 13:12:23
To: Ilya Dryomov
Cc: Ceph Development
Subject: Re: All RBD IO stuck after flapping OSD's
    
Hey Ilya,

Yes we can install a custom kernel, or we can apply a patch to the current 5.4 kernel if you prefer (we have a build street for the ubuntu kernel set up, so its not a lot of effort).

Regards,

Robin Geuze

From: Ilya Dryomov <idryomov@gmail.com>
Sent: 17 June 2021 13:09
To: Robin Geuze
Cc: Ceph Development
Subject: Re: All RBD IO stuck after flapping OSD's
    
On Thu, Jun 17, 2021 at 12:17 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> We've added some extra debug info to the fileshare from before, including the sysrq-trigger output.

Yup, seems to be exactly https://tracker.ceph.com/issues/42757.
Here are the relevant tasks listed in the same order as in the
ticket (you have two tasks in ceph_con_workfn() instead of one):

kworker/5:1     D    0 161820      2 0x80004000
Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 rwsem_down_read_slowpath+0x16c/0x4a0
 down_read+0x85/0xa0
 rbd_img_handle_request+0x40/0x1a0 [rbd]
 ? __rbd_obj_handle_request+0x61/0x2f0 [rbd]
 rbd_obj_handle_request+0x34/0x40 [rbd]
 rbd_osd_req_callback+0x44/0x80 [rbd]
 __complete_request+0x28/0x80 [libceph]
 handle_reply+0x2b6/0x460 [libceph]
 ? ceph_crypt+0x1d/0x30 [libceph]
 ? calc_signature+0xdf/0x100 [libceph]
 ? ceph_x_check_message_signature+0x5e/0xd0 [libceph]
 dispatch+0x34/0xb0 [libceph]
 ? dispatch+0x34/0xb0 [libceph]
 try_read+0x566/0x8c0 [libceph]
 ceph_con_workfn+0x130/0x620 [libceph]
 ? __queue_delayed_work+0x8a/0x90
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40
kworker/26:1    D    0 226056      2 0x80004000
Workqueue: ceph-msgr ceph_con_workfn [libceph]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 rwsem_down_read_slowpath+0x16c/0x4a0
 down_read+0x85/0xa0
 rbd_img_handle_request+0x40/0x1a0 [rbd]
 ? __rbd_obj_handle_request+0x61/0x2f0 [rbd]
 rbd_obj_handle_request+0x34/0x40 [rbd]
 rbd_osd_req_callback+0x44/0x80 [rbd]
 __complete_request+0x28/0x80 [libceph]
 handle_reply+0x2b6/0x460 [libceph]
 ? ceph_crypt+0x1d/0x30 [libceph]
 ? calc_signature+0xdf/0x100 [libceph]
 ? ceph_x_check_message_signature+0x5e/0xd0 [libceph]
 dispatch+0x34/0xb0 [libceph]
 ? dispatch+0x34/0xb0 [libceph]
 try_read+0x566/0x8c0 [libceph]
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ? __switch_to_asm+0x40/0x70
 ? __switch_to+0x7f/0x470
 ? __switch_to_asm+0x40/0x70
 ? __switch_to_asm+0x34/0x70
 ceph_con_workfn+0x130/0x620 [libceph]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

kworker/u112:2  D    0 277829      2 0x80004000
Workqueue: rbd3-tasks rbd_reregister_watch [rbd]
Call Trace:
 __schedule+0x2e3/0x740
 schedule+0x42/0xb0
 schedule_timeout+0x10e/0x160
 ? wait_for_completion_interruptible+0xb8/0x160
 wait_for_completion+0xb1/0x120
 ? wake_up_q+0x70/0x70
 rbd_quiesce_lock+0xa1/0xe0 [rbd]
 rbd_reregister_watch+0x109/0x1b0 [rbd]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

kworker/u112:3  D    0 284466      2 0x80004000
Workqueue: ceph-watch-notify do_watch_error [libceph]
Call Trace:
 __schedule+0x2e3/0x740
 ? wake_up_klogd.part.0+0x34/0x40
 ? sched_clock+0x9/0x10
 schedule+0x42/0xb0
 rwsem_down_write_slowpath+0x244/0x4d0
 down_write+0x41/0x50
 rbd_watch_errcb+0x2a/0x92 [rbd]
 do_watch_error+0x41/0xc0 [libceph]
 process_one_work+0x1eb/0x3b0
 worker_thread+0x4d/0x400
 kthread+0x104/0x140
 ? process_one_work+0x3b0/0x3b0
 ? kthread_park+0x90/0x90
 ret_from_fork+0x35/0x40

Not your original issue but closely related since it revolves around
exclusive-lock (which object-map depends on) and watches.

Would you be able to install a custom kernel on this node to test the
fix once I have it?

Thanks,

                Ilya

>
> Regards,
>
> Robin Geuze
>
> From: Ilya Dryomov <idryomov@gmail.com>
> Sent: 17 June 2021 11:40:54
> To: Robin Geuze
> Cc: Ceph Development
> Subject: Re: All RBD IO stuck after flapping OSD's
>
> On Thu, Jun 17, 2021 at 10:42 AM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > We triggered the issue at roughly 13:05, so the problem cannot have occurred before 13:00.
> >
> > We've also (in the wild, haven't reproduced that exact case yet) seen this occur without any stacktraces or stuck threads. The only "common" factor is that we see the watch errors, always at least twice within 1 or 2 minutes if its broken.
>
> Ah, I guess I got confused by timestamps in stuck_kthreads.md.
> I grepped for "pre object map update" errors that you reported
> initially and didn't see any.
>
> With any sort of networking issues, watch errors are expected.
>
> I'll take a deeper look at syslog_stuck_krbd_shrinked.
>
> Thanks,
>
>                 Ilya
>
> >
> > Regards,
> >
> > Robin Geuze
> >
> > From: Ilya Dryomov <idryomov@gmail.com>
> > Sent: 17 June 2021 10:36:33
> > To: Robin Geuze
> > Cc: Ceph Development
> > Subject: Re: All RBD IO stuck after flapping OSD's
> >
> > On Wed, Jun 16, 2021 at 1:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > >
> > > Hey Ilya,
> > >
> > > Sorry for the long delay, but we've finally managed to somewhat reliably reproduce this issue and produced a bunch of debug data. Its really big, so you can find the files here: https://dionbosschieter.stackstorage.com/s/RhM3FHLD28EcVJJ2
> > >
> > > We also got some stack traces those are in there as well.
> > >
> > > The way we reproduce it is that on one of the two ceph machines in the cluster (its a test cluster) we toggle both the bond NIC ports down, sleep 40 seconds, put them back up, wait another 15 seconds and then put them back down, wait another 40 seconds    and   then put them back up.
> > >
> > > Exact command line I used on the ceph machine:
> > > ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up; sleep 15; ip l set ens785f1 down; sleep 1 ip l set ens785f0 down; sleep 40; ip l set ens785f1 up; sleep 5; ip l set ens785f0 up
> >
> > Hi Robin,
> >
> > This looks very similar to https://tracker.ceph.com/issues/42757.
> > I don't see the offending writer thread among stuck threads in
> > stuck_kthreads.md though (and syslog_stuck_krbd_shrinked covers only
> > a short 13-second period of time so it's not there either because the
> > problem, at least the one I'm suspecting, would have occurred before
> > 13:00:00).
> >
> > If you can reproduce reliably, try again without verbose logging but
> > do capture all stack traces -- once the system locks up, let it stew
> > for ten minutes and attach "blocked for more than X seconds" splats.
> >
> > Additionally, a "echo w >/proc/sysrq-trigger" dump would be good if
> > SysRq is not disabled on your servers.
> >
> > Thanks,
> >
> >                 Ilya
> >
> > >
> > > Regards,
> > >
> > > Robin Geuze
> > >
> > > From: Ilya Dryomov <idryomov@gmail.com>
> > > Sent: 19 April 2021 14:40:00
> > > To: Robin Geuze
> > > Cc: Ceph Development
> > > Subject: Re: All RBD IO stuck after flapping OSD's
> > >
> > > On Thu, Apr 15, 2021 at 2:21 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > > >
> > > > Hey Ilya,
> > > >
> > > > We had to reboot the machine unfortunately, since we had customers unable to work with their VM's. We did manage to make a dynamic debugging dump of an earlier occurence, maybe that can help? I've attached it to this email.
> > >
> > > No, I don't see anything to go on there.  Next time, enable logging for
> > > both libceph and rbd modules and make sure that at least one instance of
> > > the error (i.e. "pre object map update failed: -16") makes it into the
> > > attached log.
> > >
> > > >
> > > > Those messages constantly occur, even after we kill the VM using the mount, I guess because there is pending IO which cannot be flushed.
> > > >
> > > > As for how its getting worse, if you try any management operations (eg unmap) on any of the RBD mounts that aren't affected, they hang and more often than not the IO for that one also stalls (not always though).
> > >
> > > One obvious workaround workaround is to unmap, disable object-map and
> > > exclusive-lock features with "rbd feature disable", and map back.  You
> > > would lose the benefits of object map, but if it is affecting customer
> > > workloads it is probably the best course of action until this thing is
> > > root caused.
> > >
> > > Thanks,
> > >
> > >                 Ilya
> > >
> > > >
> > > > Regards,
> > > >
> > > > Robin Geuze
> > > >
> > > > From: Ilya Dryomov <idryomov@gmail.com>
> > > > Sent: 14 April 2021 19:00:20
> > > > To: Robin Geuze
> > > > Cc: Ceph Development
> > > > Subject: Re: All RBD IO stuck after flapping OSD's
> > > >
> > > > On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> > > > >
> > > > > Hey,
> > > > >
> > > > > We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
> > > > >
> > > > > Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> > > > > Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> > > > > Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> > > > > (...)
> > > > >
> > > > > After a while of that we start getting these errors being spammed in dmesg:
> > > > >
> > > > > Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> > > > > Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> > > > > Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
> > > > >
> > > > > (In this case for two different RBD mounts)
> > > > >
> > > > > At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate      our  VM's to another HV, since to make the messages go away we have to reboot the server.
> > > >
> > > > Hi Robin,
> > > >
> > > > Do these messages appear even if no I/O is issued to /dev/rbd14 or only
> > > > if you attempt to write?
> > > >
> > > > >
> > > > > All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.
> > > >
> > > > Please explain how it's getting worse.
> > > >
> > > > I think the problem is that the object map isn't locked.  What
> > > > probably happened is the kernel client lost its watch on the image
> > > > and for some reason can't get it back.   The flapping has likely
> > > > trigged some edge condition in the watch/notify code.
> > > >
> > > > To confirm:
> > > >
> > > > - paste the contents of /sys/bus/rbd/devices/14/client_addr
> > > >
> > > > - paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
> > > >   for /dev/rbd14.  If you are using noshare, you will have multiple
> > > >   client instances with the same cluster id.  The one you need can be
> > > >   identified with /sys/bus/rbd/devices/14/client_id.
> > > >
> > > > - paste the output of "rbd status <rbd14 image>" (image name can be
> > > >   identified from "rbd showmapped")
> > > >
> > > > I'm also curious who actually has the lock on the header object and the
> > > > object map object.  Paste the output of
> > > >
> > > > $ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
> > > > $ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
> > > > $ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq
> > > >
> > > > Thanks,
> > > >
> > > >                 Ilya
> > > >
> > >
> >
>
        

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-06-29  8:39                     ` Robin Geuze
@ 2021-06-29 10:07                       ` Ilya Dryomov
  2021-07-06 17:21                         ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2021-06-29 10:07 UTC (permalink / raw)
  To: Robin Geuze; +Cc: Ceph Development

On Tue, Jun 29, 2021 at 10:39 AM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> Do you have any idea on the cause of this bug yet? I tried to dig around a bit myself in the source, but the logic around this locking is very complex, so I couldn't figure out where the problem is.

I do.  The proper fix would indeed be large and not backportable but
I have a workaround in mind that should be simple enough to backport
all the way to 5.4.  The trick is making sure that the workaround is
fine from the exclusive lock protocol POV.

I'll try to flesh it out by the end of this week and report back
early next week.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-06-29 10:07                       ` Ilya Dryomov
@ 2021-07-06 17:21                         ` Ilya Dryomov
  2021-07-07  7:35                           ` Robin Geuze
  0 siblings, 1 reply; 16+ messages in thread
From: Ilya Dryomov @ 2021-07-06 17:21 UTC (permalink / raw)
  To: Robin Geuze; +Cc: Ceph Development

On Tue, Jun 29, 2021 at 12:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Tue, Jun 29, 2021 at 10:39 AM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > Do you have any idea on the cause of this bug yet? I tried to dig around a bit myself in the source, but the logic around this locking is very complex, so I couldn't figure out where the problem is.
>
> I do.  The proper fix would indeed be large and not backportable but
> I have a workaround in mind that should be simple enough to backport
> all the way to 5.4.  The trick is making sure that the workaround is
> fine from the exclusive lock protocol POV.
>
> I'll try to flesh it out by the end of this week and report back
> early next week.

Hi Robin,

I CCed you on the patches.  They should apply to 5.4 cleanly.  You
mentioned you have a build farm set up, please take them for a spin.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-07-06 17:21                         ` Ilya Dryomov
@ 2021-07-07  7:35                           ` Robin Geuze
  2021-07-20 12:04                             ` Robin Geuze
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Geuze @ 2021-07-07  7:35 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Hey Ilya,

Thanks so much for the patches, we are planning to test them either this afternoon or tomorrow at the latest, I will let you know the results.

Regards,

Robin Geuze

From: Ilya Dryomov <idryomov@gmail.com>
Sent: 06 July 2021 19:21
To: Robin Geuze
Cc: Ceph Development
Subject: Re: All RBD IO stuck after flapping OSD's
    
On Tue, Jun 29, 2021 at 12:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Tue, Jun 29, 2021 at 10:39 AM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > Do you have any idea on the cause of this bug yet? I tried to dig around a bit myself in the source, but the logic around this locking is very complex, so I couldn't figure out where the problem is.
>
> I do.  The proper fix would indeed be large and not backportable but
> I have a workaround in mind that should be simple enough to backport
> all the way to 5.4.  The trick is making sure that the workaround is
> fine from the exclusive lock protocol POV.
>
> I'll try to flesh it out by the end of this week and report back
> early next week.

Hi Robin,

I CCed you on the patches.  They should apply to 5.4 cleanly.  You
mentioned you have a build farm set up, please take them for a spin.

Thanks,

                Ilya
    

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-07-07  7:35                           ` Robin Geuze
@ 2021-07-20 12:04                             ` Robin Geuze
  2021-07-20 16:42                               ` Ilya Dryomov
  0 siblings, 1 reply; 16+ messages in thread
From: Robin Geuze @ 2021-07-20 12:04 UTC (permalink / raw)
  To: Ilya Dryomov; +Cc: Ceph Development

Hey Ilya,

Took a bit longer than expected, but we finally got around to testing the patches. They seem to do the trick. We did have one stuck rbd dev, however after the 60 second hung task timeout expired that one also continued working. Great work. We ended up testing it on a the Ubuntu 20.04 hwe 5.8 based kernel btw, not 5.4.

Regards,

Robin Geuze

From: Robin Geuze
Sent: 07 July 2021 09:35
To: Ilya Dryomov
Cc: Ceph Development
Subject: Re: All RBD IO stuck after flapping OSD's
    
Hey Ilya,

Thanks so much for the patches, we are planning to test them either this afternoon or tomorrow at the latest, I will let you know the results.

Regards,

Robin Geuze

From: Ilya Dryomov <idryomov@gmail.com>
Sent: 06 July 2021 19:21
To: Robin Geuze
Cc: Ceph Development
Subject: Re: All RBD IO stuck after flapping OSD's
    
On Tue, Jun 29, 2021 at 12:07 PM Ilya Dryomov <idryomov@gmail.com> wrote:
>
> On Tue, Jun 29, 2021 at 10:39 AM Robin Geuze <robin.geuze@nl.team.blue> wrote:
> >
> > Hey Ilya,
> >
> > Do you have any idea on the cause of this bug yet? I tried to dig around a bit myself in the source, but the logic around this locking is very complex, so I couldn't figure out where the problem is.
>
> I do.  The proper fix would indeed be large and not backportable but
> I have a workaround in mind that should be simple enough to backport
> all the way to 5.4.  The trick is making sure that the workaround is
> fine from the exclusive lock protocol POV.
>
> I'll try to flesh it out by the end of this week and report back
> early next week.

Hi Robin,

I CCed you on the patches.  They should apply to 5.4 cleanly.  You
mentioned you have a build farm set up, please take them for a spin.

Thanks,

                Ilya
        

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: All RBD IO stuck after flapping OSD's
  2021-07-20 12:04                             ` Robin Geuze
@ 2021-07-20 16:42                               ` Ilya Dryomov
  0 siblings, 0 replies; 16+ messages in thread
From: Ilya Dryomov @ 2021-07-20 16:42 UTC (permalink / raw)
  To: Robin Geuze; +Cc: Ceph Development

On Tue, Jul 20, 2021 at 2:05 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey Ilya,
>
> Took a bit longer than expected, but we finally got around to testing the patches. They seem to do the trick. We did have one stuck rbd dev, however after the 60 second hung task timeout expired that one also continued working. Great work. We ended up testing it on a the Ubuntu 20.04 hwe 5.8 based kernel btw, not 5.4.

Hi Robin,

Thanks for testing!  I'll get these patches into 5.14-rc3 and have them
backported from there.

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-07-20 16:45 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-14  8:51 All RBD IO stuck after flapping OSD's Robin Geuze
2021-04-14 17:00 ` Ilya Dryomov
     [not found]   ` <8eb12c996e404870803e9a7c77e508d6@nl.team.blue>
2021-04-19 12:40     ` Ilya Dryomov
2021-06-16 11:56       ` Robin Geuze
2021-06-17  8:36         ` Ilya Dryomov
2021-06-17  8:42           ` Robin Geuze
2021-06-17  9:40             ` Ilya Dryomov
2021-06-17 10:17               ` Robin Geuze
2021-06-17 11:09                 ` Ilya Dryomov
2021-06-17 11:12                   ` Robin Geuze
2021-06-29  8:39                     ` Robin Geuze
2021-06-29 10:07                       ` Ilya Dryomov
2021-07-06 17:21                         ` Ilya Dryomov
2021-07-07  7:35                           ` Robin Geuze
2021-07-20 12:04                             ` Robin Geuze
2021-07-20 16:42                               ` Ilya Dryomov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.