Re: All RBD IO stuck after flapping OSD's

From: Ilya Dryomov <idryomov@gmail.com>
To: Robin Geuze <robin.geuze@nl.team.blue>
Cc: Ceph Development <ceph-devel@vger.kernel.org>
Subject: Re: All RBD IO stuck after flapping OSD's
Date: Wed, 14 Apr 2021 19:00:20 +0200	[thread overview]
Message-ID: <CAOi1vP-moRXtL4gKXQF8+NwbPgE11_LoxfSYqYBbJfYYQ7Sv_g@mail.gmail.com> (raw)
In-Reply-To: <47f0a04ce6664116a11cfdb5a458e252@nl.team.blue>

On Wed, Apr 14, 2021 at 4:56 PM Robin Geuze <robin.geuze@nl.team.blue> wrote:
>
> Hey,
>
> We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:
>
> Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
> Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
> Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
> (...)
>
> After a while of that we start getting these errors being spammed in dmesg:
>
> Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
> Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
> Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16
>
> (In this case for two different RBD mounts)
>
> At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate our VM's to another HV, since to make the messages go away we have to reboot the server.

Hi Robin,

Do these messages appear even if no I/O is issued to /dev/rbd14 or only
if you attempt to write?

>
> All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.

Please explain how it's getting worse.

I think the problem is that the object map isn't locked.  What
probably happened is the kernel client lost its watch on the image
and for some reason can't get it back.   The flapping has likely
trigged some edge condition in the watch/notify code.

To confirm:

- paste the contents of /sys/bus/rbd/devices/14/client_addr

- paste the contents of /sys/kernel/debug/ceph/<cluster id>.client<id>/osdc
  for /dev/rbd14.  If you are using noshare, you will have multiple
  client instances with the same cluster id.  The one you need can be
  identified with /sys/bus/rbd/devices/14/client_id.

- paste the output of "rbd status <rbd14 image>" (image name can be
  identified from "rbd showmapped")

I'm also curious who actually has the lock on the header object and the
object map object.  Paste the output of

$ ID=$(bin/rbd info --format=json <rbd14 pool>/<rbd14 image> | jq -r .id)
$ rados -p <rbd14 pool> lock info rbd_header.$ID rbd_lock | jq
$ rados -p <rbd14 pool> lock info rbd_object_map.$ID rbd_lock | jq

Thanks,

                Ilya