ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Robin Geuze <robin.geuze@nl.team.blue>
To: Ceph Development <ceph-devel@vger.kernel.org>
Subject: All RBD IO stuck after flapping OSD's
Date: Wed, 14 Apr 2021 08:51:21 +0000	[thread overview]
Message-ID: <47f0a04ce6664116a11cfdb5a458e252@nl.team.blue> (raw)

Hey,

We've encountered a weird issue when using the kernel RBD module. It starts with a bunch of OSD's flapping (in our case because of a network card issue which caused the LACP to constantly flap), which is logged in dmesg:

Apr 14 05:45:02 hv1 kernel: [647677.112461] libceph: osd56 down
Apr 14 05:45:03 hv1 kernel: [647678.114962] libceph: osd54 down
Apr 14 05:45:05 hv1 kernel: [647680.127329] libceph: osd50 down
(...)

After a while of that we start getting these errors being spammed in dmesg:

Apr 14 05:47:35 hv1 kernel: [647830.671263] rbd: rbd14: pre object map update failed: -16
Apr 14 05:47:35 hv1 kernel: [647830.671268] rbd: rbd14: write at objno 192 2564096~2048 result -16
Apr 14 05:47:35 hv1 kernel: [647830.671271] rbd: rbd14: write result -16

(In this case for two different RBD mounts)

At this point the IO for these two mounts is completely gone, and the only reason we can still perform IO on the other RBD devices is because we use noshare. Unfortunately unmounting the other devices is no longer possible, which means we cannot migrate our VM's to another HV, since to make the messages go away we have to reboot the server.

All of this wouldn't be such a big issue if it recovered once the cluster started behaving normally again, but it doesn't, it just keeps being stuck, and the longer we wait with rebooting this the worse the issue get.

We've seen this multiple times on various different machines and with various different clusters with differing problem types, so its not a freak incident. Does anyone have any ideas on how we can potentially solve this?

Regards,

Robin Geuze

             reply	other threads:[~2021-04-14  8:51 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14  8:51 Robin Geuze [this message]
2021-04-14 17:00 ` All RBD IO stuck after flapping OSD's Ilya Dryomov
     [not found]   ` <8eb12c996e404870803e9a7c77e508d6@nl.team.blue>
2021-04-19 12:40     ` Ilya Dryomov
2021-06-16 11:56       ` Robin Geuze
2021-06-17  8:36         ` Ilya Dryomov
2021-06-17  8:42           ` Robin Geuze
2021-06-17  9:40             ` Ilya Dryomov
2021-06-17 10:17               ` Robin Geuze
2021-06-17 11:09                 ` Ilya Dryomov
2021-06-17 11:12                   ` Robin Geuze
2021-06-29  8:39                     ` Robin Geuze
2021-06-29 10:07                       ` Ilya Dryomov
2021-07-06 17:21                         ` Ilya Dryomov
2021-07-07  7:35                           ` Robin Geuze
2021-07-20 12:04                             ` Robin Geuze
2021-07-20 16:42                               ` Ilya Dryomov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=47f0a04ce6664116a11cfdb5a458e252@nl.team.blue \
    --to=robin.geuze@nl.team.blue \
    --cc=ceph-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).