From: Jeff Layton <jlayton@kernel.org>
To: Ilya Dryomov <idryomov@gmail.com>, ceph-devel@vger.kernel.org
Subject: Re: [PATCH] libceph: clear con->out_msg on Policy::stateful_server faults
Date: Thu, 08 Oct 2020 13:29:18 -0400 [thread overview]
Message-ID: <a84ecb3297c19a92122846f22e38d932aedccb6b.camel@kernel.org> (raw)
In-Reply-To: <20201008165800.9494-1-idryomov@gmail.com>
On Thu, 2020-10-08 at 18:58 +0200, Ilya Dryomov wrote:
> con->out_msg must be cleared on Policy::stateful_server
> (!CEPH_MSG_CONNECT_LOSSY) faults. Not doing so botches the
> reconnection attempt, because after writing the banner the
> messenger moves on to writing the data section of that message
> (either from where it got interrupted by the connection reset or
> from the beginning) instead of writing struct ceph_msg_connect.
> This results in a bizarre error message because the server
> sends CEPH_MSGR_TAG_BADPROTOVER but we think we wrote struct
> ceph_msg_connect:
>
> libceph: mds0 (1)172.21.15.45:6828 socket error on write
> ceph: mds0 reconnect start
> libceph: mds0 (1)172.21.15.45:6829 socket closed (con state OPEN)
> libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch, my 32 != server's 32
> libceph: mds0 (1)172.21.15.45:6829 protocol version mismatch
>
> AFAICT this bug goes back to the dawn of the kernel client.
> The reason it survived for so long is that only MDS sessions
> are stateful and only two MDS messages have a data section:
> CEPH_MSG_CLIENT_RECONNECT (always, but reconnecting is rare)
> and CEPH_MSG_CLIENT_REQUEST (only when xattrs are involved).
> The connection has to get reset precisely when such message
> is being sent -- in this case it was the former.
>
> Cc: stable@vger.kernel.org
> Link: https://tracker.ceph.com/issues/47723
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
> ---
> net/ceph/messenger.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index e9e2763a255f..c1f1f85545c3 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -2998,6 +2998,11 @@ static void con_fault(struct ceph_connection *con)
> ceph_msg_put(con->in_msg);
> con->in_msg = NULL;
> }
> + if (con->out_msg) {
> + BUG_ON(con->out_msg->con != con);
> + ceph_msg_put(con->out_msg);
> + con->out_msg = NULL;
> + }
>
> /* Requeue anything that hasn't been acked */
> list_splice_init(&con->out_sent, &con->out_queue);
Nice catch, Ilya.
It might be nice to make a common helper that both reset_connection and
con_fault can call to drop the in_msg/out_msg, but keeping this small
for a stable patch is reasonable. Maybe that can be done as part of the
msgr2 work?
Reviewed-by: Jeff Layton <jlayton@kernel.org>
prev parent reply other threads:[~2020-10-08 17:29 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-10-08 16:58 [PATCH] libceph: clear con->out_msg on Policy::stateful_server faults Ilya Dryomov
2020-10-08 17:29 ` Jeff Layton [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=a84ecb3297c19a92122846f22e38d932aedccb6b.camel@kernel.org \
--to=jlayton@kernel.org \
--cc=ceph-devel@vger.kernel.org \
--cc=idryomov@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).