From: Xiubo Li <xiubli@redhat.com>
To: Jeff Layton <jlayton@kernel.org>, Ilya Dryomov <idryomov@gmail.com>
Cc: Ceph Development <ceph-devel@vger.kernel.org>,
Patrick Donnelly <pdonnell@redhat.com>,
"Yan, Zheng" <ukernel@gmail.com>
Subject: Re: [PATCH] ceph: retransmit REQUEST_CLOSE every second if we don't get a response
Date: Mon, 12 Oct 2020 21:31:37 +0800 [thread overview]
Message-ID: <41f84b63-e517-cdb7-c76a-548f9bf0fe96@redhat.com> (raw)
In-Reply-To: <2be66f15a6e9217863b4c0bb4004c9263b768e26.camel@kernel.org>
On 2020/10/12 21:17, Jeff Layton wrote:
> On Mon, 2020-10-12 at 20:41 +0800, Xiubo Li wrote:
>> On 2020/10/12 19:52, Jeff Layton wrote:
>>> On Mon, 2020-10-12 at 14:52 +0800, Xiubo Li wrote:
>>>> On 2020/10/11 2:49, Ilya Dryomov wrote:
>>>>> On Thu, Oct 8, 2020 at 8:14 PM Jeff Layton <jlayton@kernel.org> wrote:
>>>>>> On Thu, 2020-10-08 at 19:27 +0200, Ilya Dryomov wrote:
>>>>>>> On Tue, Sep 29, 2020 at 12:03 AM Jeff Layton <jlayton@kernel.org> wrote:
>>>>>>>> Patrick reported a case where the MDS and client client had racing
>>>>>>>> session messages to one anothe. The MDS was sending caps to the client
>>>>>>>> and the client was sending a CEPH_SESSION_REQUEST_CLOSE message in order
>>>>>>>> to unmount.
>>>>>>>>
>>>>>>>> Because they were sending at the same time, the REQUEST_CLOSE had too
>>>>>>>> old a sequence number, and the MDS dropped it on the floor. On the
>>>>>>>> client, this would have probably manifested as a 60s hang during umount.
>>>>>>>> The MDS ended up blocklisting the client.
>>>>>>>>
>>>>>>>> Once we've decided to issue a REQUEST_CLOSE, we're finished with the
>>>>>>>> session, so just keep sending them until the MDS acknowledges that.
>>>>>>>>
>>>>>>>> Change the code to retransmit a REQUEST_CLOSE every second if the
>>>>>>>> session hasn't changed state yet. Give up and throw a warning after
>>>>>>>> mount_timeout elapses if we haven't gotten a response.
>>>>>>>>
>>>>>>>> URL: https://tracker.ceph.com/issues/47563
>>>>>>>> Reported-by: Patrick Donnelly <pdonnell@redhat.com>
>>>>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>>>>>>> ---
>>>>>>>> fs/ceph/mds_client.c | 53 ++++++++++++++++++++++++++------------------
>>>>>>>> 1 file changed, 32 insertions(+), 21 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>>> index b07e7adf146f..d9cb74e3d5e3 100644
>>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>>> @@ -1878,7 +1878,7 @@ static int request_close_session(struct ceph_mds_session *session)
>>>>>>>> static int __close_session(struct ceph_mds_client *mdsc,
>>>>>>>> struct ceph_mds_session *session)
>>>>>>>> {
>>>>>>>> - if (session->s_state >= CEPH_MDS_SESSION_CLOSING)
>>>>>>>> + if (session->s_state > CEPH_MDS_SESSION_CLOSING)
>>>>>>>> return 0;
>>>>>>>> session->s_state = CEPH_MDS_SESSION_CLOSING;
>>>>>>>> return request_close_session(session);
>>>>>>>> @@ -4692,38 +4692,49 @@ static bool done_closing_sessions(struct ceph_mds_client *mdsc, int skipped)
>>>>>>>> return atomic_read(&mdsc->num_sessions) <= skipped;
>>>>>>>> }
>>>>>>>>
>>>>>>>> +static bool umount_timed_out(unsigned long timeo)
>>>>>>>> +{
>>>>>>>> + if (time_before(jiffies, timeo))
>>>>>>>> + return false;
>>>>>>>> + pr_warn("ceph: unable to close all sessions\n");
>>>>>>>> + return true;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> /*
>>>>>>>> * called after sb is ro.
>>>>>>>> */
>>>>>>>> void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc)
>>>>>>>> {
>>>>>>>> - struct ceph_options *opts = mdsc->fsc->client->options;
>>>>>>>> struct ceph_mds_session *session;
>>>>>>>> - int i;
>>>>>>>> - int skipped = 0;
>>>>>>>> + int i, ret;
>>>>>>>> + int skipped;
>>>>>>>> + unsigned long timeo = jiffies +
>>>>>>>> + ceph_timeout_jiffies(mdsc->fsc->client->options->mount_timeout);
>>>>>>>>
>>>>>>>> dout("close_sessions\n");
>>>>>>>>
>>>>>>>> /* close sessions */
>>>>>>>> - mutex_lock(&mdsc->mutex);
>>>>>>>> - for (i = 0; i < mdsc->max_sessions; i++) {
>>>>>>>> - session = __ceph_lookup_mds_session(mdsc, i);
>>>>>>>> - if (!session)
>>>>>>>> - continue;
>>>>>>>> - mutex_unlock(&mdsc->mutex);
>>>>>>>> - mutex_lock(&session->s_mutex);
>>>>>>>> - if (__close_session(mdsc, session) <= 0)
>>>>>>>> - skipped++;
>>>>>>>> - mutex_unlock(&session->s_mutex);
>>>>>>>> - ceph_put_mds_session(session);
>>>>>>>> + do {
>>>>>>>> + skipped = 0;
>>>>>>>> mutex_lock(&mdsc->mutex);
>>>>>>>> - }
>>>>>>>> - mutex_unlock(&mdsc->mutex);
>>>>>>>> + for (i = 0; i < mdsc->max_sessions; i++) {
>>>>>>>> + session = __ceph_lookup_mds_session(mdsc, i);
>>>>>>>> + if (!session)
>>>>>>>> + continue;
>>>>>>>> + mutex_unlock(&mdsc->mutex);
>>>>>>>> + mutex_lock(&session->s_mutex);
>>>>>>>> + if (__close_session(mdsc, session) <= 0)
>>>>>>>> + skipped++;
>>>>>>>> + mutex_unlock(&session->s_mutex);
>>>>>>>> + ceph_put_mds_session(session);
>>>>>>>> + mutex_lock(&mdsc->mutex);
>>>>>>>> + }
>>>>>>>> + mutex_unlock(&mdsc->mutex);
>>>>>>>>
>>>>>>>> - dout("waiting for sessions to close\n");
>>>>>>>> - wait_event_timeout(mdsc->session_close_wq,
>>>>>>>> - done_closing_sessions(mdsc, skipped),
>>>>>>>> - ceph_timeout_jiffies(opts->mount_timeout));
>>>>>>>> + dout("waiting for sessions to close\n");
>>>>>>>> + ret = wait_event_timeout(mdsc->session_close_wq,
>>>>>>>> + done_closing_sessions(mdsc, skipped), HZ);
>>>>>>>> + } while (!ret && !umount_timed_out(timeo));
>>>>>>>>
>>>>>>>> /* tear down remaining sessions */
>>>>>>>> mutex_lock(&mdsc->mutex);
>>>>>>>> --
>>>>>>>> 2.26.2
>>>>>>>>
>>>>>>> Hi Jeff,
>>>>>>>
>>>>>>> This seems wrong to me, at least conceptually. Is the same patch
>>>>>>> getting applied to ceph-fuse?
>>>>>>>
>>>>>> It's a grotesque workaround, I will grant you. I'm not sure what we want
>>>>>> to do for ceph-fuse yet but it does seem to have the same issue.
>>>>>> Probably, we should plan to do a similar fix there once we settle on the
>>>>>> right approach.
>>>>>>
>>>>>>> Pretending to not know anything about the client <-> MDS protocol,
>>>>>>> two questions immediately come to mind. Why is MDS allowed to drop
>>>>>>> REQUEST_CLOSE?
>>>>>> It really seems like a protocol design flaw.
>>>>>>
>>>>>> IIUC, the idea overall with the low-level ceph protocol seems to be that
>>>>>> the client should retransmit (or reevaluate, in the case of caps) calls
>>>>>> that were in flight when the seq number changes.
>>>>>>
>>>>>> The REQUEST_CLOSE handling seems to have followed suit on the MDS side,
>>>>>> but it doesn't really make a lot of sense for that, IMO.
>>>>> (edit of my reply to https://github.com/ceph/ceph/pull/37619)
>>>>>
>>>>> After taking a look at the MDS code, it really seemed like it
>>>>> had been written with the expectation that REQUEST_CLOSE would be
>>>>> resent, so I dug around. I don't fully understand these "push"
>>>>> sequence numbers yet, but there is probably some race that requires
>>>>> the client to confirm that it saw the sequence number, even if the
>>>>> session is about to go. Sage is probably the only one who might
>>>>> remember at this point.
>>>>>
>>>>> The kernel client already has the code to retry REQUEST_CLOSE, only
>>>>> every five seconds instead every second. See check_session_state()
>>>>> which is called from delayed_work() in mds_client.c. It looks like
>>>>> it got broken by Xiubo's commit fa9967734227 ("ceph: fix potential
>>>>> mdsc use-after-free crash") which conditioned delayed_work() on
>>>>> mdsc->stopping -- hence the misbehaviour.
>>>> Without this commit it will hit this issue too. The umount old code will
>>>> try to close sessions asynchronously, and then tries to cancel the
>>>> delayed work, during which the last queued delayed_work() timer might be
>>>> fired. This commit makes it easier to be reproduced.
>>>>
>>> Fixing the potential races to ensure that this is retransmitted is an
>>> option, but I'm not sure it's the best one. Here's what I think we
>>> probably ought to do:
>>>
>>> 1/ fix the MDS to just ignore the sequence number on REQUEST_CLOSE. I
>>> don't see that the sequence number has any value on that call, as it's
>>> an indicator that the client is finished with the session, and it's
>>> never going to change its mind and do something different if the
>>> sequence is wrong. I have a PR for that here:
>>>
>>> https://github.com/ceph/ceph/pull/37619
>>>
>>> 2/ fix the clients to not wait on the REQUEST_CLOSE reply. As soon as
>>> the call is sent, tear down the session and proceed with unmounting. The
>>> client doesn't really care what the MDS has to say after that point, so
>>> we may as well not wait on it before proceeding.
>>>
>>> Thoughts?
>> I am thinking possibly we can just check the session's state when the
>> client receives a request from MDS which will increase the s_seq number,
>> if the session is in CLOSING state, the client needs to resend the
>> REQUEST_CLOSE request again.
>>
> That could be done, but that means adding extra complexity to the
> session handling code, which could really stand to be simplified
> instead.
>
> mdsc->stopping and session->s_state seem to be protected by the
> mdsc->mutex, but session->s_seq is protected by the session->s_mutex.
>
> There are 4 types of messages that increment the s_seq -- caps, leases,
> quotas and snaps. All of those would need to be changed to check for and
> retransmit REQUEST_CLOSE if one is outstanding.
How about deferring resending the CLOSE request in the above case ?
> So yeah, that could be done on the client side. If we were to do that,
> should we couple it with the MDS side fix to make it ignore the seq on
> REQUEST_CLOSE?
next prev parent reply other threads:[~2020-10-12 13:31 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-28 22:03 [PATCH] ceph: retransmit REQUEST_CLOSE every second if we don't get a response Jeff Layton
2020-10-08 17:27 ` Ilya Dryomov
2020-10-08 18:14 ` Jeff Layton
2020-10-10 18:49 ` Ilya Dryomov
2020-10-12 6:52 ` Xiubo Li
2020-10-12 11:52 ` Jeff Layton
2020-10-12 12:41 ` Xiubo Li
2020-10-12 13:16 ` Ilya Dryomov
2020-10-12 13:17 ` Jeff Layton
2020-10-12 13:31 ` Xiubo Li [this message]
2020-10-12 13:49 ` Jeff Layton
2020-10-12 13:52 ` Xiubo Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=41f84b63-e517-cdb7-c76a-548f9bf0fe96@redhat.com \
--to=xiubli@redhat.com \
--cc=ceph-devel@vger.kernel.org \
--cc=idryomov@gmail.com \
--cc=jlayton@kernel.org \
--cc=pdonnell@redhat.com \
--cc=ukernel@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).