Re: [PATCH] ceph: retransmit REQUEST_CLOSE every second if we don't get a response

From: Xiubo Li <xiubli@redhat.com>
To: Jeff Layton <jlayton@kernel.org>, Ilya Dryomov <idryomov@gmail.com>
Cc: Ceph Development <ceph-devel@vger.kernel.org>,
	Patrick Donnelly <pdonnell@redhat.com>,
	"Yan, Zheng" <ukernel@gmail.com>
Subject: Re: [PATCH] ceph: retransmit REQUEST_CLOSE every second if we don't get a response
Date: Mon, 12 Oct 2020 21:31:37 +0800	[thread overview]
Message-ID: <41f84b63-e517-cdb7-c76a-548f9bf0fe96@redhat.com> (raw)
In-Reply-To: <2be66f15a6e9217863b4c0bb4004c9263b768e26.camel@kernel.org>

On 2020/10/12 21:17, Jeff Layton wrote:
> On Mon, 2020-10-12 at 20:41 +0800, Xiubo Li wrote:
>> On 2020/10/12 19:52, Jeff Layton wrote:
>>> On Mon, 2020-10-12 at 14:52 +0800, Xiubo Li wrote:
>>>> On 2020/10/11 2:49, Ilya Dryomov wrote:
>>>>> On Thu, Oct 8, 2020 at 8:14 PM Jeff Layton <jlayton@kernel.org> wrote:
>>>>>> On Thu, 2020-10-08 at 19:27 +0200, Ilya Dryomov wrote:
>>>>>>> On Tue, Sep 29, 2020 at 12:03 AM Jeff Layton <jlayton@kernel.org> wrote:
>>>>>>>> Patrick reported a case where the MDS and client client had racing
>>>>>>>> session messages to one anothe. The MDS was sending caps to the client
>>>>>>>> and the client was sending a CEPH_SESSION_REQUEST_CLOSE message in order
>>>>>>>> to unmount.
>>>>>>>>
>>>>>>>> Because they were sending at the same time, the REQUEST_CLOSE had too
>>>>>>>> old a sequence number, and the MDS dropped it on the floor. On the
>>>>>>>> client, this would have probably manifested as a 60s hang during umount.
>>>>>>>> The MDS ended up blocklisting the client.
>>>>>>>>
>>>>>>>> Once we've decided to issue a REQUEST_CLOSE, we're finished with the
>>>>>>>> session, so just keep sending them until the MDS acknowledges that.
>>>>>>>>
>>>>>>>> Change the code to retransmit a REQUEST_CLOSE every second if the
>>>>>>>> session hasn't changed state yet. Give up and throw a warning after
>>>>>>>> mount_timeout elapses if we haven't gotten a response.
>>>>>>>>
>>>>>>>> URL: https://tracker.ceph.com/issues/47563
>>>>>>>> Reported-by: Patrick Donnelly <pdonnell@redhat.com>
>>>>>>>> Signed-off-by: Jeff Layton <jlayton@kernel.org>
>>>>>>>> ---
>>>>>>>>     fs/ceph/mds_client.c | 53 ++++++++++++++++++++++++++------------------
>>>>>>>>     1 file changed, 32 insertions(+), 21 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
>>>>>>>> index b07e7adf146f..d9cb74e3d5e3 100644
>>>>>>>> --- a/fs/ceph/mds_client.c
>>>>>>>> +++ b/fs/ceph/mds_client.c
>>>>>>>> @@ -1878,7 +1878,7 @@ static int request_close_session(struct ceph_mds_session *session)
>>>>>>>>     static int __close_session(struct ceph_mds_client *mdsc,
>>>>>>>>                             struct ceph_mds_session *session)
>>>>>>>>     {
>>>>>>>> -       if (session->s_state >= CEPH_MDS_SESSION_CLOSING)
>>>>>>>> +       if (session->s_state > CEPH_MDS_SESSION_CLOSING)
>>>>>>>>                    return 0;
>>>>>>>>            session->s_state = CEPH_MDS_SESSION_CLOSING;
>>>>>>>>            return request_close_session(session);
>>>>>>>> @@ -4692,38 +4692,49 @@ static bool done_closing_sessions(struct ceph_mds_client *mdsc, int skipped)
>>>>>>>>            return atomic_read(&mdsc->num_sessions) <= skipped;
>>>>>>>>     }
>>>>>>>>
>>>>>>>> +static bool umount_timed_out(unsigned long timeo)
>>>>>>>> +{
>>>>>>>> +       if (time_before(jiffies, timeo))
>>>>>>>> +               return false;
>>>>>>>> +       pr_warn("ceph: unable to close all sessions\n");
>>>>>>>> +       return true;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>     /*
>>>>>>>>      * called after sb is ro.
>>>>>>>>      */
>>>>>>>>     void ceph_mdsc_close_sessions(struct ceph_mds_client *mdsc)
>>>>>>>>     {
>>>>>>>> -       struct ceph_options *opts = mdsc->fsc->client->options;
>>>>>>>>            struct ceph_mds_session *session;
>>>>>>>> -       int i;
>>>>>>>> -       int skipped = 0;
>>>>>>>> +       int i, ret;
>>>>>>>> +       int skipped;
>>>>>>>> +       unsigned long timeo = jiffies +
>>>>>>>> +                             ceph_timeout_jiffies(mdsc->fsc->client->options->mount_timeout);
>>>>>>>>
>>>>>>>>            dout("close_sessions\n");
>>>>>>>>
>>>>>>>>            /* close sessions */
>>>>>>>> -       mutex_lock(&mdsc->mutex);
>>>>>>>> -       for (i = 0; i < mdsc->max_sessions; i++) {
>>>>>>>> -               session = __ceph_lookup_mds_session(mdsc, i);
>>>>>>>> -               if (!session)
>>>>>>>> -                       continue;
>>>>>>>> -               mutex_unlock(&mdsc->mutex);
>>>>>>>> -               mutex_lock(&session->s_mutex);
>>>>>>>> -               if (__close_session(mdsc, session) <= 0)
>>>>>>>> -                       skipped++;
>>>>>>>> -               mutex_unlock(&session->s_mutex);
>>>>>>>> -               ceph_put_mds_session(session);
>>>>>>>> +       do {
>>>>>>>> +               skipped = 0;
>>>>>>>>                    mutex_lock(&mdsc->mutex);
>>>>>>>> -       }
>>>>>>>> -       mutex_unlock(&mdsc->mutex);
>>>>>>>> +               for (i = 0; i < mdsc->max_sessions; i++) {
>>>>>>>> +                       session = __ceph_lookup_mds_session(mdsc, i);
>>>>>>>> +                       if (!session)
>>>>>>>> +                               continue;
>>>>>>>> +                       mutex_unlock(&mdsc->mutex);
>>>>>>>> +                       mutex_lock(&session->s_mutex);
>>>>>>>> +                       if (__close_session(mdsc, session) <= 0)
>>>>>>>> +                               skipped++;
>>>>>>>> +                       mutex_unlock(&session->s_mutex);
>>>>>>>> +                       ceph_put_mds_session(session);
>>>>>>>> +                       mutex_lock(&mdsc->mutex);
>>>>>>>> +               }
>>>>>>>> +               mutex_unlock(&mdsc->mutex);
>>>>>>>>
>>>>>>>> -       dout("waiting for sessions to close\n");
>>>>>>>> -       wait_event_timeout(mdsc->session_close_wq,
>>>>>>>> -                          done_closing_sessions(mdsc, skipped),
>>>>>>>> -                          ceph_timeout_jiffies(opts->mount_timeout));
>>>>>>>> +               dout("waiting for sessions to close\n");
>>>>>>>> +               ret = wait_event_timeout(mdsc->session_close_wq,
>>>>>>>> +                                        done_closing_sessions(mdsc, skipped), HZ);
>>>>>>>> +       } while (!ret && !umount_timed_out(timeo));
>>>>>>>>
>>>>>>>>            /* tear down remaining sessions */
>>>>>>>>            mutex_lock(&mdsc->mutex);
>>>>>>>> --
>>>>>>>> 2.26.2
>>>>>>>>
>>>>>>> Hi Jeff,
>>>>>>>
>>>>>>> This seems wrong to me, at least conceptually.  Is the same patch
>>>>>>> getting applied to ceph-fuse?
>>>>>>>
>>>>>> It's a grotesque workaround, I will grant you. I'm not sure what we want
>>>>>> to do for ceph-fuse yet but it does seem to have the same issue.
>>>>>> Probably, we should plan to do a similar fix there once we settle on the
>>>>>> right approach.
>>>>>>
>>>>>>> Pretending to not know anything about the client <-> MDS protocol,
>>>>>>> two questions immediately come to mind.  Why is MDS allowed to drop
>>>>>>> REQUEST_CLOSE?
>>>>>> It really seems like a protocol design flaw.
>>>>>>
>>>>>> IIUC, the idea overall with the low-level ceph protocol seems to be that
>>>>>> the client should retransmit (or reevaluate, in the case of caps) calls
>>>>>> that were in flight when the seq number changes.
>>>>>>
>>>>>> The REQUEST_CLOSE handling seems to have followed suit on the MDS side,
>>>>>> but it doesn't really make a lot of sense for that, IMO.
>>>>> (edit of my reply to https://github.com/ceph/ceph/pull/37619)
>>>>>
>>>>> After taking a look at the MDS code, it really seemed like it
>>>>> had been written with the expectation that REQUEST_CLOSE would be
>>>>> resent, so I dug around.  I don't fully understand these "push"
>>>>> sequence numbers yet, but there is probably some race that requires
>>>>> the client to confirm that it saw the sequence number, even if the
>>>>> session is about to go.  Sage is probably the only one who might
>>>>> remember at this point.
>>>>>
>>>>> The kernel client already has the code to retry REQUEST_CLOSE, only
>>>>> every five seconds instead every second.  See check_session_state()
>>>>> which is called from delayed_work() in mds_client.c.  It looks like
>>>>> it got broken by Xiubo's commit fa9967734227 ("ceph: fix potential
>>>>> mdsc use-after-free crash") which conditioned delayed_work() on
>>>>> mdsc->stopping -- hence the misbehaviour.
>>>> Without this commit it will hit this issue too. The umount old code will
>>>> try to close sessions asynchronously, and then tries to cancel the
>>>> delayed work, during which the last queued delayed_work() timer might be
>>>> fired. This commit makes it easier to be reproduced.
>>>>
>>> Fixing the potential races to ensure that this is retransmitted is an
>>> option, but I'm not sure it's the best one. Here's what I think we
>>> probably ought to do:
>>>
>>> 1/ fix the MDS to just ignore the sequence number on REQUEST_CLOSE. I
>>> don't see that the sequence number has any value on that call, as it's
>>> an indicator that the client is finished with the session, and it's
>>> never going to change its mind and do something different if the
>>> sequence is wrong. I have a PR for that here:
>>>
>>>       https://github.com/ceph/ceph/pull/37619
>>>
>>> 2/ fix the clients to not wait on the REQUEST_CLOSE reply. As soon as
>>> the call is sent, tear down the session and proceed with unmounting. The
>>> client doesn't really care what the MDS has to say after that point, so
>>> we may as well not wait on it before proceeding.
>>>
>>> Thoughts?
>> I am thinking possibly we can just check the session's state when the
>> client receives a request from MDS which will increase the s_seq number,
>> if the session is in CLOSING state, the client needs to resend the
>> REQUEST_CLOSE request again.
>>
> That could be done, but that means adding extra complexity to the
> session handling code, which could really stand to be simplified
> instead.
>
> mdsc->stopping and session->s_state seem to be protected by the
> mdsc->mutex, but session->s_seq is protected by the session->s_mutex.
>
> There are 4 types of messages that increment the s_seq -- caps, leases,
> quotas and snaps. All of those would need to be changed to check for and
> retransmit REQUEST_CLOSE if one is outstanding.

How about deferring resending the CLOSE request in the above case ?

> So yeah, that could be done on the client side. If we were to do that,
> should we couple it with the MDS side fix to make it ignore the seq on
> REQUEST_CLOSE?