linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Trond Myklebust <trondmy@hammerspace.com>
Cc: "linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done
Date: Mon, 17 Dec 2018 14:00:11 -0500	[thread overview]
Message-ID: <6A4F0F5B-2D29-4A51-BC21-AF23B64D6ED7@oracle.com> (raw)
In-Reply-To: <c639c203e1c9a5c0016a46e95ceb690744480583.camel@hammerspace.com>



> On Dec 17, 2018, at 1:55 PM, Trond Myklebust <trondmy@hammerspace.com> wrote:
> 
> On Mon, 2018-12-17 at 13:37 -0500, Chuck Lever wrote:
>>> On Dec 17, 2018, at 12:28 PM, Trond Myklebust <
>>> trondmy@hammerspace.com> wrote:
>>> 
>>> On Mon, 2018-12-17 at 11:39 -0500, Chuck Lever wrote:
>>>> Transport disconnect processing does a "wake pending tasks" at
>>>> various points.
>>>> 
>>>> Suppose an RPC Reply is being processed. The RPC task that Reply
>>>> goes with is waiting on the pending queue. If a disconnect wake-
>>>> up
>>>> happens before reply processing is done, that reply, even if it
>>>> is
>>>> good, is thrown away, and the RPC has to be sent again.
>>>> 
>>>> This window apparently does not exist for socket transports
>>>> because
>>>> there is a lock held while a reply is being received which
>>>> prevents
>>>> the wake-up call until after reply processing is done.
>>>> 
>>>> To resolve this, all RPC replies being processed on an RPC-over-
>>>> RDMA
>>>> transport have to complete before pending tasks are awoken due to
>>>> a
>>>> transport disconnect.
>>>> 
>>>> Callers that already hold the transport write lock may invoke
>>>> ->ops->close directly. Others use a generic helper that schedules
>>>> a close when the write lock can be taken safely.
>>>> 
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> ---
>>>> include/linux/sunrpc/xprt.h                |    1 +
>>>> net/sunrpc/xprt.c                          |   19
>>>> +++++++++++++++++++
>>>> net/sunrpc/xprtrdma/backchannel.c          |   13 +++++++------
>>>> net/sunrpc/xprtrdma/svc_rdma_backchannel.c |    8 +++++---
>>>> net/sunrpc/xprtrdma/transport.c            |   16 ++++++++++-----
>>>> -
>>>> net/sunrpc/xprtrdma/verbs.c                |    5 ++---
>>>> 6 files changed, 44 insertions(+), 18 deletions(-)
>>>> 
>>>> diff --git a/include/linux/sunrpc/xprt.h
>>>> b/include/linux/sunrpc/xprt.h
>>>> index a4ab4f8..ee94ed0 100644
>>>> --- a/include/linux/sunrpc/xprt.h
>>>> +++ b/include/linux/sunrpc/xprt.h
>>>> @@ -401,6 +401,7 @@ static inline __be32
>>>> *xprt_skip_transport_header(struct rpc_xprt *xprt, __be32 *
>>>> bool			xprt_request_get_cong(struct rpc_xprt
>>>> *xprt,
>>>> struct rpc_rqst *req);
>>>> void			xprt_disconnect_done(struct rpc_xprt
>>>> *xprt);
>>>> void			xprt_force_disconnect(struct rpc_xprt
>>>> *xprt);
>>>> +void			xprt_disconnect_nowake(struct rpc_xprt
>>>> *xprt);
>>>> void			xprt_conditional_disconnect(struct
>>>> rpc_xprt
>>>> *xprt, unsigned int cookie);
>>>> 
>>>> bool			xprt_lock_connect(struct rpc_xprt *,
>>>> struct
>>>> rpc_task *, void *);
>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>> index ce92700..afe412e 100644
>>>> --- a/net/sunrpc/xprt.c
>>>> +++ b/net/sunrpc/xprt.c
>>>> @@ -685,6 +685,25 @@ void xprt_force_disconnect(struct rpc_xprt
>>>> *xprt)
>>>> }
>>>> EXPORT_SYMBOL_GPL(xprt_force_disconnect);
>>>> 
>>>> +/**
>>>> + * xprt_disconnect_nowake - force a call to xprt->ops->close
>>>> + * @xprt: transport to disconnect
>>>> + *
>>>> + * The caller must ensure that xprt_wake_pending_tasks() is
>>>> + * called later.
>>>> + */
>>>> +void xprt_disconnect_nowake(struct rpc_xprt *xprt)
>>>> +{
>>>> +       /* Don't race with the test_bit() in xprt_clear_locked()
>>>> */
>>>> +       spin_lock_bh(&xprt->transport_lock);
>>>> +       set_bit(XPRT_CLOSE_WAIT, &xprt->state);
>>>> +       /* Try to schedule an autoclose RPC call */
>>>> +       if (test_and_set_bit(XPRT_LOCKED, &xprt->state) == 0)
>>>> +               queue_work(xprtiod_workqueue, &xprt-
>>>>> task_cleanup);
>>>> +       spin_unlock_bh(&xprt->transport_lock);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(xprt_disconnect_nowake);
>>>> +
>>> 
>>> We shouldn't need both xprt_disconnect_nowake() and
>>> xprt_force_disconnect() to be exported given that you can build the
>>> latter from the former + xprt_wake_pending_tasks() (which is also
>>> already exported).
>> 
>> Thanks for your review!
>> 
>> I can get rid of xprt_disconnect_nowake. There are some variations,
>> depending on why wake_pending_tasks is protected by xprt-
>>> transport_lock.
> 
> I'm having some second thoughts about the patch that Scott sent out
> last week to fix the issue that Dave and he were seeing. I think that
> what we really need to do to fix his issue is to call
> xprt_disconnect_done() after we've released the TCP socket.
> 
> Given that realisation, I think that we can fix up
> xprt_force_disconnect() to only wake up the task that holds the
> XPRT_LOCKED instead of doing a thundering herd wakeup like we do today.
> That should (I think) obviate the need for a separate
> xprt_disconnect_nowake().

For RPC-over-RDMA, there really is a dangerous race between the waking
task(s) and work being done by the deferred RPC completion handler. IMO
the only safe thing RPC-over-RDMA can do is not wake anything until the
deferred queue is well and truly drained.

As you observed when we last spoke, socket transports are already
protected from this race by the socket lock.... RPC-over-RDMA is going
to have to be more careful.


> A patch is forthcoming later today. I'll make sure you are Cced so you
> can comment.
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com

--
Chuck Lever




  reply	other threads:[~2018-12-17 19:00 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-12-17 16:39 [PATCH v4 00/30] NFS/RDMA client for next Chuck Lever
2018-12-17 16:39 ` [PATCH v4 01/30] xprtrdma: Yet another double DMA-unmap Chuck Lever
2018-12-17 16:39 ` [PATCH v4 02/30] xprtrdma: Ensure MRs are DMA-unmapped when posting LOCAL_INV fails Chuck Lever
2018-12-17 16:39 ` [PATCH v4 03/30] xprtrdma: Refactor Receive accounting Chuck Lever
2018-12-17 16:39 ` [PATCH v4 04/30] xprtrdma: Replace rpcrdma_receive_wq with a per-xprt workqueue Chuck Lever
2018-12-17 16:39 ` [PATCH v4 05/30] xprtrdma: No qp_event disconnect Chuck Lever
2018-12-17 16:39 ` [PATCH v4 06/30] xprtrdma: Don't wake pending tasks until disconnect is done Chuck Lever
2018-12-17 17:28   ` Trond Myklebust
2018-12-17 18:37     ` Chuck Lever
2018-12-17 18:55       ` Trond Myklebust
2018-12-17 19:00         ` Chuck Lever [this message]
2018-12-17 19:09           ` Trond Myklebust
2018-12-17 19:19             ` Chuck Lever
2018-12-17 19:26               ` Trond Myklebust
2018-12-17 16:39 ` [PATCH v4 07/30] xprtrdma: Fix ri_max_segs and the result of ro_maxpages Chuck Lever
2018-12-18 19:35   ` Anna Schumaker
2018-12-18 19:39     ` Chuck Lever
2018-12-17 16:40 ` [PATCH v4 08/30] xprtrdma: Reduce max_frwr_depth Chuck Lever
2018-12-17 16:40 ` [PATCH v4 09/30] xprtrdma: Remove support for FMR memory registration Chuck Lever
2018-12-17 16:40 ` [PATCH v4 10/30] xprtrdma: Remove rpcrdma_memreg_ops Chuck Lever
2018-12-17 16:40 ` [PATCH v4 11/30] xprtrdma: Plant XID in on-the-wire RDMA offset (FRWR) Chuck Lever
2018-12-17 16:40 ` [PATCH v4 12/30] NFS: Make "port=" mount option optional for RDMA mounts Chuck Lever
2018-12-17 16:40 ` [PATCH v4 13/30] xprtrdma: Recognize XDRBUF_SPARSE_PAGES Chuck Lever
2018-12-17 16:40 ` [PATCH v4 14/30] xprtrdma: Remove request_module from backchannel Chuck Lever
2018-12-17 16:40 ` [PATCH v4 15/30] xprtrdma: Expose transport header errors Chuck Lever
2018-12-17 16:40 ` [PATCH v4 16/30] xprtrdma: Simplify locking that protects the rl_allreqs list Chuck Lever
2018-12-17 16:40 ` [PATCH v4 17/30] xprtrdma: Cull dprintk() call sites Chuck Lever
2018-12-17 16:40 ` [PATCH v4 18/30] xprtrdma: Remove unused fields from rpcrdma_ia Chuck Lever
2018-12-17 16:41 ` [PATCH v4 19/30] xprtrdma: Clean up of xprtrdma chunk trace points Chuck Lever
2018-12-17 16:41 ` [PATCH v4 20/30] xprtrdma: Relocate the xprtrdma_mr_map " Chuck Lever
2018-12-17 16:41 ` [PATCH v4 21/30] xprtrdma: Add trace points for calls to transport switch methods Chuck Lever
2018-12-17 16:41 ` [PATCH v4 22/30] xprtrdma: Trace mapping, alloc, and dereg failures Chuck Lever
2018-12-17 16:41 ` [PATCH v4 23/30] NFS: Fix NFSv4 symbolic trace point output Chuck Lever
2018-12-17 16:41 ` [PATCH v4 24/30] SUNRPC: Simplify defining common RPC trace events Chuck Lever
2018-12-17 16:41 ` [PATCH v4 25/30] SUNRPC: Fix some kernel doc complaints Chuck Lever
2018-12-17 16:41 ` [PATCH v4 26/30] xprtrdma: Update comments in frwr_op_send Chuck Lever
2018-12-17 16:41 ` [PATCH v4 27/30] xprtrdma: Replace outdated comment for rpcrdma_ep_post Chuck Lever
2018-12-17 16:41 ` [PATCH v4 28/30] xprtrdma: Add documenting comment for rpcrdma_buffer_destroy Chuck Lever
2018-12-17 16:41 ` [PATCH v4 29/30] xprtrdma: Clarify comments in rpcrdma_ia_remove Chuck Lever
2018-12-17 16:42 ` [PATCH v4 30/30] xprtrdma: Don't leak freed MRs Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6A4F0F5B-2D29-4A51-BC21-AF23B64D6ED7@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=trondmy@hammerspace.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).