All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Trond Myklebust <trondmy@primarydata.com>
Cc: Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 1/3] SUNRPC: Don't hold the transport lock across socket copy operations
Date: Tue, 15 Aug 2017 11:05:06 -0400	[thread overview]
Message-ID: <AB962563-977C-4903-A188-3AD32B9F0086@oracle.com> (raw)
In-Reply-To: <1502809254.7593.19.camel@primarydata.com>


> On Aug 15, 2017, at 11:00 AM, Trond Myklebust <trondmy@primarydata.com> wrote:
> 
> On Tue, 2017-08-15 at 10:23 -0400, Chuck Lever wrote:
>>> On Aug 14, 2017, at 9:18 PM, Trond Myklebust <trondmy@primarydata.c
>>> om> wrote:
>>> 
>>> On Mon, 2017-08-14 at 18:25 -0400, Chuck Lever wrote:
>>>>> On Aug 14, 2017, at 4:38 PM, Trond Myklebust <trondmy@primaryda
>>>>> ta.c
>>>>> om> wrote:
>>>>> 
>>>>> On Mon, 2017-08-14 at 16:23 -0400, Chuck Lever wrote:
>>>>>>> On Aug 14, 2017, at 4:07 PM, Trond Myklebust <trondmy@prima
>>>>>>> ryda
>>>>>>> ta.c
>>>>>>> om> wrote:
>>>>>>> 
>>>>>>> On Mon, 2017-08-14 at 15:28 -0400, Chuck Lever wrote:
>>>>>>>>> On Aug 14, 2017, at 3:16 PM, Trond Myklebust
>>>>>>>>> <trond.myklebu
>>>>>>>>> st@p
>>>>>>>>> rima
>>>>>>>>> rydata.com> wrote:
>>>>>>>>> 
>>>>>>>>> Instead add a mechanism to ensure that the request
>>>>>>>>> doesn't
>>>>>>>>> disappear
>>>>>>>>> from underneath us while copying from the socket. We do
>>>>>>>>> this by
>>>>>>>>> preventing xprt_release() from freeing the XDR buffers
>>>>>>>>> until
>>>>>>>>> the
>>>>>>>>> flag RPC_TASK_MSG_RECV has been cleared from the
>>>>>>>>> request.
>>>>>>>>> 
>>>>>>>>> Signed-off-by: Trond Myklebust <trond.myklebust@primary
>>>>>>>>> data
>>>>>>>>> .com
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ---
>>>>>>>>> include/linux/sunrpc/sched.h |  2 ++
>>>>>>>>> include/linux/sunrpc/xprt.h  |  2 ++
>>>>>>>>> net/sunrpc/xprt.c            | 42
>>>>>>>>> ++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>> net/sunrpc/xprtsock.c        | 27
>>>>>>>>> ++++++++++++++++++++++---
>>>>>>>>> --
>>>>>>>>> 4 files changed, 68 insertions(+), 5 deletions(-)
>>>>>>>>> 
>>>>>>>>> diff --git a/include/linux/sunrpc/sched.h
>>>>>>>>> b/include/linux/sunrpc/sched.h
>>>>>>>>> index 15bc1cd6ee5c..e972d9e05426 100644
>>>>>>>>> --- a/include/linux/sunrpc/sched.h
>>>>>>>>> +++ b/include/linux/sunrpc/sched.h
>>>>>>>>> @@ -141,6 +141,8 @@ struct rpc_task_setup {
>>>>>>>>> #define RPC_TASK_ACTIVE		2
>>>>>>>>> #define RPC_TASK_MSG_XMIT	3
>>>>>>>>> #define RPC_TASK_MSG_XMIT_WAIT	4
>>>>>>>>> +#define RPC_TASK_MSG_RECV	5
>>>>>>>>> +#define RPC_TASK_MSG_RECV_WAIT	6
>>>>>>>>> 
>>>>>>>>> #define RPC_IS_RUNNING(t)	test_bit(RPC_TASK_RUNN
>>>>>>>>> ING,
>>>>>>>>> &(t)-
>>>>>>>>>> tk_runstate)
>>>>>>>>> 
>>>>>>>>> #define rpc_set_running(t)	set_bit(RPC_TASK_RUNN
>>>>>>>>> ING,
>>>>>>>>> &(t)-
>>>>>>>>>> tk_runstate)
>>>>>>>>> 
>>>>>>>>> diff --git a/include/linux/sunrpc/xprt.h
>>>>>>>>> b/include/linux/sunrpc/xprt.h
>>>>>>>>> index eab1c749e192..65b9e0224753 100644
>>>>>>>>> --- a/include/linux/sunrpc/xprt.h
>>>>>>>>> +++ b/include/linux/sunrpc/xprt.h
>>>>>>>>> @@ -372,6 +372,8 @@ void			xprt_wr
>>>>>>>>> ite_
>>>>>>>>> spac
>>>>>>>>> e(st
>>>>>>>>> ruct rpc_xprt *xprt);
>>>>>>>>> void			xprt_adjust_cwnd(struct
>>>>>>>>> rpc_xprt
>>>>>>>>> *xprt,
>>>>>>>>> struct rpc_task *task, int result);
>>>>>>>>> struct rpc_rqst *	xprt_lookup_rqst(struct
>>>>>>>>> rpc_xprt
>>>>>>>>> *xprt,
>>>>>>>>> __be32 xid);
>>>>>>>>> void			xprt_complete_rqst(struct
>>>>>>>>> rpc_task
>>>>>>>>> *task, int copied);
>>>>>>>>> +void			xprt_pin_rqst(struct
>>>>>>>>> rpc_rqst
>>>>>>>>> *req);
>>>>>>>>> +void			xprt_unpin_rqst(struct
>>>>>>>>> rpc_rqst
>>>>>>>>> *req);
>>>>>>>>> void			xprt_release_rqst_cong(stru
>>>>>>>>> ct
>>>>>>>>> rpc_task
>>>>>>>>> *task);
>>>>>>>>> void			xprt_disconnect_done(struct
>>>>>>>>> rpc_xprt
>>>>>>>>> *xprt);
>>>>>>>>> void			xprt_force_disconnect(struc
>>>>>>>>> t
>>>>>>>>> rpc_xprt
>>>>>>>>> *xprt);
>>>>>>>>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>>>>>>>>> index 788c1b6138c2..62c379865f7c 100644
>>>>>>>>> --- a/net/sunrpc/xprt.c
>>>>>>>>> +++ b/net/sunrpc/xprt.c
>>>>>>>>> @@ -844,6 +844,47 @@ struct rpc_rqst
>>>>>>>>> *xprt_lookup_rqst(struct
>>>>>>>>> rpc_xprt *xprt, __be32 xid)
>>>>>>>>> }
>>>>>>>>> EXPORT_SYMBOL_GPL(xprt_lookup_rqst);
>>>>>>>>> 
>>>>>>>>> +/**
>>>>>>>>> + * xprt_pin_rqst - Pin a request on the transport
>>>>>>>>> receive
>>>>>>>>> list
>>>>>>>>> + * @req: Request to pin
>>>>>>>>> + *
>>>>>>>>> + * Caller must ensure this is atomic with the call to
>>>>>>>>> xprt_lookup_rqst()
>>>>>>>>> + * so should be holding the xprt transport lock.
>>>>>>>>> + */
>>>>>>>>> +void xprt_pin_rqst(struct rpc_rqst *req)
>>>>>>>>> +{
>>>>>>>>> +	set_bit(RPC_TASK_MSG_RECV, &req->rq_task-
>>>>>>>>>> tk_runstate);
>>>>>>>>> 
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +/**
>>>>>>>>> + * xprt_unpin_rqst - Unpin a request on the transport
>>>>>>>>> receive
>>>>>>>>> list
>>>>>>>>> + * @req: Request to pin
>>>>>>>>> + *
>>>>>>>>> + * Caller should be holding the xprt transport lock.
>>>>>>>>> + */
>>>>>>>>> +void xprt_unpin_rqst(struct rpc_rqst *req)
>>>>>>>>> +{
>>>>>>>>> +	struct rpc_task *task = req->rq_task;
>>>>>>>>> +
>>>>>>>>> +	clear_bit(RPC_TASK_MSG_RECV, &task-
>>>>>>>>>> tk_runstate);
>>>>>>>>> +	if (test_bit(RPC_TASK_MSG_RECV_WAIT, &task-
>>>>>>>>>> tk_runstate))
>>>>>>>>> 
>>>>>>>>> +		wake_up_bit(&task->tk_runstate,
>>>>>>>>> RPC_TASK_MSG_RECV);
>>>>>>>>> +}
>>>>>>>>> +
>>>>>>>>> +static void xprt_wait_on_pinned_rqst(struct rpc_rqst
>>>>>>>>> *req)
>>>>>>>>> +__must_hold(&req->rq_xprt->transport_lock)
>>>>>>>>> +{
>>>>>>>>> +	struct rpc_task *task = req->rq_task;
>>>>>>>>> +	if (test_bit(RPC_TASK_MSG_RECV, &task-
>>>>>>>>>> tk_runstate)) {
>>>>>>>>> 
>>>>>>>>> +		spin_unlock_bh(&req->rq_xprt-
>>>>>>>>>> transport_lock);
>>>>>>>>> 
>>>>>>>>> +		set_bit(RPC_TASK_MSG_RECV_WAIT, &task-
>>>>>>>>>> tk_runstate);
>>>>>>>>> 
>>>>>>>>> +		wait_on_bit(&task->tk_runstate,
>>>>>>>>> RPC_TASK_MSG_RECV,
>>>>>>>>> +				TASK_UNINTERRUPTIBLE);
>>>>>>>>> +		clear_bit(RPC_TASK_MSG_RECV_WAIT,
>>>>>>>>> &task-
>>>>>>>>>> tk_runstate);
>>>>>>>>> 
>>>>>>>>> +		spin_lock_bh(&req->rq_xprt-
>>>>>>>>>> transport_lock);
>>>>>>>>> 
>>>>>>>>> +	}
>>>>>>>>> +}
>> 
>> Looking at xprt_wait_for_pinned_rqst, I'm wondering
>> though what happens if a soft timeout occurs but the
>> RPC Reply never arrives because the server died. Will
>> it wait forever in that case?
>> 
>> I can't tell if wait-for-pinned always waits for the
>> Reply, or if it can allow the RPC to exit cleanly if
>> the Reply hasn't arrived.
> 
> Oh... I think I see what your concern is.
> 
> So my assumption is that the bit will always be set and released in
> xs_tcp_read_reply() (and equivalent). It should never be set while
> waiting for data. It is only set when doing _non-blocking_ copying of
> data from the socket to the buffers.
> 
> IOW: wait_for_pinned will wait only if new reply data arrives on the
> socket in the instance when the specific request to which it is being
> delivered is being torn down. If there is no data for the request in
> the socket, then there is no wait. It doesn't matter if there has been
> a partial delivery of data previously (or no delivery); unless
> xs_tcp_read_reply() is actually running and pinning the request, there
> is no wait.

Ah, of course.

Reviewed-by: Chuck Lever <chuck.lever@oracle.com>


>>>>>>>>> +
>>>>>>>>> static void xprt_update_rtt(struct rpc_task *task)
>>>>>>>>> {
>>>>>>>>> 	struct rpc_rqst *req = task->tk_rqstp;
>>>>>>>>> @@ -1301,6 +1342,7 @@ void xprt_release(struct rpc_task
>>>>>>>>> *task)
>>>>>>>>> 		list_del(&req->rq_list);
>>>>>>>>> 	xprt->last_used = jiffies;
>>>>>>>>> 	xprt_schedule_autodisconnect(xprt);
>>>>>>>>> +	xprt_wait_on_pinned_rqst(req);
>>>>>>>>> 	spin_unlock_bh(&xprt->transport_lock);
>>>>>>>> 
>>>>>>>> Is it OK to call wait_on_bit(TASK_UNINTERRUPTIBLE) while
>>>>>>>> holding
>>>>>>>> a BH spin lock? This could be prone to deadlock.
>>>>>>> 
>>>>>>> We drop the lock inside xprt_wait_on_pinned_rqst() if we
>>>>>>> need
>>>>>>> to
>>>>>>> wait.
>>>>>>> 
>>>>>>> The reason why we want to hold the lock there is to avoid a
>>>>>>> use-
>>>>>>> after-
>>>>>>> free in xprt_unpin_rqst(). Without the lock, there is a
>>>>>>> risk
>>>>>>> that
>>>>>>> xprt_release() could complete, and the task get freed
>>>>>>> before we
>>>>>>> have
>>>>>>> completed wake_up_bit().
>>>>>> 
>>>>>> Agree with the last bit. I had that challenge with the order
>>>>>> of
>>>>>> memory invalidation (which waits) and
>>>>>> xprt_lookup_rqst/complete_rqst
>>>>>> in rpcrdma_reply_handler.
>>>>>> 
>>>>>> However my concern is that there are many other users of the
>>>>>> transport_lock. Does it become more tricky not to introduce a
>>>>>> problem by changing any one of the sites that take
>>>>>> transport_lock?
>>>>> 
>>>>> What is your concern? The main reason for taking the transport
>>>>> lock
>>>>> in
>>>>> the current code is twofold:
>>>>> 
>>>>> 1) Perform the lookup of the request on the receive list
>>>>> 2) Ensure the request doesn't disappear while we're writing to
>>>>> it.
>>>> 
>>>> Yes, I think we see the same problem space.
>>>> 
>>>> I missed, though, that xprt_wait_on_pinned_rqst always releases
>>>> transport_lock before invoking wait_on_bit.
>>>> 
>>>> 
>>>>>> Despite the deadlock concern, I don't think it makes sense to
>>>>>> call
>>>>>> wait_on_bit while holding a spin lock simply because spin
>>>>>> locks
>>>>>> are
>>>>>> supposed to be for things that are quick, like updating a
>>>>>> data
>>>>>> structure.
>>>>>> 
>>>>>> wait_on_bit can take microseconds: prepare_to_wait and
>>>>>> finish_wait
>>>>>> can both take a irqsave spin_lock, for example. bit_wait is
>>>>>> the
>>>>>> action that is done if the bit is not ready, and that calls
>>>>>> schedule(). On a busy system, that kind of context switch can
>>>>>> take
>>>>>> dozens of microseconds.
>>>>> 
>>>>> Agreed, but we should only very rarely be hitting the wait
>>>>> case. In
>>>>> all
>>>>> successful RPC calls, the receive will have happened long
>>>>> before we
>>>>> call xprt_release(). In fact, the only 2 cases where this can
>>>>> happen
>>>>> would be when the RPC call itself is interrupted. Either:
>>>>> 
>>>>> 1) The task is synchronous and was interrupted by the user.
>>>>> 2) The task is terminating early due to a soft timeout.
>>>>> 
>>>>> In both those cases, we're in a non-performance path.
>>>>> Furthermore,
>>>>> they
>>>>> would currently end up spinning against the transport lock.
>>>> 
>>>> OK, no argument that there is any performance concern.
>>>> 
>>>> I can hit this case 2 or 3 times out of 5 with generic/029 on
>>>> NFSv4/RDMA. These days, the worse that happens is the client
>>>> drops
>>>> the RDMA connection because I've spent a very long time immersed
>>>> in these code paths, trying to make the two cases you list above
>>>> work 100% without deadlock or crash. See commits 04d25b7d5d1b
>>>> through 431af645cf66, and commit 68791649a725.
>>>> 
>>>> It makes me uncertain that waiting for anything in xprt_release
>>>> is safe, even if the transport_lock is not being held. xprtrdma
>>>> used to perform synchronous memory invalidation in ->buf_free.
>>>> It doesn't any more for this reason.
>>> 
>>> 
>>> Full disclosure: I didn't actually enable this code for RDMA.
>> 
>> Noted that; I assumed I would be responsible for it
>> if it could be applied to xprtrdma.
>> 
>> 
>>> The reason is firstly that I won't pretend to understand the
>>> locking in
>>> rpcrdma_reply_handler(), but furthermore that RDMA doesn't really
>>> have
>>> a problem with bulk copy in the same way that the socket based code
>>> does: the direct placement into the reply buffer means the bulk of
>>> the
>>> code in rpcrdma_reply_handler() is all about checking the alignment
>>> (with, admittedly some scary bits in rpcrdma_inline_fixup()).
>> 
>> rpcrdma_reply_handler actually can use xprt_pin_rqst.
>> 
>> It's not doing a data copy (except for the pull-up in
>> rpcrdma_inline_fixup), but it is performing a possibly
>> long-running synchronous memory invalidation. During
>> that invalidation, the transport_lock cannot be held.
>> 
>> With xprt_pin_rqst, rpcrdma_reply_handler would
>> probably work like this:
>> 
>> 
>> spin_lock(transport_lock)
>> xprt_lookup_rqst
>> xprt_pin_rqst
>> spin_unlock(transport_lock)
>> 
>> invalidate memory associated with this rqst, and wait
>> for it to complete
>> 
>> reconstruct the Reply in rqst->rq_rcv_buf, including
>> maybe a substantial pull-up if the Reply is large and
>> inline
>> 
>> spin_lock(transport_lock)
>> xprt_complete_rqst
>> xprt_unpin_rqst
>> spin_unlock(transport_lock)
>> 
>> 
>> This would have to co-ordinate with xprt_rdma_free. If
>> there are still memory regions registered when
>> xprt_rdma_free is called, that means the RPC was
>> terminated before the Reply arrived. They still need to
>> be invalidated, but these functions have to ensure that
>> invalidation does not occur twice.
>> 
>> 
>>>>>>>>> IMO it would be a lot nicer if we had an FSM state
>>>>>>>>> available that
>>>>>> 
>>>>>> could allow another sleep while an RPC task during
>>>>>> xprt_release.
>>>>> 
>>>>> Why would that be preferable, if the argument holds that this
>>>>> only
>>>>> happens in these 2 rare error cases?
>>>> 
>>>> Because FSM steps are the way this code manages sleeps. That
>>>> makes it easier to understand and less brittle. I don't think
>>>> it would be more overhead than test_bit().
>>> 
>>> The problem is that it would add new states to a number of
>>> functions
>>> that currently call xprt_release(). Right now, that would mean
>>> 
>>> call_reserveresult()
>>> rpc_verify_header()
>>> rpc_exit_task()
>> 
>> Right. That's why I wasn't bold enough to actually try adding
>> the FSM myself.
>> 
>> Your solution is helped somewhat by checking to see if a Call
>> was actually sent and a Reply is therefore expected. In some
>> of these cases, that means waiting can be avoided entirely
>> when the RPC exits before call_transmit.
>> 
>> 
>>> In addition, we'd need a separate solution for rpc_put_task(), and
>>> probably also for rpc_release_task() (since these are both called
>>> from
>>> outside the main state machine loop).
>>> 
>>> I'm getting a little worried about your reply above: the RPC engine
>>> offers no real-time guarantees, so if the RDMA code is really this
>>> latency-sensitive, then what will happen if we enable pre-emption
>>> and/or we just have a very congested workqueue?
>> 
>> The sensitivity is only performance-related. On CPU-intensive
>> workloads, RPC execution time can increase considerably. The
>> RDMA transport doesn't rely on timely execution for proper
>> operation, but maybe I don't understand your worry.
>> 
>> 
>>> Also, how would a state machine solution help? I'd expect that to
>>> make
>>> the latency worse not better, since the work item would end up
>>> getting
>>> requeued. If the worker thread then gets rescheduled as well...
>>> 
>>> One solution may be to simply say we're never going to use this
>>> mechanism for RDMA. I suspect that if we just replace the use of
>>> xprt-
>>>> transport_lock when protecting the receive queue with a new lock
>>>> that
>>> 
>>> wouldn't need to be bh-safe then that might suffice to deal with
>>> the
>>> RDMA latency issues. I have less confidence that would work for the
>>> others due to the bulk data copy issue.
>> 
>> I would like a separate non-BH lock, if that can be made to
>> work. That's in fact how xprtrdma currently works. xprtrdma
>> has its own receive list at the moment so that it can find
>> memory associated with a rqst _without_ needing to call
>> xprt_lookup_rqst. A non-BH lock is used to protect that
>> list.
> 
> That can definitely be added. I see that as a separate issue.
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> trond.myklebust@primarydata.com
> N���r�y���b�X�ǧv�^�)޺{.n�+�{�"��^n�r��z�\x1a��h���&�\x1e�G��h�\x03(�階�ݢj"�\x1a�^[m����z�ޖ��f�h�~�m

--
Chuck Lever




  reply	other threads:[~2017-08-15 15:05 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-14 19:16 [PATCH 0/3] RPC client latency fixes Trond Myklebust
2017-08-14 19:16 ` [PATCH 1/3] SUNRPC: Don't hold the transport lock across socket copy operations Trond Myklebust
2017-08-14 19:16   ` [PATCH 2/3] SUNRPC: Don't loop forever in xs_tcp_data_receive() Trond Myklebust
2017-08-14 19:16     ` [PATCH 3/3] SUNRPC: Cleanup xs_tcp_read_common() Trond Myklebust
2017-08-14 19:28   ` [PATCH 1/3] SUNRPC: Don't hold the transport lock across socket copy operations Chuck Lever
2017-08-14 20:07     ` Trond Myklebust
2017-08-14 20:23       ` Chuck Lever
2017-08-14 20:38         ` Trond Myklebust
2017-08-14 22:25           ` Chuck Lever
2017-08-15  1:18             ` Trond Myklebust
2017-08-15 14:23               ` Chuck Lever
2017-08-15 15:00                 ` Trond Myklebust
2017-08-15 15:05                   ` Chuck Lever [this message]
2017-08-15 16:13                     ` Trond Myklebust
2017-08-15 17:54                       ` Chuck Lever
2017-08-16 15:45   ` Anna Schumaker
2017-08-16 16:05     ` Trond Myklebust

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=AB962563-977C-4903-A188-3AD32B9F0086@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=linux-nfs@vger.kernel.org \
    --cc=trondmy@primarydata.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.