From: Chuck Lever <chuck.lever@oracle.com>
To: Trond Myklebust <trond.myklebust@primarydata.com>
Cc: Chuck Lever <chuck.lever@oracl.com>,
Anna Schumaker <Anna.Schumaker@Netapp.com>,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH v3] SUNRPC: Fix a race in the receive code path
Date: Wed, 13 Dec 2017 18:42:30 -0500 [thread overview]
Message-ID: <C5351D4D-2E01-4B6D-AF19-EE6BABD46CCF@oracle.com> (raw)
In-Reply-To: <17932F80-EF03-48BA-AD6C-90A441B286EF@oracle.com>
> On Dec 13, 2017, at 11:14 AM, Chuck Lever <chuck.lever@oracle.com> =
wrote:
>=20
> Hi Trond-
>=20
>> On Dec 3, 2017, at 7:17 PM, Trond Myklebust =
<trond.myklebust@primarydata.com> wrote:
>>=20
>> We must ensure that the call to rpc_sleep_on() in xprt_transmit() =
cannot
>> race with the call to xprt_complete_rqst().
>>=20
>> Reported-by: Chuck Lever <chuck.lever@oracle.com>
>> Link: https://bugzilla.linux-nfs.org/show_bug.cgi?id=3D317
>> Fixes: ce7c252a8c74 ("SUNRPC: Add a separate spinlock to protect..")
>> Cc: stable@vger.kernel.org # 4.14+
>> Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
>> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
>> ---
>> net/sunrpc/xprt.c | 26 +++++++++++++++++---------
>> 1 file changed, 17 insertions(+), 9 deletions(-)
>>=20
>> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
>> index 333b9d697ae5..5e4278e9ce37 100644
>> --- a/net/sunrpc/xprt.c
>> +++ b/net/sunrpc/xprt.c
>> @@ -1024,6 +1024,7 @@ void xprt_transmit(struct rpc_task *task)
>> } else if (!req->rq_bytes_sent)
>> return;
>>=20
>> + req->rq_connect_cookie =3D xprt->connect_cookie;
>> req->rq_xtime =3D ktime_get();
>> status =3D xprt->ops->send_request(task);
>> trace_xprt_transmit(xprt, req->rq_xid, status);
>> @@ -1047,20 +1048,27 @@ void xprt_transmit(struct rpc_task *task)
>> xprt->stat.bklog_u +=3D xprt->backlog.qlen;
>> xprt->stat.sending_u +=3D xprt->sending.qlen;
>> xprt->stat.pending_u +=3D xprt->pending.qlen;
>> + spin_unlock_bh(&xprt->transport_lock);
>>=20
>> - /* Don't race with disconnect */
>> - if (!xprt_connected(xprt))
>> - task->tk_status =3D -ENOTCONN;
>> - else {
>> + if (rpc_reply_expected(task) && =
!READ_ONCE(req->rq_reply_bytes_recvd)) {
>> /*
>> - * Sleep on the pending queue since
>> - * we're expecting a reply.
>> + * Sleep on the pending queue if we're expecting a =
reply.
>> + * The spinlock ensures atomicity between the test of
>> + * req->rq_reply_bytes_recvd, and the call to =
rpc_sleep_on().
>> */
>> - if (!req->rq_reply_bytes_recvd && =
rpc_reply_expected(task))
>> + spin_lock(&xprt->recv_lock);
>> + if (!req->rq_reply_bytes_recvd) {
>> rpc_sleep_on(&xprt->pending, task, xprt_timer);
>> - req->rq_connect_cookie =3D xprt->connect_cookie;
>> + /*
>> + * Send an extra queue wakeup call if the
>> + * connection was dropped in case the call to
>> + * rpc_sleep_on() raced.
>> + */
>> + if (!xprt_connected(xprt))
>> + xprt_wake_pending_tasks(xprt, =
-ENOTCONN);
>> + }
>> + spin_unlock(&xprt->recv_lock);
>> }
>> - spin_unlock_bh(&xprt->transport_lock);
>> }
>>=20
>> static void xprt_add_backlog(struct rpc_xprt *xprt, struct rpc_task =
*task)
>> --=20
>> 2.14.3
>=20
> I've run into a problem with this version of the patch (on v4.15-rc3).
>=20
> With dbench on NFSv4.0 (on RDMA) I no longer see constant throughput
> and a temporarily climbing latency followed by a recovery, but this
> can happen on occasion:
>=20
> releasing clients
> 16 584 322.55 MB/sec warmup 1 sec latency 807.121 ms
> 16 584 161.28 MB/sec warmup 2 sec latency 1807.175 ms
> 16 584 107.52 MB/sec warmup 3 sec latency 2807.198 ms
> 16 584 80.64 MB/sec warmup 4 sec latency 3807.217 ms
> 16 584 64.51 MB/sec warmup 5 sec latency 4807.235 ms
> 16 584 53.76 MB/sec warmup 6 sec latency 5807.251 ms
> 16 584 46.08 MB/sec warmup 7 sec latency 6807.269 ms
> 16 584 40.32 MB/sec warmup 8 sec latency 7807.285 ms
> 16 584 35.84 MB/sec warmup 9 sec latency 8807.303 ms
> 16 584 32.26 MB/sec warmup 10 sec latency 9807.318 ms
> 16 584 29.32 MB/sec warmup 11 sec latency 10807.334 ms
> 16 584 26.88 MB/sec warmup 12 sec latency 11807.355 ms
> 16 584 24.81 MB/sec warmup 13 sec latency 12807.372 ms
> 16 584 23.04 MB/sec warmup 14 sec latency 13807.387 ms
> 16 584 21.50 MB/sec warmup 15 sec latency 14807.406 ms
> 16 584 20.16 MB/sec warmup 16 sec latency 15807.423 ms
>=20
> No recovery. The latency number keeps climbing, and throughput
> drops.
>=20
> "kernel: nfs: server klimt-ib not responding, still trying"
> appears in the client's /var/log/messages.
>=20
> Without this patch applied, the NFSv4.0 behavior is the same as I
> reported with NFSv3: every once in a while, one RPC completion is
> lost, but when the timer fires, the client notices the reply
> actually did arrive and the RPC completes normally.
>=20
> I'm looking into it.
Root cause:
The transport is temporarily out of resources and returns -ENOBUFS
from ->send_request.
The FSM schedules a delay then calls call_transmit again, which
invokes xprt_prepare_transmit.
On NFSv3 mounts, xprt_prepare_transmit tries to acquire the
transport write lock.
On NFSv4 mounts, RPC_TASK_NO_RETRANS_TIMEOUT is set. This causes
xprt_prepare_transmit to see that the transport is connected and
rq_connect_cookie is the same as xprt->connect_cookie, so it
queues the task on ->pending and returns false.
The transport never sends the Call, but still waits for a Reply,
forever. This is not a bug in your patch, it is a pre-existing
issue.
There's a simple fix, which is to set
rq_connect_cookie =3D xprt->connect_cookie - 1;
if RPC-over-RDMA header marshaling fails. This is a one-line fix,
appropriate for -rc and backport to stable (to at least v4.8).
A more robust fix might be to have ->send_request return -EAGAIN
instead, and let xprtrdma call xprt_write_space when the resource
exhaustion is relieved. That might be more involved, and thus is
future-facing.
I'm investigating both fixes.
--
Chuck Lever
next prev parent reply other threads:[~2017-12-13 23:43 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-04 0:17 [PATCH v3] SUNRPC: Fix a race in the receive code path Trond Myklebust
2017-12-13 16:14 ` Chuck Lever
2017-12-13 23:42 ` Chuck Lever [this message]
2017-12-14 1:13 ` Chuck Lever
2017-12-14 12:16 ` Trond Myklebust
2017-12-14 15:49 ` Chuck Lever
2017-12-14 19:03 ` Trond Myklebust
2017-12-14 19:22 ` Chuck Lever
2017-12-14 20:37 ` Trond Myklebust
2017-12-14 20:59 ` Chuck Lever
2017-12-14 21:33 ` Trond Myklebust
2017-12-15 2:16 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=C5351D4D-2E01-4B6D-AF19-EE6BABD46CCF@oracle.com \
--to=chuck.lever@oracle.com \
--cc=Anna.Schumaker@Netapp.com \
--cc=chuck.lever@oracl.com \
--cc=linux-nfs@vger.kernel.org \
--cc=trond.myklebust@primarydata.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.