All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/11] SUNRPC server scalability improvements
@ 2014-08-03 17:03 Trond Myklebust
  2014-08-03 17:03 ` [PATCH 01/11] SUNRPC: Reduce contention in svc_xprt_enqueue() Trond Myklebust
  2014-08-04 13:36 ` [PATCH 00/11] SUNRPC server scalability improvements Bruce Fields
  0 siblings, 2 replies; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

Hi Bruce,

When testing knfsd on 10GigE and 40GigE networks, we've been hitting a few
scalability issues that are seriously affecting performance. The main issue
was scalability of the pool->sp_lock, but we also hit a couple of things
like the use of ioctls in the receive fast path, as well as some issues
with heuristic tests being performed twice in the same path.

This is not urgent, and can definitely be delayed until the 3.18 merge
window, however since the performance gains were significant over NFSv3,
I thought I'd share now.

Cheers
  Trond

Trond Myklebust (11):
  SUNRPC: Reduce contention in svc_xprt_enqueue()
  SUNRPC: svc_tcp_write_space: don't clear SOCK_NOSPACE prematurely
  SUNRPC: Allow svc_reserve() to notify TCP socket that space has been
    freed
  SUNRPC: Do not override wspace tests in svc_handle_xprt
  lockd: Ensure that lockd_start_svc sets the server rq_task...
  nfs: Ensure that nfs_callback_start_svc sets the server rq_task...
  SUNRPC: Do not grab pool->sp_lock unnecessarily in svc_get_next_xprt
  SUNRPC: get rid of the request wait queue
  SUNRPC: Fix broken kthread_should_stop test in svc_get_next_xprt
  SUNRPC: More optimisations of svc_xprt_enqueue()
  SUNRPC: Optimise away svc_recv_available

 fs/lockd/svc.c                  |   2 +
 fs/nfs/callback.c               |   1 +
 include/linux/sunrpc/svc.h      |   1 -
 include/linux/sunrpc/svc_xprt.h |   1 +
 net/sunrpc/svc.c                |   2 -
 net/sunrpc/svc_xprt.c           | 112 ++++++++++++++++++++--------------------
 net/sunrpc/svcsock.c            |  71 ++++++++++++-------------
 7 files changed, 96 insertions(+), 94 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 01/11] SUNRPC: Reduce contention in svc_xprt_enqueue()
  2014-08-03 17:03 [PATCH 00/11] SUNRPC server scalability improvements Trond Myklebust
@ 2014-08-03 17:03 ` Trond Myklebust
  2014-08-03 17:03   ` [PATCH 02/11] SUNRPC: svc_tcp_write_space: don't clear SOCK_NOSPACE prematurely Trond Myklebust
  2014-08-04 13:36 ` [PATCH 00/11] SUNRPC server scalability improvements Bruce Fields
  1 sibling, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

Ensure that all calls to svc_xprt_enqueue() except svc_xprt_received()
check the value of XPT_BUSY, before attempting to grab spinlocks etc.
This is to avoid situations such as the following "perf" trace,
which shows heavy contention on the pool spinlock:

    54.15%            nfsd  [kernel.kallsyms]        [k] _raw_spin_lock_bh
                      |
                      --- _raw_spin_lock_bh
                         |
                         |--71.43%-- svc_xprt_enqueue
                         |          |
                         |          |--50.31%-- svc_reserve
                         |          |
                         |          |--31.35%-- svc_xprt_received
                         |          |
                         |          |--18.34%-- svc_tcp_data_ready
...

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 net/sunrpc/svc_xprt.c | 25 +++++++++++++++++--------
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index b4737fbdec13..54a761fa6351 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -23,6 +23,7 @@ static int svc_deferred_recv(struct svc_rqst *rqstp);
 static struct cache_deferred_req *svc_defer(struct cache_req *req);
 static void svc_age_temp_xprts(unsigned long closure);
 static void svc_delete_xprt(struct svc_xprt *xprt);
+static void svc_xprt_do_enqueue(struct svc_xprt *xprt);
 
 /* apparently the "standard" is that clients close
  * idle connections after 5 minutes, servers after
@@ -222,11 +223,12 @@ static void svc_xprt_received(struct svc_xprt *xprt)
 	if (!test_bit(XPT_BUSY, &xprt->xpt_flags))
 		return;
 	/* As soon as we clear busy, the xprt could be closed and
-	 * 'put', so we need a reference to call svc_xprt_enqueue with:
+	 * 'put', so we need a reference to call svc_xprt_do_enqueue with:
 	 */
 	svc_xprt_get(xprt);
+	smp_mb__before_clear_bit();
 	clear_bit(XPT_BUSY, &xprt->xpt_flags);
-	svc_xprt_enqueue(xprt);
+	svc_xprt_do_enqueue(xprt);
 	svc_xprt_put(xprt);
 }
 
@@ -335,12 +337,7 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
 	return false;
 }
 
-/*
- * Queue up a transport with data pending. If there are idle nfsd
- * processes, wake 'em up.
- *
- */
-void svc_xprt_enqueue(struct svc_xprt *xprt)
+static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
 {
 	struct svc_pool *pool;
 	struct svc_rqst	*rqstp;
@@ -398,6 +395,18 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 out_unlock:
 	spin_unlock_bh(&pool->sp_lock);
 }
+
+/*
+ * Queue up a transport with data pending. If there are idle nfsd
+ * processes, wake 'em up.
+ *
+ */
+void svc_xprt_enqueue(struct svc_xprt *xprt)
+{
+	if (test_bit(XPT_BUSY, &xprt->xpt_flags))
+		return;
+	svc_xprt_do_enqueue(xprt);
+}
 EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
 
 /*
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 02/11] SUNRPC: svc_tcp_write_space: don't clear SOCK_NOSPACE prematurely
  2014-08-03 17:03 ` [PATCH 01/11] SUNRPC: Reduce contention in svc_xprt_enqueue() Trond Myklebust
@ 2014-08-03 17:03   ` Trond Myklebust
  2014-08-03 17:03     ` [PATCH 03/11] SUNRPC: Allow svc_reserve() to notify TCP socket that space has been freed Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

If requests are queued in the socket inbuffer waiting for an
svc_tcp_has_wspace() requirement to be satisfied, then we do not want
to clear the SOCK_NOSPACE flag until we've satisfied that requirement.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 net/sunrpc/svcsock.c | 39 +++++++++++++++++++++------------------
 1 file changed, 21 insertions(+), 18 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index b507cd327d9b..7322ea1164fd 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -446,11 +446,31 @@ static void svc_write_space(struct sock *sk)
 	}
 }
 
+static int svc_tcp_has_wspace(struct svc_xprt *xprt)
+{
+	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
+	struct svc_serv *serv = svsk->sk_xprt.xpt_server;
+	int required;
+
+	if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
+		return 1;
+	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
+	if (sk_stream_wspace(svsk->sk_sk) >= required ||
+	    (sk_stream_min_wspace(svsk->sk_sk) == 0 &&
+	     atomic_read(&xprt->xpt_reserved) == 0))
+		return 1;
+	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
+	return 0;
+}
+
 static void svc_tcp_write_space(struct sock *sk)
 {
+	struct svc_sock *svsk = (struct svc_sock *)(sk->sk_user_data);
 	struct socket *sock = sk->sk_socket;
 
-	if (sk_stream_is_writeable(sk) && sock)
+	if (!sk_stream_is_writeable(sk) || !sock)
+		return;
+	if (!svsk || svc_tcp_has_wspace(&svsk->sk_xprt))
 		clear_bit(SOCK_NOSPACE, &sock->flags);
 	svc_write_space(sk);
 }
@@ -1197,23 +1217,6 @@ static void svc_tcp_prep_reply_hdr(struct svc_rqst *rqstp)
 	svc_putnl(resv, 0);
 }
 
-static int svc_tcp_has_wspace(struct svc_xprt *xprt)
-{
-	struct svc_sock *svsk =	container_of(xprt, struct svc_sock, sk_xprt);
-	struct svc_serv *serv = svsk->sk_xprt.xpt_server;
-	int required;
-
-	if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
-		return 1;
-	required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
-	if (sk_stream_wspace(svsk->sk_sk) >= required ||
-	    (sk_stream_min_wspace(svsk->sk_sk) == 0 &&
-	     atomic_read(&xprt->xpt_reserved) == 0))
-		return 1;
-	set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
-	return 0;
-}
-
 static struct svc_xprt *svc_tcp_create(struct svc_serv *serv,
 				       struct net *net,
 				       struct sockaddr *sa, int salen,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 03/11] SUNRPC: Allow svc_reserve() to notify TCP socket that space has been freed
  2014-08-03 17:03   ` [PATCH 02/11] SUNRPC: svc_tcp_write_space: don't clear SOCK_NOSPACE prematurely Trond Myklebust
@ 2014-08-03 17:03     ` Trond Myklebust
  2014-08-03 17:03       ` [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 include/linux/sunrpc/svc_xprt.h | 1 +
 net/sunrpc/svc_xprt.c           | 2 ++
 net/sunrpc/svcsock.c            | 9 +++++++++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index 7235040a19b2..8f241ac6934b 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -25,6 +25,7 @@ struct svc_xprt_ops {
 	void		(*xpo_detach)(struct svc_xprt *);
 	void		(*xpo_free)(struct svc_xprt *);
 	int		(*xpo_secure_port)(struct svc_rqst *);
+	void		(*xpo_adjust_wspace)(struct svc_xprt *);
 };
 
 struct svc_xprt_class {
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 54a761fa6351..32647b2a6a34 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -448,6 +448,8 @@ void svc_reserve(struct svc_rqst *rqstp, int space)
 		atomic_sub((rqstp->rq_reserved - space), &xprt->xpt_reserved);
 		rqstp->rq_reserved = space;
 
+		if (xprt->xpt_ops->xpo_adjust_wspace)
+			xprt->xpt_ops->xpo_adjust_wspace(xprt);
 		svc_xprt_enqueue(xprt);
 	}
 }
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 7322ea1164fd..72597d7fe60a 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -475,6 +475,14 @@ static void svc_tcp_write_space(struct sock *sk)
 	svc_write_space(sk);
 }
 
+static void svc_tcp_adjust_wspace(struct svc_xprt *xprt)
+{
+	struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
+
+	if (svc_tcp_has_wspace(xprt))
+		clear_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
+}
+
 /*
  * See net/ipv6/ip_sockglue.c : ip_cmsg_recv_pktinfo
  */
@@ -1288,6 +1296,7 @@ static struct svc_xprt_ops svc_tcp_ops = {
 	.xpo_has_wspace = svc_tcp_has_wspace,
 	.xpo_accept = svc_tcp_accept,
 	.xpo_secure_port = svc_sock_secure_port,
+	.xpo_adjust_wspace = svc_tcp_adjust_wspace,
 };
 
 static struct svc_xprt_class svc_tcp_class = {
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt
  2014-08-03 17:03     ` [PATCH 03/11] SUNRPC: Allow svc_reserve() to notify TCP socket that space has been freed Trond Myklebust
@ 2014-08-03 17:03       ` Trond Myklebust
  2014-08-03 17:03         ` [PATCH 05/11] lockd: Ensure that lockd_start_svc sets the server rq_task Trond Myklebust
  2014-08-12 19:16         ` [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt Bruce Fields
  0 siblings, 2 replies; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

We already determined that there was enough wspace when we
called svc_xprt_enqueue.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 net/sunrpc/svc_xprt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 32647b2a6a34..b731077b6a30 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -744,7 +744,7 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
 			svc_add_new_temp_xprt(serv, newxpt);
 		else
 			module_put(xprt->xpt_class->xcl_owner);
-	} else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
+	} else {
 		/* XPT_DATA|XPT_DEFERRED case: */
 		dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
 			rqstp, rqstp->rq_pool->sp_id, xprt,
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 05/11] lockd: Ensure that lockd_start_svc sets the server rq_task...
  2014-08-03 17:03       ` [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt Trond Myklebust
@ 2014-08-03 17:03         ` Trond Myklebust
  2014-08-03 17:03           ` [PATCH 06/11] nfs: Ensure that nfs_callback_start_svc " Trond Myklebust
  2014-08-12 19:16         ` [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt Bruce Fields
  1 sibling, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 fs/lockd/svc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 8f27c93f8d2e..b416b3355807 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -313,6 +313,8 @@ static int lockd_start_svc(struct svc_serv *serv)
 			"lockd_up: kthread_run failed, error=%d\n", error);
 		goto out_task;
 	}
+	nlmsvc_rqst->rq_task = nlmsvc_task;
+
 	dprintk("lockd_up: service started\n");
 	return 0;
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 06/11] nfs: Ensure that nfs_callback_start_svc sets the server rq_task...
  2014-08-03 17:03         ` [PATCH 05/11] lockd: Ensure that lockd_start_svc sets the server rq_task Trond Myklebust
@ 2014-08-03 17:03           ` Trond Myklebust
  2014-08-03 17:03             ` [PATCH 07/11] SUNRPC: Do not grab pool->sp_lock unnecessarily in svc_get_next_xprt Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 fs/nfs/callback.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index 073b4cf67ed9..5350bc23afe4 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -244,6 +244,7 @@ static int nfs_callback_start_svc(int minorversion, struct rpc_xprt *xprt,
 		cb_info->task = NULL;
 		return ret;
 	}
+	rqstp->rq_task = cb_info->task;
 	dprintk("nfs_callback_up: service started\n");
 	return 0;
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 07/11] SUNRPC: Do not grab pool->sp_lock unnecessarily in svc_get_next_xprt
  2014-08-03 17:03           ` [PATCH 06/11] nfs: Ensure that nfs_callback_start_svc " Trond Myklebust
@ 2014-08-03 17:03             ` Trond Myklebust
  2014-08-03 17:03               ` [PATCH 08/11] SUNRPC: get rid of the request wait queue Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 net/sunrpc/svc_xprt.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index b731077b6a30..2c30193c7a13 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -651,8 +651,8 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 	} else {
 		if (pool->sp_task_pending) {
 			pool->sp_task_pending = 0;
-			spin_unlock_bh(&pool->sp_lock);
-			return ERR_PTR(-EAGAIN);
+			xprt = ERR_PTR(-EAGAIN);
+			goto out;
 		}
 		/* No data pending. Go to sleep */
 		svc_thread_enqueue(pool, rqstp);
@@ -672,8 +672,8 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 		 */
 		if (kthread_should_stop()) {
 			set_current_state(TASK_RUNNING);
-			spin_unlock_bh(&pool->sp_lock);
-			return ERR_PTR(-EINTR);
+			xprt = ERR_PTR(-EINTR);
+			goto out;
 		}
 
 		add_wait_queue(&rqstp->rq_wait, &wait);
@@ -683,8 +683,12 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 
 		try_to_freeze();
 
-		spin_lock_bh(&pool->sp_lock);
 		remove_wait_queue(&rqstp->rq_wait, &wait);
+		xprt = rqstp->rq_xprt;
+		if (xprt != NULL)
+			return xprt;
+
+		spin_lock_bh(&pool->sp_lock);
 		if (!time_left)
 			pool->sp_stats.threads_timedout++;
 
@@ -699,6 +703,7 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 				return ERR_PTR(-EAGAIN);
 		}
 	}
+out:
 	spin_unlock_bh(&pool->sp_lock);
 	return xprt;
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 08/11] SUNRPC: get rid of the request wait queue
  2014-08-03 17:03             ` [PATCH 07/11] SUNRPC: Do not grab pool->sp_lock unnecessarily in svc_get_next_xprt Trond Myklebust
@ 2014-08-03 17:03               ` Trond Myklebust
  2014-08-03 17:03                 ` [PATCH 09/11] SUNRPC: Fix broken kthread_should_stop test in svc_get_next_xprt Trond Myklebust
  2014-08-12 19:53                 ` [PATCH 08/11] SUNRPC: get rid of the request wait queue Bruce Fields
  0 siblings, 2 replies; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

We're always _only_ waking up tasks from within the sp_threads list, so
we know that they are enqueued and alive. The rq_wait waitqueue is just
a distraction with extra atomic semantics.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 include/linux/sunrpc/svc.h |  1 -
 net/sunrpc/svc.c           |  2 --
 net/sunrpc/svc_xprt.c      | 32 +++++++++++++++++---------------
 3 files changed, 17 insertions(+), 18 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 1bc7cd05b22e..3ec769b65c7f 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -280,7 +280,6 @@ struct svc_rqst {
 	int			rq_splice_ok;   /* turned off in gss privacy
 						 * to prevent encrypting page
 						 * cache pages */
-	wait_queue_head_t	rq_wait;	/* synchronization */
 	struct task_struct	*rq_task;	/* service thread */
 };
 
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 5de6801cd924..dfb78c4f3031 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -612,8 +612,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 	if (!rqstp)
 		goto out_enomem;
 
-	init_waitqueue_head(&rqstp->rq_wait);
-
 	serv->sv_nrthreads++;
 	spin_lock_bh(&pool->sp_lock);
 	pool->sp_nrthreads++;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 2c30193c7a13..438e91c12851 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -348,8 +348,6 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
 
 	cpu = get_cpu();
 	pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
-	put_cpu();
-
 	spin_lock_bh(&pool->sp_lock);
 
 	if (!list_empty(&pool->sp_threads) &&
@@ -382,10 +380,15 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
 			printk(KERN_ERR
 				"svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
 				rqstp, rqstp->rq_xprt);
-		rqstp->rq_xprt = xprt;
+		/* Note the order of the following 3 lines:
+		 * We want to assign xprt to rqstp->rq_xprt only _after_
+		 * we've woken up the process, so that we don't race with
+		 * the lockless check in svc_get_next_xprt().
+		 */
 		svc_xprt_get(xprt);
+		wake_up_process(rqstp->rq_task);
+		rqstp->rq_xprt = xprt;
 		pool->sp_stats.threads_woken++;
-		wake_up(&rqstp->rq_wait);
 	} else {
 		dprintk("svc: transport %p put into queue\n", xprt);
 		list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
@@ -394,6 +397,7 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
 
 out_unlock:
 	spin_unlock_bh(&pool->sp_lock);
+	put_cpu();
 }
 
 /*
@@ -509,7 +513,7 @@ void svc_wake_up(struct svc_serv *serv)
 			svc_thread_dequeue(pool, rqstp);
 			rqstp->rq_xprt = NULL;
 			 */
-			wake_up(&rqstp->rq_wait);
+			wake_up_process(rqstp->rq_task);
 		} else
 			pool->sp_task_pending = 1;
 		spin_unlock_bh(&pool->sp_lock);
@@ -628,7 +632,6 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 {
 	struct svc_xprt *xprt;
 	struct svc_pool		*pool = rqstp->rq_pool;
-	DECLARE_WAITQUEUE(wait, current);
 	long			time_left;
 
 	/* Normally we will wait up to 5 seconds for any required
@@ -654,15 +657,15 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 			xprt = ERR_PTR(-EAGAIN);
 			goto out;
 		}
-		/* No data pending. Go to sleep */
-		svc_thread_enqueue(pool, rqstp);
-
 		/*
 		 * We have to be able to interrupt this wait
 		 * to bring down the daemons ...
 		 */
 		set_current_state(TASK_INTERRUPTIBLE);
 
+		/* No data pending. Go to sleep */
+		svc_thread_enqueue(pool, rqstp);
+
 		/*
 		 * checking kthread_should_stop() here allows us to avoid
 		 * locking and signalling when stopping kthreads that call
@@ -676,14 +679,13 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 			goto out;
 		}
 
-		add_wait_queue(&rqstp->rq_wait, &wait);
 		spin_unlock_bh(&pool->sp_lock);
 
 		time_left = schedule_timeout(timeout);
+		__set_current_state(TASK_RUNNING);
 
 		try_to_freeze();
 
-		remove_wait_queue(&rqstp->rq_wait, &wait);
 		xprt = rqstp->rq_xprt;
 		if (xprt != NULL)
 			return xprt;
@@ -786,10 +788,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 		printk(KERN_ERR
 			"svc_recv: service %p, transport not NULL!\n",
 			 rqstp);
-	if (waitqueue_active(&rqstp->rq_wait))
-		printk(KERN_ERR
-			"svc_recv: service %p, wait queue active!\n",
-			 rqstp);
+
+	/* Make sure the task pointer is set! */
+	if (WARN_ON_ONCE(!rqstp->rq_task))
+		rqstp->rq_task = current_task;
 
 	err = svc_alloc_arg(rqstp);
 	if (err)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 09/11] SUNRPC: Fix broken kthread_should_stop test in svc_get_next_xprt
  2014-08-03 17:03               ` [PATCH 08/11] SUNRPC: get rid of the request wait queue Trond Myklebust
@ 2014-08-03 17:03                 ` Trond Myklebust
  2014-08-03 17:03                   ` [PATCH 10/11] SUNRPC: More optimisations of svc_xprt_enqueue() Trond Myklebust
  2014-08-12 19:53                 ` [PATCH 08/11] SUNRPC: get rid of the request wait queue Bruce Fields
  1 sibling, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

We should definitely not be exiting svc_get_next_xprt() with the
thread enqueued. Fix this by ensuring that we fall through to
the dequeue.
Also move the test itself outside the spin lock protected section.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 net/sunrpc/svc_xprt.c | 31 ++++++++++---------------------
 1 file changed, 10 insertions(+), 21 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 438e91c12851..b334cf71be3a 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -632,7 +632,7 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 {
 	struct svc_xprt *xprt;
 	struct svc_pool		*pool = rqstp->rq_pool;
-	long			time_left;
+	long			time_left = 0;
 
 	/* Normally we will wait up to 5 seconds for any required
 	 * cache information to be provided.
@@ -665,30 +665,19 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 
 		/* No data pending. Go to sleep */
 		svc_thread_enqueue(pool, rqstp);
-
-		/*
-		 * checking kthread_should_stop() here allows us to avoid
-		 * locking and signalling when stopping kthreads that call
-		 * svc_recv. If the thread has already been woken up, then
-		 * we can exit here without sleeping. If not, then it
-		 * it'll be woken up quickly during the schedule_timeout
-		 */
-		if (kthread_should_stop()) {
-			set_current_state(TASK_RUNNING);
-			xprt = ERR_PTR(-EINTR);
-			goto out;
-		}
-
 		spin_unlock_bh(&pool->sp_lock);
 
-		time_left = schedule_timeout(timeout);
-		__set_current_state(TASK_RUNNING);
+		if (!(signalled() || kthread_should_stop())) {
+			time_left = schedule_timeout(timeout);
+			__set_current_state(TASK_RUNNING);
 
-		try_to_freeze();
+			try_to_freeze();
 
-		xprt = rqstp->rq_xprt;
-		if (xprt != NULL)
-			return xprt;
+			xprt = rqstp->rq_xprt;
+			if (xprt != NULL)
+				return xprt;
+		} else
+			__set_current_state(TASK_RUNNING);
 
 		spin_lock_bh(&pool->sp_lock);
 		if (!time_left)
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 10/11] SUNRPC: More optimisations of svc_xprt_enqueue()
  2014-08-03 17:03                 ` [PATCH 09/11] SUNRPC: Fix broken kthread_should_stop test in svc_get_next_xprt Trond Myklebust
@ 2014-08-03 17:03                   ` Trond Myklebust
  2014-08-03 17:03                     ` [PATCH 11/11] SUNRPC: Optimise away svc_recv_available Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

Just move the transport locking out of the spin lock protected area
altogether.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 net/sunrpc/svc_xprt.c | 21 +++++++--------------
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index b334cf71be3a..7eff6938f814 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -346,18 +346,6 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
 	if (!svc_xprt_has_something_to_do(xprt))
 		return;
 
-	cpu = get_cpu();
-	pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
-	spin_lock_bh(&pool->sp_lock);
-
-	if (!list_empty(&pool->sp_threads) &&
-	    !list_empty(&pool->sp_sockets))
-		printk(KERN_ERR
-		       "svc_xprt_enqueue: "
-		       "threads and transports both waiting??\n");
-
-	pool->sp_stats.packets++;
-
 	/* Mark transport as busy. It will remain in this state until
 	 * the provider calls svc_xprt_received. We update XPT_BUSY
 	 * atomically because it also guards against trying to enqueue
@@ -366,9 +354,15 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
 	if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) {
 		/* Don't enqueue transport while already enqueued */
 		dprintk("svc: transport %p busy, not enqueued\n", xprt);
-		goto out_unlock;
+		return;
 	}
 
+	cpu = get_cpu();
+	pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
+	spin_lock_bh(&pool->sp_lock);
+
+	pool->sp_stats.packets++;
+
 	if (!list_empty(&pool->sp_threads)) {
 		rqstp = list_entry(pool->sp_threads.next,
 				   struct svc_rqst,
@@ -395,7 +389,6 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
 		pool->sp_stats.sockets_queued++;
 	}
 
-out_unlock:
 	spin_unlock_bh(&pool->sp_lock);
 	put_cpu();
 }
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 11/11] SUNRPC: Optimise away svc_recv_available
  2014-08-03 17:03                   ` [PATCH 10/11] SUNRPC: More optimisations of svc_xprt_enqueue() Trond Myklebust
@ 2014-08-03 17:03                     ` Trond Myklebust
  0 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2014-08-03 17:03 UTC (permalink / raw)
  To: Bruce Fields; +Cc: linux-nfs

We really do not want to do ioctls in the server's fast path. Instead, let's
use the fact that we managed to read a full record as the indicator that
we should try to read the socket again.

Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
---
 net/sunrpc/svcsock.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 72597d7fe60a..949bb2bb8f07 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -312,19 +312,6 @@ static int svc_one_sock_name(struct svc_sock *svsk, char *buf, int remaining)
 }
 
 /*
- * Check input queue length
- */
-static int svc_recv_available(struct svc_sock *svsk)
-{
-	struct socket	*sock = svsk->sk_sock;
-	int		avail, err;
-
-	err = kernel_sock_ioctl(sock, TIOCINQ, (unsigned long) &avail);
-
-	return (err >= 0)? avail : err;
-}
-
-/*
  * Generic recvfrom routine.
  */
 static int svc_recvfrom(struct svc_rqst *rqstp, struct kvec *iov, int nr,
@@ -339,8 +326,14 @@ static int svc_recvfrom(struct svc_rqst *rqstp, struct kvec *iov, int nr,
 
 	rqstp->rq_xprt_hlen = 0;
 
+	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 	len = kernel_recvmsg(svsk->sk_sock, &msg, iov, nr, buflen,
 				msg.msg_flags);
+	/* If we read a full record, then assume there may be more
+	 * data to read (stream based sockets only!)
+	 */
+	if (len == buflen)
+		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 
 	dprintk("svc: socket %p recvfrom(%p, %Zu) = %d\n",
 		svsk, iov[0].iov_base, iov[0].iov_len, len);
@@ -979,8 +972,6 @@ static int svc_tcp_recv_record(struct svc_sock *svsk, struct svc_rqst *rqstp)
 	unsigned int want;
 	int len;
 
-	clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
-
 	if (svsk->sk_tcplen < sizeof(rpc_fraghdr)) {
 		struct kvec	iov;
 
@@ -1072,8 +1063,6 @@ static int copy_pages_to_kvecs(struct kvec *vec, struct page **pages, int len)
 static void svc_tcp_fragment_received(struct svc_sock *svsk)
 {
 	/* If we have more data, signal svc_xprt_enqueue() to try again */
-	if (svc_recv_available(svsk) > sizeof(rpc_fraghdr))
-		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 	dprintk("svc: TCP %s record (%d bytes)\n",
 		svc_sock_final_rec(svsk) ? "final" : "nonfinal",
 		svc_sock_reclen(svsk));
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/11] SUNRPC server scalability improvements
  2014-08-03 17:03 [PATCH 00/11] SUNRPC server scalability improvements Trond Myklebust
  2014-08-03 17:03 ` [PATCH 01/11] SUNRPC: Reduce contention in svc_xprt_enqueue() Trond Myklebust
@ 2014-08-04 13:36 ` Bruce Fields
  2014-08-05 20:42   ` Bruce Fields
  1 sibling, 1 reply; 24+ messages in thread
From: Bruce Fields @ 2014-08-04 13:36 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Sun, Aug 03, 2014 at 01:03:02PM -0400, Trond Myklebust wrote:
> When testing knfsd on 10GigE and 40GigE networks, we've been hitting a few
> scalability issues that are seriously affecting performance. The main issue
> was scalability of the pool->sp_lock, but we also hit a couple of things
> like the use of ioctls in the receive fast path, as well as some issues
> with heuristic tests being performed twice in the same path.
> 
> This is not urgent, and can definitely be delayed until the 3.18 merge
> window, however since the performance gains were significant over NFSv3,
> I thought I'd share now.

Thanks!  On a quick skim these and the DRC patches look good.

My first priority is to see if I can merge the last of the state locking
patches which are already in nfsd-next, so yes I'll probably queue these
up for 3.18 later.

--b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/11] SUNRPC server scalability improvements
  2014-08-04 13:36 ` [PATCH 00/11] SUNRPC server scalability improvements Bruce Fields
@ 2014-08-05 20:42   ` Bruce Fields
  2014-08-05 21:24     ` Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Bruce Fields @ 2014-08-05 20:42 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Mon, Aug 04, 2014 at 09:36:41AM -0400, Bruce Fields wrote:
> On Sun, Aug 03, 2014 at 01:03:02PM -0400, Trond Myklebust wrote:
> > When testing knfsd on 10GigE and 40GigE networks, we've been hitting a few
> > scalability issues that are seriously affecting performance. The main issue
> > was scalability of the pool->sp_lock, but we also hit a couple of things
> > like the use of ioctls in the receive fast path, as well as some issues
> > with heuristic tests being performed twice in the same path.
> > 
> > This is not urgent, and can definitely be delayed until the 3.18 merge
> > window, however since the performance gains were significant over NFSv3,
> > I thought I'd share now.
> 
> Thanks!  On a quick skim these and the DRC patches look good.
> 
> My first priority is to see if I can merge the last of the state locking
> patches which are already in nfsd-next, so yes I'll probably queue these
> up for 3.18 later.

Though note I already applied the first three--I'm assuming the versions
of those posted here are the same as the ones previously applied.  (See
git://linux-nfs.org/~bfields/linux.git for-3.17.)

--b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 00/11] SUNRPC server scalability improvements
  2014-08-05 20:42   ` Bruce Fields
@ 2014-08-05 21:24     ` Trond Myklebust
  0 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2014-08-05 21:24 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List

On Tue, Aug 5, 2014 at 4:42 PM, Bruce Fields <bfields@fieldses.org> wrote:
> On Mon, Aug 04, 2014 at 09:36:41AM -0400, Bruce Fields wrote:
>> On Sun, Aug 03, 2014 at 01:03:02PM -0400, Trond Myklebust wrote:
>> > When testing knfsd on 10GigE and 40GigE networks, we've been hitting a few
>> > scalability issues that are seriously affecting performance. The main issue
>> > was scalability of the pool->sp_lock, but we also hit a couple of things
>> > like the use of ioctls in the receive fast path, as well as some issues
>> > with heuristic tests being performed twice in the same path.
>> >
>> > This is not urgent, and can definitely be delayed until the 3.18 merge
>> > window, however since the performance gains were significant over NFSv3,
>> > I thought I'd share now.
>>
>> Thanks!  On a quick skim these and the DRC patches look good.
>>
>> My first priority is to see if I can merge the last of the state locking
>> patches which are already in nfsd-next, so yes I'll probably queue these
>> up for 3.18 later.
>
> Though note I already applied the first three--I'm assuming the versions
> of those posted here are the same as the ones previously applied.  (See
> git://linux-nfs.org/~bfields/linux.git for-3.17.)
>

Yes. Those first 3 should be identical to the patches that I posted earlier.

-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt
  2014-08-03 17:03       ` [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt Trond Myklebust
  2014-08-03 17:03         ` [PATCH 05/11] lockd: Ensure that lockd_start_svc sets the server rq_task Trond Myklebust
@ 2014-08-12 19:16         ` Bruce Fields
  2014-08-12 19:31           ` Trond Myklebust
  1 sibling, 1 reply; 24+ messages in thread
From: Bruce Fields @ 2014-08-12 19:16 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Sun, Aug 03, 2014 at 01:03:06PM -0400, Trond Myklebust wrote:
> We already determined that there was enough wspace when we
> called svc_xprt_enqueue.

So xpo_has_wspace may have returned true then, but I don't see what
guarantees it still would now.  Couldn't another server thread have also
run svc_recv() and the atomic_add(rqstp->rq_reserved,
&xprt->xpt_reserved) between the svc_xprt_enqueue call and now?

--b.

> 
> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
> ---
>  net/sunrpc/svc_xprt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 32647b2a6a34..b731077b6a30 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -744,7 +744,7 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
>  			svc_add_new_temp_xprt(serv, newxpt);
>  		else
>  			module_put(xprt->xpt_class->xcl_owner);
> -	} else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> +	} else {
>  		/* XPT_DATA|XPT_DEFERRED case: */
>  		dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
>  			rqstp, rqstp->rq_pool->sp_id, xprt,
> -- 
> 1.9.3
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt
  2014-08-12 19:16         ` [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt Bruce Fields
@ 2014-08-12 19:31           ` Trond Myklebust
  2014-08-12 20:04             ` Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-12 19:31 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List

On Tue, Aug 12, 2014 at 3:16 PM, Bruce Fields <bfields@fieldses.org> wrote:
> On Sun, Aug 03, 2014 at 01:03:06PM -0400, Trond Myklebust wrote:
>> We already determined that there was enough wspace when we
>> called svc_xprt_enqueue.
>
> So xpo_has_wspace may have returned true then, but I don't see what
> guarantees it still would now.  Couldn't another server thread have also
> run svc_recv() and the atomic_add(rqstp->rq_reserved,
> &xprt->xpt_reserved) between the svc_xprt_enqueue call and now?

The point is that all this is just a heuristic: the TCP send window
can collapse at any time. So rather than waste cycles doing multiple
redundant tests that ultimately mean nothing when you call sendmsg(),
do it once and be damned...


> --b.
>
>>
>> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
>> ---
>>  net/sunrpc/svc_xprt.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index 32647b2a6a34..b731077b6a30 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -744,7 +744,7 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
>>                       svc_add_new_temp_xprt(serv, newxpt);
>>               else
>>                       module_put(xprt->xpt_class->xcl_owner);
>> -     } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
>> +     } else {
>>               /* XPT_DATA|XPT_DEFERRED case: */
>>               dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
>>                       rqstp, rqstp->rq_pool->sp_id, xprt,
>> --
>> 1.9.3
>>



-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 08/11] SUNRPC: get rid of the request wait queue
  2014-08-03 17:03               ` [PATCH 08/11] SUNRPC: get rid of the request wait queue Trond Myklebust
  2014-08-03 17:03                 ` [PATCH 09/11] SUNRPC: Fix broken kthread_should_stop test in svc_get_next_xprt Trond Myklebust
@ 2014-08-12 19:53                 ` Bruce Fields
  2014-08-12 20:09                   ` Trond Myklebust
  1 sibling, 1 reply; 24+ messages in thread
From: Bruce Fields @ 2014-08-12 19:53 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: linux-nfs

On Sun, Aug 03, 2014 at 01:03:10PM -0400, Trond Myklebust wrote:
> We're always _only_ waking up tasks from within the sp_threads list, so
> we know that they are enqueued and alive. The rq_wait waitqueue is just
> a distraction with extra atomic semantics.
> 
> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
> ---
>  include/linux/sunrpc/svc.h |  1 -
>  net/sunrpc/svc.c           |  2 --
>  net/sunrpc/svc_xprt.c      | 32 +++++++++++++++++---------------
>  3 files changed, 17 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 1bc7cd05b22e..3ec769b65c7f 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -280,7 +280,6 @@ struct svc_rqst {
>  	int			rq_splice_ok;   /* turned off in gss privacy
>  						 * to prevent encrypting page
>  						 * cache pages */
> -	wait_queue_head_t	rq_wait;	/* synchronization */
>  	struct task_struct	*rq_task;	/* service thread */
>  };
>  
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 5de6801cd924..dfb78c4f3031 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -612,8 +612,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  	if (!rqstp)
>  		goto out_enomem;
>  
> -	init_waitqueue_head(&rqstp->rq_wait);
> -
>  	serv->sv_nrthreads++;
>  	spin_lock_bh(&pool->sp_lock);
>  	pool->sp_nrthreads++;
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 2c30193c7a13..438e91c12851 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -348,8 +348,6 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>  
>  	cpu = get_cpu();
>  	pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
> -	put_cpu();
> -
>  	spin_lock_bh(&pool->sp_lock);
>  
>  	if (!list_empty(&pool->sp_threads) &&
> @@ -382,10 +380,15 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>  			printk(KERN_ERR
>  				"svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
>  				rqstp, rqstp->rq_xprt);
> -		rqstp->rq_xprt = xprt;
> +		/* Note the order of the following 3 lines:
> +		 * We want to assign xprt to rqstp->rq_xprt only _after_
> +		 * we've woken up the process, so that we don't race with
> +		 * the lockless check in svc_get_next_xprt().

Sorry, I'm not following this: what exactly is the race?

--b.

> +		 */
>  		svc_xprt_get(xprt);
> +		wake_up_process(rqstp->rq_task);
> +		rqstp->rq_xprt = xprt;
>  		pool->sp_stats.threads_woken++;
> -		wake_up(&rqstp->rq_wait);
>  	} else {
>  		dprintk("svc: transport %p put into queue\n", xprt);
>  		list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
> @@ -394,6 +397,7 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>  
>  out_unlock:
>  	spin_unlock_bh(&pool->sp_lock);
> +	put_cpu();
>  }
>  
>  /*
> @@ -509,7 +513,7 @@ void svc_wake_up(struct svc_serv *serv)
>  			svc_thread_dequeue(pool, rqstp);
>  			rqstp->rq_xprt = NULL;
>  			 */
> -			wake_up(&rqstp->rq_wait);
> +			wake_up_process(rqstp->rq_task);
>  		} else
>  			pool->sp_task_pending = 1;
>  		spin_unlock_bh(&pool->sp_lock);
> @@ -628,7 +632,6 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  {
>  	struct svc_xprt *xprt;
>  	struct svc_pool		*pool = rqstp->rq_pool;
> -	DECLARE_WAITQUEUE(wait, current);
>  	long			time_left;
>  
>  	/* Normally we will wait up to 5 seconds for any required
> @@ -654,15 +657,15 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  			xprt = ERR_PTR(-EAGAIN);
>  			goto out;
>  		}
> -		/* No data pending. Go to sleep */
> -		svc_thread_enqueue(pool, rqstp);
> -
>  		/*
>  		 * We have to be able to interrupt this wait
>  		 * to bring down the daemons ...
>  		 */
>  		set_current_state(TASK_INTERRUPTIBLE);
>  
> +		/* No data pending. Go to sleep */
> +		svc_thread_enqueue(pool, rqstp);
> +
>  		/*
>  		 * checking kthread_should_stop() here allows us to avoid
>  		 * locking and signalling when stopping kthreads that call
> @@ -676,14 +679,13 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  			goto out;
>  		}
>  
> -		add_wait_queue(&rqstp->rq_wait, &wait);
>  		spin_unlock_bh(&pool->sp_lock);
>  
>  		time_left = schedule_timeout(timeout);
> +		__set_current_state(TASK_RUNNING);
>  
>  		try_to_freeze();
>  
> -		remove_wait_queue(&rqstp->rq_wait, &wait);
>  		xprt = rqstp->rq_xprt;
>  		if (xprt != NULL)
>  			return xprt;
> @@ -786,10 +788,10 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
>  		printk(KERN_ERR
>  			"svc_recv: service %p, transport not NULL!\n",
>  			 rqstp);
> -	if (waitqueue_active(&rqstp->rq_wait))
> -		printk(KERN_ERR
> -			"svc_recv: service %p, wait queue active!\n",
> -			 rqstp);
> +
> +	/* Make sure the task pointer is set! */
> +	if (WARN_ON_ONCE(!rqstp->rq_task))
> +		rqstp->rq_task = current_task;
>  
>  	err = svc_alloc_arg(rqstp);
>  	if (err)
> -- 
> 1.9.3
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt
  2014-08-12 19:31           ` Trond Myklebust
@ 2014-08-12 20:04             ` Bruce Fields
  0 siblings, 0 replies; 24+ messages in thread
From: Bruce Fields @ 2014-08-12 20:04 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List

On Tue, Aug 12, 2014 at 03:31:43PM -0400, Trond Myklebust wrote:
> On Tue, Aug 12, 2014 at 3:16 PM, Bruce Fields <bfields@fieldses.org> wrote:
> > On Sun, Aug 03, 2014 at 01:03:06PM -0400, Trond Myklebust wrote:
> >> We already determined that there was enough wspace when we
> >> called svc_xprt_enqueue.
> >
> > So xpo_has_wspace may have returned true then, but I don't see what
> > guarantees it still would now.  Couldn't another server thread have also
> > run svc_recv() and the atomic_add(rqstp->rq_reserved,
> > &xprt->xpt_reserved) between the svc_xprt_enqueue call and now?
> 
> The point is that all this is just a heuristic: the TCP send window
> can collapse at any time. So rather than waste cycles doing multiple
> redundant tests that ultimately mean nothing when you call sendmsg(),
> do it once and be damned...

OK, makes sense.

--b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 08/11] SUNRPC: get rid of the request wait queue
  2014-08-12 19:53                 ` [PATCH 08/11] SUNRPC: get rid of the request wait queue Bruce Fields
@ 2014-08-12 20:09                   ` Trond Myklebust
  2014-08-12 20:23                     ` Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-12 20:09 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List

On Tue, Aug 12, 2014 at 3:53 PM, Bruce Fields <bfields@fieldses.org> wrote:
> On Sun, Aug 03, 2014 at 01:03:10PM -0400, Trond Myklebust wrote:
>> We're always _only_ waking up tasks from within the sp_threads list, so
>> we know that they are enqueued and alive. The rq_wait waitqueue is just
>> a distraction with extra atomic semantics.
>>
>> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
>> ---
>>  include/linux/sunrpc/svc.h |  1 -
>>  net/sunrpc/svc.c           |  2 --
>>  net/sunrpc/svc_xprt.c      | 32 +++++++++++++++++---------------
>>  3 files changed, 17 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> index 1bc7cd05b22e..3ec769b65c7f 100644
>> --- a/include/linux/sunrpc/svc.h
>> +++ b/include/linux/sunrpc/svc.h
>> @@ -280,7 +280,6 @@ struct svc_rqst {
>>       int                     rq_splice_ok;   /* turned off in gss privacy
>>                                                * to prevent encrypting page
>>                                                * cache pages */
>> -     wait_queue_head_t       rq_wait;        /* synchronization */
>>       struct task_struct      *rq_task;       /* service thread */
>>  };
>>
>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> index 5de6801cd924..dfb78c4f3031 100644
>> --- a/net/sunrpc/svc.c
>> +++ b/net/sunrpc/svc.c
>> @@ -612,8 +612,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>>       if (!rqstp)
>>               goto out_enomem;
>>
>> -     init_waitqueue_head(&rqstp->rq_wait);
>> -
>>       serv->sv_nrthreads++;
>>       spin_lock_bh(&pool->sp_lock);
>>       pool->sp_nrthreads++;
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index 2c30193c7a13..438e91c12851 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -348,8 +348,6 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>>
>>       cpu = get_cpu();
>>       pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
>> -     put_cpu();
>> -
>>       spin_lock_bh(&pool->sp_lock);
>>
>>       if (!list_empty(&pool->sp_threads) &&
>> @@ -382,10 +380,15 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>>                       printk(KERN_ERR
>>                               "svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
>>                               rqstp, rqstp->rq_xprt);
>> -             rqstp->rq_xprt = xprt;
>> +             /* Note the order of the following 3 lines:
>> +              * We want to assign xprt to rqstp->rq_xprt only _after_
>> +              * we've woken up the process, so that we don't race with
>> +              * the lockless check in svc_get_next_xprt().
>
> Sorry, I'm not following this: what exactly is the race?

There are 2 potential races, both due to the lockless check for rqstp->rq_xprt.

1) You need to ensure that the reference count in xprt is bumped
before assigning to rqstp->rq_xprt, because svc_get_next_xprt() does a
lockless check for rqstp->rq_xprt != NULL, so there is no lock to
ensure that the refcount bump occurs before whoever called
svc_get_next_xprt() calls svc_xprt_put()...

2) You want to ensure that you don't call wake_up_process() after
exiting svc_get_next_xprt() since you would no longer be guaranteed
that the task still exists. By calling wake_up_process() before
setting rqstp->rq_xprt, you ensure that if the task wakes up before
calling wake_up_process(), then the test for rqstp->rq_xprt == NULL
forces you into the spin_lock_bh() protected region, where it is safe
to test rqstp->rq_xprt again and decide whether or not the task is
still queued on &pool->sp_threads.


-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 08/11] SUNRPC: get rid of the request wait queue
  2014-08-12 20:09                   ` Trond Myklebust
@ 2014-08-12 20:23                     ` Bruce Fields
  2014-08-12 20:54                       ` Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Bruce Fields @ 2014-08-12 20:23 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List

On Tue, Aug 12, 2014 at 04:09:52PM -0400, Trond Myklebust wrote:
> On Tue, Aug 12, 2014 at 3:53 PM, Bruce Fields <bfields@fieldses.org> wrote:
> > On Sun, Aug 03, 2014 at 01:03:10PM -0400, Trond Myklebust wrote:
> >> We're always _only_ waking up tasks from within the sp_threads list, so
> >> we know that they are enqueued and alive. The rq_wait waitqueue is just
> >> a distraction with extra atomic semantics.
> >>
> >> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
> >> ---
> >>  include/linux/sunrpc/svc.h |  1 -
> >>  net/sunrpc/svc.c           |  2 --
> >>  net/sunrpc/svc_xprt.c      | 32 +++++++++++++++++---------------
> >>  3 files changed, 17 insertions(+), 18 deletions(-)
> >>
> >> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> >> index 1bc7cd05b22e..3ec769b65c7f 100644
> >> --- a/include/linux/sunrpc/svc.h
> >> +++ b/include/linux/sunrpc/svc.h
> >> @@ -280,7 +280,6 @@ struct svc_rqst {
> >>       int                     rq_splice_ok;   /* turned off in gss privacy
> >>                                                * to prevent encrypting page
> >>                                                * cache pages */
> >> -     wait_queue_head_t       rq_wait;        /* synchronization */
> >>       struct task_struct      *rq_task;       /* service thread */
> >>  };
> >>
> >> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> >> index 5de6801cd924..dfb78c4f3031 100644
> >> --- a/net/sunrpc/svc.c
> >> +++ b/net/sunrpc/svc.c
> >> @@ -612,8 +612,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >>       if (!rqstp)
> >>               goto out_enomem;
> >>
> >> -     init_waitqueue_head(&rqstp->rq_wait);
> >> -
> >>       serv->sv_nrthreads++;
> >>       spin_lock_bh(&pool->sp_lock);
> >>       pool->sp_nrthreads++;
> >> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >> index 2c30193c7a13..438e91c12851 100644
> >> --- a/net/sunrpc/svc_xprt.c
> >> +++ b/net/sunrpc/svc_xprt.c
> >> @@ -348,8 +348,6 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
> >>
> >>       cpu = get_cpu();
> >>       pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
> >> -     put_cpu();
> >> -
> >>       spin_lock_bh(&pool->sp_lock);
> >>
> >>       if (!list_empty(&pool->sp_threads) &&
> >> @@ -382,10 +380,15 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
> >>                       printk(KERN_ERR
> >>                               "svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
> >>                               rqstp, rqstp->rq_xprt);
> >> -             rqstp->rq_xprt = xprt;
> >> +             /* Note the order of the following 3 lines:
> >> +              * We want to assign xprt to rqstp->rq_xprt only _after_
> >> +              * we've woken up the process, so that we don't race with
> >> +              * the lockless check in svc_get_next_xprt().
> >
> > Sorry, I'm not following this: what exactly is the race?
> 
> There are 2 potential races, both due to the lockless check for rqstp->rq_xprt.
> 
> 1) You need to ensure that the reference count in xprt is bumped
> before assigning to rqstp->rq_xprt, because svc_get_next_xprt() does a
> lockless check for rqstp->rq_xprt != NULL, so there is no lock to
> ensure that the refcount bump occurs before whoever called
> svc_get_next_xprt() calls svc_xprt_put()...
> 
> 2) You want to ensure that you don't call wake_up_process() after
> exiting svc_get_next_xprt() since you would no longer be guaranteed
> that the task still exists. By calling wake_up_process() before
> setting rqstp->rq_xprt, you ensure that if the task wakes up before
> calling wake_up_process(), then the test for rqstp->rq_xprt == NULL
> forces you into the spin_lock_bh() protected region, where it is safe
> to test rqstp->rq_xprt again and decide whether or not the task is
> still queued on &pool->sp_threads.

Got it.  So how about making that comment:

	/*
	 * Once we assign rq_xprt, a concurrent task in
	 * svc_get_next_xprt() could put the xprt, or could exit.
	 * Therefore, the get and the wake_up need to come first.
	 */

Is that close enough?

--b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 08/11] SUNRPC: get rid of the request wait queue
  2014-08-12 20:23                     ` Bruce Fields
@ 2014-08-12 20:54                       ` Trond Myklebust
  2014-08-12 21:29                         ` Bruce Fields
  0 siblings, 1 reply; 24+ messages in thread
From: Trond Myklebust @ 2014-08-12 20:54 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List

On Tue, Aug 12, 2014 at 4:23 PM, Bruce Fields <bfields@fieldses.org> wrote:
> On Tue, Aug 12, 2014 at 04:09:52PM -0400, Trond Myklebust wrote:
>> On Tue, Aug 12, 2014 at 3:53 PM, Bruce Fields <bfields@fieldses.org> wrote:
>> > On Sun, Aug 03, 2014 at 01:03:10PM -0400, Trond Myklebust wrote:
>> >> We're always _only_ waking up tasks from within the sp_threads list, so
>> >> we know that they are enqueued and alive. The rq_wait waitqueue is just
>> >> a distraction with extra atomic semantics.
>> >>
>> >> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
>> >> ---
>> >>  include/linux/sunrpc/svc.h |  1 -
>> >>  net/sunrpc/svc.c           |  2 --
>> >>  net/sunrpc/svc_xprt.c      | 32 +++++++++++++++++---------------
>> >>  3 files changed, 17 insertions(+), 18 deletions(-)
>> >>
>> >> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> >> index 1bc7cd05b22e..3ec769b65c7f 100644
>> >> --- a/include/linux/sunrpc/svc.h
>> >> +++ b/include/linux/sunrpc/svc.h
>> >> @@ -280,7 +280,6 @@ struct svc_rqst {
>> >>       int                     rq_splice_ok;   /* turned off in gss privacy
>> >>                                                * to prevent encrypting page
>> >>                                                * cache pages */
>> >> -     wait_queue_head_t       rq_wait;        /* synchronization */
>> >>       struct task_struct      *rq_task;       /* service thread */
>> >>  };
>> >>
>> >> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> >> index 5de6801cd924..dfb78c4f3031 100644
>> >> --- a/net/sunrpc/svc.c
>> >> +++ b/net/sunrpc/svc.c
>> >> @@ -612,8 +612,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>> >>       if (!rqstp)
>> >>               goto out_enomem;
>> >>
>> >> -     init_waitqueue_head(&rqstp->rq_wait);
>> >> -
>> >>       serv->sv_nrthreads++;
>> >>       spin_lock_bh(&pool->sp_lock);
>> >>       pool->sp_nrthreads++;
>> >> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> >> index 2c30193c7a13..438e91c12851 100644
>> >> --- a/net/sunrpc/svc_xprt.c
>> >> +++ b/net/sunrpc/svc_xprt.c
>> >> @@ -348,8 +348,6 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>> >>
>> >>       cpu = get_cpu();
>> >>       pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
>> >> -     put_cpu();
>> >> -
>> >>       spin_lock_bh(&pool->sp_lock);
>> >>
>> >>       if (!list_empty(&pool->sp_threads) &&
>> >> @@ -382,10 +380,15 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>> >>                       printk(KERN_ERR
>> >>                               "svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
>> >>                               rqstp, rqstp->rq_xprt);
>> >> -             rqstp->rq_xprt = xprt;
>> >> +             /* Note the order of the following 3 lines:
>> >> +              * We want to assign xprt to rqstp->rq_xprt only _after_
>> >> +              * we've woken up the process, so that we don't race with
>> >> +              * the lockless check in svc_get_next_xprt().
>> >
>> > Sorry, I'm not following this: what exactly is the race?
>>
>> There are 2 potential races, both due to the lockless check for rqstp->rq_xprt.
>>
>> 1) You need to ensure that the reference count in xprt is bumped
>> before assigning to rqstp->rq_xprt, because svc_get_next_xprt() does a
>> lockless check for rqstp->rq_xprt != NULL, so there is no lock to
>> ensure that the refcount bump occurs before whoever called
>> svc_get_next_xprt() calls svc_xprt_put()...
>>
>> 2) You want to ensure that you don't call wake_up_process() after
>> exiting svc_get_next_xprt() since you would no longer be guaranteed
>> that the task still exists. By calling wake_up_process() before
>> setting rqstp->rq_xprt, you ensure that if the task wakes up before
>> calling wake_up_process(), then the test for rqstp->rq_xprt == NULL
>> forces you into the spin_lock_bh() protected region, where it is safe
>> to test rqstp->rq_xprt again and decide whether or not the task is
>> still queued on &pool->sp_threads.
>
> Got it.  So how about making that comment:
>
>         /*
>          * Once we assign rq_xprt, a concurrent task in
>          * svc_get_next_xprt() could put the xprt, or could exit.
>          * Therefore, the get and the wake_up need to come first.
>          */
>
> Is that close enough?
>

It's close: it's not so much any concurrent task, it's specifically
the one that we're trying to wake up that may find itself waking up
prematurely due to the schedule_timeout(), and then racing due to the
lockless check.



-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 08/11] SUNRPC: get rid of the request wait queue
  2014-08-12 20:54                       ` Trond Myklebust
@ 2014-08-12 21:29                         ` Bruce Fields
  2014-08-12 21:34                           ` Trond Myklebust
  0 siblings, 1 reply; 24+ messages in thread
From: Bruce Fields @ 2014-08-12 21:29 UTC (permalink / raw)
  To: Trond Myklebust; +Cc: Linux NFS Mailing List

On Tue, Aug 12, 2014 at 04:54:53PM -0400, Trond Myklebust wrote:
> On Tue, Aug 12, 2014 at 4:23 PM, Bruce Fields <bfields@fieldses.org> wrote:
> > On Tue, Aug 12, 2014 at 04:09:52PM -0400, Trond Myklebust wrote:
> >> On Tue, Aug 12, 2014 at 3:53 PM, Bruce Fields <bfields@fieldses.org> wrote:
> >> > On Sun, Aug 03, 2014 at 01:03:10PM -0400, Trond Myklebust wrote:
> >> >> We're always _only_ waking up tasks from within the sp_threads list, so
> >> >> we know that they are enqueued and alive. The rq_wait waitqueue is just
> >> >> a distraction with extra atomic semantics.
> >> >>
> >> >> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
> >> >> ---
> >> >>  include/linux/sunrpc/svc.h |  1 -
> >> >>  net/sunrpc/svc.c           |  2 --
> >> >>  net/sunrpc/svc_xprt.c      | 32 +++++++++++++++++---------------
> >> >>  3 files changed, 17 insertions(+), 18 deletions(-)
> >> >>
> >> >> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> >> >> index 1bc7cd05b22e..3ec769b65c7f 100644
> >> >> --- a/include/linux/sunrpc/svc.h
> >> >> +++ b/include/linux/sunrpc/svc.h
> >> >> @@ -280,7 +280,6 @@ struct svc_rqst {
> >> >>       int                     rq_splice_ok;   /* turned off in gss privacy
> >> >>                                                * to prevent encrypting page
> >> >>                                                * cache pages */
> >> >> -     wait_queue_head_t       rq_wait;        /* synchronization */
> >> >>       struct task_struct      *rq_task;       /* service thread */
> >> >>  };
> >> >>
> >> >> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> >> >> index 5de6801cd924..dfb78c4f3031 100644
> >> >> --- a/net/sunrpc/svc.c
> >> >> +++ b/net/sunrpc/svc.c
> >> >> @@ -612,8 +612,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >> >>       if (!rqstp)
> >> >>               goto out_enomem;
> >> >>
> >> >> -     init_waitqueue_head(&rqstp->rq_wait);
> >> >> -
> >> >>       serv->sv_nrthreads++;
> >> >>       spin_lock_bh(&pool->sp_lock);
> >> >>       pool->sp_nrthreads++;
> >> >> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >> >> index 2c30193c7a13..438e91c12851 100644
> >> >> --- a/net/sunrpc/svc_xprt.c
> >> >> +++ b/net/sunrpc/svc_xprt.c
> >> >> @@ -348,8 +348,6 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
> >> >>
> >> >>       cpu = get_cpu();
> >> >>       pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
> >> >> -     put_cpu();
> >> >> -
> >> >>       spin_lock_bh(&pool->sp_lock);
> >> >>
> >> >>       if (!list_empty(&pool->sp_threads) &&
> >> >> @@ -382,10 +380,15 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
> >> >>                       printk(KERN_ERR
> >> >>                               "svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
> >> >>                               rqstp, rqstp->rq_xprt);
> >> >> -             rqstp->rq_xprt = xprt;
> >> >> +             /* Note the order of the following 3 lines:
> >> >> +              * We want to assign xprt to rqstp->rq_xprt only _after_
> >> >> +              * we've woken up the process, so that we don't race with
> >> >> +              * the lockless check in svc_get_next_xprt().
> >> >
> >> > Sorry, I'm not following this: what exactly is the race?
> >>
> >> There are 2 potential races, both due to the lockless check for rqstp->rq_xprt.
> >>
> >> 1) You need to ensure that the reference count in xprt is bumped
> >> before assigning to rqstp->rq_xprt, because svc_get_next_xprt() does a
> >> lockless check for rqstp->rq_xprt != NULL, so there is no lock to
> >> ensure that the refcount bump occurs before whoever called
> >> svc_get_next_xprt() calls svc_xprt_put()...
> >>
> >> 2) You want to ensure that you don't call wake_up_process() after
> >> exiting svc_get_next_xprt() since you would no longer be guaranteed
> >> that the task still exists. By calling wake_up_process() before
> >> setting rqstp->rq_xprt, you ensure that if the task wakes up before
> >> calling wake_up_process(), then the test for rqstp->rq_xprt == NULL
> >> forces you into the spin_lock_bh() protected region, where it is safe
> >> to test rqstp->rq_xprt again and decide whether or not the task is
> >> still queued on &pool->sp_threads.
> >
> > Got it.  So how about making that comment:
> >
> >         /*
> >          * Once we assign rq_xprt, a concurrent task in
> >          * svc_get_next_xprt() could put the xprt, or could exit.
> >          * Therefore, the get and the wake_up need to come first.
> >          */
> >
> > Is that close enough?
> >
> 
> It's close: it's not so much any concurrent task, it's specifically
> the one that we're trying to wake up that may find itself waking up
> prematurely due to the schedule_timeout(), and then racing due to the
> lockless check.


Got it.  I'll make it:

	/*
	 * The task rq_task could be concurrently executing the lockless
	 * check of rq_xprt in svc_get_next_xprt().  Once we set
	 * rq_xprt, that task could return from svc_get_next_xprt(), and
	 * then could put the last reference to the xprt, or could exit.
	 * Therefore, our get and wake_up need to come first:
	 */

--b.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 08/11] SUNRPC: get rid of the request wait queue
  2014-08-12 21:29                         ` Bruce Fields
@ 2014-08-12 21:34                           ` Trond Myklebust
  0 siblings, 0 replies; 24+ messages in thread
From: Trond Myklebust @ 2014-08-12 21:34 UTC (permalink / raw)
  To: Bruce Fields; +Cc: Linux NFS Mailing List

On Tue, Aug 12, 2014 at 5:29 PM, Bruce Fields <bfields@fieldses.org> wrote:
> On Tue, Aug 12, 2014 at 04:54:53PM -0400, Trond Myklebust wrote:
>> On Tue, Aug 12, 2014 at 4:23 PM, Bruce Fields <bfields@fieldses.org> wrote:
>> > On Tue, Aug 12, 2014 at 04:09:52PM -0400, Trond Myklebust wrote:
>> >> On Tue, Aug 12, 2014 at 3:53 PM, Bruce Fields <bfields@fieldses.org> wrote:
>> >> > On Sun, Aug 03, 2014 at 01:03:10PM -0400, Trond Myklebust wrote:
>> >> >> We're always _only_ waking up tasks from within the sp_threads list, so
>> >> >> we know that they are enqueued and alive. The rq_wait waitqueue is just
>> >> >> a distraction with extra atomic semantics.
>> >> >>
>> >> >> Signed-off-by: Trond Myklebust <trond.myklebust@primarydata.com>
>> >> >> ---
>> >> >>  include/linux/sunrpc/svc.h |  1 -
>> >> >>  net/sunrpc/svc.c           |  2 --
>> >> >>  net/sunrpc/svc_xprt.c      | 32 +++++++++++++++++---------------
>> >> >>  3 files changed, 17 insertions(+), 18 deletions(-)
>> >> >>
>> >> >> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>> >> >> index 1bc7cd05b22e..3ec769b65c7f 100644
>> >> >> --- a/include/linux/sunrpc/svc.h
>> >> >> +++ b/include/linux/sunrpc/svc.h
>> >> >> @@ -280,7 +280,6 @@ struct svc_rqst {
>> >> >>       int                     rq_splice_ok;   /* turned off in gss privacy
>> >> >>                                                * to prevent encrypting page
>> >> >>                                                * cache pages */
>> >> >> -     wait_queue_head_t       rq_wait;        /* synchronization */
>> >> >>       struct task_struct      *rq_task;       /* service thread */
>> >> >>  };
>> >> >>
>> >> >> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>> >> >> index 5de6801cd924..dfb78c4f3031 100644
>> >> >> --- a/net/sunrpc/svc.c
>> >> >> +++ b/net/sunrpc/svc.c
>> >> >> @@ -612,8 +612,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>> >> >>       if (!rqstp)
>> >> >>               goto out_enomem;
>> >> >>
>> >> >> -     init_waitqueue_head(&rqstp->rq_wait);
>> >> >> -
>> >> >>       serv->sv_nrthreads++;
>> >> >>       spin_lock_bh(&pool->sp_lock);
>> >> >>       pool->sp_nrthreads++;
>> >> >> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> >> >> index 2c30193c7a13..438e91c12851 100644
>> >> >> --- a/net/sunrpc/svc_xprt.c
>> >> >> +++ b/net/sunrpc/svc_xprt.c
>> >> >> @@ -348,8 +348,6 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>> >> >>
>> >> >>       cpu = get_cpu();
>> >> >>       pool = svc_pool_for_cpu(xprt->xpt_server, cpu);
>> >> >> -     put_cpu();
>> >> >> -
>> >> >>       spin_lock_bh(&pool->sp_lock);
>> >> >>
>> >> >>       if (!list_empty(&pool->sp_threads) &&
>> >> >> @@ -382,10 +380,15 @@ static void svc_xprt_do_enqueue(struct svc_xprt *xprt)
>> >> >>                       printk(KERN_ERR
>> >> >>                               "svc_xprt_enqueue: server %p, rq_xprt=%p!\n",
>> >> >>                               rqstp, rqstp->rq_xprt);
>> >> >> -             rqstp->rq_xprt = xprt;
>> >> >> +             /* Note the order of the following 3 lines:
>> >> >> +              * We want to assign xprt to rqstp->rq_xprt only _after_
>> >> >> +              * we've woken up the process, so that we don't race with
>> >> >> +              * the lockless check in svc_get_next_xprt().
>> >> >
>> >> > Sorry, I'm not following this: what exactly is the race?
>> >>
>> >> There are 2 potential races, both due to the lockless check for rqstp->rq_xprt.
>> >>
>> >> 1) You need to ensure that the reference count in xprt is bumped
>> >> before assigning to rqstp->rq_xprt, because svc_get_next_xprt() does a
>> >> lockless check for rqstp->rq_xprt != NULL, so there is no lock to
>> >> ensure that the refcount bump occurs before whoever called
>> >> svc_get_next_xprt() calls svc_xprt_put()...
>> >>
>> >> 2) You want to ensure that you don't call wake_up_process() after
>> >> exiting svc_get_next_xprt() since you would no longer be guaranteed
>> >> that the task still exists. By calling wake_up_process() before
>> >> setting rqstp->rq_xprt, you ensure that if the task wakes up before
>> >> calling wake_up_process(), then the test for rqstp->rq_xprt == NULL
>> >> forces you into the spin_lock_bh() protected region, where it is safe
>> >> to test rqstp->rq_xprt again and decide whether or not the task is
>> >> still queued on &pool->sp_threads.
>> >
>> > Got it.  So how about making that comment:
>> >
>> >         /*
>> >          * Once we assign rq_xprt, a concurrent task in
>> >          * svc_get_next_xprt() could put the xprt, or could exit.
>> >          * Therefore, the get and the wake_up need to come first.
>> >          */
>> >
>> > Is that close enough?
>> >
>>
>> It's close: it's not so much any concurrent task, it's specifically
>> the one that we're trying to wake up that may find itself waking up
>> prematurely due to the schedule_timeout(), and then racing due to the
>> lockless check.
>
>
> Got it.  I'll make it:
>
>         /*
>          * The task rq_task could be concurrently executing the lockless
>          * check of rq_xprt in svc_get_next_xprt().  Once we set
>          * rq_xprt, that task could return from svc_get_next_xprt(), and
>          * then could put the last reference to the xprt, or could exit.
>          * Therefore, our get and wake_up need to come first:
>          */
>

Looks good.


-- 
Trond Myklebust

Linux NFS client maintainer, PrimaryData

trond.myklebust@primarydata.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-08-12 21:34 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-03 17:03 [PATCH 00/11] SUNRPC server scalability improvements Trond Myklebust
2014-08-03 17:03 ` [PATCH 01/11] SUNRPC: Reduce contention in svc_xprt_enqueue() Trond Myklebust
2014-08-03 17:03   ` [PATCH 02/11] SUNRPC: svc_tcp_write_space: don't clear SOCK_NOSPACE prematurely Trond Myklebust
2014-08-03 17:03     ` [PATCH 03/11] SUNRPC: Allow svc_reserve() to notify TCP socket that space has been freed Trond Myklebust
2014-08-03 17:03       ` [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt Trond Myklebust
2014-08-03 17:03         ` [PATCH 05/11] lockd: Ensure that lockd_start_svc sets the server rq_task Trond Myklebust
2014-08-03 17:03           ` [PATCH 06/11] nfs: Ensure that nfs_callback_start_svc " Trond Myklebust
2014-08-03 17:03             ` [PATCH 07/11] SUNRPC: Do not grab pool->sp_lock unnecessarily in svc_get_next_xprt Trond Myklebust
2014-08-03 17:03               ` [PATCH 08/11] SUNRPC: get rid of the request wait queue Trond Myklebust
2014-08-03 17:03                 ` [PATCH 09/11] SUNRPC: Fix broken kthread_should_stop test in svc_get_next_xprt Trond Myklebust
2014-08-03 17:03                   ` [PATCH 10/11] SUNRPC: More optimisations of svc_xprt_enqueue() Trond Myklebust
2014-08-03 17:03                     ` [PATCH 11/11] SUNRPC: Optimise away svc_recv_available Trond Myklebust
2014-08-12 19:53                 ` [PATCH 08/11] SUNRPC: get rid of the request wait queue Bruce Fields
2014-08-12 20:09                   ` Trond Myklebust
2014-08-12 20:23                     ` Bruce Fields
2014-08-12 20:54                       ` Trond Myklebust
2014-08-12 21:29                         ` Bruce Fields
2014-08-12 21:34                           ` Trond Myklebust
2014-08-12 19:16         ` [PATCH 04/11] SUNRPC: Do not override wspace tests in svc_handle_xprt Bruce Fields
2014-08-12 19:31           ` Trond Myklebust
2014-08-12 20:04             ` Bruce Fields
2014-08-04 13:36 ` [PATCH 00/11] SUNRPC server scalability improvements Bruce Fields
2014-08-05 20:42   ` Bruce Fields
2014-08-05 21:24     ` Trond Myklebust

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.