All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] SUNRPC service thread scheduler optimizations
@ 2023-07-04  0:07 Chuck Lever
  2023-07-04  0:07 ` [PATCH v2 1/9] SUNRPC: Deduplicate thread wake-up code Chuck Lever
                   ` (9 more replies)
  0 siblings, 10 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:07 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

Walking a linked list to find an idle thread is not CPU cache-
friendly, and in fact I've noted palpable per-request latency
impacts as the number of nfsd threads on the server increases.

After discussing some possible improvements with Jeff at LSF/MM,
I've been experimenting with the following series. I've measured an
order of magnitude latency improvement in the thread lookup time,
and have managed to keep the whole thing lockless.

After some offline discussion with Neil, I tried out using just
the xarray plus its spinlock to mark threads idle or busy. This
worked as well as the bitmap for lower thread counts, but got
predictably bad as the thread count when past several hundred. For
the moment I'm sticking with the wait-free bitmap lookup.

Also, the maximum thread count is now 4096. I'm still willing to try
an RCU-based bitmap resizing mechanism if we believe this is still
to small of a maximum.


Changes since RFC:
* Add a counter for ingress RPC messages
* Add a few documenting comments
* Move the more controversial patches to the end of the series
* Clarify the refactoring of svc_wake_up() 
* Increase the value of RPCSVC_MAXPOOLTHREADS to 4096 (and tested with that many threads)
* Optimize the loop in svc_pool_wake_idle_thread()
* Optimize marking a thread "idle" in svc_get_next_xprt()

---

Chuck Lever (9):
      SUNRPC: Deduplicate thread wake-up code
      SUNRPC: Report when no service thread is available.
      SUNRPC: Split the svc_xprt_dequeue tracepoint
      SUNRPC: Count ingress RPC messages per svc_pool
      SUNRPC: Clean up svc_set_num_threads
      SUNRPC: Replace dprintk() call site in __svc_create()
      SUNRPC: Don't disable BH's when taking sp_lock
      SUNRPC: Replace sp_threads_all with an xarray
      SUNRPC: Convert RQ_BUSY into a per-pool bitmap


 fs/nfsd/nfssvc.c              |   3 +-
 include/linux/sunrpc/svc.h    |  18 ++--
 include/trace/events/sunrpc.h | 159 +++++++++++++++++++++++++++----
 net/sunrpc/svc.c              | 174 ++++++++++++++++++++++------------
 net/sunrpc/svc_xprt.c         | 114 +++++++++++-----------
 5 files changed, 328 insertions(+), 140 deletions(-)

--
Chuck Lever


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 1/9] SUNRPC: Deduplicate thread wake-up code
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
@ 2023-07-04  0:07 ` Chuck Lever
  2023-07-04  0:07 ` [PATCH v2 2/9] SUNRPC: Report when no service thread is available Chuck Lever
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:07 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

From: Chuck Lever <chuck.lever@oracle.com>

Refactor: Extract the loop that finds an idle service thread from
svc_xprt_enqueue() and svc_wake_up(). Both functions do just about
the same thing.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h |    1 +
 net/sunrpc/svc.c           |   28 ++++++++++++++++++++++++++
 net/sunrpc/svc_xprt.c      |   48 +++++++++++++++-----------------------------
 3 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index f8751118c122..dc2d90a655e2 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -427,6 +427,7 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
 
 void		   svc_wake_up(struct svc_serv *);
 void		   svc_reserve(struct svc_rqst *rqstp, int space);
+struct svc_rqst	  *svc_pool_wake_idle_thread(struct svc_pool *pool);
 struct svc_pool   *svc_pool_for_cpu(struct svc_serv *serv);
 char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
 const char *	   svc_proc_name(const struct svc_rqst *rqstp);
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 587811a002c9..e81ce5f76abd 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -689,6 +689,34 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 	return rqstp;
 }
 
+/**
+ * svc_pool_wake_idle_thread - wake an idle thread in @pool
+ * @pool: service thread pool
+ *
+ * Returns an idle service thread (now marked BUSY), or NULL
+ * if no service threads are available. Finding an idle service
+ * thread and marking it BUSY is atomic with respect to other
+ * calls to svc_pool_wake_idle_thread().
+ */
+struct svc_rqst *svc_pool_wake_idle_thread(struct svc_pool *pool)
+{
+	struct svc_rqst	*rqstp;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
+			continue;
+
+		rcu_read_unlock();
+		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
+		wake_up_process(rqstp->rq_task);
+		percpu_counter_inc(&pool->sp_threads_woken);
+		return rqstp;
+	}
+	rcu_read_unlock();
+	return NULL;
+}
+
 /*
  * Choose a pool in which to create a new thread, for svc_set_num_threads
  */
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 62c7919ea610..89302bf09b77 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -455,8 +455,8 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
  */
 void svc_xprt_enqueue(struct svc_xprt *xprt)
 {
+	struct svc_rqst	*rqstp;
 	struct svc_pool *pool;
-	struct svc_rqst	*rqstp = NULL;
 
 	if (!svc_xprt_ready(xprt))
 		return;
@@ -476,20 +476,10 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
 	spin_unlock_bh(&pool->sp_lock);
 
-	/* find a thread for this xprt */
-	rcu_read_lock();
-	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
-		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
-			continue;
-		percpu_counter_inc(&pool->sp_threads_woken);
-		rqstp->rq_qtime = ktime_get();
-		wake_up_process(rqstp->rq_task);
-		goto out_unlock;
-	}
-	set_bit(SP_CONGESTED, &pool->sp_flags);
-	rqstp = NULL;
-out_unlock:
-	rcu_read_unlock();
+	rqstp = svc_pool_wake_idle_thread(pool);
+	if (!rqstp)
+		set_bit(SP_CONGESTED, &pool->sp_flags);
+
 	trace_svc_xprt_enqueue(xprt, rqstp);
 }
 EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
@@ -581,7 +571,10 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
 	svc_xprt_put(xprt);
 }
 
-/*
+/**
+ * svc_wake_up - Wake up a service thread for non-transport work
+ * @serv: RPC service
+ *
  * Some svc_serv's will have occasional work to do, even when a xprt is not
  * waiting to be serviced. This function is there to "kick" a task in one of
  * those services so that it can wake up and do that work. Note that we only
@@ -590,27 +583,18 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
  */
 void svc_wake_up(struct svc_serv *serv)
 {
+	struct svc_pool *pool = &serv->sv_pools[0];
 	struct svc_rqst	*rqstp;
-	struct svc_pool *pool;
 
-	pool = &serv->sv_pools[0];
-
-	rcu_read_lock();
-	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
-		/* skip any that aren't queued */
-		if (test_bit(RQ_BUSY, &rqstp->rq_flags))
-			continue;
-		rcu_read_unlock();
-		wake_up_process(rqstp->rq_task);
-		trace_svc_wake_up(rqstp->rq_task->pid);
+	rqstp = svc_pool_wake_idle_thread(pool);
+	if (!rqstp) {
+		set_bit(SP_TASK_PENDING, &pool->sp_flags);
+		smp_wmb();
+		trace_svc_wake_up(0);
 		return;
 	}
-	rcu_read_unlock();
 
-	/* No free entries available */
-	set_bit(SP_TASK_PENDING, &pool->sp_flags);
-	smp_wmb();
-	trace_svc_wake_up(0);
+	trace_svc_wake_up(rqstp->rq_task->pid);
 }
 EXPORT_SYMBOL_GPL(svc_wake_up);
 



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 2/9] SUNRPC: Report when no service thread is available.
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
  2023-07-04  0:07 ` [PATCH v2 1/9] SUNRPC: Deduplicate thread wake-up code Chuck Lever
@ 2023-07-04  0:07 ` Chuck Lever
  2023-07-04  0:07 ` [PATCH v2 3/9] SUNRPC: Split the svc_xprt_dequeue tracepoint Chuck Lever
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:07 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

From: Chuck Lever <chuck.lever@oracle.com>

Count and record thread pool starvation. Administrators can take
action by increasing thread count or decreasing workload.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h    |    5 +++-
 include/trace/events/sunrpc.h |   49 ++++++++++++++++++++++++++++++++++-------
 net/sunrpc/svc.c              |    9 +++++++-
 net/sunrpc/svc_xprt.c         |   22 ++++++++++--------
 4 files changed, 64 insertions(+), 21 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index dc2d90a655e2..fbfe6ea737c8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -22,7 +22,6 @@
 #include <linux/pagevec.h>
 
 /*
- *
  * RPC service thread pool.
  *
  * Pool of threads and temporary sockets.  Generally there is only
@@ -42,6 +41,7 @@ struct svc_pool {
 	struct percpu_counter	sp_sockets_queued;
 	struct percpu_counter	sp_threads_woken;
 	struct percpu_counter	sp_threads_timedout;
+	struct percpu_counter	sp_threads_starved;
 
 #define	SP_TASK_PENDING		(0)		/* still work to do even if no
 						 * xprt is queued. */
@@ -427,7 +427,8 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
 
 void		   svc_wake_up(struct svc_serv *);
 void		   svc_reserve(struct svc_rqst *rqstp, int space);
-struct svc_rqst	  *svc_pool_wake_idle_thread(struct svc_pool *pool);
+struct svc_rqst	  *svc_pool_wake_idle_thread(struct svc_serv *serv,
+					     struct svc_pool *pool);
 struct svc_pool   *svc_pool_for_cpu(struct svc_serv *serv);
 char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
 const char *	   svc_proc_name(const struct svc_rqst *rqstp);
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 43711753616a..9b70fc1c698a 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1994,21 +1994,21 @@ TRACE_EVENT(svc_xprt_create_err,
 TRACE_EVENT(svc_xprt_enqueue,
 	TP_PROTO(
 		const struct svc_xprt *xprt,
-		const struct svc_rqst *rqst
+		const struct svc_rqst *wakee
 	),
 
-	TP_ARGS(xprt, rqst),
+	TP_ARGS(xprt, wakee),
 
 	TP_STRUCT__entry(
 		SVC_XPRT_ENDPOINT_FIELDS(xprt)
 
-		__field(int, pid)
+		__field(pid_t, pid)
 	),
 
 	TP_fast_assign(
 		SVC_XPRT_ENDPOINT_ASSIGNMENTS(xprt);
 
-		__entry->pid = rqst? rqst->rq_task->pid : 0;
+		__entry->pid = wakee->rq_task->pid;
 	),
 
 	TP_printk(SVC_XPRT_ENDPOINT_FORMAT " pid=%d",
@@ -2039,6 +2039,39 @@ TRACE_EVENT(svc_xprt_dequeue,
 		SVC_XPRT_ENDPOINT_VARARGS, __entry->wakeup)
 );
 
+#define show_svc_pool_flags(x)						\
+	__print_flags(x, "|",						\
+		{ BIT(SP_TASK_PENDING),		"TASK_PENDING" },	\
+		{ BIT(SP_CONGESTED),		"CONGESTED" })
+
+TRACE_EVENT(svc_pool_starved,
+	TP_PROTO(
+		const struct svc_serv *serv,
+		const struct svc_pool *pool
+	),
+
+	TP_ARGS(serv, pool),
+
+	TP_STRUCT__entry(
+		__string(name, serv->sv_name)
+		__field(int, pool_id)
+		__field(unsigned int, nrthreads)
+		__field(unsigned long, flags)
+	),
+
+	TP_fast_assign(
+		__assign_str(name, serv->sv_name);
+		__entry->pool_id = pool->sp_id;
+		__entry->nrthreads = pool->sp_nrthreads;
+		__entry->flags = pool->sp_flags;
+	),
+
+	TP_printk("service=%s pool=%d flags=%s nrthreads=%u",
+		__get_str(name), __entry->pool_id,
+		show_svc_pool_flags(__entry->flags), __entry->nrthreads
+	)
+);
+
 DECLARE_EVENT_CLASS(svc_xprt_event,
 	TP_PROTO(
 		const struct svc_xprt *xprt
@@ -2109,16 +2142,16 @@ TRACE_EVENT(svc_xprt_accept,
 );
 
 TRACE_EVENT(svc_wake_up,
-	TP_PROTO(int pid),
+	TP_PROTO(const struct svc_rqst *wakee),
 
-	TP_ARGS(pid),
+	TP_ARGS(wakee),
 
 	TP_STRUCT__entry(
-		__field(int, pid)
+		__field(pid_t, pid)
 	),
 
 	TP_fast_assign(
-		__entry->pid = pid;
+		__entry->pid = wakee->rq_task->pid;
 	),
 
 	TP_printk("pid=%d", __entry->pid)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index e81ce5f76abd..04151e22ec44 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -516,6 +516,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_threads_woken, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_threads_timedout, 0, GFP_KERNEL);
+		percpu_counter_init(&pool->sp_threads_starved, 0, GFP_KERNEL);
 	}
 
 	return serv;
@@ -591,6 +592,7 @@ svc_destroy(struct kref *ref)
 		percpu_counter_destroy(&pool->sp_sockets_queued);
 		percpu_counter_destroy(&pool->sp_threads_woken);
 		percpu_counter_destroy(&pool->sp_threads_timedout);
+		percpu_counter_destroy(&pool->sp_threads_starved);
 	}
 	kfree(serv->sv_pools);
 	kfree(serv);
@@ -691,6 +693,7 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 
 /**
  * svc_pool_wake_idle_thread - wake an idle thread in @pool
+ * @serv: RPC service
  * @pool: service thread pool
  *
  * Returns an idle service thread (now marked BUSY), or NULL
@@ -698,7 +701,8 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
  * thread and marking it BUSY is atomic with respect to other
  * calls to svc_pool_wake_idle_thread().
  */
-struct svc_rqst *svc_pool_wake_idle_thread(struct svc_pool *pool)
+struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
+					   struct svc_pool *pool)
 {
 	struct svc_rqst	*rqstp;
 
@@ -714,6 +718,9 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_pool *pool)
 		return rqstp;
 	}
 	rcu_read_unlock();
+
+	trace_svc_pool_starved(serv, pool);
+	percpu_counter_inc(&pool->sp_threads_starved);
 	return NULL;
 }
 
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 89302bf09b77..a1ed6fb69793 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -455,7 +455,7 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
  */
 void svc_xprt_enqueue(struct svc_xprt *xprt)
 {
-	struct svc_rqst	*rqstp;
+	struct svc_rqst *rqstp;
 	struct svc_pool *pool;
 
 	if (!svc_xprt_ready(xprt))
@@ -476,9 +476,11 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
 	spin_unlock_bh(&pool->sp_lock);
 
-	rqstp = svc_pool_wake_idle_thread(pool);
-	if (!rqstp)
+	rqstp = svc_pool_wake_idle_thread(xprt->xpt_server, pool);
+	if (!rqstp) {
 		set_bit(SP_CONGESTED, &pool->sp_flags);
+		return;
+	}
 
 	trace_svc_xprt_enqueue(xprt, rqstp);
 }
@@ -584,17 +586,16 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
 void svc_wake_up(struct svc_serv *serv)
 {
 	struct svc_pool *pool = &serv->sv_pools[0];
-	struct svc_rqst	*rqstp;
+	struct svc_rqst *rqstp;
 
-	rqstp = svc_pool_wake_idle_thread(pool);
+	rqstp = svc_pool_wake_idle_thread(serv, pool);
 	if (!rqstp) {
 		set_bit(SP_TASK_PENDING, &pool->sp_flags);
 		smp_wmb();
-		trace_svc_wake_up(0);
 		return;
 	}
 
-	trace_svc_wake_up(rqstp->rq_task->pid);
+	trace_svc_wake_up(rqstp);
 }
 EXPORT_SYMBOL_GPL(svc_wake_up);
 
@@ -1436,16 +1437,17 @@ static int svc_pool_stats_show(struct seq_file *m, void *p)
 	struct svc_pool *pool = p;
 
 	if (p == SEQ_START_TOKEN) {
-		seq_puts(m, "# pool packets-arrived sockets-enqueued threads-woken threads-timedout\n");
+		seq_puts(m, "# pool packets-arrived xprts-enqueued threads-woken threads-timedout starved\n");
 		return 0;
 	}
 
-	seq_printf(m, "%u %llu %llu %llu %llu\n",
+	seq_printf(m, "%u %llu %llu %llu %llu %llu\n",
 		pool->sp_id,
 		percpu_counter_sum_positive(&pool->sp_sockets_queued),
 		percpu_counter_sum_positive(&pool->sp_sockets_queued),
 		percpu_counter_sum_positive(&pool->sp_threads_woken),
-		percpu_counter_sum_positive(&pool->sp_threads_timedout));
+		percpu_counter_sum_positive(&pool->sp_threads_timedout),
+		percpu_counter_sum_positive(&pool->sp_threads_starved));
 
 	return 0;
 }



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 3/9] SUNRPC: Split the svc_xprt_dequeue tracepoint
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
  2023-07-04  0:07 ` [PATCH v2 1/9] SUNRPC: Deduplicate thread wake-up code Chuck Lever
  2023-07-04  0:07 ` [PATCH v2 2/9] SUNRPC: Report when no service thread is available Chuck Lever
@ 2023-07-04  0:07 ` Chuck Lever
  2023-07-04  0:07 ` [PATCH v2 4/9] SUNRPC: Count ingress RPC messages per svc_pool Chuck Lever
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:07 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

From: Chuck Lever <chuck.lever@oracle.com>

Distinguish between the case where new work was picked up just by
looking at the transport queue versus when the thread was awoken.
This gives us better visibility about how well-utilized the thread
pool is.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/sunrpc.h |   47 +++++++++++++++++++++++++++++++----------
 net/sunrpc/svc_xprt.c         |    9 +++++---
 2 files changed, 41 insertions(+), 15 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 9b70fc1c698a..2e83887b58cd 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -2015,34 +2015,57 @@ TRACE_EVENT(svc_xprt_enqueue,
 		SVC_XPRT_ENDPOINT_VARARGS, __entry->pid)
 );
 
-TRACE_EVENT(svc_xprt_dequeue,
+#define show_svc_pool_flags(x)						\
+	__print_flags(x, "|",						\
+		{ BIT(SP_TASK_PENDING),		"TASK_PENDING" },	\
+		{ BIT(SP_CONGESTED),		"CONGESTED" })
+
+DECLARE_EVENT_CLASS(svc_pool_scheduler_class,
 	TP_PROTO(
-		const struct svc_rqst *rqst
+		const struct svc_rqst *rqstp
 	),
 
-	TP_ARGS(rqst),
+	TP_ARGS(rqstp),
 
 	TP_STRUCT__entry(
-		SVC_XPRT_ENDPOINT_FIELDS(rqst->rq_xprt)
+		SVC_XPRT_ENDPOINT_FIELDS(rqstp->rq_xprt)
 
+		__string(name, rqstp->rq_server->sv_name)
+		__field(int, pool_id)
+		__field(unsigned int, nrthreads)
+		__field(unsigned long, pool_flags)
 		__field(unsigned long, wakeup)
 	),
 
 	TP_fast_assign(
-		SVC_XPRT_ENDPOINT_ASSIGNMENTS(rqst->rq_xprt);
+		struct svc_pool *pool = rqstp->rq_pool;
+		SVC_XPRT_ENDPOINT_ASSIGNMENTS(rqstp->rq_xprt);
 
+		__assign_str(name, rqstp->rq_server->sv_name);
+		__entry->pool_id = pool->sp_id;
+		__entry->nrthreads = pool->sp_nrthreads;
+		__entry->pool_flags = pool->sp_flags;
 		__entry->wakeup = ktime_to_us(ktime_sub(ktime_get(),
-							rqst->rq_qtime));
+							rqstp->rq_qtime));
 	),
 
-	TP_printk(SVC_XPRT_ENDPOINT_FORMAT " wakeup-us=%lu",
-		SVC_XPRT_ENDPOINT_VARARGS, __entry->wakeup)
+	TP_printk(SVC_XPRT_ENDPOINT_FORMAT
+		" service=%s pool=%d pool_flags=%s nrthreads=%u wakeup-us=%lu",
+		SVC_XPRT_ENDPOINT_VARARGS, __get_str(name), __entry->pool_id,
+		show_svc_pool_flags(__entry->pool_flags), __entry->nrthreads,
+		__entry->wakeup
+	)
 );
 
-#define show_svc_pool_flags(x)						\
-	__print_flags(x, "|",						\
-		{ BIT(SP_TASK_PENDING),		"TASK_PENDING" },	\
-		{ BIT(SP_CONGESTED),		"CONGESTED" })
+#define DEFINE_SVC_POOL_SCHEDULER_EVENT(name) \
+	DEFINE_EVENT(svc_pool_scheduler_class, svc_pool_##name, \
+			TP_PROTO( \
+				const struct svc_rqst *rqstp \
+			), \
+			TP_ARGS(rqstp))
+
+DEFINE_SVC_POOL_SCHEDULER_EVENT(polled);
+DEFINE_SVC_POOL_SCHEDULER_EVENT(awoken);
 
 TRACE_EVENT(svc_pool_starved,
 	TP_PROTO(
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index a1ed6fb69793..7ee095d03996 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -744,8 +744,10 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 	WARN_ON_ONCE(rqstp->rq_xprt);
 
 	rqstp->rq_xprt = svc_xprt_dequeue(pool);
-	if (rqstp->rq_xprt)
+	if (rqstp->rq_xprt) {
+		trace_svc_pool_polled(rqstp);
 		goto out_found;
+	}
 
 	/*
 	 * We have to be able to interrupt this wait
@@ -767,8 +769,10 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 	set_bit(RQ_BUSY, &rqstp->rq_flags);
 	smp_mb__after_atomic();
 	rqstp->rq_xprt = svc_xprt_dequeue(pool);
-	if (rqstp->rq_xprt)
+	if (rqstp->rq_xprt) {
+		trace_svc_pool_awoken(rqstp);
 		goto out_found;
+	}
 
 	if (!time_left)
 		percpu_counter_inc(&pool->sp_threads_timedout);
@@ -784,7 +788,6 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 		rqstp->rq_chandle.thread_wait = 5*HZ;
 	else
 		rqstp->rq_chandle.thread_wait = 1*HZ;
-	trace_svc_xprt_dequeue(rqstp);
 	return rqstp->rq_xprt;
 }
 



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 4/9] SUNRPC: Count ingress RPC messages per svc_pool
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
                   ` (2 preceding siblings ...)
  2023-07-04  0:07 ` [PATCH v2 3/9] SUNRPC: Split the svc_xprt_dequeue tracepoint Chuck Lever
@ 2023-07-04  0:07 ` Chuck Lever
  2023-07-04  0:45   ` NeilBrown
  2023-07-04  0:08 ` [PATCH v2 5/9] SUNRPC: Clean up svc_set_num_threads Chuck Lever
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:07 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

From: Chuck Lever <chuck.lever@oracle.com>

To get a sense of the average number of transport enqueue operations
needed to process an incoming RPC message, re-use the "packets" pool
stat. Track the number of complete RPC messages processed by each
thread pool.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h |    1 +
 net/sunrpc/svc.c           |    2 ++
 net/sunrpc/svc_xprt.c      |    3 ++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index fbfe6ea737c8..74ea13270679 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -38,6 +38,7 @@ struct svc_pool {
 	struct list_head	sp_all_threads;	/* all server threads */
 
 	/* statistics on pool operation */
+	struct percpu_counter	sp_messages_arrived;
 	struct percpu_counter	sp_sockets_queued;
 	struct percpu_counter	sp_threads_woken;
 	struct percpu_counter	sp_threads_timedout;
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 04151e22ec44..ccc7ff3142dd 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -513,6 +513,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 		INIT_LIST_HEAD(&pool->sp_all_threads);
 		spin_lock_init(&pool->sp_lock);
 
+		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_threads_woken, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_threads_timedout, 0, GFP_KERNEL);
@@ -589,6 +590,7 @@ svc_destroy(struct kref *ref)
 	for (i = 0; i < serv->sv_nrpools; i++) {
 		struct svc_pool *pool = &serv->sv_pools[i];
 
+		percpu_counter_destroy(&pool->sp_messages_arrived);
 		percpu_counter_destroy(&pool->sp_sockets_queued);
 		percpu_counter_destroy(&pool->sp_threads_woken);
 		percpu_counter_destroy(&pool->sp_threads_timedout);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 7ee095d03996..ecbccf0d89b9 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -897,6 +897,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 
 	if (serv->sv_stats)
 		serv->sv_stats->netcnt++;
+	percpu_counter_inc(&rqstp->rq_pool->sp_messages_arrived);
 	rqstp->rq_stime = ktime_get();
 	return len;
 out_release:
@@ -1446,7 +1447,7 @@ static int svc_pool_stats_show(struct seq_file *m, void *p)
 
 	seq_printf(m, "%u %llu %llu %llu %llu %llu\n",
 		pool->sp_id,
-		percpu_counter_sum_positive(&pool->sp_sockets_queued),
+		percpu_counter_sum_positive(&pool->sp_messages_arrived),
 		percpu_counter_sum_positive(&pool->sp_sockets_queued),
 		percpu_counter_sum_positive(&pool->sp_threads_woken),
 		percpu_counter_sum_positive(&pool->sp_threads_timedout),



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 5/9] SUNRPC: Clean up svc_set_num_threads
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
                   ` (3 preceding siblings ...)
  2023-07-04  0:07 ` [PATCH v2 4/9] SUNRPC: Count ingress RPC messages per svc_pool Chuck Lever
@ 2023-07-04  0:08 ` Chuck Lever
  2023-07-04  0:08 ` [PATCH v2 6/9] SUNRPC: Replace dprintk() call site in __svc_create() Chuck Lever
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:08 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

From: Chuck Lever <chuck.lever@oracle.com>

Document the API contract and remove stale or obvious comments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/svc.c |   60 +++++++++++++++++++++++-------------------------------
 1 file changed, 25 insertions(+), 35 deletions(-)

diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ccc7ff3142dd..42d38cd99f29 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -726,23 +726,14 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
 	return NULL;
 }
 
-/*
- * Choose a pool in which to create a new thread, for svc_set_num_threads
- */
-static inline struct svc_pool *
-choose_pool(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
+static struct svc_pool *
+svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
 {
-	if (pool != NULL)
-		return pool;
-
-	return &serv->sv_pools[(*state)++ % serv->sv_nrpools];
+	return pool ? pool : &serv->sv_pools[(*state)++ % serv->sv_nrpools];
 }
 
-/*
- * Choose a thread to kill, for svc_set_num_threads
- */
-static inline struct task_struct *
-choose_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
+static struct task_struct *
+svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
 {
 	unsigned int i;
 	struct task_struct *task = NULL;
@@ -750,7 +741,6 @@ choose_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
 	if (pool != NULL) {
 		spin_lock_bh(&pool->sp_lock);
 	} else {
-		/* choose a pool in round-robin fashion */
 		for (i = 0; i < serv->sv_nrpools; i++) {
 			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
 			spin_lock_bh(&pool->sp_lock);
@@ -765,21 +755,15 @@ choose_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
 	if (!list_empty(&pool->sp_all_threads)) {
 		struct svc_rqst *rqstp;
 
-		/*
-		 * Remove from the pool->sp_all_threads list
-		 * so we don't try to kill it again.
-		 */
 		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
 		set_bit(RQ_VICTIM, &rqstp->rq_flags);
 		list_del_rcu(&rqstp->rq_all);
 		task = rqstp->rq_task;
 	}
 	spin_unlock_bh(&pool->sp_lock);
-
 	return task;
 }
 
-/* create new threads */
 static int
 svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 {
@@ -791,13 +775,12 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 
 	do {
 		nrservs--;
-		chosen_pool = choose_pool(serv, pool, &state);
-
+		chosen_pool = svc_pool_next(serv, pool, &state);
 		node = svc_pool_map_get_node(chosen_pool->sp_id);
+
 		rqstp = svc_prepare_thread(serv, chosen_pool, node);
 		if (IS_ERR(rqstp))
 			return PTR_ERR(rqstp);
-
 		task = kthread_create_on_node(serv->sv_threadfn, rqstp,
 					      node, "%s", serv->sv_name);
 		if (IS_ERR(task)) {
@@ -816,15 +799,6 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 	return 0;
 }
 
-/*
- * Create or destroy enough new threads to make the number
- * of threads the given number.  If `pool' is non-NULL, applies
- * only to threads in that pool, otherwise round-robins between
- * all pools.  Caller must ensure that mutual exclusion between this and
- * server startup or shutdown.
- */
-
-/* destroy old threads */
 static int
 svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 {
@@ -832,9 +806,8 @@ svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 	struct task_struct *task;
 	unsigned int state = serv->sv_nrthreads-1;
 
-	/* destroy old threads */
 	do {
-		task = choose_victim(serv, pool, &state);
+		task = svc_pool_victim(serv, pool, &state);
 		if (task == NULL)
 			break;
 		rqstp = kthread_data(task);
@@ -846,6 +819,23 @@ svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 	return 0;
 }
 
+/**
+ * svc_set_num_threads - adjust number of threads per RPC service
+ * @serv: RPC service to adjust
+ * @pool: Specific pool from which to choose threads, or NULL
+ * @nrservs: New number of threads for @serv (0 or less means kill all threads)
+ *
+ * Create or destroy threads to make the number of threads for @serv the
+ * given number. If @pool is non-NULL, change only threads in that pool;
+ * otherwise, round-robin between all pools for @serv. @serv's
+ * sv_nrthreads is adjusted for each thread created or destroyed.
+ *
+ * Caller must ensure mutual exclusion between this and server startup or
+ * shutdown.
+ *
+ * Returns zero on success or a negative errno if an error occurred while
+ * starting a thread.
+ */
 int
 svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 {



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 6/9] SUNRPC: Replace dprintk() call site in __svc_create()
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
                   ` (4 preceding siblings ...)
  2023-07-04  0:08 ` [PATCH v2 5/9] SUNRPC: Clean up svc_set_num_threads Chuck Lever
@ 2023-07-04  0:08 ` Chuck Lever
  2023-07-04  0:08 ` [PATCH v2 7/9] SUNRPC: Don't disable BH's when taking sp_lock Chuck Lever
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:08 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

From: Chuck Lever <chuck.lever@oracle.com>

Done as part of converting SunRPC observability from printk-style to
tracepoints.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/trace/events/sunrpc.h |   23 +++++++++++++++++++++++
 net/sunrpc/svc.c              |    5 ++---
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 2e83887b58cd..60c8e03268d4 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1918,6 +1918,29 @@ TRACE_EVENT(svc_stats_latency,
 		__get_str(procedure), __entry->execute)
 );
 
+TRACE_EVENT(svc_pool_init,
+	TP_PROTO(
+		const struct svc_serv *serv,
+		const struct svc_pool *pool
+	),
+
+	TP_ARGS(serv, pool),
+
+	TP_STRUCT__entry(
+		__string(name, serv->sv_name)
+		__field(int, pool_id)
+	),
+
+	TP_fast_assign(
+		__assign_str(name, serv->sv_name);
+		__entry->pool_id = pool->sp_id;
+	),
+
+	TP_printk("service=%s pool=%d",
+		__get_str(name), __entry->pool_id
+	)
+);
+
 #define show_svc_xprt_flags(flags)					\
 	__print_flags(flags, "|",					\
 		{ BIT(XPT_BUSY),		"BUSY" },		\
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 42d38cd99f29..9b6701a38e71 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -505,9 +505,6 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 	for (i = 0; i < serv->sv_nrpools; i++) {
 		struct svc_pool *pool = &serv->sv_pools[i];
 
-		dprintk("svc: initialising pool %u for %s\n",
-				i, serv->sv_name);
-
 		pool->sp_id = i;
 		INIT_LIST_HEAD(&pool->sp_sockets);
 		INIT_LIST_HEAD(&pool->sp_all_threads);
@@ -518,6 +515,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 		percpu_counter_init(&pool->sp_threads_woken, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_threads_timedout, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_threads_starved, 0, GFP_KERNEL);
+
+		trace_svc_pool_init(serv, pool);
 	}
 
 	return serv;



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 7/9] SUNRPC: Don't disable BH's when taking sp_lock
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
                   ` (5 preceding siblings ...)
  2023-07-04  0:08 ` [PATCH v2 6/9] SUNRPC: Replace dprintk() call site in __svc_create() Chuck Lever
@ 2023-07-04  0:08 ` Chuck Lever
  2023-07-04  0:56   ` NeilBrown
  2023-07-04  0:08 ` [PATCH v2 8/9] SUNRPC: Replace sp_threads_all with an xarray Chuck Lever
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:08 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

From: Chuck Lever <chuck.lever@oracle.com>

Consumers of sp_lock now all run in process context. We can avoid
the overhead of disabling bottom halves when taking this lock.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/svc_xprt.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index ecbccf0d89b9..8ced7591ce07 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -472,9 +472,9 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 	pool = svc_pool_for_cpu(xprt->xpt_server);
 
 	percpu_counter_inc(&pool->sp_sockets_queued);
-	spin_lock_bh(&pool->sp_lock);
+	spin_lock(&pool->sp_lock);
 	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
-	spin_unlock_bh(&pool->sp_lock);
+	spin_unlock(&pool->sp_lock);
 
 	rqstp = svc_pool_wake_idle_thread(xprt->xpt_server, pool);
 	if (!rqstp) {
@@ -496,14 +496,14 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
 	if (list_empty(&pool->sp_sockets))
 		goto out;
 
-	spin_lock_bh(&pool->sp_lock);
+	spin_lock(&pool->sp_lock);
 	if (likely(!list_empty(&pool->sp_sockets))) {
 		xprt = list_first_entry(&pool->sp_sockets,
 					struct svc_xprt, xpt_ready);
 		list_del_init(&xprt->xpt_ready);
 		svc_xprt_get(xprt);
 	}
-	spin_unlock_bh(&pool->sp_lock);
+	spin_unlock(&pool->sp_lock);
 out:
 	return xprt;
 }
@@ -1116,15 +1116,15 @@ static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net)
 	for (i = 0; i < serv->sv_nrpools; i++) {
 		pool = &serv->sv_pools[i];
 
-		spin_lock_bh(&pool->sp_lock);
+		spin_lock(&pool->sp_lock);
 		list_for_each_entry_safe(xprt, tmp, &pool->sp_sockets, xpt_ready) {
 			if (xprt->xpt_net != net)
 				continue;
 			list_del_init(&xprt->xpt_ready);
-			spin_unlock_bh(&pool->sp_lock);
+			spin_unlock(&pool->sp_lock);
 			return xprt;
 		}
-		spin_unlock_bh(&pool->sp_lock);
+		spin_unlock(&pool->sp_lock);
 	}
 	return NULL;
 }



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 8/9] SUNRPC: Replace sp_threads_all with an xarray
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
                   ` (6 preceding siblings ...)
  2023-07-04  0:08 ` [PATCH v2 7/9] SUNRPC: Don't disable BH's when taking sp_lock Chuck Lever
@ 2023-07-04  0:08 ` Chuck Lever
  2023-07-04  1:11   ` NeilBrown
  2023-07-04  0:08 ` [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap Chuck Lever
  2023-07-05  1:08 ` [PATCH v2 0/9] SUNRPC service thread scheduler optimizations NeilBrown
  9 siblings, 1 reply; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:08 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

From: Chuck Lever <chuck.lever@oracle.com>

We want a thread lookup operation that can be done with RCU only,
but also we want to avoid the linked-list walk, which does not scale
well in the number of pool threads.

BH-disabled locking is no longer necessary because we're no longer
sharing the pool's sp_lock to protect either the xarray or the
pool's thread count. sp_lock also protects transport activity. There
are no callers of svc_set_num_threads() that run outside of process
context.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/nfssvc.c              |    3 +-
 include/linux/sunrpc/svc.h    |   11 +++----
 include/trace/events/sunrpc.h |   47 ++++++++++++++++++++++++++++-
 net/sunrpc/svc.c              |   67 ++++++++++++++++++++++++-----------------
 net/sunrpc/svc_xprt.c         |    2 +
 5 files changed, 93 insertions(+), 37 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 2154fa63c5f2..d42b2a40c93c 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -62,8 +62,7 @@ static __be32			nfsd_init_request(struct svc_rqst *,
  * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
  * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
  * nn->keep_active is set).  That number of nfsd threads must
- * exist and each must be listed in ->sp_all_threads in some entry of
- * ->sv_pools[].
+ * exist and each must be listed in some entry of ->sv_pools[].
  *
  * Each active thread holds a counted reference on nn->nfsd_serv, as does
  * the nn->keep_active flag and various transient calls to svc_get().
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 74ea13270679..6f8bfcd44250 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -32,10 +32,10 @@
  */
 struct svc_pool {
 	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
-	spinlock_t		sp_lock;	/* protects all fields */
+	spinlock_t		sp_lock;	/* protects sp_sockets */
 	struct list_head	sp_sockets;	/* pending sockets */
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
-	struct list_head	sp_all_threads;	/* all server threads */
+	struct xarray		sp_thread_xa;
 
 	/* statistics on pool operation */
 	struct percpu_counter	sp_messages_arrived;
@@ -195,7 +195,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
  * processed.
  */
 struct svc_rqst {
-	struct list_head	rq_all;		/* all threads list */
 	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
 	struct svc_xprt *	rq_xprt;	/* transport ptr */
 
@@ -240,10 +239,10 @@ struct svc_rqst {
 #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
 						 * to prevent encrypting page
 						 * cache pages */
-#define	RQ_VICTIM	(5)			/* about to be shut down */
-#define	RQ_BUSY		(6)			/* request is busy */
-#define	RQ_DATA		(7)			/* request has data */
+#define	RQ_BUSY		(5)			/* request is busy */
+#define	RQ_DATA		(6)			/* request has data */
 	unsigned long		rq_flags;	/* flags field */
+	u32			rq_thread_id;	/* xarray index */
 	ktime_t			rq_qtime;	/* enqueue time */
 
 	void *			rq_argp;	/* decoded arguments */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index 60c8e03268d4..ea43c6059bdb 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
 	svc_rqst_flag(USEDEFERRAL)					\
 	svc_rqst_flag(DROPME)						\
 	svc_rqst_flag(SPLICE_OK)					\
-	svc_rqst_flag(VICTIM)						\
 	svc_rqst_flag(BUSY)						\
 	svc_rqst_flag_end(DATA)
 
@@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
 	)
 );
 
+DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
+	TP_PROTO(
+		const struct svc_serv *serv,
+		const struct svc_pool *pool,
+		const struct svc_rqst *rqstp
+	),
+
+	TP_ARGS(serv, pool, rqstp),
+
+	TP_STRUCT__entry(
+		__string(name, serv->sv_name)
+		__field(int, pool_id)
+		__field(unsigned int, nrthreads)
+		__field(unsigned long, pool_flags)
+		__field(u32, thread_id)
+		__field(const void *, rqstp)
+	),
+
+	TP_fast_assign(
+		__assign_str(name, serv->sv_name);
+		__entry->pool_id = pool->sp_id;
+		__entry->nrthreads = pool->sp_nrthreads;
+		__entry->pool_flags = pool->sp_flags;
+		__entry->thread_id = rqstp->rq_thread_id;
+		__entry->rqstp = rqstp;
+	),
+
+	TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
+		__get_str(name), __entry->pool_id,
+		show_svc_pool_flags(__entry->pool_flags),
+		__entry->nrthreads, __entry->thread_id
+	)
+);
+
+#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
+	DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
+			TP_PROTO( \
+				const struct svc_serv *serv, \
+				const struct svc_pool *pool, \
+				const struct svc_rqst *rqstp \
+			), \
+			TP_ARGS(serv, pool, rqstp))
+
+DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
+DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
+
 DECLARE_EVENT_CLASS(svc_xprt_event,
 	TP_PROTO(
 		const struct svc_xprt *xprt
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 9b6701a38e71..ef350f0d8925 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 
 		pool->sp_id = i;
 		INIT_LIST_HEAD(&pool->sp_sockets);
-		INIT_LIST_HEAD(&pool->sp_all_threads);
 		spin_lock_init(&pool->sp_lock);
+		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
 
 		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
@@ -594,6 +594,8 @@ svc_destroy(struct kref *ref)
 		percpu_counter_destroy(&pool->sp_threads_woken);
 		percpu_counter_destroy(&pool->sp_threads_timedout);
 		percpu_counter_destroy(&pool->sp_threads_starved);
+
+		xa_destroy(&pool->sp_thread_xa);
 	}
 	kfree(serv->sv_pools);
 	kfree(serv);
@@ -674,7 +676,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
 static struct svc_rqst *
 svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 {
+	static const struct xa_limit limit = {
+		.max = U32_MAX,
+	};
 	struct svc_rqst	*rqstp;
+	int ret;
 
 	rqstp = svc_rqst_alloc(serv, pool, node);
 	if (!rqstp)
@@ -685,15 +691,25 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 	serv->sv_nrthreads += 1;
 	spin_unlock_bh(&serv->sv_lock);
 
-	spin_lock_bh(&pool->sp_lock);
+	xa_lock(&pool->sp_thread_xa);
+	ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
+			 limit, GFP_KERNEL);
+	if (ret) {
+		xa_unlock(&pool->sp_thread_xa);
+		goto out_free;
+	}
 	pool->sp_nrthreads++;
-	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
-	spin_unlock_bh(&pool->sp_lock);
+	xa_unlock(&pool->sp_thread_xa);
+	trace_svc_pool_thread_init(serv, pool, rqstp);
 	return rqstp;
+
+out_free:
+	svc_rqst_free(rqstp);
+	return ERR_PTR(ret);
 }
 
 /**
- * svc_pool_wake_idle_thread - wake an idle thread in @pool
+ * svc_pool_wake_idle_thread - Find and wake an idle thread in @pool
  * @serv: RPC service
  * @pool: service thread pool
  *
@@ -706,19 +722,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
 					   struct svc_pool *pool)
 {
 	struct svc_rqst	*rqstp;
+	unsigned long index;
 
-	rcu_read_lock();
-	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
+	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
 		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
 			continue;
 
-		rcu_read_unlock();
 		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
 		wake_up_process(rqstp->rq_task);
 		percpu_counter_inc(&pool->sp_threads_woken);
 		return rqstp;
 	}
-	rcu_read_unlock();
 
 	trace_svc_pool_starved(serv, pool);
 	percpu_counter_inc(&pool->sp_threads_starved);
@@ -734,32 +748,31 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
 static struct task_struct *
 svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
 {
-	unsigned int i;
 	struct task_struct *task = NULL;
+	struct svc_rqst *rqstp;
+	unsigned long zero = 0;
+	unsigned int i;
 
 	if (pool != NULL) {
-		spin_lock_bh(&pool->sp_lock);
+		xa_lock(&pool->sp_thread_xa);
 	} else {
 		for (i = 0; i < serv->sv_nrpools; i++) {
 			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
-			spin_lock_bh(&pool->sp_lock);
-			if (!list_empty(&pool->sp_all_threads))
+			xa_lock(&pool->sp_thread_xa);
+			if (!xa_empty(&pool->sp_thread_xa))
 				goto found_pool;
-			spin_unlock_bh(&pool->sp_lock);
+			xa_unlock(&pool->sp_thread_xa);
 		}
 		return NULL;
 	}
 
 found_pool:
-	if (!list_empty(&pool->sp_all_threads)) {
-		struct svc_rqst *rqstp;
-
-		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
-		set_bit(RQ_VICTIM, &rqstp->rq_flags);
-		list_del_rcu(&rqstp->rq_all);
+	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
+	if (rqstp) {
+		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
 		task = rqstp->rq_task;
 	}
-	spin_unlock_bh(&pool->sp_lock);
+	xa_unlock(&pool->sp_thread_xa);
 	return task;
 }
 
@@ -841,9 +854,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 	if (pool == NULL) {
 		nrservs -= serv->sv_nrthreads;
 	} else {
-		spin_lock_bh(&pool->sp_lock);
+		xa_lock(&pool->sp_thread_xa);
 		nrservs -= pool->sp_nrthreads;
-		spin_unlock_bh(&pool->sp_lock);
+		xa_unlock(&pool->sp_thread_xa);
 	}
 
 	if (nrservs > 0)
@@ -930,11 +943,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
 	struct svc_serv	*serv = rqstp->rq_server;
 	struct svc_pool	*pool = rqstp->rq_pool;
 
-	spin_lock_bh(&pool->sp_lock);
+	xa_lock(&pool->sp_thread_xa);
 	pool->sp_nrthreads--;
-	if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
-		list_del_rcu(&rqstp->rq_all);
-	spin_unlock_bh(&pool->sp_lock);
+	__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
+	xa_unlock(&pool->sp_thread_xa);
+	trace_svc_pool_thread_exit(serv, pool, rqstp);
 
 	spin_lock_bh(&serv->sv_lock);
 	serv->sv_nrthreads -= 1;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 8ced7591ce07..7709120b45c1 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
 
 /* SMP locking strategy:
  *
- *	svc_pool->sp_lock protects most of the fields of that pool.
+ *	svc_pool->sp_lock protects sp_sockets.
  *	svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
  *	when both need to be taken (rare), svc_serv->sv_lock is first.
  *	The "service mutex" protects svc_serv->sv_nrthread.



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
                   ` (7 preceding siblings ...)
  2023-07-04  0:08 ` [PATCH v2 8/9] SUNRPC: Replace sp_threads_all with an xarray Chuck Lever
@ 2023-07-04  0:08 ` Chuck Lever
  2023-07-04  1:26   ` NeilBrown
  2023-07-05  1:08 ` [PATCH v2 0/9] SUNRPC service thread scheduler optimizations NeilBrown
  9 siblings, 1 reply; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  0:08 UTC (permalink / raw)
  To: linux-nfs; +Cc: Chuck Lever, lorenzo, neilb, jlayton, david

From: Chuck Lever <chuck.lever@oracle.com>

I've noticed that client-observed server request latency goes up
simply when the nfsd thread count is increased.

List walking is known to be memory-inefficient. On a busy server
with many threads, enqueuing a transport will walk the "all threads"
list quite frequently. This also pulls in the cache lines for some
hot fields in each svc_rqst (namely, rq_flags).

The svc_xprt_enqueue() call that concerns me most is the one in
svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
down completion handling limits the total throughput per RDMA
connection.

So, avoid walking the "all threads" list to find an idle thread to
wake. Instead, set up an idle bitmap and use find_next_bit, which
should work the same way as RQ_BUSY but it will touch only the
cachelines that the bitmap is in. Stick with atomic bit operations
to avoid taking the pool lock.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 include/linux/sunrpc/svc.h    |    6 ++++--
 include/trace/events/sunrpc.h |    1 -
 net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
 net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
 4 files changed, 49 insertions(+), 15 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 6f8bfcd44250..27ffcf7371d0 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -35,6 +35,7 @@ struct svc_pool {
 	spinlock_t		sp_lock;	/* protects sp_sockets */
 	struct list_head	sp_sockets;	/* pending sockets */
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
+	unsigned long		*sp_idle_map;	/* idle threads */
 	struct xarray		sp_thread_xa;
 
 	/* statistics on pool operation */
@@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
 #define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
 				+ 2 + 1)
 
+#define RPCSVC_MAXPOOLTHREADS	(4096)
+
 /*
  * The context of a single thread, including the request currently being
  * processed.
@@ -239,8 +242,7 @@ struct svc_rqst {
 #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
 						 * to prevent encrypting page
 						 * cache pages */
-#define	RQ_BUSY		(5)			/* request is busy */
-#define	RQ_DATA		(6)			/* request has data */
+#define	RQ_DATA		(5)			/* request has data */
 	unsigned long		rq_flags;	/* flags field */
 	u32			rq_thread_id;	/* xarray index */
 	ktime_t			rq_qtime;	/* enqueue time */
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index ea43c6059bdb..c07824a254bf 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
 	svc_rqst_flag(USEDEFERRAL)					\
 	svc_rqst_flag(DROPME)						\
 	svc_rqst_flag(SPLICE_OK)					\
-	svc_rqst_flag(BUSY)						\
 	svc_rqst_flag_end(DATA)
 
 #undef svc_rqst_flag
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ef350f0d8925..d0278e5190ba 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 		INIT_LIST_HEAD(&pool->sp_sockets);
 		spin_lock_init(&pool->sp_lock);
 		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
+		/* All threads initially marked "busy" */
+		pool->sp_idle_map =
+			bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
+					   svc_pool_map_get_node(i));
+		if (!pool->sp_idle_map)
+			return NULL;
 
 		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
@@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
 		percpu_counter_destroy(&pool->sp_threads_starved);
 
 		xa_destroy(&pool->sp_thread_xa);
+		bitmap_free(pool->sp_idle_map);
+		pool->sp_idle_map = NULL;
 	}
 	kfree(serv->sv_pools);
 	kfree(serv);
@@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
 
 	folio_batch_init(&rqstp->rq_fbatch);
 
-	__set_bit(RQ_BUSY, &rqstp->rq_flags);
 	rqstp->rq_server = serv;
 	rqstp->rq_pool = pool;
 
@@ -677,7 +684,7 @@ static struct svc_rqst *
 svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 {
 	static const struct xa_limit limit = {
-		.max = U32_MAX,
+		.max = RPCSVC_MAXPOOLTHREADS,
 	};
 	struct svc_rqst	*rqstp;
 	int ret;
@@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
 					   struct svc_pool *pool)
 {
 	struct svc_rqst	*rqstp;
-	unsigned long index;
+	unsigned long bit;
 
-	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
-		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
+	/* Check the pool's idle bitmap locklessly so that multiple
+	 * idle searches can proceed concurrently.
+	 */
+	for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
+		if (!test_and_clear_bit(bit, pool->sp_idle_map))
 			continue;
 
+		rqstp = xa_load(&pool->sp_thread_xa, bit);
+		if (!rqstp)
+			break;
+
 		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
 		wake_up_process(rqstp->rq_task);
 		percpu_counter_inc(&pool->sp_threads_woken);
@@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
 	}
 
 found_pool:
-	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
+	rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
+			XA_PRESENT);
 	if (rqstp) {
 		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
 		task = rqstp->rq_task;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 7709120b45c1..2844b32c16ea 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
 	return true;
 }
 
+static void svc_pool_thread_mark_idle(struct svc_pool *pool,
+				      struct svc_rqst *rqstp)
+{
+	smp_mb__before_atomic();
+	set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
+	smp_mb__after_atomic();
+}
+
+/*
+ * Note: If we were awoken, then this rqstp has already been marked busy.
+ */
+static void svc_pool_thread_mark_busy(struct svc_pool *pool,
+				      struct svc_rqst *rqstp)
+{
+	smp_mb__before_atomic();
+	clear_bit(rqstp->rq_thread_id, pool->sp_idle_map);
+	smp_mb__after_atomic();
+}
+
 static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 {
 	struct svc_pool		*pool = rqstp->rq_pool;
@@ -756,18 +775,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 	set_current_state(TASK_INTERRUPTIBLE);
 	smp_mb__before_atomic();
 	clear_bit(SP_CONGESTED, &pool->sp_flags);
-	clear_bit(RQ_BUSY, &rqstp->rq_flags);
-	smp_mb__after_atomic();
 
-	if (likely(rqst_should_sleep(rqstp)))
+	if (likely(rqst_should_sleep(rqstp))) {
+		svc_pool_thread_mark_idle(pool, rqstp);
 		time_left = schedule_timeout(timeout);
-	else
+	} else
 		__set_current_state(TASK_RUNNING);
 
 	try_to_freeze();
 
-	set_bit(RQ_BUSY, &rqstp->rq_flags);
-	smp_mb__after_atomic();
+	svc_pool_thread_mark_busy(pool, rqstp);
+
 	rqstp->rq_xprt = svc_xprt_dequeue(pool);
 	if (rqstp->rq_xprt) {
 		trace_svc_pool_awoken(rqstp);



^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/9] SUNRPC: Count ingress RPC messages per svc_pool
  2023-07-04  0:07 ` [PATCH v2 4/9] SUNRPC: Count ingress RPC messages per svc_pool Chuck Lever
@ 2023-07-04  0:45   ` NeilBrown
  2023-07-04  1:47     ` Chuck Lever
  0 siblings, 1 reply; 41+ messages in thread
From: NeilBrown @ 2023-07-04  0:45 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, 04 Jul 2023, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> To get a sense of the average number of transport enqueue operations
> needed to process an incoming RPC message, re-use the "packets" pool
> stat. Track the number of complete RPC messages processed by each
> thread pool.

If I understand this correctly, then I would say it differently.

I think there are either zero or one transport enqueue operations for
each incoming RPC message.  Is that correct?  So the average would be in
(0,1].
Wouldn't it be more natural to talk about the average number of incoming
RPC messages processed per enqueue operation?  This would be typically
be around 1 on a lightly loaded server and would climb up as things get
busy. 

Was there a reason you wrote it the way around that you did?

NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 7/9] SUNRPC: Don't disable BH's when taking sp_lock
  2023-07-04  0:08 ` [PATCH v2 7/9] SUNRPC: Don't disable BH's when taking sp_lock Chuck Lever
@ 2023-07-04  0:56   ` NeilBrown
  2023-07-04  1:48     ` Chuck Lever
  0 siblings, 1 reply; 41+ messages in thread
From: NeilBrown @ 2023-07-04  0:56 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, 04 Jul 2023, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Consumers of sp_lock now all run in process context. We can avoid
> the overhead of disabling bottom halves when taking this lock.

"now" suggests that something has recently changed so that sp_lock isn't
taken in bh context.  What was that change - I don't see it.

I think svc_data_ready() is called in bh, and that calls
svc_xprt_enqueue() which take sp_lock to add the xprt to ->sp_sockets. 

Is that not still the case?

NeilBrown


> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/svc_xprt.c |   14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index ecbccf0d89b9..8ced7591ce07 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -472,9 +472,9 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
>  	pool = svc_pool_for_cpu(xprt->xpt_server);
>  
>  	percpu_counter_inc(&pool->sp_sockets_queued);
> -	spin_lock_bh(&pool->sp_lock);
> +	spin_lock(&pool->sp_lock);
>  	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
> -	spin_unlock_bh(&pool->sp_lock);
> +	spin_unlock(&pool->sp_lock);
>  
>  	rqstp = svc_pool_wake_idle_thread(xprt->xpt_server, pool);
>  	if (!rqstp) {
> @@ -496,14 +496,14 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
>  	if (list_empty(&pool->sp_sockets))
>  		goto out;
>  
> -	spin_lock_bh(&pool->sp_lock);
> +	spin_lock(&pool->sp_lock);
>  	if (likely(!list_empty(&pool->sp_sockets))) {
>  		xprt = list_first_entry(&pool->sp_sockets,
>  					struct svc_xprt, xpt_ready);
>  		list_del_init(&xprt->xpt_ready);
>  		svc_xprt_get(xprt);
>  	}
> -	spin_unlock_bh(&pool->sp_lock);
> +	spin_unlock(&pool->sp_lock);
>  out:
>  	return xprt;
>  }
> @@ -1116,15 +1116,15 @@ static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net)
>  	for (i = 0; i < serv->sv_nrpools; i++) {
>  		pool = &serv->sv_pools[i];
>  
> -		spin_lock_bh(&pool->sp_lock);
> +		spin_lock(&pool->sp_lock);
>  		list_for_each_entry_safe(xprt, tmp, &pool->sp_sockets, xpt_ready) {
>  			if (xprt->xpt_net != net)
>  				continue;
>  			list_del_init(&xprt->xpt_ready);
> -			spin_unlock_bh(&pool->sp_lock);
> +			spin_unlock(&pool->sp_lock);
>  			return xprt;
>  		}
> -		spin_unlock_bh(&pool->sp_lock);
> +		spin_unlock(&pool->sp_lock);
>  	}
>  	return NULL;
>  }
> 
> 
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 8/9] SUNRPC: Replace sp_threads_all with an xarray
  2023-07-04  0:08 ` [PATCH v2 8/9] SUNRPC: Replace sp_threads_all with an xarray Chuck Lever
@ 2023-07-04  1:11   ` NeilBrown
  2023-07-04  1:52     ` Chuck Lever
  0 siblings, 1 reply; 41+ messages in thread
From: NeilBrown @ 2023-07-04  1:11 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, 04 Jul 2023, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> We want a thread lookup operation that can be done with RCU only,
> but also we want to avoid the linked-list walk, which does not scale
> well in the number of pool threads.
> 
> BH-disabled locking is no longer necessary because we're no longer
> sharing the pool's sp_lock to protect either the xarray or the
> pool's thread count. sp_lock also protects transport activity. There
> are no callers of svc_set_num_threads() that run outside of process
> context.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/nfssvc.c              |    3 +-
>  include/linux/sunrpc/svc.h    |   11 +++----
>  include/trace/events/sunrpc.h |   47 ++++++++++++++++++++++++++++-
>  net/sunrpc/svc.c              |   67 ++++++++++++++++++++++++-----------------
>  net/sunrpc/svc_xprt.c         |    2 +
>  5 files changed, 93 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 2154fa63c5f2..d42b2a40c93c 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -62,8 +62,7 @@ static __be32			nfsd_init_request(struct svc_rqst *,
>   * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
>   * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
>   * nn->keep_active is set).  That number of nfsd threads must
> - * exist and each must be listed in ->sp_all_threads in some entry of
> - * ->sv_pools[].
> + * exist and each must be listed in some entry of ->sv_pools[].
>   *
>   * Each active thread holds a counted reference on nn->nfsd_serv, as does
>   * the nn->keep_active flag and various transient calls to svc_get().
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 74ea13270679..6f8bfcd44250 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -32,10 +32,10 @@
>   */
>  struct svc_pool {
>  	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
> -	spinlock_t		sp_lock;	/* protects all fields */
> +	spinlock_t		sp_lock;	/* protects sp_sockets */
>  	struct list_head	sp_sockets;	/* pending sockets */
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> -	struct list_head	sp_all_threads;	/* all server threads */
> +	struct xarray		sp_thread_xa;
>  
>  	/* statistics on pool operation */
>  	struct percpu_counter	sp_messages_arrived;
> @@ -195,7 +195,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>   * processed.
>   */
>  struct svc_rqst {
> -	struct list_head	rq_all;		/* all threads list */
>  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
>  	struct svc_xprt *	rq_xprt;	/* transport ptr */
>  
> @@ -240,10 +239,10 @@ struct svc_rqst {
>  #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
>  						 * to prevent encrypting page
>  						 * cache pages */
> -#define	RQ_VICTIM	(5)			/* about to be shut down */
> -#define	RQ_BUSY		(6)			/* request is busy */
> -#define	RQ_DATA		(7)			/* request has data */
> +#define	RQ_BUSY		(5)			/* request is busy */
> +#define	RQ_DATA		(6)			/* request has data */
>  	unsigned long		rq_flags;	/* flags field */
> +	u32			rq_thread_id;	/* xarray index */
>  	ktime_t			rq_qtime;	/* enqueue time */
>  
>  	void *			rq_argp;	/* decoded arguments */
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index 60c8e03268d4..ea43c6059bdb 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>  	svc_rqst_flag(USEDEFERRAL)					\
>  	svc_rqst_flag(DROPME)						\
>  	svc_rqst_flag(SPLICE_OK)					\
> -	svc_rqst_flag(VICTIM)						\
>  	svc_rqst_flag(BUSY)						\
>  	svc_rqst_flag_end(DATA)
>  
> @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
>  	)
>  );
>  
> +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
> +	TP_PROTO(
> +		const struct svc_serv *serv,
> +		const struct svc_pool *pool,
> +		const struct svc_rqst *rqstp
> +	),
> +
> +	TP_ARGS(serv, pool, rqstp),
> +
> +	TP_STRUCT__entry(
> +		__string(name, serv->sv_name)
> +		__field(int, pool_id)
> +		__field(unsigned int, nrthreads)
> +		__field(unsigned long, pool_flags)
> +		__field(u32, thread_id)
> +		__field(const void *, rqstp)
> +	),
> +
> +	TP_fast_assign(
> +		__assign_str(name, serv->sv_name);
> +		__entry->pool_id = pool->sp_id;
> +		__entry->nrthreads = pool->sp_nrthreads;
> +		__entry->pool_flags = pool->sp_flags;
> +		__entry->thread_id = rqstp->rq_thread_id;
> +		__entry->rqstp = rqstp;
> +	),
> +
> +	TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
> +		__get_str(name), __entry->pool_id,
> +		show_svc_pool_flags(__entry->pool_flags),
> +		__entry->nrthreads, __entry->thread_id
> +	)
> +);
> +
> +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
> +	DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
> +			TP_PROTO( \
> +				const struct svc_serv *serv, \
> +				const struct svc_pool *pool, \
> +				const struct svc_rqst *rqstp \
> +			), \
> +			TP_ARGS(serv, pool, rqstp))
> +
> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
> +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
> +
>  DECLARE_EVENT_CLASS(svc_xprt_event,
>  	TP_PROTO(
>  		const struct svc_xprt *xprt
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 9b6701a38e71..ef350f0d8925 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>  
>  		pool->sp_id = i;
>  		INIT_LIST_HEAD(&pool->sp_sockets);
> -		INIT_LIST_HEAD(&pool->sp_all_threads);
>  		spin_lock_init(&pool->sp_lock);
> +		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
>  
>  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
>  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> @@ -594,6 +594,8 @@ svc_destroy(struct kref *ref)
>  		percpu_counter_destroy(&pool->sp_threads_woken);
>  		percpu_counter_destroy(&pool->sp_threads_timedout);
>  		percpu_counter_destroy(&pool->sp_threads_starved);
> +
> +		xa_destroy(&pool->sp_thread_xa);
>  	}
>  	kfree(serv->sv_pools);
>  	kfree(serv);
> @@ -674,7 +676,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
>  static struct svc_rqst *
>  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  {
> +	static const struct xa_limit limit = {
> +		.max = U32_MAX,
> +	};
>  	struct svc_rqst	*rqstp;
> +	int ret;
>  
>  	rqstp = svc_rqst_alloc(serv, pool, node);
>  	if (!rqstp)
> @@ -685,15 +691,25 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  	serv->sv_nrthreads += 1;
>  	spin_unlock_bh(&serv->sv_lock);
>  
> -	spin_lock_bh(&pool->sp_lock);
> +	xa_lock(&pool->sp_thread_xa);
> +	ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
> +			 limit, GFP_KERNEL);
> +	if (ret) {
> +		xa_unlock(&pool->sp_thread_xa);
> +		goto out_free;
> +	}
>  	pool->sp_nrthreads++;
> -	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
> -	spin_unlock_bh(&pool->sp_lock);
> +	xa_unlock(&pool->sp_thread_xa);
> +	trace_svc_pool_thread_init(serv, pool, rqstp);
>  	return rqstp;
> +
> +out_free:
> +	svc_rqst_free(rqstp);
> +	return ERR_PTR(ret);
>  }
>  
>  /**
> - * svc_pool_wake_idle_thread - wake an idle thread in @pool
> + * svc_pool_wake_idle_thread - Find and wake an idle thread in @pool
>   * @serv: RPC service
>   * @pool: service thread pool
>   *
> @@ -706,19 +722,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
>  					   struct svc_pool *pool)
>  {
>  	struct svc_rqst	*rqstp;
> +	unsigned long index;
>  
> -	rcu_read_lock();
> -	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> +	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
>  		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
>  			continue;
>  
> -		rcu_read_unlock();
>  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>  		wake_up_process(rqstp->rq_task);
>  		percpu_counter_inc(&pool->sp_threads_woken);
>  		return rqstp;
>  	}
> -	rcu_read_unlock();
>  
>  	trace_svc_pool_starved(serv, pool);
>  	percpu_counter_inc(&pool->sp_threads_starved);
> @@ -734,32 +748,31 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>  static struct task_struct *
>  svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>  {
> -	unsigned int i;
>  	struct task_struct *task = NULL;
> +	struct svc_rqst *rqstp;
> +	unsigned long zero = 0;
> +	unsigned int i;
>  
>  	if (pool != NULL) {
> -		spin_lock_bh(&pool->sp_lock);
> +		xa_lock(&pool->sp_thread_xa);
>  	} else {
>  		for (i = 0; i < serv->sv_nrpools; i++) {
>  			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
> -			spin_lock_bh(&pool->sp_lock);
> -			if (!list_empty(&pool->sp_all_threads))
> +			xa_lock(&pool->sp_thread_xa);
> +			if (!xa_empty(&pool->sp_thread_xa))
>  				goto found_pool;
> -			spin_unlock_bh(&pool->sp_lock);
> +			xa_unlock(&pool->sp_thread_xa);
>  		}
>  		return NULL;
>  	}
>  
>  found_pool:
> -	if (!list_empty(&pool->sp_all_threads)) {
> -		struct svc_rqst *rqstp;
> -
> -		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> -		set_bit(RQ_VICTIM, &rqstp->rq_flags);
> -		list_del_rcu(&rqstp->rq_all);
> +	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> +	if (rqstp) {
> +		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);

This bothers me.  We always delete the earliest thread in the xarray.
So if we create 128 threads, then reduce the number to 64, the remaining
threads will be numbers 128 to 256.
This means searching in the bitmap will be (slightly) slower than
necessary.

xa doesn't have a "find last" interface, but we "know" how many entries
are in the array - or we would if we decremented the counter when we
removed an entry.
Currently we only decrement the counter when the thread exits - at which
time it is removed the entry - which may already have been removed.
If we change that code to check if it is still present in the xarray and
to only erase/decrement if it is, then we can decrement the counter here
and always reliably be able to find the "last" entry.

... though I think we wait for a thread to exist before finding the next
    victim, so maybe all we need to do is start the xa_find from
    ->sp_nrthreads-1 rather than from zero ??

Is it worth it?  I don't know.  But it bothers me.

NeilBrown


>  		task = rqstp->rq_task;
>  	}
> -	spin_unlock_bh(&pool->sp_lock);
> +	xa_unlock(&pool->sp_thread_xa);
>  	return task;
>  }
>  
> @@ -841,9 +854,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
>  	if (pool == NULL) {
>  		nrservs -= serv->sv_nrthreads;
>  	} else {
> -		spin_lock_bh(&pool->sp_lock);
> +		xa_lock(&pool->sp_thread_xa);
>  		nrservs -= pool->sp_nrthreads;
> -		spin_unlock_bh(&pool->sp_lock);
> +		xa_unlock(&pool->sp_thread_xa);
>  	}
>  
>  	if (nrservs > 0)
> @@ -930,11 +943,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
>  	struct svc_serv	*serv = rqstp->rq_server;
>  	struct svc_pool	*pool = rqstp->rq_pool;
>  
> -	spin_lock_bh(&pool->sp_lock);
> +	xa_lock(&pool->sp_thread_xa);
>  	pool->sp_nrthreads--;
> -	if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> -		list_del_rcu(&rqstp->rq_all);
> -	spin_unlock_bh(&pool->sp_lock);
> +	__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> +	xa_unlock(&pool->sp_thread_xa);
> +	trace_svc_pool_thread_exit(serv, pool, rqstp);
>  
>  	spin_lock_bh(&serv->sv_lock);
>  	serv->sv_nrthreads -= 1;
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 8ced7591ce07..7709120b45c1 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
>  
>  /* SMP locking strategy:
>   *
> - *	svc_pool->sp_lock protects most of the fields of that pool.
> + *	svc_pool->sp_lock protects sp_sockets.
>   *	svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
>   *	when both need to be taken (rare), svc_serv->sv_lock is first.
>   *	The "service mutex" protects svc_serv->sv_nrthread.
> 
> 
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap
  2023-07-04  0:08 ` [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap Chuck Lever
@ 2023-07-04  1:26   ` NeilBrown
  2023-07-04  2:02     ` Chuck Lever
  0 siblings, 1 reply; 41+ messages in thread
From: NeilBrown @ 2023-07-04  1:26 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, 04 Jul 2023, Chuck Lever wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> I've noticed that client-observed server request latency goes up
> simply when the nfsd thread count is increased.
> 
> List walking is known to be memory-inefficient. On a busy server
> with many threads, enqueuing a transport will walk the "all threads"
> list quite frequently. This also pulls in the cache lines for some
> hot fields in each svc_rqst (namely, rq_flags).

I think this text could usefully be re-written.  By this point in the
series we aren't list walking.

I'd also be curious to know what latency different you get for just this
change.

> 
> The svc_xprt_enqueue() call that concerns me most is the one in
> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> down completion handling limits the total throughput per RDMA
> connection.
> 
> So, avoid walking the "all threads" list to find an idle thread to
> wake. Instead, set up an idle bitmap and use find_next_bit, which
> should work the same way as RQ_BUSY but it will touch only the
> cachelines that the bitmap is in. Stick with atomic bit operations
> to avoid taking the pool lock.
> 
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  include/linux/sunrpc/svc.h    |    6 ++++--
>  include/trace/events/sunrpc.h |    1 -
>  net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
>  net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
>  4 files changed, 49 insertions(+), 15 deletions(-)
> 
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 6f8bfcd44250..27ffcf7371d0 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -35,6 +35,7 @@ struct svc_pool {
>  	spinlock_t		sp_lock;	/* protects sp_sockets */
>  	struct list_head	sp_sockets;	/* pending sockets */
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> +	unsigned long		*sp_idle_map;	/* idle threads */
>  	struct xarray		sp_thread_xa;
>  
>  	/* statistics on pool operation */
> @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>  #define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
>  				+ 2 + 1)
>  
> +#define RPCSVC_MAXPOOLTHREADS	(4096)
> +
>  /*
>   * The context of a single thread, including the request currently being
>   * processed.
> @@ -239,8 +242,7 @@ struct svc_rqst {
>  #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
>  						 * to prevent encrypting page
>  						 * cache pages */
> -#define	RQ_BUSY		(5)			/* request is busy */
> -#define	RQ_DATA		(6)			/* request has data */
> +#define	RQ_DATA		(5)			/* request has data */

Might this be a good opportunity to convert this to an enum ??

>  	unsigned long		rq_flags;	/* flags field */
>  	u32			rq_thread_id;	/* xarray index */
>  	ktime_t			rq_qtime;	/* enqueue time */
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index ea43c6059bdb..c07824a254bf 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>  	svc_rqst_flag(USEDEFERRAL)					\
>  	svc_rqst_flag(DROPME)						\
>  	svc_rqst_flag(SPLICE_OK)					\
> -	svc_rqst_flag(BUSY)						\
>  	svc_rqst_flag_end(DATA)
>  
>  #undef svc_rqst_flag
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index ef350f0d8925..d0278e5190ba 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>  		INIT_LIST_HEAD(&pool->sp_sockets);
>  		spin_lock_init(&pool->sp_lock);
>  		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> +		/* All threads initially marked "busy" */
> +		pool->sp_idle_map =
> +			bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> +					   svc_pool_map_get_node(i));
> +		if (!pool->sp_idle_map)
> +			return NULL;
>  
>  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
>  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
>  		percpu_counter_destroy(&pool->sp_threads_starved);
>  
>  		xa_destroy(&pool->sp_thread_xa);
> +		bitmap_free(pool->sp_idle_map);
> +		pool->sp_idle_map = NULL;
>  	}
>  	kfree(serv->sv_pools);
>  	kfree(serv);
> @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
>  
>  	folio_batch_init(&rqstp->rq_fbatch);
>  
> -	__set_bit(RQ_BUSY, &rqstp->rq_flags);
>  	rqstp->rq_server = serv;
>  	rqstp->rq_pool = pool;
>  
> @@ -677,7 +684,7 @@ static struct svc_rqst *
>  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  {
>  	static const struct xa_limit limit = {
> -		.max = U32_MAX,
> +		.max = RPCSVC_MAXPOOLTHREADS,
>  	};
>  	struct svc_rqst	*rqstp;
>  	int ret;
> @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
>  					   struct svc_pool *pool)
>  {
>  	struct svc_rqst	*rqstp;
> -	unsigned long index;
> +	unsigned long bit;
>  
> -	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> -		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> +	/* Check the pool's idle bitmap locklessly so that multiple
> +	 * idle searches can proceed concurrently.
> +	 */
> +	for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> +		if (!test_and_clear_bit(bit, pool->sp_idle_map))
>  			continue;

I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
and so get all the required memory barriers.
What we are doing here is locking a particular thread for a task, so
"lock" is an appropriate description of what is happening.
See also svc_pool_thread_mark_* below.

>  
> +		rqstp = xa_load(&pool->sp_thread_xa, bit);
> +		if (!rqstp)
> +			break;
> +
>  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>  		wake_up_process(rqstp->rq_task);
>  		percpu_counter_inc(&pool->sp_threads_woken);
> @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
>  	}
>  
>  found_pool:
> -	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> +	rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> +			XA_PRESENT);
>  	if (rqstp) {
>  		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
>  		task = rqstp->rq_task;
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 7709120b45c1..2844b32c16ea 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>  	return true;
>  }
>  
> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> +				      struct svc_rqst *rqstp)
> +{
> +	smp_mb__before_atomic();
> +	set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> +	smp_mb__after_atomic();
> +}

There memory barriers above and below bother me.  There is no comment
telling me what they are protecting against.
I would rather svc_pool_thread_mark_idle - which unlocks the thread -
were

    clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);

and  that svc_pool_thread_mark_busy were

   test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);

Then it would be more obvious what was happening.

Thanks,
NeilBrown

> +
> +/*
> + * Note: If we were awoken, then this rqstp has already been marked busy.
> + */
> +static void svc_pool_thread_mark_busy(struct svc_pool *pool,
> +				      struct svc_rqst *rqstp)
> +{
> +	smp_mb__before_atomic();
> +	clear_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> +	smp_mb__after_atomic();
> +}
> +
>  static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  {
>  	struct svc_pool		*pool = rqstp->rq_pool;
> @@ -756,18 +775,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  	set_current_state(TASK_INTERRUPTIBLE);
>  	smp_mb__before_atomic();
>  	clear_bit(SP_CONGESTED, &pool->sp_flags);
> -	clear_bit(RQ_BUSY, &rqstp->rq_flags);
> -	smp_mb__after_atomic();
>  
> -	if (likely(rqst_should_sleep(rqstp)))
> +	if (likely(rqst_should_sleep(rqstp))) {
> +		svc_pool_thread_mark_idle(pool, rqstp);
>  		time_left = schedule_timeout(timeout);
> -	else
> +	} else
>  		__set_current_state(TASK_RUNNING);
>  
>  	try_to_freeze();
>  
> -	set_bit(RQ_BUSY, &rqstp->rq_flags);
> -	smp_mb__after_atomic();
> +	svc_pool_thread_mark_busy(pool, rqstp);
> +
>  	rqstp->rq_xprt = svc_xprt_dequeue(pool);
>  	if (rqstp->rq_xprt) {
>  		trace_svc_pool_awoken(rqstp);
> 
> 
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/9] SUNRPC: Count ingress RPC messages per svc_pool
  2023-07-04  0:45   ` NeilBrown
@ 2023-07-04  1:47     ` Chuck Lever
  2023-07-04  2:30       ` NeilBrown
  2023-07-10  0:41       ` [PATCH/RFC] sunrpc: constant-time code to wake idle thread NeilBrown
  0 siblings, 2 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  1:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, Jul 04, 2023 at 10:45:20AM +1000, NeilBrown wrote:
> On Tue, 04 Jul 2023, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > To get a sense of the average number of transport enqueue operations
> > needed to process an incoming RPC message, re-use the "packets" pool
> > stat. Track the number of complete RPC messages processed by each
> > thread pool.
> 
> If I understand this correctly, then I would say it differently.
> 
> I think there are either zero or one transport enqueue operations for
> each incoming RPC message.  Is that correct?  So the average would be in
> (0,1].
> Wouldn't it be more natural to talk about the average number of incoming
> RPC messages processed per enqueue operation?  This would be typically
> be around 1 on a lightly loaded server and would climb up as things get
> busy. 
> 
> Was there a reason you wrote it the way around that you did?

Yes: more than one enqueue is done per incoming RPC. For example,
svc_data_ready() enqueues, and so does svc_xprt_receive().

If the RPC requires more than one call to ->recvfrom() to complete
the receive operation, each one of those calls does an enqueue.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 7/9] SUNRPC: Don't disable BH's when taking sp_lock
  2023-07-04  0:56   ` NeilBrown
@ 2023-07-04  1:48     ` Chuck Lever
  0 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  1:48 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, Jul 04, 2023 at 10:56:08AM +1000, NeilBrown wrote:
> On Tue, 04 Jul 2023, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > Consumers of sp_lock now all run in process context. We can avoid
> > the overhead of disabling bottom halves when taking this lock.
> 
> "now" suggests that something has recently changed so that sp_lock isn't
> taken in bh context.  What was that change - I don't see it.
> 
> I think svc_data_ready() is called in bh, and that calls
> svc_xprt_enqueue() which take sp_lock to add the xprt to ->sp_sockets. 
> 
> Is that not still the case?

Darn, I forgot about that one. I'll drop this patch.

> NeilBrown
> 
> 
> > 
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  net/sunrpc/svc_xprt.c |   14 +++++++-------
> >  1 file changed, 7 insertions(+), 7 deletions(-)
> > 
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index ecbccf0d89b9..8ced7591ce07 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -472,9 +472,9 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> >  	pool = svc_pool_for_cpu(xprt->xpt_server);
> >  
> >  	percpu_counter_inc(&pool->sp_sockets_queued);
> > -	spin_lock_bh(&pool->sp_lock);
> > +	spin_lock(&pool->sp_lock);
> >  	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
> > -	spin_unlock_bh(&pool->sp_lock);
> > +	spin_unlock(&pool->sp_lock);
> >  
> >  	rqstp = svc_pool_wake_idle_thread(xprt->xpt_server, pool);
> >  	if (!rqstp) {
> > @@ -496,14 +496,14 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
> >  	if (list_empty(&pool->sp_sockets))
> >  		goto out;
> >  
> > -	spin_lock_bh(&pool->sp_lock);
> > +	spin_lock(&pool->sp_lock);
> >  	if (likely(!list_empty(&pool->sp_sockets))) {
> >  		xprt = list_first_entry(&pool->sp_sockets,
> >  					struct svc_xprt, xpt_ready);
> >  		list_del_init(&xprt->xpt_ready);
> >  		svc_xprt_get(xprt);
> >  	}
> > -	spin_unlock_bh(&pool->sp_lock);
> > +	spin_unlock(&pool->sp_lock);
> >  out:
> >  	return xprt;
> >  }
> > @@ -1116,15 +1116,15 @@ static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net)
> >  	for (i = 0; i < serv->sv_nrpools; i++) {
> >  		pool = &serv->sv_pools[i];
> >  
> > -		spin_lock_bh(&pool->sp_lock);
> > +		spin_lock(&pool->sp_lock);
> >  		list_for_each_entry_safe(xprt, tmp, &pool->sp_sockets, xpt_ready) {
> >  			if (xprt->xpt_net != net)
> >  				continue;
> >  			list_del_init(&xprt->xpt_ready);
> > -			spin_unlock_bh(&pool->sp_lock);
> > +			spin_unlock(&pool->sp_lock);
> >  			return xprt;
> >  		}
> > -		spin_unlock_bh(&pool->sp_lock);
> > +		spin_unlock(&pool->sp_lock);
> >  	}
> >  	return NULL;
> >  }
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 8/9] SUNRPC: Replace sp_threads_all with an xarray
  2023-07-04  1:11   ` NeilBrown
@ 2023-07-04  1:52     ` Chuck Lever
  2023-07-04  2:32       ` NeilBrown
  0 siblings, 1 reply; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  1:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, Jul 04, 2023 at 11:11:57AM +1000, NeilBrown wrote:
> On Tue, 04 Jul 2023, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > We want a thread lookup operation that can be done with RCU only,
> > but also we want to avoid the linked-list walk, which does not scale
> > well in the number of pool threads.
> > 
> > BH-disabled locking is no longer necessary because we're no longer
> > sharing the pool's sp_lock to protect either the xarray or the
> > pool's thread count. sp_lock also protects transport activity. There
> > are no callers of svc_set_num_threads() that run outside of process
> > context.
> > 
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  fs/nfsd/nfssvc.c              |    3 +-
> >  include/linux/sunrpc/svc.h    |   11 +++----
> >  include/trace/events/sunrpc.h |   47 ++++++++++++++++++++++++++++-
> >  net/sunrpc/svc.c              |   67 ++++++++++++++++++++++++-----------------
> >  net/sunrpc/svc_xprt.c         |    2 +
> >  5 files changed, 93 insertions(+), 37 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > index 2154fa63c5f2..d42b2a40c93c 100644
> > --- a/fs/nfsd/nfssvc.c
> > +++ b/fs/nfsd/nfssvc.c
> > @@ -62,8 +62,7 @@ static __be32			nfsd_init_request(struct svc_rqst *,
> >   * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
> >   * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
> >   * nn->keep_active is set).  That number of nfsd threads must
> > - * exist and each must be listed in ->sp_all_threads in some entry of
> > - * ->sv_pools[].
> > + * exist and each must be listed in some entry of ->sv_pools[].
> >   *
> >   * Each active thread holds a counted reference on nn->nfsd_serv, as does
> >   * the nn->keep_active flag and various transient calls to svc_get().
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 74ea13270679..6f8bfcd44250 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -32,10 +32,10 @@
> >   */
> >  struct svc_pool {
> >  	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
> > -	spinlock_t		sp_lock;	/* protects all fields */
> > +	spinlock_t		sp_lock;	/* protects sp_sockets */
> >  	struct list_head	sp_sockets;	/* pending sockets */
> >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> > -	struct list_head	sp_all_threads;	/* all server threads */
> > +	struct xarray		sp_thread_xa;
> >  
> >  	/* statistics on pool operation */
> >  	struct percpu_counter	sp_messages_arrived;
> > @@ -195,7 +195,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >   * processed.
> >   */
> >  struct svc_rqst {
> > -	struct list_head	rq_all;		/* all threads list */
> >  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
> >  	struct svc_xprt *	rq_xprt;	/* transport ptr */
> >  
> > @@ -240,10 +239,10 @@ struct svc_rqst {
> >  #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
> >  						 * to prevent encrypting page
> >  						 * cache pages */
> > -#define	RQ_VICTIM	(5)			/* about to be shut down */
> > -#define	RQ_BUSY		(6)			/* request is busy */
> > -#define	RQ_DATA		(7)			/* request has data */
> > +#define	RQ_BUSY		(5)			/* request is busy */
> > +#define	RQ_DATA		(6)			/* request has data */
> >  	unsigned long		rq_flags;	/* flags field */
> > +	u32			rq_thread_id;	/* xarray index */
> >  	ktime_t			rq_qtime;	/* enqueue time */
> >  
> >  	void *			rq_argp;	/* decoded arguments */
> > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > index 60c8e03268d4..ea43c6059bdb 100644
> > --- a/include/trace/events/sunrpc.h
> > +++ b/include/trace/events/sunrpc.h
> > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> >  	svc_rqst_flag(USEDEFERRAL)					\
> >  	svc_rqst_flag(DROPME)						\
> >  	svc_rqst_flag(SPLICE_OK)					\
> > -	svc_rqst_flag(VICTIM)						\
> >  	svc_rqst_flag(BUSY)						\
> >  	svc_rqst_flag_end(DATA)
> >  
> > @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
> >  	)
> >  );
> >  
> > +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
> > +	TP_PROTO(
> > +		const struct svc_serv *serv,
> > +		const struct svc_pool *pool,
> > +		const struct svc_rqst *rqstp
> > +	),
> > +
> > +	TP_ARGS(serv, pool, rqstp),
> > +
> > +	TP_STRUCT__entry(
> > +		__string(name, serv->sv_name)
> > +		__field(int, pool_id)
> > +		__field(unsigned int, nrthreads)
> > +		__field(unsigned long, pool_flags)
> > +		__field(u32, thread_id)
> > +		__field(const void *, rqstp)
> > +	),
> > +
> > +	TP_fast_assign(
> > +		__assign_str(name, serv->sv_name);
> > +		__entry->pool_id = pool->sp_id;
> > +		__entry->nrthreads = pool->sp_nrthreads;
> > +		__entry->pool_flags = pool->sp_flags;
> > +		__entry->thread_id = rqstp->rq_thread_id;
> > +		__entry->rqstp = rqstp;
> > +	),
> > +
> > +	TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
> > +		__get_str(name), __entry->pool_id,
> > +		show_svc_pool_flags(__entry->pool_flags),
> > +		__entry->nrthreads, __entry->thread_id
> > +	)
> > +);
> > +
> > +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
> > +	DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
> > +			TP_PROTO( \
> > +				const struct svc_serv *serv, \
> > +				const struct svc_pool *pool, \
> > +				const struct svc_rqst *rqstp \
> > +			), \
> > +			TP_ARGS(serv, pool, rqstp))
> > +
> > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
> > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
> > +
> >  DECLARE_EVENT_CLASS(svc_xprt_event,
> >  	TP_PROTO(
> >  		const struct svc_xprt *xprt
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index 9b6701a38e71..ef350f0d8925 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >  
> >  		pool->sp_id = i;
> >  		INIT_LIST_HEAD(&pool->sp_sockets);
> > -		INIT_LIST_HEAD(&pool->sp_all_threads);
> >  		spin_lock_init(&pool->sp_lock);
> > +		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> >  
> >  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> >  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > @@ -594,6 +594,8 @@ svc_destroy(struct kref *ref)
> >  		percpu_counter_destroy(&pool->sp_threads_woken);
> >  		percpu_counter_destroy(&pool->sp_threads_timedout);
> >  		percpu_counter_destroy(&pool->sp_threads_starved);
> > +
> > +		xa_destroy(&pool->sp_thread_xa);
> >  	}
> >  	kfree(serv->sv_pools);
> >  	kfree(serv);
> > @@ -674,7 +676,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
> >  static struct svc_rqst *
> >  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  {
> > +	static const struct xa_limit limit = {
> > +		.max = U32_MAX,
> > +	};
> >  	struct svc_rqst	*rqstp;
> > +	int ret;
> >  
> >  	rqstp = svc_rqst_alloc(serv, pool, node);
> >  	if (!rqstp)
> > @@ -685,15 +691,25 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  	serv->sv_nrthreads += 1;
> >  	spin_unlock_bh(&serv->sv_lock);
> >  
> > -	spin_lock_bh(&pool->sp_lock);
> > +	xa_lock(&pool->sp_thread_xa);
> > +	ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
> > +			 limit, GFP_KERNEL);
> > +	if (ret) {
> > +		xa_unlock(&pool->sp_thread_xa);
> > +		goto out_free;
> > +	}
> >  	pool->sp_nrthreads++;
> > -	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
> > -	spin_unlock_bh(&pool->sp_lock);
> > +	xa_unlock(&pool->sp_thread_xa);
> > +	trace_svc_pool_thread_init(serv, pool, rqstp);
> >  	return rqstp;
> > +
> > +out_free:
> > +	svc_rqst_free(rqstp);
> > +	return ERR_PTR(ret);
> >  }
> >  
> >  /**
> > - * svc_pool_wake_idle_thread - wake an idle thread in @pool
> > + * svc_pool_wake_idle_thread - Find and wake an idle thread in @pool
> >   * @serv: RPC service
> >   * @pool: service thread pool
> >   *
> > @@ -706,19 +722,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> >  					   struct svc_pool *pool)
> >  {
> >  	struct svc_rqst	*rqstp;
> > +	unsigned long index;
> >  
> > -	rcu_read_lock();
> > -	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> > +	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> >  		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> >  			continue;
> >  
> > -		rcu_read_unlock();
> >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> >  		wake_up_process(rqstp->rq_task);
> >  		percpu_counter_inc(&pool->sp_threads_woken);
> >  		return rqstp;
> >  	}
> > -	rcu_read_unlock();
> >  
> >  	trace_svc_pool_starved(serv, pool);
> >  	percpu_counter_inc(&pool->sp_threads_starved);
> > @@ -734,32 +748,31 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> >  static struct task_struct *
> >  svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> >  {
> > -	unsigned int i;
> >  	struct task_struct *task = NULL;
> > +	struct svc_rqst *rqstp;
> > +	unsigned long zero = 0;
> > +	unsigned int i;
> >  
> >  	if (pool != NULL) {
> > -		spin_lock_bh(&pool->sp_lock);
> > +		xa_lock(&pool->sp_thread_xa);
> >  	} else {
> >  		for (i = 0; i < serv->sv_nrpools; i++) {
> >  			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
> > -			spin_lock_bh(&pool->sp_lock);
> > -			if (!list_empty(&pool->sp_all_threads))
> > +			xa_lock(&pool->sp_thread_xa);
> > +			if (!xa_empty(&pool->sp_thread_xa))
> >  				goto found_pool;
> > -			spin_unlock_bh(&pool->sp_lock);
> > +			xa_unlock(&pool->sp_thread_xa);
> >  		}
> >  		return NULL;
> >  	}
> >  
> >  found_pool:
> > -	if (!list_empty(&pool->sp_all_threads)) {
> > -		struct svc_rqst *rqstp;
> > -
> > -		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> > -		set_bit(RQ_VICTIM, &rqstp->rq_flags);
> > -		list_del_rcu(&rqstp->rq_all);
> > +	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> > +	if (rqstp) {
> > +		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> 
> This bothers me.  We always delete the earliest thread in the xarray.
> So if we create 128 threads, then reduce the number to 64, the remaining
> threads will be numbers 128 to 256.
> This means searching in the bitmap will be (slightly) slower than
> necessary.
> 
> xa doesn't have a "find last" interface, but we "know" how many entries
> are in the array - or we would if we decremented the counter when we
> removed an entry.
> Currently we only decrement the counter when the thread exits - at which
> time it is removed the entry - which may already have been removed.
> If we change that code to check if it is still present in the xarray and
> to only erase/decrement if it is, then we can decrement the counter here
> and always reliably be able to find the "last" entry.
> 
> ... though I think we wait for a thread to exist before finding the next
>     victim, so maybe all we need to do is start the xa_find from
>     ->sp_nrthreads-1 rather than from zero ??
> 
> Is it worth it?  I don't know.  But it bothers me.

Well it would be straightforward to change the "pool_wake_idle_thread"
search into a find_next_bit over a range. Store the lowest thread_id
in the svc_pool and use that as the starting point for the loop.


> NeilBrown
> 
> 
> >  		task = rqstp->rq_task;
> >  	}
> > -	spin_unlock_bh(&pool->sp_lock);
> > +	xa_unlock(&pool->sp_thread_xa);
> >  	return task;
> >  }
> >  
> > @@ -841,9 +854,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> >  	if (pool == NULL) {
> >  		nrservs -= serv->sv_nrthreads;
> >  	} else {
> > -		spin_lock_bh(&pool->sp_lock);
> > +		xa_lock(&pool->sp_thread_xa);
> >  		nrservs -= pool->sp_nrthreads;
> > -		spin_unlock_bh(&pool->sp_lock);
> > +		xa_unlock(&pool->sp_thread_xa);
> >  	}
> >  
> >  	if (nrservs > 0)
> > @@ -930,11 +943,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
> >  	struct svc_serv	*serv = rqstp->rq_server;
> >  	struct svc_pool	*pool = rqstp->rq_pool;
> >  
> > -	spin_lock_bh(&pool->sp_lock);
> > +	xa_lock(&pool->sp_thread_xa);
> >  	pool->sp_nrthreads--;
> > -	if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> > -		list_del_rcu(&rqstp->rq_all);
> > -	spin_unlock_bh(&pool->sp_lock);
> > +	__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > +	xa_unlock(&pool->sp_thread_xa);
> > +	trace_svc_pool_thread_exit(serv, pool, rqstp);
> >  
> >  	spin_lock_bh(&serv->sv_lock);
> >  	serv->sv_nrthreads -= 1;
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 8ced7591ce07..7709120b45c1 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
> >  
> >  /* SMP locking strategy:
> >   *
> > - *	svc_pool->sp_lock protects most of the fields of that pool.
> > + *	svc_pool->sp_lock protects sp_sockets.
> >   *	svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
> >   *	when both need to be taken (rare), svc_serv->sv_lock is first.
> >   *	The "service mutex" protects svc_serv->sv_nrthread.
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap
  2023-07-04  1:26   ` NeilBrown
@ 2023-07-04  2:02     ` Chuck Lever
  2023-07-04  2:17       ` NeilBrown
  2023-07-04 16:03       ` Chuck Lever III
  0 siblings, 2 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-04  2:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> On Tue, 04 Jul 2023, Chuck Lever wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > I've noticed that client-observed server request latency goes up
> > simply when the nfsd thread count is increased.
> > 
> > List walking is known to be memory-inefficient. On a busy server
> > with many threads, enqueuing a transport will walk the "all threads"
> > list quite frequently. This also pulls in the cache lines for some
> > hot fields in each svc_rqst (namely, rq_flags).
> 
> I think this text could usefully be re-written.  By this point in the
> series we aren't list walking.
> 
> I'd also be curious to know what latency different you get for just this
> change.

Not much of a latency difference at lower thread counts.

The difference I notice is that with the spinlock version of
pool_wake_idle_thread, there is significant lock contention as
the thread count increases, and the throughput result of my fio
test is lower (outside the result variance).


> > The svc_xprt_enqueue() call that concerns me most is the one in
> > svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> > down completion handling limits the total throughput per RDMA
> > connection.
> > 
> > So, avoid walking the "all threads" list to find an idle thread to
> > wake. Instead, set up an idle bitmap and use find_next_bit, which
> > should work the same way as RQ_BUSY but it will touch only the
> > cachelines that the bitmap is in. Stick with atomic bit operations
> > to avoid taking the pool lock.
> > 
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  include/linux/sunrpc/svc.h    |    6 ++++--
> >  include/trace/events/sunrpc.h |    1 -
> >  net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
> >  net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
> >  4 files changed, 49 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 6f8bfcd44250..27ffcf7371d0 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -35,6 +35,7 @@ struct svc_pool {
> >  	spinlock_t		sp_lock;	/* protects sp_sockets */
> >  	struct list_head	sp_sockets;	/* pending sockets */
> >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> > +	unsigned long		*sp_idle_map;	/* idle threads */
> >  	struct xarray		sp_thread_xa;
> >  
> >  	/* statistics on pool operation */
> > @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >  #define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> >  				+ 2 + 1)
> >  
> > +#define RPCSVC_MAXPOOLTHREADS	(4096)
> > +
> >  /*
> >   * The context of a single thread, including the request currently being
> >   * processed.
> > @@ -239,8 +242,7 @@ struct svc_rqst {
> >  #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
> >  						 * to prevent encrypting page
> >  						 * cache pages */
> > -#define	RQ_BUSY		(5)			/* request is busy */
> > -#define	RQ_DATA		(6)			/* request has data */
> > +#define	RQ_DATA		(5)			/* request has data */
> 
> Might this be a good opportunity to convert this to an enum ??
> 
> >  	unsigned long		rq_flags;	/* flags field */
> >  	u32			rq_thread_id;	/* xarray index */
> >  	ktime_t			rq_qtime;	/* enqueue time */
> > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > index ea43c6059bdb..c07824a254bf 100644
> > --- a/include/trace/events/sunrpc.h
> > +++ b/include/trace/events/sunrpc.h
> > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> >  	svc_rqst_flag(USEDEFERRAL)					\
> >  	svc_rqst_flag(DROPME)						\
> >  	svc_rqst_flag(SPLICE_OK)					\
> > -	svc_rqst_flag(BUSY)						\
> >  	svc_rqst_flag_end(DATA)
> >  
> >  #undef svc_rqst_flag
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index ef350f0d8925..d0278e5190ba 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >  		INIT_LIST_HEAD(&pool->sp_sockets);
> >  		spin_lock_init(&pool->sp_lock);
> >  		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > +		/* All threads initially marked "busy" */
> > +		pool->sp_idle_map =
> > +			bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> > +					   svc_pool_map_get_node(i));
> > +		if (!pool->sp_idle_map)
> > +			return NULL;
> >  
> >  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> >  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> >  		percpu_counter_destroy(&pool->sp_threads_starved);
> >  
> >  		xa_destroy(&pool->sp_thread_xa);
> > +		bitmap_free(pool->sp_idle_map);
> > +		pool->sp_idle_map = NULL;
> >  	}
> >  	kfree(serv->sv_pools);
> >  	kfree(serv);
> > @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  
> >  	folio_batch_init(&rqstp->rq_fbatch);
> >  
> > -	__set_bit(RQ_BUSY, &rqstp->rq_flags);
> >  	rqstp->rq_server = serv;
> >  	rqstp->rq_pool = pool;
> >  
> > @@ -677,7 +684,7 @@ static struct svc_rqst *
> >  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  {
> >  	static const struct xa_limit limit = {
> > -		.max = U32_MAX,
> > +		.max = RPCSVC_MAXPOOLTHREADS,
> >  	};
> >  	struct svc_rqst	*rqstp;
> >  	int ret;
> > @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> >  					   struct svc_pool *pool)
> >  {
> >  	struct svc_rqst	*rqstp;
> > -	unsigned long index;
> > +	unsigned long bit;
> >  
> > -	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > -		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > +	/* Check the pool's idle bitmap locklessly so that multiple
> > +	 * idle searches can proceed concurrently.
> > +	 */
> > +	for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> > +		if (!test_and_clear_bit(bit, pool->sp_idle_map))
> >  			continue;
> 
> I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
> Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
> and so get all the required memory barriers.
> What we are doing here is locking a particular thread for a task, so
> "lock" is an appropriate description of what is happening.
> See also svc_pool_thread_mark_* below.
> 
> >  
> > +		rqstp = xa_load(&pool->sp_thread_xa, bit);
> > +		if (!rqstp)
> > +			break;
> > +
> >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> >  		wake_up_process(rqstp->rq_task);
> >  		percpu_counter_inc(&pool->sp_threads_woken);
> > @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> >  	}
> >  
> >  found_pool:
> > -	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> > +	rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> > +			XA_PRESENT);
> >  	if (rqstp) {
> >  		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> >  		task = rqstp->rq_task;
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 7709120b45c1..2844b32c16ea 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >  	return true;
> >  }
> >  
> > +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> > +				      struct svc_rqst *rqstp)
> > +{
> > +	smp_mb__before_atomic();
> > +	set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> > +	smp_mb__after_atomic();
> > +}
> 
> There memory barriers above and below bother me.  There is no comment
> telling me what they are protecting against.
> I would rather svc_pool_thread_mark_idle - which unlocks the thread -
> were
> 
>     clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> 
> and  that svc_pool_thread_mark_busy were
> 
>    test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> 
> Then it would be more obvious what was happening.

Not obvious to me, but that's very likely because I'm not clear what
clear_bit_unlock() does. :-)

I'll try this change for the next version of the series.


> Thanks,
> NeilBrown
> 
> > +
> > +/*
> > + * Note: If we were awoken, then this rqstp has already been marked busy.
> > + */
> > +static void svc_pool_thread_mark_busy(struct svc_pool *pool,
> > +				      struct svc_rqst *rqstp)
> > +{
> > +	smp_mb__before_atomic();
> > +	clear_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> > +	smp_mb__after_atomic();
> > +}
> > +
> >  static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> >  {
> >  	struct svc_pool		*pool = rqstp->rq_pool;
> > @@ -756,18 +775,17 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> >  	set_current_state(TASK_INTERRUPTIBLE);
> >  	smp_mb__before_atomic();
> >  	clear_bit(SP_CONGESTED, &pool->sp_flags);
> > -	clear_bit(RQ_BUSY, &rqstp->rq_flags);
> > -	smp_mb__after_atomic();
> >  
> > -	if (likely(rqst_should_sleep(rqstp)))
> > +	if (likely(rqst_should_sleep(rqstp))) {
> > +		svc_pool_thread_mark_idle(pool, rqstp);
> >  		time_left = schedule_timeout(timeout);
> > -	else
> > +	} else
> >  		__set_current_state(TASK_RUNNING);
> >  
> >  	try_to_freeze();
> >  
> > -	set_bit(RQ_BUSY, &rqstp->rq_flags);
> > -	smp_mb__after_atomic();
> > +	svc_pool_thread_mark_busy(pool, rqstp);
> > +
> >  	rqstp->rq_xprt = svc_xprt_dequeue(pool);
> >  	if (rqstp->rq_xprt) {
> >  		trace_svc_pool_awoken(rqstp);
> > 
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap
  2023-07-04  2:02     ` Chuck Lever
@ 2023-07-04  2:17       ` NeilBrown
  2023-07-04 15:14         ` Chuck Lever III
  2023-07-04 16:03       ` Chuck Lever III
  1 sibling, 1 reply; 41+ messages in thread
From: NeilBrown @ 2023-07-04  2:17 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, 04 Jul 2023, Chuck Lever wrote:
> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> > On Tue, 04 Jul 2023, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > I've noticed that client-observed server request latency goes up
> > > simply when the nfsd thread count is increased.
> > > 
> > > List walking is known to be memory-inefficient. On a busy server
> > > with many threads, enqueuing a transport will walk the "all threads"
> > > list quite frequently. This also pulls in the cache lines for some
> > > hot fields in each svc_rqst (namely, rq_flags).
> > 
> > I think this text could usefully be re-written.  By this point in the
> > series we aren't list walking.
> > 
> > I'd also be curious to know what latency different you get for just this
> > change.
> 
> Not much of a latency difference at lower thread counts.
> 
> The difference I notice is that with the spinlock version of
> pool_wake_idle_thread, there is significant lock contention as
> the thread count increases, and the throughput result of my fio
> test is lower (outside the result variance).
> 
> 
> > > The svc_xprt_enqueue() call that concerns me most is the one in
> > > svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> > > down completion handling limits the total throughput per RDMA
> > > connection.
> > > 
> > > So, avoid walking the "all threads" list to find an idle thread to
> > > wake. Instead, set up an idle bitmap and use find_next_bit, which
> > > should work the same way as RQ_BUSY but it will touch only the
> > > cachelines that the bitmap is in. Stick with atomic bit operations
> > > to avoid taking the pool lock.
> > > 
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > >  include/linux/sunrpc/svc.h    |    6 ++++--
> > >  include/trace/events/sunrpc.h |    1 -
> > >  net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
> > >  net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
> > >  4 files changed, 49 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > index 6f8bfcd44250..27ffcf7371d0 100644
> > > --- a/include/linux/sunrpc/svc.h
> > > +++ b/include/linux/sunrpc/svc.h
> > > @@ -35,6 +35,7 @@ struct svc_pool {
> > >  	spinlock_t		sp_lock;	/* protects sp_sockets */
> > >  	struct list_head	sp_sockets;	/* pending sockets */
> > >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> > > +	unsigned long		*sp_idle_map;	/* idle threads */
> > >  	struct xarray		sp_thread_xa;
> > >  
> > >  	/* statistics on pool operation */
> > > @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > >  #define RPCSVC_MAXPAGES		((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> > >  				+ 2 + 1)
> > >  
> > > +#define RPCSVC_MAXPOOLTHREADS	(4096)
> > > +
> > >  /*
> > >   * The context of a single thread, including the request currently being
> > >   * processed.
> > > @@ -239,8 +242,7 @@ struct svc_rqst {
> > >  #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
> > >  						 * to prevent encrypting page
> > >  						 * cache pages */
> > > -#define	RQ_BUSY		(5)			/* request is busy */
> > > -#define	RQ_DATA		(6)			/* request has data */
> > > +#define	RQ_DATA		(5)			/* request has data */
> > 
> > Might this be a good opportunity to convert this to an enum ??
> > 
> > >  	unsigned long		rq_flags;	/* flags field */
> > >  	u32			rq_thread_id;	/* xarray index */
> > >  	ktime_t			rq_qtime;	/* enqueue time */
> > > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > > index ea43c6059bdb..c07824a254bf 100644
> > > --- a/include/trace/events/sunrpc.h
> > > +++ b/include/trace/events/sunrpc.h
> > > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > >  	svc_rqst_flag(USEDEFERRAL)					\
> > >  	svc_rqst_flag(DROPME)						\
> > >  	svc_rqst_flag(SPLICE_OK)					\
> > > -	svc_rqst_flag(BUSY)						\
> > >  	svc_rqst_flag_end(DATA)
> > >  
> > >  #undef svc_rqst_flag
> > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > index ef350f0d8925..d0278e5190ba 100644
> > > --- a/net/sunrpc/svc.c
> > > +++ b/net/sunrpc/svc.c
> > > @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > >  		INIT_LIST_HEAD(&pool->sp_sockets);
> > >  		spin_lock_init(&pool->sp_lock);
> > >  		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > > +		/* All threads initially marked "busy" */
> > > +		pool->sp_idle_map =
> > > +			bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> > > +					   svc_pool_map_get_node(i));
> > > +		if (!pool->sp_idle_map)
> > > +			return NULL;
> > >  
> > >  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > >  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > > @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> > >  		percpu_counter_destroy(&pool->sp_threads_starved);
> > >  
> > >  		xa_destroy(&pool->sp_thread_xa);
> > > +		bitmap_free(pool->sp_idle_map);
> > > +		pool->sp_idle_map = NULL;
> > >  	}
> > >  	kfree(serv->sv_pools);
> > >  	kfree(serv);
> > > @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> > >  
> > >  	folio_batch_init(&rqstp->rq_fbatch);
> > >  
> > > -	__set_bit(RQ_BUSY, &rqstp->rq_flags);
> > >  	rqstp->rq_server = serv;
> > >  	rqstp->rq_pool = pool;
> > >  
> > > @@ -677,7 +684,7 @@ static struct svc_rqst *
> > >  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > >  {
> > >  	static const struct xa_limit limit = {
> > > -		.max = U32_MAX,
> > > +		.max = RPCSVC_MAXPOOLTHREADS,
> > >  	};
> > >  	struct svc_rqst	*rqstp;
> > >  	int ret;
> > > @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > >  					   struct svc_pool *pool)
> > >  {
> > >  	struct svc_rqst	*rqstp;
> > > -	unsigned long index;
> > > +	unsigned long bit;
> > >  
> > > -	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > > -		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > > +	/* Check the pool's idle bitmap locklessly so that multiple
> > > +	 * idle searches can proceed concurrently.
> > > +	 */
> > > +	for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> > > +		if (!test_and_clear_bit(bit, pool->sp_idle_map))
> > >  			continue;
> > 
> > I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
> > Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
> > and so get all the required memory barriers.
> > What we are doing here is locking a particular thread for a task, so
> > "lock" is an appropriate description of what is happening.
> > See also svc_pool_thread_mark_* below.
> > 
> > >  
> > > +		rqstp = xa_load(&pool->sp_thread_xa, bit);
> > > +		if (!rqstp)
> > > +			break;
> > > +
> > >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > >  		wake_up_process(rqstp->rq_task);
> > >  		percpu_counter_inc(&pool->sp_threads_woken);
> > > @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> > >  	}
> > >  
> > >  found_pool:
> > > -	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> > > +	rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> > > +			XA_PRESENT);
> > >  	if (rqstp) {
> > >  		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > >  		task = rqstp->rq_task;
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 7709120b45c1..2844b32c16ea 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> > >  	return true;
> > >  }
> > >  
> > > +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> > > +				      struct svc_rqst *rqstp)
> > > +{
> > > +	smp_mb__before_atomic();
> > > +	set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> > > +	smp_mb__after_atomic();
> > > +}
> > 
> > There memory barriers above and below bother me.  There is no comment
> > telling me what they are protecting against.
> > I would rather svc_pool_thread_mark_idle - which unlocks the thread -
> > were
> > 
> >     clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> > 
> > and  that svc_pool_thread_mark_busy were
> > 
> >    test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> > 
> > Then it would be more obvious what was happening.
> 
> Not obvious to me, but that's very likely because I'm not clear what
> clear_bit_unlock() does. :-)

In general, any "lock" operation (mutex, spin, whatever) is (and must
be) and "acquire" type operations which imposes a memory barrier so that
read requests *after* the lock cannot be satisfied with data from
*before* the lock.  The read must access data after the lock.
Conversely any "unlock" operations is a "release" type operation which
imposes a memory barrier so that any write request *before* the unlock
must not be delayed until *after* the unlock.  The write must complete
before the unlock.

This is exactly what you would expect of locking - it creates a closed
code region that is properly ordered w.r.t comparable closed regions.

test_and_set_bit_lock() and clear_bit_unlock() provide these expected
semantics for bit operations.

New code should (almost?) never have explicit memory barriers like
smp_mb__after_atomic().
It should use one of the many APIs with _acquire or _release suffixes,
or with the more explicit _lock or _unlock.

> 
> I'll try this change for the next version of the series.
> 

Thanks.

NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/9] SUNRPC: Count ingress RPC messages per svc_pool
  2023-07-04  1:47     ` Chuck Lever
@ 2023-07-04  2:30       ` NeilBrown
  2023-07-04 15:04         ` Chuck Lever III
  2023-07-10  0:41       ` [PATCH/RFC] sunrpc: constant-time code to wake idle thread NeilBrown
  1 sibling, 1 reply; 41+ messages in thread
From: NeilBrown @ 2023-07-04  2:30 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, 04 Jul 2023, Chuck Lever wrote:
> On Tue, Jul 04, 2023 at 10:45:20AM +1000, NeilBrown wrote:
> > On Tue, 04 Jul 2023, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > To get a sense of the average number of transport enqueue operations
> > > needed to process an incoming RPC message, re-use the "packets" pool
> > > stat. Track the number of complete RPC messages processed by each
> > > thread pool.
> > 
> > If I understand this correctly, then I would say it differently.
> > 
> > I think there are either zero or one transport enqueue operations for
> > each incoming RPC message.  Is that correct?  So the average would be in
> > (0,1].
> > Wouldn't it be more natural to talk about the average number of incoming
> > RPC messages processed per enqueue operation?  This would be typically
> > be around 1 on a lightly loaded server and would climb up as things get
> > busy. 
> > 
> > Was there a reason you wrote it the way around that you did?
> 
> Yes: more than one enqueue is done per incoming RPC. For example,
> svc_data_ready() enqueues, and so does svc_xprt_receive().
> 
> If the RPC requires more than one call to ->recvfrom() to complete
> the receive operation, each one of those calls does an enqueue.
> 

Ahhhh - that makes sense.  Thanks.
So its really that a number of transport enqueue operations are needed
to *receive* the message.  Once it is received, it is then processed
with no more enqueuing.

I was partly thrown by the fact that the series is mostly about the
queue of threads, but this is about the queue of transports.
I guess the more times a transport if queued, the more times a thread
needs to be woken?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 8/9] SUNRPC: Replace sp_threads_all with an xarray
  2023-07-04  1:52     ` Chuck Lever
@ 2023-07-04  2:32       ` NeilBrown
  0 siblings, 0 replies; 41+ messages in thread
From: NeilBrown @ 2023-07-04  2:32 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, 04 Jul 2023, Chuck Lever wrote:
> On Tue, Jul 04, 2023 at 11:11:57AM +1000, NeilBrown wrote:
> > On Tue, 04 Jul 2023, Chuck Lever wrote:
> > > From: Chuck Lever <chuck.lever@oracle.com>
> > > 
> > > We want a thread lookup operation that can be done with RCU only,
> > > but also we want to avoid the linked-list walk, which does not scale
> > > well in the number of pool threads.
> > > 
> > > BH-disabled locking is no longer necessary because we're no longer
> > > sharing the pool's sp_lock to protect either the xarray or the
> > > pool's thread count. sp_lock also protects transport activity. There
> > > are no callers of svc_set_num_threads() that run outside of process
> > > context.
> > > 
> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > > ---
> > >  fs/nfsd/nfssvc.c              |    3 +-
> > >  include/linux/sunrpc/svc.h    |   11 +++----
> > >  include/trace/events/sunrpc.h |   47 ++++++++++++++++++++++++++++-
> > >  net/sunrpc/svc.c              |   67 ++++++++++++++++++++++++-----------------
> > >  net/sunrpc/svc_xprt.c         |    2 +
> > >  5 files changed, 93 insertions(+), 37 deletions(-)
> > > 
> > > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > > index 2154fa63c5f2..d42b2a40c93c 100644
> > > --- a/fs/nfsd/nfssvc.c
> > > +++ b/fs/nfsd/nfssvc.c
> > > @@ -62,8 +62,7 @@ static __be32			nfsd_init_request(struct svc_rqst *,
> > >   * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
> > >   * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
> > >   * nn->keep_active is set).  That number of nfsd threads must
> > > - * exist and each must be listed in ->sp_all_threads in some entry of
> > > - * ->sv_pools[].
> > > + * exist and each must be listed in some entry of ->sv_pools[].
> > >   *
> > >   * Each active thread holds a counted reference on nn->nfsd_serv, as does
> > >   * the nn->keep_active flag and various transient calls to svc_get().
> > > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > > index 74ea13270679..6f8bfcd44250 100644
> > > --- a/include/linux/sunrpc/svc.h
> > > +++ b/include/linux/sunrpc/svc.h
> > > @@ -32,10 +32,10 @@
> > >   */
> > >  struct svc_pool {
> > >  	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
> > > -	spinlock_t		sp_lock;	/* protects all fields */
> > > +	spinlock_t		sp_lock;	/* protects sp_sockets */
> > >  	struct list_head	sp_sockets;	/* pending sockets */
> > >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> > > -	struct list_head	sp_all_threads;	/* all server threads */
> > > +	struct xarray		sp_thread_xa;
> > >  
> > >  	/* statistics on pool operation */
> > >  	struct percpu_counter	sp_messages_arrived;
> > > @@ -195,7 +195,6 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> > >   * processed.
> > >   */
> > >  struct svc_rqst {
> > > -	struct list_head	rq_all;		/* all threads list */
> > >  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
> > >  	struct svc_xprt *	rq_xprt;	/* transport ptr */
> > >  
> > > @@ -240,10 +239,10 @@ struct svc_rqst {
> > >  #define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
> > >  						 * to prevent encrypting page
> > >  						 * cache pages */
> > > -#define	RQ_VICTIM	(5)			/* about to be shut down */
> > > -#define	RQ_BUSY		(6)			/* request is busy */
> > > -#define	RQ_DATA		(7)			/* request has data */
> > > +#define	RQ_BUSY		(5)			/* request is busy */
> > > +#define	RQ_DATA		(6)			/* request has data */
> > >  	unsigned long		rq_flags;	/* flags field */
> > > +	u32			rq_thread_id;	/* xarray index */
> > >  	ktime_t			rq_qtime;	/* enqueue time */
> > >  
> > >  	void *			rq_argp;	/* decoded arguments */
> > > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > > index 60c8e03268d4..ea43c6059bdb 100644
> > > --- a/include/trace/events/sunrpc.h
> > > +++ b/include/trace/events/sunrpc.h
> > > @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> > >  	svc_rqst_flag(USEDEFERRAL)					\
> > >  	svc_rqst_flag(DROPME)						\
> > >  	svc_rqst_flag(SPLICE_OK)					\
> > > -	svc_rqst_flag(VICTIM)						\
> > >  	svc_rqst_flag(BUSY)						\
> > >  	svc_rqst_flag_end(DATA)
> > >  
> > > @@ -2118,6 +2117,52 @@ TRACE_EVENT(svc_pool_starved,
> > >  	)
> > >  );
> > >  
> > > +DECLARE_EVENT_CLASS(svc_thread_lifetime_class,
> > > +	TP_PROTO(
> > > +		const struct svc_serv *serv,
> > > +		const struct svc_pool *pool,
> > > +		const struct svc_rqst *rqstp
> > > +	),
> > > +
> > > +	TP_ARGS(serv, pool, rqstp),
> > > +
> > > +	TP_STRUCT__entry(
> > > +		__string(name, serv->sv_name)
> > > +		__field(int, pool_id)
> > > +		__field(unsigned int, nrthreads)
> > > +		__field(unsigned long, pool_flags)
> > > +		__field(u32, thread_id)
> > > +		__field(const void *, rqstp)
> > > +	),
> > > +
> > > +	TP_fast_assign(
> > > +		__assign_str(name, serv->sv_name);
> > > +		__entry->pool_id = pool->sp_id;
> > > +		__entry->nrthreads = pool->sp_nrthreads;
> > > +		__entry->pool_flags = pool->sp_flags;
> > > +		__entry->thread_id = rqstp->rq_thread_id;
> > > +		__entry->rqstp = rqstp;
> > > +	),
> > > +
> > > +	TP_printk("service=%s pool=%d pool_flags=%s nrthreads=%u thread_id=%u",
> > > +		__get_str(name), __entry->pool_id,
> > > +		show_svc_pool_flags(__entry->pool_flags),
> > > +		__entry->nrthreads, __entry->thread_id
> > > +	)
> > > +);
> > > +
> > > +#define DEFINE_SVC_THREAD_LIFETIME_EVENT(name) \
> > > +	DEFINE_EVENT(svc_thread_lifetime_class, svc_pool_##name, \
> > > +			TP_PROTO( \
> > > +				const struct svc_serv *serv, \
> > > +				const struct svc_pool *pool, \
> > > +				const struct svc_rqst *rqstp \
> > > +			), \
> > > +			TP_ARGS(serv, pool, rqstp))
> > > +
> > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_init);
> > > +DEFINE_SVC_THREAD_LIFETIME_EVENT(thread_exit);
> > > +
> > >  DECLARE_EVENT_CLASS(svc_xprt_event,
> > >  	TP_PROTO(
> > >  		const struct svc_xprt *xprt
> > > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > > index 9b6701a38e71..ef350f0d8925 100644
> > > --- a/net/sunrpc/svc.c
> > > +++ b/net/sunrpc/svc.c
> > > @@ -507,8 +507,8 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> > >  
> > >  		pool->sp_id = i;
> > >  		INIT_LIST_HEAD(&pool->sp_sockets);
> > > -		INIT_LIST_HEAD(&pool->sp_all_threads);
> > >  		spin_lock_init(&pool->sp_lock);
> > > +		xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> > >  
> > >  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > >  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > > @@ -594,6 +594,8 @@ svc_destroy(struct kref *ref)
> > >  		percpu_counter_destroy(&pool->sp_threads_woken);
> > >  		percpu_counter_destroy(&pool->sp_threads_timedout);
> > >  		percpu_counter_destroy(&pool->sp_threads_starved);
> > > +
> > > +		xa_destroy(&pool->sp_thread_xa);
> > >  	}
> > >  	kfree(serv->sv_pools);
> > >  	kfree(serv);
> > > @@ -674,7 +676,11 @@ EXPORT_SYMBOL_GPL(svc_rqst_alloc);
> > >  static struct svc_rqst *
> > >  svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > >  {
> > > +	static const struct xa_limit limit = {
> > > +		.max = U32_MAX,
> > > +	};
> > >  	struct svc_rqst	*rqstp;
> > > +	int ret;
> > >  
> > >  	rqstp = svc_rqst_alloc(serv, pool, node);
> > >  	if (!rqstp)
> > > @@ -685,15 +691,25 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> > >  	serv->sv_nrthreads += 1;
> > >  	spin_unlock_bh(&serv->sv_lock);
> > >  
> > > -	spin_lock_bh(&pool->sp_lock);
> > > +	xa_lock(&pool->sp_thread_xa);
> > > +	ret = __xa_alloc(&pool->sp_thread_xa, &rqstp->rq_thread_id, rqstp,
> > > +			 limit, GFP_KERNEL);
> > > +	if (ret) {
> > > +		xa_unlock(&pool->sp_thread_xa);
> > > +		goto out_free;
> > > +	}
> > >  	pool->sp_nrthreads++;
> > > -	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
> > > -	spin_unlock_bh(&pool->sp_lock);
> > > +	xa_unlock(&pool->sp_thread_xa);
> > > +	trace_svc_pool_thread_init(serv, pool, rqstp);
> > >  	return rqstp;
> > > +
> > > +out_free:
> > > +	svc_rqst_free(rqstp);
> > > +	return ERR_PTR(ret);
> > >  }
> > >  
> > >  /**
> > > - * svc_pool_wake_idle_thread - wake an idle thread in @pool
> > > + * svc_pool_wake_idle_thread - Find and wake an idle thread in @pool
> > >   * @serv: RPC service
> > >   * @pool: service thread pool
> > >   *
> > > @@ -706,19 +722,17 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > >  					   struct svc_pool *pool)
> > >  {
> > >  	struct svc_rqst	*rqstp;
> > > +	unsigned long index;
> > >  
> > > -	rcu_read_lock();
> > > -	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> > > +	xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> > >  		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > >  			continue;
> > >  
> > > -		rcu_read_unlock();
> > >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > >  		wake_up_process(rqstp->rq_task);
> > >  		percpu_counter_inc(&pool->sp_threads_woken);
> > >  		return rqstp;
> > >  	}
> > > -	rcu_read_unlock();
> > >  
> > >  	trace_svc_pool_starved(serv, pool);
> > >  	percpu_counter_inc(&pool->sp_threads_starved);
> > > @@ -734,32 +748,31 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> > >  static struct task_struct *
> > >  svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> > >  {
> > > -	unsigned int i;
> > >  	struct task_struct *task = NULL;
> > > +	struct svc_rqst *rqstp;
> > > +	unsigned long zero = 0;
> > > +	unsigned int i;
> > >  
> > >  	if (pool != NULL) {
> > > -		spin_lock_bh(&pool->sp_lock);
> > > +		xa_lock(&pool->sp_thread_xa);
> > >  	} else {
> > >  		for (i = 0; i < serv->sv_nrpools; i++) {
> > >  			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
> > > -			spin_lock_bh(&pool->sp_lock);
> > > -			if (!list_empty(&pool->sp_all_threads))
> > > +			xa_lock(&pool->sp_thread_xa);
> > > +			if (!xa_empty(&pool->sp_thread_xa))
> > >  				goto found_pool;
> > > -			spin_unlock_bh(&pool->sp_lock);
> > > +			xa_unlock(&pool->sp_thread_xa);
> > >  		}
> > >  		return NULL;
> > >  	}
> > >  
> > >  found_pool:
> > > -	if (!list_empty(&pool->sp_all_threads)) {
> > > -		struct svc_rqst *rqstp;
> > > -
> > > -		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> > > -		set_bit(RQ_VICTIM, &rqstp->rq_flags);
> > > -		list_del_rcu(&rqstp->rq_all);
> > > +	rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> > > +	if (rqstp) {
> > > +		__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > 
> > This bothers me.  We always delete the earliest thread in the xarray.
> > So if we create 128 threads, then reduce the number to 64, the remaining
> > threads will be numbers 128 to 256.
> > This means searching in the bitmap will be (slightly) slower than
> > necessary.
> > 
> > xa doesn't have a "find last" interface, but we "know" how many entries
> > are in the array - or we would if we decremented the counter when we
> > removed an entry.
> > Currently we only decrement the counter when the thread exits - at which
> > time it is removed the entry - which may already have been removed.
> > If we change that code to check if it is still present in the xarray and
> > to only erase/decrement if it is, then we can decrement the counter here
> > and always reliably be able to find the "last" entry.
> > 
> > ... though I think we wait for a thread to exist before finding the next
> >     victim, so maybe all we need to do is start the xa_find from
> >     ->sp_nrthreads-1 rather than from zero ??
> > 
> > Is it worth it?  I don't know.  But it bothers me.
> 
> Well it would be straightforward to change the "pool_wake_idle_thread"
> search into a find_next_bit over a range. Store the lowest thread_id
> in the svc_pool and use that as the starting point for the loop.

Then if you add another 32 threads they will go at the front, so you
have 32 then a gap of 32, then 64 threads in the first 128 slots.

It's obviously not a big deal, but it just feels more tidy to keep them
dense at the start of the array.

Thanks,
NeilBrown


> 
> 
> > NeilBrown
> > 
> > 
> > >  		task = rqstp->rq_task;
> > >  	}
> > > -	spin_unlock_bh(&pool->sp_lock);
> > > +	xa_unlock(&pool->sp_thread_xa);
> > >  	return task;
> > >  }
> > >  
> > > @@ -841,9 +854,9 @@ svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> > >  	if (pool == NULL) {
> > >  		nrservs -= serv->sv_nrthreads;
> > >  	} else {
> > > -		spin_lock_bh(&pool->sp_lock);
> > > +		xa_lock(&pool->sp_thread_xa);
> > >  		nrservs -= pool->sp_nrthreads;
> > > -		spin_unlock_bh(&pool->sp_lock);
> > > +		xa_unlock(&pool->sp_thread_xa);
> > >  	}
> > >  
> > >  	if (nrservs > 0)
> > > @@ -930,11 +943,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
> > >  	struct svc_serv	*serv = rqstp->rq_server;
> > >  	struct svc_pool	*pool = rqstp->rq_pool;
> > >  
> > > -	spin_lock_bh(&pool->sp_lock);
> > > +	xa_lock(&pool->sp_thread_xa);
> > >  	pool->sp_nrthreads--;
> > > -	if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> > > -		list_del_rcu(&rqstp->rq_all);
> > > -	spin_unlock_bh(&pool->sp_lock);
> > > +	__xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> > > +	xa_unlock(&pool->sp_thread_xa);
> > > +	trace_svc_pool_thread_exit(serv, pool, rqstp);
> > >  
> > >  	spin_lock_bh(&serv->sv_lock);
> > >  	serv->sv_nrthreads -= 1;
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 8ced7591ce07..7709120b45c1 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -46,7 +46,7 @@ static LIST_HEAD(svc_xprt_class_list);
> > >  
> > >  /* SMP locking strategy:
> > >   *
> > > - *	svc_pool->sp_lock protects most of the fields of that pool.
> > > + *	svc_pool->sp_lock protects sp_sockets.
> > >   *	svc_serv->sv_lock protects sv_tempsocks, sv_permsocks, sv_tmpcnt.
> > >   *	when both need to be taken (rare), svc_serv->sv_lock is first.
> > >   *	The "service mutex" protects svc_serv->sv_nrthread.
> > > 
> > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 4/9] SUNRPC: Count ingress RPC messages per svc_pool
  2023-07-04  2:30       ` NeilBrown
@ 2023-07-04 15:04         ` Chuck Lever III
  0 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever III @ 2023-07-04 15:04 UTC (permalink / raw)
  To: Neil Brown
  Cc: Chuck Lever, Linux NFS Mailing List, Lorenzo Bianconi,
	Jeff Layton, david



> On Jul 3, 2023, at 10:30 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Tue, 04 Jul 2023, Chuck Lever wrote:
>> On Tue, Jul 04, 2023 at 10:45:20AM +1000, NeilBrown wrote:
>>> On Tue, 04 Jul 2023, Chuck Lever wrote:
>>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>> 
>>>> To get a sense of the average number of transport enqueue operations
>>>> needed to process an incoming RPC message, re-use the "packets" pool
>>>> stat. Track the number of complete RPC messages processed by each
>>>> thread pool.
>>> 
>>> If I understand this correctly, then I would say it differently.
>>> 
>>> I think there are either zero or one transport enqueue operations for
>>> each incoming RPC message.  Is that correct?  So the average would be in
>>> (0,1].
>>> Wouldn't it be more natural to talk about the average number of incoming
>>> RPC messages processed per enqueue operation?  This would be typically
>>> be around 1 on a lightly loaded server and would climb up as things get
>>> busy. 
>>> 
>>> Was there a reason you wrote it the way around that you did?
>> 
>> Yes: more than one enqueue is done per incoming RPC. For example,
>> svc_data_ready() enqueues, and so does svc_xprt_receive().
>> 
>> If the RPC requires more than one call to ->recvfrom() to complete
>> the receive operation, each one of those calls does an enqueue.
>> 
> 
> Ahhhh - that makes sense.  Thanks.
> So its really that a number of transport enqueue operations are needed
> to *receive* the message.  Once it is received, it is then processed
> with no more enqueuing.
> 
> I was partly thrown by the fact that the series is mostly about the
> queue of threads, but this is about the queue of transports.
> I guess the more times a transport if queued, the more times a thread
> needs to be woken?

Yes: this new metric is an indirect measure of the workload
on the new thread wake-up mechanism.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap
  2023-07-04  2:17       ` NeilBrown
@ 2023-07-04 15:14         ` Chuck Lever III
  2023-07-05  0:34           ` NeilBrown
  0 siblings, 1 reply; 41+ messages in thread
From: Chuck Lever III @ 2023-07-04 15:14 UTC (permalink / raw)
  To: Neil Brown
  Cc: Chuck Lever, Linux NFS Mailing List, lorenzo, Jeff Layton, david



> On Jul 3, 2023, at 10:17 PM, NeilBrown <neilb@suse.de> wrote:
> 
> On Tue, 04 Jul 2023, Chuck Lever wrote:
>> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
>>> On Tue, 04 Jul 2023, Chuck Lever wrote:
>>>> From: Chuck Lever <chuck.lever@oracle.com>
>>>> 
>>>> I've noticed that client-observed server request latency goes up
>>>> simply when the nfsd thread count is increased.
>>>> 
>>>> List walking is known to be memory-inefficient. On a busy server
>>>> with many threads, enqueuing a transport will walk the "all threads"
>>>> list quite frequently. This also pulls in the cache lines for some
>>>> hot fields in each svc_rqst (namely, rq_flags).
>>> 
>>> I think this text could usefully be re-written.  By this point in the
>>> series we aren't list walking.
>>> 
>>> I'd also be curious to know what latency different you get for just this
>>> change.
>> 
>> Not much of a latency difference at lower thread counts.
>> 
>> The difference I notice is that with the spinlock version of
>> pool_wake_idle_thread, there is significant lock contention as
>> the thread count increases, and the throughput result of my fio
>> test is lower (outside the result variance).
>> 
>> 
>>>> The svc_xprt_enqueue() call that concerns me most is the one in
>>>> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
>>>> down completion handling limits the total throughput per RDMA
>>>> connection.
>>>> 
>>>> So, avoid walking the "all threads" list to find an idle thread to
>>>> wake. Instead, set up an idle bitmap and use find_next_bit, which
>>>> should work the same way as RQ_BUSY but it will touch only the
>>>> cachelines that the bitmap is in. Stick with atomic bit operations
>>>> to avoid taking the pool lock.
>>>> 
>>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
>>>> ---
>>>> include/linux/sunrpc/svc.h    |    6 ++++--
>>>> include/trace/events/sunrpc.h |    1 -
>>>> net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
>>>> net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
>>>> 4 files changed, 49 insertions(+), 15 deletions(-)
>>>> 
>>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
>>>> index 6f8bfcd44250..27ffcf7371d0 100644
>>>> --- a/include/linux/sunrpc/svc.h
>>>> +++ b/include/linux/sunrpc/svc.h
>>>> @@ -35,6 +35,7 @@ struct svc_pool {
>>>> spinlock_t sp_lock; /* protects sp_sockets */
>>>> struct list_head sp_sockets; /* pending sockets */
>>>> unsigned int sp_nrthreads; /* # of threads in pool */
>>>> + unsigned long *sp_idle_map; /* idle threads */
>>>> struct xarray sp_thread_xa;
>>>> 
>>>> /* statistics on pool operation */
>>>> @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>>>> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
>>>> + 2 + 1)
>>>> 
>>>> +#define RPCSVC_MAXPOOLTHREADS (4096)
>>>> +
>>>> /*
>>>>  * The context of a single thread, including the request currently being
>>>>  * processed.
>>>> @@ -239,8 +242,7 @@ struct svc_rqst {
>>>> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
>>>> * to prevent encrypting page
>>>> * cache pages */
>>>> -#define RQ_BUSY (5) /* request is busy */
>>>> -#define RQ_DATA (6) /* request has data */
>>>> +#define RQ_DATA (5) /* request has data */
>>> 
>>> Might this be a good opportunity to convert this to an enum ??
>>> 
>>>> unsigned long rq_flags; /* flags field */
>>>> u32 rq_thread_id; /* xarray index */
>>>> ktime_t rq_qtime; /* enqueue time */
>>>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
>>>> index ea43c6059bdb..c07824a254bf 100644
>>>> --- a/include/trace/events/sunrpc.h
>>>> +++ b/include/trace/events/sunrpc.h
>>>> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>>>> svc_rqst_flag(USEDEFERRAL) \
>>>> svc_rqst_flag(DROPME) \
>>>> svc_rqst_flag(SPLICE_OK) \
>>>> - svc_rqst_flag(BUSY) \
>>>> svc_rqst_flag_end(DATA)
>>>> 
>>>> #undef svc_rqst_flag
>>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
>>>> index ef350f0d8925..d0278e5190ba 100644
>>>> --- a/net/sunrpc/svc.c
>>>> +++ b/net/sunrpc/svc.c
>>>> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>>>> INIT_LIST_HEAD(&pool->sp_sockets);
>>>> spin_lock_init(&pool->sp_lock);
>>>> xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
>>>> + /* All threads initially marked "busy" */
>>>> + pool->sp_idle_map =
>>>> + bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
>>>> +   svc_pool_map_get_node(i));
>>>> + if (!pool->sp_idle_map)
>>>> + return NULL;
>>>> 
>>>> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
>>>> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
>>>> @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
>>>> percpu_counter_destroy(&pool->sp_threads_starved);
>>>> 
>>>> xa_destroy(&pool->sp_thread_xa);
>>>> + bitmap_free(pool->sp_idle_map);
>>>> + pool->sp_idle_map = NULL;
>>>> }
>>>> kfree(serv->sv_pools);
>>>> kfree(serv);
>>>> @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>> 
>>>> folio_batch_init(&rqstp->rq_fbatch);
>>>> 
>>>> - __set_bit(RQ_BUSY, &rqstp->rq_flags);
>>>> rqstp->rq_server = serv;
>>>> rqstp->rq_pool = pool;
>>>> 
>>>> @@ -677,7 +684,7 @@ static struct svc_rqst *
>>>> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>>>> {
>>>> static const struct xa_limit limit = {
>>>> - .max = U32_MAX,
>>>> + .max = RPCSVC_MAXPOOLTHREADS,
>>>> };
>>>> struct svc_rqst *rqstp;
>>>> int ret;
>>>> @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
>>>>   struct svc_pool *pool)
>>>> {
>>>> struct svc_rqst *rqstp;
>>>> - unsigned long index;
>>>> + unsigned long bit;
>>>> 
>>>> - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
>>>> - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
>>>> + /* Check the pool's idle bitmap locklessly so that multiple
>>>> + * idle searches can proceed concurrently.
>>>> + */
>>>> + for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
>>>> + if (!test_and_clear_bit(bit, pool->sp_idle_map))
>>>> continue;
>>> 
>>> I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
>>> Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
>>> and so get all the required memory barriers.
>>> What we are doing here is locking a particular thread for a task, so
>>> "lock" is an appropriate description of what is happening.
>>> See also svc_pool_thread_mark_* below.
>>> 
>>>> 
>>>> + rqstp = xa_load(&pool->sp_thread_xa, bit);
>>>> + if (!rqstp)
>>>> + break;
>>>> +
>>>> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>>>> wake_up_process(rqstp->rq_task);
>>>> percpu_counter_inc(&pool->sp_threads_woken);
>>>> @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
>>>> }
>>>> 
>>>> found_pool:
>>>> - rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
>>>> + rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
>>>> + XA_PRESENT);
>>>> if (rqstp) {
>>>> __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
>>>> task = rqstp->rq_task;
>>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>>>> index 7709120b45c1..2844b32c16ea 100644
>>>> --- a/net/sunrpc/svc_xprt.c
>>>> +++ b/net/sunrpc/svc_xprt.c
>>>> @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>>>> return true;
>>>> }
>>>> 
>>>> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
>>>> +      struct svc_rqst *rqstp)
>>>> +{
>>>> + smp_mb__before_atomic();
>>>> + set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
>>>> + smp_mb__after_atomic();
>>>> +}
>>> 
>>> There memory barriers above and below bother me.  There is no comment
>>> telling me what they are protecting against.
>>> I would rather svc_pool_thread_mark_idle - which unlocks the thread -
>>> were
>>> 
>>>    clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
>>> 
>>> and  that svc_pool_thread_mark_busy were
>>> 
>>>   test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
>>> 
>>> Then it would be more obvious what was happening.
>> 
>> Not obvious to me, but that's very likely because I'm not clear what
>> clear_bit_unlock() does. :-)
> 
> In general, any "lock" operation (mutex, spin, whatever) is (and must
> be) and "acquire" type operations which imposes a memory barrier so that
> read requests *after* the lock cannot be satisfied with data from
> *before* the lock.  The read must access data after the lock.
> Conversely any "unlock" operations is a "release" type operation which
> imposes a memory barrier so that any write request *before* the unlock
> must not be delayed until *after* the unlock.  The write must complete
> before the unlock.
> 
> This is exactly what you would expect of locking - it creates a closed
> code region that is properly ordered w.r.t comparable closed regions.
> 
> test_and_set_bit_lock() and clear_bit_unlock() provide these expected
> semantics for bit operations.

Your explanation is more clear than what I read in Documentation/atomic*
so thanks. I feel a little more armed to make good use of it.


> New code should (almost?) never have explicit memory barriers like
> smp_mb__after_atomic().
> It should use one of the many APIs with _acquire or _release suffixes,
> or with the more explicit _lock or _unlock.

Out of curiosity, is "should never have explicit memory barriers"
documented somewhere? I've been accused of skimming when I read, so
I might have missed it.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap
  2023-07-04  2:02     ` Chuck Lever
  2023-07-04  2:17       ` NeilBrown
@ 2023-07-04 16:03       ` Chuck Lever III
  2023-07-05  0:35         ` NeilBrown
  1 sibling, 1 reply; 41+ messages in thread
From: Chuck Lever III @ 2023-07-04 16:03 UTC (permalink / raw)
  To: Neil Brown
  Cc: Chuck Lever, Linux NFS Mailing List, Lorenzo Bianconi,
	Jeff Layton, Dave Chinner


> On Jul 3, 2023, at 10:02 PM, Chuck Lever <cel@kernel.org> wrote:
> 
> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
>> On Tue, 04 Jul 2023, Chuck Lever wrote:
>>> From: Chuck Lever <chuck.lever@oracle.com>
>>> 
>>> I've noticed that client-observed server request latency goes up
>>> simply when the nfsd thread count is increased.
>>> 
>>> List walking is known to be memory-inefficient. On a busy server
>>> with many threads, enqueuing a transport will walk the "all threads"
>>> list quite frequently. This also pulls in the cache lines for some
>>> hot fields in each svc_rqst (namely, rq_flags).
>> 
>> I think this text could usefully be re-written.  By this point in the
>> series we aren't list walking.
>> 
>> I'd also be curious to know what latency different you get for just this
>> change.
> 
> Not much of a latency difference at lower thread counts.
> 
> The difference I notice is that with the spinlock version of
> pool_wake_idle_thread, there is significant lock contention as
> the thread count increases, and the throughput result of my fio
> test is lower (outside the result variance).

I mis-spoke. When I wrote this yesterday I had compared only the
"xarray with bitmap" and the "xarray with spinlock" mechanisms.
I had not tried "xarray only".

Today, while testing review-related fixes, I benchmarked "xarray
only". It behaves like the linked-list implementation it replaces:
performance degrades with anything more than a couple dozen threads
in the pool.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap
  2023-07-04 15:14         ` Chuck Lever III
@ 2023-07-05  0:34           ` NeilBrown
  0 siblings, 0 replies; 41+ messages in thread
From: NeilBrown @ 2023-07-05  0:34 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Chuck Lever, Linux NFS Mailing List, lorenzo, Jeff Layton, david

On Wed, 05 Jul 2023, Chuck Lever III wrote:
> 
> > On Jul 3, 2023, at 10:17 PM, NeilBrown <neilb@suse.de> wrote:
> > 
> > On Tue, 04 Jul 2023, Chuck Lever wrote:
> >> On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> >>> On Tue, 04 Jul 2023, Chuck Lever wrote:
> >>>> From: Chuck Lever <chuck.lever@oracle.com>
> >>>> 
> >>>> I've noticed that client-observed server request latency goes up
> >>>> simply when the nfsd thread count is increased.
> >>>> 
> >>>> List walking is known to be memory-inefficient. On a busy server
> >>>> with many threads, enqueuing a transport will walk the "all threads"
> >>>> list quite frequently. This also pulls in the cache lines for some
> >>>> hot fields in each svc_rqst (namely, rq_flags).
> >>> 
> >>> I think this text could usefully be re-written.  By this point in the
> >>> series we aren't list walking.
> >>> 
> >>> I'd also be curious to know what latency different you get for just this
> >>> change.
> >> 
> >> Not much of a latency difference at lower thread counts.
> >> 
> >> The difference I notice is that with the spinlock version of
> >> pool_wake_idle_thread, there is significant lock contention as
> >> the thread count increases, and the throughput result of my fio
> >> test is lower (outside the result variance).
> >> 
> >> 
> >>>> The svc_xprt_enqueue() call that concerns me most is the one in
> >>>> svc_rdma_wc_receive(), which is single-threaded per CQ. Slowing
> >>>> down completion handling limits the total throughput per RDMA
> >>>> connection.
> >>>> 
> >>>> So, avoid walking the "all threads" list to find an idle thread to
> >>>> wake. Instead, set up an idle bitmap and use find_next_bit, which
> >>>> should work the same way as RQ_BUSY but it will touch only the
> >>>> cachelines that the bitmap is in. Stick with atomic bit operations
> >>>> to avoid taking the pool lock.
> >>>> 
> >>>> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> >>>> ---
> >>>> include/linux/sunrpc/svc.h    |    6 ++++--
> >>>> include/trace/events/sunrpc.h |    1 -
> >>>> net/sunrpc/svc.c              |   27 +++++++++++++++++++++------
> >>>> net/sunrpc/svc_xprt.c         |   30 ++++++++++++++++++++++++------
> >>>> 4 files changed, 49 insertions(+), 15 deletions(-)
> >>>> 
> >>>> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> >>>> index 6f8bfcd44250..27ffcf7371d0 100644
> >>>> --- a/include/linux/sunrpc/svc.h
> >>>> +++ b/include/linux/sunrpc/svc.h
> >>>> @@ -35,6 +35,7 @@ struct svc_pool {
> >>>> spinlock_t sp_lock; /* protects sp_sockets */
> >>>> struct list_head sp_sockets; /* pending sockets */
> >>>> unsigned int sp_nrthreads; /* # of threads in pool */
> >>>> + unsigned long *sp_idle_map; /* idle threads */
> >>>> struct xarray sp_thread_xa;
> >>>> 
> >>>> /* statistics on pool operation */
> >>>> @@ -190,6 +191,8 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >>>> #define RPCSVC_MAXPAGES ((RPCSVC_MAXPAYLOAD+PAGE_SIZE-1)/PAGE_SIZE \
> >>>> + 2 + 1)
> >>>> 
> >>>> +#define RPCSVC_MAXPOOLTHREADS (4096)
> >>>> +
> >>>> /*
> >>>>  * The context of a single thread, including the request currently being
> >>>>  * processed.
> >>>> @@ -239,8 +242,7 @@ struct svc_rqst {
> >>>> #define RQ_SPLICE_OK (4) /* turned off in gss privacy
> >>>> * to prevent encrypting page
> >>>> * cache pages */
> >>>> -#define RQ_BUSY (5) /* request is busy */
> >>>> -#define RQ_DATA (6) /* request has data */
> >>>> +#define RQ_DATA (5) /* request has data */
> >>> 
> >>> Might this be a good opportunity to convert this to an enum ??
> >>> 
> >>>> unsigned long rq_flags; /* flags field */
> >>>> u32 rq_thread_id; /* xarray index */
> >>>> ktime_t rq_qtime; /* enqueue time */
> >>>> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> >>>> index ea43c6059bdb..c07824a254bf 100644
> >>>> --- a/include/trace/events/sunrpc.h
> >>>> +++ b/include/trace/events/sunrpc.h
> >>>> @@ -1676,7 +1676,6 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> >>>> svc_rqst_flag(USEDEFERRAL) \
> >>>> svc_rqst_flag(DROPME) \
> >>>> svc_rqst_flag(SPLICE_OK) \
> >>>> - svc_rqst_flag(BUSY) \
> >>>> svc_rqst_flag_end(DATA)
> >>>> 
> >>>> #undef svc_rqst_flag
> >>>> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> >>>> index ef350f0d8925..d0278e5190ba 100644
> >>>> --- a/net/sunrpc/svc.c
> >>>> +++ b/net/sunrpc/svc.c
> >>>> @@ -509,6 +509,12 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >>>> INIT_LIST_HEAD(&pool->sp_sockets);
> >>>> spin_lock_init(&pool->sp_lock);
> >>>> xa_init_flags(&pool->sp_thread_xa, XA_FLAGS_ALLOC);
> >>>> + /* All threads initially marked "busy" */
> >>>> + pool->sp_idle_map =
> >>>> + bitmap_zalloc_node(RPCSVC_MAXPOOLTHREADS, GFP_KERNEL,
> >>>> +   svc_pool_map_get_node(i));
> >>>> + if (!pool->sp_idle_map)
> >>>> + return NULL;
> >>>> 
> >>>> percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> >>>> percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> >>>> @@ -596,6 +602,8 @@ svc_destroy(struct kref *ref)
> >>>> percpu_counter_destroy(&pool->sp_threads_starved);
> >>>> 
> >>>> xa_destroy(&pool->sp_thread_xa);
> >>>> + bitmap_free(pool->sp_idle_map);
> >>>> + pool->sp_idle_map = NULL;
> >>>> }
> >>>> kfree(serv->sv_pools);
> >>>> kfree(serv);
> >>>> @@ -647,7 +655,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> >>>> 
> >>>> folio_batch_init(&rqstp->rq_fbatch);
> >>>> 
> >>>> - __set_bit(RQ_BUSY, &rqstp->rq_flags);
> >>>> rqstp->rq_server = serv;
> >>>> rqstp->rq_pool = pool;
> >>>> 
> >>>> @@ -677,7 +684,7 @@ static struct svc_rqst *
> >>>> svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >>>> {
> >>>> static const struct xa_limit limit = {
> >>>> - .max = U32_MAX,
> >>>> + .max = RPCSVC_MAXPOOLTHREADS,
> >>>> };
> >>>> struct svc_rqst *rqstp;
> >>>> int ret;
> >>>> @@ -722,12 +729,19 @@ struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> >>>>   struct svc_pool *pool)
> >>>> {
> >>>> struct svc_rqst *rqstp;
> >>>> - unsigned long index;
> >>>> + unsigned long bit;
> >>>> 
> >>>> - xa_for_each(&pool->sp_thread_xa, index, rqstp) {
> >>>> - if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> >>>> + /* Check the pool's idle bitmap locklessly so that multiple
> >>>> + * idle searches can proceed concurrently.
> >>>> + */
> >>>> + for_each_set_bit(bit, pool->sp_idle_map, pool->sp_nrthreads) {
> >>>> + if (!test_and_clear_bit(bit, pool->sp_idle_map))
> >>>> continue;
> >>> 
> >>> I would really rather the map was "sp_busy_map". (initialised with bitmap_fill())
> >>> Then you could "test_and_set_bit_lock()" and later "clear_bit_unlock()"
> >>> and so get all the required memory barriers.
> >>> What we are doing here is locking a particular thread for a task, so
> >>> "lock" is an appropriate description of what is happening.
> >>> See also svc_pool_thread_mark_* below.
> >>> 
> >>>> 
> >>>> + rqstp = xa_load(&pool->sp_thread_xa, bit);
> >>>> + if (!rqstp)
> >>>> + break;
> >>>> +
> >>>> WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> >>>> wake_up_process(rqstp->rq_task);
> >>>> percpu_counter_inc(&pool->sp_threads_woken);
> >>>> @@ -767,7 +781,8 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> >>>> }
> >>>> 
> >>>> found_pool:
> >>>> - rqstp = xa_find(&pool->sp_thread_xa, &zero, U32_MAX, XA_PRESENT);
> >>>> + rqstp = xa_find(&pool->sp_thread_xa, &zero, RPCSVC_MAXPOOLTHREADS,
> >>>> + XA_PRESENT);
> >>>> if (rqstp) {
> >>>> __xa_erase(&pool->sp_thread_xa, rqstp->rq_thread_id);
> >>>> task = rqstp->rq_task;
> >>>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> >>>> index 7709120b45c1..2844b32c16ea 100644
> >>>> --- a/net/sunrpc/svc_xprt.c
> >>>> +++ b/net/sunrpc/svc_xprt.c
> >>>> @@ -735,6 +735,25 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >>>> return true;
> >>>> }
> >>>> 
> >>>> +static void svc_pool_thread_mark_idle(struct svc_pool *pool,
> >>>> +      struct svc_rqst *rqstp)
> >>>> +{
> >>>> + smp_mb__before_atomic();
> >>>> + set_bit(rqstp->rq_thread_id, pool->sp_idle_map);
> >>>> + smp_mb__after_atomic();
> >>>> +}
> >>> 
> >>> There memory barriers above and below bother me.  There is no comment
> >>> telling me what they are protecting against.
> >>> I would rather svc_pool_thread_mark_idle - which unlocks the thread -
> >>> were
> >>> 
> >>>    clear_bit_unlock(rqstp->rq_thread_id, pool->sp_busy_map);
> >>> 
> >>> and  that svc_pool_thread_mark_busy were
> >>> 
> >>>   test_and_set_bit_lock(rqstp->rq_thread_id, pool->sp_busy_map);
> >>> 
> >>> Then it would be more obvious what was happening.
> >> 
> >> Not obvious to me, but that's very likely because I'm not clear what
> >> clear_bit_unlock() does. :-)
> > 
> > In general, any "lock" operation (mutex, spin, whatever) is (and must
> > be) and "acquire" type operations which imposes a memory barrier so that
> > read requests *after* the lock cannot be satisfied with data from
> > *before* the lock.  The read must access data after the lock.
> > Conversely any "unlock" operations is a "release" type operation which
> > imposes a memory barrier so that any write request *before* the unlock
> > must not be delayed until *after* the unlock.  The write must complete
> > before the unlock.
> > 
> > This is exactly what you would expect of locking - it creates a closed
> > code region that is properly ordered w.r.t comparable closed regions.
> > 
> > test_and_set_bit_lock() and clear_bit_unlock() provide these expected
> > semantics for bit operations.
> 
> Your explanation is more clear than what I read in Documentation/atomic*
> so thanks. I feel a little more armed to make good use of it.
> 
> 
> > New code should (almost?) never have explicit memory barriers like
> > smp_mb__after_atomic().
> > It should use one of the many APIs with _acquire or _release suffixes,
> > or with the more explicit _lock or _unlock.
> 
> Out of curiosity, is "should never have explicit memory barriers"
> documented somewhere? I've been accused of skimming when I read, so
> I might have missed it.

My wife says I only read every second word of emails :-)

I don't know that it is documented anywhere (maybe I should submit a
patch).  The statement was really my personal rule that seems to be the
natural sequel for the introduction of the many _acquire and _release
interfaces.

NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap
  2023-07-04 16:03       ` Chuck Lever III
@ 2023-07-05  0:35         ` NeilBrown
  0 siblings, 0 replies; 41+ messages in thread
From: NeilBrown @ 2023-07-05  0:35 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Chuck Lever, Linux NFS Mailing List, Lorenzo Bianconi,
	Jeff Layton, Dave Chinner

On Wed, 05 Jul 2023, Chuck Lever III wrote:
> > On Jul 3, 2023, at 10:02 PM, Chuck Lever <cel@kernel.org> wrote:
> > 
> > On Tue, Jul 04, 2023 at 11:26:22AM +1000, NeilBrown wrote:
> >> On Tue, 04 Jul 2023, Chuck Lever wrote:
> >>> From: Chuck Lever <chuck.lever@oracle.com>
> >>> 
> >>> I've noticed that client-observed server request latency goes up
> >>> simply when the nfsd thread count is increased.
> >>> 
> >>> List walking is known to be memory-inefficient. On a busy server
> >>> with many threads, enqueuing a transport will walk the "all threads"
> >>> list quite frequently. This also pulls in the cache lines for some
> >>> hot fields in each svc_rqst (namely, rq_flags).
> >> 
> >> I think this text could usefully be re-written.  By this point in the
> >> series we aren't list walking.
> >> 
> >> I'd also be curious to know what latency different you get for just this
> >> change.
> > 
> > Not much of a latency difference at lower thread counts.
> > 
> > The difference I notice is that with the spinlock version of
> > pool_wake_idle_thread, there is significant lock contention as
> > the thread count increases, and the throughput result of my fio
> > test is lower (outside the result variance).
> 
> I mis-spoke. When I wrote this yesterday I had compared only the
> "xarray with bitmap" and the "xarray with spinlock" mechanisms.
> I had not tried "xarray only".
> 
> Today, while testing review-related fixes, I benchmarked "xarray
> only". It behaves like the linked-list implementation it replaces:
> performance degrades with anything more than a couple dozen threads
> in the pool.

I'm a little surprised it is that bad, but only a little.
The above is good text to include in the justification of that last
patch.

Thanks for the clarification.
NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/9] SUNRPC service thread scheduler optimizations
  2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
                   ` (8 preceding siblings ...)
  2023-07-04  0:08 ` [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap Chuck Lever
@ 2023-07-05  1:08 ` NeilBrown
  2023-07-05  2:43   ` Chuck Lever
  9 siblings, 1 reply; 41+ messages in thread
From: NeilBrown @ 2023-07-05  1:08 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david


I've been pondering this scheduling mechanism in sunrpc/svc some more,
and I wonder if rather than optimising the search, we should eliminate
it.

Instead we could have a linked list of idle threads using llist.h

svc_enqueue_xprt calls llist_del_first() and if the result is not NULL,
that thread is deemed busy (because it isn't on the list) and is woken.

choose_victim() could also use llist_del_first().  If nothing is there
it could set a flag which gets cleared by the next thread to go idle.
That thread exits ..  or something.  Some interlock would be needed but
it shouldn't be too hard.

svc_exit_thread would have difficulty removing itself from the idle
list, if it wasn't busy..  Possibly we could disallow that case (I think
sending a signal to a thread can make it exit while idle).
Alternately we could use llist_del_all() to clear the list, then wake
them all up so that they go back on the list if there is nothing to do
and if they aren't trying to exit.  That is fairly heavy handed, but
isn't a case that we need to optimise.

If you think that might be worth pursuing, I could have a go at writing
the patch - probably on top of all the cleanups in your series before
the xarray is added.

I also wonder if we should avoid waking too many threads up at once.
If multiple events happen in quick succession, we currently wake up
multiple threads and if there is any scheduling delay (which is expected
based on Commit 22700f3c6df5 ("SUNRPC: Improve ordering of transport processing"))
then by the time the threads wake up, there may no longer be work to do
as another thread might have gone idle and taken the work.

Instead we could have a limit on the number of threads waking up -
possibly 1 or 3.  If the counter is maxed out, don't do a wake up.
When a thread wakes up, it decrements the counter, dequeues some work,
and if there is more to do, then it queues another task.
I imagine the same basic protocol would be used for creating new threads
when load is high - start just one at a time, though maybe a new thread
would handle a first request before possibly starting another thread.

But this is a stretch goal - not the main focus.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/9] SUNRPC service thread scheduler optimizations
  2023-07-05  1:08 ` [PATCH v2 0/9] SUNRPC service thread scheduler optimizations NeilBrown
@ 2023-07-05  2:43   ` Chuck Lever
  0 siblings, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-05  2:43 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Wed, Jul 05, 2023 at 11:08:32AM +1000, NeilBrown wrote:
> 
> I've been pondering this scheduling mechanism in sunrpc/svc some more,
> and I wonder if rather than optimising the search, we should eliminate
> it.
> 
> Instead we could have a linked list of idle threads using llist.h
> 
> svc_enqueue_xprt calls llist_del_first() and if the result is not NULL,
> that thread is deemed busy (because it isn't on the list) and is woken.
> 
> choose_victim() could also use llist_del_first().  If nothing is there
> it could set a flag which gets cleared by the next thread to go idle.
> That thread exits ..  or something.  Some interlock would be needed but
> it shouldn't be too hard.
> 
> svc_exit_thread would have difficulty removing itself from the idle
> list, if it wasn't busy..  Possibly we could disallow that case (I think
> sending a signal to a thread can make it exit while idle).
> Alternately we could use llist_del_all() to clear the list, then wake
> them all up so that they go back on the list if there is nothing to do
> and if they aren't trying to exit.  That is fairly heavy handed, but
> isn't a case that we need to optimise.
> 
> If you think that might be worth pursuing, I could have a go at writing
> the patch - probably on top of all the cleanups in your series before
> the xarray is added.

The thread pool is effectively a cached resource, so it is a use case
that fits llist well. svcrdma uses llist in a few spots in that very
capacity.

If you think you can meet all of the criteria in the table at the top of
llist.h so thread scheduling works entirely without a lock, that might
be an interesting point of comparison.

My only concern is that the current set of candidate mechanisms manage
to use mostly the first thread, rather than round-robining through the
thread list. Using mostly one process tends to be more cache-friendly.
An llist-based thread scheduler should try to follow that behavior,
IMO.


> I also wonder if we should avoid waking too many threads up at once.
> If multiple events happen in quick succession, we currently wake up
> multiple threads and if there is any scheduling delay (which is expected
> based on Commit 22700f3c6df5 ("SUNRPC: Improve ordering of transport processing"))
> then by the time the threads wake up, there may no longer be work to do
> as another thread might have gone idle and taken the work.

It might be valuable to add some observability of wake-ups that find
nothing to do. I'll look into that.


> Instead we could have a limit on the number of threads waking up -
> possibly 1 or 3.  If the counter is maxed out, don't do a wake up.
> When a thread wakes up, it decrements the counter, dequeues some work,
> and if there is more to do, then it queues another task.

I consider reducing the wake-up rate as the next step for improving
RPC service thread scalability. Any experimentation in that area is
worth looking into.


> I imagine the same basic protocol would be used for creating new threads
> when load is high - start just one at a time, though maybe a new thread
> would handle a first request before possibly starting another thread.

I envision that dynamically tuning the pool thread count as something
that should be managed in user space, since it's a policy rather than
a mechanism.

That could be a problem, though, if we wanted to shut down a few pool
threads based on shrinker activity.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-04  1:47     ` Chuck Lever
  2023-07-04  2:30       ` NeilBrown
@ 2023-07-10  0:41       ` NeilBrown
  2023-07-10 14:41         ` Chuck Lever
  2023-07-10 16:21         ` Chuck Lever
  1 sibling, 2 replies; 41+ messages in thread
From: NeilBrown @ 2023-07-10  0:41 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david


The patch show an alternate approach to the recent patches which improve
the latency for waking an idle thread when work is ready.

The current approach involves searching a linked list for an idle thread
to wake.  The recent patches instead use a bitmap search to find the
idle thread.  With this patch no search is needed - and idle thread is
directly available without searching.

The idle threads are kept in an "llist" - there is no longer a list of
all threads.

The llist ADT does not allow concurrent delete_first operations, so to
wake an idle thread we simply wake it and do not remove it from the
list.
When the thread is scheduled it will remove itself - which is safe - and
will take the next thread if there is more work to do (and if there is
another thread).

The "remove itself" requires and addition to the llist api.
"llist_del_first_this()" removes a given item if is it the first.
Multiple callers can call this concurrently as along as they each give a
different "this", so each thread can safely try to remove itself.  It
must be prepared for failure.

Reducing the thread count currently requires finding any thing, idle or
not, and calling kthread_stop().  This no longer possible as we don't
have a list of all threads (though I guess we could keep the list if we
wanted to...).  Instead the pool is marked NEED_VICTIM and the next
thread to go idle will become the VICTIM and duly exit - signalling
this be clearing VICTIM_REMAINS.  We replace kthread_should_stop() call
with a new svc_should_stop() which checks and sets victim flags.

nfsd threads can currently be told to exit with a signal.  It might be
time to deprecate/remove this feature.  However this patch does support
it.

If the signalled thread is not at the head of the idle list it cannot
remove itself.  In this case it sets RQ_CLEAN_ME and SP_CLEANUP and the
next thread to wake up will use llist_del_all_this() to remove all
threads from the idle list.  It then finds and removes any RQ_CLEAN_ME
threads and puts the rest back on the list.

There is quite a bit of churn here so it will need careful testing.
In fact - it doesn't handle nfsv4 callback handling threads properly as
they don't wait the same way that other threads wait...  I'll need to
think about that but I don't have time just now.

For now it is primarily an RFC.  I haven't given a lot of thought to
trace points.

It apply it you will need

 SUNRPC: Deduplicate thread wake-up code
 SUNRPC: Report when no service thread is available.
 SUNRPC: Split the svc_xprt_dequeue tracepoint
 SUNRPC: Clean up svc_set_num_threads
 SUNRPC: Replace dprintk() call site in __svc_create()

from recent post by Chuck.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/lockd/svc.c                |   4 +-
 fs/lockd/svclock.c            |   4 +-
 fs/nfs/callback.c             |   5 +-
 fs/nfsd/nfssvc.c              |   3 +-
 include/linux/llist.h         |   4 +
 include/linux/lockd/lockd.h   |   2 +-
 include/linux/sunrpc/svc.h    |  55 +++++++++-----
 include/trace/events/sunrpc.h |   7 +-
 lib/llist.c                   |  51 +++++++++++++
 net/sunrpc/svc.c              | 139 ++++++++++++++++++++++++----------
 net/sunrpc/svc_xprt.c         |  61 ++++++++-------
 11 files changed, 239 insertions(+), 96 deletions(-)

diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
index 22d3ff3818f5..df295771bd40 100644
--- a/fs/lockd/svc.c
+++ b/fs/lockd/svc.c
@@ -147,7 +147,7 @@ lockd(void *vrqstp)
 	 * The main request loop. We don't terminate until the last
 	 * NFS mount or NFS daemon has gone away.
 	 */
-	while (!kthread_should_stop()) {
+	while (!svc_should_stop(rqstp)) {
 		long timeout = MAX_SCHEDULE_TIMEOUT;
 		RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
 
@@ -160,7 +160,7 @@ lockd(void *vrqstp)
 			continue;
 		}
 
-		timeout = nlmsvc_retry_blocked();
+		timeout = nlmsvc_retry_blocked(rqstp);
 
 		/*
 		 * Find a socket with data available and call its
diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
index c43ccdf28ed9..54b679fcbcab 100644
--- a/fs/lockd/svclock.c
+++ b/fs/lockd/svclock.c
@@ -1009,13 +1009,13 @@ retry_deferred_block(struct nlm_block *block)
  * be retransmitted.
  */
 unsigned long
-nlmsvc_retry_blocked(void)
+nlmsvc_retry_blocked(struct svc_rqst *rqstp)
 {
 	unsigned long	timeout = MAX_SCHEDULE_TIMEOUT;
 	struct nlm_block *block;
 
 	spin_lock(&nlm_blocked_lock);
-	while (!list_empty(&nlm_blocked) && !kthread_should_stop()) {
+	while (!list_empty(&nlm_blocked) && !svc_should_stop(rqstp)) {
 		block = list_entry(nlm_blocked.next, struct nlm_block, b_list);
 
 		if (block->b_when == NLM_NEVER)
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index 456af7d230cf..646425f1dc36 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -111,7 +111,7 @@ nfs41_callback_svc(void *vrqstp)
 
 	set_freezable();
 
-	while (!kthread_freezable_should_stop(NULL)) {
+	while (!svc_should_stop(rqstp)) {
 
 		if (signal_pending(current))
 			flush_signals(current);
@@ -130,10 +130,11 @@ nfs41_callback_svc(void *vrqstp)
 				error);
 		} else {
 			spin_unlock_bh(&serv->sv_cb_lock);
-			if (!kthread_should_stop())
+			if (!svc_should_stop(rqstp))
 				schedule();
 			finish_wait(&serv->sv_cb_waitq, &wq);
 		}
+		try_to_freeze();
 	}
 
 	svc_exit_thread(rqstp);
diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 9c7b1ef5be40..7cfa7f2e9bf7 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -62,8 +62,7 @@ static __be32			nfsd_init_request(struct svc_rqst *,
  * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
  * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
  * nn->keep_active is set).  That number of nfsd threads must
- * exist and each must be listed in ->sp_all_threads in some entry of
- * ->sv_pools[].
+ * exist.
  *
  * Each active thread holds a counted reference on nn->nfsd_serv, as does
  * the nn->keep_active flag and various transient calls to svc_get().
diff --git a/include/linux/llist.h b/include/linux/llist.h
index 85bda2d02d65..5a22499844c8 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -248,6 +248,10 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
 }
 
 extern struct llist_node *llist_del_first(struct llist_head *head);
+extern struct llist_node *llist_del_first_this(struct llist_head *head,
+					       struct llist_node *this);
+extern struct llist_node *llist_del_all_this(struct llist_head *head,
+					     struct llist_node *this);
 
 struct llist_node *llist_reverse_order(struct llist_node *head);
 
diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
index f42594a9efe0..c48020e7ee08 100644
--- a/include/linux/lockd/lockd.h
+++ b/include/linux/lockd/lockd.h
@@ -280,7 +280,7 @@ __be32		  nlmsvc_testlock(struct svc_rqst *, struct nlm_file *,
 			struct nlm_host *, struct nlm_lock *,
 			struct nlm_lock *, struct nlm_cookie *);
 __be32		  nlmsvc_cancel_blocked(struct net *net, struct nlm_file *, struct nlm_lock *);
-unsigned long	  nlmsvc_retry_blocked(void);
+unsigned long	  nlmsvc_retry_blocked(struct svc_rqst *rqstp);
 void		  nlmsvc_traverse_blocks(struct nlm_host *, struct nlm_file *,
 					nlm_host_match_fn_t match);
 void		  nlmsvc_grant_reply(struct nlm_cookie *, __be32);
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 366f2b6b689c..cb2497b977c1 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -31,11 +31,11 @@
  * node traffic on multi-node NUMA NFS servers.
  */
 struct svc_pool {
-	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
-	spinlock_t		sp_lock;	/* protects all fields */
+	unsigned int		sp_id;		/* pool id; also node id on NUMA */
+	spinlock_t		sp_lock;	/* protects all sp_socketsn sp_nrthreads*/
 	struct list_head	sp_sockets;	/* pending sockets */
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
-	struct list_head	sp_all_threads;	/* all server threads */
+	struct llist_head	sp_idle_threads;/* idle server threads */
 
 	/* statistics on pool operation */
 	struct percpu_counter	sp_sockets_queued;
@@ -43,12 +43,17 @@ struct svc_pool {
 	struct percpu_counter	sp_threads_timedout;
 	struct percpu_counter	sp_threads_starved;
 
-#define	SP_TASK_PENDING		(0)		/* still work to do even if no
-						 * xprt is queued. */
-#define SP_CONGESTED		(1)
 	unsigned long		sp_flags;
 } ____cacheline_aligned_in_smp;
 
+enum svc_sp_flags {
+	SP_TASK_PENDING,	/* still work to do even if no xprt is queued */
+	SP_CONGESTED,
+	SP_NEED_VICTIM,		/* One thread needs to agree to exit */
+	SP_VICTIM_REMAINS,	/* One thread needs to actually exit */
+	SP_CLEANUP,		/* A thread has set RQ_CLEAN_ME */
+};
+
 /*
  * RPC service.
  *
@@ -195,7 +200,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
  * processed.
  */
 struct svc_rqst {
-	struct list_head	rq_all;		/* all threads list */
+	struct llist_node	rq_idle;	/* On pool's idle list */
 	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
 	struct svc_xprt *	rq_xprt;	/* transport ptr */
 
@@ -233,16 +238,6 @@ struct svc_rqst {
 	u32			rq_proc;	/* procedure number */
 	u32			rq_prot;	/* IP protocol */
 	int			rq_cachetype;	/* catering to nfsd */
-#define	RQ_SECURE	(0)			/* secure port */
-#define	RQ_LOCAL	(1)			/* local request */
-#define	RQ_USEDEFERRAL	(2)			/* use deferral */
-#define	RQ_DROPME	(3)			/* drop current reply */
-#define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
-						 * to prevent encrypting page
-						 * cache pages */
-#define	RQ_VICTIM	(5)			/* about to be shut down */
-#define	RQ_BUSY		(6)			/* request is busy */
-#define	RQ_DATA		(7)			/* request has data */
 	unsigned long		rq_flags;	/* flags field */
 	ktime_t			rq_qtime;	/* enqueue time */
 
@@ -274,6 +269,20 @@ struct svc_rqst {
 	void **			rq_lease_breaker; /* The v4 client breaking a lease */
 };
 
+enum svc_rq_flags {
+	RQ_SECURE,			/* secure port */
+	RQ_LOCAL,			/* local request */
+	RQ_USEDEFERRAL,			/* use deferral */
+	RQ_DROPME,			/* drop current reply */
+	RQ_SPLICE_OK,			/* turned off in gss privacy
+					 * to prevent encrypting page
+					 * cache pages */
+	RQ_VICTIM,			/* agreed to shut down */
+	RQ_DATA,			/* request has data */
+	RQ_CLEAN_ME,			/* Thread needs to exit but
+					 * is on the idle list */
+};
+
 #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
 
 /*
@@ -309,6 +318,15 @@ static inline struct sockaddr *svc_daddr(const struct svc_rqst *rqst)
 	return (struct sockaddr *) &rqst->rq_daddr;
 }
 
+static inline bool svc_should_stop(struct svc_rqst *rqstp)
+{
+	if (test_and_clear_bit(SP_NEED_VICTIM, &rqstp->rq_pool->sp_flags)) {
+		set_bit(RQ_VICTIM, &rqstp->rq_flags);
+		return true;
+	}
+	return test_bit(RQ_VICTIM, &rqstp->rq_flags);
+}
+
 struct svc_deferred_req {
 	u32			prot;	/* protocol (UDP or TCP) */
 	struct svc_xprt		*xprt;
@@ -416,6 +434,7 @@ bool		   svc_rqst_replace_page(struct svc_rqst *rqstp,
 void		   svc_rqst_release_pages(struct svc_rqst *rqstp);
 void		   svc_rqst_free(struct svc_rqst *);
 void		   svc_exit_thread(struct svc_rqst *);
+bool		   svc_dequeue_rqst(struct svc_rqst *rqstp);
 struct svc_serv *  svc_create_pooled(struct svc_program *, unsigned int,
 				     int (*threadfn)(void *data));
 int		   svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
@@ -428,7 +447,7 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
 
 void		   svc_wake_up(struct svc_serv *);
 void		   svc_reserve(struct svc_rqst *rqstp, int space);
-struct svc_rqst	  *svc_pool_wake_idle_thread(struct svc_serv *serv,
+bool		   svc_pool_wake_idle_thread(struct svc_serv *serv,
 					     struct svc_pool *pool);
 struct svc_pool   *svc_pool_for_cpu(struct svc_serv *serv);
 char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
index f6fd48961074..f63289d1491d 100644
--- a/include/trace/events/sunrpc.h
+++ b/include/trace/events/sunrpc.h
@@ -1601,7 +1601,7 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
 	svc_rqst_flag(DROPME)						\
 	svc_rqst_flag(SPLICE_OK)					\
 	svc_rqst_flag(VICTIM)						\
-	svc_rqst_flag(BUSY)						\
+	svc_rqst_flag(CLEAN_ME)						\
 	svc_rqst_flag_end(DATA)
 
 #undef svc_rqst_flag
@@ -1965,7 +1965,10 @@ TRACE_EVENT(svc_xprt_enqueue,
 #define show_svc_pool_flags(x)						\
 	__print_flags(x, "|",						\
 		{ BIT(SP_TASK_PENDING),		"TASK_PENDING" },	\
-		{ BIT(SP_CONGESTED),		"CONGESTED" })
+		{ BIT(SP_CONGESTED),		"CONGESTED" },		\
+		{ BIT(SP_NEED_VICTIM),		"NEED_VICTIM" },	\
+		{ BIT(SP_VICTIM_REMAINS),	"VICTIM_REMAINS" },	\
+		{ BIT(SP_CLEANUP),		"CLEANUP" })
 
 DECLARE_EVENT_CLASS(svc_pool_scheduler_class,
 	TP_PROTO(
diff --git a/lib/llist.c b/lib/llist.c
index 6e668fa5a2c6..660be07795ac 100644
--- a/lib/llist.c
+++ b/lib/llist.c
@@ -65,6 +65,57 @@ struct llist_node *llist_del_first(struct llist_head *head)
 }
 EXPORT_SYMBOL_GPL(llist_del_first);
 
+/**
+ * llist_del_first_this - delete given entry of lock-less list if it is first
+ * @head:	the head for your lock-less list
+ * @this:	a list entry.
+ *
+ * If head of the list is given entry, delete and return it, else
+ * return %NULL.
+ *
+ * Providing the caller has exclusive access to @this, multiple callers can
+ * safely call this concurrently with multiple llist_add() callers.
+ */
+struct llist_node *llist_del_first_this(struct llist_head *head,
+					struct llist_node *this)
+{
+	struct llist_node *entry, *next;
+
+	entry = smp_load_acquire(&head->first);
+	do {
+		if (entry != this)
+			return NULL;
+		next = READ_ONCE(entry->next);
+	} while (!try_cmpxchg(&head->first, &entry, next));
+
+	return entry;
+}
+EXPORT_SYMBOL_GPL(llist_del_first_this);
+
+/**
+ * llist_del_all_this - delete all entries from lock-less list if first is the given element
+ * @head:	the head of lock-less list to delete all entries
+ * @this:	the expected first element.
+ *
+ * If the first element of the list is @this, delete all elements and
+ * return them, else return %NULL.  Providing the caller has exclusive access
+ * to @this, multiple concurrent callers can call this or list_del_first_this()
+ * simultaneuously with multiple callers of llist_add().
+ */
+struct llist_node *llist_del_all_this(struct llist_head *head,
+				      struct llist_node *this)
+{
+	struct llist_node *entry;
+
+	entry = smp_load_acquire(&head->first);
+	do {
+		if (entry != this)
+			return NULL;
+	} while (!try_cmpxchg(&head->first, &entry, NULL));
+
+	return entry;
+}
+
 /**
  * llist_reverse_order - reverse order of a llist chain
  * @head:	first item of the list to be reversed
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index ffb7200e8257..55339cbbbc6e 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -507,7 +507,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 
 		pool->sp_id = i;
 		INIT_LIST_HEAD(&pool->sp_sockets);
-		INIT_LIST_HEAD(&pool->sp_all_threads);
+		init_llist_head(&pool->sp_idle_threads);
 		spin_lock_init(&pool->sp_lock);
 
 		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
@@ -652,9 +652,9 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
 
 	pagevec_init(&rqstp->rq_pvec);
 
-	__set_bit(RQ_BUSY, &rqstp->rq_flags);
 	rqstp->rq_server = serv;
 	rqstp->rq_pool = pool;
+	rqstp->rq_idle.next = &rqstp->rq_idle;
 
 	rqstp->rq_scratch_page = alloc_pages_node(node, GFP_KERNEL, 0);
 	if (!rqstp->rq_scratch_page)
@@ -694,7 +694,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 
 	spin_lock_bh(&pool->sp_lock);
 	pool->sp_nrthreads++;
-	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
 	spin_unlock_bh(&pool->sp_lock);
 	return rqstp;
 }
@@ -704,32 +703,34 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
  * @serv: RPC service
  * @pool: service thread pool
  *
- * Returns an idle service thread (now marked BUSY), or NULL
- * if no service threads are available. Finding an idle service
- * thread and marking it BUSY is atomic with respect to other
- * calls to svc_pool_wake_idle_thread().
+ * If there are any idle threads in the pool, wake one up and return
+ * %true, else return %false.  The thread will become non-idle once
+ * the scheduler schedules it, at which point is might wake another
+ * thread if there seems to be enough work to justify that.
  */
-struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
-					   struct svc_pool *pool)
+bool svc_pool_wake_idle_thread(struct svc_serv *serv,
+			       struct svc_pool *pool)
 {
 	struct svc_rqst	*rqstp;
+	struct llist_node *ln;
 
 	rcu_read_lock();
-	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
-		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
-			continue;
-
-		rcu_read_unlock();
+	ln = READ_ONCE(pool->sp_idle_threads.first);
+	if (ln) {
+		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
 		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
-		wake_up_process(rqstp->rq_task);
-		percpu_counter_inc(&pool->sp_threads_woken);
-		return rqstp;
+		if (!task_is_running(rqstp->rq_task)) {
+			wake_up_process(rqstp->rq_task);
+			percpu_counter_inc(&pool->sp_threads_woken);
+		}
+		rcu_read_unlock();
+		return true;
 	}
 	rcu_read_unlock();
 
 	trace_svc_pool_starved(serv, pool);
 	percpu_counter_inc(&pool->sp_threads_starved);
-	return NULL;
+	return false;
 }
 
 static struct svc_pool *
@@ -738,19 +739,22 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
 	return pool ? pool : &serv->sv_pools[(*state)++ % serv->sv_nrpools];
 }
 
-static struct task_struct *
+static struct svc_pool *
 svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
 {
-	unsigned int i;
-	struct task_struct *task = NULL;
 
 	if (pool != NULL) {
 		spin_lock_bh(&pool->sp_lock);
+		if (pool->sp_nrthreads > 0)
+			goto found_pool;
+		spin_unlock_bh(&pool->sp_lock);
+		return NULL;
 	} else {
+		unsigned int i;
 		for (i = 0; i < serv->sv_nrpools; i++) {
 			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
 			spin_lock_bh(&pool->sp_lock);
-			if (!list_empty(&pool->sp_all_threads))
+			if (pool->sp_nrthreads > 0)
 				goto found_pool;
 			spin_unlock_bh(&pool->sp_lock);
 		}
@@ -758,16 +762,10 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
 	}
 
 found_pool:
-	if (!list_empty(&pool->sp_all_threads)) {
-		struct svc_rqst *rqstp;
-
-		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
-		set_bit(RQ_VICTIM, &rqstp->rq_flags);
-		list_del_rcu(&rqstp->rq_all);
-		task = rqstp->rq_task;
-	}
+	set_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
+	set_bit(SP_NEED_VICTIM, &pool->sp_flags);
 	spin_unlock_bh(&pool->sp_lock);
-	return task;
+	return pool;
 }
 
 static int
@@ -808,18 +806,16 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 static int
 svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 {
-	struct svc_rqst	*rqstp;
-	struct task_struct *task;
 	unsigned int state = serv->sv_nrthreads-1;
+	struct svc_pool *vpool;
 
 	do {
-		task = svc_pool_victim(serv, pool, &state);
-		if (task == NULL)
+		vpool = svc_pool_victim(serv, pool, &state);
+		if (vpool == NULL)
 			break;
-		rqstp = kthread_data(task);
-		/* Did we lose a race to svo_function threadfn? */
-		if (kthread_stop(task) == -EINTR)
-			svc_exit_thread(rqstp);
+		svc_pool_wake_idle_thread(serv, vpool);
+		wait_on_bit(&vpool->sp_flags, SP_VICTIM_REMAINS,
+			    TASK_UNINTERRUPTIBLE);
 		nrservs++;
 	} while (nrservs < 0);
 	return 0;
@@ -931,16 +927,75 @@ svc_rqst_free(struct svc_rqst *rqstp)
 }
 EXPORT_SYMBOL_GPL(svc_rqst_free);
 
+bool svc_dequeue_rqst(struct svc_rqst *rqstp)
+{
+	struct svc_pool *pool = rqstp->rq_pool;
+	struct llist_node *le, *last;
+
+retry:
+	if (pool->sp_idle_threads.first != &rqstp->rq_idle)
+		/* Not at head of queue, so cannot wake up */
+		return false;
+	if (!test_and_clear_bit(SP_CLEANUP, &pool->sp_flags)) {
+		le = llist_del_first_this(&pool->sp_idle_threads,
+					  &rqstp->rq_idle);
+		if (le)
+			le->next = le;
+		return !!le;
+	}
+	/* Need to deal will RQ_CLEAN_ME thread */
+	le = llist_del_all_this(&pool->sp_idle_threads,
+				&rqstp->rq_idle);
+	if (!le) {
+		/* lost a race, someone else need to clean up */
+		set_bit(SP_CLEANUP, &pool->sp_flags);
+		svc_pool_wake_idle_thread(rqstp->rq_server,
+					  pool);
+		goto retry;
+	}
+	if (!le->next)
+		return true;
+	last = le;
+	while (last->next) {
+		rqstp = list_entry(last->next, struct svc_rqst, rq_idle);
+		if (!test_bit(RQ_CLEAN_ME, &rqstp->rq_flags)) {
+			last = last->next;
+			continue;
+		}
+		last->next = last->next->next;
+		rqstp->rq_idle.next = &rqstp->rq_idle;
+		wake_up_process(rqstp->rq_task);
+	}
+	if (last != le)
+		llist_add_batch(le->next, last, &pool->sp_idle_threads);
+	le->next = le;
+	return true;
+}
+
 void
 svc_exit_thread(struct svc_rqst *rqstp)
 {
 	struct svc_serv	*serv = rqstp->rq_server;
 	struct svc_pool	*pool = rqstp->rq_pool;
 
+	while (rqstp->rq_idle.next != &rqstp->rq_idle) {
+		/* Still on the idle list. */
+		if (llist_del_first_this(&pool->sp_idle_threads,
+					 &rqstp->rq_idle)) {
+			/* Safely removed */
+			rqstp->rq_idle.next = &rqstp->rq_idle;
+		} else {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			set_bit(RQ_CLEAN_ME, &rqstp->rq_flags);
+			set_bit(SP_CLEANUP, &pool->sp_flags);
+			svc_pool_wake_idle_thread(serv, pool);
+			if (!svc_dequeue_rqst(rqstp))
+				schedule();
+			__set_current_state(TASK_RUNNING);
+		}
+	}
 	spin_lock_bh(&pool->sp_lock);
 	pool->sp_nrthreads--;
-	if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
-		list_del_rcu(&rqstp->rq_all);
 	spin_unlock_bh(&pool->sp_lock);
 
 	spin_lock_bh(&serv->sv_lock);
@@ -948,6 +1003,8 @@ svc_exit_thread(struct svc_rqst *rqstp)
 	spin_unlock_bh(&serv->sv_lock);
 	svc_sock_update_bufs(serv);
 
+	if (test_bit(RQ_VICTIM, &rqstp->rq_flags))
+		clear_and_wake_up_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
 	svc_rqst_free(rqstp);
 
 	svc_put(serv);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index c4521bce1f27..d51587cd8d99 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -446,7 +446,6 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
  */
 void svc_xprt_enqueue(struct svc_xprt *xprt)
 {
-	struct svc_rqst *rqstp;
 	struct svc_pool *pool;
 
 	if (!svc_xprt_ready(xprt))
@@ -467,20 +466,19 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
 	spin_unlock_bh(&pool->sp_lock);
 
-	rqstp = svc_pool_wake_idle_thread(xprt->xpt_server, pool);
-	if (!rqstp) {
+	if (!svc_pool_wake_idle_thread(xprt->xpt_server, pool)) {
 		set_bit(SP_CONGESTED, &pool->sp_flags);
 		return;
 	}
 
-	trace_svc_xprt_enqueue(xprt, rqstp);
+	// trace_svc_xprt_enqueue(xprt, rqstp);
 }
 EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
 
 /*
  * Dequeue the first transport, if there is one.
  */
-static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
+static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool, bool *more)
 {
 	struct svc_xprt	*xprt = NULL;
 
@@ -493,6 +491,7 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
 					struct svc_xprt, xpt_ready);
 		list_del_init(&xprt->xpt_ready);
 		svc_xprt_get(xprt);
+		*more = !list_empty(&pool->sp_sockets);
 	}
 	spin_unlock_bh(&pool->sp_lock);
 out:
@@ -577,15 +576,13 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
 void svc_wake_up(struct svc_serv *serv)
 {
 	struct svc_pool *pool = &serv->sv_pools[0];
-	struct svc_rqst *rqstp;
 
-	rqstp = svc_pool_wake_idle_thread(serv, pool);
-	if (!rqstp) {
+	if (!svc_pool_wake_idle_thread(serv, pool)) {
 		set_bit(SP_TASK_PENDING, &pool->sp_flags);
 		return;
 	}
 
-	trace_svc_wake_up(rqstp);
+	// trace_svc_wake_up(rqstp);
 }
 EXPORT_SYMBOL_GPL(svc_wake_up);
 
@@ -676,7 +673,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
 			continue;
 
 		set_current_state(TASK_INTERRUPTIBLE);
-		if (signalled() || kthread_should_stop()) {
+		if (signalled() || svc_should_stop(rqstp)) {
 			set_current_state(TASK_RUNNING);
 			return -EINTR;
 		}
@@ -706,7 +703,10 @@ rqst_should_sleep(struct svc_rqst *rqstp)
 	struct svc_pool		*pool = rqstp->rq_pool;
 
 	/* did someone call svc_wake_up? */
-	if (test_and_clear_bit(SP_TASK_PENDING, &pool->sp_flags))
+	if (test_bit(SP_TASK_PENDING, &pool->sp_flags))
+		return false;
+	if (test_bit(SP_CLEANUP, &pool->sp_flags))
+		/* a signalled thread needs to be released */
 		return false;
 
 	/* was a socket queued? */
@@ -714,7 +714,7 @@ rqst_should_sleep(struct svc_rqst *rqstp)
 		return false;
 
 	/* are we shutting down? */
-	if (signalled() || kthread_should_stop())
+	if (signalled() || svc_should_stop(rqstp))
 		return false;
 
 	/* are we freezing? */
@@ -728,11 +728,9 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 {
 	struct svc_pool		*pool = rqstp->rq_pool;
 	long			time_left = 0;
+	bool			more = false;
 
-	/* rq_xprt should be clear on entry */
-	WARN_ON_ONCE(rqstp->rq_xprt);
-
-	rqstp->rq_xprt = svc_xprt_dequeue(pool);
+	rqstp->rq_xprt = svc_xprt_dequeue(pool, &more);
 	if (rqstp->rq_xprt) {
 		trace_svc_pool_polled(pool, rqstp);
 		goto out_found;
@@ -743,11 +741,10 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 	 * to bring down the daemons ...
 	 */
 	set_current_state(TASK_INTERRUPTIBLE);
-	smp_mb__before_atomic();
-	clear_bit(SP_CONGESTED, &pool->sp_flags);
-	clear_bit(RQ_BUSY, &rqstp->rq_flags);
-	smp_mb__after_atomic();
+	clear_bit_unlock(SP_CONGESTED, &pool->sp_flags);
 
+	llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
+again:
 	if (likely(rqst_should_sleep(rqstp)))
 		time_left = schedule_timeout(timeout);
 	else
@@ -755,9 +752,20 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 
 	try_to_freeze();
 
-	set_bit(RQ_BUSY, &rqstp->rq_flags);
-	smp_mb__after_atomic();
-	rqstp->rq_xprt = svc_xprt_dequeue(pool);
+	if (!svc_dequeue_rqst(rqstp)) {
+		if (signalled())
+			/* Can only return while on idle list if signalled */
+			return ERR_PTR(-EINTR);
+		/* Still on the idle list */
+		goto again;
+	}
+
+	clear_bit(SP_TASK_PENDING, &pool->sp_flags);
+
+	if (svc_should_stop(rqstp))
+		return ERR_PTR(-EINTR);
+
+	rqstp->rq_xprt = svc_xprt_dequeue(pool, &more);
 	if (rqstp->rq_xprt) {
 		trace_svc_pool_awoken(pool, rqstp);
 		goto out_found;
@@ -766,10 +774,11 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
 	if (!time_left)
 		percpu_counter_inc(&pool->sp_threads_timedout);
 
-	if (signalled() || kthread_should_stop())
-		return ERR_PTR(-EINTR);
 	return ERR_PTR(-EAGAIN);
 out_found:
+	if (more)
+		svc_pool_wake_idle_thread(rqstp->rq_server, pool);
+
 	/* Normally we will wait up to 5 seconds for any required
 	 * cache information to be provided.
 	 */
@@ -866,7 +875,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
 	try_to_freeze();
 	cond_resched();
 	err = -EINTR;
-	if (signalled() || kthread_should_stop())
+	if (signalled() || svc_should_stop(rqstp))
 		goto out;
 
 	xprt = svc_get_next_xprt(rqstp, timeout);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10  0:41       ` [PATCH/RFC] sunrpc: constant-time code to wake idle thread NeilBrown
@ 2023-07-10 14:41         ` Chuck Lever
  2023-07-10 22:18           ` NeilBrown
  2023-07-10 16:21         ` Chuck Lever
  1 sibling, 1 reply; 41+ messages in thread
From: Chuck Lever @ 2023-07-10 14:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Mon, Jul 10, 2023 at 10:41:38AM +1000, NeilBrown wrote:
> 
> The patch show an alternate approach to the recent patches which improve
> the latency for waking an idle thread when work is ready.
> 
> The current approach involves searching a linked list for an idle thread
> to wake.  The recent patches instead use a bitmap search to find the
> idle thread.  With this patch no search is needed - and idle thread is
> directly available without searching.
> 
> The idle threads are kept in an "llist" - there is no longer a list of
> all threads.
> 
> The llist ADT does not allow concurrent delete_first operations, so to
> wake an idle thread we simply wake it and do not remove it from the
> list.
> When the thread is scheduled it will remove itself - which is safe - and
> will take the next thread if there is more work to do (and if there is
> another thread).
> 
> The "remove itself" requires and addition to the llist api.
> "llist_del_first_this()" removes a given item if is it the first.
> Multiple callers can call this concurrently as along as they each give a
> different "this", so each thread can safely try to remove itself.  It
> must be prepared for failure.
> 
> Reducing the thread count currently requires finding any thing, idle or
> not, and calling kthread_stop().  This no longer possible as we don't
> have a list of all threads (though I guess we could keep the list if we
> wanted to...).  Instead the pool is marked NEED_VICTIM and the next
> thread to go idle will become the VICTIM and duly exit - signalling
> this be clearing VICTIM_REMAINS.  We replace kthread_should_stop() call
> with a new svc_should_stop() which checks and sets victim flags.
> 
> nfsd threads can currently be told to exit with a signal.  It might be
> time to deprecate/remove this feature.  However this patch does support
> it.
> 
> If the signalled thread is not at the head of the idle list it cannot
> remove itself.  In this case it sets RQ_CLEAN_ME and SP_CLEANUP and the
> next thread to wake up will use llist_del_all_this() to remove all
> threads from the idle list.  It then finds and removes any RQ_CLEAN_ME
> threads and puts the rest back on the list.
> 
> There is quite a bit of churn here so it will need careful testing.
> In fact - it doesn't handle nfsv4 callback handling threads properly as
> they don't wait the same way that other threads wait...  I'll need to
> think about that but I don't have time just now.
> 
> For now it is primarily an RFC.  I haven't given a lot of thought to
> trace points.
> 
> It apply it you will need
> 
>  SUNRPC: Deduplicate thread wake-up code
>  SUNRPC: Report when no service thread is available.
>  SUNRPC: Split the svc_xprt_dequeue tracepoint
>  SUNRPC: Clean up svc_set_num_threads
>  SUNRPC: Replace dprintk() call site in __svc_create()
> 
> from recent post by Chuck.

Hi, thanks for letting us see your pencil sketches. :-)

Later today, I'll push a topic branch to my kernel.org repo that we
can use as a base for continuing this work.

Some initial remarks below, recognizing that this patch is still
incomplete.


> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/lockd/svc.c                |   4 +-
>  fs/lockd/svclock.c            |   4 +-
>  fs/nfs/callback.c             |   5 +-
>  fs/nfsd/nfssvc.c              |   3 +-
>  include/linux/llist.h         |   4 +
>  include/linux/lockd/lockd.h   |   2 +-
>  include/linux/sunrpc/svc.h    |  55 +++++++++-----
>  include/trace/events/sunrpc.h |   7 +-
>  lib/llist.c                   |  51 +++++++++++++
>  net/sunrpc/svc.c              | 139 ++++++++++++++++++++++++----------
>  net/sunrpc/svc_xprt.c         |  61 ++++++++-------
>  11 files changed, 239 insertions(+), 96 deletions(-)
> 
> diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
> index 22d3ff3818f5..df295771bd40 100644
> --- a/fs/lockd/svc.c
> +++ b/fs/lockd/svc.c
> @@ -147,7 +147,7 @@ lockd(void *vrqstp)
>  	 * The main request loop. We don't terminate until the last
>  	 * NFS mount or NFS daemon has gone away.
>  	 */
> -	while (!kthread_should_stop()) {
> +	while (!svc_should_stop(rqstp)) {
>  		long timeout = MAX_SCHEDULE_TIMEOUT;
>  		RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
>  
> @@ -160,7 +160,7 @@ lockd(void *vrqstp)
>  			continue;
>  		}
>  
> -		timeout = nlmsvc_retry_blocked();
> +		timeout = nlmsvc_retry_blocked(rqstp);
>  
>  		/*
>  		 * Find a socket with data available and call its
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index c43ccdf28ed9..54b679fcbcab 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -1009,13 +1009,13 @@ retry_deferred_block(struct nlm_block *block)
>   * be retransmitted.
>   */
>  unsigned long
> -nlmsvc_retry_blocked(void)
> +nlmsvc_retry_blocked(struct svc_rqst *rqstp)
>  {
>  	unsigned long	timeout = MAX_SCHEDULE_TIMEOUT;
>  	struct nlm_block *block;
>  
>  	spin_lock(&nlm_blocked_lock);
> -	while (!list_empty(&nlm_blocked) && !kthread_should_stop()) {
> +	while (!list_empty(&nlm_blocked) && !svc_should_stop(rqstp)) {
>  		block = list_entry(nlm_blocked.next, struct nlm_block, b_list);
>  
>  		if (block->b_when == NLM_NEVER)
> diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
> index 456af7d230cf..646425f1dc36 100644
> --- a/fs/nfs/callback.c
> +++ b/fs/nfs/callback.c
> @@ -111,7 +111,7 @@ nfs41_callback_svc(void *vrqstp)
>  
>  	set_freezable();
>  
> -	while (!kthread_freezable_should_stop(NULL)) {
> +	while (!svc_should_stop(rqstp)) {
>  
>  		if (signal_pending(current))
>  			flush_signals(current);
> @@ -130,10 +130,11 @@ nfs41_callback_svc(void *vrqstp)
>  				error);
>  		} else {
>  			spin_unlock_bh(&serv->sv_cb_lock);
> -			if (!kthread_should_stop())
> +			if (!svc_should_stop(rqstp))
>  				schedule();
>  			finish_wait(&serv->sv_cb_waitq, &wq);
>  		}
> +		try_to_freeze();
>  	}
>  
>  	svc_exit_thread(rqstp);
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 9c7b1ef5be40..7cfa7f2e9bf7 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -62,8 +62,7 @@ static __be32			nfsd_init_request(struct svc_rqst *,
>   * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
>   * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
>   * nn->keep_active is set).  That number of nfsd threads must
> - * exist and each must be listed in ->sp_all_threads in some entry of
> - * ->sv_pools[].
> + * exist.
>   *
>   * Each active thread holds a counted reference on nn->nfsd_serv, as does
>   * the nn->keep_active flag and various transient calls to svc_get().
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 85bda2d02d65..5a22499844c8 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -248,6 +248,10 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
>  }
>  
>  extern struct llist_node *llist_del_first(struct llist_head *head);
> +extern struct llist_node *llist_del_first_this(struct llist_head *head,
> +					       struct llist_node *this);
> +extern struct llist_node *llist_del_all_this(struct llist_head *head,
> +					     struct llist_node *this);
>  
>  struct llist_node *llist_reverse_order(struct llist_node *head);
>  
> diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> index f42594a9efe0..c48020e7ee08 100644
> --- a/include/linux/lockd/lockd.h
> +++ b/include/linux/lockd/lockd.h
> @@ -280,7 +280,7 @@ __be32		  nlmsvc_testlock(struct svc_rqst *, struct nlm_file *,
>  			struct nlm_host *, struct nlm_lock *,
>  			struct nlm_lock *, struct nlm_cookie *);
>  __be32		  nlmsvc_cancel_blocked(struct net *net, struct nlm_file *, struct nlm_lock *);
> -unsigned long	  nlmsvc_retry_blocked(void);
> +unsigned long	  nlmsvc_retry_blocked(struct svc_rqst *rqstp);
>  void		  nlmsvc_traverse_blocks(struct nlm_host *, struct nlm_file *,
>  					nlm_host_match_fn_t match);
>  void		  nlmsvc_grant_reply(struct nlm_cookie *, __be32);
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 366f2b6b689c..cb2497b977c1 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -31,11 +31,11 @@
>   * node traffic on multi-node NUMA NFS servers.
>   */
>  struct svc_pool {
> -	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
> -	spinlock_t		sp_lock;	/* protects all fields */
> +	unsigned int		sp_id;		/* pool id; also node id on NUMA */
> +	spinlock_t		sp_lock;	/* protects all sp_socketsn sp_nrthreads*/
>  	struct list_head	sp_sockets;	/* pending sockets */
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> -	struct list_head	sp_all_threads;	/* all server threads */
> +	struct llist_head	sp_idle_threads;/* idle server threads */
>  
>  	/* statistics on pool operation */
>  	struct percpu_counter	sp_sockets_queued;
> @@ -43,12 +43,17 @@ struct svc_pool {
>  	struct percpu_counter	sp_threads_timedout;
>  	struct percpu_counter	sp_threads_starved;
>  
> -#define	SP_TASK_PENDING		(0)		/* still work to do even if no
> -						 * xprt is queued. */
> -#define SP_CONGESTED		(1)
>  	unsigned long		sp_flags;
>  } ____cacheline_aligned_in_smp;
>  
> +enum svc_sp_flags {

Let's make this an anonymous enum. Ditto below.


> +	SP_TASK_PENDING,	/* still work to do even if no xprt is queued */
> +	SP_CONGESTED,
> +	SP_NEED_VICTIM,		/* One thread needs to agree to exit */
> +	SP_VICTIM_REMAINS,	/* One thread needs to actually exit */
> +	SP_CLEANUP,		/* A thread has set RQ_CLEAN_ME */
> +};
> +

Converting the bit flags to an enum seems like an unrelated clean-
up. It isn't necessary in order to implement the new scheduler.
Let's extract this into a separate patch that can be applied first.

Also, I'm not clear on the justification for this clean up. That
should be explained in the patch description of the split-out
clean-up patch(es).


>  /*
>   * RPC service.
>   *
> @@ -195,7 +200,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>   * processed.
>   */
>  struct svc_rqst {
> -	struct list_head	rq_all;		/* all threads list */
> +	struct llist_node	rq_idle;	/* On pool's idle list */
>  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
>  	struct svc_xprt *	rq_xprt;	/* transport ptr */
>  
> @@ -233,16 +238,6 @@ struct svc_rqst {
>  	u32			rq_proc;	/* procedure number */
>  	u32			rq_prot;	/* IP protocol */
>  	int			rq_cachetype;	/* catering to nfsd */
> -#define	RQ_SECURE	(0)			/* secure port */
> -#define	RQ_LOCAL	(1)			/* local request */
> -#define	RQ_USEDEFERRAL	(2)			/* use deferral */
> -#define	RQ_DROPME	(3)			/* drop current reply */
> -#define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
> -						 * to prevent encrypting page
> -						 * cache pages */
> -#define	RQ_VICTIM	(5)			/* about to be shut down */
> -#define	RQ_BUSY		(6)			/* request is busy */
> -#define	RQ_DATA		(7)			/* request has data */
>  	unsigned long		rq_flags;	/* flags field */
>  	ktime_t			rq_qtime;	/* enqueue time */
>  
> @@ -274,6 +269,20 @@ struct svc_rqst {
>  	void **			rq_lease_breaker; /* The v4 client breaking a lease */
>  };
>  
> +enum svc_rq_flags {
> +	RQ_SECURE,			/* secure port */
> +	RQ_LOCAL,			/* local request */
> +	RQ_USEDEFERRAL,			/* use deferral */
> +	RQ_DROPME,			/* drop current reply */
> +	RQ_SPLICE_OK,			/* turned off in gss privacy
> +					 * to prevent encrypting page
> +					 * cache pages */
> +	RQ_VICTIM,			/* agreed to shut down */
> +	RQ_DATA,			/* request has data */
> +	RQ_CLEAN_ME,			/* Thread needs to exit but
> +					 * is on the idle list */
> +};
> +

Likewise here. And let's keep the flag clean-ups in separate patches.


>  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
>  
>  /*
> @@ -309,6 +318,15 @@ static inline struct sockaddr *svc_daddr(const struct svc_rqst *rqst)
>  	return (struct sockaddr *) &rqst->rq_daddr;
>  }
>  

This needs a kdoc comment and a more conventional name. How about
svc_thread_should_stop() ?

Actually it seems like a better abstraction all around if the upper
layers don't have to care that they are running in a kthread  -- so
maybe replacing kthread_should_stop() is a good clean-up to apply
in advance.


> +static inline bool svc_should_stop(struct svc_rqst *rqstp)
> +{
> +	if (test_and_clear_bit(SP_NEED_VICTIM, &rqstp->rq_pool->sp_flags)) {
> +		set_bit(RQ_VICTIM, &rqstp->rq_flags);
> +		return true;
> +	}
> +	return test_bit(RQ_VICTIM, &rqstp->rq_flags);
> +}
> +
>  struct svc_deferred_req {
>  	u32			prot;	/* protocol (UDP or TCP) */
>  	struct svc_xprt		*xprt;
> @@ -416,6 +434,7 @@ bool		   svc_rqst_replace_page(struct svc_rqst *rqstp,
>  void		   svc_rqst_release_pages(struct svc_rqst *rqstp);
>  void		   svc_rqst_free(struct svc_rqst *);
>  void		   svc_exit_thread(struct svc_rqst *);
> +bool		   svc_dequeue_rqst(struct svc_rqst *rqstp);
>  struct svc_serv *  svc_create_pooled(struct svc_program *, unsigned int,
>  				     int (*threadfn)(void *data));
>  int		   svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
> @@ -428,7 +447,7 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
>  
>  void		   svc_wake_up(struct svc_serv *);
>  void		   svc_reserve(struct svc_rqst *rqstp, int space);
> -struct svc_rqst	  *svc_pool_wake_idle_thread(struct svc_serv *serv,
> +bool		   svc_pool_wake_idle_thread(struct svc_serv *serv,
>  					     struct svc_pool *pool);
>  struct svc_pool   *svc_pool_for_cpu(struct svc_serv *serv);
>  char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index f6fd48961074..f63289d1491d 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -1601,7 +1601,7 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>  	svc_rqst_flag(DROPME)						\
>  	svc_rqst_flag(SPLICE_OK)					\
>  	svc_rqst_flag(VICTIM)						\
> -	svc_rqst_flag(BUSY)						\
> +	svc_rqst_flag(CLEAN_ME)						\
>  	svc_rqst_flag_end(DATA)
>  
>  #undef svc_rqst_flag
> @@ -1965,7 +1965,10 @@ TRACE_EVENT(svc_xprt_enqueue,
>  #define show_svc_pool_flags(x)						\
>  	__print_flags(x, "|",						\
>  		{ BIT(SP_TASK_PENDING),		"TASK_PENDING" },	\
> -		{ BIT(SP_CONGESTED),		"CONGESTED" })
> +		{ BIT(SP_CONGESTED),		"CONGESTED" },		\
> +		{ BIT(SP_NEED_VICTIM),		"NEED_VICTIM" },	\
> +		{ BIT(SP_VICTIM_REMAINS),	"VICTIM_REMAINS" },	\
> +		{ BIT(SP_CLEANUP),		"CLEANUP" })
>  
>  DECLARE_EVENT_CLASS(svc_pool_scheduler_class,
>  	TP_PROTO(
> diff --git a/lib/llist.c b/lib/llist.c
> index 6e668fa5a2c6..660be07795ac 100644
> --- a/lib/llist.c
> +++ b/lib/llist.c
> @@ -65,6 +65,57 @@ struct llist_node *llist_del_first(struct llist_head *head)
>  }
>  EXPORT_SYMBOL_GPL(llist_del_first);
>  
> +/**
> + * llist_del_first_this - delete given entry of lock-less list if it is first
> + * @head:	the head for your lock-less list
> + * @this:	a list entry.
> + *
> + * If head of the list is given entry, delete and return it, else
> + * return %NULL.
> + *
> + * Providing the caller has exclusive access to @this, multiple callers can
> + * safely call this concurrently with multiple llist_add() callers.
> + */
> +struct llist_node *llist_del_first_this(struct llist_head *head,
> +					struct llist_node *this)
> +{
> +	struct llist_node *entry, *next;
> +
> +	entry = smp_load_acquire(&head->first);
> +	do {
> +		if (entry != this)
> +			return NULL;
> +		next = READ_ONCE(entry->next);
> +	} while (!try_cmpxchg(&head->first, &entry, next));
> +
> +	return entry;
> +}
> +EXPORT_SYMBOL_GPL(llist_del_first_this);
> +
> +/**
> + * llist_del_all_this - delete all entries from lock-less list if first is the given element
> + * @head:	the head of lock-less list to delete all entries
> + * @this:	the expected first element.
> + *
> + * If the first element of the list is @this, delete all elements and
> + * return them, else return %NULL.  Providing the caller has exclusive access
> + * to @this, multiple concurrent callers can call this or list_del_first_this()
> + * simultaneuously with multiple callers of llist_add().
> + */
> +struct llist_node *llist_del_all_this(struct llist_head *head,
> +				      struct llist_node *this)
> +{
> +	struct llist_node *entry;
> +
> +	entry = smp_load_acquire(&head->first);
> +	do {
> +		if (entry != this)
> +			return NULL;
> +	} while (!try_cmpxchg(&head->first, &entry, NULL));
> +
> +	return entry;
> +}
> +

I was going to say that we should copy the maintainer of
lib/llist.c on this patch set, but I'm a little surprised to see
no maintainer listed for it. I'm not sure how to get proper
review for the new API and mechanism.

Sidebar: Are there any self-tests or kunit tests for llist?


>  /**
>   * llist_reverse_order - reverse order of a llist chain
>   * @head:	first item of the list to be reversed
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index ffb7200e8257..55339cbbbc6e 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -507,7 +507,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>  
>  		pool->sp_id = i;
>  		INIT_LIST_HEAD(&pool->sp_sockets);
> -		INIT_LIST_HEAD(&pool->sp_all_threads);
> +		init_llist_head(&pool->sp_idle_threads);
>  		spin_lock_init(&pool->sp_lock);
>  
>  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> @@ -652,9 +652,9 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
>  
>  	pagevec_init(&rqstp->rq_pvec);
>  
> -	__set_bit(RQ_BUSY, &rqstp->rq_flags);
>  	rqstp->rq_server = serv;
>  	rqstp->rq_pool = pool;
> +	rqstp->rq_idle.next = &rqstp->rq_idle;

Is there really no initializer helper for this?


>  	rqstp->rq_scratch_page = alloc_pages_node(node, GFP_KERNEL, 0);
>  	if (!rqstp->rq_scratch_page)
> @@ -694,7 +694,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  
>  	spin_lock_bh(&pool->sp_lock);
>  	pool->sp_nrthreads++;
> -	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
>  	spin_unlock_bh(&pool->sp_lock);
>  	return rqstp;
>  }
> @@ -704,32 +703,34 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>   * @serv: RPC service
>   * @pool: service thread pool
>   *
> - * Returns an idle service thread (now marked BUSY), or NULL
> - * if no service threads are available. Finding an idle service
> - * thread and marking it BUSY is atomic with respect to other
> - * calls to svc_pool_wake_idle_thread().
> + * If there are any idle threads in the pool, wake one up and return
> + * %true, else return %false.  The thread will become non-idle once
> + * the scheduler schedules it, at which point is might wake another
> + * thread if there seems to be enough work to justify that.

So I'm wondering how another call to svc_pool_wake_idle_thread()
that happens concurrently will not find and wake the same thread?


>   */
> -struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> -					   struct svc_pool *pool)
> +bool svc_pool_wake_idle_thread(struct svc_serv *serv,
> +			       struct svc_pool *pool)
>  {
>  	struct svc_rqst	*rqstp;
> +	struct llist_node *ln;
>  
>  	rcu_read_lock();
> -	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> -		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> -			continue;
> -
> -		rcu_read_unlock();
> +	ln = READ_ONCE(pool->sp_idle_threads.first);
> +	if (ln) {
> +		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
>  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> -		wake_up_process(rqstp->rq_task);
> -		percpu_counter_inc(&pool->sp_threads_woken);
> -		return rqstp;
> +		if (!task_is_running(rqstp->rq_task)) {
> +			wake_up_process(rqstp->rq_task);
> +			percpu_counter_inc(&pool->sp_threads_woken);
> +		}
> +		rcu_read_unlock();
> +		return true;
>  	}
>  	rcu_read_unlock();
>  
>  	trace_svc_pool_starved(serv, pool);
>  	percpu_counter_inc(&pool->sp_threads_starved);
> -	return NULL;
> +	return false;
>  }
>  
>  static struct svc_pool *
> @@ -738,19 +739,22 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>  	return pool ? pool : &serv->sv_pools[(*state)++ % serv->sv_nrpools];
>  }
>  
> -static struct task_struct *
> +static struct svc_pool *
>  svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>  {
> -	unsigned int i;
> -	struct task_struct *task = NULL;
>  
>  	if (pool != NULL) {
>  		spin_lock_bh(&pool->sp_lock);
> +		if (pool->sp_nrthreads > 0)
> +			goto found_pool;
> +		spin_unlock_bh(&pool->sp_lock);
> +		return NULL;
>  	} else {
> +		unsigned int i;
>  		for (i = 0; i < serv->sv_nrpools; i++) {
>  			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
>  			spin_lock_bh(&pool->sp_lock);
> -			if (!list_empty(&pool->sp_all_threads))
> +			if (pool->sp_nrthreads > 0)
>  				goto found_pool;
>  			spin_unlock_bh(&pool->sp_lock);
>  		}
> @@ -758,16 +762,10 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
>  	}
>  
>  found_pool:
> -	if (!list_empty(&pool->sp_all_threads)) {
> -		struct svc_rqst *rqstp;
> -
> -		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> -		set_bit(RQ_VICTIM, &rqstp->rq_flags);
> -		list_del_rcu(&rqstp->rq_all);
> -		task = rqstp->rq_task;
> -	}
> +	set_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
> +	set_bit(SP_NEED_VICTIM, &pool->sp_flags);
>  	spin_unlock_bh(&pool->sp_lock);
> -	return task;
> +	return pool;
>  }
>  
>  static int
> @@ -808,18 +806,16 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
>  static int
>  svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
>  {
> -	struct svc_rqst	*rqstp;
> -	struct task_struct *task;
>  	unsigned int state = serv->sv_nrthreads-1;
> +	struct svc_pool *vpool;
>  
>  	do {
> -		task = svc_pool_victim(serv, pool, &state);
> -		if (task == NULL)
> +		vpool = svc_pool_victim(serv, pool, &state);
> +		if (vpool == NULL)
>  			break;
> -		rqstp = kthread_data(task);
> -		/* Did we lose a race to svo_function threadfn? */
> -		if (kthread_stop(task) == -EINTR)
> -			svc_exit_thread(rqstp);
> +		svc_pool_wake_idle_thread(serv, vpool);
> +		wait_on_bit(&vpool->sp_flags, SP_VICTIM_REMAINS,
> +			    TASK_UNINTERRUPTIBLE);
>  		nrservs++;
>  	} while (nrservs < 0);
>  	return 0;
> @@ -931,16 +927,75 @@ svc_rqst_free(struct svc_rqst *rqstp)
>  }
>  EXPORT_SYMBOL_GPL(svc_rqst_free);
>  

Can you add a kdoc comment for svc_dequeue_rqst()?

There is a lot of complexity here that I'm not grokking on first
read. Either it needs more comments or simplification (or both).

It's not winning me over, I have to say ;-)


> +bool svc_dequeue_rqst(struct svc_rqst *rqstp)
> +{
> +	struct svc_pool *pool = rqstp->rq_pool;
> +	struct llist_node *le, *last;
> +
> +retry:
> +	if (pool->sp_idle_threads.first != &rqstp->rq_idle)

Would be better if there was a helper for this test.


> +		/* Not at head of queue, so cannot wake up */
> +		return false;
> +	if (!test_and_clear_bit(SP_CLEANUP, &pool->sp_flags)) {
> +		le = llist_del_first_this(&pool->sp_idle_threads,
> +					  &rqstp->rq_idle);
> +		if (le)
> +			le->next = le;
> +		return !!le;
> +	}
> +	/* Need to deal will RQ_CLEAN_ME thread */
> +	le = llist_del_all_this(&pool->sp_idle_threads,
> +				&rqstp->rq_idle);
> +	if (!le) {
> +		/* lost a race, someone else need to clean up */
> +		set_bit(SP_CLEANUP, &pool->sp_flags);
> +		svc_pool_wake_idle_thread(rqstp->rq_server,
> +					  pool);
> +		goto retry;
> +	}
> +	if (!le->next)
> +		return true;
> +	last = le;
> +	while (last->next) {
> +		rqstp = list_entry(last->next, struct svc_rqst, rq_idle);
> +		if (!test_bit(RQ_CLEAN_ME, &rqstp->rq_flags)) {
> +			last = last->next;
> +			continue;
> +		}
> +		last->next = last->next->next;
> +		rqstp->rq_idle.next = &rqstp->rq_idle;
> +		wake_up_process(rqstp->rq_task);
> +	}
> +	if (last != le)
> +		llist_add_batch(le->next, last, &pool->sp_idle_threads);
> +	le->next = le;
> +	return true;
> +}
> +
>  void
>  svc_exit_thread(struct svc_rqst *rqstp)
>  {
>  	struct svc_serv	*serv = rqstp->rq_server;
>  	struct svc_pool	*pool = rqstp->rq_pool;
>  
> +	while (rqstp->rq_idle.next != &rqstp->rq_idle) {

Helper, maybe?


> +		/* Still on the idle list. */
> +		if (llist_del_first_this(&pool->sp_idle_threads,
> +					 &rqstp->rq_idle)) {
> +			/* Safely removed */
> +			rqstp->rq_idle.next = &rqstp->rq_idle;
> +		} else {
> +			set_current_state(TASK_UNINTERRUPTIBLE);
> +			set_bit(RQ_CLEAN_ME, &rqstp->rq_flags);
> +			set_bit(SP_CLEANUP, &pool->sp_flags);
> +			svc_pool_wake_idle_thread(serv, pool);
> +			if (!svc_dequeue_rqst(rqstp))
> +				schedule();
> +			__set_current_state(TASK_RUNNING);
> +		}
> +	}
>  	spin_lock_bh(&pool->sp_lock);
>  	pool->sp_nrthreads--;
> -	if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> -		list_del_rcu(&rqstp->rq_all);
>  	spin_unlock_bh(&pool->sp_lock);
>  
>  	spin_lock_bh(&serv->sv_lock);
> @@ -948,6 +1003,8 @@ svc_exit_thread(struct svc_rqst *rqstp)
>  	spin_unlock_bh(&serv->sv_lock);
>  	svc_sock_update_bufs(serv);
>  
> +	if (test_bit(RQ_VICTIM, &rqstp->rq_flags))
> +		clear_and_wake_up_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
>  	svc_rqst_free(rqstp);
>  
>  	svc_put(serv);
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index c4521bce1f27..d51587cd8d99 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -446,7 +446,6 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
>   */
>  void svc_xprt_enqueue(struct svc_xprt *xprt)
>  {
> -	struct svc_rqst *rqstp;
>  	struct svc_pool *pool;
>  
>  	if (!svc_xprt_ready(xprt))
> @@ -467,20 +466,19 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
>  	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
>  	spin_unlock_bh(&pool->sp_lock);
>  
> -	rqstp = svc_pool_wake_idle_thread(xprt->xpt_server, pool);
> -	if (!rqstp) {
> +	if (!svc_pool_wake_idle_thread(xprt->xpt_server, pool)) {
>  		set_bit(SP_CONGESTED, &pool->sp_flags);
>  		return;
>  	}
>  
> -	trace_svc_xprt_enqueue(xprt, rqstp);
> +	// trace_svc_xprt_enqueue(xprt, rqstp);
>  }
>  EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
>  
>  /*
>   * Dequeue the first transport, if there is one.
>   */
> -static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
> +static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool, bool *more)
>  {
>  	struct svc_xprt	*xprt = NULL;
>  
> @@ -493,6 +491,7 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
>  					struct svc_xprt, xpt_ready);
>  		list_del_init(&xprt->xpt_ready);
>  		svc_xprt_get(xprt);
> +		*more = !list_empty(&pool->sp_sockets);
>  	}
>  	spin_unlock_bh(&pool->sp_lock);
>  out:
> @@ -577,15 +576,13 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
>  void svc_wake_up(struct svc_serv *serv)
>  {
>  	struct svc_pool *pool = &serv->sv_pools[0];
> -	struct svc_rqst *rqstp;
>  
> -	rqstp = svc_pool_wake_idle_thread(serv, pool);
> -	if (!rqstp) {
> +	if (!svc_pool_wake_idle_thread(serv, pool)) {
>  		set_bit(SP_TASK_PENDING, &pool->sp_flags);
>  		return;
>  	}
>  
> -	trace_svc_wake_up(rqstp);
> +	// trace_svc_wake_up(rqstp);
>  }
>  EXPORT_SYMBOL_GPL(svc_wake_up);
>  
> @@ -676,7 +673,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>  			continue;
>  
>  		set_current_state(TASK_INTERRUPTIBLE);
> -		if (signalled() || kthread_should_stop()) {
> +		if (signalled() || svc_should_stop(rqstp)) {
>  			set_current_state(TASK_RUNNING);
>  			return -EINTR;
>  		}
> @@ -706,7 +703,10 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>  	struct svc_pool		*pool = rqstp->rq_pool;
>  
>  	/* did someone call svc_wake_up? */
> -	if (test_and_clear_bit(SP_TASK_PENDING, &pool->sp_flags))
> +	if (test_bit(SP_TASK_PENDING, &pool->sp_flags))
> +		return false;
> +	if (test_bit(SP_CLEANUP, &pool->sp_flags))
> +		/* a signalled thread needs to be released */
>  		return false;
>  
>  	/* was a socket queued? */
> @@ -714,7 +714,7 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>  		return false;
>  
>  	/* are we shutting down? */
> -	if (signalled() || kthread_should_stop())
> +	if (signalled() || svc_should_stop(rqstp))
>  		return false;
>  
>  	/* are we freezing? */
> @@ -728,11 +728,9 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  {
>  	struct svc_pool		*pool = rqstp->rq_pool;
>  	long			time_left = 0;
> +	bool			more = false;
>  
> -	/* rq_xprt should be clear on entry */
> -	WARN_ON_ONCE(rqstp->rq_xprt);
> -
> -	rqstp->rq_xprt = svc_xprt_dequeue(pool);
> +	rqstp->rq_xprt = svc_xprt_dequeue(pool, &more);
>  	if (rqstp->rq_xprt) {
>  		trace_svc_pool_polled(pool, rqstp);
>  		goto out_found;
> @@ -743,11 +741,10 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  	 * to bring down the daemons ...
>  	 */
>  	set_current_state(TASK_INTERRUPTIBLE);
> -	smp_mb__before_atomic();
> -	clear_bit(SP_CONGESTED, &pool->sp_flags);
> -	clear_bit(RQ_BUSY, &rqstp->rq_flags);
> -	smp_mb__after_atomic();
> +	clear_bit_unlock(SP_CONGESTED, &pool->sp_flags);
>  
> +	llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> +again:
>  	if (likely(rqst_should_sleep(rqstp)))
>  		time_left = schedule_timeout(timeout);
>  	else
> @@ -755,9 +752,20 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  
>  	try_to_freeze();
>  
> -	set_bit(RQ_BUSY, &rqstp->rq_flags);
> -	smp_mb__after_atomic();
> -	rqstp->rq_xprt = svc_xprt_dequeue(pool);
> +	if (!svc_dequeue_rqst(rqstp)) {
> +		if (signalled())
> +			/* Can only return while on idle list if signalled */
> +			return ERR_PTR(-EINTR);
> +		/* Still on the idle list */
> +		goto again;
> +	}
> +
> +	clear_bit(SP_TASK_PENDING, &pool->sp_flags);
> +
> +	if (svc_should_stop(rqstp))
> +		return ERR_PTR(-EINTR);
> +
> +	rqstp->rq_xprt = svc_xprt_dequeue(pool, &more);
>  	if (rqstp->rq_xprt) {
>  		trace_svc_pool_awoken(pool, rqstp);
>  		goto out_found;
> @@ -766,10 +774,11 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  	if (!time_left)
>  		percpu_counter_inc(&pool->sp_threads_timedout);
>  
> -	if (signalled() || kthread_should_stop())
> -		return ERR_PTR(-EINTR);
>  	return ERR_PTR(-EAGAIN);
>  out_found:
> +	if (more)
> +		svc_pool_wake_idle_thread(rqstp->rq_server, pool);
> +

I'm thinking that dealing with more work should be implemented as a
separate optimization (ie, a subsequent patch).


>  	/* Normally we will wait up to 5 seconds for any required
>  	 * cache information to be provided.
>  	 */
> @@ -866,7 +875,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
>  	try_to_freeze();
>  	cond_resched();
>  	err = -EINTR;
> -	if (signalled() || kthread_should_stop())
> +	if (signalled() || svc_should_stop(rqstp))
>  		goto out;
>  
>  	xprt = svc_get_next_xprt(rqstp, timeout);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10  0:41       ` [PATCH/RFC] sunrpc: constant-time code to wake idle thread NeilBrown
  2023-07-10 14:41         ` Chuck Lever
@ 2023-07-10 16:21         ` Chuck Lever
  2023-07-10 22:18           ` NeilBrown
  1 sibling, 1 reply; 41+ messages in thread
From: Chuck Lever @ 2023-07-10 16:21 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Mon, Jul 10, 2023 at 10:41:38AM +1000, NeilBrown wrote:
> 
> The patch show an alternate approach to the recent patches which improve
> the latency for waking an idle thread when work is ready.
> 
> The current approach involves searching a linked list for an idle thread
> to wake.  The recent patches instead use a bitmap search to find the
> idle thread.  With this patch no search is needed - and idle thread is
> directly available without searching.
> 
> The idle threads are kept in an "llist" - there is no longer a list of
> all threads.
> 
> The llist ADT does not allow concurrent delete_first operations, so to
> wake an idle thread we simply wake it and do not remove it from the
> list.
> When the thread is scheduled it will remove itself - which is safe - and
> will take the next thread if there is more work to do (and if there is
> another thread).
> 
> The "remove itself" requires and addition to the llist api.
> "llist_del_first_this()" removes a given item if is it the first.
> Multiple callers can call this concurrently as along as they each give a
> different "this", so each thread can safely try to remove itself.  It
> must be prepared for failure.
> 
> Reducing the thread count currently requires finding any thing, idle or
> not, and calling kthread_stop().  This no longer possible as we don't
> have a list of all threads (though I guess we could keep the list if we
> wanted to...).  Instead the pool is marked NEED_VICTIM and the next
> thread to go idle will become the VICTIM and duly exit - signalling
> this be clearing VICTIM_REMAINS.  We replace kthread_should_stop() call
> with a new svc_should_stop() which checks and sets victim flags.
> 
> nfsd threads can currently be told to exit with a signal.  It might be
> time to deprecate/remove this feature.  However this patch does support
> it.
> 
> If the signalled thread is not at the head of the idle list it cannot
> remove itself.  In this case it sets RQ_CLEAN_ME and SP_CLEANUP and the
> next thread to wake up will use llist_del_all_this() to remove all
> threads from the idle list.  It then finds and removes any RQ_CLEAN_ME
> threads and puts the rest back on the list.
> 
> There is quite a bit of churn here so it will need careful testing.
> In fact - it doesn't handle nfsv4 callback handling threads properly as
> they don't wait the same way that other threads wait...  I'll need to
> think about that but I don't have time just now.
> 
> For now it is primarily an RFC.  I haven't given a lot of thought to
> trace points.
> 
> It apply it you will need
> 
>  SUNRPC: Deduplicate thread wake-up code
>  SUNRPC: Report when no service thread is available.
>  SUNRPC: Split the svc_xprt_dequeue tracepoint
>  SUNRPC: Clean up svc_set_num_threads
>  SUNRPC: Replace dprintk() call site in __svc_create()
> 
> from recent post by Chuck.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  fs/lockd/svc.c                |   4 +-
>  fs/lockd/svclock.c            |   4 +-
>  fs/nfs/callback.c             |   5 +-
>  fs/nfsd/nfssvc.c              |   3 +-
>  include/linux/llist.h         |   4 +
>  include/linux/lockd/lockd.h   |   2 +-
>  include/linux/sunrpc/svc.h    |  55 +++++++++-----
>  include/trace/events/sunrpc.h |   7 +-
>  lib/llist.c                   |  51 +++++++++++++
>  net/sunrpc/svc.c              | 139 ++++++++++++++++++++++++----------
>  net/sunrpc/svc_xprt.c         |  61 ++++++++-------
>  11 files changed, 239 insertions(+), 96 deletions(-)
> 
> diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
> index 22d3ff3818f5..df295771bd40 100644
> --- a/fs/lockd/svc.c
> +++ b/fs/lockd/svc.c
> @@ -147,7 +147,7 @@ lockd(void *vrqstp)
>  	 * The main request loop. We don't terminate until the last
>  	 * NFS mount or NFS daemon has gone away.
>  	 */
> -	while (!kthread_should_stop()) {
> +	while (!svc_should_stop(rqstp)) {
>  		long timeout = MAX_SCHEDULE_TIMEOUT;
>  		RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
>  
> @@ -160,7 +160,7 @@ lockd(void *vrqstp)
>  			continue;
>  		}
>  
> -		timeout = nlmsvc_retry_blocked();
> +		timeout = nlmsvc_retry_blocked(rqstp);
>  
>  		/*
>  		 * Find a socket with data available and call its
> diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> index c43ccdf28ed9..54b679fcbcab 100644
> --- a/fs/lockd/svclock.c
> +++ b/fs/lockd/svclock.c
> @@ -1009,13 +1009,13 @@ retry_deferred_block(struct nlm_block *block)
>   * be retransmitted.
>   */
>  unsigned long
> -nlmsvc_retry_blocked(void)
> +nlmsvc_retry_blocked(struct svc_rqst *rqstp)
>  {
>  	unsigned long	timeout = MAX_SCHEDULE_TIMEOUT;
>  	struct nlm_block *block;
>  
>  	spin_lock(&nlm_blocked_lock);
> -	while (!list_empty(&nlm_blocked) && !kthread_should_stop()) {
> +	while (!list_empty(&nlm_blocked) && !svc_should_stop(rqstp)) {
>  		block = list_entry(nlm_blocked.next, struct nlm_block, b_list);
>  
>  		if (block->b_when == NLM_NEVER)
> diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
> index 456af7d230cf..646425f1dc36 100644
> --- a/fs/nfs/callback.c
> +++ b/fs/nfs/callback.c
> @@ -111,7 +111,7 @@ nfs41_callback_svc(void *vrqstp)
>  
>  	set_freezable();
>  
> -	while (!kthread_freezable_should_stop(NULL)) {
> +	while (!svc_should_stop(rqstp)) {
>  
>  		if (signal_pending(current))
>  			flush_signals(current);
> @@ -130,10 +130,11 @@ nfs41_callback_svc(void *vrqstp)
>  				error);
>  		} else {
>  			spin_unlock_bh(&serv->sv_cb_lock);
> -			if (!kthread_should_stop())
> +			if (!svc_should_stop(rqstp))
>  				schedule();
>  			finish_wait(&serv->sv_cb_waitq, &wq);
>  		}
> +		try_to_freeze();
>  	}
>  
>  	svc_exit_thread(rqstp);
> diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> index 9c7b1ef5be40..7cfa7f2e9bf7 100644
> --- a/fs/nfsd/nfssvc.c
> +++ b/fs/nfsd/nfssvc.c
> @@ -62,8 +62,7 @@ static __be32			nfsd_init_request(struct svc_rqst *,
>   * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
>   * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
>   * nn->keep_active is set).  That number of nfsd threads must
> - * exist and each must be listed in ->sp_all_threads in some entry of
> - * ->sv_pools[].
> + * exist.
>   *
>   * Each active thread holds a counted reference on nn->nfsd_serv, as does
>   * the nn->keep_active flag and various transient calls to svc_get().
> diff --git a/include/linux/llist.h b/include/linux/llist.h
> index 85bda2d02d65..5a22499844c8 100644
> --- a/include/linux/llist.h
> +++ b/include/linux/llist.h
> @@ -248,6 +248,10 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
>  }
>  
>  extern struct llist_node *llist_del_first(struct llist_head *head);
> +extern struct llist_node *llist_del_first_this(struct llist_head *head,
> +					       struct llist_node *this);
> +extern struct llist_node *llist_del_all_this(struct llist_head *head,
> +					     struct llist_node *this);
>  
>  struct llist_node *llist_reverse_order(struct llist_node *head);
>  
> diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> index f42594a9efe0..c48020e7ee08 100644
> --- a/include/linux/lockd/lockd.h
> +++ b/include/linux/lockd/lockd.h
> @@ -280,7 +280,7 @@ __be32		  nlmsvc_testlock(struct svc_rqst *, struct nlm_file *,
>  			struct nlm_host *, struct nlm_lock *,
>  			struct nlm_lock *, struct nlm_cookie *);
>  __be32		  nlmsvc_cancel_blocked(struct net *net, struct nlm_file *, struct nlm_lock *);
> -unsigned long	  nlmsvc_retry_blocked(void);
> +unsigned long	  nlmsvc_retry_blocked(struct svc_rqst *rqstp);
>  void		  nlmsvc_traverse_blocks(struct nlm_host *, struct nlm_file *,
>  					nlm_host_match_fn_t match);
>  void		  nlmsvc_grant_reply(struct nlm_cookie *, __be32);
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 366f2b6b689c..cb2497b977c1 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -31,11 +31,11 @@
>   * node traffic on multi-node NUMA NFS servers.
>   */
>  struct svc_pool {
> -	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
> -	spinlock_t		sp_lock;	/* protects all fields */
> +	unsigned int		sp_id;		/* pool id; also node id on NUMA */
> +	spinlock_t		sp_lock;	/* protects all sp_socketsn sp_nrthreads*/
>  	struct list_head	sp_sockets;	/* pending sockets */
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> -	struct list_head	sp_all_threads;	/* all server threads */
> +	struct llist_head	sp_idle_threads;/* idle server threads */

Actually... Lorenzo reminded me that we need to retain a mechanism
that can iterate through all the threads in a pool. The xarray
replaces the "all_threads" list in this regard.


>  	/* statistics on pool operation */
>  	struct percpu_counter	sp_sockets_queued;
> @@ -43,12 +43,17 @@ struct svc_pool {
>  	struct percpu_counter	sp_threads_timedout;
>  	struct percpu_counter	sp_threads_starved;
>  
> -#define	SP_TASK_PENDING		(0)		/* still work to do even if no
> -						 * xprt is queued. */
> -#define SP_CONGESTED		(1)
>  	unsigned long		sp_flags;
>  } ____cacheline_aligned_in_smp;
>  
> +enum svc_sp_flags {
> +	SP_TASK_PENDING,	/* still work to do even if no xprt is queued */
> +	SP_CONGESTED,
> +	SP_NEED_VICTIM,		/* One thread needs to agree to exit */
> +	SP_VICTIM_REMAINS,	/* One thread needs to actually exit */
> +	SP_CLEANUP,		/* A thread has set RQ_CLEAN_ME */
> +};
> +
>  /*
>   * RPC service.
>   *
> @@ -195,7 +200,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>   * processed.
>   */
>  struct svc_rqst {
> -	struct list_head	rq_all;		/* all threads list */
> +	struct llist_node	rq_idle;	/* On pool's idle list */
>  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
>  	struct svc_xprt *	rq_xprt;	/* transport ptr */
>  
> @@ -233,16 +238,6 @@ struct svc_rqst {
>  	u32			rq_proc;	/* procedure number */
>  	u32			rq_prot;	/* IP protocol */
>  	int			rq_cachetype;	/* catering to nfsd */
> -#define	RQ_SECURE	(0)			/* secure port */
> -#define	RQ_LOCAL	(1)			/* local request */
> -#define	RQ_USEDEFERRAL	(2)			/* use deferral */
> -#define	RQ_DROPME	(3)			/* drop current reply */
> -#define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
> -						 * to prevent encrypting page
> -						 * cache pages */
> -#define	RQ_VICTIM	(5)			/* about to be shut down */
> -#define	RQ_BUSY		(6)			/* request is busy */
> -#define	RQ_DATA		(7)			/* request has data */
>  	unsigned long		rq_flags;	/* flags field */
>  	ktime_t			rq_qtime;	/* enqueue time */
>  
> @@ -274,6 +269,20 @@ struct svc_rqst {
>  	void **			rq_lease_breaker; /* The v4 client breaking a lease */
>  };
>  
> +enum svc_rq_flags {
> +	RQ_SECURE,			/* secure port */
> +	RQ_LOCAL,			/* local request */
> +	RQ_USEDEFERRAL,			/* use deferral */
> +	RQ_DROPME,			/* drop current reply */
> +	RQ_SPLICE_OK,			/* turned off in gss privacy
> +					 * to prevent encrypting page
> +					 * cache pages */
> +	RQ_VICTIM,			/* agreed to shut down */
> +	RQ_DATA,			/* request has data */
> +	RQ_CLEAN_ME,			/* Thread needs to exit but
> +					 * is on the idle list */
> +};
> +
>  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
>  
>  /*
> @@ -309,6 +318,15 @@ static inline struct sockaddr *svc_daddr(const struct svc_rqst *rqst)
>  	return (struct sockaddr *) &rqst->rq_daddr;
>  }
>  
> +static inline bool svc_should_stop(struct svc_rqst *rqstp)
> +{
> +	if (test_and_clear_bit(SP_NEED_VICTIM, &rqstp->rq_pool->sp_flags)) {
> +		set_bit(RQ_VICTIM, &rqstp->rq_flags);
> +		return true;
> +	}
> +	return test_bit(RQ_VICTIM, &rqstp->rq_flags);
> +}
> +
>  struct svc_deferred_req {
>  	u32			prot;	/* protocol (UDP or TCP) */
>  	struct svc_xprt		*xprt;
> @@ -416,6 +434,7 @@ bool		   svc_rqst_replace_page(struct svc_rqst *rqstp,
>  void		   svc_rqst_release_pages(struct svc_rqst *rqstp);
>  void		   svc_rqst_free(struct svc_rqst *);
>  void		   svc_exit_thread(struct svc_rqst *);
> +bool		   svc_dequeue_rqst(struct svc_rqst *rqstp);
>  struct svc_serv *  svc_create_pooled(struct svc_program *, unsigned int,
>  				     int (*threadfn)(void *data));
>  int		   svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
> @@ -428,7 +447,7 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
>  
>  void		   svc_wake_up(struct svc_serv *);
>  void		   svc_reserve(struct svc_rqst *rqstp, int space);
> -struct svc_rqst	  *svc_pool_wake_idle_thread(struct svc_serv *serv,
> +bool		   svc_pool_wake_idle_thread(struct svc_serv *serv,
>  					     struct svc_pool *pool);
>  struct svc_pool   *svc_pool_for_cpu(struct svc_serv *serv);
>  char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
> diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> index f6fd48961074..f63289d1491d 100644
> --- a/include/trace/events/sunrpc.h
> +++ b/include/trace/events/sunrpc.h
> @@ -1601,7 +1601,7 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
>  	svc_rqst_flag(DROPME)						\
>  	svc_rqst_flag(SPLICE_OK)					\
>  	svc_rqst_flag(VICTIM)						\
> -	svc_rqst_flag(BUSY)						\
> +	svc_rqst_flag(CLEAN_ME)						\
>  	svc_rqst_flag_end(DATA)
>  
>  #undef svc_rqst_flag
> @@ -1965,7 +1965,10 @@ TRACE_EVENT(svc_xprt_enqueue,
>  #define show_svc_pool_flags(x)						\
>  	__print_flags(x, "|",						\
>  		{ BIT(SP_TASK_PENDING),		"TASK_PENDING" },	\
> -		{ BIT(SP_CONGESTED),		"CONGESTED" })
> +		{ BIT(SP_CONGESTED),		"CONGESTED" },		\
> +		{ BIT(SP_NEED_VICTIM),		"NEED_VICTIM" },	\
> +		{ BIT(SP_VICTIM_REMAINS),	"VICTIM_REMAINS" },	\
> +		{ BIT(SP_CLEANUP),		"CLEANUP" })
>  
>  DECLARE_EVENT_CLASS(svc_pool_scheduler_class,
>  	TP_PROTO(
> diff --git a/lib/llist.c b/lib/llist.c
> index 6e668fa5a2c6..660be07795ac 100644
> --- a/lib/llist.c
> +++ b/lib/llist.c
> @@ -65,6 +65,57 @@ struct llist_node *llist_del_first(struct llist_head *head)
>  }
>  EXPORT_SYMBOL_GPL(llist_del_first);
>  
> +/**
> + * llist_del_first_this - delete given entry of lock-less list if it is first
> + * @head:	the head for your lock-less list
> + * @this:	a list entry.
> + *
> + * If head of the list is given entry, delete and return it, else
> + * return %NULL.
> + *
> + * Providing the caller has exclusive access to @this, multiple callers can
> + * safely call this concurrently with multiple llist_add() callers.
> + */
> +struct llist_node *llist_del_first_this(struct llist_head *head,
> +					struct llist_node *this)
> +{
> +	struct llist_node *entry, *next;
> +
> +	entry = smp_load_acquire(&head->first);
> +	do {
> +		if (entry != this)
> +			return NULL;
> +		next = READ_ONCE(entry->next);
> +	} while (!try_cmpxchg(&head->first, &entry, next));
> +
> +	return entry;
> +}
> +EXPORT_SYMBOL_GPL(llist_del_first_this);
> +
> +/**
> + * llist_del_all_this - delete all entries from lock-less list if first is the given element
> + * @head:	the head of lock-less list to delete all entries
> + * @this:	the expected first element.
> + *
> + * If the first element of the list is @this, delete all elements and
> + * return them, else return %NULL.  Providing the caller has exclusive access
> + * to @this, multiple concurrent callers can call this or list_del_first_this()
> + * simultaneuously with multiple callers of llist_add().
> + */
> +struct llist_node *llist_del_all_this(struct llist_head *head,
> +				      struct llist_node *this)
> +{
> +	struct llist_node *entry;
> +
> +	entry = smp_load_acquire(&head->first);
> +	do {
> +		if (entry != this)
> +			return NULL;
> +	} while (!try_cmpxchg(&head->first, &entry, NULL));
> +
> +	return entry;
> +}
> +
>  /**
>   * llist_reverse_order - reverse order of a llist chain
>   * @head:	first item of the list to be reversed
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index ffb7200e8257..55339cbbbc6e 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -507,7 +507,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>  
>  		pool->sp_id = i;
>  		INIT_LIST_HEAD(&pool->sp_sockets);
> -		INIT_LIST_HEAD(&pool->sp_all_threads);
> +		init_llist_head(&pool->sp_idle_threads);
>  		spin_lock_init(&pool->sp_lock);
>  
>  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> @@ -652,9 +652,9 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
>  
>  	pagevec_init(&rqstp->rq_pvec);
>  
> -	__set_bit(RQ_BUSY, &rqstp->rq_flags);
>  	rqstp->rq_server = serv;
>  	rqstp->rq_pool = pool;
> +	rqstp->rq_idle.next = &rqstp->rq_idle;
>  
>  	rqstp->rq_scratch_page = alloc_pages_node(node, GFP_KERNEL, 0);
>  	if (!rqstp->rq_scratch_page)
> @@ -694,7 +694,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  
>  	spin_lock_bh(&pool->sp_lock);
>  	pool->sp_nrthreads++;
> -	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
>  	spin_unlock_bh(&pool->sp_lock);
>  	return rqstp;
>  }
> @@ -704,32 +703,34 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>   * @serv: RPC service
>   * @pool: service thread pool
>   *
> - * Returns an idle service thread (now marked BUSY), or NULL
> - * if no service threads are available. Finding an idle service
> - * thread and marking it BUSY is atomic with respect to other
> - * calls to svc_pool_wake_idle_thread().
> + * If there are any idle threads in the pool, wake one up and return
> + * %true, else return %false.  The thread will become non-idle once
> + * the scheduler schedules it, at which point is might wake another
> + * thread if there seems to be enough work to justify that.
>   */
> -struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> -					   struct svc_pool *pool)
> +bool svc_pool_wake_idle_thread(struct svc_serv *serv,
> +			       struct svc_pool *pool)
>  {
>  	struct svc_rqst	*rqstp;
> +	struct llist_node *ln;
>  
>  	rcu_read_lock();
> -	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> -		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> -			continue;
> -
> -		rcu_read_unlock();
> +	ln = READ_ONCE(pool->sp_idle_threads.first);
> +	if (ln) {
> +		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
>  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> -		wake_up_process(rqstp->rq_task);
> -		percpu_counter_inc(&pool->sp_threads_woken);
> -		return rqstp;
> +		if (!task_is_running(rqstp->rq_task)) {
> +			wake_up_process(rqstp->rq_task);
> +			percpu_counter_inc(&pool->sp_threads_woken);
> +		}
> +		rcu_read_unlock();
> +		return true;
>  	}
>  	rcu_read_unlock();
>  
>  	trace_svc_pool_starved(serv, pool);
>  	percpu_counter_inc(&pool->sp_threads_starved);
> -	return NULL;
> +	return false;
>  }
>  
>  static struct svc_pool *
> @@ -738,19 +739,22 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>  	return pool ? pool : &serv->sv_pools[(*state)++ % serv->sv_nrpools];
>  }
>  
> -static struct task_struct *
> +static struct svc_pool *
>  svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
>  {
> -	unsigned int i;
> -	struct task_struct *task = NULL;
>  
>  	if (pool != NULL) {
>  		spin_lock_bh(&pool->sp_lock);
> +		if (pool->sp_nrthreads > 0)
> +			goto found_pool;
> +		spin_unlock_bh(&pool->sp_lock);
> +		return NULL;
>  	} else {
> +		unsigned int i;
>  		for (i = 0; i < serv->sv_nrpools; i++) {
>  			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
>  			spin_lock_bh(&pool->sp_lock);
> -			if (!list_empty(&pool->sp_all_threads))
> +			if (pool->sp_nrthreads > 0)
>  				goto found_pool;
>  			spin_unlock_bh(&pool->sp_lock);
>  		}
> @@ -758,16 +762,10 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
>  	}
>  
>  found_pool:
> -	if (!list_empty(&pool->sp_all_threads)) {
> -		struct svc_rqst *rqstp;
> -
> -		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> -		set_bit(RQ_VICTIM, &rqstp->rq_flags);
> -		list_del_rcu(&rqstp->rq_all);
> -		task = rqstp->rq_task;
> -	}
> +	set_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
> +	set_bit(SP_NEED_VICTIM, &pool->sp_flags);
>  	spin_unlock_bh(&pool->sp_lock);
> -	return task;
> +	return pool;
>  }
>  
>  static int
> @@ -808,18 +806,16 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
>  static int
>  svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
>  {
> -	struct svc_rqst	*rqstp;
> -	struct task_struct *task;
>  	unsigned int state = serv->sv_nrthreads-1;
> +	struct svc_pool *vpool;
>  
>  	do {
> -		task = svc_pool_victim(serv, pool, &state);
> -		if (task == NULL)
> +		vpool = svc_pool_victim(serv, pool, &state);
> +		if (vpool == NULL)
>  			break;
> -		rqstp = kthread_data(task);
> -		/* Did we lose a race to svo_function threadfn? */
> -		if (kthread_stop(task) == -EINTR)
> -			svc_exit_thread(rqstp);
> +		svc_pool_wake_idle_thread(serv, vpool);
> +		wait_on_bit(&vpool->sp_flags, SP_VICTIM_REMAINS,
> +			    TASK_UNINTERRUPTIBLE);
>  		nrservs++;
>  	} while (nrservs < 0);
>  	return 0;
> @@ -931,16 +927,75 @@ svc_rqst_free(struct svc_rqst *rqstp)
>  }
>  EXPORT_SYMBOL_GPL(svc_rqst_free);
>  
> +bool svc_dequeue_rqst(struct svc_rqst *rqstp)
> +{
> +	struct svc_pool *pool = rqstp->rq_pool;
> +	struct llist_node *le, *last;
> +
> +retry:
> +	if (pool->sp_idle_threads.first != &rqstp->rq_idle)
> +		/* Not at head of queue, so cannot wake up */
> +		return false;
> +	if (!test_and_clear_bit(SP_CLEANUP, &pool->sp_flags)) {
> +		le = llist_del_first_this(&pool->sp_idle_threads,
> +					  &rqstp->rq_idle);
> +		if (le)
> +			le->next = le;
> +		return !!le;
> +	}
> +	/* Need to deal will RQ_CLEAN_ME thread */
> +	le = llist_del_all_this(&pool->sp_idle_threads,
> +				&rqstp->rq_idle);
> +	if (!le) {
> +		/* lost a race, someone else need to clean up */
> +		set_bit(SP_CLEANUP, &pool->sp_flags);
> +		svc_pool_wake_idle_thread(rqstp->rq_server,
> +					  pool);
> +		goto retry;
> +	}
> +	if (!le->next)
> +		return true;
> +	last = le;
> +	while (last->next) {
> +		rqstp = list_entry(last->next, struct svc_rqst, rq_idle);
> +		if (!test_bit(RQ_CLEAN_ME, &rqstp->rq_flags)) {
> +			last = last->next;
> +			continue;
> +		}
> +		last->next = last->next->next;
> +		rqstp->rq_idle.next = &rqstp->rq_idle;
> +		wake_up_process(rqstp->rq_task);
> +	}
> +	if (last != le)
> +		llist_add_batch(le->next, last, &pool->sp_idle_threads);
> +	le->next = le;
> +	return true;
> +}
> +
>  void
>  svc_exit_thread(struct svc_rqst *rqstp)
>  {
>  	struct svc_serv	*serv = rqstp->rq_server;
>  	struct svc_pool	*pool = rqstp->rq_pool;
>  
> +	while (rqstp->rq_idle.next != &rqstp->rq_idle) {
> +		/* Still on the idle list. */
> +		if (llist_del_first_this(&pool->sp_idle_threads,
> +					 &rqstp->rq_idle)) {
> +			/* Safely removed */
> +			rqstp->rq_idle.next = &rqstp->rq_idle;
> +		} else {
> +			set_current_state(TASK_UNINTERRUPTIBLE);
> +			set_bit(RQ_CLEAN_ME, &rqstp->rq_flags);
> +			set_bit(SP_CLEANUP, &pool->sp_flags);
> +			svc_pool_wake_idle_thread(serv, pool);
> +			if (!svc_dequeue_rqst(rqstp))
> +				schedule();
> +			__set_current_state(TASK_RUNNING);
> +		}
> +	}
>  	spin_lock_bh(&pool->sp_lock);
>  	pool->sp_nrthreads--;
> -	if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> -		list_del_rcu(&rqstp->rq_all);
>  	spin_unlock_bh(&pool->sp_lock);
>  
>  	spin_lock_bh(&serv->sv_lock);
> @@ -948,6 +1003,8 @@ svc_exit_thread(struct svc_rqst *rqstp)
>  	spin_unlock_bh(&serv->sv_lock);
>  	svc_sock_update_bufs(serv);
>  
> +	if (test_bit(RQ_VICTIM, &rqstp->rq_flags))
> +		clear_and_wake_up_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
>  	svc_rqst_free(rqstp);
>  
>  	svc_put(serv);
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index c4521bce1f27..d51587cd8d99 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -446,7 +446,6 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
>   */
>  void svc_xprt_enqueue(struct svc_xprt *xprt)
>  {
> -	struct svc_rqst *rqstp;
>  	struct svc_pool *pool;
>  
>  	if (!svc_xprt_ready(xprt))
> @@ -467,20 +466,19 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
>  	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
>  	spin_unlock_bh(&pool->sp_lock);
>  
> -	rqstp = svc_pool_wake_idle_thread(xprt->xpt_server, pool);
> -	if (!rqstp) {
> +	if (!svc_pool_wake_idle_thread(xprt->xpt_server, pool)) {
>  		set_bit(SP_CONGESTED, &pool->sp_flags);
>  		return;
>  	}
>  
> -	trace_svc_xprt_enqueue(xprt, rqstp);
> +	// trace_svc_xprt_enqueue(xprt, rqstp);
>  }
>  EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
>  
>  /*
>   * Dequeue the first transport, if there is one.
>   */
> -static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
> +static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool, bool *more)
>  {
>  	struct svc_xprt	*xprt = NULL;
>  
> @@ -493,6 +491,7 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
>  					struct svc_xprt, xpt_ready);
>  		list_del_init(&xprt->xpt_ready);
>  		svc_xprt_get(xprt);
> +		*more = !list_empty(&pool->sp_sockets);
>  	}
>  	spin_unlock_bh(&pool->sp_lock);
>  out:
> @@ -577,15 +576,13 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
>  void svc_wake_up(struct svc_serv *serv)
>  {
>  	struct svc_pool *pool = &serv->sv_pools[0];
> -	struct svc_rqst *rqstp;
>  
> -	rqstp = svc_pool_wake_idle_thread(serv, pool);
> -	if (!rqstp) {
> +	if (!svc_pool_wake_idle_thread(serv, pool)) {
>  		set_bit(SP_TASK_PENDING, &pool->sp_flags);
>  		return;
>  	}
>  
> -	trace_svc_wake_up(rqstp);
> +	// trace_svc_wake_up(rqstp);
>  }
>  EXPORT_SYMBOL_GPL(svc_wake_up);
>  
> @@ -676,7 +673,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>  			continue;
>  
>  		set_current_state(TASK_INTERRUPTIBLE);
> -		if (signalled() || kthread_should_stop()) {
> +		if (signalled() || svc_should_stop(rqstp)) {
>  			set_current_state(TASK_RUNNING);
>  			return -EINTR;
>  		}
> @@ -706,7 +703,10 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>  	struct svc_pool		*pool = rqstp->rq_pool;
>  
>  	/* did someone call svc_wake_up? */
> -	if (test_and_clear_bit(SP_TASK_PENDING, &pool->sp_flags))
> +	if (test_bit(SP_TASK_PENDING, &pool->sp_flags))
> +		return false;
> +	if (test_bit(SP_CLEANUP, &pool->sp_flags))
> +		/* a signalled thread needs to be released */
>  		return false;
>  
>  	/* was a socket queued? */
> @@ -714,7 +714,7 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>  		return false;
>  
>  	/* are we shutting down? */
> -	if (signalled() || kthread_should_stop())
> +	if (signalled() || svc_should_stop(rqstp))
>  		return false;
>  
>  	/* are we freezing? */
> @@ -728,11 +728,9 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  {
>  	struct svc_pool		*pool = rqstp->rq_pool;
>  	long			time_left = 0;
> +	bool			more = false;
>  
> -	/* rq_xprt should be clear on entry */
> -	WARN_ON_ONCE(rqstp->rq_xprt);
> -
> -	rqstp->rq_xprt = svc_xprt_dequeue(pool);
> +	rqstp->rq_xprt = svc_xprt_dequeue(pool, &more);
>  	if (rqstp->rq_xprt) {
>  		trace_svc_pool_polled(pool, rqstp);
>  		goto out_found;
> @@ -743,11 +741,10 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  	 * to bring down the daemons ...
>  	 */
>  	set_current_state(TASK_INTERRUPTIBLE);
> -	smp_mb__before_atomic();
> -	clear_bit(SP_CONGESTED, &pool->sp_flags);
> -	clear_bit(RQ_BUSY, &rqstp->rq_flags);
> -	smp_mb__after_atomic();
> +	clear_bit_unlock(SP_CONGESTED, &pool->sp_flags);
>  
> +	llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> +again:
>  	if (likely(rqst_should_sleep(rqstp)))
>  		time_left = schedule_timeout(timeout);
>  	else
> @@ -755,9 +752,20 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  
>  	try_to_freeze();
>  
> -	set_bit(RQ_BUSY, &rqstp->rq_flags);
> -	smp_mb__after_atomic();
> -	rqstp->rq_xprt = svc_xprt_dequeue(pool);
> +	if (!svc_dequeue_rqst(rqstp)) {
> +		if (signalled())
> +			/* Can only return while on idle list if signalled */
> +			return ERR_PTR(-EINTR);
> +		/* Still on the idle list */
> +		goto again;
> +	}
> +
> +	clear_bit(SP_TASK_PENDING, &pool->sp_flags);
> +
> +	if (svc_should_stop(rqstp))
> +		return ERR_PTR(-EINTR);
> +
> +	rqstp->rq_xprt = svc_xprt_dequeue(pool, &more);
>  	if (rqstp->rq_xprt) {
>  		trace_svc_pool_awoken(pool, rqstp);
>  		goto out_found;
> @@ -766,10 +774,11 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
>  	if (!time_left)
>  		percpu_counter_inc(&pool->sp_threads_timedout);
>  
> -	if (signalled() || kthread_should_stop())
> -		return ERR_PTR(-EINTR);
>  	return ERR_PTR(-EAGAIN);
>  out_found:
> +	if (more)
> +		svc_pool_wake_idle_thread(rqstp->rq_server, pool);
> +
>  	/* Normally we will wait up to 5 seconds for any required
>  	 * cache information to be provided.
>  	 */
> @@ -866,7 +875,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
>  	try_to_freeze();
>  	cond_resched();
>  	err = -EINTR;
> -	if (signalled() || kthread_should_stop())
> +	if (signalled() || svc_should_stop(rqstp))
>  		goto out;
>  
>  	xprt = svc_get_next_xprt(rqstp, timeout);
> -- 
> 2.40.1
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10 14:41         ` Chuck Lever
@ 2023-07-10 22:18           ` NeilBrown
  2023-07-10 22:52             ` Chuck Lever III
  0 siblings, 1 reply; 41+ messages in thread
From: NeilBrown @ 2023-07-10 22:18 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, 11 Jul 2023, Chuck Lever wrote:
> On Mon, Jul 10, 2023 at 10:41:38AM +1000, NeilBrown wrote:
> > 
> > The patch show an alternate approach to the recent patches which improve
> > the latency for waking an idle thread when work is ready.
> > 
> > The current approach involves searching a linked list for an idle thread
> > to wake.  The recent patches instead use a bitmap search to find the
> > idle thread.  With this patch no search is needed - and idle thread is
> > directly available without searching.
> > 
> > The idle threads are kept in an "llist" - there is no longer a list of
> > all threads.
> > 
> > The llist ADT does not allow concurrent delete_first operations, so to
> > wake an idle thread we simply wake it and do not remove it from the
> > list.
> > When the thread is scheduled it will remove itself - which is safe - and
> > will take the next thread if there is more work to do (and if there is
> > another thread).
> > 
> > The "remove itself" requires and addition to the llist api.
> > "llist_del_first_this()" removes a given item if is it the first.
> > Multiple callers can call this concurrently as along as they each give a
> > different "this", so each thread can safely try to remove itself.  It
> > must be prepared for failure.
> > 
> > Reducing the thread count currently requires finding any thing, idle or
> > not, and calling kthread_stop().  This no longer possible as we don't
> > have a list of all threads (though I guess we could keep the list if we
> > wanted to...).  Instead the pool is marked NEED_VICTIM and the next
> > thread to go idle will become the VICTIM and duly exit - signalling
> > this be clearing VICTIM_REMAINS.  We replace kthread_should_stop() call
> > with a new svc_should_stop() which checks and sets victim flags.
> > 
> > nfsd threads can currently be told to exit with a signal.  It might be
> > time to deprecate/remove this feature.  However this patch does support
> > it.
> > 
> > If the signalled thread is not at the head of the idle list it cannot
> > remove itself.  In this case it sets RQ_CLEAN_ME and SP_CLEANUP and the
> > next thread to wake up will use llist_del_all_this() to remove all
> > threads from the idle list.  It then finds and removes any RQ_CLEAN_ME
> > threads and puts the rest back on the list.
> > 
> > There is quite a bit of churn here so it will need careful testing.
> > In fact - it doesn't handle nfsv4 callback handling threads properly as
> > they don't wait the same way that other threads wait...  I'll need to
> > think about that but I don't have time just now.
> > 
> > For now it is primarily an RFC.  I haven't given a lot of thought to
> > trace points.
> > 
> > It apply it you will need
> > 
> >  SUNRPC: Deduplicate thread wake-up code
> >  SUNRPC: Report when no service thread is available.
> >  SUNRPC: Split the svc_xprt_dequeue tracepoint
> >  SUNRPC: Clean up svc_set_num_threads
> >  SUNRPC: Replace dprintk() call site in __svc_create()
> > 
> > from recent post by Chuck.
> 
> Hi, thanks for letting us see your pencil sketches. :-)
> 
> Later today, I'll push a topic branch to my kernel.org repo that we
> can use as a base for continuing this work.
> 
> Some initial remarks below, recognizing that this patch is still
> incomplete.
> 
> 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  fs/lockd/svc.c                |   4 +-
> >  fs/lockd/svclock.c            |   4 +-
> >  fs/nfs/callback.c             |   5 +-
> >  fs/nfsd/nfssvc.c              |   3 +-
> >  include/linux/llist.h         |   4 +
> >  include/linux/lockd/lockd.h   |   2 +-
> >  include/linux/sunrpc/svc.h    |  55 +++++++++-----
> >  include/trace/events/sunrpc.h |   7 +-
> >  lib/llist.c                   |  51 +++++++++++++
> >  net/sunrpc/svc.c              | 139 ++++++++++++++++++++++++----------
> >  net/sunrpc/svc_xprt.c         |  61 ++++++++-------
> >  11 files changed, 239 insertions(+), 96 deletions(-)
> > 
> > diff --git a/fs/lockd/svc.c b/fs/lockd/svc.c
> > index 22d3ff3818f5..df295771bd40 100644
> > --- a/fs/lockd/svc.c
> > +++ b/fs/lockd/svc.c
> > @@ -147,7 +147,7 @@ lockd(void *vrqstp)
> >  	 * The main request loop. We don't terminate until the last
> >  	 * NFS mount or NFS daemon has gone away.
> >  	 */
> > -	while (!kthread_should_stop()) {
> > +	while (!svc_should_stop(rqstp)) {
> >  		long timeout = MAX_SCHEDULE_TIMEOUT;
> >  		RPC_IFDEBUG(char buf[RPC_MAX_ADDRBUFLEN]);
> >  
> > @@ -160,7 +160,7 @@ lockd(void *vrqstp)
> >  			continue;
> >  		}
> >  
> > -		timeout = nlmsvc_retry_blocked();
> > +		timeout = nlmsvc_retry_blocked(rqstp);
> >  
> >  		/*
> >  		 * Find a socket with data available and call its
> > diff --git a/fs/lockd/svclock.c b/fs/lockd/svclock.c
> > index c43ccdf28ed9..54b679fcbcab 100644
> > --- a/fs/lockd/svclock.c
> > +++ b/fs/lockd/svclock.c
> > @@ -1009,13 +1009,13 @@ retry_deferred_block(struct nlm_block *block)
> >   * be retransmitted.
> >   */
> >  unsigned long
> > -nlmsvc_retry_blocked(void)
> > +nlmsvc_retry_blocked(struct svc_rqst *rqstp)
> >  {
> >  	unsigned long	timeout = MAX_SCHEDULE_TIMEOUT;
> >  	struct nlm_block *block;
> >  
> >  	spin_lock(&nlm_blocked_lock);
> > -	while (!list_empty(&nlm_blocked) && !kthread_should_stop()) {
> > +	while (!list_empty(&nlm_blocked) && !svc_should_stop(rqstp)) {
> >  		block = list_entry(nlm_blocked.next, struct nlm_block, b_list);
> >  
> >  		if (block->b_when == NLM_NEVER)
> > diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
> > index 456af7d230cf..646425f1dc36 100644
> > --- a/fs/nfs/callback.c
> > +++ b/fs/nfs/callback.c
> > @@ -111,7 +111,7 @@ nfs41_callback_svc(void *vrqstp)
> >  
> >  	set_freezable();
> >  
> > -	while (!kthread_freezable_should_stop(NULL)) {
> > +	while (!svc_should_stop(rqstp)) {
> >  
> >  		if (signal_pending(current))
> >  			flush_signals(current);
> > @@ -130,10 +130,11 @@ nfs41_callback_svc(void *vrqstp)
> >  				error);
> >  		} else {
> >  			spin_unlock_bh(&serv->sv_cb_lock);
> > -			if (!kthread_should_stop())
> > +			if (!svc_should_stop(rqstp))
> >  				schedule();
> >  			finish_wait(&serv->sv_cb_waitq, &wq);
> >  		}
> > +		try_to_freeze();
> >  	}
> >  
> >  	svc_exit_thread(rqstp);
> > diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
> > index 9c7b1ef5be40..7cfa7f2e9bf7 100644
> > --- a/fs/nfsd/nfssvc.c
> > +++ b/fs/nfsd/nfssvc.c
> > @@ -62,8 +62,7 @@ static __be32			nfsd_init_request(struct svc_rqst *,
> >   * If (out side the lock) nn->nfsd_serv is non-NULL, then it must point to a
> >   * properly initialised 'struct svc_serv' with ->sv_nrthreads > 0 (unless
> >   * nn->keep_active is set).  That number of nfsd threads must
> > - * exist and each must be listed in ->sp_all_threads in some entry of
> > - * ->sv_pools[].
> > + * exist.
> >   *
> >   * Each active thread holds a counted reference on nn->nfsd_serv, as does
> >   * the nn->keep_active flag and various transient calls to svc_get().
> > diff --git a/include/linux/llist.h b/include/linux/llist.h
> > index 85bda2d02d65..5a22499844c8 100644
> > --- a/include/linux/llist.h
> > +++ b/include/linux/llist.h
> > @@ -248,6 +248,10 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
> >  }
> >  
> >  extern struct llist_node *llist_del_first(struct llist_head *head);
> > +extern struct llist_node *llist_del_first_this(struct llist_head *head,
> > +					       struct llist_node *this);
> > +extern struct llist_node *llist_del_all_this(struct llist_head *head,
> > +					     struct llist_node *this);
> >  
> >  struct llist_node *llist_reverse_order(struct llist_node *head);
> >  
> > diff --git a/include/linux/lockd/lockd.h b/include/linux/lockd/lockd.h
> > index f42594a9efe0..c48020e7ee08 100644
> > --- a/include/linux/lockd/lockd.h
> > +++ b/include/linux/lockd/lockd.h
> > @@ -280,7 +280,7 @@ __be32		  nlmsvc_testlock(struct svc_rqst *, struct nlm_file *,
> >  			struct nlm_host *, struct nlm_lock *,
> >  			struct nlm_lock *, struct nlm_cookie *);
> >  __be32		  nlmsvc_cancel_blocked(struct net *net, struct nlm_file *, struct nlm_lock *);
> > -unsigned long	  nlmsvc_retry_blocked(void);
> > +unsigned long	  nlmsvc_retry_blocked(struct svc_rqst *rqstp);
> >  void		  nlmsvc_traverse_blocks(struct nlm_host *, struct nlm_file *,
> >  					nlm_host_match_fn_t match);
> >  void		  nlmsvc_grant_reply(struct nlm_cookie *, __be32);
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 366f2b6b689c..cb2497b977c1 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -31,11 +31,11 @@
> >   * node traffic on multi-node NUMA NFS servers.
> >   */
> >  struct svc_pool {
> > -	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
> > -	spinlock_t		sp_lock;	/* protects all fields */
> > +	unsigned int		sp_id;		/* pool id; also node id on NUMA */
> > +	spinlock_t		sp_lock;	/* protects all sp_socketsn sp_nrthreads*/
> >  	struct list_head	sp_sockets;	/* pending sockets */
> >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> > -	struct list_head	sp_all_threads;	/* all server threads */
> > +	struct llist_head	sp_idle_threads;/* idle server threads */
> >  
> >  	/* statistics on pool operation */
> >  	struct percpu_counter	sp_sockets_queued;
> > @@ -43,12 +43,17 @@ struct svc_pool {
> >  	struct percpu_counter	sp_threads_timedout;
> >  	struct percpu_counter	sp_threads_starved;
> >  
> > -#define	SP_TASK_PENDING		(0)		/* still work to do even if no
> > -						 * xprt is queued. */
> > -#define SP_CONGESTED		(1)
> >  	unsigned long		sp_flags;
> >  } ____cacheline_aligned_in_smp;
> >  
> > +enum svc_sp_flags {
> 
> Let's make this an anonymous enum. Ditto below.

The name was to compensate for moving the definitions away from the
declaration for sp_flags.  It means that if I search for "sp_flags" I'll
easily find this list of bits.  I guess a comment could achieve the same
end, but I prefer code to be self-documenting where possible.
> 
> 
> > +	SP_TASK_PENDING,	/* still work to do even if no xprt is queued */
> > +	SP_CONGESTED,
> > +	SP_NEED_VICTIM,		/* One thread needs to agree to exit */
> > +	SP_VICTIM_REMAINS,	/* One thread needs to actually exit */
> > +	SP_CLEANUP,		/* A thread has set RQ_CLEAN_ME */
> > +};
> > +
> 
> Converting the bit flags to an enum seems like an unrelated clean-
> up. It isn't necessary in order to implement the new scheduler.
> Let's extract this into a separate patch that can be applied first.
> 
> Also, I'm not clear on the justification for this clean up. That
> should be explained in the patch description of the split-out
> clean-up patch(es).
> 
> 
> >  /*
> >   * RPC service.
> >   *
> > @@ -195,7 +200,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >   * processed.
> >   */
> >  struct svc_rqst {
> > -	struct list_head	rq_all;		/* all threads list */
> > +	struct llist_node	rq_idle;	/* On pool's idle list */
> >  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
> >  	struct svc_xprt *	rq_xprt;	/* transport ptr */
> >  
> > @@ -233,16 +238,6 @@ struct svc_rqst {
> >  	u32			rq_proc;	/* procedure number */
> >  	u32			rq_prot;	/* IP protocol */
> >  	int			rq_cachetype;	/* catering to nfsd */
> > -#define	RQ_SECURE	(0)			/* secure port */
> > -#define	RQ_LOCAL	(1)			/* local request */
> > -#define	RQ_USEDEFERRAL	(2)			/* use deferral */
> > -#define	RQ_DROPME	(3)			/* drop current reply */
> > -#define	RQ_SPLICE_OK	(4)			/* turned off in gss privacy
> > -						 * to prevent encrypting page
> > -						 * cache pages */
> > -#define	RQ_VICTIM	(5)			/* about to be shut down */
> > -#define	RQ_BUSY		(6)			/* request is busy */
> > -#define	RQ_DATA		(7)			/* request has data */
> >  	unsigned long		rq_flags;	/* flags field */
> >  	ktime_t			rq_qtime;	/* enqueue time */
> >  
> > @@ -274,6 +269,20 @@ struct svc_rqst {
> >  	void **			rq_lease_breaker; /* The v4 client breaking a lease */
> >  };
> >  
> > +enum svc_rq_flags {
> > +	RQ_SECURE,			/* secure port */
> > +	RQ_LOCAL,			/* local request */
> > +	RQ_USEDEFERRAL,			/* use deferral */
> > +	RQ_DROPME,			/* drop current reply */
> > +	RQ_SPLICE_OK,			/* turned off in gss privacy
> > +					 * to prevent encrypting page
> > +					 * cache pages */
> > +	RQ_VICTIM,			/* agreed to shut down */
> > +	RQ_DATA,			/* request has data */
> > +	RQ_CLEAN_ME,			/* Thread needs to exit but
> > +					 * is on the idle list */
> > +};
> > +
> 
> Likewise here. And let's keep the flag clean-ups in separate patches.
> 
> 
> >  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
> >  
> >  /*
> > @@ -309,6 +318,15 @@ static inline struct sockaddr *svc_daddr(const struct svc_rqst *rqst)
> >  	return (struct sockaddr *) &rqst->rq_daddr;
> >  }
> >  
> 
> This needs a kdoc comment and a more conventional name. How about
> svc_thread_should_stop() ?
> 
> Actually it seems like a better abstraction all around if the upper
> layers don't have to care that they are running in a kthread  -- so
> maybe replacing kthread_should_stop() is a good clean-up to apply
> in advance.
> 
> 
> > +static inline bool svc_should_stop(struct svc_rqst *rqstp)
> > +{
> > +	if (test_and_clear_bit(SP_NEED_VICTIM, &rqstp->rq_pool->sp_flags)) {
> > +		set_bit(RQ_VICTIM, &rqstp->rq_flags);
> > +		return true;
> > +	}
> > +	return test_bit(RQ_VICTIM, &rqstp->rq_flags);
> > +}
> > +
> >  struct svc_deferred_req {
> >  	u32			prot;	/* protocol (UDP or TCP) */
> >  	struct svc_xprt		*xprt;
> > @@ -416,6 +434,7 @@ bool		   svc_rqst_replace_page(struct svc_rqst *rqstp,
> >  void		   svc_rqst_release_pages(struct svc_rqst *rqstp);
> >  void		   svc_rqst_free(struct svc_rqst *);
> >  void		   svc_exit_thread(struct svc_rqst *);
> > +bool		   svc_dequeue_rqst(struct svc_rqst *rqstp);
> >  struct svc_serv *  svc_create_pooled(struct svc_program *, unsigned int,
> >  				     int (*threadfn)(void *data));
> >  int		   svc_set_num_threads(struct svc_serv *, struct svc_pool *, int);
> > @@ -428,7 +447,7 @@ int		   svc_register(const struct svc_serv *, struct net *, const int,
> >  
> >  void		   svc_wake_up(struct svc_serv *);
> >  void		   svc_reserve(struct svc_rqst *rqstp, int space);
> > -struct svc_rqst	  *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > +bool		   svc_pool_wake_idle_thread(struct svc_serv *serv,
> >  					     struct svc_pool *pool);
> >  struct svc_pool   *svc_pool_for_cpu(struct svc_serv *serv);
> >  char *		   svc_print_addr(struct svc_rqst *, char *, size_t);
> > diff --git a/include/trace/events/sunrpc.h b/include/trace/events/sunrpc.h
> > index f6fd48961074..f63289d1491d 100644
> > --- a/include/trace/events/sunrpc.h
> > +++ b/include/trace/events/sunrpc.h
> > @@ -1601,7 +1601,7 @@ DEFINE_SVCXDRBUF_EVENT(sendto);
> >  	svc_rqst_flag(DROPME)						\
> >  	svc_rqst_flag(SPLICE_OK)					\
> >  	svc_rqst_flag(VICTIM)						\
> > -	svc_rqst_flag(BUSY)						\
> > +	svc_rqst_flag(CLEAN_ME)						\
> >  	svc_rqst_flag_end(DATA)
> >  
> >  #undef svc_rqst_flag
> > @@ -1965,7 +1965,10 @@ TRACE_EVENT(svc_xprt_enqueue,
> >  #define show_svc_pool_flags(x)						\
> >  	__print_flags(x, "|",						\
> >  		{ BIT(SP_TASK_PENDING),		"TASK_PENDING" },	\
> > -		{ BIT(SP_CONGESTED),		"CONGESTED" })
> > +		{ BIT(SP_CONGESTED),		"CONGESTED" },		\
> > +		{ BIT(SP_NEED_VICTIM),		"NEED_VICTIM" },	\
> > +		{ BIT(SP_VICTIM_REMAINS),	"VICTIM_REMAINS" },	\
> > +		{ BIT(SP_CLEANUP),		"CLEANUP" })
> >  
> >  DECLARE_EVENT_CLASS(svc_pool_scheduler_class,
> >  	TP_PROTO(
> > diff --git a/lib/llist.c b/lib/llist.c
> > index 6e668fa5a2c6..660be07795ac 100644
> > --- a/lib/llist.c
> > +++ b/lib/llist.c
> > @@ -65,6 +65,57 @@ struct llist_node *llist_del_first(struct llist_head *head)
> >  }
> >  EXPORT_SYMBOL_GPL(llist_del_first);
> >  
> > +/**
> > + * llist_del_first_this - delete given entry of lock-less list if it is first
> > + * @head:	the head for your lock-less list
> > + * @this:	a list entry.
> > + *
> > + * If head of the list is given entry, delete and return it, else
> > + * return %NULL.
> > + *
> > + * Providing the caller has exclusive access to @this, multiple callers can
> > + * safely call this concurrently with multiple llist_add() callers.
> > + */
> > +struct llist_node *llist_del_first_this(struct llist_head *head,
> > +					struct llist_node *this)
> > +{
> > +	struct llist_node *entry, *next;
> > +
> > +	entry = smp_load_acquire(&head->first);
> > +	do {
> > +		if (entry != this)
> > +			return NULL;
> > +		next = READ_ONCE(entry->next);
> > +	} while (!try_cmpxchg(&head->first, &entry, next));
> > +
> > +	return entry;
> > +}
> > +EXPORT_SYMBOL_GPL(llist_del_first_this);
> > +
> > +/**
> > + * llist_del_all_this - delete all entries from lock-less list if first is the given element
> > + * @head:	the head of lock-less list to delete all entries
> > + * @this:	the expected first element.
> > + *
> > + * If the first element of the list is @this, delete all elements and
> > + * return them, else return %NULL.  Providing the caller has exclusive access
> > + * to @this, multiple concurrent callers can call this or list_del_first_this()
> > + * simultaneuously with multiple callers of llist_add().
> > + */
> > +struct llist_node *llist_del_all_this(struct llist_head *head,
> > +				      struct llist_node *this)
> > +{
> > +	struct llist_node *entry;
> > +
> > +	entry = smp_load_acquire(&head->first);
> > +	do {
> > +		if (entry != this)
> > +			return NULL;
> > +	} while (!try_cmpxchg(&head->first, &entry, NULL));
> > +
> > +	return entry;
> > +}
> > +
> 
> I was going to say that we should copy the maintainer of
> lib/llist.c on this patch set, but I'm a little surprised to see
> no maintainer listed for it. I'm not sure how to get proper
> review for the new API and mechanism.

I would certainly copy the author if this ends up going anywhere.

> 
> Sidebar: Are there any self-tests or kunit tests for llist?
> 
> 
> >  /**
> >   * llist_reverse_order - reverse order of a llist chain
> >   * @head:	first item of the list to be reversed
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index ffb7200e8257..55339cbbbc6e 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -507,7 +507,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >  
> >  		pool->sp_id = i;
> >  		INIT_LIST_HEAD(&pool->sp_sockets);
> > -		INIT_LIST_HEAD(&pool->sp_all_threads);
> > +		init_llist_head(&pool->sp_idle_threads);
> >  		spin_lock_init(&pool->sp_lock);
> >  
> >  		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
> > @@ -652,9 +652,9 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  
> >  	pagevec_init(&rqstp->rq_pvec);
> >  
> > -	__set_bit(RQ_BUSY, &rqstp->rq_flags);
> >  	rqstp->rq_server = serv;
> >  	rqstp->rq_pool = pool;
> > +	rqstp->rq_idle.next = &rqstp->rq_idle;
> 
> Is there really no initializer helper for this?

This convention for ".next == &node" meaning it isn't on the list is a
convention that I created.  So there is no pre-existing help.  Yes: I
should add one.

> 
> 
> >  	rqstp->rq_scratch_page = alloc_pages_node(node, GFP_KERNEL, 0);
> >  	if (!rqstp->rq_scratch_page)
> > @@ -694,7 +694,6 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  
> >  	spin_lock_bh(&pool->sp_lock);
> >  	pool->sp_nrthreads++;
> > -	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
> >  	spin_unlock_bh(&pool->sp_lock);
> >  	return rqstp;
> >  }
> > @@ -704,32 +703,34 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >   * @serv: RPC service
> >   * @pool: service thread pool
> >   *
> > - * Returns an idle service thread (now marked BUSY), or NULL
> > - * if no service threads are available. Finding an idle service
> > - * thread and marking it BUSY is atomic with respect to other
> > - * calls to svc_pool_wake_idle_thread().
> > + * If there are any idle threads in the pool, wake one up and return
> > + * %true, else return %false.  The thread will become non-idle once
> > + * the scheduler schedules it, at which point is might wake another
> > + * thread if there seems to be enough work to justify that.
> 
> So I'm wondering how another call to svc_pool_wake_idle_thread()
> that happens concurrently will not find and wake the same thread?

It will find an wake the same thread.  When that thread is scheduled it
will de-queue any work that is needed.  Then if there is more it will
wake another.

> 
> 
> >   */
> > -struct svc_rqst *svc_pool_wake_idle_thread(struct svc_serv *serv,
> > -					   struct svc_pool *pool)
> > +bool svc_pool_wake_idle_thread(struct svc_serv *serv,
> > +			       struct svc_pool *pool)
> >  {
> >  	struct svc_rqst	*rqstp;
> > +	struct llist_node *ln;
> >  
> >  	rcu_read_lock();
> > -	list_for_each_entry_rcu(rqstp, &pool->sp_all_threads, rq_all) {
> > -		if (test_and_set_bit(RQ_BUSY, &rqstp->rq_flags))
> > -			continue;
> > -
> > -		rcu_read_unlock();
> > +	ln = READ_ONCE(pool->sp_idle_threads.first);
> > +	if (ln) {
> > +		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
> >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> > -		wake_up_process(rqstp->rq_task);
> > -		percpu_counter_inc(&pool->sp_threads_woken);
> > -		return rqstp;
> > +		if (!task_is_running(rqstp->rq_task)) {
> > +			wake_up_process(rqstp->rq_task);
> > +			percpu_counter_inc(&pool->sp_threads_woken);
> > +		}
> > +		rcu_read_unlock();
> > +		return true;
> >  	}
> >  	rcu_read_unlock();
> >  
> >  	trace_svc_pool_starved(serv, pool);
> >  	percpu_counter_inc(&pool->sp_threads_starved);
> > -	return NULL;
> > +	return false;
> >  }
> >  
> >  static struct svc_pool *
> > @@ -738,19 +739,22 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> >  	return pool ? pool : &serv->sv_pools[(*state)++ % serv->sv_nrpools];
> >  }
> >  
> > -static struct task_struct *
> > +static struct svc_pool *
> >  svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
> >  {
> > -	unsigned int i;
> > -	struct task_struct *task = NULL;
> >  
> >  	if (pool != NULL) {
> >  		spin_lock_bh(&pool->sp_lock);
> > +		if (pool->sp_nrthreads > 0)
> > +			goto found_pool;
> > +		spin_unlock_bh(&pool->sp_lock);
> > +		return NULL;
> >  	} else {
> > +		unsigned int i;
> >  		for (i = 0; i < serv->sv_nrpools; i++) {
> >  			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
> >  			spin_lock_bh(&pool->sp_lock);
> > -			if (!list_empty(&pool->sp_all_threads))
> > +			if (pool->sp_nrthreads > 0)
> >  				goto found_pool;
> >  			spin_unlock_bh(&pool->sp_lock);
> >  		}
> > @@ -758,16 +762,10 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
> >  	}
> >  
> >  found_pool:
> > -	if (!list_empty(&pool->sp_all_threads)) {
> > -		struct svc_rqst *rqstp;
> > -
> > -		rqstp = list_entry(pool->sp_all_threads.next, struct svc_rqst, rq_all);
> > -		set_bit(RQ_VICTIM, &rqstp->rq_flags);
> > -		list_del_rcu(&rqstp->rq_all);
> > -		task = rqstp->rq_task;
> > -	}
> > +	set_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
> > +	set_bit(SP_NEED_VICTIM, &pool->sp_flags);
> >  	spin_unlock_bh(&pool->sp_lock);
> > -	return task;
> > +	return pool;
> >  }
> >  
> >  static int
> > @@ -808,18 +806,16 @@ svc_start_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> >  static int
> >  svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
> >  {
> > -	struct svc_rqst	*rqstp;
> > -	struct task_struct *task;
> >  	unsigned int state = serv->sv_nrthreads-1;
> > +	struct svc_pool *vpool;
> >  
> >  	do {
> > -		task = svc_pool_victim(serv, pool, &state);
> > -		if (task == NULL)
> > +		vpool = svc_pool_victim(serv, pool, &state);
> > +		if (vpool == NULL)
> >  			break;
> > -		rqstp = kthread_data(task);
> > -		/* Did we lose a race to svo_function threadfn? */
> > -		if (kthread_stop(task) == -EINTR)
> > -			svc_exit_thread(rqstp);
> > +		svc_pool_wake_idle_thread(serv, vpool);
> > +		wait_on_bit(&vpool->sp_flags, SP_VICTIM_REMAINS,
> > +			    TASK_UNINTERRUPTIBLE);
> >  		nrservs++;
> >  	} while (nrservs < 0);
> >  	return 0;
> > @@ -931,16 +927,75 @@ svc_rqst_free(struct svc_rqst *rqstp)
> >  }
> >  EXPORT_SYMBOL_GPL(svc_rqst_free);
> >  
> 
> Can you add a kdoc comment for svc_dequeue_rqst()?
> 
> There is a lot of complexity here that I'm not grokking on first
> read. Either it needs more comments or simplification (or both).
> 
> It's not winning me over, I have to say ;-)
> 
> 
> > +bool svc_dequeue_rqst(struct svc_rqst *rqstp)
> > +{
> > +	struct svc_pool *pool = rqstp->rq_pool;
> > +	struct llist_node *le, *last;
> > +
> > +retry:
> > +	if (pool->sp_idle_threads.first != &rqstp->rq_idle)
> 
> Would be better if there was a helper for this test.
> 
> 
> > +		/* Not at head of queue, so cannot wake up */
> > +		return false;
> > +	if (!test_and_clear_bit(SP_CLEANUP, &pool->sp_flags)) {
> > +		le = llist_del_first_this(&pool->sp_idle_threads,
> > +					  &rqstp->rq_idle);
> > +		if (le)
> > +			le->next = le;
> > +		return !!le;
> > +	}
> > +	/* Need to deal will RQ_CLEAN_ME thread */
> > +	le = llist_del_all_this(&pool->sp_idle_threads,
> > +				&rqstp->rq_idle);
> > +	if (!le) {
> > +		/* lost a race, someone else need to clean up */
> > +		set_bit(SP_CLEANUP, &pool->sp_flags);
> > +		svc_pool_wake_idle_thread(rqstp->rq_server,
> > +					  pool);
> > +		goto retry;
> > +	}
> > +	if (!le->next)
> > +		return true;
> > +	last = le;
> > +	while (last->next) {
> > +		rqstp = list_entry(last->next, struct svc_rqst, rq_idle);
> > +		if (!test_bit(RQ_CLEAN_ME, &rqstp->rq_flags)) {
> > +			last = last->next;
> > +			continue;
> > +		}
> > +		last->next = last->next->next;
> > +		rqstp->rq_idle.next = &rqstp->rq_idle;
> > +		wake_up_process(rqstp->rq_task);
> > +	}
> > +	if (last != le)
> > +		llist_add_batch(le->next, last, &pool->sp_idle_threads);
> > +	le->next = le;
> > +	return true;
> > +}
> > +
> >  void
> >  svc_exit_thread(struct svc_rqst *rqstp)
> >  {
> >  	struct svc_serv	*serv = rqstp->rq_server;
> >  	struct svc_pool	*pool = rqstp->rq_pool;
> >  
> > +	while (rqstp->rq_idle.next != &rqstp->rq_idle) {
> 
> Helper, maybe?
> 
> 
> > +		/* Still on the idle list. */
> > +		if (llist_del_first_this(&pool->sp_idle_threads,
> > +					 &rqstp->rq_idle)) {
> > +			/* Safely removed */
> > +			rqstp->rq_idle.next = &rqstp->rq_idle;
> > +		} else {
> > +			set_current_state(TASK_UNINTERRUPTIBLE);
> > +			set_bit(RQ_CLEAN_ME, &rqstp->rq_flags);
> > +			set_bit(SP_CLEANUP, &pool->sp_flags);
> > +			svc_pool_wake_idle_thread(serv, pool);
> > +			if (!svc_dequeue_rqst(rqstp))
> > +				schedule();
> > +			__set_current_state(TASK_RUNNING);
> > +		}
> > +	}
> >  	spin_lock_bh(&pool->sp_lock);
> >  	pool->sp_nrthreads--;
> > -	if (!test_and_set_bit(RQ_VICTIM, &rqstp->rq_flags))
> > -		list_del_rcu(&rqstp->rq_all);
> >  	spin_unlock_bh(&pool->sp_lock);
> >  
> >  	spin_lock_bh(&serv->sv_lock);
> > @@ -948,6 +1003,8 @@ svc_exit_thread(struct svc_rqst *rqstp)
> >  	spin_unlock_bh(&serv->sv_lock);
> >  	svc_sock_update_bufs(serv);
> >  
> > +	if (test_bit(RQ_VICTIM, &rqstp->rq_flags))
> > +		clear_and_wake_up_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
> >  	svc_rqst_free(rqstp);
> >  
> >  	svc_put(serv);
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index c4521bce1f27..d51587cd8d99 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -446,7 +446,6 @@ static bool svc_xprt_ready(struct svc_xprt *xprt)
> >   */
> >  void svc_xprt_enqueue(struct svc_xprt *xprt)
> >  {
> > -	struct svc_rqst *rqstp;
> >  	struct svc_pool *pool;
> >  
> >  	if (!svc_xprt_ready(xprt))
> > @@ -467,20 +466,19 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> >  	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
> >  	spin_unlock_bh(&pool->sp_lock);
> >  
> > -	rqstp = svc_pool_wake_idle_thread(xprt->xpt_server, pool);
> > -	if (!rqstp) {
> > +	if (!svc_pool_wake_idle_thread(xprt->xpt_server, pool)) {
> >  		set_bit(SP_CONGESTED, &pool->sp_flags);
> >  		return;
> >  	}
> >  
> > -	trace_svc_xprt_enqueue(xprt, rqstp);
> > +	// trace_svc_xprt_enqueue(xprt, rqstp);
> >  }
> >  EXPORT_SYMBOL_GPL(svc_xprt_enqueue);
> >  
> >  /*
> >   * Dequeue the first transport, if there is one.
> >   */
> > -static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
> > +static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool, bool *more)
> >  {
> >  	struct svc_xprt	*xprt = NULL;
> >  
> > @@ -493,6 +491,7 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
> >  					struct svc_xprt, xpt_ready);
> >  		list_del_init(&xprt->xpt_ready);
> >  		svc_xprt_get(xprt);
> > +		*more = !list_empty(&pool->sp_sockets);
> >  	}
> >  	spin_unlock_bh(&pool->sp_lock);
> >  out:
> > @@ -577,15 +576,13 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
> >  void svc_wake_up(struct svc_serv *serv)
> >  {
> >  	struct svc_pool *pool = &serv->sv_pools[0];
> > -	struct svc_rqst *rqstp;
> >  
> > -	rqstp = svc_pool_wake_idle_thread(serv, pool);
> > -	if (!rqstp) {
> > +	if (!svc_pool_wake_idle_thread(serv, pool)) {
> >  		set_bit(SP_TASK_PENDING, &pool->sp_flags);
> >  		return;
> >  	}
> >  
> > -	trace_svc_wake_up(rqstp);
> > +	// trace_svc_wake_up(rqstp);
> >  }
> >  EXPORT_SYMBOL_GPL(svc_wake_up);
> >  
> > @@ -676,7 +673,7 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> >  			continue;
> >  
> >  		set_current_state(TASK_INTERRUPTIBLE);
> > -		if (signalled() || kthread_should_stop()) {
> > +		if (signalled() || svc_should_stop(rqstp)) {
> >  			set_current_state(TASK_RUNNING);
> >  			return -EINTR;
> >  		}
> > @@ -706,7 +703,10 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >  	struct svc_pool		*pool = rqstp->rq_pool;
> >  
> >  	/* did someone call svc_wake_up? */
> > -	if (test_and_clear_bit(SP_TASK_PENDING, &pool->sp_flags))
> > +	if (test_bit(SP_TASK_PENDING, &pool->sp_flags))
> > +		return false;
> > +	if (test_bit(SP_CLEANUP, &pool->sp_flags))
> > +		/* a signalled thread needs to be released */
> >  		return false;
> >  
> >  	/* was a socket queued? */
> > @@ -714,7 +714,7 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >  		return false;
> >  
> >  	/* are we shutting down? */
> > -	if (signalled() || kthread_should_stop())
> > +	if (signalled() || svc_should_stop(rqstp))
> >  		return false;
> >  
> >  	/* are we freezing? */
> > @@ -728,11 +728,9 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> >  {
> >  	struct svc_pool		*pool = rqstp->rq_pool;
> >  	long			time_left = 0;
> > +	bool			more = false;
> >  
> > -	/* rq_xprt should be clear on entry */
> > -	WARN_ON_ONCE(rqstp->rq_xprt);
> > -
> > -	rqstp->rq_xprt = svc_xprt_dequeue(pool);
> > +	rqstp->rq_xprt = svc_xprt_dequeue(pool, &more);
> >  	if (rqstp->rq_xprt) {
> >  		trace_svc_pool_polled(pool, rqstp);
> >  		goto out_found;
> > @@ -743,11 +741,10 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> >  	 * to bring down the daemons ...
> >  	 */
> >  	set_current_state(TASK_INTERRUPTIBLE);
> > -	smp_mb__before_atomic();
> > -	clear_bit(SP_CONGESTED, &pool->sp_flags);
> > -	clear_bit(RQ_BUSY, &rqstp->rq_flags);
> > -	smp_mb__after_atomic();
> > +	clear_bit_unlock(SP_CONGESTED, &pool->sp_flags);
> >  
> > +	llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> > +again:
> >  	if (likely(rqst_should_sleep(rqstp)))
> >  		time_left = schedule_timeout(timeout);
> >  	else
> > @@ -755,9 +752,20 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> >  
> >  	try_to_freeze();
> >  
> > -	set_bit(RQ_BUSY, &rqstp->rq_flags);
> > -	smp_mb__after_atomic();
> > -	rqstp->rq_xprt = svc_xprt_dequeue(pool);
> > +	if (!svc_dequeue_rqst(rqstp)) {
> > +		if (signalled())
> > +			/* Can only return while on idle list if signalled */
> > +			return ERR_PTR(-EINTR);
> > +		/* Still on the idle list */
> > +		goto again;
> > +	}
> > +
> > +	clear_bit(SP_TASK_PENDING, &pool->sp_flags);
> > +
> > +	if (svc_should_stop(rqstp))
> > +		return ERR_PTR(-EINTR);
> > +
> > +	rqstp->rq_xprt = svc_xprt_dequeue(pool, &more);
> >  	if (rqstp->rq_xprt) {
> >  		trace_svc_pool_awoken(pool, rqstp);
> >  		goto out_found;
> > @@ -766,10 +774,11 @@ static struct svc_xprt *svc_get_next_xprt(struct svc_rqst *rqstp, long timeout)
> >  	if (!time_left)
> >  		percpu_counter_inc(&pool->sp_threads_timedout);
> >  
> > -	if (signalled() || kthread_should_stop())
> > -		return ERR_PTR(-EINTR);
> >  	return ERR_PTR(-EAGAIN);
> >  out_found:
> > +	if (more)
> > +		svc_pool_wake_idle_thread(rqstp->rq_server, pool);
> > +
> 
> I'm thinking that dealing with more work should be implemented as a
> separate optimization (ie, a subsequent patch).

Ideally: yes.  However the two changes are inter-related.
Using a lockless queue lends itself particularly well to only waking one
thread at a time - the thread at the head.
To preserve the current practice of waking a thread every time a
work-item is queue we would need a lock to remove the thread from the
queue.  There would be a small constant amount of work to do under that
lock, so the cost would be minimal and I could do that.
But then I would just remove the lock again when changing to wake
threads sequentially.
Still - might be worth doing it that way for independent testing and
performance comparison.

> 
> 
> >  	/* Normally we will wait up to 5 seconds for any required
> >  	 * cache information to be provided.
> >  	 */
> > @@ -866,7 +875,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
> >  	try_to_freeze();
> >  	cond_resched();
> >  	err = -EINTR;
> > -	if (signalled() || kthread_should_stop())
> > +	if (signalled() || svc_should_stop(rqstp))
> >  		goto out;
> >  
> >  	xprt = svc_get_next_xprt(rqstp, timeout);
> > -- 
> > 2.40.1
> > 
> 


Thanks for the review.  I generally agree with anything that I didn't
reply to - and much that I did.

What do you think of removing the ability to stop an nfsd thread by
sending it a signal.  Note that this doesn't apply to lockd or to nfsv4
callback threads.  And nfs-utils never relies on this.
I'm keen.  It would make this patch a lot simpler.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10 16:21         ` Chuck Lever
@ 2023-07-10 22:18           ` NeilBrown
  2023-07-10 22:30             ` Jeff Layton
  0 siblings, 1 reply; 41+ messages in thread
From: NeilBrown @ 2023-07-10 22:18 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, jlayton, david

On Tue, 11 Jul 2023, Chuck Lever wrote:
> 
> Actually... Lorenzo reminded me that we need to retain a mechanism
> that can iterate through all the threads in a pool. The xarray
> replaces the "all_threads" list in this regard.
> 

For what purpose?

NeilBrown

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10 22:18           ` NeilBrown
@ 2023-07-10 22:30             ` Jeff Layton
  2023-07-10 22:43               ` Chuck Lever III
  0 siblings, 1 reply; 41+ messages in thread
From: Jeff Layton @ 2023-07-10 22:30 UTC (permalink / raw)
  To: NeilBrown, Chuck Lever; +Cc: linux-nfs, Chuck Lever, lorenzo, david

On Tue, 2023-07-11 at 08:18 +1000, NeilBrown wrote:
> On Tue, 11 Jul 2023, Chuck Lever wrote:
> > 
> > Actually... Lorenzo reminded me that we need to retain a mechanism
> > that can iterate through all the threads in a pool. The xarray
> > replaces the "all_threads" list in this regard.
> > 
> 
> For what purpose?
> 

He's working on a project to add a rpc status procfile which shows in-
flight requests:

    https://bugzilla.linux-nfs.org/show_bug.cgi?id=366
-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10 22:30             ` Jeff Layton
@ 2023-07-10 22:43               ` Chuck Lever III
  2023-07-11 10:25                 ` NeilBrown
  2023-07-11 11:50                 ` Jeff Layton
  0 siblings, 2 replies; 41+ messages in thread
From: Chuck Lever III @ 2023-07-10 22:43 UTC (permalink / raw)
  To: Jeff Layton, Neil Brown
  Cc: Chuck Lever, Linux NFS Mailing List, lorenzo, Dave Chinner



> On Jul 10, 2023, at 6:30 PM, Jeff Layton <jlayton@redhat.com> wrote:
> 
> On Tue, 2023-07-11 at 08:18 +1000, NeilBrown wrote:
>> On Tue, 11 Jul 2023, Chuck Lever wrote:
>>> 
>>> Actually... Lorenzo reminded me that we need to retain a mechanism
>>> that can iterate through all the threads in a pool. The xarray
>>> replaces the "all_threads" list in this regard.
>>> 
>> 
>> For what purpose?
>> 
> 
> He's working on a project to add a rpc status procfile which shows in-
> flight requests:
> 
>    https://bugzilla.linux-nfs.org/show_bug.cgi?id=366

Essentially, we want a lightweight mechanism for detecting
unresponsive nfsd threads.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10 22:18           ` NeilBrown
@ 2023-07-10 22:52             ` Chuck Lever III
  2023-07-11 10:01               ` NeilBrown
  0 siblings, 1 reply; 41+ messages in thread
From: Chuck Lever III @ 2023-07-10 22:52 UTC (permalink / raw)
  To: Neil Brown
  Cc: Chuck Lever, Linux NFS Mailing List, lorenzo, Jeff Layton, david


> On Jul 10, 2023, at 6:18 PM, NeilBrown <neilb@suse.de> wrote:
> 
> What do you think of removing the ability to stop an nfsd thread by
> sending it a signal.  Note that this doesn't apply to lockd or to nfsv4
> callback threads.  And nfs-utils never relies on this.
> I'm keen.  It would make this patch a lot simpler.

I agree the code base would be cleaner for it.

But I'm the new kid. I'm not really sure if this is
part of a kernel - user space API that we mustn't
alter, or whether it's something that was added but
never used, or ....

I can sniff around to get a better understanding.


--
Chuck Lever



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10 22:52             ` Chuck Lever III
@ 2023-07-11 10:01               ` NeilBrown
  2023-07-11 10:43                 ` Jeff Layton
  2023-07-12 19:29                 ` Chuck Lever
  0 siblings, 2 replies; 41+ messages in thread
From: NeilBrown @ 2023-07-11 10:01 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Chuck Lever, Linux NFS Mailing List, lorenzo, Jeff Layton, david

On Tue, 11 Jul 2023, Chuck Lever III wrote:
> > On Jul 10, 2023, at 6:18 PM, NeilBrown <neilb@suse.de> wrote:
> > 
> > What do you think of removing the ability to stop an nfsd thread by
> > sending it a signal.  Note that this doesn't apply to lockd or to nfsv4
> > callback threads.  And nfs-utils never relies on this.
> > I'm keen.  It would make this patch a lot simpler.
> 
> I agree the code base would be cleaner for it.
> 
> But I'm the new kid. I'm not really sure if this is
> part of a kernel - user space API that we mustn't
> alter, or whether it's something that was added but
> never used, or ....
> 
> I can sniff around to get a better understanding.

Once upon a time it was the only way to kill the threads.
There was a syscall which let you start some threads.  You couldn't
change the number of threads, but you could kill them.
And shutdown always kills processes, so letting nfsd threads be killed
meant that would be removed at system shutdown.

When I added the ability to dynamically change the number of threads it
made sense that we could set the number to zero, and then to use that
functionality to shut down the nfs server.  So the /etc/init.d/nfsd
script changed from "killall -9 nfsd" or whatever it was to 
"rpc.nfsd 0".

But it didn't seem sensible to remove the "kill" functionality until
after a transition process, and I never thought the schedule a formal
deprecation.  So it just stayed...

The more I think about it, the more I am in favour of removing it.  I
don't think any other kernel threads can be killed.  nfsd doesn't need
to be special.

Maybe I'll post a patch which just does that.

NeilBrown


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10 22:43               ` Chuck Lever III
@ 2023-07-11 10:25                 ` NeilBrown
  2023-07-11 11:50                 ` Jeff Layton
  1 sibling, 0 replies; 41+ messages in thread
From: NeilBrown @ 2023-07-11 10:25 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Jeff Layton, Chuck Lever, Linux NFS Mailing List, lorenzo, Dave Chinner

On Tue, 11 Jul 2023, Chuck Lever III wrote:
> 
> > On Jul 10, 2023, at 6:30 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > On Tue, 2023-07-11 at 08:18 +1000, NeilBrown wrote:
> >> On Tue, 11 Jul 2023, Chuck Lever wrote:
> >>> 
> >>> Actually... Lorenzo reminded me that we need to retain a mechanism
> >>> that can iterate through all the threads in a pool. The xarray
> >>> replaces the "all_threads" list in this regard.
> >>> 
> >> 
> >> For what purpose?
> >> 
> > 
> > He's working on a project to add a rpc status procfile which shows in-
> > flight requests:
> > 
> >    https://bugzilla.linux-nfs.org/show_bug.cgi?id=366
> 
> Essentially, we want a lightweight mechanism for detecting
> unresponsive nfsd threads.
> 

Sounds reasonable - thanks.
Obviously the mechanism for that purpose can be completely separate from
any mechanism for finding idle threads.  And it can easily be re-added
if removed because all other uses disappear.

Thanks,
NeilBrown


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-11 10:01               ` NeilBrown
@ 2023-07-11 10:43                 ` Jeff Layton
  2023-07-12 19:29                 ` Chuck Lever
  1 sibling, 0 replies; 41+ messages in thread
From: Jeff Layton @ 2023-07-11 10:43 UTC (permalink / raw)
  To: NeilBrown, Chuck Lever III
  Cc: Chuck Lever, Linux NFS Mailing List, lorenzo, david

On Tue, 2023-07-11 at 20:01 +1000, NeilBrown wrote:
> On Tue, 11 Jul 2023, Chuck Lever III wrote:
> > > On Jul 10, 2023, at 6:18 PM, NeilBrown <neilb@suse.de> wrote:
> > > 
> > > What do you think of removing the ability to stop an nfsd thread by
> > > sending it a signal.  Note that this doesn't apply to lockd or to nfsv4
> > > callback threads.  And nfs-utils never relies on this.
> > > I'm keen.  It would make this patch a lot simpler.
> > 
> > I agree the code base would be cleaner for it.
> > 
> > But I'm the new kid. I'm not really sure if this is
> > part of a kernel - user space API that we mustn't
> > alter, or whether it's something that was added but
> > never used, or ....
> > 
> > I can sniff around to get a better understanding.
> 
> Once upon a time it was the only way to kill the threads.
> There was a syscall which let you start some threads.  You couldn't
> change the number of threads, but you could kill them.
> And shutdown always kills processes, so letting nfsd threads be killed
> meant that would be removed at system shutdown.
> 
> When I added the ability to dynamically change the number of threads it
> made sense that we could set the number to zero, and then to use that
> functionality to shut down the nfs server.  So the /etc/init.d/nfsd
> script changed from "killall -9 nfsd" or whatever it was to 
> "rpc.nfsd 0".
> 
> But it didn't seem sensible to remove the "kill" functionality until
> after a transition process, and I never thought the schedule a formal
> deprecation.  So it just stayed...
> 
> The more I think about it, the more I am in favour of removing it.  I
> don't think any other kernel threads can be killed.  nfsd doesn't need
> to be special.
> 
> Maybe I'll post a patch which just does that.
> 

I'd be in favor of removing signal handling from the nfsd threads. It's
always been a bit hacky, particularly since we moved everything to the
kthread API around a decade ago.

-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-10 22:43               ` Chuck Lever III
  2023-07-11 10:25                 ` NeilBrown
@ 2023-07-11 11:50                 ` Jeff Layton
  1 sibling, 0 replies; 41+ messages in thread
From: Jeff Layton @ 2023-07-11 11:50 UTC (permalink / raw)
  To: Chuck Lever III, Neil Brown
  Cc: Chuck Lever, Linux NFS Mailing List, lorenzo, Dave Chinner

On Mon, 2023-07-10 at 22:43 +0000, Chuck Lever III wrote:
> 
> > On Jul 10, 2023, at 6:30 PM, Jeff Layton <jlayton@redhat.com> wrote:
> > 
> > On Tue, 2023-07-11 at 08:18 +1000, NeilBrown wrote:
> > > On Tue, 11 Jul 2023, Chuck Lever wrote:
> > > > 
> > > > Actually... Lorenzo reminded me that we need to retain a mechanism
> > > > that can iterate through all the threads in a pool. The xarray
> > > > replaces the "all_threads" list in this regard.
> > > > 
> > > 
> > > For what purpose?
> > > 
> > 
> > He's working on a project to add a rpc status procfile which shows in-
> > flight requests:
> > 
> >    https://bugzilla.linux-nfs.org/show_bug.cgi?id=366
> 
> Essentially, we want a lightweight mechanism for detecting
> unresponsive nfsd threads.
> 

...and personally, it would also be nice to have a "nfstop" program or
something that could repeatedly poll this file and display information
about active nfsd threads. I could also see this being useful for admins
who want quick visibility into nfsd's activity, without having to sniff
traffic.
-- 
Jeff Layton <jlayton@redhat.com>


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH/RFC] sunrpc: constant-time code to wake idle thread
  2023-07-11 10:01               ` NeilBrown
  2023-07-11 10:43                 ` Jeff Layton
@ 2023-07-12 19:29                 ` Chuck Lever
  1 sibling, 0 replies; 41+ messages in thread
From: Chuck Lever @ 2023-07-12 19:29 UTC (permalink / raw)
  To: NeilBrown
  Cc: Chuck Lever III, Linux NFS Mailing List, lorenzo, Jeff Layton, david

On Tue, Jul 11, 2023 at 08:01:41PM +1000, NeilBrown wrote:
> On Tue, 11 Jul 2023, Chuck Lever III wrote:
> > > On Jul 10, 2023, at 6:18 PM, NeilBrown <neilb@suse.de> wrote:
> > > 
> > > What do you think of removing the ability to stop an nfsd thread by
> > > sending it a signal.  Note that this doesn't apply to lockd or to nfsv4
> > > callback threads.  And nfs-utils never relies on this.
> > > I'm keen.  It would make this patch a lot simpler.
> > 
> > I agree the code base would be cleaner for it.
> > 
> > But I'm the new kid. I'm not really sure if this is
> > part of a kernel - user space API that we mustn't
> > alter, or whether it's something that was added but
> > never used, or ....
> > 
> > I can sniff around to get a better understanding.
> 
> Once upon a time it was the only way to kill the threads.
> There was a syscall which let you start some threads.  You couldn't
> change the number of threads, but you could kill them.
> And shutdown always kills processes, so letting nfsd threads be killed
> meant that would be removed at system shutdown.
> 
> When I added the ability to dynamically change the number of threads it
> made sense that we could set the number to zero, and then to use that
> functionality to shut down the nfs server.  So the /etc/init.d/nfsd
> script changed from "killall -9 nfsd" or whatever it was to 
> "rpc.nfsd 0".
> 
> But it didn't seem sensible to remove the "kill" functionality until
> after a transition process, and I never thought the schedule a formal
> deprecation.  So it just stayed...
> 
> The more I think about it, the more I am in favour of removing it.  I
> don't think any other kernel threads can be killed.  nfsd doesn't need
> to be special.
> 
> Maybe I'll post a patch which just does that.

I won't NACK such a patch out-of-hand. It seems like a sensible
clean-up, but let's have a look at the details.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-07-12 19:29 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-04  0:07 [PATCH v2 0/9] SUNRPC service thread scheduler optimizations Chuck Lever
2023-07-04  0:07 ` [PATCH v2 1/9] SUNRPC: Deduplicate thread wake-up code Chuck Lever
2023-07-04  0:07 ` [PATCH v2 2/9] SUNRPC: Report when no service thread is available Chuck Lever
2023-07-04  0:07 ` [PATCH v2 3/9] SUNRPC: Split the svc_xprt_dequeue tracepoint Chuck Lever
2023-07-04  0:07 ` [PATCH v2 4/9] SUNRPC: Count ingress RPC messages per svc_pool Chuck Lever
2023-07-04  0:45   ` NeilBrown
2023-07-04  1:47     ` Chuck Lever
2023-07-04  2:30       ` NeilBrown
2023-07-04 15:04         ` Chuck Lever III
2023-07-10  0:41       ` [PATCH/RFC] sunrpc: constant-time code to wake idle thread NeilBrown
2023-07-10 14:41         ` Chuck Lever
2023-07-10 22:18           ` NeilBrown
2023-07-10 22:52             ` Chuck Lever III
2023-07-11 10:01               ` NeilBrown
2023-07-11 10:43                 ` Jeff Layton
2023-07-12 19:29                 ` Chuck Lever
2023-07-10 16:21         ` Chuck Lever
2023-07-10 22:18           ` NeilBrown
2023-07-10 22:30             ` Jeff Layton
2023-07-10 22:43               ` Chuck Lever III
2023-07-11 10:25                 ` NeilBrown
2023-07-11 11:50                 ` Jeff Layton
2023-07-04  0:08 ` [PATCH v2 5/9] SUNRPC: Clean up svc_set_num_threads Chuck Lever
2023-07-04  0:08 ` [PATCH v2 6/9] SUNRPC: Replace dprintk() call site in __svc_create() Chuck Lever
2023-07-04  0:08 ` [PATCH v2 7/9] SUNRPC: Don't disable BH's when taking sp_lock Chuck Lever
2023-07-04  0:56   ` NeilBrown
2023-07-04  1:48     ` Chuck Lever
2023-07-04  0:08 ` [PATCH v2 8/9] SUNRPC: Replace sp_threads_all with an xarray Chuck Lever
2023-07-04  1:11   ` NeilBrown
2023-07-04  1:52     ` Chuck Lever
2023-07-04  2:32       ` NeilBrown
2023-07-04  0:08 ` [PATCH v2 9/9] SUNRPC: Convert RQ_BUSY into a per-pool bitmap Chuck Lever
2023-07-04  1:26   ` NeilBrown
2023-07-04  2:02     ` Chuck Lever
2023-07-04  2:17       ` NeilBrown
2023-07-04 15:14         ` Chuck Lever III
2023-07-05  0:34           ` NeilBrown
2023-07-04 16:03       ` Chuck Lever III
2023-07-05  0:35         ` NeilBrown
2023-07-05  1:08 ` [PATCH v2 0/9] SUNRPC service thread scheduler optimizations NeilBrown
2023-07-05  2:43   ` Chuck Lever

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.