linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] SUNRPC - queuing improvements.
@ 2023-08-18  1:45 NeilBrown
  2023-08-18  1:45 ` [PATCH 1/7] SUNRPC: change service idle list to be an llist NeilBrown
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: NeilBrown @ 2023-08-18  1:45 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

This series is against topic-sunrpc-thread-scheduling (7fddd9be200c).
There are two changes since last post (other than dropping patches
that have been included in the topic branch.

1/ When I dropped the test on freezing(current), need needed to
   add TASK_FREEZABLE to the wait state.  I could keep testing
   freezing(current) before calling schedule, but TASK_FREEZABLE
   is cleaner and seems to be prefered ... except ...

   /proc/PID/stat reports a status for "D" for tasks that are
      TASK_IDLE | TASK_FREEZABLE
   instead of the correct "I".  I've included a fix for that in
   the relevant patch.  I should probably post is separately
   to the scheduler team.

2/ There was a bug in lwq_dequeue().  It tested lwq_empty() without
   taking the spinlock.  This could result in call-positive results
   if it raced with another lwq_dequeue(). So lwq_dequeue() now
   ensure the queue never looks transiently empty.
   lwq_empty() is also more careful about memory ordering.

NeilBrown

 [PATCH 1/7] SUNRPC: change service idle list to be an llist
 [PATCH 2/7] SUNRPC: only have one thread waking up at a time
 [PATCH 3/7] SUNRPC/svc: add light-weight queuing mechanism.
 [PATCH 4/7] SUNRPC: use lwq for sp_sockets - renamed to sp_xprts
 [PATCH 5/7] SUNRPC: change sp_nrthreads to atomic_t
 [PATCH 6/7] SUNRPC: discard sp_lock
 [PATCH 7/7] SUNRPC: change the back-channel queue to lwq


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/7] SUNRPC: change service idle list to be an llist
  2023-08-18  1:45 [PATCH 0/7] SUNRPC - queuing improvements NeilBrown
@ 2023-08-18  1:45 ` NeilBrown
  2023-08-18 15:53   ` Chuck Lever
  2023-08-18  1:45 ` [PATCH 2/7] SUNRPC: only have one thread waking up at a time NeilBrown
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: NeilBrown @ 2023-08-18  1:45 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

With an llist we don't need to take a lock to add a thread to the list,
though we still need a lock to remove it.  That will go in the next
patch.

Unlike double-linked lists, a thread cannot reliably remove itself from
the list.  Only the first thread can be removed, and that can change
asynchronously.  So some care is needed.

We already check if there is pending work to do, so we are unlikely to
add ourselves to the idle list and then want to remove ourselves again.

If we DO find something needs to be done after adding ourselves to the
list, we simply wake up the first thread on the list.  If that was us,
we successfully removed ourselves and can continue.  If it was some
other thread, they will do the work that needs to be done.  We can
safely sleep until woken.

We also remove the test on freezing() from rqst_should_sleep().  Instead
we set TASK_FREEZABLE before scheduling.  This makes is safe to
schedule() when a freeze is pending.  As we now loop waiting to be
removed from the idle queue, this is a cleaner way to handle freezing.

task_state_index() incorrectly identifies a task with
   TASK_IDLE | TASK_FREEZABLE
as 'D'.  So fix __task_state_index to ignore extra flags on TASK_IDLE
just as it ignores extra flags on other states.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sched.h      |  4 +--
 include/linux/sunrpc/svc.h | 14 ++++++-----
 net/sunrpc/svc.c           | 13 +++++-----
 net/sunrpc/svc_xprt.c      | 51 +++++++++++++++++++-------------------
 4 files changed, 42 insertions(+), 40 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 609bde814cb0..a5f3badcb629 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1666,7 +1666,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
 
 	BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
 
-	if (tsk_state == TASK_IDLE)
+	if ((tsk_state & TASK_IDLE) == TASK_IDLE)
 		state = TASK_REPORT_IDLE;
 
 	/*
@@ -1674,7 +1674,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
 	 * to userspace, we can make this appear as if the task has gone through
 	 * a regular rt_mutex_lock() call.
 	 */
-	if (tsk_state == TASK_RTLOCK_WAIT)
+	if (tsk_state & TASK_RTLOCK_WAIT)
 		state = TASK_UNINTERRUPTIBLE;
 
 	return fls(state);
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 22b3018ebf62..5216f95411e3 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -37,7 +37,7 @@ struct svc_pool {
 	struct list_head	sp_sockets;	/* pending sockets */
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
 	struct list_head	sp_all_threads;	/* all server threads */
-	struct list_head	sp_idle_threads; /* idle server threads */
+	struct llist_head	sp_idle_threads; /* idle server threads */
 
 	/* statistics on pool operation */
 	struct percpu_counter	sp_messages_arrived;
@@ -186,7 +186,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
  */
 struct svc_rqst {
 	struct list_head	rq_all;		/* all threads list */
-	struct list_head	rq_idle;	/* On the idle list */
+	struct llist_node	rq_idle;	/* On the idle list */
 	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
 	struct svc_xprt *	rq_xprt;	/* transport ptr */
 
@@ -270,22 +270,24 @@ enum {
  * svc_thread_set_busy - mark a thread as busy
  * @rqstp: the thread which is now busy
  *
- * If rq_idle is "empty", the thread must be busy.
+ * By convention a thread is busy if rq_idle.next points to rq_idle.
+ * This ensures it is not on the idle list.
  */
 static inline void svc_thread_set_busy(struct svc_rqst *rqstp)
 {
-	INIT_LIST_HEAD(&rqstp->rq_idle);
+	rqstp->rq_idle.next = &rqstp->rq_idle;
 }
 
 /**
  * svc_thread_busy - check if a thread as busy
  * @rqstp: the thread which might be busy
  *
- * If rq_idle is "empty", the thread must be busy.
+ * By convention a thread is busy if rq_idle.next points to rq_idle.
+ * This ensures it is not on the idle list.
  */
 static inline bool svc_thread_busy(struct svc_rqst *rqstp)
 {
-	return list_empty(&rqstp->rq_idle);
+	return rqstp->rq_idle.next == &rqstp->rq_idle;
 }
 
 #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 051f08c48418..addbf28ea50a 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -510,7 +510,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 		pool->sp_id = i;
 		INIT_LIST_HEAD(&pool->sp_sockets);
 		INIT_LIST_HEAD(&pool->sp_all_threads);
-		INIT_LIST_HEAD(&pool->sp_idle_threads);
+		init_llist_head(&pool->sp_idle_threads);
 		spin_lock_init(&pool->sp_lock);
 
 		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
@@ -701,15 +701,16 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 void svc_pool_wake_idle_thread(struct svc_pool *pool)
 {
 	struct svc_rqst	*rqstp;
+	struct llist_node *ln;
 
 	rcu_read_lock();
 	spin_lock_bh(&pool->sp_lock);
-	rqstp = list_first_entry_or_null(&pool->sp_idle_threads,
-					 struct svc_rqst, rq_idle);
-	if (rqstp)
-		list_del_init(&rqstp->rq_idle);
+	ln = llist_del_first(&pool->sp_idle_threads);
 	spin_unlock_bh(&pool->sp_lock);
-	if (rqstp) {
+	if (ln) {
+		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
+		svc_thread_set_busy(rqstp);
+
 		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
 		wake_up_process(rqstp->rq_task);
 		rcu_read_unlock();
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index fa0d854a5596..81327001e074 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -715,10 +715,6 @@ rqst_should_sleep(struct svc_rqst *rqstp)
 	if (svc_thread_should_stop(rqstp))
 		return false;
 
-	/* are we freezing? */
-	if (freezing(current))
-		return false;
-
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
 	if (svc_is_backchannel(rqstp)) {
 		if (!list_empty(&rqstp->rq_server->sv_cb_list))
@@ -734,30 +730,32 @@ static void svc_rqst_wait_for_work(struct svc_rqst *rqstp)
 	struct svc_pool *pool = rqstp->rq_pool;
 
 	if (rqst_should_sleep(rqstp)) {
-		set_current_state(TASK_IDLE);
-		spin_lock_bh(&pool->sp_lock);
-		list_add(&rqstp->rq_idle, &pool->sp_idle_threads);
-		spin_unlock_bh(&pool->sp_lock);
+		set_current_state(TASK_IDLE | TASK_FREEZABLE);
+		llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
+
+		if (unlikely(!rqst_should_sleep(rqstp)))
+			/* maybe there were no idle threads when some work
+			 * became ready and so nothing was woken.  We've just
+			 * become idle so someone can to the work - maybe us.
+			 * But we cannot reliably remove ourselves from the
+			 * idle list - we can only remove the first task which
+			 * might be us, and might not.
+			 * So remove and wake it, then schedule().  If it was
+			 * us, we won't sleep.  If it is some other thread, they
+			 * will do the work.
+			 */
+			svc_pool_wake_idle_thread(pool);
 
-		/* Need to check should_sleep() again after
-		 * setting task state in case a wakeup happened
-		 * between testing and setting.
+		/* We mustn't continue while on the idle list, and we
+		 * cannot remove outselves reliably.  The only "work"
+		 * we can do while on the idle list is to freeze.
+		 * So loop until someone removes us
 		 */
-		if (rqst_should_sleep(rqstp)) {
+		while (!svc_thread_busy(rqstp)) {
 			schedule();
-		} else {
-			__set_current_state(TASK_RUNNING);
-			cond_resched();
-		}
-
-		/* We *must* be removed from the list before we can continue.
-		 * If we were woken, this is already done
-		 */
-		if (!svc_thread_busy(rqstp)) {
-			spin_lock_bh(&pool->sp_lock);
-			list_del_init(&rqstp->rq_idle);
-			spin_unlock_bh(&pool->sp_lock);
+			set_current_state(TASK_IDLE | TASK_FREEZABLE);
 		}
+		__set_current_state(TASK_RUNNING);
 	} else {
 		cond_resched();
 	}
@@ -870,9 +868,10 @@ void svc_recv(struct svc_rqst *rqstp)
 		struct svc_xprt *xprt = rqstp->rq_xprt;
 
 		/* Normally we will wait up to 5 seconds for any required
-		 * cache information to be provided.
+		 * cache information to be provided.  When there are no
+		 * idle threads, we reduce the wait time.
 		 */
-		if (!list_empty(&pool->sp_idle_threads))
+		if (pool->sp_idle_threads.first)
 			rqstp->rq_chandle.thread_wait = 5 * HZ;
 		else
 			rqstp->rq_chandle.thread_wait = 1 * HZ;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/7] SUNRPC: only have one thread waking up at a time
  2023-08-18  1:45 [PATCH 0/7] SUNRPC - queuing improvements NeilBrown
  2023-08-18  1:45 ` [PATCH 1/7] SUNRPC: change service idle list to be an llist NeilBrown
@ 2023-08-18  1:45 ` NeilBrown
  2023-08-18  1:45 ` [PATCH 3/7] SUNRPC/svc: add light-weight queuing mechanism NeilBrown
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2023-08-18  1:45 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Currently if several items of work become available in quick succession,
that number of threads (if available) will be woken.  By the time some
of them wake up another thread that was already cache-warm might have
come along and completed the work.  Anecdotal evidence suggests as many
as 15% of wakes find nothing to do once they get to the point of
looking.

This patch changes svc_pool_wake_idle_thread() to wake the first thread
on the queue but NOT remove it.  Subsequent calls will wake the same
thread.  Once that thread starts it will dequeue itself and after
dequeueing some work to do, it will wake the next thread if there is more
work ready.  This results in a more orderly increase in the number of
busy threads.

As a bonus, this allows us to reduce locking around the idle queue.
svc_pool_wake_idle_thread() no longer needs to take a lock (beyond
rcu_read_lock()) as it doesn't manipulate the queue, it just looks at
the first item.

The thread itself can avoid locking by using the new
llist_del_first_this() interface.  This will safely remove the thread
itself if it is the head.  If it isn't the head, it will do nothing.
If multiple threads call this concurrently only one will succeed.  The
others will do nothing, so no corruption can result.

If a thread wakes up and finds that it cannot dequeue itself that means
either
- that it wasn't woken because it was the head of the queue.  Maybe the
  freezer woke it.  In that case it can go back to sleep (after trying
  to freeze of course).
- some other thread found there was nothing to do very recently, and
  placed itself on the head of the queue in front of this thread.
  It must check again after placing itself there, so it can be deemed to
  be responsible for any pending work, and this thread can go back to
  sleep until woken.

No code ever tests for busy threads any more.  Only each thread itself
cares if it is busy.  So functions to set and test the busy status are
no longer needed.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/llist.h      |  2 ++
 include/linux/sunrpc/svc.h | 24 --------------------
 lib/llist.c                | 27 ++++++++++++++++++++++
 net/sunrpc/svc.c           | 15 +++++--------
 net/sunrpc/svc_xprt.c      | 46 ++++++++++++++++++++++----------------
 5 files changed, 62 insertions(+), 52 deletions(-)

diff --git a/include/linux/llist.h b/include/linux/llist.h
index 85bda2d02d65..d8b1b73f3df0 100644
--- a/include/linux/llist.h
+++ b/include/linux/llist.h
@@ -248,6 +248,8 @@ static inline struct llist_node *__llist_del_all(struct llist_head *head)
 }
 
 extern struct llist_node *llist_del_first(struct llist_head *head);
+extern struct llist_node *llist_del_first_this(struct llist_head *head,
+					       struct llist_node *this);
 
 struct llist_node *llist_reverse_order(struct llist_node *head);
 
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 5216f95411e3..dafa362b4fdd 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -266,30 +266,6 @@ enum {
 	RQ_DATA,		/* request has data */
 };
 
-/**
- * svc_thread_set_busy - mark a thread as busy
- * @rqstp: the thread which is now busy
- *
- * By convention a thread is busy if rq_idle.next points to rq_idle.
- * This ensures it is not on the idle list.
- */
-static inline void svc_thread_set_busy(struct svc_rqst *rqstp)
-{
-	rqstp->rq_idle.next = &rqstp->rq_idle;
-}
-
-/**
- * svc_thread_busy - check if a thread as busy
- * @rqstp: the thread which might be busy
- *
- * By convention a thread is busy if rq_idle.next points to rq_idle.
- * This ensures it is not on the idle list.
- */
-static inline bool svc_thread_busy(struct svc_rqst *rqstp)
-{
-	return rqstp->rq_idle.next == &rqstp->rq_idle;
-}
-
 #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
 
 /*
diff --git a/lib/llist.c b/lib/llist.c
index 6e668fa5a2c6..7e8a13a13586 100644
--- a/lib/llist.c
+++ b/lib/llist.c
@@ -65,6 +65,33 @@ struct llist_node *llist_del_first(struct llist_head *head)
 }
 EXPORT_SYMBOL_GPL(llist_del_first);
 
+/**
+ * llist_del_first_this - delete given entry of lock-less list if it is first
+ * @head:	the head for your lock-less list
+ * @this:	a list entry.
+ *
+ * If head of the list is given entry, delete and return it, else
+ * return %NULL.
+ *
+ * Providing the caller has exclusive access to @this, multiple callers can
+ * safely call this concurrently with multiple llist_add() callers.
+ */
+struct llist_node *llist_del_first_this(struct llist_head *head,
+					struct llist_node *this)
+{
+	struct llist_node *entry, *next;
+
+	entry = smp_load_acquire(&head->first);
+	do {
+		if (entry != this)
+			return NULL;
+		next = READ_ONCE(entry->next);
+	} while (!try_cmpxchg(&head->first, &entry, next));
+
+	return entry;
+}
+EXPORT_SYMBOL_GPL(llist_del_first_this);
+
 /**
  * llist_reverse_order - reverse order of a llist chain
  * @head:	first item of the list to be reversed
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index addbf28ea50a..3267d740235e 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -642,7 +642,6 @@ svc_rqst_alloc(struct svc_serv *serv, struct svc_pool *pool, int node)
 
 	folio_batch_init(&rqstp->rq_fbatch);
 
-	svc_thread_set_busy(rqstp);
 	rqstp->rq_server = serv;
 	rqstp->rq_pool = pool;
 
@@ -704,18 +703,16 @@ void svc_pool_wake_idle_thread(struct svc_pool *pool)
 	struct llist_node *ln;
 
 	rcu_read_lock();
-	spin_lock_bh(&pool->sp_lock);
-	ln = llist_del_first(&pool->sp_idle_threads);
-	spin_unlock_bh(&pool->sp_lock);
+	ln = READ_ONCE(pool->sp_idle_threads.first);
 	if (ln) {
 		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
-		svc_thread_set_busy(rqstp);
-
 		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
-		wake_up_process(rqstp->rq_task);
+		if (!task_is_running(rqstp->rq_task)) {
+			wake_up_process(rqstp->rq_task);
+			trace_svc_wake_up(rqstp->rq_task->pid);
+			percpu_counter_inc(&pool->sp_threads_woken);
+		}
 		rcu_read_unlock();
-		percpu_counter_inc(&pool->sp_threads_woken);
-		trace_svc_wake_up(rqstp->rq_task->pid);
 		return;
 	}
 	rcu_read_unlock();
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 81327001e074..397ca12299a2 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -732,26 +732,21 @@ static void svc_rqst_wait_for_work(struct svc_rqst *rqstp)
 	if (rqst_should_sleep(rqstp)) {
 		set_current_state(TASK_IDLE | TASK_FREEZABLE);
 		llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
+		/* Maybe there were no idle threads when some work
+		 * became ready and so nothing was woken.  We've just
+		 * become idle so someone can do the work - maybe us.
+		 * So check again for work to do.
+		 */
+		if (likely(rqst_should_sleep(rqstp)))
+			schedule();
 
-		if (unlikely(!rqst_should_sleep(rqstp)))
-			/* maybe there were no idle threads when some work
-			 * became ready and so nothing was woken.  We've just
-			 * become idle so someone can to the work - maybe us.
-			 * But we cannot reliably remove ourselves from the
-			 * idle list - we can only remove the first task which
-			 * might be us, and might not.
-			 * So remove and wake it, then schedule().  If it was
-			 * us, we won't sleep.  If it is some other thread, they
-			 * will do the work.
+		while (llist_del_first_this(&pool->sp_idle_threads,
+					    &rqstp->rq_idle) == NULL) {
+			/* Cannot remove myself, so some other thread
+			 * must have queued themselves after finding
+			 * no work to do, so they have taken responsibility
+			 * for any outstanding work.
 			 */
-			svc_pool_wake_idle_thread(pool);
-
-		/* We mustn't continue while on the idle list, and we
-		 * cannot remove outselves reliably.  The only "work"
-		 * we can do while on the idle list is to freeze.
-		 * So loop until someone removes us
-		 */
-		while (!svc_thread_busy(rqstp)) {
 			schedule();
 			set_current_state(TASK_IDLE | TASK_FREEZABLE);
 		}
@@ -841,6 +836,15 @@ static void svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
 	svc_xprt_release(rqstp);
 }
 
+static void wake_next(struct svc_rqst *rqstp)
+{
+	if (!rqst_should_sleep(rqstp))
+		/* More work pending after I dequeued some,
+		 * wake another worker
+		 */
+		svc_pool_wake_idle_thread(rqstp->rq_pool);
+}
+
 /**
  * svc_recv - Receive and process the next request on any transport
  * @rqstp: an idle RPC service thread
@@ -860,13 +864,16 @@ void svc_recv(struct svc_rqst *rqstp)
 
 	clear_bit(SP_TASK_PENDING, &pool->sp_flags);
 
-	if (svc_thread_should_stop(rqstp))
+	if (svc_thread_should_stop(rqstp)) {
+		wake_next(rqstp);
 		return;
+	}
 
 	rqstp->rq_xprt = svc_xprt_dequeue(pool);
 	if (rqstp->rq_xprt) {
 		struct svc_xprt *xprt = rqstp->rq_xprt;
 
+		wake_next(rqstp);
 		/* Normally we will wait up to 5 seconds for any required
 		 * cache information to be provided.  When there are no
 		 * idle threads, we reduce the wait time.
@@ -891,6 +898,7 @@ void svc_recv(struct svc_rqst *rqstp)
 		if (req) {
 			list_del(&req->rq_bc_list);
 			spin_unlock_bh(&serv->sv_cb_lock);
+			wake_next(rqstp);
 
 			svc_process_bc(req, rqstp);
 			return;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/7] SUNRPC/svc: add light-weight queuing mechanism.
  2023-08-18  1:45 [PATCH 0/7] SUNRPC - queuing improvements NeilBrown
  2023-08-18  1:45 ` [PATCH 1/7] SUNRPC: change service idle list to be an llist NeilBrown
  2023-08-18  1:45 ` [PATCH 2/7] SUNRPC: only have one thread waking up at a time NeilBrown
@ 2023-08-18  1:45 ` NeilBrown
  2023-08-18  1:45 ` [PATCH 4/7] SUNRPC: use lwq for sp_sockets - renamed to sp_xprts NeilBrown
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2023-08-18  1:45 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

lwq is a FIFO single-linked queue that only requires a spinlock
for dequeueing, which happens in process context.  Enqueueing is atomic
with no spinlock and can happen in any context.

Include a unit test for basic functionality - runs a boot/module-load
time.  Does not use kunit framework.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/svc_lwq.h |  80 +++++++++++++++++++
 net/sunrpc/Kconfig             |   6 ++
 net/sunrpc/Makefile            |   2 +-
 net/sunrpc/svc_lwq.c           | 141 +++++++++++++++++++++++++++++++++
 4 files changed, 228 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/sunrpc/svc_lwq.h
 create mode 100644 net/sunrpc/svc_lwq.c

diff --git a/include/linux/sunrpc/svc_lwq.h b/include/linux/sunrpc/svc_lwq.h
new file mode 100644
index 000000000000..e5ffd25584ad
--- /dev/null
+++ b/include/linux/sunrpc/svc_lwq.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef SUNRPC_SVC_LWQ_H
+#define SUNRPC_SVC_LWQ_H
+
+/*
+ * light-weight single linked queue
+ *
+ * Entries can be enqueued from any context with no locking.
+ * Entries can be dequeued from process context with integrated locking.
+ *
+ */
+#include <linux/container_of.h>
+#include <linux/spinlock.h>
+#include <linux/llist.h>
+
+struct lwq_node {
+	struct llist_node node;
+};
+
+struct lwq {
+	spinlock_t		lock;
+	struct llist_node	*ready;		/* entries to be dequeued */
+	struct llist_head	new;		/* entries being enqueued */
+};
+
+static inline void lwq_init(struct lwq *q)
+{
+	spin_lock_init(&q->lock);
+	q->ready = NULL;
+	init_llist_head(&q->new);
+}
+
+static inline bool lwq_empty(struct lwq *q)
+{
+	return smp_load_acquire(&q->ready) == NULL && llist_empty(&q->new);
+}
+
+struct llist_node *__lwq_dequeue(struct lwq *q);
+#define lwq_dequeue(_q, _type, _member)					\
+	({ struct llist_node *_n = __lwq_dequeue(_q);			\
+	  _n ? container_of(_n, _type, _member.node) : NULL; })
+
+struct llist_node *lwq_dequeue_all(struct lwq *q);
+
+/**
+ * lwq_for_each_safe: iterate over detached queue allowing deletion
+ * @_n:		iterator variable
+ * @_t1:	temporary struct llist_node **
+ * @_t2:	temporary struct llist_node *
+ * @_l:		address of llist_node pointer from lwq_dequeue_all()
+ * @_member:	member in _n where lwq_node is found.
+ *
+ * Iterate over members in a dequeued list.  If the iterator variable
+ * is set to NULL, the iterator removes that entry from the queue.
+ */
+#define lwq_for_each_safe(_n, _t1, _t2, _l, _member)			\
+	for (_t1 = (_l);						\
+	     *(_t1) ? (_n = container_of(*(_t1), typeof(*(_n)), _member.node),\
+		       _t2 = ((*_t1)->next),				\
+		       true)						\
+	     : false;							\
+	     (_n) ? (_t1 = &(_n)->_member.node.next, 0)			\
+	     : ((*(_t1) = (_t2)),  0))
+
+static inline bool lwq_enqueue(struct lwq_node *n, struct lwq *q)
+{
+	return llist_add(&n->node, &q->new) &&
+		smp_load_acquire(&q->ready) == NULL;
+}
+
+static inline bool lwq_enqueue_batch(struct llist_node *n, struct lwq *q)
+{
+	struct llist_node *e = n;
+
+	return llist_add_batch(llist_reverse_order(n), e, &q->new) &&
+		smp_load_acquire(&q->ready) == NULL;
+}
+
+#endif /* SUNRPC_SVC_LWQ_H */
diff --git a/net/sunrpc/Kconfig b/net/sunrpc/Kconfig
index 2d8b67dac7b5..5de87d005962 100644
--- a/net/sunrpc/Kconfig
+++ b/net/sunrpc/Kconfig
@@ -115,3 +115,9 @@ config SUNRPC_XPRT_RDMA
 
 	  If unsure, or you know there is no RDMA capability on your
 	  hardware platform, say N.
+
+config SUNRPC_LWQ_TEST
+	bool "RPC: enable boot-time test for lwq queuing"
+	depends on SUNRPC
+	help
+          Enable boot-time test of lwq functionality.
diff --git a/net/sunrpc/Makefile b/net/sunrpc/Makefile
index f89c10fe7e6a..b224cba1d0da 100644
--- a/net/sunrpc/Makefile
+++ b/net/sunrpc/Makefile
@@ -10,7 +10,7 @@ obj-$(CONFIG_SUNRPC_XPRT_RDMA) += xprtrdma/
 
 sunrpc-y := clnt.o xprt.o socklib.o xprtsock.o sched.o \
 	    auth.o auth_null.o auth_tls.o auth_unix.o \
-	    svc.o svcsock.o svcauth.o svcauth_unix.o \
+	    svc.o svc_lwq.o svcsock.o svcauth.o svcauth_unix.o \
 	    addr.o rpcb_clnt.o timer.o xdr.o \
 	    sunrpc_syms.o cache.o rpc_pipe.o sysfs.o \
 	    svc_xprt.o \
diff --git a/net/sunrpc/svc_lwq.c b/net/sunrpc/svc_lwq.c
new file mode 100644
index 000000000000..9c10c68e3a21
--- /dev/null
+++ b/net/sunrpc/svc_lwq.c
@@ -0,0 +1,141 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Light weight single-linked queue.
+ *
+ * Entries are enqueued to the head of an llist, with no blocking.
+ * This can happen in any context.
+ *
+ * Entries are dequeued using a spinlock to protect against
+ * multiple access.  The llist is staged in reverse order, and refreshed
+ * from the llist when it exhausts.
+ */
+#include <linux/rcupdate.h>
+#include <linux/sunrpc/svc_lwq.h>
+
+struct llist_node *__lwq_dequeue(struct lwq *q)
+{
+	struct llist_node *this;
+
+	if (lwq_empty(q))
+		return NULL;
+	spin_lock(&q->lock);
+	this = q->ready;
+	if (!this && !llist_empty(&q->new)) {
+		/* ensure queue doesn't appear transiently lwq_empty */
+		smp_store_release(&q->ready, (void*)1);
+		this = llist_reverse_order(llist_del_all(&q->new));
+		if (!this)
+			q->ready = NULL;
+	}
+	if (this)
+		q->ready = llist_next(this);
+	spin_unlock(&q->lock);
+	return this;
+}
+
+struct llist_node *lwq_dequeue_all(struct lwq *q)
+{
+	struct llist_node *r, *t, **ep;
+
+	if (lwq_empty(q))
+		return NULL;
+
+	spin_lock(&q->lock);
+	r = q->ready;
+	q->ready = NULL;
+	t = llist_del_all(&q->new);
+	spin_unlock(&q->lock);
+	ep = &r;
+	while (*ep)
+		ep = &(*ep)->next;
+	*ep = llist_reverse_order(t);
+	return r;
+}
+
+#if IS_ENABLED(CONFIG_SUNRPC_LWQ_TEST)
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/wait_bit.h>
+#include <linux/kthread.h>
+#include <linux/delay.h>
+struct tnode {
+	struct lwq_node n;
+	int i;
+	int c;
+};
+
+static int lwq_exercise(void *qv)
+{
+	struct lwq *q = qv;
+	int cnt;
+	struct tnode *t;
+
+	for (cnt = 0; cnt < 10000; cnt++) {
+		wait_var_event(q, (t = lwq_dequeue(q, struct tnode, n)) != NULL);
+		t->c++;
+		if (lwq_enqueue(&t->n, q))
+			wake_up_var(q);
+	}
+	while (!kthread_should_stop())
+		schedule_timeout_idle(1);
+	return 0;
+}
+
+static int lwq_test(void)
+{
+	int i;
+	struct lwq q;
+	struct llist_node *l, **t1, *t2;
+	struct tnode *t;
+	struct task_struct *threads[8];
+
+	printk(KERN_INFO "testing lwq....\n");
+	lwq_init(&q);
+	printk(KERN_INFO " lwq: run some threads\n");
+	for (i = 0; i < ARRAY_SIZE(threads); i++)
+		threads[i] = kthread_run(lwq_exercise, &q, "lwq_test-%d", i);
+	for (i = 0; i < 100; i++) {
+		t = kmalloc(sizeof(*t), GFP_KERNEL);
+		t->i = i;
+		t->c = 0;
+		if (lwq_enqueue(&t->n, &q))
+			wake_up_var(&q);
+	};
+	/* wait for threads to exit */
+	for (i = 0; i < ARRAY_SIZE(threads); i++)
+		if (!IS_ERR_OR_NULL(threads[i]))
+			kthread_stop(threads[i]);
+	printk(KERN_INFO " lwq: dequeue first 50:");
+	for (i = 0; i < 50 ; i++) {
+		if (i && (i % 10) == 0) {
+			printk(KERN_CONT "\n");
+			printk(KERN_INFO " lwq: ... ");
+		}
+		t = lwq_dequeue(&q, struct tnode, n);
+		printk(KERN_CONT " %d(%d)", t->i, t->c);
+		kfree(t);
+	}
+	printk(KERN_CONT "\n");
+	l = lwq_dequeue_all(&q);
+	printk(KERN_INFO " lwq: delete the multiples of 3 (test lwq_for_each_safe())\n");
+	lwq_for_each_safe(t, t1, t2, &l, n) {
+		if ((t->i % 3) == 0) {
+			t->i = -1;
+			kfree(t);
+			t = NULL;
+		}
+	}
+	if (l)
+		lwq_enqueue_batch(l, &q);
+	printk(KERN_INFO " lwq: dequeue remaining:");
+	while ((t = lwq_dequeue(&q, struct tnode, n)) != NULL) {
+		printk(KERN_CONT " %d", t->i);
+		kfree(t);
+	}
+	printk(KERN_CONT "\n");
+	return 0;
+}
+
+module_init(lwq_test);
+#endif /* CONFIG_SUNRPC_LWQ_TEST*/
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/7] SUNRPC: use lwq for sp_sockets - renamed to sp_xprts
  2023-08-18  1:45 [PATCH 0/7] SUNRPC - queuing improvements NeilBrown
                   ` (2 preceding siblings ...)
  2023-08-18  1:45 ` [PATCH 3/7] SUNRPC/svc: add light-weight queuing mechanism NeilBrown
@ 2023-08-18  1:45 ` NeilBrown
  2023-08-18  1:45 ` [PATCH 5/7] SUNRPC: change sp_nrthreads to atomic_t NeilBrown
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2023-08-18  1:45 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

lwq avoids using back pointers in lists, and uses less locking.
This introduces a new spinlock, but the other one will be removed in a
future patch.

For svc_clean_up_xprts(), we now dequeue the entire queue, walk it to
remove and process the xprts that need cleaning up, then re-enqueue the
remaining queue.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/svc.h      |  3 +-
 include/linux/sunrpc/svc_xprt.h |  2 +-
 net/sunrpc/svc.c                |  2 +-
 net/sunrpc/svc_xprt.c           | 57 ++++++++++-----------------------
 4 files changed, 21 insertions(+), 43 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index dafa362b4fdd..8b2bd034a760 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -17,6 +17,7 @@
 #include <linux/sunrpc/xdr.h>
 #include <linux/sunrpc/auth.h>
 #include <linux/sunrpc/svcauth.h>
+#include <linux/sunrpc/svc_lwq.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
 #include <linux/pagevec.h>
@@ -34,7 +35,7 @@
 struct svc_pool {
 	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
 	spinlock_t		sp_lock;	/* protects all fields */
-	struct list_head	sp_sockets;	/* pending sockets */
+	struct lwq		sp_xprts;	/* pending transports */
 	unsigned int		sp_nrthreads;	/* # of threads in pool */
 	struct list_head	sp_all_threads;	/* all server threads */
 	struct llist_head	sp_idle_threads; /* idle server threads */
diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index fa55d12dc765..8e20cd60e2e7 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -54,7 +54,7 @@ struct svc_xprt {
 	const struct svc_xprt_ops *xpt_ops;
 	struct kref		xpt_ref;
 	struct list_head	xpt_list;
-	struct list_head	xpt_ready;
+	struct lwq_node		xpt_ready;
 	unsigned long		xpt_flags;
 
 	struct svc_serv		*xpt_server;	/* service for transport */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 3267d740235e..63cddb8cb08d 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -508,7 +508,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 				i, serv->sv_name);
 
 		pool->sp_id = i;
-		INIT_LIST_HEAD(&pool->sp_sockets);
+		lwq_init(&pool->sp_xprts);
 		INIT_LIST_HEAD(&pool->sp_all_threads);
 		init_llist_head(&pool->sp_idle_threads);
 		spin_lock_init(&pool->sp_lock);
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 397ca12299a2..39843c064361 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -201,7 +201,6 @@ void svc_xprt_init(struct net *net, struct svc_xprt_class *xcl,
 	kref_init(&xprt->xpt_ref);
 	xprt->xpt_server = serv;
 	INIT_LIST_HEAD(&xprt->xpt_list);
-	INIT_LIST_HEAD(&xprt->xpt_ready);
 	INIT_LIST_HEAD(&xprt->xpt_deferred);
 	INIT_LIST_HEAD(&xprt->xpt_users);
 	mutex_init(&xprt->xpt_mutex);
@@ -472,9 +471,7 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
 	pool = svc_pool_for_cpu(xprt->xpt_server);
 
 	percpu_counter_inc(&pool->sp_sockets_queued);
-	spin_lock_bh(&pool->sp_lock);
-	list_add_tail(&xprt->xpt_ready, &pool->sp_sockets);
-	spin_unlock_bh(&pool->sp_lock);
+	lwq_enqueue(&xprt->xpt_ready, &pool->sp_xprts);
 
 	svc_pool_wake_idle_thread(pool);
 }
@@ -487,18 +484,9 @@ static struct svc_xprt *svc_xprt_dequeue(struct svc_pool *pool)
 {
 	struct svc_xprt	*xprt = NULL;
 
-	if (list_empty(&pool->sp_sockets))
-		goto out;
-
-	spin_lock_bh(&pool->sp_lock);
-	if (likely(!list_empty(&pool->sp_sockets))) {
-		xprt = list_first_entry(&pool->sp_sockets,
-					struct svc_xprt, xpt_ready);
-		list_del_init(&xprt->xpt_ready);
+	xprt = lwq_dequeue(&pool->sp_xprts, struct svc_xprt, xpt_ready);
+	if (xprt)
 		svc_xprt_get(xprt);
-	}
-	spin_unlock_bh(&pool->sp_lock);
-out:
 	return xprt;
 }
 
@@ -708,7 +696,7 @@ rqst_should_sleep(struct svc_rqst *rqstp)
 		return false;
 
 	/* was a socket queued? */
-	if (!list_empty(&pool->sp_sockets))
+	if (!lwq_empty(&pool->sp_xprts))
 		return false;
 
 	/* are we shutting down? */
@@ -1052,7 +1040,6 @@ static void svc_delete_xprt(struct svc_xprt *xprt)
 
 	spin_lock_bh(&serv->sv_lock);
 	list_del_init(&xprt->xpt_list);
-	WARN_ON_ONCE(!list_empty(&xprt->xpt_ready));
 	if (test_bit(XPT_TEMP, &xprt->xpt_flags))
 		serv->sv_tmpcnt--;
 	spin_unlock_bh(&serv->sv_lock);
@@ -1103,36 +1090,26 @@ static int svc_close_list(struct svc_serv *serv, struct list_head *xprt_list, st
 	return ret;
 }
 
-static struct svc_xprt *svc_dequeue_net(struct svc_serv *serv, struct net *net)
+static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net)
 {
-	struct svc_pool *pool;
 	struct svc_xprt *xprt;
-	struct svc_xprt *tmp;
 	int i;
 
 	for (i = 0; i < serv->sv_nrpools; i++) {
-		pool = &serv->sv_pools[i];
-
-		spin_lock_bh(&pool->sp_lock);
-		list_for_each_entry_safe(xprt, tmp, &pool->sp_sockets, xpt_ready) {
-			if (xprt->xpt_net != net)
-				continue;
-			list_del_init(&xprt->xpt_ready);
-			spin_unlock_bh(&pool->sp_lock);
-			return xprt;
+		struct svc_pool *pool = &serv->sv_pools[i];
+		struct llist_node *q, **t1, *t2;
+
+		q = lwq_dequeue_all(&pool->sp_xprts);
+		lwq_for_each_safe(xprt, t1, t2, &q, xpt_ready) {
+			if (xprt->xpt_net == net) {
+				set_bit(XPT_CLOSE, &xprt->xpt_flags);
+				svc_delete_xprt(xprt);
+				xprt = NULL;
+			}
 		}
-		spin_unlock_bh(&pool->sp_lock);
-	}
-	return NULL;
-}
 
-static void svc_clean_up_xprts(struct svc_serv *serv, struct net *net)
-{
-	struct svc_xprt *xprt;
-
-	while ((xprt = svc_dequeue_net(serv, net))) {
-		set_bit(XPT_CLOSE, &xprt->xpt_flags);
-		svc_delete_xprt(xprt);
+		if (q)
+			lwq_enqueue_batch(q, &pool->sp_xprts);
 	}
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 5/7] SUNRPC: change sp_nrthreads to atomic_t
  2023-08-18  1:45 [PATCH 0/7] SUNRPC - queuing improvements NeilBrown
                   ` (3 preceding siblings ...)
  2023-08-18  1:45 ` [PATCH 4/7] SUNRPC: use lwq for sp_sockets - renamed to sp_xprts NeilBrown
@ 2023-08-18  1:45 ` NeilBrown
  2023-08-18  1:45 ` [PATCH 6/7] SUNRPC: discard sp_lock NeilBrown
  2023-08-18  1:45 ` [PATCH 7/7] SUNRPC: change the back-channel queue to lwq NeilBrown
  6 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2023-08-18  1:45 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

Using an atomic_t avoids the need to take a spinlock (which can soon be
removed).

Choosing a thread to kill needs to be careful as we cannot set the "die
now" bit atomically with the test on the count.  Instead we temporarily
increase the count.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 fs/nfsd/nfssvc.c           |  3 ++-
 include/linux/sunrpc/svc.h |  2 +-
 net/sunrpc/svc.c           | 37 ++++++++++++++++++++-----------------
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/nfsd/nfssvc.c b/fs/nfsd/nfssvc.c
index 062f51fe4dfb..5e455ced0711 100644
--- a/fs/nfsd/nfssvc.c
+++ b/fs/nfsd/nfssvc.c
@@ -718,7 +718,8 @@ int nfsd_get_nrthreads(int n, int *nthreads, struct net *net)
 
 	if (nn->nfsd_serv != NULL) {
 		for (i = 0; i < nn->nfsd_serv->sv_nrpools && i < n; i++)
-			nthreads[i] = nn->nfsd_serv->sv_pools[i].sp_nrthreads;
+			nthreads[i] =
+				atomic_read(&nn->nfsd_serv->sv_pools[i].sp_nrthreads);
 	}
 
 	return 0;
diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index 8b2bd034a760..c972561f5ab8 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -36,7 +36,7 @@ struct svc_pool {
 	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
 	spinlock_t		sp_lock;	/* protects all fields */
 	struct lwq		sp_xprts;	/* pending transports */
-	unsigned int		sp_nrthreads;	/* # of threads in pool */
+	atomic_t		sp_nrthreads;	/* # of threads in pool */
 	struct list_head	sp_all_threads;	/* all server threads */
 	struct llist_head	sp_idle_threads; /* idle server threads */
 
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 63cddb8cb08d..9524af33ace9 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -681,8 +681,8 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 	serv->sv_nrthreads += 1;
 	spin_unlock_bh(&serv->sv_lock);
 
+	atomic_inc(&pool->sp_nrthreads);
 	spin_lock_bh(&pool->sp_lock);
-	pool->sp_nrthreads++;
 	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
 	spin_unlock_bh(&pool->sp_lock);
 	return rqstp;
@@ -727,23 +727,24 @@ svc_pool_next(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
 }
 
 static struct svc_pool *
-svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *state)
+svc_pool_victim(struct svc_serv *serv, struct svc_pool *target_pool,
+		unsigned int *state)
 {
+	struct svc_pool *pool;
 	unsigned int i;
 
+retry:
+	pool = target_pool;
+
 	if (pool != NULL) {
-		spin_lock_bh(&pool->sp_lock);
-		if (pool->sp_nrthreads)
+		if (atomic_inc_not_zero(&pool->sp_nrthreads))
 			goto found_pool;
-		spin_unlock_bh(&pool->sp_lock);
 		return NULL;
 	} else {
 		for (i = 0; i < serv->sv_nrpools; i++) {
 			pool = &serv->sv_pools[--(*state) % serv->sv_nrpools];
-			spin_lock_bh(&pool->sp_lock);
-			if (pool->sp_nrthreads)
+			if (atomic_inc_not_zero(&pool->sp_nrthreads))
 				goto found_pool;
-			spin_unlock_bh(&pool->sp_lock);
 		}
 		return NULL;
 	}
@@ -751,8 +752,12 @@ svc_pool_victim(struct svc_serv *serv, struct svc_pool *pool, unsigned int *stat
 found_pool:
 	set_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
 	set_bit(SP_NEED_VICTIM, &pool->sp_flags);
-	spin_unlock_bh(&pool->sp_lock);
-	return pool;
+	if (!atomic_dec_and_test(&pool->sp_nrthreads))
+		return pool;
+	/* Nothing left in this pool any more */
+	clear_bit(SP_NEED_VICTIM, &pool->sp_flags);
+	clear_bit(SP_VICTIM_REMAINS, &pool->sp_flags);
+	goto retry;
 }
 
 static int
@@ -828,13 +833,10 @@ svc_stop_kthreads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 int
 svc_set_num_threads(struct svc_serv *serv, struct svc_pool *pool, int nrservs)
 {
-	if (pool == NULL) {
+	if (!pool)
 		nrservs -= serv->sv_nrthreads;
-	} else {
-		spin_lock_bh(&pool->sp_lock);
-		nrservs -= pool->sp_nrthreads;
-		spin_unlock_bh(&pool->sp_lock);
-	}
+	else
+		nrservs -= atomic_read(&pool->sp_nrthreads);
 
 	if (nrservs > 0)
 		return svc_start_kthreads(serv, pool, nrservs);
@@ -921,10 +923,11 @@ svc_exit_thread(struct svc_rqst *rqstp)
 	struct svc_pool	*pool = rqstp->rq_pool;
 
 	spin_lock_bh(&pool->sp_lock);
-	pool->sp_nrthreads--;
 	list_del_rcu(&rqstp->rq_all);
 	spin_unlock_bh(&pool->sp_lock);
 
+	atomic_dec(&pool->sp_nrthreads);
+
 	spin_lock_bh(&serv->sv_lock);
 	serv->sv_nrthreads -= 1;
 	spin_unlock_bh(&serv->sv_lock);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 6/7] SUNRPC: discard sp_lock
  2023-08-18  1:45 [PATCH 0/7] SUNRPC - queuing improvements NeilBrown
                   ` (4 preceding siblings ...)
  2023-08-18  1:45 ` [PATCH 5/7] SUNRPC: change sp_nrthreads to atomic_t NeilBrown
@ 2023-08-18  1:45 ` NeilBrown
  2023-08-18  1:45 ` [PATCH 7/7] SUNRPC: change the back-channel queue to lwq NeilBrown
  6 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2023-08-18  1:45 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

sp_lock is now only used to protect sp_all_threads.  This isn't needed
as sp_all_threads is only manipulated through svc_set_num_threads(),
which must be locked.  Read-acccess only requires rcu_read_lock().  So
no more locking is needed.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/svc.h |  1 -
 net/sunrpc/svc.c           | 10 +++++-----
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index c972561f5ab8..b80b698b09c5 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -34,7 +34,6 @@
  */
 struct svc_pool {
 	unsigned int		sp_id;	    	/* pool id; also node id on NUMA */
-	spinlock_t		sp_lock;	/* protects all fields */
 	struct lwq		sp_xprts;	/* pending transports */
 	atomic_t		sp_nrthreads;	/* # of threads in pool */
 	struct list_head	sp_all_threads;	/* all server threads */
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 9524af33ace9..61ea8ce7975f 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -511,7 +511,6 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
 		lwq_init(&pool->sp_xprts);
 		INIT_LIST_HEAD(&pool->sp_all_threads);
 		init_llist_head(&pool->sp_idle_threads);
-		spin_lock_init(&pool->sp_lock);
 
 		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
 		percpu_counter_init(&pool->sp_sockets_queued, 0, GFP_KERNEL);
@@ -682,9 +681,12 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
 	spin_unlock_bh(&serv->sv_lock);
 
 	atomic_inc(&pool->sp_nrthreads);
-	spin_lock_bh(&pool->sp_lock);
+
+	/* Protected by whatever lock the service uses when calling
+	 * svc_set_num_threads()
+	 */
 	list_add_rcu(&rqstp->rq_all, &pool->sp_all_threads);
-	spin_unlock_bh(&pool->sp_lock);
+
 	return rqstp;
 }
 
@@ -922,9 +924,7 @@ svc_exit_thread(struct svc_rqst *rqstp)
 	struct svc_serv	*serv = rqstp->rq_server;
 	struct svc_pool	*pool = rqstp->rq_pool;
 
-	spin_lock_bh(&pool->sp_lock);
 	list_del_rcu(&rqstp->rq_all);
-	spin_unlock_bh(&pool->sp_lock);
 
 	atomic_dec(&pool->sp_nrthreads);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 7/7] SUNRPC: change the back-channel queue to lwq
  2023-08-18  1:45 [PATCH 0/7] SUNRPC - queuing improvements NeilBrown
                   ` (5 preceding siblings ...)
  2023-08-18  1:45 ` [PATCH 6/7] SUNRPC: discard sp_lock NeilBrown
@ 2023-08-18  1:45 ` NeilBrown
  6 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2023-08-18  1:45 UTC (permalink / raw)
  To: Chuck Lever, Jeff Layton; +Cc: linux-nfs

This removes the need to store and update back-links in the list.
It also remove the need for the _bh version of spin_lock().

Signed-off-by: NeilBrown <neilb@suse.de>
---
 include/linux/sunrpc/svc.h        |  3 +--
 include/linux/sunrpc/xprt.h       |  3 ++-
 net/sunrpc/backchannel_rqst.c     |  5 +----
 net/sunrpc/svc.c                  |  3 +--
 net/sunrpc/svc_xprt.c             | 12 +++---------
 net/sunrpc/xprtrdma/backchannel.c |  4 +---
 6 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
index b80b698b09c5..a05a2e1f77ec 100644
--- a/include/linux/sunrpc/svc.h
+++ b/include/linux/sunrpc/svc.h
@@ -90,10 +90,9 @@ struct svc_serv {
 	int			(*sv_threadfn)(void *data);
 
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
-	struct list_head	sv_cb_list;	/* queue for callback requests
+	struct lwq		sv_cb_list;	/* queue for callback requests
 						 * that arrive over the same
 						 * connection */
-	spinlock_t		sv_cb_lock;	/* protects the svc_cb_list */
 	bool			sv_bc_enabled;	/* service uses backchannel */
 #endif /* CONFIG_SUNRPC_BACKCHANNEL */
 };
diff --git a/include/linux/sunrpc/xprt.h b/include/linux/sunrpc/xprt.h
index b52411bcfe4e..0a77a6f1c32a 100644
--- a/include/linux/sunrpc/xprt.h
+++ b/include/linux/sunrpc/xprt.h
@@ -57,6 +57,7 @@ struct xprt_class;
 struct seq_file;
 struct svc_serv;
 struct net;
+#include <linux/sunrpc/svc_lwq.h>
 
 /*
  * This describes a complete RPC request
@@ -121,7 +122,7 @@ struct rpc_rqst {
 	int			rq_ntrans;
 
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
-	struct list_head	rq_bc_list;	/* Callback service list */
+	struct lwq_node		rq_bc_list;	/* Callback service list */
 	unsigned long		rq_bc_pa_state;	/* Backchannel prealloc state */
 	struct list_head	rq_bc_pa_list;	/* Backchannel prealloc list */
 #endif /* CONFIG_SUNRPC_BACKCHANEL */
diff --git a/net/sunrpc/backchannel_rqst.c b/net/sunrpc/backchannel_rqst.c
index 44b7c89a635f..caa94cf57123 100644
--- a/net/sunrpc/backchannel_rqst.c
+++ b/net/sunrpc/backchannel_rqst.c
@@ -83,7 +83,6 @@ static struct rpc_rqst *xprt_alloc_bc_req(struct rpc_xprt *xprt)
 		return NULL;
 
 	req->rq_xprt = xprt;
-	INIT_LIST_HEAD(&req->rq_bc_list);
 
 	/* Preallocate one XDR receive buffer */
 	if (xprt_alloc_xdr_buf(&req->rq_rcv_buf, gfp_flags) < 0) {
@@ -367,8 +366,6 @@ void xprt_complete_bc_request(struct rpc_rqst *req, uint32_t copied)
 
 	dprintk("RPC:       add callback request to list\n");
 	xprt_get(xprt);
-	spin_lock(&bc_serv->sv_cb_lock);
-	list_add(&req->rq_bc_list, &bc_serv->sv_cb_list);
-	spin_unlock(&bc_serv->sv_cb_lock);
+	lwq_enqueue(&req->rq_bc_list, &bc_serv->sv_cb_list);
 	svc_pool_wake_idle_thread(&bc_serv->sv_pools[0]);
 }
diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
index 61ea8ce7975f..3d3aaed8311c 100644
--- a/net/sunrpc/svc.c
+++ b/net/sunrpc/svc.c
@@ -438,8 +438,7 @@ EXPORT_SYMBOL_GPL(svc_bind);
 static void
 __svc_init_bc(struct svc_serv *serv)
 {
-	INIT_LIST_HEAD(&serv->sv_cb_list);
-	spin_lock_init(&serv->sv_cb_lock);
+	lwq_init(&serv->sv_cb_list);
 }
 #else
 static void
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 39843c064361..5143fa4235fb 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -705,7 +705,7 @@ rqst_should_sleep(struct svc_rqst *rqstp)
 
 #if defined(CONFIG_SUNRPC_BACKCHANNEL)
 	if (svc_is_backchannel(rqstp)) {
-		if (!list_empty(&rqstp->rq_server->sv_cb_list))
+		if (!lwq_empty(&rqstp->rq_server->sv_cb_list))
 			return false;
 	}
 #endif
@@ -880,18 +880,12 @@ void svc_recv(struct svc_rqst *rqstp)
 		struct svc_serv *serv = rqstp->rq_server;
 		struct rpc_rqst *req;
 
-		spin_lock_bh(&serv->sv_cb_lock);
-		req = list_first_entry_or_null(&serv->sv_cb_list,
-					       struct rpc_rqst, rq_bc_list);
+		req = lwq_dequeue(&serv->sv_cb_list,
+				  struct rpc_rqst, rq_bc_list);
 		if (req) {
-			list_del(&req->rq_bc_list);
-			spin_unlock_bh(&serv->sv_cb_lock);
 			wake_next(rqstp);
-
 			svc_process_bc(req, rqstp);
-			return;
 		}
-		spin_unlock_bh(&serv->sv_cb_lock);
 	}
 #endif
 }
diff --git a/net/sunrpc/xprtrdma/backchannel.c b/net/sunrpc/xprtrdma/backchannel.c
index bfc434ec52a7..8c817e755262 100644
--- a/net/sunrpc/xprtrdma/backchannel.c
+++ b/net/sunrpc/xprtrdma/backchannel.c
@@ -263,9 +263,7 @@ void rpcrdma_bc_receive_call(struct rpcrdma_xprt *r_xprt,
 	/* Queue rqst for ULP's callback service */
 	bc_serv = xprt->bc_serv;
 	xprt_get(xprt);
-	spin_lock(&bc_serv->sv_cb_lock);
-	list_add(&rqst->rq_bc_list, &bc_serv->sv_cb_list);
-	spin_unlock(&bc_serv->sv_cb_lock);
+	lwq_enqueue(&rqst->rq_bc_list, &bc_serv->sv_cb_list);
 
 	svc_pool_wake_idle_thread(&bc_serv->sv_pools[0]);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/7] SUNRPC: change service idle list to be an llist
  2023-08-18  1:45 ` [PATCH 1/7] SUNRPC: change service idle list to be an llist NeilBrown
@ 2023-08-18 15:53   ` Chuck Lever
  2023-08-30  1:52     ` NeilBrown
  0 siblings, 1 reply; 10+ messages in thread
From: Chuck Lever @ 2023-08-18 15:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: Jeff Layton, linux-nfs

On Fri, Aug 18, 2023 at 11:45:06AM +1000, NeilBrown wrote:
> With an llist we don't need to take a lock to add a thread to the list,
> though we still need a lock to remove it.  That will go in the next
> patch.
> 
> Unlike double-linked lists, a thread cannot reliably remove itself from
> the list.  Only the first thread can be removed, and that can change
> asynchronously.  So some care is needed.
> 
> We already check if there is pending work to do, so we are unlikely to
> add ourselves to the idle list and then want to remove ourselves again.
> 
> If we DO find something needs to be done after adding ourselves to the
> list, we simply wake up the first thread on the list.  If that was us,
> we successfully removed ourselves and can continue.  If it was some
> other thread, they will do the work that needs to be done.  We can
> safely sleep until woken.
> 
> We also remove the test on freezing() from rqst_should_sleep().  Instead
> we set TASK_FREEZABLE before scheduling.  This makes is safe to
> schedule() when a freeze is pending.  As we now loop waiting to be
> removed from the idle queue, this is a cleaner way to handle freezing.
> 
> task_state_index() incorrectly identifies a task with
>    TASK_IDLE | TASK_FREEZABLE
> as 'D'.  So fix __task_state_index to ignore extra flags on TASK_IDLE
> just as it ignores extra flags on other states.

Let's split the task_state_index() change into a separate patch
that can be sent to the scheduler maintainers for their Review/Ack.


> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  include/linux/sched.h      |  4 +--
>  include/linux/sunrpc/svc.h | 14 ++++++-----
>  net/sunrpc/svc.c           | 13 +++++-----
>  net/sunrpc/svc_xprt.c      | 51 +++++++++++++++++++-------------------
>  4 files changed, 42 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 609bde814cb0..a5f3badcb629 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1666,7 +1666,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
>  
>  	BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
>  
> -	if (tsk_state == TASK_IDLE)
> +	if ((tsk_state & TASK_IDLE) == TASK_IDLE)
>  		state = TASK_REPORT_IDLE;
>  
>  	/*
> @@ -1674,7 +1674,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
>  	 * to userspace, we can make this appear as if the task has gone through
>  	 * a regular rt_mutex_lock() call.
>  	 */
> -	if (tsk_state == TASK_RTLOCK_WAIT)
> +	if (tsk_state & TASK_RTLOCK_WAIT)
>  		state = TASK_UNINTERRUPTIBLE;
>  
>  	return fls(state);
> diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> index 22b3018ebf62..5216f95411e3 100644
> --- a/include/linux/sunrpc/svc.h
> +++ b/include/linux/sunrpc/svc.h
> @@ -37,7 +37,7 @@ struct svc_pool {
>  	struct list_head	sp_sockets;	/* pending sockets */
>  	unsigned int		sp_nrthreads;	/* # of threads in pool */
>  	struct list_head	sp_all_threads;	/* all server threads */
> -	struct list_head	sp_idle_threads; /* idle server threads */
> +	struct llist_head	sp_idle_threads; /* idle server threads */
>  
>  	/* statistics on pool operation */
>  	struct percpu_counter	sp_messages_arrived;
> @@ -186,7 +186,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
>   */
>  struct svc_rqst {
>  	struct list_head	rq_all;		/* all threads list */
> -	struct list_head	rq_idle;	/* On the idle list */
> +	struct llist_node	rq_idle;	/* On the idle list */
>  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
>  	struct svc_xprt *	rq_xprt;	/* transport ptr */
>  
> @@ -270,22 +270,24 @@ enum {
>   * svc_thread_set_busy - mark a thread as busy
>   * @rqstp: the thread which is now busy
>   *
> - * If rq_idle is "empty", the thread must be busy.
> + * By convention a thread is busy if rq_idle.next points to rq_idle.
> + * This ensures it is not on the idle list.
>   */
>  static inline void svc_thread_set_busy(struct svc_rqst *rqstp)
>  {
> -	INIT_LIST_HEAD(&rqstp->rq_idle);
> +	rqstp->rq_idle.next = &rqstp->rq_idle;
>  }

I don't understand the comment "This ensures it is not on the idle
list." svc_thread_set_busy() is called in two places: The first
when an svc_rqst is created, and once directly after an
llist_del_first() has been done. @rqstp is already not on the
idle list in either case.

What really needs an explanation here is that there's no
existing utility to check whether an llist_node is on a list or
not.


>  /**
>   * svc_thread_busy - check if a thread as busy
>   * @rqstp: the thread which might be busy
>   *
> - * If rq_idle is "empty", the thread must be busy.
> + * By convention a thread is busy if rq_idle.next points to rq_idle.
> + * This ensures it is not on the idle list.

This function doesn't modify the thread, so it can't ensure it
is not on the idle list.


>   */
>  static inline bool svc_thread_busy(struct svc_rqst *rqstp)

const struct svc_rqst *rqstp

>  {
> -	return list_empty(&rqstp->rq_idle);
> +	return rqstp->rq_idle.next == &rqstp->rq_idle;
>  }
>  
>  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
> diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> index 051f08c48418..addbf28ea50a 100644
> --- a/net/sunrpc/svc.c
> +++ b/net/sunrpc/svc.c
> @@ -510,7 +510,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
>  		pool->sp_id = i;
>  		INIT_LIST_HEAD(&pool->sp_sockets);
>  		INIT_LIST_HEAD(&pool->sp_all_threads);
> -		INIT_LIST_HEAD(&pool->sp_idle_threads);
> +		init_llist_head(&pool->sp_idle_threads);
>  		spin_lock_init(&pool->sp_lock);
>  
>  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> @@ -701,15 +701,16 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
>  void svc_pool_wake_idle_thread(struct svc_pool *pool)
>  {
>  	struct svc_rqst	*rqstp;
> +	struct llist_node *ln;
>  
>  	rcu_read_lock();
>  	spin_lock_bh(&pool->sp_lock);
> -	rqstp = list_first_entry_or_null(&pool->sp_idle_threads,
> -					 struct svc_rqst, rq_idle);
> -	if (rqstp)
> -		list_del_init(&rqstp->rq_idle);
> +	ln = llist_del_first(&pool->sp_idle_threads);
>  	spin_unlock_bh(&pool->sp_lock);
> -	if (rqstp) {
> +	if (ln) {
> +		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
> +		svc_thread_set_busy(rqstp);
> +
>  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
>  		wake_up_process(rqstp->rq_task);
>  		rcu_read_unlock();
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index fa0d854a5596..81327001e074 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -715,10 +715,6 @@ rqst_should_sleep(struct svc_rqst *rqstp)
>  	if (svc_thread_should_stop(rqstp))
>  		return false;
>  
> -	/* are we freezing? */
> -	if (freezing(current))
> -		return false;
> -
>  #if defined(CONFIG_SUNRPC_BACKCHANNEL)
>  	if (svc_is_backchannel(rqstp)) {
>  		if (!list_empty(&rqstp->rq_server->sv_cb_list))
> @@ -734,30 +730,32 @@ static void svc_rqst_wait_for_work(struct svc_rqst *rqstp)
>  	struct svc_pool *pool = rqstp->rq_pool;
>  
>  	if (rqst_should_sleep(rqstp)) {
> -		set_current_state(TASK_IDLE);
> -		spin_lock_bh(&pool->sp_lock);
> -		list_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> -		spin_unlock_bh(&pool->sp_lock);
> +		set_current_state(TASK_IDLE | TASK_FREEZABLE);

I'm trying to learn about the semantics of IDLE|FREEZABLE, but there
isn't another instance of this flag combination in the kernel.


> +		llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> +
> +		if (unlikely(!rqst_should_sleep(rqstp)))
> +			/* maybe there were no idle threads when some work
> +			 * became ready and so nothing was woken.  We've just
> +			 * become idle so someone can to the work - maybe us.
> +			 * But we cannot reliably remove ourselves from the
> +			 * idle list - we can only remove the first task which
> +			 * might be us, and might not.
> +			 * So remove and wake it, then schedule().  If it was
> +			 * us, we won't sleep.  If it is some other thread, they
> +			 * will do the work.
> +			 */

Large block comments suggest that this code doesn't document itself
very well.

At least, some of the textual redundancy can be removed, though this
comment and the next are removed or replaced in later patches. I'll
leave it up to you to find an approach that is a bit cleaner.


> +			svc_pool_wake_idle_thread(pool);
>  
> -		/* Need to check should_sleep() again after
> -		 * setting task state in case a wakeup happened
> -		 * between testing and setting.
> +		/* We mustn't continue while on the idle list, and we
> +		 * cannot remove outselves reliably.  The only "work"
> +		 * we can do while on the idle list is to freeze.
> +		 * So loop until someone removes us
>  		 */

Here and above, the use of "we" obscures the explanation. What is
"we" in this context? I think it might be "this thread" or @rqstp,
but I can't be certain. For instance:

		/* Since a thread cannot remove itself from an llist,
		 * schedule until someone else removes @rqstp from
		 * the idle list.
		 */


> -		if (rqst_should_sleep(rqstp)) {
> +		while (!svc_thread_busy(rqstp)) {

I need to convince myself that this while() can't possibly result in
an infinite loop.


>  			schedule();
> -		} else {
> -			__set_current_state(TASK_RUNNING);
> -			cond_resched();
> -		}
> -
> -		/* We *must* be removed from the list before we can continue.
> -		 * If we were woken, this is already done
> -		 */
> -		if (!svc_thread_busy(rqstp)) {
> -			spin_lock_bh(&pool->sp_lock);
> -			list_del_init(&rqstp->rq_idle);
> -			spin_unlock_bh(&pool->sp_lock);
> +			set_current_state(TASK_IDLE | TASK_FREEZABLE);
>  		}
> +		__set_current_state(TASK_RUNNING);
> 
>  	} else {
>  		cond_resched();
>  	}

There's a try_to_freeze() call just after this hunk. Is there a
reason to invoke cond_resched(); before freezing?


> @@ -870,9 +868,10 @@ void svc_recv(struct svc_rqst *rqstp)
>  		struct svc_xprt *xprt = rqstp->rq_xprt;
>  
>  		/* Normally we will wait up to 5 seconds for any required
> -		 * cache information to be provided.
> +		 * cache information to be provided.  When there are no
> +		 * idle threads, we reduce the wait time.
>  		 */
> -		if (!list_empty(&pool->sp_idle_threads))
> +		if (pool->sp_idle_threads.first)
>  			rqstp->rq_chandle.thread_wait = 5 * HZ;
>  		else
>  			rqstp->rq_chandle.thread_wait = 1 * HZ;
> -- 
> 2.40.1
> 

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/7] SUNRPC: change service idle list to be an llist
  2023-08-18 15:53   ` Chuck Lever
@ 2023-08-30  1:52     ` NeilBrown
  0 siblings, 0 replies; 10+ messages in thread
From: NeilBrown @ 2023-08-30  1:52 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Jeff Layton, linux-nfs

On Sat, 19 Aug 2023, Chuck Lever wrote:
> On Fri, Aug 18, 2023 at 11:45:06AM +1000, NeilBrown wrote:
> > With an llist we don't need to take a lock to add a thread to the list,
> > though we still need a lock to remove it.  That will go in the next
> > patch.
> > 
> > Unlike double-linked lists, a thread cannot reliably remove itself from
> > the list.  Only the first thread can be removed, and that can change
> > asynchronously.  So some care is needed.
> > 
> > We already check if there is pending work to do, so we are unlikely to
> > add ourselves to the idle list and then want to remove ourselves again.
> > 
> > If we DO find something needs to be done after adding ourselves to the
> > list, we simply wake up the first thread on the list.  If that was us,
> > we successfully removed ourselves and can continue.  If it was some
> > other thread, they will do the work that needs to be done.  We can
> > safely sleep until woken.
> > 
> > We also remove the test on freezing() from rqst_should_sleep().  Instead
> > we set TASK_FREEZABLE before scheduling.  This makes is safe to
> > schedule() when a freeze is pending.  As we now loop waiting to be
> > removed from the idle queue, this is a cleaner way to handle freezing.
> > 
> > task_state_index() incorrectly identifies a task with
> >    TASK_IDLE | TASK_FREEZABLE
> > as 'D'.  So fix __task_state_index to ignore extra flags on TASK_IDLE
> > just as it ignores extra flags on other states.
> 
> Let's split the task_state_index() change into a separate patch
> that can be sent to the scheduler maintainers for their Review/Ack.
> 

I've sent that to appropriate people.  It doesn't really matter if it
lands before or after we change sunrpc to use TASK_FREEZABLE.  The only
problem is that nfsd threads look like "D" in a ps listing rather than
"I".  The actually behaviour of the threads isn't changed.

> 
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> >  include/linux/sched.h      |  4 +--
> >  include/linux/sunrpc/svc.h | 14 ++++++-----
> >  net/sunrpc/svc.c           | 13 +++++-----
> >  net/sunrpc/svc_xprt.c      | 51 +++++++++++++++++++-------------------
> >  4 files changed, 42 insertions(+), 40 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index 609bde814cb0..a5f3badcb629 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1666,7 +1666,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
> >  
> >  	BUILD_BUG_ON_NOT_POWER_OF_2(TASK_REPORT_MAX);
> >  
> > -	if (tsk_state == TASK_IDLE)
> > +	if ((tsk_state & TASK_IDLE) == TASK_IDLE)
> >  		state = TASK_REPORT_IDLE;
> >  
> >  	/*
> > @@ -1674,7 +1674,7 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
> >  	 * to userspace, we can make this appear as if the task has gone through
> >  	 * a regular rt_mutex_lock() call.
> >  	 */
> > -	if (tsk_state == TASK_RTLOCK_WAIT)
> > +	if (tsk_state & TASK_RTLOCK_WAIT)
> >  		state = TASK_UNINTERRUPTIBLE;
> >  
> >  	return fls(state);
> > diff --git a/include/linux/sunrpc/svc.h b/include/linux/sunrpc/svc.h
> > index 22b3018ebf62..5216f95411e3 100644
> > --- a/include/linux/sunrpc/svc.h
> > +++ b/include/linux/sunrpc/svc.h
> > @@ -37,7 +37,7 @@ struct svc_pool {
> >  	struct list_head	sp_sockets;	/* pending sockets */
> >  	unsigned int		sp_nrthreads;	/* # of threads in pool */
> >  	struct list_head	sp_all_threads;	/* all server threads */
> > -	struct list_head	sp_idle_threads; /* idle server threads */
> > +	struct llist_head	sp_idle_threads; /* idle server threads */
> >  
> >  	/* statistics on pool operation */
> >  	struct percpu_counter	sp_messages_arrived;
> > @@ -186,7 +186,7 @@ extern u32 svc_max_payload(const struct svc_rqst *rqstp);
> >   */
> >  struct svc_rqst {
> >  	struct list_head	rq_all;		/* all threads list */
> > -	struct list_head	rq_idle;	/* On the idle list */
> > +	struct llist_node	rq_idle;	/* On the idle list */
> >  	struct rcu_head		rq_rcu_head;	/* for RCU deferred kfree */
> >  	struct svc_xprt *	rq_xprt;	/* transport ptr */
> >  
> > @@ -270,22 +270,24 @@ enum {
> >   * svc_thread_set_busy - mark a thread as busy
> >   * @rqstp: the thread which is now busy
> >   *
> > - * If rq_idle is "empty", the thread must be busy.
> > + * By convention a thread is busy if rq_idle.next points to rq_idle.
> > + * This ensures it is not on the idle list.
> >   */
> >  static inline void svc_thread_set_busy(struct svc_rqst *rqstp)
> >  {
> > -	INIT_LIST_HEAD(&rqstp->rq_idle);
> > +	rqstp->rq_idle.next = &rqstp->rq_idle;
> >  }
> 
> I don't understand the comment "This ensures it is not on the idle
> list." svc_thread_set_busy() is called in two places: The first
> when an svc_rqst is created, and once directly after an
> llist_del_first() has been done. @rqstp is already not on the
> idle list in either case.

"ensures" is the wrong word.  Maybe I meant "assures" or even "records".
I change it to
 * This will never be the case for threads on the idle list.

> 
> What really needs an explanation here is that there's no
> existing utility to check whether an llist_node is on a list or
> not.

I guess no-one found the need before...
Though I note that md/raid5.c has a bit flag STRIPE_ON_RELEASE_LIST
to check if the stripe is on a llist.  So it could use that
functionality.

I would happily include a patch to add this to llist.h, except that by
the end of the series we don't need it any more...
Maybe I should do it anyway?
> 
> 
> >  /**
> >   * svc_thread_busy - check if a thread as busy
> >   * @rqstp: the thread which might be busy
> >   *
> > - * If rq_idle is "empty", the thread must be busy.
> > + * By convention a thread is busy if rq_idle.next points to rq_idle.
> > + * This ensures it is not on the idle list.
> 
> This function doesn't modify the thread, so it can't ensure it
> is not on the idle list.
I changed that to

 * This will never be the case for threads on the idle list.

too.

> 
> 
> >   */
> >  static inline bool svc_thread_busy(struct svc_rqst *rqstp)
> 
> const struct svc_rqst *rqstp

ack.

> 
> >  {
> > -	return list_empty(&rqstp->rq_idle);
> > +	return rqstp->rq_idle.next == &rqstp->rq_idle;
> >  }
> >  
> >  #define SVC_NET(rqst) (rqst->rq_xprt ? rqst->rq_xprt->xpt_net : rqst->rq_bc_net)
> > diff --git a/net/sunrpc/svc.c b/net/sunrpc/svc.c
> > index 051f08c48418..addbf28ea50a 100644
> > --- a/net/sunrpc/svc.c
> > +++ b/net/sunrpc/svc.c
> > @@ -510,7 +510,7 @@ __svc_create(struct svc_program *prog, unsigned int bufsize, int npools,
> >  		pool->sp_id = i;
> >  		INIT_LIST_HEAD(&pool->sp_sockets);
> >  		INIT_LIST_HEAD(&pool->sp_all_threads);
> > -		INIT_LIST_HEAD(&pool->sp_idle_threads);
> > +		init_llist_head(&pool->sp_idle_threads);
> >  		spin_lock_init(&pool->sp_lock);
> >  
> >  		percpu_counter_init(&pool->sp_messages_arrived, 0, GFP_KERNEL);
> > @@ -701,15 +701,16 @@ svc_prepare_thread(struct svc_serv *serv, struct svc_pool *pool, int node)
> >  void svc_pool_wake_idle_thread(struct svc_pool *pool)
> >  {
> >  	struct svc_rqst	*rqstp;
> > +	struct llist_node *ln;
> >  
> >  	rcu_read_lock();
> >  	spin_lock_bh(&pool->sp_lock);
> > -	rqstp = list_first_entry_or_null(&pool->sp_idle_threads,
> > -					 struct svc_rqst, rq_idle);
> > -	if (rqstp)
> > -		list_del_init(&rqstp->rq_idle);
> > +	ln = llist_del_first(&pool->sp_idle_threads);
> >  	spin_unlock_bh(&pool->sp_lock);
> > -	if (rqstp) {
> > +	if (ln) {
> > +		rqstp = llist_entry(ln, struct svc_rqst, rq_idle);
> > +		svc_thread_set_busy(rqstp);
> > +
> >  		WRITE_ONCE(rqstp->rq_qtime, ktime_get());
> >  		wake_up_process(rqstp->rq_task);
> >  		rcu_read_unlock();
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index fa0d854a5596..81327001e074 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -715,10 +715,6 @@ rqst_should_sleep(struct svc_rqst *rqstp)
> >  	if (svc_thread_should_stop(rqstp))
> >  		return false;
> >  
> > -	/* are we freezing? */
> > -	if (freezing(current))
> > -		return false;
> > -
> >  #if defined(CONFIG_SUNRPC_BACKCHANNEL)
> >  	if (svc_is_backchannel(rqstp)) {
> >  		if (!list_empty(&rqstp->rq_server->sv_cb_list))
> > @@ -734,30 +730,32 @@ static void svc_rqst_wait_for_work(struct svc_rqst *rqstp)
> >  	struct svc_pool *pool = rqstp->rq_pool;
> >  
> >  	if (rqst_should_sleep(rqstp)) {
> > -		set_current_state(TASK_IDLE);
> > -		spin_lock_bh(&pool->sp_lock);
> > -		list_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> > -		spin_unlock_bh(&pool->sp_lock);
> > +		set_current_state(TASK_IDLE | TASK_FREEZABLE);
> 
> I'm trying to learn about the semantics of IDLE|FREEZABLE, but there
> isn't another instance of this flag combination in the kernel.

No - which I why no one else notice that the state reported by ps was
wrong. 
Quite a few places use INTERRUPTIBLE|FREEZABLE which is much the same
thing except sunrpc threads don't want interrupts.

My interpretation of FREEZABLE is that it avoids the need to check for
freezing, or to freeze.  A higher-level interface.

> 
> 
> > +		llist_add(&rqstp->rq_idle, &pool->sp_idle_threads);
> > +
> > +		if (unlikely(!rqst_should_sleep(rqstp)))
> > +			/* maybe there were no idle threads when some work
> > +			 * became ready and so nothing was woken.  We've just
> > +			 * become idle so someone can to the work - maybe us.
> > +			 * But we cannot reliably remove ourselves from the
> > +			 * idle list - we can only remove the first task which
> > +			 * might be us, and might not.
> > +			 * So remove and wake it, then schedule().  If it was
> > +			 * us, we won't sleep.  If it is some other thread, they
> > +			 * will do the work.
> > +			 */
> 
> Large block comments suggest that this code doesn't document itself
> very well.

Self documenting code is best, but when the logic is subtle and
unfamiliar, a bit of text can help.
Of course the logic is quite familiar to me now and doesn't seem so
subtle, so I wonder if any comment it needed.
> 
> At least, some of the textual redundancy can be removed, though this
> comment and the next are removed or replaced in later patches. I'll
> leave it up to you to find an approach that is a bit cleaner.

Is now:
	/* Work just became available.  This thread cannot simply
	 * choose not to sleep as it *must* wait until removed.
	 * So wake the first waiter - whether it is this
	 * thread or some other, it will get the work done.
	 */


> 
> 
> > +			svc_pool_wake_idle_thread(pool);
> >  
> > -		/* Need to check should_sleep() again after
> > -		 * setting task state in case a wakeup happened
> > -		 * between testing and setting.
> > +		/* We mustn't continue while on the idle list, and we
> > +		 * cannot remove outselves reliably.  The only "work"
> > +		 * we can do while on the idle list is to freeze.
> > +		 * So loop until someone removes us
> >  		 */
> 
> Here and above, the use of "we" obscures the explanation. What is
> "we" in this context? I think it might be "this thread" or @rqstp,
> but I can't be certain. For instance:
> 
> 		/* Since a thread cannot remove itself from an llist,
> 		 * schedule until someone else removes @rqstp from
> 		 * the idle list.
> 		 */

Looks good - thanks.

> 
> 
> > -		if (rqst_should_sleep(rqstp)) {
> > +		while (!svc_thread_busy(rqstp)) {
> 
> I need to convince myself that this while() can't possibly result in
> an infinite loop.
> 

It cannot be a busy-loop because we set state and call schedule().
It could loop for a long time if there is no work to do.
But at service shutdown, everything will be removed from the busy list
and woken, so the loop will exit then.

> 
> >  			schedule();
> > -		} else {
> > -			__set_current_state(TASK_RUNNING);
> > -			cond_resched();
> > -		}
> > -
> > -		/* We *must* be removed from the list before we can continue.
> > -		 * If we were woken, this is already done
> > -		 */
> > -		if (!svc_thread_busy(rqstp)) {
> > -			spin_lock_bh(&pool->sp_lock);
> > -			list_del_init(&rqstp->rq_idle);
> > -			spin_unlock_bh(&pool->sp_lock);
> > +			set_current_state(TASK_IDLE | TASK_FREEZABLE);
> >  		}
> > +		__set_current_state(TASK_RUNNING);
> > 
> >  	} else {
> >  		cond_resched();
> >  	}
> 
> There's a try_to_freeze() call just after this hunk. Is there a
> reason to invoke cond_resched(); before freezing?

try_to_freeze() can compile to {return false;}.  Even when not, it
usually does not more than tests a global variable.  So the presence of
try_to_freeze() has no bearing on the use of cond_resched().

> 
> 
> > @@ -870,9 +868,10 @@ void svc_recv(struct svc_rqst *rqstp)
> >  		struct svc_xprt *xprt = rqstp->rq_xprt;
> >  
> >  		/* Normally we will wait up to 5 seconds for any required
> > -		 * cache information to be provided.
> > +		 * cache information to be provided.  When there are no
> > +		 * idle threads, we reduce the wait time.
> >  		 */
> > -		if (!list_empty(&pool->sp_idle_threads))
> > +		if (pool->sp_idle_threads.first)
> >  			rqstp->rq_chandle.thread_wait = 5 * HZ;
> >  		else
> >  			rqstp->rq_chandle.thread_wait = 1 * HZ;
> > -- 
> > 2.40.1
> > 

I'll send a new batch today I hope.  I decided to move lwq into lib/
and make all the changes to llist in separate patches.

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-08-30  1:53 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-18  1:45 [PATCH 0/7] SUNRPC - queuing improvements NeilBrown
2023-08-18  1:45 ` [PATCH 1/7] SUNRPC: change service idle list to be an llist NeilBrown
2023-08-18 15:53   ` Chuck Lever
2023-08-30  1:52     ` NeilBrown
2023-08-18  1:45 ` [PATCH 2/7] SUNRPC: only have one thread waking up at a time NeilBrown
2023-08-18  1:45 ` [PATCH 3/7] SUNRPC/svc: add light-weight queuing mechanism NeilBrown
2023-08-18  1:45 ` [PATCH 4/7] SUNRPC: use lwq for sp_sockets - renamed to sp_xprts NeilBrown
2023-08-18  1:45 ` [PATCH 5/7] SUNRPC: change sp_nrthreads to atomic_t NeilBrown
2023-08-18  1:45 ` [PATCH 6/7] SUNRPC: discard sp_lock NeilBrown
2023-08-18  1:45 ` [PATCH 7/7] SUNRPC: change the back-channel queue to lwq NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).