linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/11] rcu: call_rcu() power improvements
@ 2022-10-04  2:41 Joel Fernandes (Google)
  2022-10-04  2:41 ` [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain() Joel Fernandes (Google)
                   ` (11 more replies)
  0 siblings, 12 replies; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes (Google)

v7 version of RCU lazy patches based on rcu/next branch.

To facilitate easier merge, I dropped tracing and other patches and just
implemented the new changes. I will post the tracing patches later along with
rcutop as I need to add new tracepoints that Frederic suggested.

Main recent changes:
1. rcu_barrier() wake up only for lazy bypass list.
2. Make all call_rcu() default-lazy and add call_rcu_flush() API.
3. Take care of some callers using call_rcu_flush() API.
4. Several refactorings suggested by Paul/Frederic.
5. New call_rcu() to call_rcu_flush() conversions by Joel/Vlad/Paul.

I am seeing good performance and power with these patches on real ChromeOS x86
asymmetric hardware.

Earlier cover letter with lots of details is here:
https://lore.kernel.org/all/20220901221720.1105021-1-joel@joelfernandes.org/

List of recent changes:
    
    [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
      deferral levels wake much earlier so for those it is not needed. ]
    
    [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
    
    [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
    
    [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
    
    [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
    
    [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
    
    [ Paul/Frederic : Cosmetic changes, split out wakeup of nocb thread. ]
    
    [ Vlad/Joel : More call_rcu -> flush conversions ]

Frederic Weisbecker (1):
rcu: Wake up nocb gp thread on rcu_barrier_entrain()

Joel Fernandes (Google) (7):
rcu: Make call_rcu() lazy to save power
rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
rcuscale: Add laziness and kfree tests
percpu-refcount: Use call_rcu_flush() for atomic switch
rcu/sync: Use call_rcu_flush() instead of call_rcu
rcu/rcuscale: Use call_rcu_flush() for async reader test
rcu/rcutorture: Use call_rcu_flush() where needed

Uladzislau Rezki (2):
scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
workqueue: Make queue_rcu_work() use call_rcu_flush()

Vineeth Pillai (1):
rcu: shrinker for lazy rcu

drivers/scsi/scsi_error.c |   2 +-
include/linux/rcupdate.h  |   7 ++
kernel/rcu/Kconfig        |   8 ++
kernel/rcu/rcu.h          |   8 ++
kernel/rcu/rcuscale.c     |  67 +++++++++++-
kernel/rcu/rcutorture.c   |  16 +--
kernel/rcu/sync.c         |   2 +-
kernel/rcu/tiny.c         |   2 +-
kernel/rcu/tree.c         | 140 +++++++++++++++++--------
kernel/rcu/tree.h         |  12 ++-
kernel/rcu/tree_exp.h     |   2 +-
kernel/rcu/tree_nocb.h    | 213 ++++++++++++++++++++++++++++++++------
kernel/workqueue.c        |   2 +-
lib/percpu-refcount.c     |   3 +-
14 files changed, 388 insertions(+), 96 deletions(-)

--
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply	[flat|nested] 45+ messages in thread

* [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain()
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-04 22:28   ` Frederic Weisbecker
  2022-10-04  2:41 ` [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power Joel Fernandes (Google)
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes

From: Frederic Weisbecker <frederic@kernel.org>

In preparation of RCU lazy changes, wake up the RCU nocb gp thread if
needed after an entrain. Otherwise, the RCU barrier callback can wait in
the queue for several seconds before the lazy callbacks in front of it
are serviced.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree.c      | 11 +++++++++++
 kernel/rcu/tree_nocb.h |  4 ++++
 2 files changed, 15 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5ec97e3f7468..04f33191e5ed 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3894,6 +3894,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 {
 	unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
 	unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
+	bool wake_nocb = false;
+	bool was_done = false;
 
 	lockdep_assert_held(&rcu_state.barrier_lock);
 	if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
@@ -3902,6 +3904,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	rdp->barrier_head.func = rcu_barrier_callback;
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
+	was_done = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
 	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
@@ -3909,7 +3912,15 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 		debug_rcu_head_unqueue(&rdp->barrier_head);
 		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
 	}
+
+	/*
+	 * If bypass list was non-empty, wake up the nocb GP thread otherwise
+	 * bypass/lazy CBs may not be noticed, and can cause real long delays!
+	 */
+	wake_nocb = was_done && rcu_segcblist_pend_cbs(&rdp->cblist);
 	rcu_nocb_unlock(rdp);
+	if (wake_nocb)
+		wake_nocb_gp(rdp, false);
 	smp_store_release(&rdp->barrier_seq_snap, gseq);
 }
 
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index f77a6d7e1356..6caade0683dd 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1558,6 +1558,10 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 {
 }
 
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
+{
+}
+
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j)
 {
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
  2022-10-04  2:41 ` [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain() Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-04 11:41   ` Uladzislau Rezki
                     ` (2 more replies)
  2022-10-04  2:41 ` [PATCH v7 03/11] rcu: Refactor code a bit in rcu_nocb_do_flush_bypass() Joel Fernandes (Google)
                   ` (9 subsequent siblings)
  11 siblings, 3 replies; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes (Google)

Implement timer-based RCU callback batching (also known as lazy
callbacks). With this we save about 5-10% of power consumed due
to RCU requests that happen when system is lightly loaded or idle.

By default, all async callbacks (queued via call_rcu) are marked
lazy. An alternate API call_rcu_flush() is provided for the few users,
for example synchronize_rcu(), that need the old behavior.

The batch is flushed whenever a certain amount of time has passed, or
the batch on a particular CPU grows too big. Also memory pressure will
flush it in a future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists which were originally introduced to
address lock contention, to handle lazy CBs as well. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

Suggested-by: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/rcupdate.h |   7 ++
 kernel/rcu/Kconfig       |   8 ++
 kernel/rcu/rcu.h         |   8 ++
 kernel/rcu/tiny.c        |   2 +-
 kernel/rcu/tree.c        | 129 +++++++++++++++++++-----------
 kernel/rcu/tree.h        |  12 ++-
 kernel/rcu/tree_exp.h    |   2 +-
 kernel/rcu/tree_nocb.h   | 164 +++++++++++++++++++++++++++++++--------
 8 files changed, 246 insertions(+), 86 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 08605ce7379d..40ae36904825 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
 
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
+#ifdef CONFIG_RCU_LAZY
+void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
+#else
+static inline void call_rcu_flush(struct rcu_head *head,
+		rcu_callback_t func) {  call_rcu(head, func); }
+#endif
+
 /* Internal to kernel */
 void rcu_init(void);
 extern int rcu_scheduler_active;
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index f53ad63b2bc6..edd632e68497 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
 	  Say N here if you hate read-side memory barriers.
 	  Take the default if you are unsure.
 
+config RCU_LAZY
+	bool "RCU callback lazy invocation functionality"
+	depends on RCU_NOCB_CPU
+	default n
+	help
+	  To save power, batch RCU callbacks and flush after delay, memory
+	  pressure or callback list growing too big.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index be5979da07f5..65704cbc9df7 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -474,6 +474,14 @@ enum rcutorture_type {
 	INVALID_RCU_FLAVOR
 };
 
+#if defined(CONFIG_RCU_LAZY)
+unsigned long rcu_lazy_get_jiffies_till_flush(void);
+void rcu_lazy_set_jiffies_till_flush(unsigned long j);
+#else
+static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
+static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
+#endif
+
 #if defined(CONFIG_TREE_RCU)
 void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
 			    unsigned long *gp_seq);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index a33a8d4942c3..810479cf17ba 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -44,7 +44,7 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
 
 void rcu_barrier(void)
 {
-	wait_rcu_gp(call_rcu);
+	wait_rcu_gp(call_rcu_flush);
 }
 EXPORT_SYMBOL(rcu_barrier);
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 04f33191e5ed..c20544c4aa29 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
 	raw_spin_unlock_rcu_node(rnp);
 }
 
-/**
- * call_rcu() - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all pre-existing RCU read-side
- * critical sections have completed.  However, the callback function
- * might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked.
- *
- * RCU read-side critical sections are delimited by rcu_read_lock()
- * and rcu_read_unlock(), and may be nested.  In addition, but only in
- * v5.0 and later, regions of code across which interrupts, preemption,
- * or softirqs have been disabled also serve as RCU read-side critical
- * sections.  This includes hardware interrupt handlers, softirq handlers,
- * and NMI handlers.
- *
- * Note that all CPUs must agree that the grace period extended beyond
- * all pre-existing RCU read-side critical section.  On systems with more
- * than one CPU, this means that when "func()" is invoked, each CPU is
- * guaranteed to have executed a full memory barrier since the end of its
- * last RCU read-side critical section whose beginning preceded the call
- * to call_rcu().  It also means that each CPU executing an RCU read-side
- * critical section that continues beyond the start of "func()" must have
- * executed a memory barrier after the call_rcu() but before the beginning
- * of that RCU read-side critical section.  Note that these guarantees
- * include CPUs that are offline, idle, or executing in user mode, as
- * well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
- * resulting RCU callback function "func()", then both CPU A and CPU B are
- * guaranteed to execute a full memory barrier during the time interval
- * between the call to call_rcu() and the invocation of "func()" -- even
- * if CPU A and CPU B are the same CPU (but again only if the system has
- * more than one CPU).
- *
- * Implementation of these memory-ordering guarantees is described here:
- * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
- */
-void call_rcu(struct rcu_head *head, rcu_callback_t func)
+static void
+__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 {
 	static atomic_t doublefrees;
 	unsigned long flags;
@@ -2809,7 +2770,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 	}
 
 	check_cb_ovld(rdp);
-	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
+	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
 		return; // Enqueued onto ->nocb_bypass, so just leave.
 	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
 	rcu_segcblist_enqueue(&rdp->cblist, head);
@@ -2831,8 +2792,84 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 		local_irq_restore(flags);
 	}
 }
-EXPORT_SYMBOL_GPL(call_rcu);
 
+#ifdef CONFIG_RCU_LAZY
+/**
+ * call_rcu_flush() - Queue RCU callback for invocation after grace period, and
+ * flush all lazy callbacks (including the new one) to the main ->cblist while
+ * doing so.
+ *
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.
+ *
+ * Use this API instead of call_rcu() if you don't want the callback to be
+ * invoked after very long periods of time, which can happen on systems without
+ * memory pressure and on systems which are lightly loaded or mostly idle.
+ * This function will cause callbacks to be invoked sooner than later at the
+ * expense of extra power. Other than that, this function is identical to, and
+ * reuses call_rcu()'s logic. Refer to call_rcu() for more details about memory
+ * ordering and other functionality.
+ */
+void call_rcu_flush(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, false);
+}
+EXPORT_SYMBOL_GPL(call_rcu_flush);
+#endif
+
+/**
+ * call_rcu() - Queue an RCU callback for invocation after a grace period.
+ * By default the callbacks are 'lazy' and are kept hidden from the main
+ * ->cblist to prevent starting of grace periods too soon.
+ * If you desire grace periods to start very soon, use call_rcu_flush().
+ *
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.  However, the callback function
+ * might well execute concurrently with RCU read-side critical sections
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested.  In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
+ * sections.  This includes hardware interrupt handlers, softirq handlers,
+ * and NMI handlers.
+ *
+ * Note that all CPUs must agree that the grace period extended beyond
+ * all pre-existing RCU read-side critical section.  On systems with more
+ * than one CPU, this means that when "func()" is invoked, each CPU is
+ * guaranteed to have executed a full memory barrier since the end of its
+ * last RCU read-side critical section whose beginning preceded the call
+ * to call_rcu().  It also means that each CPU executing an RCU read-side
+ * critical section that continues beyond the start of "func()" must have
+ * executed a memory barrier after the call_rcu() but before the beginning
+ * of that RCU read-side critical section.  Note that these guarantees
+ * include CPUs that are offline, idle, or executing in user mode, as
+ * well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
+ * resulting RCU callback function "func()", then both CPU A and CPU B are
+ * guaranteed to execute a full memory barrier during the time interval
+ * between the call to call_rcu() and the invocation of "func()" -- even
+ * if CPU A and CPU B are the same CPU (but again only if the system has
+ * more than one CPU).
+ *
+ * Implementation of these memory-ordering guarantees is described here:
+ * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
+ */
+void call_rcu(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, true);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
 
 /* Maximum number of jiffies to wait before draining a batch. */
 #define KFREE_DRAIN_JIFFIES (5 * HZ)
@@ -3507,7 +3544,7 @@ void synchronize_rcu(void)
 		if (rcu_gp_is_expedited())
 			synchronize_rcu_expedited();
 		else
-			wait_rcu_gp(call_rcu);
+			wait_rcu_gp(call_rcu_flush);
 		return;
 	}
 
@@ -3905,7 +3942,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
 	was_done = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
 	} else {
@@ -4334,7 +4371,7 @@ void rcutree_migrate_callbacks(int cpu)
 	my_rdp = this_cpu_ptr(&rcu_data);
 	my_rnp = my_rdp->mynode;
 	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, false));
 	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
 	/* Leverage recent GPs and set GP for new callbacks. */
 	needwake = rcu_advance_cbs(my_rnp, rdp) ||
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index d4a97e40ea9c..1d803d39f0d1 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -263,14 +263,16 @@ struct rcu_data {
 	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
 	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
 
+	long lazy_len;			/* Length of buffered lazy callbacks. */
 	int cpu;
 };
 
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
 #define RCU_NOCB_WAKE_NOT	0
 #define RCU_NOCB_WAKE_BYPASS	1
-#define RCU_NOCB_WAKE		2
-#define RCU_NOCB_WAKE_FORCE	3
+#define RCU_NOCB_WAKE_LAZY	2
+#define RCU_NOCB_WAKE		3
+#define RCU_NOCB_WAKE_FORCE	4
 
 #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
 					/* For jiffies_till_first_fqs and */
@@ -440,9 +442,11 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j);
+				  unsigned long j, bool lazy);
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags);
+				bool *was_alldone, unsigned long flags,
+				bool lazy);
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 				 unsigned long flags);
 static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 18e9b4cd78ef..5cac05600798 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -937,7 +937,7 @@ void synchronize_rcu_expedited(void)
 
 	/* If expedited grace periods are prohibited, fall back to normal. */
 	if (rcu_gp_is_normal()) {
-		wait_rcu_gp(call_rcu);
+		wait_rcu_gp(call_rcu_flush);
 		return;
 	}
 
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 6caade0683dd..56616bfbe156 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
 	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
 }
 
+/*
+ * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
+ * can elapse before lazy callbacks are flushed. Lazy callbacks
+ * could be flushed much earlier for a number of other reasons
+ * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
+ * left unsubmitted to RCU after those many jiffies.
+ */
+#define LAZY_FLUSH_JIFFIES (10 * HZ)
+static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
+
+#ifdef CONFIG_RCU_LAZY
+// To be called only from test code.
+void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
+{
+	jiffies_till_flush = jif;
+}
+EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
+
+unsigned long rcu_lazy_get_jiffies_till_flush(void)
+{
+	return jiffies_till_flush;
+}
+EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
+#endif
+
 /*
  * Arrange to wake the GP kthread for this NOCB group at some future
  * time when it is safe to do so.
@@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
 	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
 
 	/*
-	 * Bypass wakeup overrides previous deferments. In case
-	 * of callback storm, no need to wake up too early.
+	 * Bypass wakeup overrides previous deferments. In case of
+	 * callback storm, no need to wake up too early.
 	 */
-	if (waketype == RCU_NOCB_WAKE_BYPASS) {
+	if (waketype == RCU_NOCB_WAKE_LAZY &&
+	    READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
+		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
+		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
 		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
 		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
 	} else {
@@ -293,10 +322,13 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
  * proves to be initially empty, just return false because the no-CB GP
  * kthread may need to be awakened in this case.
  *
+ * Return true if there was something to be flushed and it succeeded, otherwise
+ * false.
+ *
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				     unsigned long j)
+				     unsigned long j, bool lazy)
 {
 	struct rcu_cblist rcl;
 
@@ -310,7 +342,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
 	if (rhp)
 		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
-	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+
+	/*
+	 * If the new CB requested was a lazy one, queue it onto the main
+	 * ->cblist so we can take advantage of a sooner grade period.
+	 */
+	if (lazy && rhp) {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
+		rcu_cblist_enqueue(&rcl, rhp);
+		WRITE_ONCE(rdp->lazy_len, 0);
+	} else {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+		WRITE_ONCE(rdp->lazy_len, 0);
+	}
+
 	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
 	WRITE_ONCE(rdp->nocb_bypass_first, j);
 	rcu_nocb_bypass_unlock(rdp);
@@ -326,13 +371,13 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, bool lazy)
 {
 	if (!rcu_rdp_is_offloaded(rdp))
 		return true;
 	rcu_lockdep_assert_cblist_protected(rdp);
 	rcu_nocb_bypass_lock(rdp);
-	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
+	return rcu_nocb_do_flush_bypass(rdp, rhp, j, lazy);
 }
 
 /*
@@ -345,7 +390,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
 	if (!rcu_rdp_is_offloaded(rdp) ||
 	    !rcu_nocb_bypass_trylock(rdp))
 		return;
-	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
+	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, false));
 }
 
 /*
@@ -367,12 +412,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
  * there is only one CPU in operation.
  */
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags,
+				bool lazy)
 {
 	unsigned long c;
 	unsigned long cur_gp_seq;
 	unsigned long j = jiffies;
 	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
 
 	lockdep_assert_irqs_disabled();
 
@@ -417,25 +464,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	// If there hasn't yet been all that many ->cblist enqueues
 	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
 	// ->nocb_bypass first.
-	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
+	// Lazy CBs throttle this back and do immediate bypass queuing.
+	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
 		rcu_nocb_lock(rdp);
 		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 		if (*was_alldone)
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstQ"));
-		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
+
+		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, false));
 		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
 		return false; // Caller must enqueue the callback.
 	}
 
 	// If ->nocb_bypass has been used too long or is too full,
 	// flush ->nocb_bypass to ->cblist.
-	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
+	if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
+	    (ncbs &&  bypass_is_lazy &&
+		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
 	    ncbs >= qhimark) {
 		rcu_nocb_lock(rdp);
 		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 
-		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
+		if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy)) {
 			if (*was_alldone)
 				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 						    TPS("FirstQ"));
@@ -460,16 +511,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	// We need to use the bypass.
 	rcu_nocb_wait_contended(rdp);
 	rcu_nocb_bypass_lock(rdp);
+
 	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
 	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
 	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
+
+	if (lazy)
+		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
+
 	if (!ncbs) {
 		WRITE_ONCE(rdp->nocb_bypass_first, j);
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
 	}
+
 	rcu_nocb_bypass_unlock(rdp);
 	smp_mb(); /* Order enqueue before wake. */
-	if (ncbs) {
+
+	// A wake up of the grace period kthread or timer adjustment needs to
+	// be done only if:
+	// 1. Bypass list was fully empty before (this is the first bypass list entry).
+	//	Or, both the below conditions are met:
+	// 1. Bypass list had only lazy CBs before.
+	// 2. The new CB is non-lazy.
+	if (ncbs && (!bypass_is_lazy || lazy)) {
 		local_irq_restore(flags);
 	} else {
 		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
@@ -497,9 +561,10 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 				 unsigned long flags)
 				 __releases(rdp->nocb_lock)
 {
+	long bypass_len;
 	unsigned long cur_gp_seq;
 	unsigned long j;
-	long len;
+	long len, lazy_len;
 	struct task_struct *t;
 
 	// If we are being polled or there is no kthread, just leave.
@@ -512,9 +577,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 	}
 	// Need to actually to a wakeup.
 	len = rcu_segcblist_n_cbs(&rdp->cblist);
+	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	lazy_len = READ_ONCE(rdp->lazy_len);
 	if (was_alldone) {
 		rdp->qlen_last_fqs_check = len;
-		if (!irqs_disabled_flags(flags)) {
+		// Only lazy CBs in bypass list
+		if (lazy_len && bypass_len == lazy_len) {
+			rcu_nocb_unlock_irqrestore(rdp, flags);
+			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
+					   TPS("WakeLazy"));
+		} else if (!irqs_disabled_flags(flags)) {
 			/* ... if queue was empty ... */
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			wake_nocb_gp(rdp, false);
@@ -611,6 +683,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	unsigned long flags;
 	bool gotcbs = false;
 	unsigned long j = jiffies;
+	bool lazy = false;
+	long lazy_ncbs;
 	bool needwait_gp = false; // This prevents actual uninitialized use.
 	bool needwake;
 	bool needwake_gp;
@@ -640,24 +714,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	 * won't be ignored for long.
 	 */
 	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
+		bool flush_bypass = false;
+
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
 		rcu_nocb_lock_irqsave(rdp, flags);
 		lockdep_assert_held(&rdp->nocb_lock);
 		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
-		if (bypass_ncbs &&
+		lazy_ncbs = READ_ONCE(rdp->lazy_len);
+
+		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
+		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
+		     bypass_ncbs > 2 * qhimark)) {
+			flush_bypass = true;
+		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
 		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
 		     bypass_ncbs > 2 * qhimark)) {
-			// Bypass full or old, so flush it.
-			(void)rcu_nocb_try_flush_bypass(rdp, j);
-			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			flush_bypass = true;
 		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			continue; /* No callbacks here, try next. */
 		}
+
+		if (flush_bypass) {
+			// Bypass full or old, so flush it.
+			(void)rcu_nocb_try_flush_bypass(rdp, j);
+			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			lazy_ncbs = READ_ONCE(rdp->lazy_len);
+		}
+
 		if (bypass_ncbs) {
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
-					    TPS("Bypass"));
-			bypass = true;
+				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
+			if (bypass_ncbs == lazy_ncbs)
+				lazy = true;
+			else
+				bypass = true;
 		}
 		rnp = rdp->mynode;
 
@@ -705,12 +796,20 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	my_rdp->nocb_gp_gp = needwait_gp;
 	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
 
-	if (bypass && !rcu_nocb_poll) {
-		// At least one child with non-empty ->nocb_bypass, so set
-		// timer in order to avoid stranding its callbacks.
-		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
-				   TPS("WakeBypassIsDeferred"));
+	// At least one child with non-empty ->nocb_bypass, so set
+	// timer in order to avoid stranding its callbacks.
+	if (!rcu_nocb_poll) {
+		// If bypass list only has lazy CBs. Add a deferred lazy wake up.
+		if (lazy && !bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
+					TPS("WakeLazyIsDeferred"));
+		// Otherwise add a deferred bypass wake up.
+		} else if (bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
+					TPS("WakeBypassIsDeferred"));
+		}
 	}
+
 	if (rcu_nocb_poll) {
 		/* Polling, so trace if first poll in the series. */
 		if (gotcbs)
@@ -1036,7 +1135,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
 	 * return false, which means that future calls to rcu_nocb_try_bypass()
 	 * will refuse to put anything into the bypass.
 	 */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false));
 	/*
 	 * Start with invoking rcu_core() early. This way if the current thread
 	 * happens to preempt an ongoing call to rcu_core() in the middle,
@@ -1278,6 +1377,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 	raw_spin_lock_init(&rdp->nocb_gp_lock);
 	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
 	rcu_cblist_init(&rdp->nocb_bypass);
+	WRITE_ONCE(rdp->lazy_len, 0);
 	mutex_init(&rdp->nocb_gp_kthread_mutex);
 }
 
@@ -1558,18 +1658,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 {
 }
 
-static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
-{
-}
-
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, bool lazy)
 {
 	return true;
 }
 
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags, bool lazy)
 {
 	return false;
 }
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 03/11] rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
  2022-10-04  2:41 ` [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain() Joel Fernandes (Google)
  2022-10-04  2:41 ` [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-06 19:12   ` Paul E. McKenney
  2022-10-04  2:41 ` [PATCH v7 04/11] rcu: shrinker for lazy rcu Joel Fernandes (Google)
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes (Google)

This consolidates the code a bit and makes it cleaner. Functionally it
is the same.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 56616bfbe156..dfa736da68ab 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -327,10 +327,11 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
  *
  * Note that this function always returns true if rhp is NULL.
  */
-static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
+static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp_in,
 				     unsigned long j, bool lazy)
 {
 	struct rcu_cblist rcl;
+	struct rcu_head *rhp = rhp_in;
 
 	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
 	rcu_lockdep_assert_cblist_protected(rdp);
@@ -345,16 +346,16 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 
 	/*
 	 * If the new CB requested was a lazy one, queue it onto the main
-	 * ->cblist so we can take advantage of a sooner grade period.
+	 * ->cblist so that we can take advantage of the grace-period that will
+	 * happen regardless. But queue it onto the bypass list first so that
+	 * the lazy CB is ordered with the existing CBs in the bypass list.
 	 */
 	if (lazy && rhp) {
-		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
-		rcu_cblist_enqueue(&rcl, rhp);
-		WRITE_ONCE(rdp->lazy_len, 0);
-	} else {
-		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
-		WRITE_ONCE(rdp->lazy_len, 0);
+		rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
+		rhp = NULL;
 	}
+	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+	WRITE_ONCE(rdp->lazy_len, 0);
 
 	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
 	WRITE_ONCE(rdp->nocb_bypass_first, j);
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 04/11] rcu: shrinker for lazy rcu
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2022-10-04  2:41 ` [PATCH v7 03/11] rcu: Refactor code a bit in rcu_nocb_do_flush_bypass() Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-04  2:41 ` [PATCH v7 05/11] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Vineeth Pillai,
	Joel Fernandes

From: Vineeth Pillai <vineeth@bitbyteword.org>

The shrinker is used to speed up the free'ing of memory potentially held
by RCU lazy callbacks. RCU kernel module test cases show this to be
effective. Test is introduced in a later patch.

Signed-off-by: Vineeth Pillai <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 52 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index dfa736da68ab..0162aa10f9e4 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1313,6 +1313,55 @@ int rcu_nocb_cpu_offload(int cpu)
 }
 EXPORT_SYMBOL_GPL(rcu_nocb_cpu_offload);
 
+static unsigned long
+lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int cpu;
+	unsigned long count = 0;
+
+	/* Snapshot count of all CPUs */
+	for_each_possible_cpu(cpu) {
+		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+
+		count +=  READ_ONCE(rdp->lazy_len);
+	}
+
+	return count ? count : SHRINK_EMPTY;
+}
+
+static unsigned long
+lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int cpu;
+	unsigned long flags;
+	unsigned long count = 0;
+
+	/* Snapshot count of all CPUs */
+	for_each_possible_cpu(cpu) {
+		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+		int _count = READ_ONCE(rdp->lazy_len);
+
+		if (_count == 0)
+			continue;
+		rcu_nocb_lock_irqsave(rdp, flags);
+		WRITE_ONCE(rdp->lazy_len, 0);
+		rcu_nocb_unlock_irqrestore(rdp, flags);
+		wake_nocb_gp(rdp, false);
+		sc->nr_to_scan -= _count;
+		count += _count;
+		if (sc->nr_to_scan <= 0)
+			break;
+	}
+	return count ? count : SHRINK_STOP;
+}
+
+static struct shrinker lazy_rcu_shrinker = {
+	.count_objects = lazy_rcu_shrink_count,
+	.scan_objects = lazy_rcu_shrink_scan,
+	.batch = 0,
+	.seeks = DEFAULT_SEEKS,
+};
+
 void __init rcu_init_nohz(void)
 {
 	int cpu;
@@ -1343,6 +1392,9 @@ void __init rcu_init_nohz(void)
 	if (!rcu_state.nocb_is_setup)
 		return;
 
+	if (register_shrinker(&lazy_rcu_shrinker, "rcu-lazy"))
+		pr_err("Failed to register lazy_rcu shrinker!\n");
+
 	if (!cpumask_subset(rcu_nocb_mask, cpu_possible_mask)) {
 		pr_info("\tNote: kernel parameter 'rcu_nocbs=', 'nohz_full', or 'isolcpus=' contains nonexistent CPUs.\n");
 		cpumask_and(rcu_nocb_mask, cpu_possible_mask,
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 05/11] rcuscale: Add laziness and kfree tests
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2022-10-04  2:41 ` [PATCH v7 04/11] rcu: shrinker for lazy rcu Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-06 19:15   ` Paul E. McKenney
  2022-10-04  2:41 ` [PATCH v7 06/11] percpu-refcount: Use call_rcu_flush() for atomic switch Joel Fernandes (Google)
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes (Google)

We add 2 tests to rcuscale, first one is a startup test to check whether
we are not too lazy or too hard working. Two, emulate kfree_rcu() itself
to use call_rcu() and check memory pressure. In my testing, the new
call_rcu() does well to keep memory pressure under control, similar
to kfree_rcu().

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/rcuscale.c | 65 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 64 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 3ef02d4a8108..027b7c1e7613 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -95,6 +95,7 @@ torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
 torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() scale test?");
 torture_param(int, kfree_mult, 1, "Multiple of kfree_obj size to allocate.");
+torture_param(int, kfree_by_call_rcu, 0, "Use call_rcu() to emulate kfree_rcu()?");
 
 static char *scale_type = "rcu";
 module_param(scale_type, charp, 0444);
@@ -659,6 +660,14 @@ struct kfree_obj {
 	struct rcu_head rh;
 };
 
+/* Used if doing RCU-kfree'ing via call_rcu(). */
+static void kfree_call_rcu(struct rcu_head *rh)
+{
+	struct kfree_obj *obj = container_of(rh, struct kfree_obj, rh);
+
+	kfree(obj);
+}
+
 static int
 kfree_scale_thread(void *arg)
 {
@@ -696,6 +705,11 @@ kfree_scale_thread(void *arg)
 			if (!alloc_ptr)
 				return -ENOMEM;
 
+			if (kfree_by_call_rcu) {
+				call_rcu(&(alloc_ptr->rh), kfree_call_rcu);
+				continue;
+			}
+
 			// By default kfree_rcu_test_single and kfree_rcu_test_double are
 			// initialized to false. If both have the same value (false or true)
 			// both are randomly tested, otherwise only the one with value true
@@ -767,11 +781,58 @@ kfree_scale_shutdown(void *arg)
 	return -EINVAL;
 }
 
+// Used if doing RCU-kfree'ing via call_rcu().
+static unsigned long jiffies_at_lazy_cb;
+static struct rcu_head lazy_test1_rh;
+static int rcu_lazy_test1_cb_called;
+static void call_rcu_lazy_test1(struct rcu_head *rh)
+{
+	jiffies_at_lazy_cb = jiffies;
+	WRITE_ONCE(rcu_lazy_test1_cb_called, 1);
+}
+
 static int __init
 kfree_scale_init(void)
 {
 	long i;
 	int firsterr = 0;
+	unsigned long orig_jif, jif_start;
+
+	// Also, do a quick self-test to ensure laziness is as much as
+	// expected.
+	if (kfree_by_call_rcu && !IS_ENABLED(CONFIG_RCU_LAZY)) {
+		pr_alert("CONFIG_RCU_LAZY is disabled, falling back to kfree_rcu() "
+			 "for delayed RCU kfree'ing\n");
+		kfree_by_call_rcu = 0;
+	}
+
+	if (kfree_by_call_rcu) {
+		/* do a test to check the timeout. */
+		orig_jif = rcu_lazy_get_jiffies_till_flush();
+
+		rcu_lazy_set_jiffies_till_flush(2 * HZ);
+		rcu_barrier();
+
+		jif_start = jiffies;
+		jiffies_at_lazy_cb = 0;
+		call_rcu(&lazy_test1_rh, call_rcu_lazy_test1);
+
+		smp_cond_load_relaxed(&rcu_lazy_test1_cb_called, VAL == 1);
+
+		rcu_lazy_set_jiffies_till_flush(orig_jif);
+
+		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start < 2 * HZ)) {
+			pr_alert("ERROR: call_rcu() CBs are not being lazy as expected!\n");
+			WARN_ON_ONCE(1);
+			return -1;
+		}
+
+		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start > 3 * HZ)) {
+			pr_alert("ERROR: call_rcu() CBs are being too lazy!\n");
+			WARN_ON_ONCE(1);
+			return -1;
+		}
+	}
 
 	kfree_nrealthreads = compute_real(kfree_nthreads);
 	/* Start up the kthreads. */
@@ -784,7 +845,9 @@ kfree_scale_init(void)
 		schedule_timeout_uninterruptible(1);
 	}
 
-	pr_alert("kfree object size=%zu\n", kfree_mult * sizeof(struct kfree_obj));
+	pr_alert("kfree object size=%zu, kfree_by_call_rcu=%d\n",
+			kfree_mult * sizeof(struct kfree_obj),
+			kfree_by_call_rcu);
 
 	kfree_reader_tasks = kcalloc(kfree_nrealthreads, sizeof(kfree_reader_tasks[0]),
 			       GFP_KERNEL);
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 06/11] percpu-refcount: Use call_rcu_flush() for atomic switch
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2022-10-04  2:41 ` [PATCH v7 05/11] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-04  2:41 ` [PATCH v7 07/11] rcu/sync: Use call_rcu_flush() instead of call_rcu Joel Fernandes (Google)
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes (Google)

call_rcu() changes to save power will slow down the percpu refcounter's
"per-CPU to atomic switch" path. The primitive uses RCU when switching to
atomic mode. The enqueued async callback wakes up waiters waiting in the
percpu_ref_switch_waitq. Due to this, per-CPU refcount users will slow down,
such as blk_pre_runtime_suspend().

Use the call_rcu_flush() API instead which reverts to the old behavior.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 lib/percpu-refcount.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index e5c5315da274..65c58a029297 100644
--- a/lib/percpu-refcount.c
+++ b/lib/percpu-refcount.c
@@ -230,7 +230,8 @@ static void __percpu_ref_switch_to_atomic(struct percpu_ref *ref,
 		percpu_ref_noop_confirm_switch;
 
 	percpu_ref_get(ref);	/* put after confirmation */
-	call_rcu(&ref->data->rcu, percpu_ref_switch_to_atomic_rcu);
+	call_rcu_flush(&ref->data->rcu,
+		       percpu_ref_switch_to_atomic_rcu);
 }
 
 static void __percpu_ref_switch_to_percpu(struct percpu_ref *ref)
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 07/11] rcu/sync: Use call_rcu_flush() instead of call_rcu
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
                   ` (5 preceding siblings ...)
  2022-10-04  2:41 ` [PATCH v7 06/11] percpu-refcount: Use call_rcu_flush() for atomic switch Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-04  2:41 ` [PATCH v7 08/11] rcu/rcuscale: Use call_rcu_flush() for async reader test Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes (Google)

call_rcu() changes to save power will slow down rcu sync. Use the
call_rcu_flush() API instead which reverts to the old behavior.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/sync.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/sync.c b/kernel/rcu/sync.c
index 5cefc702158f..bdce3b5d7f71 100644
--- a/kernel/rcu/sync.c
+++ b/kernel/rcu/sync.c
@@ -44,7 +44,7 @@ static void rcu_sync_func(struct rcu_head *rhp);
 
 static void rcu_sync_call(struct rcu_sync *rsp)
 {
-	call_rcu(&rsp->cb_head, rcu_sync_func);
+	call_rcu_flush(&rsp->cb_head, rcu_sync_func);
 }
 
 /**
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 08/11] rcu/rcuscale: Use call_rcu_flush() for async reader test
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
                   ` (6 preceding siblings ...)
  2022-10-04  2:41 ` [PATCH v7 07/11] rcu/sync: Use call_rcu_flush() instead of call_rcu Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-04  2:41 ` [PATCH v7 09/11] rcu/rcutorture: Use call_rcu_flush() where needed Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes (Google)

rcuscale uses call_rcu() to queue async readers. With recent changes to
save power, the test will have fewer async readers in flight. Use the
call_rcu_flush() API instead to revert to the old behavior.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/rcuscale.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 027b7c1e7613..21453a919c31 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -176,7 +176,7 @@ static struct rcu_scale_ops rcu_ops = {
 	.get_gp_seq	= rcu_get_gp_seq,
 	.gp_diff	= rcu_seq_diff,
 	.exp_completed	= rcu_exp_batches_completed,
-	.async		= call_rcu,
+	.async		= call_rcu_flush,
 	.gp_barrier	= rcu_barrier,
 	.sync		= synchronize_rcu,
 	.exp_sync	= synchronize_rcu_expedited,
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 09/11] rcu/rcutorture: Use call_rcu_flush() where needed
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
                   ` (7 preceding siblings ...)
  2022-10-04  2:41 ` [PATCH v7 08/11] rcu/rcuscale: Use call_rcu_flush() for async reader test Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-04  2:41 ` [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu() Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes (Google)

call_rcu() changes to save power will change the behavior of rcutorture
tests. Use the call_rcu_flush() API instead which reverts to the old
behavior.

Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/rcutorture.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 684e24f12a79..fd56202ae4f4 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -514,7 +514,7 @@ static unsigned long rcu_no_completed(void)
 
 static void rcu_torture_deferred_free(struct rcu_torture *p)
 {
-	call_rcu(&p->rtort_rcu, rcu_torture_cb);
+	call_rcu_flush(&p->rtort_rcu, rcu_torture_cb);
 }
 
 static void rcu_sync_torture_init(void)
@@ -559,7 +559,7 @@ static struct rcu_torture_ops rcu_ops = {
 	.start_gp_poll_exp_full	= start_poll_synchronize_rcu_expedited_full,
 	.poll_gp_state_exp	= poll_state_synchronize_rcu,
 	.cond_sync_exp		= cond_synchronize_rcu_expedited,
-	.call			= call_rcu,
+	.call			= call_rcu_flush,
 	.cb_barrier		= rcu_barrier,
 	.fqs			= rcu_force_quiescent_state,
 	.stats			= NULL,
@@ -863,7 +863,7 @@ static void rcu_tasks_torture_deferred_free(struct rcu_torture *p)
 
 static void synchronize_rcu_mult_test(void)
 {
-	synchronize_rcu_mult(call_rcu_tasks, call_rcu);
+	synchronize_rcu_mult(call_rcu_tasks, call_rcu_flush);
 }
 
 static struct rcu_torture_ops tasks_ops = {
@@ -3432,13 +3432,13 @@ static void rcu_test_debug_objects(void)
 	/* Try to queue the rh2 pair of callbacks for the same grace period. */
 	preempt_disable(); /* Prevent preemption from interrupting test. */
 	rcu_read_lock(); /* Make it impossible to finish a grace period. */
-	call_rcu(&rh1, rcu_torture_leak_cb); /* Start grace period. */
+	call_rcu_flush(&rh1, rcu_torture_leak_cb); /* Start grace period. */
 	local_irq_disable(); /* Make it harder to start a new grace period. */
-	call_rcu(&rh2, rcu_torture_leak_cb);
-	call_rcu(&rh2, rcu_torture_err_cb); /* Duplicate callback. */
+	call_rcu_flush(&rh2, rcu_torture_leak_cb);
+	call_rcu_flush(&rh2, rcu_torture_err_cb); /* Duplicate callback. */
 	if (rhp) {
-		call_rcu(rhp, rcu_torture_leak_cb);
-		call_rcu(rhp, rcu_torture_err_cb); /* Another duplicate callback. */
+		call_rcu_flush(rhp, rcu_torture_leak_cb);
+		call_rcu_flush(rhp, rcu_torture_err_cb); /* Another duplicate callback. */
 	}
 	local_irq_enable();
 	rcu_read_unlock();
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
                   ` (8 preceding siblings ...)
  2022-10-04  2:41 ` [PATCH v7 09/11] rcu/rcutorture: Use call_rcu_flush() where needed Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-07  3:18   ` Joel Fernandes
  2022-10-04  2:41 ` [PATCH v7 11/11] workqueue: Make queue_rcu_work() use call_rcu_flush() Joel Fernandes (Google)
  2022-10-06 18:55 ` [PATCH v7 00/11] rcu: call_rcu() power improvements Paul E. McKenney
  11 siblings, 1 reply; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes

From: Uladzislau Rezki <urezki@gmail.com>

Slow boot time is seen on KVM running typical Linux distributions due to
SCSI layer calling call_rcu(). Recent changes to save power may be
causing this slowness. Using call_rcu_flush() fixes the issue and brings
the boot time back to what it originally was. Convert it.

Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/scsi/scsi_error.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 448748e3fba5..a56cfd612e3a 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -312,7 +312,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
 	 * Ensure that all tasks observe the host state change before the
 	 * host_failed change.
 	 */
-	call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
+	call_rcu_flush(&scmd->rcu, scsi_eh_inc_host_failed);
 }
 
 /**
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* [PATCH v7 11/11] workqueue: Make queue_rcu_work() use call_rcu_flush()
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
                   ` (9 preceding siblings ...)
  2022-10-04  2:41 ` [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu() Joel Fernandes (Google)
@ 2022-10-04  2:41 ` Joel Fernandes (Google)
  2022-10-06 18:55 ` [PATCH v7 00/11] rcu: call_rcu() power improvements Paul E. McKenney
  11 siblings, 0 replies; 45+ messages in thread
From: Joel Fernandes (Google) @ 2022-10-04  2:41 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb, Joel Fernandes

From: Uladzislau Rezki <urezki@gmail.com>

call_rcu() changes to save power will slow down RCU workqueue items
queued via queue_rcu_work(). This may not be an issue, however we cannot
assume that workqueue users are OK with long delays. Use
call_rcu_flush() API instead which reverts to the old behavior.

Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/workqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index aeea9731ef80..fe1146d97f1a 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1771,7 +1771,7 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
 
 	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
 		rwork->wq = wq;
-		call_rcu(&rwork->rcu, rcu_work_rcufn);
+		call_rcu_flush(&rwork->rcu, rcu_work_rcufn);
 		return true;
 	}
 
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04  2:41 ` [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power Joel Fernandes (Google)
@ 2022-10-04 11:41   ` Uladzislau Rezki
  2022-10-04 13:30     ` Paul E. McKenney
  2022-10-04 18:05     ` Joel Fernandes
  2022-10-05 11:44   ` Uladzislau Rezki
  2022-10-06 19:11   ` Paul E. McKenney
  2 siblings, 2 replies; 45+ messages in thread
From: Uladzislau Rezki @ 2022-10-04 11:41 UTC (permalink / raw)
  To: paulmck, Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, paulmck, rostedt, youssefesmat, surenb

>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
>  		rcu_nocb_lock_irqsave(rdp, flags);
>  		lockdep_assert_held(&rdp->nocb_lock);
>  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> -		if (bypass_ncbs &&
> +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> +
> +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> +		     bypass_ncbs > 2 * qhimark)) {
Do you know why we want double "qhimark" threshold? It is not only this
place, there are several. I am asking because it is not expected by the
user.

--
Uladzsislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04 11:41   ` Uladzislau Rezki
@ 2022-10-04 13:30     ` Paul E. McKenney
  2022-10-04 14:53       ` Uladzislau Rezki
  2022-10-04 18:05     ` Joel Fernandes
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-04 13:30 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Joel Fernandes (Google),
	rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10, frederic,
	rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 01:41:38PM +0200, Uladzislau Rezki wrote:
> >  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> >  		rcu_nocb_lock_irqsave(rdp, flags);
> >  		lockdep_assert_held(&rdp->nocb_lock);
> >  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > -		if (bypass_ncbs &&
> > +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> > +
> > +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> > +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> > +		     bypass_ncbs > 2 * qhimark)) {
> Do you know why we want double "qhimark" threshold? It is not only this
> place, there are several. I am asking because it is not expected by the
> user.

OK, I will bite...  What does the user expect?  Or, perhaps a better
question, how is this choice causing the user problems?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04 13:30     ` Paul E. McKenney
@ 2022-10-04 14:53       ` Uladzislau Rezki
  2022-10-04 15:58         ` Paul E. McKenney
  2022-10-04 16:22         ` Joel Fernandes
  0 siblings, 2 replies; 45+ messages in thread
From: Uladzislau Rezki @ 2022-10-04 14:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, Joel Fernandes (Google),
	rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10, frederic,
	rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 06:30:04AM -0700, Paul E. McKenney wrote:
> On Tue, Oct 04, 2022 at 01:41:38PM +0200, Uladzislau Rezki wrote:
> > >  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> > >  		rcu_nocb_lock_irqsave(rdp, flags);
> > >  		lockdep_assert_held(&rdp->nocb_lock);
> > >  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > > -		if (bypass_ncbs &&
> > > +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> > > +
> > > +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> > > +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> > > +		     bypass_ncbs > 2 * qhimark)) {
> > Do you know why we want double "qhimark" threshold? It is not only this
> > place, there are several. I am asking because it is not expected by the
> > user.
> 
> OK, I will bite...  What does the user expect?  Or, perhaps a better
> question, how is this choice causing the user problems?
> 
Yesterday when i was checking the lazy-v6 on Android i noticed the following:

<snip>
...
     rcuop/4-48      [006] d..1   184.780328: rcu_batch_start: rcu_preempt CBs=15572 bl=121
     rcuop/6-62      [000] d..1   184.796939: rcu_batch_start: rcu_preempt CBs=21503 bl=167
     rcuop/6-62      [003] d..1   184.800706: rcu_batch_start: rcu_preempt CBs=24677 bl=192
     rcuop/6-62      [005] d..1   184.803773: rcu_batch_start: rcu_preempt CBs=27117 bl=211
     rcuop/6-62      [005] d..1   184.805732: rcu_batch_start: rcu_preempt CBs=22391 bl=174
     rcuop/6-62      [005] d..1   184.809083: rcu_batch_start: rcu_preempt CBs=12554 bl=98
     rcuop/6-62      [005] d..1   184.824228: rcu_batch_start: rcu_preempt CBs=16177 bl=126
     rcuop/4-48      [006] d..1   184.836193: rcu_batch_start: rcu_preempt CBs=24129 bl=188
     rcuop/4-48      [006] d..1   184.844147: rcu_batch_start: rcu_preempt CBs=25854 bl=201
     rcuop/4-48      [006] d..1   184.847257: rcu_batch_start: rcu_preempt CBs=21328 bl=166
     rcuop/4-48      [006] d..1   184.852128: rcu_batch_start: rcu_preempt CBs=21710 bl=169
...
<snip>

On my device the "qhimark" is set to:

<snip>
XQ-CT54:/sys/module/rcutree/parameters # cat qhimark
10000
XQ-CT54:/sys/module/rcutree/parameters #
<snip>

so i expect that once we pass 10 000 callbacks threshold the flush 
should occur. This parameter gives us an opportunity to control a
memory that should be reclaimed sooner or later.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04 14:53       ` Uladzislau Rezki
@ 2022-10-04 15:58         ` Paul E. McKenney
  2022-10-04 16:20           ` Uladzislau Rezki
  2022-10-04 16:22         ` Joel Fernandes
  1 sibling, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-04 15:58 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Joel Fernandes (Google),
	rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10, frederic,
	rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 04:53:09PM +0200, Uladzislau Rezki wrote:
> On Tue, Oct 04, 2022 at 06:30:04AM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 04, 2022 at 01:41:38PM +0200, Uladzislau Rezki wrote:
> > > >  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> > > >  		rcu_nocb_lock_irqsave(rdp, flags);
> > > >  		lockdep_assert_held(&rdp->nocb_lock);
> > > >  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > > > -		if (bypass_ncbs &&
> > > > +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> > > > +
> > > > +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> > > > +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> > > > +		     bypass_ncbs > 2 * qhimark)) {
> > > Do you know why we want double "qhimark" threshold? It is not only this
> > > place, there are several. I am asking because it is not expected by the
> > > user.
> > 
> > OK, I will bite...  What does the user expect?  Or, perhaps a better
> > question, how is this choice causing the user problems?
> > 
> Yesterday when i was checking the lazy-v6 on Android i noticed the following:
> 
> <snip>
> ...
>      rcuop/4-48      [006] d..1   184.780328: rcu_batch_start: rcu_preempt CBs=15572 bl=121
>      rcuop/6-62      [000] d..1   184.796939: rcu_batch_start: rcu_preempt CBs=21503 bl=167
>      rcuop/6-62      [003] d..1   184.800706: rcu_batch_start: rcu_preempt CBs=24677 bl=192
>      rcuop/6-62      [005] d..1   184.803773: rcu_batch_start: rcu_preempt CBs=27117 bl=211
>      rcuop/6-62      [005] d..1   184.805732: rcu_batch_start: rcu_preempt CBs=22391 bl=174
>      rcuop/6-62      [005] d..1   184.809083: rcu_batch_start: rcu_preempt CBs=12554 bl=98
>      rcuop/6-62      [005] d..1   184.824228: rcu_batch_start: rcu_preempt CBs=16177 bl=126
>      rcuop/4-48      [006] d..1   184.836193: rcu_batch_start: rcu_preempt CBs=24129 bl=188
>      rcuop/4-48      [006] d..1   184.844147: rcu_batch_start: rcu_preempt CBs=25854 bl=201
>      rcuop/4-48      [006] d..1   184.847257: rcu_batch_start: rcu_preempt CBs=21328 bl=166
>      rcuop/4-48      [006] d..1   184.852128: rcu_batch_start: rcu_preempt CBs=21710 bl=169
> ...
> <snip>
> 
> On my device the "qhimark" is set to:
> 
> <snip>
> XQ-CT54:/sys/module/rcutree/parameters # cat qhimark
> 10000
> XQ-CT54:/sys/module/rcutree/parameters #
> <snip>
> 
> so i expect that once we pass 10 000 callbacks threshold the flush 
> should occur. This parameter gives us an opportunity to control a
> memory that should be reclaimed sooner or later.

I did understand that you were surprised.

But what problem did this cause other than you being surprised?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04 15:58         ` Paul E. McKenney
@ 2022-10-04 16:20           ` Uladzislau Rezki
  2022-10-04 18:27             ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Uladzislau Rezki @ 2022-10-04 16:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, Joel Fernandes (Google),
	rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10, frederic,
	rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 08:58:14AM -0700, Paul E. McKenney wrote:
> On Tue, Oct 04, 2022 at 04:53:09PM +0200, Uladzislau Rezki wrote:
> > On Tue, Oct 04, 2022 at 06:30:04AM -0700, Paul E. McKenney wrote:
> > > On Tue, Oct 04, 2022 at 01:41:38PM +0200, Uladzislau Rezki wrote:
> > > > >  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> > > > >  		rcu_nocb_lock_irqsave(rdp, flags);
> > > > >  		lockdep_assert_held(&rdp->nocb_lock);
> > > > >  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > > > > -		if (bypass_ncbs &&
> > > > > +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> > > > > +
> > > > > +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> > > > > +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> > > > > +		     bypass_ncbs > 2 * qhimark)) {
> > > > Do you know why we want double "qhimark" threshold? It is not only this
> > > > place, there are several. I am asking because it is not expected by the
> > > > user.
> > > 
> > > OK, I will bite...  What does the user expect?  Or, perhaps a better
> > > question, how is this choice causing the user problems?
> > > 
> > Yesterday when i was checking the lazy-v6 on Android i noticed the following:
> > 
> > <snip>
> > ...
> >      rcuop/4-48      [006] d..1   184.780328: rcu_batch_start: rcu_preempt CBs=15572 bl=121
> >      rcuop/6-62      [000] d..1   184.796939: rcu_batch_start: rcu_preempt CBs=21503 bl=167
> >      rcuop/6-62      [003] d..1   184.800706: rcu_batch_start: rcu_preempt CBs=24677 bl=192
> >      rcuop/6-62      [005] d..1   184.803773: rcu_batch_start: rcu_preempt CBs=27117 bl=211
> >      rcuop/6-62      [005] d..1   184.805732: rcu_batch_start: rcu_preempt CBs=22391 bl=174
> >      rcuop/6-62      [005] d..1   184.809083: rcu_batch_start: rcu_preempt CBs=12554 bl=98
> >      rcuop/6-62      [005] d..1   184.824228: rcu_batch_start: rcu_preempt CBs=16177 bl=126
> >      rcuop/4-48      [006] d..1   184.836193: rcu_batch_start: rcu_preempt CBs=24129 bl=188
> >      rcuop/4-48      [006] d..1   184.844147: rcu_batch_start: rcu_preempt CBs=25854 bl=201
> >      rcuop/4-48      [006] d..1   184.847257: rcu_batch_start: rcu_preempt CBs=21328 bl=166
> >      rcuop/4-48      [006] d..1   184.852128: rcu_batch_start: rcu_preempt CBs=21710 bl=169
> > ...
> > <snip>
> > 
> > On my device the "qhimark" is set to:
> > 
> > <snip>
> > XQ-CT54:/sys/module/rcutree/parameters # cat qhimark
> > 10000
> > XQ-CT54:/sys/module/rcutree/parameters #
> > <snip>
> > 
> > so i expect that once we pass 10 000 callbacks threshold the flush 
> > should occur. This parameter gives us an opportunity to control a
> > memory that should be reclaimed sooner or later.
> 
> I did understand that you were surprised.
> 
> But what problem did this cause other than you being surprised?
> 
It is not about surprising. It is about expectation. So if i set a
threshold to 100 i expect it that around 100 callbacks my memory will
be reclaimed. But the resolution is 2 * 100 in fact.

I am not aware about any issues with it. I just noticed such behaviour
during testing.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04 14:53       ` Uladzislau Rezki
  2022-10-04 15:58         ` Paul E. McKenney
@ 2022-10-04 16:22         ` Joel Fernandes
  1 sibling, 0 replies; 45+ messages in thread
From: Joel Fernandes @ 2022-10-04 16:22 UTC (permalink / raw)
  To: Uladzislau Rezki, Paul E. McKenney
  Cc: rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10, frederic,
	rostedt, youssefesmat, surenb



On 10/4/2022 10:53 AM, Uladzislau Rezki wrote:
> On Tue, Oct 04, 2022 at 06:30:04AM -0700, Paul E. McKenney wrote:
>> On Tue, Oct 04, 2022 at 01:41:38PM +0200, Uladzislau Rezki wrote:
>>>>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
>>>>  		rcu_nocb_lock_irqsave(rdp, flags);
>>>>  		lockdep_assert_held(&rdp->nocb_lock);
>>>>  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>> -		if (bypass_ncbs &&
>>>> +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
>>>> +
>>>> +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
>>>> +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
>>>> +		     bypass_ncbs > 2 * qhimark)) {
>>> Do you know why we want double "qhimark" threshold? It is not only this
>>> place, there are several. I am asking because it is not expected by the
>>> user.
>>
>> OK, I will bite...  What does the user expect?  Or, perhaps a better
>> question, how is this choice causing the user problems?
>>
> Yesterday when i was checking the lazy-v6 on Android i noticed the following:
> 
> <snip>
> ...
>      rcuop/4-48      [006] d..1   184.780328: rcu_batch_start: rcu_preempt CBs=15572 bl=121
>      rcuop/6-62      [000] d..1   184.796939: rcu_batch_start: rcu_preempt CBs=21503 bl=167
>      rcuop/6-62      [003] d..1   184.800706: rcu_batch_start: rcu_preempt CBs=24677 bl=192
>      rcuop/6-62      [005] d..1   184.803773: rcu_batch_start: rcu_preempt CBs=27117 bl=211
>      rcuop/6-62      [005] d..1   184.805732: rcu_batch_start: rcu_preempt CBs=22391 bl=174
>      rcuop/6-62      [005] d..1   184.809083: rcu_batch_start: rcu_preempt CBs=12554 bl=98
>      rcuop/6-62      [005] d..1   184.824228: rcu_batch_start: rcu_preempt CBs=16177 bl=126
>      rcuop/4-48      [006] d..1   184.836193: rcu_batch_start: rcu_preempt CBs=24129 bl=188
>      rcuop/4-48      [006] d..1   184.844147: rcu_batch_start: rcu_preempt CBs=25854 bl=201
>      rcuop/4-48      [006] d..1   184.847257: rcu_batch_start: rcu_preempt CBs=21328 bl=166
>      rcuop/4-48      [006] d..1   184.852128: rcu_batch_start: rcu_preempt CBs=21710 bl=169

This looks normal to me and working as intended, due to the first flush after
hitting the limit, future callbacks will not be considered lazy until the
existing ones are drained out of the queue. It is possible the additional
callbacks showed up before the whole queue was drained. Then add any scheduling
delays to the rcuop thread etc etc.

thanks,

 - Joel



> ...
> <snip>
> 
> On my device the "qhimark" is set to:
> 
> <snip>
> XQ-CT54:/sys/module/rcutree/parameters # cat qhimark
> 10000
> XQ-CT54:/sys/module/rcutree/parameters #
> <snip>
> 
> so i expect that once we pass 10 000 callbacks threshold the flush 
> should occur. This parameter gives us an opportunity to control a
> memory that should be reclaimed sooner or later.
> 
> --
> Uladzislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04 11:41   ` Uladzislau Rezki
  2022-10-04 13:30     ` Paul E. McKenney
@ 2022-10-04 18:05     ` Joel Fernandes
  2022-10-05 11:21       ` Uladzislau Rezki
  1 sibling, 1 reply; 45+ messages in thread
From: Joel Fernandes @ 2022-10-04 18:05 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: paulmck, rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Tue, Oct 4, 2022 at 7:41 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> >               trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> >               rcu_nocb_lock_irqsave(rdp, flags);
> >               lockdep_assert_held(&rdp->nocb_lock);
> >               bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > -             if (bypass_ncbs &&
> > +             lazy_ncbs = READ_ONCE(rdp->lazy_len);
> > +
> > +             if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> > +                 (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> > +                  bypass_ncbs > 2 * qhimark)) {
> Do you know why we want double "qhimark" threshold? It is not only this
> place, there are several. I am asking because it is not expected by the
> user.

I am following the qhimark conventions in existing code. However
qhimark does not mean that your callbacks cannot exceed these many or
something, it is not a hard limit on queued callbacks.

qhimark (And Paul can correct me) was introduced to reduce the number
of callbacks after which RCU will not limit execution of callbacks to
a batch of them. That has nothing to do with limiting the maximum
number of callbacks, per-se. However, its usage certainly seems to
have grown since that introduction.

Maybe you are confusing it with blimit:

#define DEFAULT_RCU_QHIMARK 10000 // If this many pending, ignore blimit.
static long qhimark = DEFAULT_RCU_QHIMARK;
#define DEFAULT_RCU_QLOMARK 100   // Once only this many pending, use blimit.
static long qlowmark = DEFAULT_RCU_QLOMARK;

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04 16:20           ` Uladzislau Rezki
@ 2022-10-04 18:27             ` Paul E. McKenney
  2022-10-05 11:28               ` Uladzislau Rezki
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-04 18:27 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Joel Fernandes (Google),
	rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10, frederic,
	rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 06:20:03PM +0200, Uladzislau Rezki wrote:
> On Tue, Oct 04, 2022 at 08:58:14AM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 04, 2022 at 04:53:09PM +0200, Uladzislau Rezki wrote:
> > > On Tue, Oct 04, 2022 at 06:30:04AM -0700, Paul E. McKenney wrote:
> > > > On Tue, Oct 04, 2022 at 01:41:38PM +0200, Uladzislau Rezki wrote:
> > > > > >  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> > > > > >  		rcu_nocb_lock_irqsave(rdp, flags);
> > > > > >  		lockdep_assert_held(&rdp->nocb_lock);
> > > > > >  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > > > > > -		if (bypass_ncbs &&
> > > > > > +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> > > > > > +
> > > > > > +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> > > > > > +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> > > > > > +		     bypass_ncbs > 2 * qhimark)) {
> > > > > Do you know why we want double "qhimark" threshold? It is not only this
> > > > > place, there are several. I am asking because it is not expected by the
> > > > > user.
> > > > 
> > > > OK, I will bite...  What does the user expect?  Or, perhaps a better
> > > > question, how is this choice causing the user problems?
> > > > 
> > > Yesterday when i was checking the lazy-v6 on Android i noticed the following:
> > > 
> > > <snip>
> > > ...
> > >      rcuop/4-48      [006] d..1   184.780328: rcu_batch_start: rcu_preempt CBs=15572 bl=121
> > >      rcuop/6-62      [000] d..1   184.796939: rcu_batch_start: rcu_preempt CBs=21503 bl=167
> > >      rcuop/6-62      [003] d..1   184.800706: rcu_batch_start: rcu_preempt CBs=24677 bl=192
> > >      rcuop/6-62      [005] d..1   184.803773: rcu_batch_start: rcu_preempt CBs=27117 bl=211
> > >      rcuop/6-62      [005] d..1   184.805732: rcu_batch_start: rcu_preempt CBs=22391 bl=174
> > >      rcuop/6-62      [005] d..1   184.809083: rcu_batch_start: rcu_preempt CBs=12554 bl=98
> > >      rcuop/6-62      [005] d..1   184.824228: rcu_batch_start: rcu_preempt CBs=16177 bl=126
> > >      rcuop/4-48      [006] d..1   184.836193: rcu_batch_start: rcu_preempt CBs=24129 bl=188
> > >      rcuop/4-48      [006] d..1   184.844147: rcu_batch_start: rcu_preempt CBs=25854 bl=201
> > >      rcuop/4-48      [006] d..1   184.847257: rcu_batch_start: rcu_preempt CBs=21328 bl=166
> > >      rcuop/4-48      [006] d..1   184.852128: rcu_batch_start: rcu_preempt CBs=21710 bl=169
> > > ...
> > > <snip>
> > > 
> > > On my device the "qhimark" is set to:
> > > 
> > > <snip>
> > > XQ-CT54:/sys/module/rcutree/parameters # cat qhimark
> > > 10000
> > > XQ-CT54:/sys/module/rcutree/parameters #
> > > <snip>
> > > 
> > > so i expect that once we pass 10 000 callbacks threshold the flush 
> > > should occur. This parameter gives us an opportunity to control a
> > > memory that should be reclaimed sooner or later.
> > 
> > I did understand that you were surprised.
> > 
> > But what problem did this cause other than you being surprised?
> > 
> It is not about surprising. It is about expectation. So if i set a
> threshold to 100 i expect it that around 100 callbacks my memory will
> be reclaimed. But the resolution is 2 * 100 in fact.
> 
> I am not aware about any issues with it. I just noticed such behaviour
> during testing.

Whew!!!

This value was arrived at when tuning this code to best deal with callback
floods.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain()
  2022-10-04  2:41 ` [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain() Joel Fernandes (Google)
@ 2022-10-04 22:28   ` Frederic Weisbecker
  2022-10-04 22:57     ` Joel Fernandes
  0 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2022-10-04 22:28 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 02:41:47AM +0000, Joel Fernandes (Google) wrote:
> From: Frederic Weisbecker <frederic@kernel.org>
> 
> In preparation of RCU lazy changes, wake up the RCU nocb gp thread if

It's more than just prep work for a new feature, it's a regression fix.

> needed after an entrain. Otherwise, the RCU barrier callback can wait in
> the queue for several seconds before the lazy callbacks in front of it
> are serviced.

It's not about lazy callbacks here (but you can mention the fact that
waking nocb_gp if necessary after flushing bypass is a beneficial side
effect for further lazy implementation).

So here is the possible bad scenario:

1) CPU 0 is nocb, it queues a callback
2) CPU 0 goes idle (or userspace with nohz_full) forever
3) The grace period related to that callback elapses
4) The callback is moved to the done list (but is not invoked yet), there are no more pending for CPU 0
5) CPU 1 calls rcu_barrier() and entrains to CPU 0 cblist
6) CPU 1 waits forever

> 
> Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")

Thanks.

> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  kernel/rcu/tree.c      | 11 +++++++++++
>  kernel/rcu/tree_nocb.h |  4 ++++
>  2 files changed, 15 insertions(+)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 5ec97e3f7468..04f33191e5ed 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3894,6 +3894,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  {
>  	unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
>  	unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
> +	bool wake_nocb = false;
> +	bool was_done = false;
>  
>  	lockdep_assert_held(&rcu_state.barrier_lock);
>  	if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
> @@ -3902,6 +3904,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	rdp->barrier_head.func = rcu_barrier_callback;
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
> +	was_done = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
>  	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>  		atomic_inc(&rcu_state.barrier_cpu_count);
> @@ -3909,7 +3912,15 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  		debug_rcu_head_unqueue(&rdp->barrier_head);
>  		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
>  	}
> +
> +	/*
> +	 * If bypass list was non-empty, wake up the nocb GP thread otherwise
> +	 * bypass/lazy CBs may not be noticed, and can cause real long delays!
> +	 */
> +	wake_nocb = was_done && rcu_segcblist_pend_cbs(&rdp->cblist);
>  	rcu_nocb_unlock(rdp);
> +	if (wake_nocb)
> +		wake_nocb_gp(rdp, false);
>  	smp_store_release(&rdp->barrier_seq_snap, gseq);
>  }
>  
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index f77a6d7e1356..6caade0683dd 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -1558,6 +1558,10 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>  {
>  }
>  
> +static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
> +{
> +}
> +
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  				  unsigned long j)
>  {
> -- 
> 2.38.0.rc1.362.ged0d419d3c-goog
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain()
  2022-10-04 22:28   ` Frederic Weisbecker
@ 2022-10-04 22:57     ` Joel Fernandes
  2022-10-05 10:39       ` Frederic Weisbecker
  2022-10-07  2:47       ` Joel Fernandes
  0 siblings, 2 replies; 45+ messages in thread
From: Joel Fernandes @ 2022-10-04 22:57 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, youssefesmat, surenb

Hi Frederic,

On 10/4/2022 6:28 PM, Frederic Weisbecker wrote:
> On Tue, Oct 04, 2022 at 02:41:47AM +0000, Joel Fernandes (Google) wrote:
>> From: Frederic Weisbecker <frederic@kernel.org>
>>
>> In preparation of RCU lazy changes, wake up the RCU nocb gp thread if
> 
> It's more than just prep work for a new feature, it's a regression fix.

Oh ok, both our fixes are equivalent but I chose yours since its cleaner. I was
fixing Lazy CBs since I can actually trigger this issue.

>> needed after an entrain. Otherwise, the RCU barrier callback can wait in
>> the queue for several seconds before the lazy callbacks in front of it
>> are serviced.
> 
> It's not about lazy callbacks here (but you can mention the fact that
> waking nocb_gp if necessary after flushing bypass is a beneficial side
> effect for further lazy implementation).
> 
> So here is the possible bad scenario:
> 
> 1) CPU 0 is nocb, it queues a callback
> 2) CPU 0 goes idle (or userspace with nohz_full) forever
> 3) The grace period related to that callback elapses
> 4) The callback is moved to the done list (but is not invoked yet), there are no more pending for CPU 0
> 5) CPU 1 calls rcu_barrier() and entrains to CPU 0 cblist

CPU 1 can only entrain into CPU 0 if the CPU is offline:

		if (!rcu_rdp_cpu_online(rdp)) {
			rcu_barrier_entrain(rdp);
			WARN_ON_ONCE(READ_ONCE(rdp->barrier_seq_snap) != gseq);
			raw_spin_unlock_irqrestore(&rcu_state.barrier_lock,
			...
			continue;
		}

Otherwise an IPI does the entraining. So I do not see how CPU 0 being idle
causes the cross-CPU entraining.

> 6) CPU 1 waits forever

But, I agree it can still wait forever, once the IPI handler does the
entraining, since nothing will do the GP thread wakeup.

>>
>> Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")

So, do you mind writing a proper patch with a proper commit message and Fixes
tag then? It can independent of this series and add my Reported-by tag, thanks!

Thanks!

 - Joel


^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain()
  2022-10-04 22:57     ` Joel Fernandes
@ 2022-10-05 10:39       ` Frederic Weisbecker
  2022-10-07  2:47       ` Joel Fernandes
  1 sibling, 0 replies; 45+ messages in thread
From: Frederic Weisbecker @ 2022-10-05 10:39 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 06:57:59PM -0400, Joel Fernandes wrote:
> >> needed after an entrain. Otherwise, the RCU barrier callback can wait in
> >> the queue for several seconds before the lazy callbacks in front of it
> >> are serviced.
> > 
> > It's not about lazy callbacks here (but you can mention the fact that
> > waking nocb_gp if necessary after flushing bypass is a beneficial side
> > effect for further lazy implementation).
> > 
> > So here is the possible bad scenario:
> > 
> > 1) CPU 0 is nocb, it queues a callback
> > 2) CPU 0 goes idle (or userspace with nohz_full) forever
> > 3) The grace period related to that callback elapses
> > 4) The callback is moved to the done list (but is not invoked yet), there are no more pending for CPU 0
> > 5) CPU 1 calls rcu_barrier() and entrains to CPU 0 cblist
> 
> CPU 1 can only entrain into CPU 0 if the CPU is offline:
> 
> 		if (!rcu_rdp_cpu_online(rdp)) {
> 			rcu_barrier_entrain(rdp);
> 			WARN_ON_ONCE(READ_ONCE(rdp->barrier_seq_snap) != gseq);
> 			raw_spin_unlock_irqrestore(&rcu_state.barrier_lock,
> 			...
> 			continue;
> 		}

Ah good point. So CPU 1 sends an IPI to CPU 0 which entrains itself.
And then looks like the result is the same.


> 
> Otherwise an IPI does the entraining. So I do not see how CPU 0 being idle
> causes the cross-CPU entraining.

It doesn't but it shows that the CPU isn't going to enqueue any further
callback before a while. Though even if it did, it may not even solve the
situation, not until an RCU_NOCB_WAKE_FORCE is issued...

> 
> > 6) CPU 1 waits forever
> 
> But, I agree it can still wait forever, once the IPI handler does the
> entraining, since nothing will do the GP thread wakeup.
> 
> >>
> >> Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")
> 
> So, do you mind writing a proper patch with a proper commit message and Fixes
> tag then? It can independent of this series and add my Reported-by tag,
> thanks!

Ok will do.

Thanks!

> 
> Thanks!
> 
>  - Joel
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04 18:05     ` Joel Fernandes
@ 2022-10-05 11:21       ` Uladzislau Rezki
  0 siblings, 0 replies; 45+ messages in thread
From: Uladzislau Rezki @ 2022-10-05 11:21 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, paulmck, rcu, linux-kernel, rushikesh.s.kadam,
	neeraj.iitr10, frederic, rostedt, youssefesmat, surenb

> On Tue, Oct 4, 2022 at 7:41 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > >               trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> > >               rcu_nocb_lock_irqsave(rdp, flags);
> > >               lockdep_assert_held(&rdp->nocb_lock);
> > >               bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > > -             if (bypass_ncbs &&
> > > +             lazy_ncbs = READ_ONCE(rdp->lazy_len);
> > > +
> > > +             if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> > > +                 (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> > > +                  bypass_ncbs > 2 * qhimark)) {
> > Do you know why we want double "qhimark" threshold? It is not only this
> > place, there are several. I am asking because it is not expected by the
> > user.
> 
> I am following the qhimark conventions in existing code. However
> qhimark does not mean that your callbacks cannot exceed these many or
> something, it is not a hard limit on queued callbacks.
> 
> qhimark (And Paul can correct me) was introduced to reduce the number
> of callbacks after which RCU will not limit execution of callbacks to
> a batch of them. That has nothing to do with limiting the maximum
> number of callbacks, per-se. However, its usage certainly seems to
> have grown since that introduction.
> 
> Maybe you are confusing it with blimit:
> 
"blimit" controls how many/long callbacks are executed by the rcu_do_batch().
Whereas the "qhimark" controls when the bypass list is flushed to a regular one
to initiate gp and start executing.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04 18:27             ` Paul E. McKenney
@ 2022-10-05 11:28               ` Uladzislau Rezki
  0 siblings, 0 replies; 45+ messages in thread
From: Uladzislau Rezki @ 2022-10-05 11:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, Joel Fernandes (Google),
	rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10, frederic,
	rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 11:27:37AM -0700, Paul E. McKenney wrote:
> On Tue, Oct 04, 2022 at 06:20:03PM +0200, Uladzislau Rezki wrote:
> > On Tue, Oct 04, 2022 at 08:58:14AM -0700, Paul E. McKenney wrote:
> > > On Tue, Oct 04, 2022 at 04:53:09PM +0200, Uladzislau Rezki wrote:
> > > > On Tue, Oct 04, 2022 at 06:30:04AM -0700, Paul E. McKenney wrote:
> > > > > On Tue, Oct 04, 2022 at 01:41:38PM +0200, Uladzislau Rezki wrote:
> > > > > > >  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
> > > > > > >  		rcu_nocb_lock_irqsave(rdp, flags);
> > > > > > >  		lockdep_assert_held(&rdp->nocb_lock);
> > > > > > >  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > > > > > > -		if (bypass_ncbs &&
> > > > > > > +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> > > > > > > +
> > > > > > > +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> > > > > > > +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> > > > > > > +		     bypass_ncbs > 2 * qhimark)) {
> > > > > > Do you know why we want double "qhimark" threshold? It is not only this
> > > > > > place, there are several. I am asking because it is not expected by the
> > > > > > user.
> > > > > 
> > > > > OK, I will bite...  What does the user expect?  Or, perhaps a better
> > > > > question, how is this choice causing the user problems?
> > > > > 
> > > > Yesterday when i was checking the lazy-v6 on Android i noticed the following:
> > > > 
> > > > <snip>
> > > > ...
> > > >      rcuop/4-48      [006] d..1   184.780328: rcu_batch_start: rcu_preempt CBs=15572 bl=121
> > > >      rcuop/6-62      [000] d..1   184.796939: rcu_batch_start: rcu_preempt CBs=21503 bl=167
> > > >      rcuop/6-62      [003] d..1   184.800706: rcu_batch_start: rcu_preempt CBs=24677 bl=192
> > > >      rcuop/6-62      [005] d..1   184.803773: rcu_batch_start: rcu_preempt CBs=27117 bl=211
> > > >      rcuop/6-62      [005] d..1   184.805732: rcu_batch_start: rcu_preempt CBs=22391 bl=174
> > > >      rcuop/6-62      [005] d..1   184.809083: rcu_batch_start: rcu_preempt CBs=12554 bl=98
> > > >      rcuop/6-62      [005] d..1   184.824228: rcu_batch_start: rcu_preempt CBs=16177 bl=126
> > > >      rcuop/4-48      [006] d..1   184.836193: rcu_batch_start: rcu_preempt CBs=24129 bl=188
> > > >      rcuop/4-48      [006] d..1   184.844147: rcu_batch_start: rcu_preempt CBs=25854 bl=201
> > > >      rcuop/4-48      [006] d..1   184.847257: rcu_batch_start: rcu_preempt CBs=21328 bl=166
> > > >      rcuop/4-48      [006] d..1   184.852128: rcu_batch_start: rcu_preempt CBs=21710 bl=169
> > > > ...
> > > > <snip>
> > > > 
> > > > On my device the "qhimark" is set to:
> > > > 
> > > > <snip>
> > > > XQ-CT54:/sys/module/rcutree/parameters # cat qhimark
> > > > 10000
> > > > XQ-CT54:/sys/module/rcutree/parameters #
> > > > <snip>
> > > > 
> > > > so i expect that once we pass 10 000 callbacks threshold the flush 
> > > > should occur. This parameter gives us an opportunity to control a
> > > > memory that should be reclaimed sooner or later.
> > > 
> > > I did understand that you were surprised.
> > > 
> > > But what problem did this cause other than you being surprised?
> > > 
> > It is not about surprising. It is about expectation. So if i set a
> > threshold to 100 i expect it that around 100 callbacks my memory will
> > be reclaimed. But the resolution is 2 * 100 in fact.
> > 
> > I am not aware about any issues with it. I just noticed such behaviour
> > during testing.
> 
> Whew!!!
> 
> This value was arrived at when tuning this code to best deal with callback
> floods.
> 
Actually the "qhimark" is correctly handled by the caller and flush is
initiated exactly after what we have in the(as one of the conditions):

/sys/module/rcutree/parameters/qhimark

<snip>
if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
    ncbs >= qhimark) {
	rcu_nocb_lock(rdp);
<snip>

it is not doubled. I messed it up with another place where you double it:

<snip>
if (bypass_ncbs &&
    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
     bypass_ncbs > 2 * qhimark)) {
<snip>

it is in the nocb_gp_wait(). Indeed, it is needed if we have a real flood
scenario.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04  2:41 ` [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power Joel Fernandes (Google)
  2022-10-04 11:41   ` Uladzislau Rezki
@ 2022-10-05 11:44   ` Uladzislau Rezki
  2022-10-06 19:11   ` Paul E. McKenney
  2 siblings, 0 replies; 45+ messages in thread
From: Uladzislau Rezki @ 2022-10-05 11:44 UTC (permalink / raw)
  To: Joel Fernandes (Google), paulmck
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, paulmck, rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 02:41:48AM +0000, Joel Fernandes (Google) wrote:
> Implement timer-based RCU callback batching (also known as lazy
> callbacks). With this we save about 5-10% of power consumed due
> to RCU requests that happen when system is lightly loaded or idle.
> 
> By default, all async callbacks (queued via call_rcu) are marked
> lazy. An alternate API call_rcu_flush() is provided for the few users,
> for example synchronize_rcu(), that need the old behavior.
> 
> The batch is flushed whenever a certain amount of time has passed, or
> the batch on a particular CPU grows too big. Also memory pressure will
> flush it in a future patch.
> 
> To handle several corner cases automagically (such as rcu_barrier() and
> hotplug), we re-use bypass lists which were originally introduced to
> address lock contention, to handle lazy CBs as well. The bypass list
> length has the lazy CB length included in it. A separate lazy CB length
> counter is also introduced to keep track of the number of lazy CBs.
> 
> Suggested-by: Paul McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/linux/rcupdate.h |   7 ++
>  kernel/rcu/Kconfig       |   8 ++
>  kernel/rcu/rcu.h         |   8 ++
>  kernel/rcu/tiny.c        |   2 +-
>  kernel/rcu/tree.c        | 129 +++++++++++++++++++-----------
>  kernel/rcu/tree.h        |  12 ++-
>  kernel/rcu/tree_exp.h    |   2 +-
>  kernel/rcu/tree_nocb.h   | 164 +++++++++++++++++++++++++++++++--------
>  8 files changed, 246 insertions(+), 86 deletions(-)
> 
Tested-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

for whole v7 series.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 00/11] rcu: call_rcu() power improvements
  2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
                   ` (10 preceding siblings ...)
  2022-10-04  2:41 ` [PATCH v7 11/11] workqueue: Make queue_rcu_work() use call_rcu_flush() Joel Fernandes (Google)
@ 2022-10-06 18:55 ` Paul E. McKenney
  2022-10-07 14:40   ` Uladzislau Rezki
  2022-10-07 15:09   ` Joel Fernandes
  11 siblings, 2 replies; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-06 18:55 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 02:41:46AM +0000, Joel Fernandes (Google) wrote:
> v7 version of RCU lazy patches based on rcu/next branch.
> 
> To facilitate easier merge, I dropped tracing and other patches and just
> implemented the new changes. I will post the tracing patches later along with
> rcutop as I need to add new tracepoints that Frederic suggested.
> 
> Main recent changes:
> 1. rcu_barrier() wake up only for lazy bypass list.
> 2. Make all call_rcu() default-lazy and add call_rcu_flush() API.
> 3. Take care of some callers using call_rcu_flush() API.
> 4. Several refactorings suggested by Paul/Frederic.
> 5. New call_rcu() to call_rcu_flush() conversions by Joel/Vlad/Paul.
> 
> I am seeing good performance and power with these patches on real ChromeOS x86
> asymmetric hardware.
> 
> Earlier cover letter with lots of details is here:
> https://lore.kernel.org/all/20220901221720.1105021-1-joel@joelfernandes.org/
> 
> List of recent changes:
>     
>     [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
>       deferral levels wake much earlier so for those it is not needed. ]
>     
>     [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
>     
>     [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
>     
>     [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
>     
>     [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
>     
>     [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
>     
>     [ Paul/Frederic : Cosmetic changes, split out wakeup of nocb thread. ]
>     
>     [ Vlad/Joel : More call_rcu -> flush conversions ]

Thank you for your continued work on this!

I pulled these into an experimental branch, applied Uladzislau's
Tested-by and ran a quick round of rcutorture.

From TREE02, TREE03, and TREE09 I got this:

In file included from kernel/rcu/tree.c:68:
kernel/rcu/tree.h:449:13: error: ‘wake_nocb_gp’ used but never defined [-Werror]
  449 | static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
      |             ^~~~~~~~~~~~

One could argue that this is not a big deal, except that Linus gets a
bit tetchy when this sort of thing shows up in mainline.

Nevetheless, it looks like we might be getting quite close!

							Thanx, Paul

> Frederic Weisbecker (1):
> rcu: Wake up nocb gp thread on rcu_barrier_entrain()
> 
> Joel Fernandes (Google) (7):
> rcu: Make call_rcu() lazy to save power
> rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
> rcuscale: Add laziness and kfree tests
> percpu-refcount: Use call_rcu_flush() for atomic switch
> rcu/sync: Use call_rcu_flush() instead of call_rcu
> rcu/rcuscale: Use call_rcu_flush() for async reader test
> rcu/rcutorture: Use call_rcu_flush() where needed
> 
> Uladzislau Rezki (2):
> scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
> workqueue: Make queue_rcu_work() use call_rcu_flush()
> 
> Vineeth Pillai (1):
> rcu: shrinker for lazy rcu
> 
> drivers/scsi/scsi_error.c |   2 +-
> include/linux/rcupdate.h  |   7 ++
> kernel/rcu/Kconfig        |   8 ++
> kernel/rcu/rcu.h          |   8 ++
> kernel/rcu/rcuscale.c     |  67 +++++++++++-
> kernel/rcu/rcutorture.c   |  16 +--
> kernel/rcu/sync.c         |   2 +-
> kernel/rcu/tiny.c         |   2 +-
> kernel/rcu/tree.c         | 140 +++++++++++++++++--------
> kernel/rcu/tree.h         |  12 ++-
> kernel/rcu/tree_exp.h     |   2 +-
> kernel/rcu/tree_nocb.h    | 213 ++++++++++++++++++++++++++++++++------
> kernel/workqueue.c        |   2 +-
> lib/percpu-refcount.c     |   3 +-
> 14 files changed, 388 insertions(+), 96 deletions(-)
> 
> --
> 2.38.0.rc1.362.ged0d419d3c-goog
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-04  2:41 ` [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power Joel Fernandes (Google)
  2022-10-04 11:41   ` Uladzislau Rezki
  2022-10-05 11:44   ` Uladzislau Rezki
@ 2022-10-06 19:11   ` Paul E. McKenney
  2022-10-11 17:44     ` Joel Fernandes
  2 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-06 19:11 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 02:41:48AM +0000, Joel Fernandes (Google) wrote:
> Implement timer-based RCU callback batching (also known as lazy
> callbacks). With this we save about 5-10% of power consumed due
> to RCU requests that happen when system is lightly loaded or idle.
> 
> By default, all async callbacks (queued via call_rcu) are marked
> lazy. An alternate API call_rcu_flush() is provided for the few users,
> for example synchronize_rcu(), that need the old behavior.
> 
> The batch is flushed whenever a certain amount of time has passed, or
> the batch on a particular CPU grows too big. Also memory pressure will
> flush it in a future patch.
> 
> To handle several corner cases automagically (such as rcu_barrier() and
> hotplug), we re-use bypass lists which were originally introduced to
> address lock contention, to handle lazy CBs as well. The bypass list
> length has the lazy CB length included in it. A separate lazy CB length
> counter is also introduced to keep track of the number of lazy CBs.
> 
> Suggested-by: Paul McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

A few comments below.  Perhaps these seem picky, but that should be
taken as evidence that we are getting close. ;-)

							Thanx, Paul

> ---
>  include/linux/rcupdate.h |   7 ++
>  kernel/rcu/Kconfig       |   8 ++
>  kernel/rcu/rcu.h         |   8 ++
>  kernel/rcu/tiny.c        |   2 +-
>  kernel/rcu/tree.c        | 129 +++++++++++++++++++-----------
>  kernel/rcu/tree.h        |  12 ++-
>  kernel/rcu/tree_exp.h    |   2 +-
>  kernel/rcu/tree_nocb.h   | 164 +++++++++++++++++++++++++++++++--------
>  8 files changed, 246 insertions(+), 86 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 08605ce7379d..40ae36904825 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
>  
>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>  
> +#ifdef CONFIG_RCU_LAZY
> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> +#else
> +static inline void call_rcu_flush(struct rcu_head *head,
> +		rcu_callback_t func) {  call_rcu(head, func); }
> +#endif
> +
>  /* Internal to kernel */
>  void rcu_init(void);
>  extern int rcu_scheduler_active;
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index f53ad63b2bc6..edd632e68497 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
>  	  Say N here if you hate read-side memory barriers.
>  	  Take the default if you are unsure.
>  
> +config RCU_LAZY
> +	bool "RCU callback lazy invocation functionality"
> +	depends on RCU_NOCB_CPU
> +	default n
> +	help
> +	  To save power, batch RCU callbacks and flush after delay, memory
> +	  pressure or callback list growing too big.
> +
>  endmenu # "RCU Subsystem"
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index be5979da07f5..65704cbc9df7 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -474,6 +474,14 @@ enum rcutorture_type {
>  	INVALID_RCU_FLAVOR
>  };
>  
> +#if defined(CONFIG_RCU_LAZY)
> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> +#else
> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> +#endif
> +
>  #if defined(CONFIG_TREE_RCU)
>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>  			    unsigned long *gp_seq);
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index a33a8d4942c3..810479cf17ba 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -44,7 +44,7 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
>  
>  void rcu_barrier(void)
>  {
> -	wait_rcu_gp(call_rcu);
> +	wait_rcu_gp(call_rcu_flush);
>  }
>  EXPORT_SYMBOL(rcu_barrier);
>  
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 04f33191e5ed..c20544c4aa29 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>  	raw_spin_unlock_rcu_node(rnp);
>  }
>  
> -/**
> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
> - * @head: structure to be used for queueing the RCU updates.
> - * @func: actual callback function to be invoked after the grace period
> - *
> - * The callback function will be invoked some time after a full grace
> - * period elapses, in other words after all pre-existing RCU read-side
> - * critical sections have completed.  However, the callback function
> - * might well execute concurrently with RCU read-side critical sections
> - * that started after call_rcu() was invoked.
> - *
> - * RCU read-side critical sections are delimited by rcu_read_lock()
> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
> - * v5.0 and later, regions of code across which interrupts, preemption,
> - * or softirqs have been disabled also serve as RCU read-side critical
> - * sections.  This includes hardware interrupt handlers, softirq handlers,
> - * and NMI handlers.
> - *
> - * Note that all CPUs must agree that the grace period extended beyond
> - * all pre-existing RCU read-side critical section.  On systems with more
> - * than one CPU, this means that when "func()" is invoked, each CPU is
> - * guaranteed to have executed a full memory barrier since the end of its
> - * last RCU read-side critical section whose beginning preceded the call
> - * to call_rcu().  It also means that each CPU executing an RCU read-side
> - * critical section that continues beyond the start of "func()" must have
> - * executed a memory barrier after the call_rcu() but before the beginning
> - * of that RCU read-side critical section.  Note that these guarantees
> - * include CPUs that are offline, idle, or executing in user mode, as
> - * well as CPUs that are executing in the kernel.
> - *
> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> - * resulting RCU callback function "func()", then both CPU A and CPU B are
> - * guaranteed to execute a full memory barrier during the time interval
> - * between the call to call_rcu() and the invocation of "func()" -- even
> - * if CPU A and CPU B are the same CPU (but again only if the system has
> - * more than one CPU).
> - *
> - * Implementation of these memory-ordering guarantees is described here:
> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> - */
> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +static void
> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>  {
>  	static atomic_t doublefrees;
>  	unsigned long flags;
> @@ -2809,7 +2770,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  	}
>  
>  	check_cb_ovld(rdp);
> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>  		return; // Enqueued onto ->nocb_bypass, so just leave.
>  	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>  	rcu_segcblist_enqueue(&rdp->cblist, head);
> @@ -2831,8 +2792,84 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  		local_irq_restore(flags);
>  	}
>  }
> -EXPORT_SYMBOL_GPL(call_rcu);
>  
> +#ifdef CONFIG_RCU_LAZY
> +/**
> + * call_rcu_flush() - Queue RCU callback for invocation after grace period, and
> + * flush all lazy callbacks (including the new one) to the main ->cblist while
> + * doing so.
> + *
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.
> + *
> + * Use this API instead of call_rcu() if you don't want the callback to be
> + * invoked after very long periods of time, which can happen on systems without
> + * memory pressure and on systems which are lightly loaded or mostly idle.
> + * This function will cause callbacks to be invoked sooner than later at the
> + * expense of extra power. Other than that, this function is identical to, and
> + * reuses call_rcu()'s logic. Refer to call_rcu() for more details about memory
> + * ordering and other functionality.
> + */
> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, false);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_flush);
> +#endif
> +
> +/**
> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
> + * By default the callbacks are 'lazy' and are kept hidden from the main
> + * ->cblist to prevent starting of grace periods too soon.
> + * If you desire grace periods to start very soon, use call_rcu_flush().
> + *
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.  However, the callback function
> + * might well execute concurrently with RCU read-side critical sections
> + * that started after call_rcu() was invoked.
> + *
> + * RCU read-side critical sections are delimited by rcu_read_lock()
> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
> + * v5.0 and later, regions of code across which interrupts, preemption,
> + * or softirqs have been disabled also serve as RCU read-side critical
> + * sections.  This includes hardware interrupt handlers, softirq handlers,
> + * and NMI handlers.
> + *
> + * Note that all CPUs must agree that the grace period extended beyond
> + * all pre-existing RCU read-side critical section.  On systems with more
> + * than one CPU, this means that when "func()" is invoked, each CPU is
> + * guaranteed to have executed a full memory barrier since the end of its
> + * last RCU read-side critical section whose beginning preceded the call
> + * to call_rcu().  It also means that each CPU executing an RCU read-side
> + * critical section that continues beyond the start of "func()" must have
> + * executed a memory barrier after the call_rcu() but before the beginning
> + * of that RCU read-side critical section.  Note that these guarantees
> + * include CPUs that are offline, idle, or executing in user mode, as
> + * well as CPUs that are executing in the kernel.
> + *
> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> + * resulting RCU callback function "func()", then both CPU A and CPU B are
> + * guaranteed to execute a full memory barrier during the time interval
> + * between the call to call_rcu() and the invocation of "func()" -- even
> + * if CPU A and CPU B are the same CPU (but again only if the system has
> + * more than one CPU).
> + *
> + * Implementation of these memory-ordering guarantees is described here:
> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> + */
> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, true);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu);
>  
>  /* Maximum number of jiffies to wait before draining a batch. */
>  #define KFREE_DRAIN_JIFFIES (5 * HZ)
> @@ -3507,7 +3544,7 @@ void synchronize_rcu(void)
>  		if (rcu_gp_is_expedited())
>  			synchronize_rcu_expedited();
>  		else
> -			wait_rcu_gp(call_rcu);
> +			wait_rcu_gp(call_rcu_flush);
>  		return;
>  	}
>  
> @@ -3905,7 +3942,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
>  	was_done = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false));
>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>  		atomic_inc(&rcu_state.barrier_cpu_count);
>  	} else {
> @@ -4334,7 +4371,7 @@ void rcutree_migrate_callbacks(int cpu)
>  	my_rdp = this_cpu_ptr(&rcu_data);
>  	my_rnp = my_rdp->mynode;
>  	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, false));
>  	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>  	/* Leverage recent GPs and set GP for new callbacks. */
>  	needwake = rcu_advance_cbs(my_rnp, rdp) ||
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index d4a97e40ea9c..1d803d39f0d1 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -263,14 +263,16 @@ struct rcu_data {
>  	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
>  	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
>  
> +	long lazy_len;			/* Length of buffered lazy callbacks. */
>  	int cpu;
>  };
>  
>  /* Values for nocb_defer_wakeup field in struct rcu_data. */
>  #define RCU_NOCB_WAKE_NOT	0
>  #define RCU_NOCB_WAKE_BYPASS	1
> -#define RCU_NOCB_WAKE		2
> -#define RCU_NOCB_WAKE_FORCE	3
> +#define RCU_NOCB_WAKE_LAZY	2
> +#define RCU_NOCB_WAKE		3
> +#define RCU_NOCB_WAKE_FORCE	4
>  
>  #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>  					/* For jiffies_till_first_fqs and */
> @@ -440,9 +442,11 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>  static void rcu_init_one_nocb(struct rcu_node *rnp);
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j);
> +				  unsigned long j, bool lazy);
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags);
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy);
> +static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>  				 unsigned long flags);
>  static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 18e9b4cd78ef..5cac05600798 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -937,7 +937,7 @@ void synchronize_rcu_expedited(void)
>  
>  	/* If expedited grace periods are prohibited, fall back to normal. */
>  	if (rcu_gp_is_normal()) {
> -		wait_rcu_gp(call_rcu);
> +		wait_rcu_gp(call_rcu_flush);
>  		return;
>  	}
>  
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 6caade0683dd..56616bfbe156 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>  }
>  
> +/*
> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> + * could be flushed much earlier for a number of other reasons
> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> + * left unsubmitted to RCU after those many jiffies.
> + */
> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> +
> +#ifdef CONFIG_RCU_LAZY
> +// To be called only from test code.
> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
> +{
> +	jiffies_till_flush = jif;
> +}
> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
> +
> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
> +{
> +	return jiffies_till_flush;
> +}
> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
> +#endif
> +
>  /*
>   * Arrange to wake the GP kthread for this NOCB group at some future
>   * time when it is safe to do so.
> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>  
>  	/*
> -	 * Bypass wakeup overrides previous deferments. In case
> -	 * of callback storm, no need to wake up too early.
> +	 * Bypass wakeup overrides previous deferments. In case of
> +	 * callback storm, no need to wake up too early.
>  	 */
> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
> +	if (waketype == RCU_NOCB_WAKE_LAZY &&
> +	    READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
> +		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> +	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>  		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>  		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>  	} else {
> @@ -293,10 +322,13 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>   * proves to be initially empty, just return false because the no-CB GP
>   * kthread may need to be awakened in this case.
>   *
> + * Return true if there was something to be flushed and it succeeded, otherwise
> + * false.
> + *
>   * Note that this function always returns true if rhp is NULL.
>   */
>  static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				     unsigned long j)
> +				     unsigned long j, bool lazy)
>  {
>  	struct rcu_cblist rcl;
>  
> @@ -310,7 +342,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
>  	if (rhp)
>  		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
> -	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +
> +	/*
> +	 * If the new CB requested was a lazy one, queue it onto the main
> +	 * ->cblist so we can take advantage of a sooner grade period.
> +	 */
> +	if (lazy && rhp) {
> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> +		rcu_cblist_enqueue(&rcl, rhp);
> +		WRITE_ONCE(rdp->lazy_len, 0);
> +	} else {
> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +		WRITE_ONCE(rdp->lazy_len, 0);
> +	}

I was going to complain about this again, but I see the cleanup commit.  ;-)

> +
>  	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>  	WRITE_ONCE(rdp->nocb_bypass_first, j);
>  	rcu_nocb_bypass_unlock(rdp);
> @@ -326,13 +371,13 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>   * Note that this function always returns true if rhp is NULL.
>   */
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, bool lazy)
>  {
>  	if (!rcu_rdp_is_offloaded(rdp))
>  		return true;
>  	rcu_lockdep_assert_cblist_protected(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
> +	return rcu_nocb_do_flush_bypass(rdp, rhp, j, lazy);
>  }
>  
>  /*
> @@ -345,7 +390,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>  	if (!rcu_rdp_is_offloaded(rdp) ||
>  	    !rcu_nocb_bypass_trylock(rdp))
>  		return;
> -	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
> +	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, false));
>  }
>  
>  /*
> @@ -367,12 +412,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>   * there is only one CPU in operation.
>   */
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags)
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy)
>  {
>  	unsigned long c;
>  	unsigned long cur_gp_seq;
>  	unsigned long j = jiffies;
>  	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +	bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));

In the alphabet, "b" comes before "c".  Picky, but it does make these
things easier to find.  Which can be important when you are searching
for all occurrences of something -- nice to be able to find the variable
in alpha order without interrupting your search.

>  	lockdep_assert_irqs_disabled();
>  
> @@ -417,25 +464,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	// If there hasn't yet been all that many ->cblist enqueues
>  	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
>  	// ->nocb_bypass first.
> -	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
> +	// Lazy CBs throttle this back and do immediate bypass queuing.
> +	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
>  		rcu_nocb_lock(rdp);
>  		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>  		if (*was_alldone)
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("FirstQ"));
> -		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
> +

Can't say I get the rationale for some of the added blank lines, but
I am not (yet) too worried about it.

> +		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, false));
>  		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
>  		return false; // Caller must enqueue the callback.
>  	}
>  
>  	// If ->nocb_bypass has been used too long or is too full,
>  	// flush ->nocb_bypass to ->cblist.
> -	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> +	if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> +	    (ncbs &&  bypass_is_lazy &&
> +		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||

Please line up the "(" of the above line with the "n" in "ncbs" of the
line preceding it.

>  	    ncbs >= qhimark) {
>  		rcu_nocb_lock(rdp);
>  		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>  
> -		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
> +		if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy)) {
>  			if (*was_alldone)
>  				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  						    TPS("FirstQ"));
> @@ -460,16 +511,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	// We need to use the bypass.
>  	rcu_nocb_wait_contended(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> +
>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> +
> +	if (lazy)
> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
> +
>  	if (!ncbs) {
>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>  	}
> +
>  	rcu_nocb_bypass_unlock(rdp);
>  	smp_mb(); /* Order enqueue before wake. */
> -	if (ncbs) {
> +
> +	// A wake up of the grace period kthread or timer adjustment needs to
> +	// be done only if:
> +	// 1. Bypass list was fully empty before (this is the first bypass list entry).
> +	//	Or, both the below conditions are met:
> +	// 1. Bypass list had only lazy CBs before.
> +	// 2. The new CB is non-lazy.

How about like this?

	// A wake up of the grace period kthread or timer adjustment
	// needs to be done only if:
	// 1. Bypass list was fully empty before (this is the first
	//    bypass list entry), or:
	// 2. Both of these conditions are met:
	//    a. The bypass list previously had only lazy CBs, and:
	//    b. The new CB is non-lazy.

> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>  		local_irq_restore(flags);
>  	} else {
>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
> @@ -497,9 +561,10 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  				 unsigned long flags)
>  				 __releases(rdp->nocb_lock)
>  {
> +	long bypass_len;
>  	unsigned long cur_gp_seq;
>  	unsigned long j;
> -	long len;
> +	long len, lazy_len;

One on each line, please, with lazy_len first.

>  	struct task_struct *t;
>  
>  	// If we are being polled or there is no kthread, just leave.
> @@ -512,9 +577,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  	}
>  	// Need to actually to a wakeup.
>  	len = rcu_segcblist_n_cbs(&rdp->cblist);
> +	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +	lazy_len = READ_ONCE(rdp->lazy_len);
>  	if (was_alldone) {
>  		rdp->qlen_last_fqs_check = len;
> -		if (!irqs_disabled_flags(flags)) {
> +		// Only lazy CBs in bypass list
> +		if (lazy_len && bypass_len == lazy_len) {
> +			rcu_nocb_unlock_irqrestore(rdp, flags);
> +			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
> +					   TPS("WakeLazy"));
> +		} else if (!irqs_disabled_flags(flags)) {
>  			/* ... if queue was empty ... */
>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			wake_nocb_gp(rdp, false);
> @@ -611,6 +683,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	unsigned long flags;
>  	bool gotcbs = false;
>  	unsigned long j = jiffies;
> +	bool lazy = false;
> +	long lazy_ncbs;
>  	bool needwait_gp = false; // This prevents actual uninitialized use.
>  	bool needwake;
>  	bool needwake_gp;
> @@ -640,24 +714,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	 * won't be ignored for long.
>  	 */
>  	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
> +		bool flush_bypass = false;
> +
>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
>  		rcu_nocb_lock_irqsave(rdp, flags);
>  		lockdep_assert_held(&rdp->nocb_lock);
>  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> -		if (bypass_ncbs &&
> +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> +
> +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> +		     bypass_ncbs > 2 * qhimark)) {
> +			flush_bypass = true;
> +		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
>  		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
>  		     bypass_ncbs > 2 * qhimark)) {
> -			// Bypass full or old, so flush it.
> -			(void)rcu_nocb_try_flush_bypass(rdp, j);
> -			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +			flush_bypass = true;
>  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			continue; /* No callbacks here, try next. */
>  		}
> +
> +		if (flush_bypass) {
> +			// Bypass full or old, so flush it.
> +			(void)rcu_nocb_try_flush_bypass(rdp, j);
> +			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +			lazy_ncbs = READ_ONCE(rdp->lazy_len);
> +		}
> +
>  		if (bypass_ncbs) {
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> -					    TPS("Bypass"));
> -			bypass = true;
> +				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));

Please line this up.  Given 100-column lines, one easy way to do that
is as follows:

 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
					    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));

> +			if (bypass_ncbs == lazy_ncbs)
> +				lazy = true;
> +			else
> +				bypass = true;
>  		}
>  		rnp = rdp->mynode;
>  
> @@ -705,12 +796,20 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	my_rdp->nocb_gp_gp = needwait_gp;
>  	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
>  
> -	if (bypass && !rcu_nocb_poll) {
> -		// At least one child with non-empty ->nocb_bypass, so set
> -		// timer in order to avoid stranding its callbacks.
> -		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> -				   TPS("WakeBypassIsDeferred"));
> +	// At least one child with non-empty ->nocb_bypass, so set
> +	// timer in order to avoid stranding its callbacks.
> +	if (!rcu_nocb_poll) {
> +		// If bypass list only has lazy CBs. Add a deferred lazy wake up.
> +		if (lazy && !bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
> +					TPS("WakeLazyIsDeferred"));
> +		// Otherwise add a deferred bypass wake up.
> +		} else if (bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> +					TPS("WakeBypassIsDeferred"));
> +		}
>  	}
> +
>  	if (rcu_nocb_poll) {
>  		/* Polling, so trace if first poll in the series. */
>  		if (gotcbs)
> @@ -1036,7 +1135,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>  	 * return false, which means that future calls to rcu_nocb_try_bypass()
>  	 * will refuse to put anything into the bypass.
>  	 */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false));
>  	/*
>  	 * Start with invoking rcu_core() early. This way if the current thread
>  	 * happens to preempt an ongoing call to rcu_core() in the middle,
> @@ -1278,6 +1377,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  	raw_spin_lock_init(&rdp->nocb_gp_lock);
>  	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
>  	rcu_cblist_init(&rdp->nocb_bypass);
> +	WRITE_ONCE(rdp->lazy_len, 0);
>  	mutex_init(&rdp->nocb_gp_kthread_mutex);
>  }
>  
> @@ -1558,18 +1658,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>  {
>  }
>  
> -static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
> -{
> -}
> -
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, bool lazy)
>  {
>  	return true;
>  }
>  
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags)
> +				bool *was_alldone, unsigned long flags, bool lazy)
>  {
>  	return false;
>  }
> -- 
> 2.38.0.rc1.362.ged0d419d3c-goog
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 03/11] rcu: Refactor code a bit in rcu_nocb_do_flush_bypass()
  2022-10-04  2:41 ` [PATCH v7 03/11] rcu: Refactor code a bit in rcu_nocb_do_flush_bypass() Joel Fernandes (Google)
@ 2022-10-06 19:12   ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-06 19:12 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 02:41:49AM +0000, Joel Fernandes (Google) wrote:
> This consolidates the code a bit and makes it cleaner. Functionally it
> is the same.
> 
> Reported-by: Paul E. McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree_nocb.h | 17 +++++++++--------
>  1 file changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 56616bfbe156..dfa736da68ab 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -327,10 +327,11 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>   *
>   * Note that this function always returns true if rhp is NULL.
>   */
> -static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> +static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp_in,
>  				     unsigned long j, bool lazy)
>  {
>  	struct rcu_cblist rcl;
> +	struct rcu_head *rhp = rhp_in;
>  
>  	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
>  	rcu_lockdep_assert_cblist_protected(rdp);
> @@ -345,16 +346,16 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  
>  	/*
>  	 * If the new CB requested was a lazy one, queue it onto the main
> -	 * ->cblist so we can take advantage of a sooner grade period.
> +	 * ->cblist so that we can take advantage of the grace-period that will
> +	 * happen regardless. But queue it onto the bypass list first so that
> +	 * the lazy CB is ordered with the existing CBs in the bypass list.
>  	 */
>  	if (lazy && rhp) {
> -		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> -		rcu_cblist_enqueue(&rcl, rhp);
> -		WRITE_ONCE(rdp->lazy_len, 0);
> -	} else {
> -		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> -		WRITE_ONCE(rdp->lazy_len, 0);
> +		rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> +		rhp = NULL;
>  	}
> +	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +	WRITE_ONCE(rdp->lazy_len, 0);

Much much better, thank you!

							Thanx, Paul

>  	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>  	WRITE_ONCE(rdp->nocb_bypass_first, j);
> -- 
> 2.38.0.rc1.362.ged0d419d3c-goog
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 05/11] rcuscale: Add laziness and kfree tests
  2022-10-04  2:41 ` [PATCH v7 05/11] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
@ 2022-10-06 19:15   ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-06 19:15 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 02:41:51AM +0000, Joel Fernandes (Google) wrote:
> We add 2 tests to rcuscale, first one is a startup test to check whether
> we are not too lazy or too hard working. Two, emulate kfree_rcu() itself
> to use call_rcu() and check memory pressure. In my testing, the new
> call_rcu() does well to keep memory pressure under control, similar
> to kfree_rcu().
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/rcuscale.c | 65 ++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 64 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
> index 3ef02d4a8108..027b7c1e7613 100644
> --- a/kernel/rcu/rcuscale.c
> +++ b/kernel/rcu/rcuscale.c
> @@ -95,6 +95,7 @@ torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
>  torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
>  torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() scale test?");
>  torture_param(int, kfree_mult, 1, "Multiple of kfree_obj size to allocate.");
> +torture_param(int, kfree_by_call_rcu, 0, "Use call_rcu() to emulate kfree_rcu()?");
>  
>  static char *scale_type = "rcu";
>  module_param(scale_type, charp, 0444);
> @@ -659,6 +660,14 @@ struct kfree_obj {
>  	struct rcu_head rh;
>  };
>  
> +/* Used if doing RCU-kfree'ing via call_rcu(). */
> +static void kfree_call_rcu(struct rcu_head *rh)
> +{
> +	struct kfree_obj *obj = container_of(rh, struct kfree_obj, rh);
> +
> +	kfree(obj);
> +}
> +
>  static int
>  kfree_scale_thread(void *arg)
>  {
> @@ -696,6 +705,11 @@ kfree_scale_thread(void *arg)
>  			if (!alloc_ptr)
>  				return -ENOMEM;
>  
> +			if (kfree_by_call_rcu) {
> +				call_rcu(&(alloc_ptr->rh), kfree_call_rcu);
> +				continue;
> +			}
> +
>  			// By default kfree_rcu_test_single and kfree_rcu_test_double are
>  			// initialized to false. If both have the same value (false or true)
>  			// both are randomly tested, otherwise only the one with value true
> @@ -767,11 +781,58 @@ kfree_scale_shutdown(void *arg)
>  	return -EINVAL;
>  }
>  
> +// Used if doing RCU-kfree'ing via call_rcu().
> +static unsigned long jiffies_at_lazy_cb;
> +static struct rcu_head lazy_test1_rh;
> +static int rcu_lazy_test1_cb_called;
> +static void call_rcu_lazy_test1(struct rcu_head *rh)
> +{
> +	jiffies_at_lazy_cb = jiffies;
> +	WRITE_ONCE(rcu_lazy_test1_cb_called, 1);
> +}
> +
>  static int __init
>  kfree_scale_init(void)
>  {
>  	long i;
>  	int firsterr = 0;
> +	unsigned long orig_jif, jif_start;

Separate lines in alphabetic order, please.

> +
> +	// Also, do a quick self-test to ensure laziness is as much as
> +	// expected.
> +	if (kfree_by_call_rcu && !IS_ENABLED(CONFIG_RCU_LAZY)) {
> +		pr_alert("CONFIG_RCU_LAZY is disabled, falling back to kfree_rcu() "
> +			 "for delayed RCU kfree'ing\n");
> +		kfree_by_call_rcu = 0;
> +	}
> +
> +	if (kfree_by_call_rcu) {
> +		/* do a test to check the timeout. */
> +		orig_jif = rcu_lazy_get_jiffies_till_flush();
> +
> +		rcu_lazy_set_jiffies_till_flush(2 * HZ);
> +		rcu_barrier();
> +
> +		jif_start = jiffies;
> +		jiffies_at_lazy_cb = 0;
> +		call_rcu(&lazy_test1_rh, call_rcu_lazy_test1);
> +
> +		smp_cond_load_relaxed(&rcu_lazy_test1_cb_called, VAL == 1);
> +
> +		rcu_lazy_set_jiffies_till_flush(orig_jif);
> +
> +		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start < 2 * HZ)) {
> +			pr_alert("ERROR: call_rcu() CBs are not being lazy as expected!\n");
> +			WARN_ON_ONCE(1);
> +			return -1;
> +		}
> +
> +		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start > 3 * HZ)) {
> +			pr_alert("ERROR: call_rcu() CBs are being too lazy!\n");
> +			WARN_ON_ONCE(1);

I am concerned about false positives for this, and especially about false
negatives given other activity in the system.

But let's give it a shot and see how it does.

							Thanx, Paul

> +			return -1;
> +		}
> +	}
>  
>  	kfree_nrealthreads = compute_real(kfree_nthreads);
>  	/* Start up the kthreads. */
> @@ -784,7 +845,9 @@ kfree_scale_init(void)
>  		schedule_timeout_uninterruptible(1);
>  	}
>  
> -	pr_alert("kfree object size=%zu\n", kfree_mult * sizeof(struct kfree_obj));
> +	pr_alert("kfree object size=%zu, kfree_by_call_rcu=%d\n",
> +			kfree_mult * sizeof(struct kfree_obj),
> +			kfree_by_call_rcu);
>  
>  	kfree_reader_tasks = kcalloc(kfree_nrealthreads, sizeof(kfree_reader_tasks[0]),
>  			       GFP_KERNEL);
> -- 
> 2.38.0.rc1.362.ged0d419d3c-goog
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain()
  2022-10-04 22:57     ` Joel Fernandes
  2022-10-05 10:39       ` Frederic Weisbecker
@ 2022-10-07  2:47       ` Joel Fernandes
  2022-10-07 11:26         ` Frederic Weisbecker
  1 sibling, 1 reply; 45+ messages in thread
From: Joel Fernandes @ 2022-10-07  2:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 06:57:59PM -0400, Joel Fernandes wrote:
> Hi Frederic,
> 
> On 10/4/2022 6:28 PM, Frederic Weisbecker wrote:
> > On Tue, Oct 04, 2022 at 02:41:47AM +0000, Joel Fernandes (Google) wrote:
> >> From: Frederic Weisbecker <frederic@kernel.org>
> >>
> >> In preparation of RCU lazy changes, wake up the RCU nocb gp thread if
> > 
> > It's more than just prep work for a new feature, it's a regression fix.
> 
> Oh ok, both our fixes are equivalent but I chose yours since its cleaner. I was
> fixing Lazy CBs since I can actually trigger this issue.
> 
> >> needed after an entrain. Otherwise, the RCU barrier callback can wait in
> >> the queue for several seconds before the lazy callbacks in front of it
> >> are serviced.
> > 
> > It's not about lazy callbacks here (but you can mention the fact that
> > waking nocb_gp if necessary after flushing bypass is a beneficial side
> > effect for further lazy implementation).
> > 
> > So here is the possible bad scenario:
> > 
> > 1) CPU 0 is nocb, it queues a callback
> > 2) CPU 0 goes idle (or userspace with nohz_full) forever
> > 3) The grace period related to that callback elapses
> > 4) The callback is moved to the done list (but is not invoked yet), there are no more pending for CPU 0
> > 5) CPU 1 calls rcu_barrier() and entrains to CPU 0 cblist
> 
> CPU 1 can only entrain into CPU 0 if the CPU is offline:
> 
> 		if (!rcu_rdp_cpu_online(rdp)) {
> 			rcu_barrier_entrain(rdp);
> 			WARN_ON_ONCE(READ_ONCE(rdp->barrier_seq_snap) != gseq);
> 			raw_spin_unlock_irqrestore(&rcu_state.barrier_lock,
> 			...
> 			continue;
> 		}
> 
> Otherwise an IPI does the entraining. So I do not see how CPU 0 being idle
> causes the cross-CPU entraining.
> 
> > 6) CPU 1 waits forever
> 
> But, I agree it can still wait forever, once the IPI handler does the
> entraining, since nothing will do the GP thread wakeup.
> 
> >>
> >> Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")
> 
> So, do you mind writing a proper patch with a proper commit message and Fixes
> tag then? It can independent of this series and add my Reported-by tag, thanks!

And a small hunk went into the wrong patch so I pulled it back into this one,
here is the updated version of this patch:

(Don't pull this as Frederic will post this one separately, just posting here
 for reference).
---8<-----------------------

From: Frederic Weisbecker <frederic@kernel.org>
Subject: [PATCH] rcu: Wake up nocb gp thread on rcu_barrier_entrain()

In preparation of RCU lazy changes, wake up the RCU nocb gp thread if
needed after an entrain. Otherwise, the RCU barrier callback can wait in
the queue for several seconds before the lazy callbacks in front of it
are serviced.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree.c      | 11 +++++++++++
 kernel/rcu/tree.h      |  1 +
 kernel/rcu/tree_nocb.h |  4 ++++
 3 files changed, 16 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5ec97e3f7468..04f33191e5ed 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3894,6 +3894,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 {
 	unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
 	unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
+	bool wake_nocb = false;
+	bool was_done = false;
 
 	lockdep_assert_held(&rcu_state.barrier_lock);
 	if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
@@ -3902,6 +3904,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	rdp->barrier_head.func = rcu_barrier_callback;
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
+	was_done = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
 	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
@@ -3909,7 +3912,15 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 		debug_rcu_head_unqueue(&rdp->barrier_head);
 		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
 	}
+
+	/*
+	 * If bypass list was non-empty, wake up the nocb GP thread otherwise
+	 * bypass/lazy CBs may not be noticed, and can cause real long delays!
+	 */
+	wake_nocb = was_done && rcu_segcblist_pend_cbs(&rdp->cblist);
 	rcu_nocb_unlock(rdp);
+	if (wake_nocb)
+		wake_nocb_gp(rdp, false);
 	smp_store_release(&rdp->barrier_seq_snap, gseq);
 }
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index d4a97e40ea9c..41faaf8eb872 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -443,6 +443,7 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j);
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				bool *was_alldone, unsigned long flags);
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 				 unsigned long flags);
 static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index f77a6d7e1356..6caade0683dd 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1558,6 +1558,10 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 {
 }
 
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
+{
+}
+
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j)
 {
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
  2022-10-04  2:41 ` [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu() Joel Fernandes (Google)
@ 2022-10-07  3:18   ` Joel Fernandes
  2022-10-07 17:19     ` Joel Fernandes
  0 siblings, 1 reply; 45+ messages in thread
From: Joel Fernandes @ 2022-10-07  3:18 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb

On Tue, Oct 04, 2022 at 02:41:56AM +0000, Joel Fernandes (Google) wrote:
> From: Uladzislau Rezki <urezki@gmail.com>
> 
> Slow boot time is seen on KVM running typical Linux distributions due to
> SCSI layer calling call_rcu(). Recent changes to save power may be
> causing this slowness. Using call_rcu_flush() fixes the issue and brings
> the boot time back to what it originally was. Convert it.
> 
> Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

And I successfully setup Debian on KVM and verified that this fixes it, so
now I have a nice reproducible rig for my
'lazy-callback-doing-a-wakeup-detector' (I wrote half the detector thanks to
ideas from Steve, and will finish the other half tomorrow or so).

Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel


> ---
>  drivers/scsi/scsi_error.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
> index 448748e3fba5..a56cfd612e3a 100644
> --- a/drivers/scsi/scsi_error.c
> +++ b/drivers/scsi/scsi_error.c
> @@ -312,7 +312,7 @@ void scsi_eh_scmd_add(struct scsi_cmnd *scmd)
>  	 * Ensure that all tasks observe the host state change before the
>  	 * host_failed change.
>  	 */
> -	call_rcu(&scmd->rcu, scsi_eh_inc_host_failed);
> +	call_rcu_flush(&scmd->rcu, scsi_eh_inc_host_failed);
>  }
>  
>  /**
> -- 
> 2.38.0.rc1.362.ged0d419d3c-goog
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain()
  2022-10-07  2:47       ` Joel Fernandes
@ 2022-10-07 11:26         ` Frederic Weisbecker
  2022-10-07 12:46           ` Joel Fernandes
  0 siblings, 1 reply; 45+ messages in thread
From: Frederic Weisbecker @ 2022-10-07 11:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, youssefesmat, surenb

On Fri, Oct 07, 2022 at 02:47:52AM +0000, Joel Fernandes wrote:
> On Tue, Oct 04, 2022 at 06:57:59PM -0400, Joel Fernandes wrote:
> 
> From: Frederic Weisbecker <frederic@kernel.org>
> Subject: [PATCH] rcu: Wake up nocb gp thread on rcu_barrier_entrain()
> 
> In preparation of RCU lazy changes, wake up the RCU nocb gp thread if
> needed after an entrain. Otherwise, the RCU barrier callback can wait in
> the queue for several seconds before the lazy callbacks in front of it
> are serviced.
> 
> Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
>  kernel/rcu/tree.c      | 11 +++++++++++
>  kernel/rcu/tree.h      |  1 +
>  kernel/rcu/tree_nocb.h |  4 ++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 5ec97e3f7468..04f33191e5ed 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3894,6 +3894,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  {
>  	unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
>  	unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
> +	bool wake_nocb = false;
> +	bool was_done = false;
>  
>  	lockdep_assert_held(&rcu_state.barrier_lock);
>  	if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
> @@ -3902,6 +3904,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	rdp->barrier_head.func = rcu_barrier_callback;
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
> +	was_done = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
>  	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>  		atomic_inc(&rcu_state.barrier_cpu_count);
> @@ -3909,7 +3912,15 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  		debug_rcu_head_unqueue(&rdp->barrier_head);
>  		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
>  	}
> +
> +	/*
> +	 * If bypass list was non-empty, wake up the nocb GP thread otherwise
> +	 * bypass/lazy CBs may not be noticed, and can cause real long delays!
> +	 */
> +	wake_nocb = was_done && rcu_segcblist_pend_cbs(&rdp->cblist);
>  	rcu_nocb_unlock(rdp);
> +	if (wake_nocb)
> +		wake_nocb_gp(rdp, false);
>  	smp_store_release(&rdp->barrier_seq_snap, gseq);
>  }
>  
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index d4a97e40ea9c..41faaf8eb872 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -443,6 +443,7 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  				  unsigned long j);
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  				bool *was_alldone, unsigned long flags);
> +static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>  				 unsigned long flags);
>  static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index f77a6d7e1356..6caade0683dd 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -1558,6 +1558,10 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>  {
>  }
>  
> +static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
> +{

It misses a return value...

> +}

So I planned to post an updated version (please find below) but I triggered
a rare TREE01 stall related to rcu_barrier(). I just can't remember if it was
before or after the patch :-s

Anyway I'm holding it off until I get more clues.

---
From: Frederic Weisbecker <frederic@kernel.org>
Date: Tue, 4 Oct 2022 02:41:47 +0000
Subject: [PATCH] rcu: Fix missing nocb gp wake on rcu_barrier()

Upon entraining a callback to a NOCB CPU, no further wake up is
performed on the corresponding nocb_gp kthread. As a result, the callback
and all the subsequent ones on that CPU may be ignored, at least until
an RCU_NOCB_WAKE_FORCE timer is ever armed.

Here is a possible bad scenario:

1) CPU 0 is NOCB unlike all other CPUs.
2) CPU 0 queues a callback
2) The grace period related to that callback elapses
3) The callback is moved to the done list (but is not invoked yet),
   there are no more pending callbacks for CPU 0
4) CPU 1 calls rcu_barrier() and sends an IPI to CPU 0
5) CPU 0 entrains the callback but doesn't wake up nocb_gp
6) CPU 1 blocks forever, unless CPU 0 ever queues enough further
   callbacks to arm an RCU_NOCB_WAKE_FORCE timer.

Make sure the necessary wake up is issued if necessary.

Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tree.c      | 6 ++++++
 kernel/rcu/tree.h      | 1 +
 kernel/rcu/tree_nocb.h | 5 +++++
 3 files changed, 12 insertions(+)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 96d678c9cfb6..025f59f6f97f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3914,6 +3914,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 {
 	unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
 	unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
+	bool wake_nocb = false;
+	bool was_alldone = false;
 
 	lockdep_assert_held(&rcu_state.barrier_lock);
 	if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
@@ -3922,6 +3924,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	rdp->barrier_head.func = rcu_barrier_callback;
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
+	was_alldone = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
 	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
@@ -3929,7 +3932,10 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 		debug_rcu_head_unqueue(&rdp->barrier_head);
 		rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
 	}
+	wake_nocb = was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist);
 	rcu_nocb_unlock(rdp);
+	if (wake_nocb)
+		wake_nocb_gp(rdp, false);
 	smp_store_release(&rdp->barrier_seq_snap, gseq);
 }
 
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index d4a97e40ea9c..925dd98f8b23 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -439,6 +439,7 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
 static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j);
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index f77a6d7e1356..094fd454b6c3 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1558,6 +1558,11 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 {
 }
 
+static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
+{
+	return false;
+}
+
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j)
 {
-- 
2.25.1




^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain()
  2022-10-07 11:26         ` Frederic Weisbecker
@ 2022-10-07 12:46           ` Joel Fernandes
  0 siblings, 0 replies; 45+ messages in thread
From: Joel Fernandes @ 2022-10-07 12:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, youssefesmat, surenb



> On Oct 7, 2022, at 7:26 AM, Frederic Weisbecker <frederic@kernel.org> wrote:
> 
> On Fri, Oct 07, 2022 at 02:47:52AM +0000, Joel Fernandes wrote:
>> On Tue, Oct 04, 2022 at 06:57:59PM -0400, Joel Fernandes wrote:
>> 
>> From: Frederic Weisbecker <frederic@kernel.org>
>> Subject: [PATCH] rcu: Wake up nocb gp thread on rcu_barrier_entrain()
>> 
>> In preparation of RCU lazy changes, wake up the RCU nocb gp thread if
>> needed after an entrain. Otherwise, the RCU barrier callback can wait in
>> the queue for several seconds before the lazy callbacks in front of it
>> are serviced.
>> 
>> Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>> ---
>> kernel/rcu/tree.c      | 11 +++++++++++
>> kernel/rcu/tree.h      |  1 +
>> kernel/rcu/tree_nocb.h |  4 ++++
>> 3 files changed, 16 insertions(+)
>> 
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 5ec97e3f7468..04f33191e5ed 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -3894,6 +3894,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>> {
>>    unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
>>    unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
>> +    bool wake_nocb = false;
>> +    bool was_done = false;
>> 
>>    lockdep_assert_held(&rcu_state.barrier_lock);
>>    if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
>> @@ -3902,6 +3904,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>    rdp->barrier_head.func = rcu_barrier_callback;
>>    debug_rcu_head_queue(&rdp->barrier_head);
>>    rcu_nocb_lock(rdp);
>> +    was_done = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
>>    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>>    if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>>        atomic_inc(&rcu_state.barrier_cpu_count);
>> @@ -3909,7 +3912,15 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>        debug_rcu_head_unqueue(&rdp->barrier_head);
>>        rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
>>    }
>> +
>> +    /*
>> +     * If bypass list was non-empty, wake up the nocb GP thread otherwise
>> +     * bypass/lazy CBs may not be noticed, and can cause real long delays!
>> +     */
>> +    wake_nocb = was_done && rcu_segcblist_pend_cbs(&rdp->cblist);
>>    rcu_nocb_unlock(rdp);
>> +    if (wake_nocb)
>> +        wake_nocb_gp(rdp, false);
>>    smp_store_release(&rdp->barrier_seq_snap, gseq);
>> }
>> 
>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>> index d4a97e40ea9c..41faaf8eb872 100644
>> --- a/kernel/rcu/tree.h
>> +++ b/kernel/rcu/tree.h
>> @@ -443,6 +443,7 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>                  unsigned long j);
>> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>                bool *was_alldone, unsigned long flags);
>> +static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
>> static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>>                 unsigned long flags);
>> static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index f77a6d7e1356..6caade0683dd 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -1558,6 +1558,10 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>> {
>> }
>> 
>> +static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>> +{
> 
> It misses a return value...

Ah, true. I had the return in my tree but not in the posting, sigh.

>> +}
> 
> So I planned to post an updated version (please find below) but I triggered
> a rare TREE01 stall related to rcu_barrier(). I just can't remember if it was
> before or after the patch :-s

Weird.

Thanks,

- Joel




> Anyway I'm holding it off until I get more clues.
> 
> ---
> From: Frederic Weisbecker <frederic@kernel.org>
> Date: Tue, 4 Oct 2022 02:41:47 +0000
> Subject: [PATCH] rcu: Fix missing nocb gp wake on rcu_barrier()
> 
> Upon entraining a callback to a NOCB CPU, no further wake up is
> performed on the corresponding nocb_gp kthread. As a result, the callback
> and all the subsequent ones on that CPU may be ignored, at least until
> an RCU_NOCB_WAKE_FORCE timer is ever armed.
> 
> Here is a possible bad scenario:
> 
> 1) CPU 0 is NOCB unlike all other CPUs.
> 2) CPU 0 queues a callback
> 2) The grace period related to that callback elapses
> 3) The callback is moved to the done list (but is not invoked yet),
>   there are no more pending callbacks for CPU 0
> 4) CPU 1 calls rcu_barrier() and sends an IPI to CPU 0
> 5) CPU 0 entrains the callback but doesn't wake up nocb_gp
> 6) CPU 1 blocks forever, unless CPU 0 ever queues enough further
>   callbacks to arm an RCU_NOCB_WAKE_FORCE timer.
> 
> Make sure the necessary wake up is issued if necessary.
> 
> Reported-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Fixes: 5d6742b37727 ("rcu/nocb: Use rcu_segcblist for no-CBs CPUs")
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> ---
> kernel/rcu/tree.c      | 6 ++++++
> kernel/rcu/tree.h      | 1 +
> kernel/rcu/tree_nocb.h | 5 +++++
> 3 files changed, 12 insertions(+)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 96d678c9cfb6..025f59f6f97f 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3914,6 +3914,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> {
>    unsigned long gseq = READ_ONCE(rcu_state.barrier_sequence);
>    unsigned long lseq = READ_ONCE(rdp->barrier_seq_snap);
> +    bool wake_nocb = false;
> +    bool was_alldone = false;
> 
>    lockdep_assert_held(&rcu_state.barrier_lock);
>    if (rcu_seq_state(lseq) || !rcu_seq_state(gseq) || rcu_seq_ctr(lseq) != rcu_seq_ctr(gseq))
> @@ -3922,6 +3924,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>    rdp->barrier_head.func = rcu_barrier_callback;
>    debug_rcu_head_queue(&rdp->barrier_head);
>    rcu_nocb_lock(rdp);
> +    was_alldone = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
>    WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>    if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>        atomic_inc(&rcu_state.barrier_cpu_count);
> @@ -3929,7 +3932,10 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>        debug_rcu_head_unqueue(&rdp->barrier_head);
>        rcu_barrier_trace(TPS("IRQNQ"), -1, rcu_state.barrier_sequence);
>    }
> +    wake_nocb = was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist);
>    rcu_nocb_unlock(rdp);
> +    if (wake_nocb)
> +        wake_nocb_gp(rdp, false);
>    smp_store_release(&rdp->barrier_seq_snap, gseq);
> }
> 
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index d4a97e40ea9c..925dd98f8b23 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -439,6 +439,7 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
> static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
> static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
> static void rcu_init_one_nocb(struct rcu_node *rnp);
> +static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>                  unsigned long j);
> static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index f77a6d7e1356..094fd454b6c3 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -1558,6 +1558,11 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
> {
> }
> 
> +static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
> +{
> +    return false;
> +}
> +
> static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>                  unsigned long j)
> {
> -- 
> 2.25.1
> 
> 
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 00/11] rcu: call_rcu() power improvements
  2022-10-06 18:55 ` [PATCH v7 00/11] rcu: call_rcu() power improvements Paul E. McKenney
@ 2022-10-07 14:40   ` Uladzislau Rezki
  2022-10-07 14:54     ` Paul E. McKenney
  2022-10-07 15:09   ` Joel Fernandes
  1 sibling, 1 reply; 45+ messages in thread
From: Uladzislau Rezki @ 2022-10-07 14:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes (Google),
	rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Thu, Oct 06, 2022 at 11:55:43AM -0700, Paul E. McKenney wrote:
> On Tue, Oct 04, 2022 at 02:41:46AM +0000, Joel Fernandes (Google) wrote:
> > v7 version of RCU lazy patches based on rcu/next branch.
> > 
> > To facilitate easier merge, I dropped tracing and other patches and just
> > implemented the new changes. I will post the tracing patches later along with
> > rcutop as I need to add new tracepoints that Frederic suggested.
> > 
> > Main recent changes:
> > 1. rcu_barrier() wake up only for lazy bypass list.
> > 2. Make all call_rcu() default-lazy and add call_rcu_flush() API.
> > 3. Take care of some callers using call_rcu_flush() API.
> > 4. Several refactorings suggested by Paul/Frederic.
> > 5. New call_rcu() to call_rcu_flush() conversions by Joel/Vlad/Paul.
> > 
> > I am seeing good performance and power with these patches on real ChromeOS x86
> > asymmetric hardware.
> > 
> > Earlier cover letter with lots of details is here:
> > https://lore.kernel.org/all/20220901221720.1105021-1-joel@joelfernandes.org/
> > 
> > List of recent changes:
> >     
> >     [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> >       deferral levels wake much earlier so for those it is not needed. ]
> >     
> >     [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> >     
> >     [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> >     
> >     [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> >     
> >     [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> >     
> >     [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> >     
> >     [ Paul/Frederic : Cosmetic changes, split out wakeup of nocb thread. ]
> >     
> >     [ Vlad/Joel : More call_rcu -> flush conversions ]
> 
> Thank you for your continued work on this!
> 
> I pulled these into an experimental branch, applied Uladzislau's
> Tested-by and ran a quick round of rcutorture.
> 
> From TREE02, TREE03, and TREE09 I got this:
> 
> In file included from kernel/rcu/tree.c:68:
> kernel/rcu/tree.h:449:13: error: ‘wake_nocb_gp’ used but never defined [-Werror]
>   449 | static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
>       |             ^~~~~~~~~~~~
> 
> One could argue that this is not a big deal, except that Linus gets a
> bit tetchy when this sort of thing shows up in mainline.
> 
Sorry. I have not tested TREE02, TREE03 and TREE09 scenarios. My goal
was to check below functionalities:

- call_rcu_flush() does not introduce any delays once it is queued
- call_rcu() does not apply pressure on the RCU-machinery from wake-up point of view
- boot-time is not degraded
- synchronize_rcu() and rcu_barrier() work as expected
- if bypass consists of lazy callbacks the *flush() one has to initiate the offloading

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 00/11] rcu: call_rcu() power improvements
  2022-10-07 14:40   ` Uladzislau Rezki
@ 2022-10-07 14:54     ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-07 14:54 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Joel Fernandes (Google),
	rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10, frederic,
	rostedt, youssefesmat, surenb

On Fri, Oct 07, 2022 at 04:40:11PM +0200, Uladzislau Rezki wrote:
> On Thu, Oct 06, 2022 at 11:55:43AM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 04, 2022 at 02:41:46AM +0000, Joel Fernandes (Google) wrote:
> > > v7 version of RCU lazy patches based on rcu/next branch.
> > > 
> > > To facilitate easier merge, I dropped tracing and other patches and just
> > > implemented the new changes. I will post the tracing patches later along with
> > > rcutop as I need to add new tracepoints that Frederic suggested.
> > > 
> > > Main recent changes:
> > > 1. rcu_barrier() wake up only for lazy bypass list.
> > > 2. Make all call_rcu() default-lazy and add call_rcu_flush() API.
> > > 3. Take care of some callers using call_rcu_flush() API.
> > > 4. Several refactorings suggested by Paul/Frederic.
> > > 5. New call_rcu() to call_rcu_flush() conversions by Joel/Vlad/Paul.
> > > 
> > > I am seeing good performance and power with these patches on real ChromeOS x86
> > > asymmetric hardware.
> > > 
> > > Earlier cover letter with lots of details is here:
> > > https://lore.kernel.org/all/20220901221720.1105021-1-joel@joelfernandes.org/
> > > 
> > > List of recent changes:
> > >     
> > >     [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> > >       deferral levels wake much earlier so for those it is not needed. ]
> > >     
> > >     [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> > >     
> > >     [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> > >     
> > >     [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> > >     
> > >     [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> > >     
> > >     [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> > >     
> > >     [ Paul/Frederic : Cosmetic changes, split out wakeup of nocb thread. ]
> > >     
> > >     [ Vlad/Joel : More call_rcu -> flush conversions ]
> > 
> > Thank you for your continued work on this!
> > 
> > I pulled these into an experimental branch, applied Uladzislau's
> > Tested-by and ran a quick round of rcutorture.
> > 
> > From TREE02, TREE03, and TREE09 I got this:
> > 
> > In file included from kernel/rcu/tree.c:68:
> > kernel/rcu/tree.h:449:13: error: ‘wake_nocb_gp’ used but never defined [-Werror]
> >   449 | static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
> >       |             ^~~~~~~~~~~~
> > 
> > One could argue that this is not a big deal, except that Linus gets a
> > bit tetchy when this sort of thing shows up in mainline.
> > 
> Sorry. I have not tested TREE02, TREE03 and TREE09 scenarios. My goal
> was to check below functionalities:
> 
> - call_rcu_flush() does not introduce any delays once it is queued
> - call_rcu() does not apply pressure on the RCU-machinery from wake-up point of view
> - boot-time is not degraded
> - synchronize_rcu() and rcu_barrier() work as expected
> - if bypass consists of lazy callbacks the *flush() one has to initiate the offloading

I am not blaming you, just making people aware of an issue needing
some resolution.  After all, you only said that you tested it, not that
you tested all possible corner cases on all possible hardware running
all possible workloads.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 00/11] rcu: call_rcu() power improvements
  2022-10-06 18:55 ` [PATCH v7 00/11] rcu: call_rcu() power improvements Paul E. McKenney
  2022-10-07 14:40   ` Uladzislau Rezki
@ 2022-10-07 15:09   ` Joel Fernandes
  2022-10-07 18:30     ` Paul E. McKenney
  1 sibling, 1 reply; 45+ messages in thread
From: Joel Fernandes @ 2022-10-07 15:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Thu, Oct 06, 2022 at 11:55:43AM -0700, Paul E. McKenney wrote:
> On Tue, Oct 04, 2022 at 02:41:46AM +0000, Joel Fernandes (Google) wrote:
> > v7 version of RCU lazy patches based on rcu/next branch.
> > 
> > To facilitate easier merge, I dropped tracing and other patches and just
> > implemented the new changes. I will post the tracing patches later along with
> > rcutop as I need to add new tracepoints that Frederic suggested.
> > 
> > Main recent changes:
> > 1. rcu_barrier() wake up only for lazy bypass list.
> > 2. Make all call_rcu() default-lazy and add call_rcu_flush() API.
> > 3. Take care of some callers using call_rcu_flush() API.
> > 4. Several refactorings suggested by Paul/Frederic.
> > 5. New call_rcu() to call_rcu_flush() conversions by Joel/Vlad/Paul.
> > 
> > I am seeing good performance and power with these patches on real ChromeOS x86
> > asymmetric hardware.
> > 
> > Earlier cover letter with lots of details is here:
> > https://lore.kernel.org/all/20220901221720.1105021-1-joel@joelfernandes.org/
> > 
> > List of recent changes:
> >     
> >     [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> >       deferral levels wake much earlier so for those it is not needed. ]
> >     
> >     [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> >     
> >     [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> >     
> >     [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> >     
> >     [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> >     
> >     [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> >     
> >     [ Paul/Frederic : Cosmetic changes, split out wakeup of nocb thread. ]
> >     
> >     [ Vlad/Joel : More call_rcu -> flush conversions ]
> 
> Thank you for your continued work on this!
> 
> I pulled these into an experimental branch, applied Uladzislau's
> Tested-by and ran a quick round of rcutorture.
> 
> From TREE02, TREE03, and TREE09 I got this:
> 
> In file included from kernel/rcu/tree.c:68:
> kernel/rcu/tree.h:449:13: error: ‘wake_nocb_gp’ used but never defined [-Werror]
>   449 | static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
>       |             ^~~~~~~~~~~~
> 
> One could argue that this is not a big deal, except that Linus gets a
> bit tetchy when this sort of thing shows up in mainline.
> 
> Nevetheless, it looks like we might be getting quite close!

Yes, sorry this is because a hunk got misplaced as I mentioned in the other
thread. If you replace the second patch with the below, it will fix it,
thanks!

I pushed all the latest patches to:
https://github.com/joelagnel/linux-kernel.git
(branch  rcu-lazy-latest-oct722)

If you want to try those.

---8<-----------------------

From 978bf585cbd933048733d9ad9a80ed8de265c07e Mon Sep 17 00:00:00 2001
From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Date: Fri, 7 Oct 2022 02:41:00 +0000
Subject: [PATCH] rcu: Make call_rcu() lazy to save power

Implement timer-based RCU callback batching (also known as lazy
callbacks). With this we save about 5-10% of power consumed due
to RCU requests that happen when system is lightly loaded or idle.

By default, all async callbacks (queued via call_rcu) are marked
lazy. An alternate API call_rcu_flush() is provided for the few users,
for example synchronize_rcu(), that need the old behavior.

The batch is flushed whenever a certain amount of time has passed, or
the batch on a particular CPU grows too big. Also memory pressure will
flush it in a future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists which were originally introduced to
address lock contention, to handle lazy CBs as well. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

Suggested-by: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/rcupdate.h |   7 ++
 kernel/rcu/Kconfig       |   8 ++
 kernel/rcu/rcu.h         |   8 ++
 kernel/rcu/tiny.c        |   2 +-
 kernel/rcu/tree.c        | 129 ++++++++++++++++++++-----------
 kernel/rcu/tree.h        |  11 ++-
 kernel/rcu/tree_exp.h    |   2 +-
 kernel/rcu/tree_nocb.h   | 161 +++++++++++++++++++++++++++++++--------
 8 files changed, 246 insertions(+), 82 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 08605ce7379d..40ae36904825 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
 
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
+#ifdef CONFIG_RCU_LAZY
+void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
+#else
+static inline void call_rcu_flush(struct rcu_head *head,
+		rcu_callback_t func) {  call_rcu(head, func); }
+#endif
+
 /* Internal to kernel */
 void rcu_init(void);
 extern int rcu_scheduler_active;
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index f53ad63b2bc6..edd632e68497 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
 	  Say N here if you hate read-side memory barriers.
 	  Take the default if you are unsure.
 
+config RCU_LAZY
+	bool "RCU callback lazy invocation functionality"
+	depends on RCU_NOCB_CPU
+	default n
+	help
+	  To save power, batch RCU callbacks and flush after delay, memory
+	  pressure or callback list growing too big.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index be5979da07f5..65704cbc9df7 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -474,6 +474,14 @@ enum rcutorture_type {
 	INVALID_RCU_FLAVOR
 };
 
+#if defined(CONFIG_RCU_LAZY)
+unsigned long rcu_lazy_get_jiffies_till_flush(void);
+void rcu_lazy_set_jiffies_till_flush(unsigned long j);
+#else
+static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
+static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
+#endif
+
 #if defined(CONFIG_TREE_RCU)
 void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
 			    unsigned long *gp_seq);
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index a33a8d4942c3..810479cf17ba 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -44,7 +44,7 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
 
 void rcu_barrier(void)
 {
-	wait_rcu_gp(call_rcu);
+	wait_rcu_gp(call_rcu_flush);
 }
 EXPORT_SYMBOL(rcu_barrier);
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 04f33191e5ed..c20544c4aa29 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
 	raw_spin_unlock_rcu_node(rnp);
 }
 
-/**
- * call_rcu() - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all pre-existing RCU read-side
- * critical sections have completed.  However, the callback function
- * might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked.
- *
- * RCU read-side critical sections are delimited by rcu_read_lock()
- * and rcu_read_unlock(), and may be nested.  In addition, but only in
- * v5.0 and later, regions of code across which interrupts, preemption,
- * or softirqs have been disabled also serve as RCU read-side critical
- * sections.  This includes hardware interrupt handlers, softirq handlers,
- * and NMI handlers.
- *
- * Note that all CPUs must agree that the grace period extended beyond
- * all pre-existing RCU read-side critical section.  On systems with more
- * than one CPU, this means that when "func()" is invoked, each CPU is
- * guaranteed to have executed a full memory barrier since the end of its
- * last RCU read-side critical section whose beginning preceded the call
- * to call_rcu().  It also means that each CPU executing an RCU read-side
- * critical section that continues beyond the start of "func()" must have
- * executed a memory barrier after the call_rcu() but before the beginning
- * of that RCU read-side critical section.  Note that these guarantees
- * include CPUs that are offline, idle, or executing in user mode, as
- * well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
- * resulting RCU callback function "func()", then both CPU A and CPU B are
- * guaranteed to execute a full memory barrier during the time interval
- * between the call to call_rcu() and the invocation of "func()" -- even
- * if CPU A and CPU B are the same CPU (but again only if the system has
- * more than one CPU).
- *
- * Implementation of these memory-ordering guarantees is described here:
- * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
- */
-void call_rcu(struct rcu_head *head, rcu_callback_t func)
+static void
+__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 {
 	static atomic_t doublefrees;
 	unsigned long flags;
@@ -2809,7 +2770,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 	}
 
 	check_cb_ovld(rdp);
-	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
+	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
 		return; // Enqueued onto ->nocb_bypass, so just leave.
 	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
 	rcu_segcblist_enqueue(&rdp->cblist, head);
@@ -2831,8 +2792,84 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 		local_irq_restore(flags);
 	}
 }
-EXPORT_SYMBOL_GPL(call_rcu);
 
+#ifdef CONFIG_RCU_LAZY
+/**
+ * call_rcu_flush() - Queue RCU callback for invocation after grace period, and
+ * flush all lazy callbacks (including the new one) to the main ->cblist while
+ * doing so.
+ *
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.
+ *
+ * Use this API instead of call_rcu() if you don't want the callback to be
+ * invoked after very long periods of time, which can happen on systems without
+ * memory pressure and on systems which are lightly loaded or mostly idle.
+ * This function will cause callbacks to be invoked sooner than later at the
+ * expense of extra power. Other than that, this function is identical to, and
+ * reuses call_rcu()'s logic. Refer to call_rcu() for more details about memory
+ * ordering and other functionality.
+ */
+void call_rcu_flush(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, false);
+}
+EXPORT_SYMBOL_GPL(call_rcu_flush);
+#endif
+
+/**
+ * call_rcu() - Queue an RCU callback for invocation after a grace period.
+ * By default the callbacks are 'lazy' and are kept hidden from the main
+ * ->cblist to prevent starting of grace periods too soon.
+ * If you desire grace periods to start very soon, use call_rcu_flush().
+ *
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.  However, the callback function
+ * might well execute concurrently with RCU read-side critical sections
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested.  In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
+ * sections.  This includes hardware interrupt handlers, softirq handlers,
+ * and NMI handlers.
+ *
+ * Note that all CPUs must agree that the grace period extended beyond
+ * all pre-existing RCU read-side critical section.  On systems with more
+ * than one CPU, this means that when "func()" is invoked, each CPU is
+ * guaranteed to have executed a full memory barrier since the end of its
+ * last RCU read-side critical section whose beginning preceded the call
+ * to call_rcu().  It also means that each CPU executing an RCU read-side
+ * critical section that continues beyond the start of "func()" must have
+ * executed a memory barrier after the call_rcu() but before the beginning
+ * of that RCU read-side critical section.  Note that these guarantees
+ * include CPUs that are offline, idle, or executing in user mode, as
+ * well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
+ * resulting RCU callback function "func()", then both CPU A and CPU B are
+ * guaranteed to execute a full memory barrier during the time interval
+ * between the call to call_rcu() and the invocation of "func()" -- even
+ * if CPU A and CPU B are the same CPU (but again only if the system has
+ * more than one CPU).
+ *
+ * Implementation of these memory-ordering guarantees is described here:
+ * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
+ */
+void call_rcu(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, true);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
 
 /* Maximum number of jiffies to wait before draining a batch. */
 #define KFREE_DRAIN_JIFFIES (5 * HZ)
@@ -3507,7 +3544,7 @@ void synchronize_rcu(void)
 		if (rcu_gp_is_expedited())
 			synchronize_rcu_expedited();
 		else
-			wait_rcu_gp(call_rcu);
+			wait_rcu_gp(call_rcu_flush);
 		return;
 	}
 
@@ -3905,7 +3942,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
 	was_done = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
 	} else {
@@ -4334,7 +4371,7 @@ void rcutree_migrate_callbacks(int cpu)
 	my_rdp = this_cpu_ptr(&rcu_data);
 	my_rnp = my_rdp->mynode;
 	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, false));
 	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
 	/* Leverage recent GPs and set GP for new callbacks. */
 	needwake = rcu_advance_cbs(my_rnp, rdp) ||
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 41faaf8eb872..1d803d39f0d1 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -263,14 +263,16 @@ struct rcu_data {
 	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
 	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
 
+	long lazy_len;			/* Length of buffered lazy callbacks. */
 	int cpu;
 };
 
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
 #define RCU_NOCB_WAKE_NOT	0
 #define RCU_NOCB_WAKE_BYPASS	1
-#define RCU_NOCB_WAKE		2
-#define RCU_NOCB_WAKE_FORCE	3
+#define RCU_NOCB_WAKE_LAZY	2
+#define RCU_NOCB_WAKE		3
+#define RCU_NOCB_WAKE_FORCE	4
 
 #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
 					/* For jiffies_till_first_fqs and */
@@ -440,9 +442,10 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j);
+				  unsigned long j, bool lazy);
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags);
+				bool *was_alldone, unsigned long flags,
+				bool lazy);
 static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 				 unsigned long flags);
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 18e9b4cd78ef..5cac05600798 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -937,7 +937,7 @@ void synchronize_rcu_expedited(void)
 
 	/* If expedited grace periods are prohibited, fall back to normal. */
 	if (rcu_gp_is_normal()) {
-		wait_rcu_gp(call_rcu);
+		wait_rcu_gp(call_rcu_flush);
 		return;
 	}
 
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 6caade0683dd..ccd05ded6bb4 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
 	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
 }
 
+/*
+ * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
+ * can elapse before lazy callbacks are flushed. Lazy callbacks
+ * could be flushed much earlier for a number of other reasons
+ * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
+ * left unsubmitted to RCU after those many jiffies.
+ */
+#define LAZY_FLUSH_JIFFIES (10 * HZ)
+static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
+
+#ifdef CONFIG_RCU_LAZY
+// To be called only from test code.
+void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
+{
+	jiffies_till_flush = jif;
+}
+EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
+
+unsigned long rcu_lazy_get_jiffies_till_flush(void)
+{
+	return jiffies_till_flush;
+}
+EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
+#endif
+
 /*
  * Arrange to wake the GP kthread for this NOCB group at some future
  * time when it is safe to do so.
@@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
 	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
 
 	/*
-	 * Bypass wakeup overrides previous deferments. In case
-	 * of callback storm, no need to wake up too early.
+	 * Bypass wakeup overrides previous deferments. In case of
+	 * callback storm, no need to wake up too early.
 	 */
-	if (waketype == RCU_NOCB_WAKE_BYPASS) {
+	if (waketype == RCU_NOCB_WAKE_LAZY &&
+	    READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
+		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
+		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
 		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
 		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
 	} else {
@@ -293,10 +322,13 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
  * proves to be initially empty, just return false because the no-CB GP
  * kthread may need to be awakened in this case.
  *
+ * Return true if there was something to be flushed and it succeeded, otherwise
+ * false.
+ *
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				     unsigned long j)
+				     unsigned long j, bool lazy)
 {
 	struct rcu_cblist rcl;
 
@@ -310,7 +342,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
 	if (rhp)
 		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
-	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+
+	/*
+	 * If the new CB requested was a lazy one, queue it onto the main
+	 * ->cblist so we can take advantage of a sooner grade period.
+	 */
+	if (lazy && rhp) {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
+		rcu_cblist_enqueue(&rcl, rhp);
+		WRITE_ONCE(rdp->lazy_len, 0);
+	} else {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+		WRITE_ONCE(rdp->lazy_len, 0);
+	}
+
 	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
 	WRITE_ONCE(rdp->nocb_bypass_first, j);
 	rcu_nocb_bypass_unlock(rdp);
@@ -326,13 +371,13 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, bool lazy)
 {
 	if (!rcu_rdp_is_offloaded(rdp))
 		return true;
 	rcu_lockdep_assert_cblist_protected(rdp);
 	rcu_nocb_bypass_lock(rdp);
-	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
+	return rcu_nocb_do_flush_bypass(rdp, rhp, j, lazy);
 }
 
 /*
@@ -345,7 +390,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
 	if (!rcu_rdp_is_offloaded(rdp) ||
 	    !rcu_nocb_bypass_trylock(rdp))
 		return;
-	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
+	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, false));
 }
 
 /*
@@ -367,12 +412,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
  * there is only one CPU in operation.
  */
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags,
+				bool lazy)
 {
 	unsigned long c;
 	unsigned long cur_gp_seq;
 	unsigned long j = jiffies;
 	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
 
 	lockdep_assert_irqs_disabled();
 
@@ -417,25 +464,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	// If there hasn't yet been all that many ->cblist enqueues
 	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
 	// ->nocb_bypass first.
-	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
+	// Lazy CBs throttle this back and do immediate bypass queuing.
+	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
 		rcu_nocb_lock(rdp);
 		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 		if (*was_alldone)
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstQ"));
-		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
+
+		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, false));
 		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
 		return false; // Caller must enqueue the callback.
 	}
 
 	// If ->nocb_bypass has been used too long or is too full,
 	// flush ->nocb_bypass to ->cblist.
-	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
+	if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
+	    (ncbs &&  bypass_is_lazy &&
+		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
 	    ncbs >= qhimark) {
 		rcu_nocb_lock(rdp);
 		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 
-		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
+		if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy)) {
 			if (*was_alldone)
 				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 						    TPS("FirstQ"));
@@ -460,16 +511,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	// We need to use the bypass.
 	rcu_nocb_wait_contended(rdp);
 	rcu_nocb_bypass_lock(rdp);
+
 	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
 	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
 	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
+
+	if (lazy)
+		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
+
 	if (!ncbs) {
 		WRITE_ONCE(rdp->nocb_bypass_first, j);
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
 	}
+
 	rcu_nocb_bypass_unlock(rdp);
 	smp_mb(); /* Order enqueue before wake. */
-	if (ncbs) {
+
+	// A wake up of the grace period kthread or timer adjustment needs to
+	// be done only if:
+	// 1. Bypass list was fully empty before (this is the first bypass list entry).
+	//	Or, both the below conditions are met:
+	// 1. Bypass list had only lazy CBs before.
+	// 2. The new CB is non-lazy.
+	if (ncbs && (!bypass_is_lazy || lazy)) {
 		local_irq_restore(flags);
 	} else {
 		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
@@ -497,9 +561,10 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 				 unsigned long flags)
 				 __releases(rdp->nocb_lock)
 {
+	long bypass_len;
 	unsigned long cur_gp_seq;
 	unsigned long j;
-	long len;
+	long len, lazy_len;
 	struct task_struct *t;
 
 	// If we are being polled or there is no kthread, just leave.
@@ -512,9 +577,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 	}
 	// Need to actually to a wakeup.
 	len = rcu_segcblist_n_cbs(&rdp->cblist);
+	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	lazy_len = READ_ONCE(rdp->lazy_len);
 	if (was_alldone) {
 		rdp->qlen_last_fqs_check = len;
-		if (!irqs_disabled_flags(flags)) {
+		// Only lazy CBs in bypass list
+		if (lazy_len && bypass_len == lazy_len) {
+			rcu_nocb_unlock_irqrestore(rdp, flags);
+			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
+					   TPS("WakeLazy"));
+		} else if (!irqs_disabled_flags(flags)) {
 			/* ... if queue was empty ... */
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			wake_nocb_gp(rdp, false);
@@ -611,6 +683,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	unsigned long flags;
 	bool gotcbs = false;
 	unsigned long j = jiffies;
+	bool lazy = false;
+	long lazy_ncbs;
 	bool needwait_gp = false; // This prevents actual uninitialized use.
 	bool needwake;
 	bool needwake_gp;
@@ -640,24 +714,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	 * won't be ignored for long.
 	 */
 	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
+		bool flush_bypass = false;
+
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
 		rcu_nocb_lock_irqsave(rdp, flags);
 		lockdep_assert_held(&rdp->nocb_lock);
 		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
-		if (bypass_ncbs &&
+		lazy_ncbs = READ_ONCE(rdp->lazy_len);
+
+		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
+		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
+		     bypass_ncbs > 2 * qhimark)) {
+			flush_bypass = true;
+		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
 		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
 		     bypass_ncbs > 2 * qhimark)) {
-			// Bypass full or old, so flush it.
-			(void)rcu_nocb_try_flush_bypass(rdp, j);
-			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			flush_bypass = true;
 		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			continue; /* No callbacks here, try next. */
 		}
+
+		if (flush_bypass) {
+			// Bypass full or old, so flush it.
+			(void)rcu_nocb_try_flush_bypass(rdp, j);
+			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			lazy_ncbs = READ_ONCE(rdp->lazy_len);
+		}
+
 		if (bypass_ncbs) {
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
-					    TPS("Bypass"));
-			bypass = true;
+				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
+			if (bypass_ncbs == lazy_ncbs)
+				lazy = true;
+			else
+				bypass = true;
 		}
 		rnp = rdp->mynode;
 
@@ -705,12 +796,20 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	my_rdp->nocb_gp_gp = needwait_gp;
 	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
 
-	if (bypass && !rcu_nocb_poll) {
-		// At least one child with non-empty ->nocb_bypass, so set
-		// timer in order to avoid stranding its callbacks.
-		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
-				   TPS("WakeBypassIsDeferred"));
+	// At least one child with non-empty ->nocb_bypass, so set
+	// timer in order to avoid stranding its callbacks.
+	if (!rcu_nocb_poll) {
+		// If bypass list only has lazy CBs. Add a deferred lazy wake up.
+		if (lazy && !bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
+					TPS("WakeLazyIsDeferred"));
+		// Otherwise add a deferred bypass wake up.
+		} else if (bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
+					TPS("WakeBypassIsDeferred"));
+		}
 	}
+
 	if (rcu_nocb_poll) {
 		/* Polling, so trace if first poll in the series. */
 		if (gotcbs)
@@ -1036,7 +1135,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
 	 * return false, which means that future calls to rcu_nocb_try_bypass()
 	 * will refuse to put anything into the bypass.
 	 */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false));
 	/*
 	 * Start with invoking rcu_core() early. This way if the current thread
 	 * happens to preempt an ongoing call to rcu_core() in the middle,
@@ -1278,6 +1377,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 	raw_spin_lock_init(&rdp->nocb_gp_lock);
 	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
 	rcu_cblist_init(&rdp->nocb_bypass);
+	WRITE_ONCE(rdp->lazy_len, 0);
 	mutex_init(&rdp->nocb_gp_kthread_mutex);
 }
 
@@ -1560,16 +1660,17 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 
 static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
 {
+	return false;
 }
 
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, bool lazy)
 {
 	return true;
 }
 
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags, bool lazy)
 {
 	return false;
 }
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
  2022-10-07  3:18   ` Joel Fernandes
@ 2022-10-07 17:19     ` Joel Fernandes
  2022-10-07 17:31       ` Joel Fernandes
  0 siblings, 1 reply; 45+ messages in thread
From: Joel Fernandes @ 2022-10-07 17:19 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb

On Fri, Oct 07, 2022 at 03:18:26AM +0000, Joel Fernandes wrote:
> On Tue, Oct 04, 2022 at 02:41:56AM +0000, Joel Fernandes (Google) wrote:
> > From: Uladzislau Rezki <urezki@gmail.com>
> > 
> > Slow boot time is seen on KVM running typical Linux distributions due to
> > SCSI layer calling call_rcu(). Recent changes to save power may be
> > causing this slowness. Using call_rcu_flush() fixes the issue and brings
> > the boot time back to what it originally was. Convert it.
> > 
> > Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> And I successfully setup Debian on KVM and verified that this fixes it, so
> now I have a nice reproducible rig for my
> 'lazy-callback-doing-a-wakeup-detector' (I wrote half the detector thanks to
> ideas from Steve, and will finish the other half tomorrow or so).
> 
> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Looks like I can catch Vlad's issue with the below patch. Thoughts? Does this
look reasonable for mainline? (I think so as it is self-contained and the
debug option is default off, and could be useful down the line).

[    6.887033 ] rcu: *****************************************************
[    6.891242 ] rcu: RCU: A wake up has been detected from a lazy callback!
[    6.895377 ] rcu: The callback name is: scsi_eh_inc_host_failed
[    6.899084 ] rcu: The task it woke up is: scsi_eh_1 (61)
[    6.902405 ] rcu: This could cause performance issues! Check the stack.
[    6.906532 ] rcu: *****************************************************


[   17.127128 ] rcu: *****************************************************
[   17.131397 ] rcu: RCU: A wake up has been detected from a lazy callback!
[   17.135703 ] rcu: The callback name is: scsi_eh_inc_host_failed
[   17.139485 ] rcu: The task it woke up is: scsi_eh_1 (61)
[   17.142828 ] rcu: This could cause performance issues! Check the stack.
[   17.146962 ] rcu: *****************************************************

And thanks to Steve for the binary search code.

One thing I found is I have to ignore kworkers because there are times when a
work item is queued from a callback and those callbacks don't seem to
contribute to performance issues. So I am filtering these:

[   38.631724 ] rcu: The callback name is: thread_stack_free_rcu
[   38.635317 ] rcu: The task it woke up is: kworker/3:2 (143)

[   39.649332 ] rcu: The callback name is: delayed_put_task_struct
[   39.653037 ] rcu: The task it woke up is: kworker/0:1 (40)

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH] lazy wake debug

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/Kconfig      |   7 ++
 kernel/rcu/lazy-debug.h | 149 ++++++++++++++++++++++++++++++++++++++++
 kernel/rcu/tree.c       |   9 +++
 3 files changed, 165 insertions(+)
 create mode 100644 kernel/rcu/lazy-debug.h

diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index edd632e68497..08c06f739187 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -322,4 +322,11 @@ config RCU_LAZY
 	  To save power, batch RCU callbacks and flush after delay, memory
 	  pressure or callback list growing too big.
 
+config RCU_LAZY_DEBUG
+	bool "RCU callback lazy invocation debugging"
+	depends on RCU_LAZY
+	default n
+	help
+	  Debugging to catch issues caused by delayed RCU callbacks.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/lazy-debug.h b/kernel/rcu/lazy-debug.h
new file mode 100644
index 000000000000..fc1cc1cb89f0
--- /dev/null
+++ b/kernel/rcu/lazy-debug.h
@@ -0,0 +1,149 @@
+#include <linux/string.h>
+#include <linux/spinlock.h>
+
+#ifdef CONFIG_RCU_LAZY_DEBUG
+#include <linux/preempt.h>
+#include <trace/events/sched.h>
+
+static DEFINE_PER_CPU(bool, rcu_lazy_cb_exec) = false;
+static DEFINE_PER_CPU(void *, rcu_lazy_ip) = NULL;
+
+static DEFINE_RAW_SPINLOCK(lazy_funcs_lock);
+
+#define FUNC_SIZE 1024
+static unsigned long lazy_funcs[FUNC_SIZE];
+static int nr_funcs;
+
+static void __find_func(unsigned long ip, int *B, int *E, int *N)
+{
+	unsigned long *p;
+	int b, e, n;
+
+	b = n = 0;
+	e = nr_funcs - 1;
+
+	while (b <= e) {
+		n = (b + e) / 2;
+		p = &lazy_funcs[n];
+		if (ip > *p) {
+			b = n + 1;
+		} else if (ip < *p) {
+			e = n - 1;
+		} else
+			break;
+	}
+
+	*B = b;
+	*E = e;
+	*N = n;
+
+	return;
+}
+
+static bool lazy_func_exists(void* ip_ptr)
+{
+	int b, e, n;
+	unsigned long flags;
+	unsigned long ip = (unsigned long)ip_ptr;
+
+	raw_spin_lock_irqsave(&lazy_funcs_lock, flags);
+	__find_func(ip, &b, &e, &n);
+	raw_spin_unlock_irqrestore(&lazy_funcs_lock, flags);
+
+	return b <= e;
+}
+
+static int lazy_func_add(void* ip_ptr)
+{
+	int b, e, n;
+	unsigned long flags;
+	unsigned long ip = (unsigned long)ip_ptr;
+
+	raw_spin_lock_irqsave(&lazy_funcs_lock, flags);
+	if (nr_funcs >= FUNC_SIZE) {
+		raw_spin_unlock_irqrestore(&lazy_funcs_lock, flags);
+		return -1;
+	}
+
+	__find_func(ip, &b, &e, &n);
+
+	if (b > e) {
+		if (n != nr_funcs)
+			memmove(&lazy_funcs[n+1], &lazy_funcs[n],
+				(sizeof(*lazy_funcs) * (nr_funcs - n)));
+
+		lazy_funcs[n] = ip;
+		nr_funcs++;
+	}
+
+	raw_spin_unlock_irqrestore(&lazy_funcs_lock, flags);
+	return 0;
+}
+
+static void rcu_set_lazy_context(void *ip_ptr)
+{
+	bool *flag = this_cpu_ptr(&rcu_lazy_cb_exec);
+	*flag = lazy_func_exists(ip_ptr);
+
+	if (*flag) {
+		*this_cpu_ptr(&rcu_lazy_ip) = ip_ptr;
+	} else {
+		*this_cpu_ptr(&rcu_lazy_ip) = NULL;
+	}
+}
+
+static void rcu_reset_lazy_context(void)
+{
+	bool *flag = this_cpu_ptr(&rcu_lazy_cb_exec);
+	*flag = false;
+}
+
+static bool rcu_is_lazy_context(void)
+{
+	return *(this_cpu_ptr(&rcu_lazy_cb_exec));
+}
+
+static void
+probe_waking(void *ignore, struct task_struct *p)
+{
+	if (WARN_ON(!in_nmi() && !in_hardirq() && rcu_is_lazy_context())) {
+		pr_err("*****************************************************\n");
+		pr_err("RCU: A wake up has been detected from a lazy callback!\n");
+		pr_err("The callback name is: %ps\n", *this_cpu_ptr(&rcu_lazy_ip));
+		pr_err("The task it woke up is: %s (%d)\n", p->comm, p->pid);
+		pr_err("This could cause performance issues! Check the stack.\n");
+		pr_err("*****************************************************\n");
+	}
+}
+
+static void rcu_lazy_debug_init(void)
+{
+	int ret;
+	pr_info("RCU Lazy CB debugging is turned on, system may be slow.\n");
+
+	ret = register_trace_sched_waking(probe_waking, NULL);
+	if (ret)
+		pr_info("RCU: Lazy debug ched_waking probe could not be registered.");
+}
+
+#else
+
+static int lazy_func_add(void* ip_ptr)
+{
+	return -1;
+}
+
+
+static void rcu_set_lazy_context(void *ip_ptr)
+{
+}
+
+static void rcu_reset_lazy_context(void)
+{
+}
+
+static void rcu_lazy_debug_init(void)
+{
+}
+
+#endif
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c20544c4aa29..ad8d4e52ae92 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -67,6 +67,7 @@
 
 #include "tree.h"
 #include "rcu.h"
+#include "lazy-debug.h"
 
 #ifdef MODULE_PARAM_PREFIX
 #undef MODULE_PARAM_PREFIX
@@ -2245,7 +2246,10 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
 		f = rhp->func;
 		WRITE_ONCE(rhp->func, (rcu_callback_t)0L);
+
+		rcu_set_lazy_context(f);
 		f(rhp);
+		rcu_reset_lazy_context();
 
 		rcu_lock_release(&rcu_callback_map);
 
@@ -2770,6 +2774,10 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 	}
 
 	check_cb_ovld(rdp);
+
+	if (lazy)
+		lazy_func_add(func);
+
 	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
 		return; // Enqueued onto ->nocb_bypass, so just leave.
 	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
@@ -4805,6 +4813,7 @@ void __init rcu_init(void)
 	rcu_early_boot_tests();
 
 	kfree_rcu_batch_init();
+	rcu_lazy_debug_init();
 	rcu_bootup_announce();
 	sanitize_kthread_prio();
 	rcu_init_geometry();
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
  2022-10-07 17:19     ` Joel Fernandes
@ 2022-10-07 17:31       ` Joel Fernandes
  2022-10-07 17:52         ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Joel Fernandes @ 2022-10-07 17:31 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, youssefesmat, surenb



> On Oct 7, 2022, at 1:19 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> 
> On Fri, Oct 07, 2022 at 03:18:26AM +0000, Joel Fernandes wrote:
>>> On Tue, Oct 04, 2022 at 02:41:56AM +0000, Joel Fernandes (Google) wrote:
>>> From: Uladzislau Rezki <urezki@gmail.com>
>>> 
>>> Slow boot time is seen on KVM running typical Linux distributions due to
>>> SCSI layer calling call_rcu(). Recent changes to save power may be
>>> causing this slowness. Using call_rcu_flush() fixes the issue and brings
>>> the boot time back to what it originally was. Convert it.
>>> 
>>> Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> 
>> And I successfully setup Debian on KVM and verified that this fixes it, so
>> now I have a nice reproducible rig for my
>> 'lazy-callback-doing-a-wakeup-detector' (I wrote half the detector thanks to
>> ideas from Steve, and will finish the other half tomorrow or so).
>> 
>> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> Looks like I can catch Vlad's issue with the below patch. Thoughts? Does this
> look reasonable for mainline? (I think so as it is self-contained and the
> debug option is default off, and could be useful down the line).
> 
> [    6.887033 ] rcu: *****************************************************
> [    6.891242 ] rcu: RCU: A wake up has been detected from a lazy callback!
> [    6.895377 ] rcu: The callback name is: scsi_eh_inc_host_failed
> [    6.899084 ] rcu: The task it woke up is: scsi_eh_1 (61)
> [    6.902405 ] rcu: This could cause performance issues! Check the stack.
> [    6.906532 ] rcu: *****************************************************
> 
> 
> [   17.127128 ] rcu: *****************************************************
> [   17.131397 ] rcu: RCU: A wake up has been detected from a lazy callback!
> [   17.135703 ] rcu: The callback name is: scsi_eh_inc_host_failed
> [   17.139485 ] rcu: The task it woke up is: scsi_eh_1 (61)
> [   17.142828 ] rcu: This could cause performance issues! Check the stack.
> [   17.146962 ] rcu: *****************************************************
> 
> And thanks to Steve for the binary search code.
> 
> One thing I found is I have to ignore kworkers because there are times when a
> work item is queued from a callback and those callbacks don't seem to
> contribute to performance issues. So I am filtering these:
> 
> [   38.631724 ] rcu: The callback name is: thread_stack_free_rcu
> [   38.635317 ] rcu: The task it woke up is: kworker/3:2 (143)
> 
> [   39.649332 ] rcu: The callback name is: delayed_put_task_struct
> [   39.653037 ] rcu: The task it woke up is: kworker/0:1 (40)
> 
> ---8<-----------------------
> 
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Subject: [PATCH] lazy wake debug
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
> kernel/rcu/Kconfig      |   7 ++
> kernel/rcu/lazy-debug.h | 149 ++++++++++++++++++++++++++++++++++++++++
> kernel/rcu/tree.c       |   9 +++
> 3 files changed, 165 insertions(+)
> create mode 100644 kernel/rcu/lazy-debug.h
> 
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index edd632e68497..08c06f739187 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -322,4 +322,11 @@ config RCU_LAZY
>      To save power, batch RCU callbacks and flush after delay, memory
>      pressure or callback list growing too big.
> 
> +config RCU_LAZY_DEBUG
> +    bool "RCU callback lazy invocation debugging"
> +    depends on RCU_LAZY
> +    default n
> +    help
> +      Debugging to catch issues caused by delayed RCU callbacks.
> +
> endmenu # "RCU Subsystem"
> diff --git a/kernel/rcu/lazy-debug.h b/kernel/rcu/lazy-debug.h
> new file mode 100644
> index 000000000000..fc1cc1cb89f0
> --- /dev/null
> +++ b/kernel/rcu/lazy-debug.h
> @@ -0,0 +1,149 @@
> +#include <linux/string.h>
> +#include <linux/spinlock.h>
> +
> +#ifdef CONFIG_RCU_LAZY_DEBUG
> +#include <linux/preempt.h>
> +#include <trace/events/sched.h>
> +
> +static DEFINE_PER_CPU(bool, rcu_lazy_cb_exec) = false;
> +static DEFINE_PER_CPU(void *, rcu_lazy_ip) = NULL;
> +
> +static DEFINE_RAW_SPINLOCK(lazy_funcs_lock);
> +
> +#define FUNC_SIZE 1024

And I know this array can overflow in the future so I will add checks for that in the code if we are going with this patch.

 - Joel 


> +static unsigned long lazy_funcs[FUNC_SIZE];
> +static int nr_funcs;
> +
> +static void __find_func(unsigned long ip, int *B, int *E, int *N)
> +{
> +    unsigned long *p;
> +    int b, e, n;
> +
> +    b = n = 0;
> +    e = nr_funcs - 1;
> +
> +    while (b <= e) {
> +        n = (b + e) / 2;
> +        p = &lazy_funcs[n];
> +        if (ip > *p) {
> +            b = n + 1;
> +        } else if (ip < *p) {
> +            e = n - 1;
> +        } else
> +            break;
> +    }
> +
> +    *B = b;
> +    *E = e;
> +    *N = n;
> +
> +    return;
> +}
> +
> +static bool lazy_func_exists(void* ip_ptr)
> +{
> +    int b, e, n;
> +    unsigned long flags;
> +    unsigned long ip = (unsigned long)ip_ptr;
> +
> +    raw_spin_lock_irqsave(&lazy_funcs_lock, flags);
> +    __find_func(ip, &b, &e, &n);
> +    raw_spin_unlock_irqrestore(&lazy_funcs_lock, flags);
> +
> +    return b <= e;
> +}
> +
> +static int lazy_func_add(void* ip_ptr)
> +{
> +    int b, e, n;
> +    unsigned long flags;
> +    unsigned long ip = (unsigned long)ip_ptr;
> +
> +    raw_spin_lock_irqsave(&lazy_funcs_lock, flags);
> +    if (nr_funcs >= FUNC_SIZE) {
> +        raw_spin_unlock_irqrestore(&lazy_funcs_lock, flags);
> +        return -1;
> +    }
> +
> +    __find_func(ip, &b, &e, &n);
> +
> +    if (b > e) {
> +        if (n != nr_funcs)
> +            memmove(&lazy_funcs[n+1], &lazy_funcs[n],
> +                (sizeof(*lazy_funcs) * (nr_funcs - n)));
> +
> +        lazy_funcs[n] = ip;
> +        nr_funcs++;
> +    }
> +
> +    raw_spin_unlock_irqrestore(&lazy_funcs_lock, flags);
> +    return 0;
> +}
> +
> +static void rcu_set_lazy_context(void *ip_ptr)
> +{
> +    bool *flag = this_cpu_ptr(&rcu_lazy_cb_exec);
> +    *flag = lazy_func_exists(ip_ptr);
> +
> +    if (*flag) {
> +        *this_cpu_ptr(&rcu_lazy_ip) = ip_ptr;
> +    } else {
> +        *this_cpu_ptr(&rcu_lazy_ip) = NULL;
> +    }
> +}
> +
> +static void rcu_reset_lazy_context(void)
> +{
> +    bool *flag = this_cpu_ptr(&rcu_lazy_cb_exec);
> +    *flag = false;
> +}
> +
> +static bool rcu_is_lazy_context(void)
> +{
> +    return *(this_cpu_ptr(&rcu_lazy_cb_exec));
> +}
> +
> +static void
> +probe_waking(void *ignore, struct task_struct *p)
> +{
> +    if (WARN_ON(!in_nmi() && !in_hardirq() && rcu_is_lazy_context())) {
> +        pr_err("*****************************************************\n");
> +        pr_err("RCU: A wake up has been detected from a lazy callback!\n");
> +        pr_err("The callback name is: %ps\n", *this_cpu_ptr(&rcu_lazy_ip));
> +        pr_err("The task it woke up is: %s (%d)\n", p->comm, p->pid);
> +        pr_err("This could cause performance issues! Check the stack.\n");
> +        pr_err("*****************************************************\n");
> +    }
> +}
> +
> +static void rcu_lazy_debug_init(void)
> +{
> +    int ret;
> +    pr_info("RCU Lazy CB debugging is turned on, system may be slow.\n");
> +
> +    ret = register_trace_sched_waking(probe_waking, NULL);
> +    if (ret)
> +        pr_info("RCU: Lazy debug ched_waking probe could not be registered.");
> +}
> +
> +#else
> +
> +static int lazy_func_add(void* ip_ptr)
> +{
> +    return -1;
> +}
> +
> +
> +static void rcu_set_lazy_context(void *ip_ptr)
> +{
> +}
> +
> +static void rcu_reset_lazy_context(void)
> +{
> +}
> +
> +static void rcu_lazy_debug_init(void)
> +{
> +}
> +
> +#endif
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c20544c4aa29..ad8d4e52ae92 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -67,6 +67,7 @@
> 
> #include "tree.h"
> #include "rcu.h"
> +#include "lazy-debug.h"
> 
> #ifdef MODULE_PARAM_PREFIX
> #undef MODULE_PARAM_PREFIX
> @@ -2245,7 +2246,10 @@ static void rcu_do_batch(struct rcu_data *rdp)
> 
>        f = rhp->func;
>        WRITE_ONCE(rhp->func, (rcu_callback_t)0L);
> +
> +        rcu_set_lazy_context(f);
>        f(rhp);
> +        rcu_reset_lazy_context();
> 
>        rcu_lock_release(&rcu_callback_map);
> 
> @@ -2770,6 +2774,10 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>    }
> 
>    check_cb_ovld(rdp);
> +
> +    if (lazy)
> +        lazy_func_add(func);
> +
>    if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>        return; // Enqueued onto ->nocb_bypass, so just leave.
>    // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
> @@ -4805,6 +4813,7 @@ void __init rcu_init(void)
>    rcu_early_boot_tests();
> 
>    kfree_rcu_batch_init();
> +    rcu_lazy_debug_init();
>    rcu_bootup_announce();
>    sanitize_kthread_prio();
>    rcu_init_geometry();
> -- 
> 2.38.0.rc1.362.ged0d419d3c-goog
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
  2022-10-07 17:31       ` Joel Fernandes
@ 2022-10-07 17:52         ` Paul E. McKenney
  2022-10-07 19:29           ` Joel Fernandes
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-07 17:52 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Fri, Oct 07, 2022 at 01:31:23PM -0400, Joel Fernandes wrote:
> 
> 
> > On Oct 7, 2022, at 1:19 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > 
> > On Fri, Oct 07, 2022 at 03:18:26AM +0000, Joel Fernandes wrote:
> >>> On Tue, Oct 04, 2022 at 02:41:56AM +0000, Joel Fernandes (Google) wrote:
> >>> From: Uladzislau Rezki <urezki@gmail.com>
> >>> 
> >>> Slow boot time is seen on KVM running typical Linux distributions due to
> >>> SCSI layer calling call_rcu(). Recent changes to save power may be
> >>> causing this slowness. Using call_rcu_flush() fixes the issue and brings
> >>> the boot time back to what it originally was. Convert it.
> >>> 
> >>> Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
> >>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> 
> >> And I successfully setup Debian on KVM and verified that this fixes it, so
> >> now I have a nice reproducible rig for my
> >> 'lazy-callback-doing-a-wakeup-detector' (I wrote half the detector thanks to
> >> ideas from Steve, and will finish the other half tomorrow or so).
> >> 
> >> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > Looks like I can catch Vlad's issue with the below patch. Thoughts? Does this
> > look reasonable for mainline? (I think so as it is self-contained and the
> > debug option is default off, and could be useful down the line).

Excellent as a debug patch!!!  Let's see how it goes -- if it proves
sufficiently useful, some form should go into mainline.  Or at least
be feature prominently somewhere public.

> > [    6.887033 ] rcu: *****************************************************
> > [    6.891242 ] rcu: RCU: A wake up has been detected from a lazy callback!
> > [    6.895377 ] rcu: The callback name is: scsi_eh_inc_host_failed
> > [    6.899084 ] rcu: The task it woke up is: scsi_eh_1 (61)
> > [    6.902405 ] rcu: This could cause performance issues! Check the stack.
> > [    6.906532 ] rcu: *****************************************************
> > 
> > 
> > [   17.127128 ] rcu: *****************************************************
> > [   17.131397 ] rcu: RCU: A wake up has been detected from a lazy callback!
> > [   17.135703 ] rcu: The callback name is: scsi_eh_inc_host_failed
> > [   17.139485 ] rcu: The task it woke up is: scsi_eh_1 (61)
> > [   17.142828 ] rcu: This could cause performance issues! Check the stack.
> > [   17.146962 ] rcu: *****************************************************
> > 
> > And thanks to Steve for the binary search code.
> > 
> > One thing I found is I have to ignore kworkers because there are times when a
> > work item is queued from a callback and those callbacks don't seem to
> > contribute to performance issues. So I am filtering these:
> > 
> > [   38.631724 ] rcu: The callback name is: thread_stack_free_rcu
> > [   38.635317 ] rcu: The task it woke up is: kworker/3:2 (143)
> > 
> > [   39.649332 ] rcu: The callback name is: delayed_put_task_struct
> > [   39.653037 ] rcu: The task it woke up is: kworker/0:1 (40)
> > 
> > ---8<-----------------------
> > 
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > Subject: [PATCH] lazy wake debug
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> > kernel/rcu/Kconfig      |   7 ++
> > kernel/rcu/lazy-debug.h | 149 ++++++++++++++++++++++++++++++++++++++++
> > kernel/rcu/tree.c       |   9 +++
> > 3 files changed, 165 insertions(+)
> > create mode 100644 kernel/rcu/lazy-debug.h
> > 
> > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > index edd632e68497..08c06f739187 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -322,4 +322,11 @@ config RCU_LAZY
> >      To save power, batch RCU callbacks and flush after delay, memory
> >      pressure or callback list growing too big.
> > 
> > +config RCU_LAZY_DEBUG
> > +    bool "RCU callback lazy invocation debugging"
> > +    depends on RCU_LAZY
> > +    default n
> > +    help
> > +      Debugging to catch issues caused by delayed RCU callbacks.
> > +
> > endmenu # "RCU Subsystem"
> > diff --git a/kernel/rcu/lazy-debug.h b/kernel/rcu/lazy-debug.h
> > new file mode 100644
> > index 000000000000..fc1cc1cb89f0
> > --- /dev/null
> > +++ b/kernel/rcu/lazy-debug.h
> > @@ -0,0 +1,149 @@
> > +#include <linux/string.h>
> > +#include <linux/spinlock.h>
> > +
> > +#ifdef CONFIG_RCU_LAZY_DEBUG
> > +#include <linux/preempt.h>
> > +#include <trace/events/sched.h>
> > +
> > +static DEFINE_PER_CPU(bool, rcu_lazy_cb_exec) = false;
> > +static DEFINE_PER_CPU(void *, rcu_lazy_ip) = NULL;
> > +
> > +static DEFINE_RAW_SPINLOCK(lazy_funcs_lock);
> > +
> > +#define FUNC_SIZE 1024
> 
> And I know this array can overflow in the future so I will add checks for that in the code if we are going with this patch.

I believe that there are fewer than 300 RCU callback functions, but yes,
there would need to at least be a check at some point.

I still strongly suggest the static search in addition to this.  Yes, this
is a cool patch, but it should be mostly used for the difficult-to-find
instances.

							Thanx, Paul

>  - Joel 
> 
> 
> > +static unsigned long lazy_funcs[FUNC_SIZE];
> > +static int nr_funcs;
> > +
> > +static void __find_func(unsigned long ip, int *B, int *E, int *N)
> > +{
> > +    unsigned long *p;
> > +    int b, e, n;
> > +
> > +    b = n = 0;
> > +    e = nr_funcs - 1;
> > +
> > +    while (b <= e) {
> > +        n = (b + e) / 2;
> > +        p = &lazy_funcs[n];
> > +        if (ip > *p) {
> > +            b = n + 1;
> > +        } else if (ip < *p) {
> > +            e = n - 1;
> > +        } else
> > +            break;
> > +    }
> > +
> > +    *B = b;
> > +    *E = e;
> > +    *N = n;
> > +
> > +    return;
> > +}
> > +
> > +static bool lazy_func_exists(void* ip_ptr)
> > +{
> > +    int b, e, n;
> > +    unsigned long flags;
> > +    unsigned long ip = (unsigned long)ip_ptr;
> > +
> > +    raw_spin_lock_irqsave(&lazy_funcs_lock, flags);
> > +    __find_func(ip, &b, &e, &n);
> > +    raw_spin_unlock_irqrestore(&lazy_funcs_lock, flags);
> > +
> > +    return b <= e;
> > +}
> > +
> > +static int lazy_func_add(void* ip_ptr)
> > +{
> > +    int b, e, n;
> > +    unsigned long flags;
> > +    unsigned long ip = (unsigned long)ip_ptr;
> > +
> > +    raw_spin_lock_irqsave(&lazy_funcs_lock, flags);
> > +    if (nr_funcs >= FUNC_SIZE) {
> > +        raw_spin_unlock_irqrestore(&lazy_funcs_lock, flags);
> > +        return -1;
> > +    }
> > +
> > +    __find_func(ip, &b, &e, &n);
> > +
> > +    if (b > e) {
> > +        if (n != nr_funcs)
> > +            memmove(&lazy_funcs[n+1], &lazy_funcs[n],
> > +                (sizeof(*lazy_funcs) * (nr_funcs - n)));
> > +
> > +        lazy_funcs[n] = ip;
> > +        nr_funcs++;
> > +    }
> > +
> > +    raw_spin_unlock_irqrestore(&lazy_funcs_lock, flags);
> > +    return 0;
> > +}
> > +
> > +static void rcu_set_lazy_context(void *ip_ptr)
> > +{
> > +    bool *flag = this_cpu_ptr(&rcu_lazy_cb_exec);
> > +    *flag = lazy_func_exists(ip_ptr);
> > +
> > +    if (*flag) {
> > +        *this_cpu_ptr(&rcu_lazy_ip) = ip_ptr;
> > +    } else {
> > +        *this_cpu_ptr(&rcu_lazy_ip) = NULL;
> > +    }
> > +}
> > +
> > +static void rcu_reset_lazy_context(void)
> > +{
> > +    bool *flag = this_cpu_ptr(&rcu_lazy_cb_exec);
> > +    *flag = false;
> > +}
> > +
> > +static bool rcu_is_lazy_context(void)
> > +{
> > +    return *(this_cpu_ptr(&rcu_lazy_cb_exec));
> > +}
> > +
> > +static void
> > +probe_waking(void *ignore, struct task_struct *p)
> > +{
> > +    if (WARN_ON(!in_nmi() && !in_hardirq() && rcu_is_lazy_context())) {
> > +        pr_err("*****************************************************\n");
> > +        pr_err("RCU: A wake up has been detected from a lazy callback!\n");
> > +        pr_err("The callback name is: %ps\n", *this_cpu_ptr(&rcu_lazy_ip));
> > +        pr_err("The task it woke up is: %s (%d)\n", p->comm, p->pid);
> > +        pr_err("This could cause performance issues! Check the stack.\n");
> > +        pr_err("*****************************************************\n");
> > +    }
> > +}
> > +
> > +static void rcu_lazy_debug_init(void)
> > +{
> > +    int ret;
> > +    pr_info("RCU Lazy CB debugging is turned on, system may be slow.\n");
> > +
> > +    ret = register_trace_sched_waking(probe_waking, NULL);
> > +    if (ret)
> > +        pr_info("RCU: Lazy debug ched_waking probe could not be registered.");
> > +}
> > +
> > +#else
> > +
> > +static int lazy_func_add(void* ip_ptr)
> > +{
> > +    return -1;
> > +}
> > +
> > +
> > +static void rcu_set_lazy_context(void *ip_ptr)
> > +{
> > +}
> > +
> > +static void rcu_reset_lazy_context(void)
> > +{
> > +}
> > +
> > +static void rcu_lazy_debug_init(void)
> > +{
> > +}
> > +
> > +#endif
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index c20544c4aa29..ad8d4e52ae92 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -67,6 +67,7 @@
> > 
> > #include "tree.h"
> > #include "rcu.h"
> > +#include "lazy-debug.h"
> > 
> > #ifdef MODULE_PARAM_PREFIX
> > #undef MODULE_PARAM_PREFIX
> > @@ -2245,7 +2246,10 @@ static void rcu_do_batch(struct rcu_data *rdp)
> > 
> >        f = rhp->func;
> >        WRITE_ONCE(rhp->func, (rcu_callback_t)0L);
> > +
> > +        rcu_set_lazy_context(f);
> >        f(rhp);
> > +        rcu_reset_lazy_context();
> > 
> >        rcu_lock_release(&rcu_callback_map);
> > 
> > @@ -2770,6 +2774,10 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
> >    }
> > 
> >    check_cb_ovld(rdp);
> > +
> > +    if (lazy)
> > +        lazy_func_add(func);
> > +
> >    if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
> >        return; // Enqueued onto ->nocb_bypass, so just leave.
> >    // If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
> > @@ -4805,6 +4813,7 @@ void __init rcu_init(void)
> >    rcu_early_boot_tests();
> > 
> >    kfree_rcu_batch_init();
> > +    rcu_lazy_debug_init();
> >    rcu_bootup_announce();
> >    sanitize_kthread_prio();
> >    rcu_init_geometry();
> > -- 
> > 2.38.0.rc1.362.ged0d419d3c-goog
> > 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 00/11] rcu: call_rcu() power improvements
  2022-10-07 15:09   ` Joel Fernandes
@ 2022-10-07 18:30     ` Paul E. McKenney
  0 siblings, 0 replies; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-07 18:30 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Fri, Oct 07, 2022 at 03:09:18PM +0000, Joel Fernandes wrote:
> On Thu, Oct 06, 2022 at 11:55:43AM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 04, 2022 at 02:41:46AM +0000, Joel Fernandes (Google) wrote:
> > > v7 version of RCU lazy patches based on rcu/next branch.
> > > 
> > > To facilitate easier merge, I dropped tracing and other patches and just
> > > implemented the new changes. I will post the tracing patches later along with
> > > rcutop as I need to add new tracepoints that Frederic suggested.
> > > 
> > > Main recent changes:
> > > 1. rcu_barrier() wake up only for lazy bypass list.
> > > 2. Make all call_rcu() default-lazy and add call_rcu_flush() API.
> > > 3. Take care of some callers using call_rcu_flush() API.
> > > 4. Several refactorings suggested by Paul/Frederic.
> > > 5. New call_rcu() to call_rcu_flush() conversions by Joel/Vlad/Paul.
> > > 
> > > I am seeing good performance and power with these patches on real ChromeOS x86
> > > asymmetric hardware.
> > > 
> > > Earlier cover letter with lots of details is here:
> > > https://lore.kernel.org/all/20220901221720.1105021-1-joel@joelfernandes.org/
> > > 
> > > List of recent changes:
> > >     
> > >     [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> > >       deferral levels wake much earlier so for those it is not needed. ]
> > >     
> > >     [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> > >     
> > >     [ Frederic Weisbec: Make rcu_barrier() wake up only if main list empty. ]
> > >     
> > >     [ Frederic Weisbec: Remove extra 'else if' branch in rcu_nocb_try_bypass(). ]
> > >     
> > >     [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> > >     
> > >     [ Paul/Thomas/Joel: Make call_rcu() default lazy so users don't mess up. ]
> > >     
> > >     [ Paul/Frederic : Cosmetic changes, split out wakeup of nocb thread. ]
> > >     
> > >     [ Vlad/Joel : More call_rcu -> flush conversions ]
> > 
> > Thank you for your continued work on this!
> > 
> > I pulled these into an experimental branch, applied Uladzislau's
> > Tested-by and ran a quick round of rcutorture.
> > 
> > From TREE02, TREE03, and TREE09 I got this:
> > 
> > In file included from kernel/rcu/tree.c:68:
> > kernel/rcu/tree.h:449:13: error: ‘wake_nocb_gp’ used but never defined [-Werror]
> >   449 | static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
> >       |             ^~~~~~~~~~~~
> > 
> > One could argue that this is not a big deal, except that Linus gets a
> > bit tetchy when this sort of thing shows up in mainline.
> > 
> > Nevetheless, it looks like we might be getting quite close!
> 
> Yes, sorry this is because a hunk got misplaced as I mentioned in the other
> thread. If you replace the second patch with the below, it will fix it,
> thanks!
> 
> I pushed all the latest patches to:
> https://github.com/joelagnel/linux-kernel.git
> (branch  rcu-lazy-latest-oct722)
> 
> If you want to try those.

Thank you!  I have pulled these commits and rebased them on top of -rcu
dev as an experimental series at lazy.2022.10.07a.  Started testing
will let you know how it goes.

							Thanx, Paul

> ---8<-----------------------
> 
> >From 978bf585cbd933048733d9ad9a80ed8de265c07e Mon Sep 17 00:00:00 2001
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Date: Fri, 7 Oct 2022 02:41:00 +0000
> Subject: [PATCH] rcu: Make call_rcu() lazy to save power
> 
> Implement timer-based RCU callback batching (also known as lazy
> callbacks). With this we save about 5-10% of power consumed due
> to RCU requests that happen when system is lightly loaded or idle.
> 
> By default, all async callbacks (queued via call_rcu) are marked
> lazy. An alternate API call_rcu_flush() is provided for the few users,
> for example synchronize_rcu(), that need the old behavior.
> 
> The batch is flushed whenever a certain amount of time has passed, or
> the batch on a particular CPU grows too big. Also memory pressure will
> flush it in a future patch.
> 
> To handle several corner cases automagically (such as rcu_barrier() and
> hotplug), we re-use bypass lists which were originally introduced to
> address lock contention, to handle lazy CBs as well. The bypass list
> length has the lazy CB length included in it. A separate lazy CB length
> counter is also introduced to keep track of the number of lazy CBs.
> 
> Suggested-by: Paul McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/linux/rcupdate.h |   7 ++
>  kernel/rcu/Kconfig       |   8 ++
>  kernel/rcu/rcu.h         |   8 ++
>  kernel/rcu/tiny.c        |   2 +-
>  kernel/rcu/tree.c        | 129 ++++++++++++++++++++-----------
>  kernel/rcu/tree.h        |  11 ++-
>  kernel/rcu/tree_exp.h    |   2 +-
>  kernel/rcu/tree_nocb.h   | 161 +++++++++++++++++++++++++++++++--------
>  8 files changed, 246 insertions(+), 82 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 08605ce7379d..40ae36904825 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -108,6 +108,13 @@ static inline int rcu_preempt_depth(void)
>  
>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>  
> +#ifdef CONFIG_RCU_LAZY
> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func);
> +#else
> +static inline void call_rcu_flush(struct rcu_head *head,
> +		rcu_callback_t func) {  call_rcu(head, func); }
> +#endif
> +
>  /* Internal to kernel */
>  void rcu_init(void);
>  extern int rcu_scheduler_active;
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index f53ad63b2bc6..edd632e68497 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -314,4 +314,12 @@ config TASKS_TRACE_RCU_READ_MB
>  	  Say N here if you hate read-side memory barriers.
>  	  Take the default if you are unsure.
>  
> +config RCU_LAZY
> +	bool "RCU callback lazy invocation functionality"
> +	depends on RCU_NOCB_CPU
> +	default n
> +	help
> +	  To save power, batch RCU callbacks and flush after delay, memory
> +	  pressure or callback list growing too big.
> +
>  endmenu # "RCU Subsystem"
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index be5979da07f5..65704cbc9df7 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -474,6 +474,14 @@ enum rcutorture_type {
>  	INVALID_RCU_FLAVOR
>  };
>  
> +#if defined(CONFIG_RCU_LAZY)
> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> +#else
> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> +#endif
> +
>  #if defined(CONFIG_TREE_RCU)
>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>  			    unsigned long *gp_seq);
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index a33a8d4942c3..810479cf17ba 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -44,7 +44,7 @@ static struct rcu_ctrlblk rcu_ctrlblk = {
>  
>  void rcu_barrier(void)
>  {
> -	wait_rcu_gp(call_rcu);
> +	wait_rcu_gp(call_rcu_flush);
>  }
>  EXPORT_SYMBOL(rcu_barrier);
>  
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 04f33191e5ed..c20544c4aa29 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>  	raw_spin_unlock_rcu_node(rnp);
>  }
>  
> -/**
> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
> - * @head: structure to be used for queueing the RCU updates.
> - * @func: actual callback function to be invoked after the grace period
> - *
> - * The callback function will be invoked some time after a full grace
> - * period elapses, in other words after all pre-existing RCU read-side
> - * critical sections have completed.  However, the callback function
> - * might well execute concurrently with RCU read-side critical sections
> - * that started after call_rcu() was invoked.
> - *
> - * RCU read-side critical sections are delimited by rcu_read_lock()
> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
> - * v5.0 and later, regions of code across which interrupts, preemption,
> - * or softirqs have been disabled also serve as RCU read-side critical
> - * sections.  This includes hardware interrupt handlers, softirq handlers,
> - * and NMI handlers.
> - *
> - * Note that all CPUs must agree that the grace period extended beyond
> - * all pre-existing RCU read-side critical section.  On systems with more
> - * than one CPU, this means that when "func()" is invoked, each CPU is
> - * guaranteed to have executed a full memory barrier since the end of its
> - * last RCU read-side critical section whose beginning preceded the call
> - * to call_rcu().  It also means that each CPU executing an RCU read-side
> - * critical section that continues beyond the start of "func()" must have
> - * executed a memory barrier after the call_rcu() but before the beginning
> - * of that RCU read-side critical section.  Note that these guarantees
> - * include CPUs that are offline, idle, or executing in user mode, as
> - * well as CPUs that are executing in the kernel.
> - *
> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> - * resulting RCU callback function "func()", then both CPU A and CPU B are
> - * guaranteed to execute a full memory barrier during the time interval
> - * between the call to call_rcu() and the invocation of "func()" -- even
> - * if CPU A and CPU B are the same CPU (but again only if the system has
> - * more than one CPU).
> - *
> - * Implementation of these memory-ordering guarantees is described here:
> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> - */
> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +static void
> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>  {
>  	static atomic_t doublefrees;
>  	unsigned long flags;
> @@ -2809,7 +2770,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  	}
>  
>  	check_cb_ovld(rdp);
> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>  		return; // Enqueued onto ->nocb_bypass, so just leave.
>  	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>  	rcu_segcblist_enqueue(&rdp->cblist, head);
> @@ -2831,8 +2792,84 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  		local_irq_restore(flags);
>  	}
>  }
> -EXPORT_SYMBOL_GPL(call_rcu);
>  
> +#ifdef CONFIG_RCU_LAZY
> +/**
> + * call_rcu_flush() - Queue RCU callback for invocation after grace period, and
> + * flush all lazy callbacks (including the new one) to the main ->cblist while
> + * doing so.
> + *
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.
> + *
> + * Use this API instead of call_rcu() if you don't want the callback to be
> + * invoked after very long periods of time, which can happen on systems without
> + * memory pressure and on systems which are lightly loaded or mostly idle.
> + * This function will cause callbacks to be invoked sooner than later at the
> + * expense of extra power. Other than that, this function is identical to, and
> + * reuses call_rcu()'s logic. Refer to call_rcu() for more details about memory
> + * ordering and other functionality.
> + */
> +void call_rcu_flush(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, false);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_flush);
> +#endif
> +
> +/**
> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
> + * By default the callbacks are 'lazy' and are kept hidden from the main
> + * ->cblist to prevent starting of grace periods too soon.
> + * If you desire grace periods to start very soon, use call_rcu_flush().
> + *
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.  However, the callback function
> + * might well execute concurrently with RCU read-side critical sections
> + * that started after call_rcu() was invoked.
> + *
> + * RCU read-side critical sections are delimited by rcu_read_lock()
> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
> + * v5.0 and later, regions of code across which interrupts, preemption,
> + * or softirqs have been disabled also serve as RCU read-side critical
> + * sections.  This includes hardware interrupt handlers, softirq handlers,
> + * and NMI handlers.
> + *
> + * Note that all CPUs must agree that the grace period extended beyond
> + * all pre-existing RCU read-side critical section.  On systems with more
> + * than one CPU, this means that when "func()" is invoked, each CPU is
> + * guaranteed to have executed a full memory barrier since the end of its
> + * last RCU read-side critical section whose beginning preceded the call
> + * to call_rcu().  It also means that each CPU executing an RCU read-side
> + * critical section that continues beyond the start of "func()" must have
> + * executed a memory barrier after the call_rcu() but before the beginning
> + * of that RCU read-side critical section.  Note that these guarantees
> + * include CPUs that are offline, idle, or executing in user mode, as
> + * well as CPUs that are executing in the kernel.
> + *
> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> + * resulting RCU callback function "func()", then both CPU A and CPU B are
> + * guaranteed to execute a full memory barrier during the time interval
> + * between the call to call_rcu() and the invocation of "func()" -- even
> + * if CPU A and CPU B are the same CPU (but again only if the system has
> + * more than one CPU).
> + *
> + * Implementation of these memory-ordering guarantees is described here:
> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> + */
> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, true);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu);
>  
>  /* Maximum number of jiffies to wait before draining a batch. */
>  #define KFREE_DRAIN_JIFFIES (5 * HZ)
> @@ -3507,7 +3544,7 @@ void synchronize_rcu(void)
>  		if (rcu_gp_is_expedited())
>  			synchronize_rcu_expedited();
>  		else
> -			wait_rcu_gp(call_rcu);
> +			wait_rcu_gp(call_rcu_flush);
>  		return;
>  	}
>  
> @@ -3905,7 +3942,7 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
>  	was_done = rcu_rdp_is_offloaded(rdp) && !rcu_segcblist_pend_cbs(&rdp->cblist);
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false));
>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>  		atomic_inc(&rcu_state.barrier_cpu_count);
>  	} else {
> @@ -4334,7 +4371,7 @@ void rcutree_migrate_callbacks(int cpu)
>  	my_rdp = this_cpu_ptr(&rcu_data);
>  	my_rnp = my_rdp->mynode;
>  	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, false));
>  	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>  	/* Leverage recent GPs and set GP for new callbacks. */
>  	needwake = rcu_advance_cbs(my_rnp, rdp) ||
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 41faaf8eb872..1d803d39f0d1 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -263,14 +263,16 @@ struct rcu_data {
>  	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
>  	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
>  
> +	long lazy_len;			/* Length of buffered lazy callbacks. */
>  	int cpu;
>  };
>  
>  /* Values for nocb_defer_wakeup field in struct rcu_data. */
>  #define RCU_NOCB_WAKE_NOT	0
>  #define RCU_NOCB_WAKE_BYPASS	1
> -#define RCU_NOCB_WAKE		2
> -#define RCU_NOCB_WAKE_FORCE	3
> +#define RCU_NOCB_WAKE_LAZY	2
> +#define RCU_NOCB_WAKE		3
> +#define RCU_NOCB_WAKE_FORCE	4
>  
>  #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>  					/* For jiffies_till_first_fqs and */
> @@ -440,9 +442,10 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>  static void rcu_init_one_nocb(struct rcu_node *rnp);
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j);
> +				  unsigned long j, bool lazy);
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags);
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy);
>  static bool wake_nocb_gp(struct rcu_data *rdp, bool force);
>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>  				 unsigned long flags);
> diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
> index 18e9b4cd78ef..5cac05600798 100644
> --- a/kernel/rcu/tree_exp.h
> +++ b/kernel/rcu/tree_exp.h
> @@ -937,7 +937,7 @@ void synchronize_rcu_expedited(void)
>  
>  	/* If expedited grace periods are prohibited, fall back to normal. */
>  	if (rcu_gp_is_normal()) {
> -		wait_rcu_gp(call_rcu);
> +		wait_rcu_gp(call_rcu_flush);
>  		return;
>  	}
>  
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 6caade0683dd..ccd05ded6bb4 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>  }
>  
> +/*
> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> + * could be flushed much earlier for a number of other reasons
> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> + * left unsubmitted to RCU after those many jiffies.
> + */
> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> +static unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> +
> +#ifdef CONFIG_RCU_LAZY
> +// To be called only from test code.
> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
> +{
> +	jiffies_till_flush = jif;
> +}
> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
> +
> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
> +{
> +	return jiffies_till_flush;
> +}
> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
> +#endif
> +
>  /*
>   * Arrange to wake the GP kthread for this NOCB group at some future
>   * time when it is safe to do so.
> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>  
>  	/*
> -	 * Bypass wakeup overrides previous deferments. In case
> -	 * of callback storm, no need to wake up too early.
> +	 * Bypass wakeup overrides previous deferments. In case of
> +	 * callback storm, no need to wake up too early.
>  	 */
> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
> +	if (waketype == RCU_NOCB_WAKE_LAZY &&
> +	    READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
> +		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> +	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>  		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>  		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>  	} else {
> @@ -293,10 +322,13 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>   * proves to be initially empty, just return false because the no-CB GP
>   * kthread may need to be awakened in this case.
>   *
> + * Return true if there was something to be flushed and it succeeded, otherwise
> + * false.
> + *
>   * Note that this function always returns true if rhp is NULL.
>   */
>  static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				     unsigned long j)
> +				     unsigned long j, bool lazy)
>  {
>  	struct rcu_cblist rcl;
>  
> @@ -310,7 +342,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
>  	if (rhp)
>  		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
> -	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +
> +	/*
> +	 * If the new CB requested was a lazy one, queue it onto the main
> +	 * ->cblist so we can take advantage of a sooner grade period.
> +	 */
> +	if (lazy && rhp) {
> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> +		rcu_cblist_enqueue(&rcl, rhp);
> +		WRITE_ONCE(rdp->lazy_len, 0);
> +	} else {
> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +		WRITE_ONCE(rdp->lazy_len, 0);
> +	}
> +
>  	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>  	WRITE_ONCE(rdp->nocb_bypass_first, j);
>  	rcu_nocb_bypass_unlock(rdp);
> @@ -326,13 +371,13 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>   * Note that this function always returns true if rhp is NULL.
>   */
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, bool lazy)
>  {
>  	if (!rcu_rdp_is_offloaded(rdp))
>  		return true;
>  	rcu_lockdep_assert_cblist_protected(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
> +	return rcu_nocb_do_flush_bypass(rdp, rhp, j, lazy);
>  }
>  
>  /*
> @@ -345,7 +390,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>  	if (!rcu_rdp_is_offloaded(rdp) ||
>  	    !rcu_nocb_bypass_trylock(rdp))
>  		return;
> -	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
> +	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, false));
>  }
>  
>  /*
> @@ -367,12 +412,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>   * there is only one CPU in operation.
>   */
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags)
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy)
>  {
>  	unsigned long c;
>  	unsigned long cur_gp_seq;
>  	unsigned long j = jiffies;
>  	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +	bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
>  
>  	lockdep_assert_irqs_disabled();
>  
> @@ -417,25 +464,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	// If there hasn't yet been all that many ->cblist enqueues
>  	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
>  	// ->nocb_bypass first.
> -	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
> +	// Lazy CBs throttle this back and do immediate bypass queuing.
> +	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
>  		rcu_nocb_lock(rdp);
>  		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>  		if (*was_alldone)
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("FirstQ"));
> -		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
> +
> +		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, false));
>  		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
>  		return false; // Caller must enqueue the callback.
>  	}
>  
>  	// If ->nocb_bypass has been used too long or is too full,
>  	// flush ->nocb_bypass to ->cblist.
> -	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> +	if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> +	    (ncbs &&  bypass_is_lazy &&
> +		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
>  	    ncbs >= qhimark) {
>  		rcu_nocb_lock(rdp);
>  		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>  
> -		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
> +		if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy)) {
>  			if (*was_alldone)
>  				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  						    TPS("FirstQ"));
> @@ -460,16 +511,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	// We need to use the bypass.
>  	rcu_nocb_wait_contended(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> +
>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> +
> +	if (lazy)
> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
> +
>  	if (!ncbs) {
>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>  	}
> +
>  	rcu_nocb_bypass_unlock(rdp);
>  	smp_mb(); /* Order enqueue before wake. */
> -	if (ncbs) {
> +
> +	// A wake up of the grace period kthread or timer adjustment needs to
> +	// be done only if:
> +	// 1. Bypass list was fully empty before (this is the first bypass list entry).
> +	//	Or, both the below conditions are met:
> +	// 1. Bypass list had only lazy CBs before.
> +	// 2. The new CB is non-lazy.
> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>  		local_irq_restore(flags);
>  	} else {
>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
> @@ -497,9 +561,10 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  				 unsigned long flags)
>  				 __releases(rdp->nocb_lock)
>  {
> +	long bypass_len;
>  	unsigned long cur_gp_seq;
>  	unsigned long j;
> -	long len;
> +	long len, lazy_len;
>  	struct task_struct *t;
>  
>  	// If we are being polled or there is no kthread, just leave.
> @@ -512,9 +577,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  	}
>  	// Need to actually to a wakeup.
>  	len = rcu_segcblist_n_cbs(&rdp->cblist);
> +	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +	lazy_len = READ_ONCE(rdp->lazy_len);
>  	if (was_alldone) {
>  		rdp->qlen_last_fqs_check = len;
> -		if (!irqs_disabled_flags(flags)) {
> +		// Only lazy CBs in bypass list
> +		if (lazy_len && bypass_len == lazy_len) {
> +			rcu_nocb_unlock_irqrestore(rdp, flags);
> +			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
> +					   TPS("WakeLazy"));
> +		} else if (!irqs_disabled_flags(flags)) {
>  			/* ... if queue was empty ... */
>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			wake_nocb_gp(rdp, false);
> @@ -611,6 +683,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	unsigned long flags;
>  	bool gotcbs = false;
>  	unsigned long j = jiffies;
> +	bool lazy = false;
> +	long lazy_ncbs;
>  	bool needwait_gp = false; // This prevents actual uninitialized use.
>  	bool needwake;
>  	bool needwake_gp;
> @@ -640,24 +714,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	 * won't be ignored for long.
>  	 */
>  	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
> +		bool flush_bypass = false;
> +
>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
>  		rcu_nocb_lock_irqsave(rdp, flags);
>  		lockdep_assert_held(&rdp->nocb_lock);
>  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> -		if (bypass_ncbs &&
> +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> +
> +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> +		     bypass_ncbs > 2 * qhimark)) {
> +			flush_bypass = true;
> +		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
>  		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
>  		     bypass_ncbs > 2 * qhimark)) {
> -			// Bypass full or old, so flush it.
> -			(void)rcu_nocb_try_flush_bypass(rdp, j);
> -			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +			flush_bypass = true;
>  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			continue; /* No callbacks here, try next. */
>  		}
> +
> +		if (flush_bypass) {
> +			// Bypass full or old, so flush it.
> +			(void)rcu_nocb_try_flush_bypass(rdp, j);
> +			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +			lazy_ncbs = READ_ONCE(rdp->lazy_len);
> +		}
> +
>  		if (bypass_ncbs) {
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> -					    TPS("Bypass"));
> -			bypass = true;
> +				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
> +			if (bypass_ncbs == lazy_ncbs)
> +				lazy = true;
> +			else
> +				bypass = true;
>  		}
>  		rnp = rdp->mynode;
>  
> @@ -705,12 +796,20 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	my_rdp->nocb_gp_gp = needwait_gp;
>  	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
>  
> -	if (bypass && !rcu_nocb_poll) {
> -		// At least one child with non-empty ->nocb_bypass, so set
> -		// timer in order to avoid stranding its callbacks.
> -		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> -				   TPS("WakeBypassIsDeferred"));
> +	// At least one child with non-empty ->nocb_bypass, so set
> +	// timer in order to avoid stranding its callbacks.
> +	if (!rcu_nocb_poll) {
> +		// If bypass list only has lazy CBs. Add a deferred lazy wake up.
> +		if (lazy && !bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
> +					TPS("WakeLazyIsDeferred"));
> +		// Otherwise add a deferred bypass wake up.
> +		} else if (bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> +					TPS("WakeBypassIsDeferred"));
> +		}
>  	}
> +
>  	if (rcu_nocb_poll) {
>  		/* Polling, so trace if first poll in the series. */
>  		if (gotcbs)
> @@ -1036,7 +1135,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>  	 * return false, which means that future calls to rcu_nocb_try_bypass()
>  	 * will refuse to put anything into the bypass.
>  	 */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false));
>  	/*
>  	 * Start with invoking rcu_core() early. This way if the current thread
>  	 * happens to preempt an ongoing call to rcu_core() in the middle,
> @@ -1278,6 +1377,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  	raw_spin_lock_init(&rdp->nocb_gp_lock);
>  	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
>  	rcu_cblist_init(&rdp->nocb_bypass);
> +	WRITE_ONCE(rdp->lazy_len, 0);
>  	mutex_init(&rdp->nocb_gp_kthread_mutex);
>  }
>  
> @@ -1560,16 +1660,17 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>  
>  static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>  {
> +	return false;
>  }
>  
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, bool lazy)
>  {
>  	return true;
>  }
>  
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags)
> +				bool *was_alldone, unsigned long flags, bool lazy)
>  {
>  	return false;
>  }
> -- 
> 2.38.0.rc1.362.ged0d419d3c-goog
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
  2022-10-07 17:52         ` Paul E. McKenney
@ 2022-10-07 19:29           ` Joel Fernandes
  2022-10-07 19:56             ` Paul E. McKenney
  0 siblings, 1 reply; 45+ messages in thread
From: Joel Fernandes @ 2022-10-07 19:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Fri, Oct 07, 2022 at 10:52:08AM -0700, Paul E. McKenney wrote:
> On Fri, Oct 07, 2022 at 01:31:23PM -0400, Joel Fernandes wrote:
> > 
> > 
> > > On Oct 7, 2022, at 1:19 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > 
> > > On Fri, Oct 07, 2022 at 03:18:26AM +0000, Joel Fernandes wrote:
> > >>> On Tue, Oct 04, 2022 at 02:41:56AM +0000, Joel Fernandes (Google) wrote:
> > >>> From: Uladzislau Rezki <urezki@gmail.com>
> > >>> 
> > >>> Slow boot time is seen on KVM running typical Linux distributions due to
> > >>> SCSI layer calling call_rcu(). Recent changes to save power may be
> > >>> causing this slowness. Using call_rcu_flush() fixes the issue and brings
> > >>> the boot time back to what it originally was. Convert it.
> > >>> 
> > >>> Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
> > >>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > >> 
> > >> And I successfully setup Debian on KVM and verified that this fixes it, so
> > >> now I have a nice reproducible rig for my
> > >> 'lazy-callback-doing-a-wakeup-detector' (I wrote half the detector thanks to
> > >> ideas from Steve, and will finish the other half tomorrow or so).
> > >> 
> > >> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > 
> > > Looks like I can catch Vlad's issue with the below patch. Thoughts? Does this
> > > look reasonable for mainline? (I think so as it is self-contained and the
> > > debug option is default off, and could be useful down the line).
> 
> Excellent as a debug patch!!!  Let's see how it goes -- if it proves
> sufficiently useful, some form should go into mainline.  Or at least
> be feature prominently somewhere public.

Ok that sounds good.

> > > [    6.887033 ] rcu: *****************************************************
> > > [    6.891242 ] rcu: RCU: A wake up has been detected from a lazy callback!
> > > [    6.895377 ] rcu: The callback name is: scsi_eh_inc_host_failed
> > > [    6.899084 ] rcu: The task it woke up is: scsi_eh_1 (61)
> > > [    6.902405 ] rcu: This could cause performance issues! Check the stack.
> > > [    6.906532 ] rcu: *****************************************************
> > > 
> > > 
> > > [   17.127128 ] rcu: *****************************************************
> > > [   17.131397 ] rcu: RCU: A wake up has been detected from a lazy callback!
> > > [   17.135703 ] rcu: The callback name is: scsi_eh_inc_host_failed
> > > [   17.139485 ] rcu: The task it woke up is: scsi_eh_1 (61)
> > > [   17.142828 ] rcu: This could cause performance issues! Check the stack.
> > > [   17.146962 ] rcu: *****************************************************
> > > 
> > > And thanks to Steve for the binary search code.
> > > 
> > > One thing I found is I have to ignore kworkers because there are times when a
> > > work item is queued from a callback and those callbacks don't seem to
> > > contribute to performance issues. So I am filtering these:
> > > 
> > > [   38.631724 ] rcu: The callback name is: thread_stack_free_rcu
> > > [   38.635317 ] rcu: The task it woke up is: kworker/3:2 (143)
> > > 
> > > [   39.649332 ] rcu: The callback name is: delayed_put_task_struct
> > > [   39.653037 ] rcu: The task it woke up is: kworker/0:1 (40)
> > > 
> > > ---8<-----------------------
> > > 
> > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > Subject: [PATCH] lazy wake debug
> > > 
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > > kernel/rcu/Kconfig      |   7 ++
> > > kernel/rcu/lazy-debug.h | 149 ++++++++++++++++++++++++++++++++++++++++
> > > kernel/rcu/tree.c       |   9 +++
> > > 3 files changed, 165 insertions(+)
> > > create mode 100644 kernel/rcu/lazy-debug.h
> > > 
> > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > index edd632e68497..08c06f739187 100644
> > > --- a/kernel/rcu/Kconfig
> > > +++ b/kernel/rcu/Kconfig
> > > @@ -322,4 +322,11 @@ config RCU_LAZY
> > >      To save power, batch RCU callbacks and flush after delay, memory
> > >      pressure or callback list growing too big.
> > > 
> > > +config RCU_LAZY_DEBUG
> > > +    bool "RCU callback lazy invocation debugging"
> > > +    depends on RCU_LAZY
> > > +    default n
> > > +    help
> > > +      Debugging to catch issues caused by delayed RCU callbacks.
> > > +
> > > endmenu # "RCU Subsystem"
> > > diff --git a/kernel/rcu/lazy-debug.h b/kernel/rcu/lazy-debug.h
> > > new file mode 100644
> > > index 000000000000..fc1cc1cb89f0
> > > --- /dev/null
> > > +++ b/kernel/rcu/lazy-debug.h
> > > @@ -0,0 +1,149 @@
> > > +#include <linux/string.h>
> > > +#include <linux/spinlock.h>
> > > +
> > > +#ifdef CONFIG_RCU_LAZY_DEBUG
> > > +#include <linux/preempt.h>
> > > +#include <trace/events/sched.h>
> > > +
> > > +static DEFINE_PER_CPU(bool, rcu_lazy_cb_exec) = false;
> > > +static DEFINE_PER_CPU(void *, rcu_lazy_ip) = NULL;
> > > +
> > > +static DEFINE_RAW_SPINLOCK(lazy_funcs_lock);
> > > +
> > > +#define FUNC_SIZE 1024
> > 
> > And I know this array can overflow in the future so I will add checks for that in the code if we are going with this patch.
> 
> I believe that there are fewer than 300 RCU callback functions, but yes,
> there would need to at least be a check at some point.
> 
> I still strongly suggest the static search in addition to this.  Yes, this
> is a cool patch, but it should be mostly used for the difficult-to-find
> instances.

I wrote some scripts shared below (could be improves) which search for "wake"
in code following an RCU callback's reference. This catches SCSI too but I
did find one more:

(1)
rxrpc_destroy_connection()

	which does:
	wake_up_var(&conn->params.local->rxnet->nr_conns);


I think I'll change this to call_rcu_flush() to be safe.

========

All others are harmless / false-positives which I inspected and didn't have
anything concerning.

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH] debug: look for wake references after rcu callback body

First run search-call-rcu.sh which generates some files, then run
search-wakers.sh to see the references to wake.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 search-call-rcu.sh | 19 +++++++++++++++++++
 search-wakers.sh   | 15 +++++++++++++++
 2 files changed, 34 insertions(+)
 create mode 100755 search-call-rcu.sh
 create mode 100755 search-wakers.sh

diff --git a/search-call-rcu.sh b/search-call-rcu.sh
new file mode 100755
index 000000000000..21406355888c
--- /dev/null
+++ b/search-call-rcu.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+
+rm func-list
+touch func-list
+
+for f in $(find . \( -name "*.c" -o -name "*.h" \) | grep -v rcu); do
+
+	funcs=$(perl -0777 -ne 'while(m/call_rcu\([&]?.+,\s?(.+)\).*;/g){print "$1\n";}' $f)
+
+	if [ "x$funcs" != "x" ]; then
+		for func in $funcs; do
+			echo "$f $func" >> func-list
+			echo "$f $func"
+		done
+	fi
+
+done
+
+cat func-list | sort | uniq | tee func-list-sorted
diff --git a/search-wakers.sh b/search-wakers.sh
new file mode 100755
index 000000000000..a96d60a7e16b
--- /dev/null
+++ b/search-wakers.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+while read fl; do
+	file=$(echo $fl | cut -d " " -f1)
+	func=$(echo $fl | cut -d " " -f2)
+
+	grep -A 30 $func $file | grep wake > /dev/null
+
+	if [ $? -eq 0 ]; then
+		echo "keyword wake found after function reference $func in $file"
+		echo "Output:"
+		grep -A 30 $func $file 
+		echo "==========================================================="
+	fi
+done < func-list-sorted
-- 
2.38.0.rc1.362.ged0d419d3c-goog


^ permalink raw reply related	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
  2022-10-07 19:29           ` Joel Fernandes
@ 2022-10-07 19:56             ` Paul E. McKenney
  2022-10-07 20:24               ` Joel Fernandes
  0 siblings, 1 reply; 45+ messages in thread
From: Paul E. McKenney @ 2022-10-07 19:56 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Fri, Oct 07, 2022 at 07:29:14PM +0000, Joel Fernandes wrote:
> On Fri, Oct 07, 2022 at 10:52:08AM -0700, Paul E. McKenney wrote:
> > On Fri, Oct 07, 2022 at 01:31:23PM -0400, Joel Fernandes wrote:
> > > 
> > > 
> > > > On Oct 7, 2022, at 1:19 PM, Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > 
> > > > On Fri, Oct 07, 2022 at 03:18:26AM +0000, Joel Fernandes wrote:
> > > >>> On Tue, Oct 04, 2022 at 02:41:56AM +0000, Joel Fernandes (Google) wrote:
> > > >>> From: Uladzislau Rezki <urezki@gmail.com>
> > > >>> 
> > > >>> Slow boot time is seen on KVM running typical Linux distributions due to
> > > >>> SCSI layer calling call_rcu(). Recent changes to save power may be
> > > >>> causing this slowness. Using call_rcu_flush() fixes the issue and brings
> > > >>> the boot time back to what it originally was. Convert it.
> > > >>> 
> > > >>> Signed-off-by: Uladzislau Rezki <urezki@gmail.com>
> > > >>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > >> 
> > > >> And I successfully setup Debian on KVM and verified that this fixes it, so
> > > >> now I have a nice reproducible rig for my
> > > >> 'lazy-callback-doing-a-wakeup-detector' (I wrote half the detector thanks to
> > > >> ideas from Steve, and will finish the other half tomorrow or so).
> > > >> 
> > > >> Tested-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > 
> > > > Looks like I can catch Vlad's issue with the below patch. Thoughts? Does this
> > > > look reasonable for mainline? (I think so as it is self-contained and the
> > > > debug option is default off, and could be useful down the line).
> > 
> > Excellent as a debug patch!!!  Let's see how it goes -- if it proves
> > sufficiently useful, some form should go into mainline.  Or at least
> > be feature prominently somewhere public.
> 
> Ok that sounds good.
> 
> > > > [    6.887033 ] rcu: *****************************************************
> > > > [    6.891242 ] rcu: RCU: A wake up has been detected from a lazy callback!
> > > > [    6.895377 ] rcu: The callback name is: scsi_eh_inc_host_failed
> > > > [    6.899084 ] rcu: The task it woke up is: scsi_eh_1 (61)
> > > > [    6.902405 ] rcu: This could cause performance issues! Check the stack.
> > > > [    6.906532 ] rcu: *****************************************************
> > > > 
> > > > 
> > > > [   17.127128 ] rcu: *****************************************************
> > > > [   17.131397 ] rcu: RCU: A wake up has been detected from a lazy callback!
> > > > [   17.135703 ] rcu: The callback name is: scsi_eh_inc_host_failed
> > > > [   17.139485 ] rcu: The task it woke up is: scsi_eh_1 (61)
> > > > [   17.142828 ] rcu: This could cause performance issues! Check the stack.
> > > > [   17.146962 ] rcu: *****************************************************
> > > > 
> > > > And thanks to Steve for the binary search code.
> > > > 
> > > > One thing I found is I have to ignore kworkers because there are times when a
> > > > work item is queued from a callback and those callbacks don't seem to
> > > > contribute to performance issues. So I am filtering these:
> > > > 
> > > > [   38.631724 ] rcu: The callback name is: thread_stack_free_rcu
> > > > [   38.635317 ] rcu: The task it woke up is: kworker/3:2 (143)
> > > > 
> > > > [   39.649332 ] rcu: The callback name is: delayed_put_task_struct
> > > > [   39.653037 ] rcu: The task it woke up is: kworker/0:1 (40)
> > > > 
> > > > ---8<-----------------------
> > > > 
> > > > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > > > Subject: [PATCH] lazy wake debug
> > > > 
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > ---
> > > > kernel/rcu/Kconfig      |   7 ++
> > > > kernel/rcu/lazy-debug.h | 149 ++++++++++++++++++++++++++++++++++++++++
> > > > kernel/rcu/tree.c       |   9 +++
> > > > 3 files changed, 165 insertions(+)
> > > > create mode 100644 kernel/rcu/lazy-debug.h
> > > > 
> > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > index edd632e68497..08c06f739187 100644
> > > > --- a/kernel/rcu/Kconfig
> > > > +++ b/kernel/rcu/Kconfig
> > > > @@ -322,4 +322,11 @@ config RCU_LAZY
> > > >      To save power, batch RCU callbacks and flush after delay, memory
> > > >      pressure or callback list growing too big.
> > > > 
> > > > +config RCU_LAZY_DEBUG
> > > > +    bool "RCU callback lazy invocation debugging"
> > > > +    depends on RCU_LAZY
> > > > +    default n
> > > > +    help
> > > > +      Debugging to catch issues caused by delayed RCU callbacks.
> > > > +
> > > > endmenu # "RCU Subsystem"
> > > > diff --git a/kernel/rcu/lazy-debug.h b/kernel/rcu/lazy-debug.h
> > > > new file mode 100644
> > > > index 000000000000..fc1cc1cb89f0
> > > > --- /dev/null
> > > > +++ b/kernel/rcu/lazy-debug.h
> > > > @@ -0,0 +1,149 @@
> > > > +#include <linux/string.h>
> > > > +#include <linux/spinlock.h>
> > > > +
> > > > +#ifdef CONFIG_RCU_LAZY_DEBUG
> > > > +#include <linux/preempt.h>
> > > > +#include <trace/events/sched.h>
> > > > +
> > > > +static DEFINE_PER_CPU(bool, rcu_lazy_cb_exec) = false;
> > > > +static DEFINE_PER_CPU(void *, rcu_lazy_ip) = NULL;
> > > > +
> > > > +static DEFINE_RAW_SPINLOCK(lazy_funcs_lock);
> > > > +
> > > > +#define FUNC_SIZE 1024
> > > 
> > > And I know this array can overflow in the future so I will add checks for that in the code if we are going with this patch.
> > 
> > I believe that there are fewer than 300 RCU callback functions, but yes,
> > there would need to at least be a check at some point.
> > 
> > I still strongly suggest the static search in addition to this.  Yes, this
> > is a cool patch, but it should be mostly used for the difficult-to-find
> > instances.
> 
> I wrote some scripts shared below (could be improves) which search for "wake"
> in code following an RCU callback's reference. This catches SCSI too but I
> did find one more:
> 
> (1)
> rxrpc_destroy_connection()
> 
> 	which does:
> 	wake_up_var(&conn->params.local->rxnet->nr_conns);
> 
> 
> I think I'll change this to call_rcu_flush() to be safe.
> 
> ========
> 
> All others are harmless / false-positives which I inspected and didn't have
> anything concerning.

Very good, and thank you for doing this!

Maybe include the script in the cover letter of the next version?

							Thanx, Paul

> ---8<-----------------------
> 
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Subject: [PATCH] debug: look for wake references after rcu callback body
> 
> First run search-call-rcu.sh which generates some files, then run
> search-wakers.sh to see the references to wake.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  search-call-rcu.sh | 19 +++++++++++++++++++
>  search-wakers.sh   | 15 +++++++++++++++
>  2 files changed, 34 insertions(+)
>  create mode 100755 search-call-rcu.sh
>  create mode 100755 search-wakers.sh
> 
> diff --git a/search-call-rcu.sh b/search-call-rcu.sh
> new file mode 100755
> index 000000000000..21406355888c
> --- /dev/null
> +++ b/search-call-rcu.sh
> @@ -0,0 +1,19 @@
> +#!/bin/bash
> +
> +rm func-list
> +touch func-list
> +
> +for f in $(find . \( -name "*.c" -o -name "*.h" \) | grep -v rcu); do
> +
> +	funcs=$(perl -0777 -ne 'while(m/call_rcu\([&]?.+,\s?(.+)\).*;/g){print "$1\n";}' $f)
> +
> +	if [ "x$funcs" != "x" ]; then
> +		for func in $funcs; do
> +			echo "$f $func" >> func-list
> +			echo "$f $func"
> +		done
> +	fi
> +
> +done
> +
> +cat func-list | sort | uniq | tee func-list-sorted
> diff --git a/search-wakers.sh b/search-wakers.sh
> new file mode 100755
> index 000000000000..a96d60a7e16b
> --- /dev/null
> +++ b/search-wakers.sh
> @@ -0,0 +1,15 @@
> +#!/bin/bash
> +
> +while read fl; do
> +	file=$(echo $fl | cut -d " " -f1)
> +	func=$(echo $fl | cut -d " " -f2)
> +
> +	grep -A 30 $func $file | grep wake > /dev/null
> +
> +	if [ $? -eq 0 ]; then
> +		echo "keyword wake found after function reference $func in $file"
> +		echo "Output:"
> +		grep -A 30 $func $file 
> +		echo "==========================================================="
> +	fi
> +done < func-list-sorted
> -- 
> 2.38.0.rc1.362.ged0d419d3c-goog
> 

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu()
  2022-10-07 19:56             ` Paul E. McKenney
@ 2022-10-07 20:24               ` Joel Fernandes
  0 siblings, 0 replies; 45+ messages in thread
From: Joel Fernandes @ 2022-10-07 20:24 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Fri, Oct 7, 2022 at 3:56 PM Paul E. McKenney <paulmck@kernel.org> wrote:
[..]
> > > > > [    6.887033 ] rcu: *****************************************************
> > > > > [    6.891242 ] rcu: RCU: A wake up has been detected from a lazy callback!
> > > > > [    6.895377 ] rcu: The callback name is: scsi_eh_inc_host_failed
> > > > > [    6.899084 ] rcu: The task it woke up is: scsi_eh_1 (61)
> > > > > [    6.902405 ] rcu: This could cause performance issues! Check the stack.
> > > > > [    6.906532 ] rcu: *****************************************************
> > > > >
> > > > >
> > > > > [   17.127128 ] rcu: *****************************************************
> > > > > [   17.131397 ] rcu: RCU: A wake up has been detected from a lazy callback!
> > > > > [   17.135703 ] rcu: The callback name is: scsi_eh_inc_host_failed
> > > > > [   17.139485 ] rcu: The task it woke up is: scsi_eh_1 (61)
> > > > > [   17.142828 ] rcu: This could cause performance issues! Check the stack.
> > > > > [   17.146962 ] rcu: *****************************************************
> > > > >
> > > > > And thanks to Steve for the binary search code.
> > > > >
> > > > > One thing I found is I have to ignore kworkers because there are times when a
> > > > > work item is queued from a callback and those callbacks don't seem to
> > > > > contribute to performance issues. So I am filtering these:
> > > > >
> > > > > [   38.631724 ] rcu: The callback name is: thread_stack_free_rcu
> > > > > [   38.635317 ] rcu: The task it woke up is: kworker/3:2 (143)
> > > > >
[..]
> > > > > +    help
> > > > > +      Debugging to catch issues caused by delayed RCU callbacks.
> > > > > +
> > > > > endmenu # "RCU Subsystem"
> > > > > diff --git a/kernel/rcu/lazy-debug.h b/kernel/rcu/lazy-debug.h
> > > > > new file mode 100644
> > > > > index 000000000000..fc1cc1cb89f0
> > > > > --- /dev/null
> > > > > +++ b/kernel/rcu/lazy-debug.h
> > > > > @@ -0,0 +1,149 @@
> > > > > +#include <linux/string.h>
> > > > > +#include <linux/spinlock.h>
> > > > > +
> > > > > +#ifdef CONFIG_RCU_LAZY_DEBUG
> > > > > +#include <linux/preempt.h>
> > > > > +#include <trace/events/sched.h>
> > > > > +
> > > > > +static DEFINE_PER_CPU(bool, rcu_lazy_cb_exec) = false;
> > > > > +static DEFINE_PER_CPU(void *, rcu_lazy_ip) = NULL;
> > > > > +
> > > > > +static DEFINE_RAW_SPINLOCK(lazy_funcs_lock);
> > > > > +
> > > > > +#define FUNC_SIZE 1024
> > > >
> > > > And I know this array can overflow in the future so I will add checks for that in the code if we are going with this patch.
> > >
> > > I believe that there are fewer than 300 RCU callback functions, but yes,
> > > there would need to at least be a check at some point.
> > >
> > > I still strongly suggest the static search in addition to this.  Yes, this
> > > is a cool patch, but it should be mostly used for the difficult-to-find
> > > instances.
> >
> > I wrote some scripts shared below (could be improves) which search for "wake"
> > in code following an RCU callback's reference. This catches SCSI too but I
> > did find one more:
> >
> > (1)
> > rxrpc_destroy_connection()
> >
> >       which does:
> >       wake_up_var(&conn->params.local->rxnet->nr_conns);
> >
> >
> > I think I'll change this to call_rcu_flush() to be safe.
> >
> > ========
> >
> > All others are harmless / false-positives which I inspected and didn't have
> > anything concerning.
>
> Very good, and thank you for doing this!

Thanks.

> Maybe include the script in the cover letter of the next version?

Good idea, thanks, I will do so.

 - Joel


>
>                                                         Thanx, Paul
>
> > ---8<-----------------------
> >
> > From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> > Subject: [PATCH] debug: look for wake references after rcu callback body
> >
> > First run search-call-rcu.sh which generates some files, then run
> > search-wakers.sh to see the references to wake.
> >
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  search-call-rcu.sh | 19 +++++++++++++++++++
> >  search-wakers.sh   | 15 +++++++++++++++
> >  2 files changed, 34 insertions(+)
> >  create mode 100755 search-call-rcu.sh
> >  create mode 100755 search-wakers.sh
> >
> > diff --git a/search-call-rcu.sh b/search-call-rcu.sh
> > new file mode 100755
> > index 000000000000..21406355888c
> > --- /dev/null
> > +++ b/search-call-rcu.sh
> > @@ -0,0 +1,19 @@
> > +#!/bin/bash
> > +
> > +rm func-list
> > +touch func-list
> > +
> > +for f in $(find . \( -name "*.c" -o -name "*.h" \) | grep -v rcu); do
> > +
> > +     funcs=$(perl -0777 -ne 'while(m/call_rcu\([&]?.+,\s?(.+)\).*;/g){print "$1\n";}' $f)
> > +
> > +     if [ "x$funcs" != "x" ]; then
> > +             for func in $funcs; do
> > +                     echo "$f $func" >> func-list
> > +                     echo "$f $func"
> > +             done
> > +     fi
> > +
> > +done
> > +
> > +cat func-list | sort | uniq | tee func-list-sorted
> > diff --git a/search-wakers.sh b/search-wakers.sh
> > new file mode 100755
> > index 000000000000..a96d60a7e16b
> > --- /dev/null
> > +++ b/search-wakers.sh
> > @@ -0,0 +1,15 @@
> > +#!/bin/bash
> > +
> > +while read fl; do
> > +     file=$(echo $fl | cut -d " " -f1)
> > +     func=$(echo $fl | cut -d " " -f2)
> > +
> > +     grep -A 30 $func $file | grep wake > /dev/null
> > +
> > +     if [ $? -eq 0 ]; then
> > +             echo "keyword wake found after function reference $func in $file"
> > +             echo "Output:"
> > +             grep -A 30 $func $file
> > +             echo "==========================================================="
> > +     fi
> > +done < func-list-sorted
> > --
> > 2.38.0.rc1.362.ged0d419d3c-goog
> >

^ permalink raw reply	[flat|nested] 45+ messages in thread

* Re: [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power
  2022-10-06 19:11   ` Paul E. McKenney
@ 2022-10-11 17:44     ` Joel Fernandes
  0 siblings, 0 replies; 45+ messages in thread
From: Joel Fernandes @ 2022-10-11 17:44 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, youssefesmat, surenb

On Thu, Oct 6, 2022 at 3:11 PM Paul E. McKenney <paulmck@kernel.org> wrote:
[..]
> >  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> > -                             bool *was_alldone, unsigned long flags)
> > +                             bool *was_alldone, unsigned long flags,
> > +                             bool lazy)
> >  {
> >       unsigned long c;
> >       unsigned long cur_gp_seq;
> >       unsigned long j = jiffies;
> >       long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> > +     bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
>
> In the alphabet, "b" comes before "c".  Picky, but it does make these
> things easier to find.  Which can be important when you are searching
> for all occurrences of something -- nice to be able to find the variable
> in alpha order without interrupting your search.

For this comment, I cannot reorder it because it will break as ncbs
needs to be defined before it is used. I will leave this one alone if
it is Ok with you.

I am OK with all the other comments.

Thanks.

^ permalink raw reply	[flat|nested] 45+ messages in thread

end of thread, other threads:[~2022-10-11 17:45 UTC | newest]

Thread overview: 45+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-04  2:41 [PATCH v7 00/11] rcu: call_rcu() power improvements Joel Fernandes (Google)
2022-10-04  2:41 ` [PATCH v7 01/11] rcu: Wake up nocb gp thread on rcu_barrier_entrain() Joel Fernandes (Google)
2022-10-04 22:28   ` Frederic Weisbecker
2022-10-04 22:57     ` Joel Fernandes
2022-10-05 10:39       ` Frederic Weisbecker
2022-10-07  2:47       ` Joel Fernandes
2022-10-07 11:26         ` Frederic Weisbecker
2022-10-07 12:46           ` Joel Fernandes
2022-10-04  2:41 ` [PATCH v7 02/11] rcu: Make call_rcu() lazy to save power Joel Fernandes (Google)
2022-10-04 11:41   ` Uladzislau Rezki
2022-10-04 13:30     ` Paul E. McKenney
2022-10-04 14:53       ` Uladzislau Rezki
2022-10-04 15:58         ` Paul E. McKenney
2022-10-04 16:20           ` Uladzislau Rezki
2022-10-04 18:27             ` Paul E. McKenney
2022-10-05 11:28               ` Uladzislau Rezki
2022-10-04 16:22         ` Joel Fernandes
2022-10-04 18:05     ` Joel Fernandes
2022-10-05 11:21       ` Uladzislau Rezki
2022-10-05 11:44   ` Uladzislau Rezki
2022-10-06 19:11   ` Paul E. McKenney
2022-10-11 17:44     ` Joel Fernandes
2022-10-04  2:41 ` [PATCH v7 03/11] rcu: Refactor code a bit in rcu_nocb_do_flush_bypass() Joel Fernandes (Google)
2022-10-06 19:12   ` Paul E. McKenney
2022-10-04  2:41 ` [PATCH v7 04/11] rcu: shrinker for lazy rcu Joel Fernandes (Google)
2022-10-04  2:41 ` [PATCH v7 05/11] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
2022-10-06 19:15   ` Paul E. McKenney
2022-10-04  2:41 ` [PATCH v7 06/11] percpu-refcount: Use call_rcu_flush() for atomic switch Joel Fernandes (Google)
2022-10-04  2:41 ` [PATCH v7 07/11] rcu/sync: Use call_rcu_flush() instead of call_rcu Joel Fernandes (Google)
2022-10-04  2:41 ` [PATCH v7 08/11] rcu/rcuscale: Use call_rcu_flush() for async reader test Joel Fernandes (Google)
2022-10-04  2:41 ` [PATCH v7 09/11] rcu/rcutorture: Use call_rcu_flush() where needed Joel Fernandes (Google)
2022-10-04  2:41 ` [PATCH v7 10/11] scsi/scsi_error: Use call_rcu_flush() instead of call_rcu() Joel Fernandes (Google)
2022-10-07  3:18   ` Joel Fernandes
2022-10-07 17:19     ` Joel Fernandes
2022-10-07 17:31       ` Joel Fernandes
2022-10-07 17:52         ` Paul E. McKenney
2022-10-07 19:29           ` Joel Fernandes
2022-10-07 19:56             ` Paul E. McKenney
2022-10-07 20:24               ` Joel Fernandes
2022-10-04  2:41 ` [PATCH v7 11/11] workqueue: Make queue_rcu_work() use call_rcu_flush() Joel Fernandes (Google)
2022-10-06 18:55 ` [PATCH v7 00/11] rcu: call_rcu() power improvements Paul E. McKenney
2022-10-07 14:40   ` Uladzislau Rezki
2022-10-07 14:54     ` Paul E. McKenney
2022-10-07 15:09   ` Joel Fernandes
2022-10-07 18:30     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).