[PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
@ 2022-08-19 20:48 Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 01/14] rcu: Introduce call_rcu_lazy() API implementation Joel Fernandes (Google)
                   ` (14 more replies)
  0 siblings, 15 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

Refresh tested on real ChromeOS userspace and hardware, passes boot time tests
and rcuscale tests.

Fixes on top of v3:
- Fix boot issues due to a race in the lazy RCU logic which caused a missed
  wakeup of the RCU GP thread, causing synchronize_rcu() to stall.
- Fixed trace_rcu_callback tracepoint

I tested power previously [1], I am in the process of testing power again but I
wanted share my latest code as others who are testing power as well could use
the above fixes.

[1] https://lore.kernel.org/all/20220713213237.1596225-1-joel@joelfernandes.org/

Joel Fernandes (Google) (13):
rcu: Introduce call_rcu_lazy() API implementation
rcuscale: Add laziness and kfree tests
fs: Move call_rcu() to call_rcu_lazy() in some paths
rcutorture: Add test code for call_rcu_lazy()
debug: Toggle lazy at runtime and change flush jiffies
cred: Move call_rcu() to call_rcu_lazy()
security: Move call_rcu() to call_rcu_lazy()
net/core: Move call_rcu() to call_rcu_lazy()
kernel: Move various core kernel usages to call_rcu_lazy()
lib: Move call_rcu() to call_rcu_lazy()
i915: Move call_rcu() to call_rcu_lazy()
fork: Move thread_stack_free_rcu to call_rcu_lazy
rcu/tree: Move trace_rcu_callback() before bypassing

Vineeth Pillai (1):
rcu: shrinker for lazy rcu

drivers/gpu/drm/i915/gem/i915_gem_object.c    |   2 +-
fs/dcache.c                                   |   4 +-
fs/eventpoll.c                                |   2 +-
fs/file_table.c                               |   2 +-
fs/inode.c                                    |   2 +-
include/linux/rcu_segcblist.h                 |   1 +
include/linux/rcupdate.h                      |   6 +
include/linux/sched/sysctl.h                  |   3 +
kernel/cred.c                                 |   2 +-
kernel/exit.c                                 |   2 +-
kernel/fork.c                                 |   6 +-
kernel/pid.c                                  |   2 +-
kernel/rcu/Kconfig                            |   8 +
kernel/rcu/rcu.h                              |  12 +
kernel/rcu/rcu_segcblist.c                    |  15 +-
kernel/rcu/rcu_segcblist.h                    |  20 +-
kernel/rcu/rcuscale.c                         |  74 ++++-
kernel/rcu/rcutorture.c                       |  60 +++-
kernel/rcu/tree.c                             | 139 ++++++----
kernel/rcu/tree.h                             |  10 +-
kernel/rcu/tree_nocb.h                        | 260 +++++++++++++++---
kernel/sysctl.c                               |  17 ++
kernel/time/posix-timers.c                    |   2 +-
lib/radix-tree.c                              |   2 +-
lib/xarray.c                                  |   2 +-
net/core/dst.c                                |   2 +-
security/security.c                           |   2 +-
security/selinux/avc.c                        |   4 +-
.../selftests/rcutorture/configs/rcu/CFLIST   |   1 +
.../selftests/rcutorture/configs/rcu/TREE11   |  18 ++
.../rcutorture/configs/rcu/TREE11.boot        |   8 +
31 files changed, 567 insertions(+), 123 deletions(-)
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

--
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply	[flat|nested] 54+ messages in thread

* [PATCH v4 01/14] rcu: Introduce call_rcu_lazy() API implementation
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 02/14] rcu: shrinker for lazy rcu Joel Fernandes (Google)
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	Paul McKenney, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

Implement timer-based RCU lazy callback batching. The batch is flushed
whenever a certain amount of time has passed, or the batch on a
particular CPU grows too big. Also memory pressure will flush it in a
future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

Suggested-by: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/rcu_segcblist.h |   1 +
 include/linux/rcupdate.h      |   6 +
 kernel/rcu/Kconfig            |   8 ++
 kernel/rcu/rcu.h              |  11 ++
 kernel/rcu/rcu_segcblist.c    |  15 ++-
 kernel/rcu/rcu_segcblist.h    |  20 +++-
 kernel/rcu/tree.c             | 130 ++++++++++++++--------
 kernel/rcu/tree.h             |  10 +-
 kernel/rcu/tree_nocb.h        | 199 ++++++++++++++++++++++++++--------
 9 files changed, 301 insertions(+), 99 deletions(-)

diff --git a/include/linux/rcu_segcblist.h b/include/linux/rcu_segcblist.h
index 659d13a7ddaa..9a992707917b 100644
--- a/include/linux/rcu_segcblist.h
+++ b/include/linux/rcu_segcblist.h
@@ -22,6 +22,7 @@ struct rcu_cblist {
 	struct rcu_head *head;
 	struct rcu_head **tail;
 	long len;
+	long lazy_len;
 };
 
 #define RCU_CBLIST_INITIALIZER(n) { .head = NULL, .tail = &n.head }
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 1a32036c918c..9191a3d88087 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -82,6 +82,12 @@ static inline int rcu_preempt_depth(void)
 
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
+#ifdef CONFIG_RCU_LAZY
+void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
+#else
+#define call_rcu_lazy(head, func) call_rcu(head, func)
+#endif
+
 /* Internal to kernel */
 void rcu_init(void);
 extern int rcu_scheduler_active;
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 27aab870ae4c..779b6e84006b 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -293,4 +293,12 @@ config TASKS_TRACE_RCU_READ_MB
 	  Say N here if you hate read-side memory barriers.
 	  Take the default if you are unsure.
 
+config RCU_LAZY
+	bool "RCU callback lazy invocation functionality"
+	depends on RCU_NOCB_CPU
+	default n
+	help
+	  To save power, batch RCU callbacks and flush after delay, memory
+	  pressure or callback list growing too big.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 4916077119f3..608f6ab76c7f 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -463,6 +463,14 @@ enum rcutorture_type {
 	INVALID_RCU_FLAVOR
 };
 
+#if defined(CONFIG_RCU_LAZY)
+unsigned long rcu_lazy_get_jiffies_till_flush(void);
+void rcu_lazy_set_jiffies_till_flush(unsigned long j);
+#else
+static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
+static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
+#endif
+
 #if defined(CONFIG_TREE_RCU)
 void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
 			    unsigned long *gp_seq);
@@ -472,6 +480,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
 			       unsigned long c_old,
 			       unsigned long c);
 void rcu_gp_set_torture_wait(int duration);
+void rcu_force_call_rcu_to_lazy(bool force);
+
 #else
 static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
 					  int *flags, unsigned long *gp_seq)
@@ -490,6 +500,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
 	do { } while (0)
 #endif
 static inline void rcu_gp_set_torture_wait(int duration) { }
+static inline void rcu_force_call_rcu_to_lazy(bool force) { }
 #endif
 
 #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index c54ea2b6a36b..776647cd2d6c 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -20,16 +20,21 @@ void rcu_cblist_init(struct rcu_cblist *rclp)
 	rclp->head = NULL;
 	rclp->tail = &rclp->head;
 	rclp->len = 0;
+	rclp->lazy_len = 0;
 }
 
 /*
  * Enqueue an rcu_head structure onto the specified callback list.
  */
-void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
+void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp,
+			bool lazy)
 {
 	*rclp->tail = rhp;
 	rclp->tail = &rhp->next;
 	WRITE_ONCE(rclp->len, rclp->len + 1);
+
+	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
+		WRITE_ONCE(rclp->lazy_len, rclp->lazy_len + 1);
 }
 
 /*
@@ -38,11 +43,12 @@ void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
  * element of the second rcu_cblist structure, but ensuring that the second
  * rcu_cblist structure, if initially non-empty, always appears non-empty
  * throughout the process.  If rdp is NULL, the second rcu_cblist structure
- * is instead initialized to empty.
+ * is instead initialized to empty. Also account for lazy_len for lazy CBs.
  */
 void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
 			      struct rcu_cblist *srclp,
-			      struct rcu_head *rhp)
+			      struct rcu_head *rhp,
+			      bool lazy)
 {
 	drclp->head = srclp->head;
 	if (drclp->head)
@@ -58,6 +64,9 @@ void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
 		srclp->tail = &rhp->next;
 		WRITE_ONCE(srclp->len, 1);
 	}
+
+	if (IS_ENABLED(CONFIG_RCU_LAZY) && rhp && lazy)
+		WRITE_ONCE(srclp->lazy_len, 1);
 }
 
 /*
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 431cee212467..8e90b34adb00 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -15,14 +15,30 @@ static inline long rcu_cblist_n_cbs(struct rcu_cblist *rclp)
 	return READ_ONCE(rclp->len);
 }
 
+/* Return number of callbacks in the specified callback list. */
+static inline long rcu_cblist_n_lazy_cbs(struct rcu_cblist *rclp)
+{
+	if (IS_ENABLED(CONFIG_RCU_LAZY))
+		return READ_ONCE(rclp->lazy_len);
+	return 0;
+}
+
+static inline void rcu_cblist_reset_lazy_len(struct rcu_cblist *rclp)
+{
+	if (IS_ENABLED(CONFIG_RCU_LAZY))
+		WRITE_ONCE(rclp->lazy_len, 0);
+}
+
 /* Return number of callbacks in segmented callback list by summing seglen. */
 long rcu_segcblist_n_segment_cbs(struct rcu_segcblist *rsclp);
 
 void rcu_cblist_init(struct rcu_cblist *rclp);
-void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp);
+void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp,
+			bool lazy);
 void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
 			      struct rcu_cblist *srclp,
-			      struct rcu_head *rhp);
+			      struct rcu_head *rhp,
+			      bool lazy);
 struct rcu_head *rcu_cblist_dequeue(struct rcu_cblist *rclp);
 
 /*
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c25ba442044a..e76fef8031be 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3058,47 +3058,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
 	raw_spin_unlock_rcu_node(rnp);
 }
 
-/**
- * call_rcu() - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all pre-existing RCU read-side
- * critical sections have completed.  However, the callback function
- * might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked.
- *
- * RCU read-side critical sections are delimited by rcu_read_lock()
- * and rcu_read_unlock(), and may be nested.  In addition, but only in
- * v5.0 and later, regions of code across which interrupts, preemption,
- * or softirqs have been disabled also serve as RCU read-side critical
- * sections.  This includes hardware interrupt handlers, softirq handlers,
- * and NMI handlers.
- *
- * Note that all CPUs must agree that the grace period extended beyond
- * all pre-existing RCU read-side critical section.  On systems with more
- * than one CPU, this means that when "func()" is invoked, each CPU is
- * guaranteed to have executed a full memory barrier since the end of its
- * last RCU read-side critical section whose beginning preceded the call
- * to call_rcu().  It also means that each CPU executing an RCU read-side
- * critical section that continues beyond the start of "func()" must have
- * executed a memory barrier after the call_rcu() but before the beginning
- * of that RCU read-side critical section.  Note that these guarantees
- * include CPUs that are offline, idle, or executing in user mode, as
- * well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
- * resulting RCU callback function "func()", then both CPU A and CPU B are
- * guaranteed to execute a full memory barrier during the time interval
- * between the call to call_rcu() and the invocation of "func()" -- even
- * if CPU A and CPU B are the same CPU (but again only if the system has
- * more than one CPU).
- *
- * Implementation of these memory-ordering guarantees is described here:
- * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
- */
-void call_rcu(struct rcu_head *head, rcu_callback_t func)
+static void
+__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 {
 	static atomic_t doublefrees;
 	unsigned long flags;
@@ -3139,7 +3100,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 	}
 
 	check_cb_ovld(rdp);
-	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
+	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
 		return; // Enqueued onto ->nocb_bypass, so just leave.
 	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
 	rcu_segcblist_enqueue(&rdp->cblist, head);
@@ -3161,8 +3122,86 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 		local_irq_restore(flags);
 	}
 }
-EXPORT_SYMBOL_GPL(call_rcu);
 
+#ifdef CONFIG_RCU_LAZY
+/**
+ * call_rcu_lazy() - Lazily queue RCU callback for invocation after grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.
+ *
+ * Use this API instead of call_rcu() if you don't mind the callback being
+ * invoked after very long periods of time on systems without memory pressure
+ * and on systems which are lightly loaded or mostly idle.
+ *
+ * Other than the extra delay in callbacks being invoked, this function is
+ * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
+ * details about memory ordering and other functionality.
+ */
+void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, true);
+}
+EXPORT_SYMBOL_GPL(call_rcu_lazy);
+#endif
+
+static bool force_call_rcu_to_lazy;
+
+void rcu_force_call_rcu_to_lazy(bool force)
+{
+	if (IS_ENABLED(CONFIG_RCU_SCALE_TEST))
+		WRITE_ONCE(force_call_rcu_to_lazy, force);
+}
+EXPORT_SYMBOL_GPL(rcu_force_call_rcu_to_lazy);
+
+/**
+ * call_rcu() - Queue an RCU callback for invocation after a grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.  However, the callback function
+ * might well execute concurrently with RCU read-side critical sections
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested.  In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
+ * sections.  This includes hardware interrupt handlers, softirq handlers,
+ * and NMI handlers.
+ *
+ * Note that all CPUs must agree that the grace period extended beyond
+ * all pre-existing RCU read-side critical section.  On systems with more
+ * than one CPU, this means that when "func()" is invoked, each CPU is
+ * guaranteed to have executed a full memory barrier since the end of its
+ * last RCU read-side critical section whose beginning preceded the call
+ * to call_rcu().  It also means that each CPU executing an RCU read-side
+ * critical section that continues beyond the start of "func()" must have
+ * executed a memory barrier after the call_rcu() but before the beginning
+ * of that RCU read-side critical section.  Note that these guarantees
+ * include CPUs that are offline, idle, or executing in user mode, as
+ * well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
+ * resulting RCU callback function "func()", then both CPU A and CPU B are
+ * guaranteed to execute a full memory barrier during the time interval
+ * between the call to call_rcu() and the invocation of "func()" -- even
+ * if CPU A and CPU B are the same CPU (but again only if the system has
+ * more than one CPU).
+ *
+ * Implementation of these memory-ordering guarantees is described here:
+ * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
+ */
+void call_rcu(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, force_call_rcu_to_lazy);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
 
 /* Maximum number of jiffies to wait before draining a batch. */
 #define KFREE_DRAIN_JIFFIES (HZ / 50)
@@ -4056,7 +4095,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	rdp->barrier_head.func = rcu_barrier_callback;
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false,
+		     /* wake gp thread */ true));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
 	} else {
@@ -4476,7 +4516,7 @@ void rcutree_migrate_callbacks(int cpu)
 	my_rdp = this_cpu_ptr(&rcu_data);
 	my_rnp = my_rdp->mynode;
 	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, false, false));
 	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
 	/* Leverage recent GPs and set GP for new callbacks. */
 	needwake = rcu_advance_cbs(my_rnp, rdp) ||
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2ccf5845957d..7b1ddee6a159 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -267,8 +267,9 @@ struct rcu_data {
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
 #define RCU_NOCB_WAKE_NOT	0
 #define RCU_NOCB_WAKE_BYPASS	1
-#define RCU_NOCB_WAKE		2
-#define RCU_NOCB_WAKE_FORCE	3
+#define RCU_NOCB_WAKE_LAZY	2
+#define RCU_NOCB_WAKE		3
+#define RCU_NOCB_WAKE_FORCE	4
 
 #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
 					/* For jiffies_till_first_fqs and */
@@ -436,9 +437,10 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j);
+				  unsigned long j, bool lazy, bool wakegp);
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags);
+				bool *was_alldone, unsigned long flags,
+				bool lazy);
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 				 unsigned long flags);
 static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index e369efe94fda..55636da76bc2 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
 	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
 }
 
+/*
+ * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
+ * can elapse before lazy callbacks are flushed. Lazy callbacks
+ * could be flushed much earlier for a number of other reasons
+ * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
+ * left unsubmitted to RCU after those many jiffies.
+ */
+#define LAZY_FLUSH_JIFFIES (10 * HZ)
+unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
+
+#ifdef CONFIG_RCU_LAZY
+// To be called only from test code.
+void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
+{
+	jiffies_till_flush = jif;
+}
+EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
+
+unsigned long rcu_lazy_get_jiffies_till_flush(void)
+{
+	return jiffies_till_flush;
+}
+EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
+#endif
+
 /*
  * Arrange to wake the GP kthread for this NOCB group at some future
  * time when it is safe to do so.
@@ -265,6 +290,7 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
 {
 	unsigned long flags;
 	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
+	unsigned long mod_jif = 0;
 
 	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
 
@@ -272,16 +298,32 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
 	 * Bypass wakeup overrides previous deferments. In case
 	 * of callback storm, no need to wake up too early.
 	 */
-	if (waketype == RCU_NOCB_WAKE_BYPASS) {
-		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
-		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
-	} else {
+	switch (waketype) {
+	case RCU_NOCB_WAKE_LAZY:
+		if (rdp->nocb_defer_wakeup != RCU_NOCB_WAKE_LAZY)
+			mod_jif = jiffies_till_flush;
+		break;
+
+	case RCU_NOCB_WAKE_BYPASS:
+		mod_jif = 2;
+		break;
+
+	case RCU_NOCB_WAKE:
+	case RCU_NOCB_WAKE_FORCE:
+		// If the type of deferred wake is "stronger"
+		// than it was before, make it wake up the soonest.
 		if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
-			mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
-		if (rdp_gp->nocb_defer_wakeup < waketype)
-			WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+			mod_jif = 1;
+		break;
 	}
 
+	if (mod_jif)
+		mod_timer(&rdp_gp->nocb_timer, jiffies + mod_jif);
+
+	// If new type of wake up is stronger than before, promote.
+	if (rdp_gp->nocb_defer_wakeup < waketype)
+		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+
 	raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
 
 	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
@@ -296,7 +338,7 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				     unsigned long j)
+				     unsigned long j, bool lazy)
 {
 	struct rcu_cblist rcl;
 
@@ -310,7 +352,9 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
 	if (rhp)
 		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
-	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+
+	/* The lazy CBs are being flushed, but a new one might be enqueued. */
+	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp, lazy);
 	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
 	WRITE_ONCE(rdp->nocb_bypass_first, j);
 	rcu_nocb_bypass_unlock(rdp);
@@ -326,13 +370,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, bool lazy, bool wake_gp)
 {
+	bool ret;
+
 	if (!rcu_rdp_is_offloaded(rdp))
 		return true;
 	rcu_lockdep_assert_cblist_protected(rdp);
 	rcu_nocb_bypass_lock(rdp);
-	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
+	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, lazy);
+
+	if (wake_gp)
+		wake_nocb_gp(rdp, true);
+
+	return ret;
 }
 
 /*
@@ -345,7 +396,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
 	if (!rcu_rdp_is_offloaded(rdp) ||
 	    !rcu_nocb_bypass_trylock(rdp))
 		return;
-	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
+	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, false));
 }
 
 /*
@@ -367,12 +418,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
  * there is only one CPU in operation.
  */
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags,
+				bool lazy)
 {
 	unsigned long c;
 	unsigned long cur_gp_seq;
 	unsigned long j = jiffies;
 	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	long n_lazy_cbs = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass);
 
 	lockdep_assert_irqs_disabled();
 
@@ -414,30 +467,47 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	}
 	WRITE_ONCE(rdp->nocb_nobypass_count, c);
 
-	// If there hasn't yet been all that many ->cblist enqueues
-	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
-	// ->nocb_bypass first.
-	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
+	// If caller passed a non-lazy CB and there hasn't yet been all that
+	// many ->cblist enqueues this jiffy, tell the caller to enqueue it
+	// onto ->cblist.  But flush ->nocb_bypass first. Also do so, if total
+	// number of CBs (lazy + non-lazy) grows too much, or there were lazy
+	// CBs previously queued and the current one is non-lazy.
+	//
+	// Note that if the bypass list has lazy CBs, and the main list is
+	// empty, and rhp happens to be non-lazy, then we end up flushing all
+	// the lazy CBs to the main list as well. That's the right thing to do,
+	// since we are kick-starting RCU GP processing anyway for the non-lazy
+	// one, we can just reuse that GP for the already queued-up lazy ones.
+	if ((rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) ||
+	    (!lazy && n_lazy_cbs) ||
+	    (lazy  && n_lazy_cbs >= qhimark)) {
 		rcu_nocb_lock(rdp);
-		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+
+		// This variable helps decide if a wakeup of the rcuog thread
+		// is needed. It is passed to __call_rcu_nocb_wake() by the
+		// caller.  If only lazy CBs were previously queued and this one
+		// is non-lazy, make sure the caller does a wake up.
+		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist) ||
+			(!lazy && n_lazy_cbs);
+
 		if (*was_alldone)
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
-					    TPS("FirstQ"));
-		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
+					    lazy ? TPS("FirstLazyQ") : TPS("FirstQ"));
+		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, lazy, false));
 		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
 		return false; // Caller must enqueue the callback.
 	}
 
 	// If ->nocb_bypass has been used too long or is too full,
 	// flush ->nocb_bypass to ->cblist.
-	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
-	    ncbs >= qhimark) {
+	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) || ncbs >= qhimark) {
 		rcu_nocb_lock(rdp);
-		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
-			*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+		if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) {
+			*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist) ||
+				(!lazy && n_lazy_cbs);
 			if (*was_alldone)
 				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
-						    TPS("FirstQ"));
+						    lazy ? TPS("FirstLazyQ") : TPS("FirstQ"));
 			WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
 			return false; // Caller must enqueue the callback.
 		}
@@ -455,12 +525,18 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	rcu_nocb_wait_contended(rdp);
 	rcu_nocb_bypass_lock(rdp);
 	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	n_lazy_cbs = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass);
 	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
-	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
+	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp, lazy);
+
 	if (!ncbs) {
 		WRITE_ONCE(rdp->nocb_bypass_first, j);
-		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
+		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+				    lazy ? TPS("FirstLazyBQ") : TPS("FirstBQ"));
+	} else if (!n_lazy_cbs && lazy) {
+		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstLazyBQ"));
 	}
+
 	rcu_nocb_bypass_unlock(rdp);
 	smp_mb(); /* Order enqueue before wake. */
 	if (ncbs) {
@@ -493,7 +569,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 {
 	unsigned long cur_gp_seq;
 	unsigned long j;
-	long len;
+	long len, lazy_len, bypass_len;
 	struct task_struct *t;
 
 	// If we are being polled or there is no kthread, just leave.
@@ -506,9 +582,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 	}
 	// Need to actually to a wakeup.
 	len = rcu_segcblist_n_cbs(&rdp->cblist);
+	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	lazy_len = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass);
 	if (was_alldone) {
 		rdp->qlen_last_fqs_check = len;
-		if (!irqs_disabled_flags(flags)) {
+		// Only lazy CBs in bypass list
+		if (lazy_len && bypass_len == lazy_len) {
+			rcu_nocb_unlock_irqrestore(rdp, flags);
+			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
+					   TPS("WakeLazy"));
+		} else if (!irqs_disabled_flags(flags)) {
 			/* ... if queue was empty ... */
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			wake_nocb_gp(rdp, false);
@@ -599,8 +682,8 @@ static inline bool nocb_gp_update_state_deoffloading(struct rcu_data *rdp,
  */
 static void nocb_gp_wait(struct rcu_data *my_rdp)
 {
-	bool bypass = false;
-	long bypass_ncbs;
+	bool bypass = false, lazy = false;
+	long bypass_ncbs, lazy_ncbs;
 	int __maybe_unused cpu = my_rdp->cpu;
 	unsigned long cur_gp_seq;
 	unsigned long flags;
@@ -636,6 +719,7 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	 */
 	list_for_each_entry_rcu(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp, 1) {
 		bool needwake_state = false;
+		bool flush_bypass = false;
 
 		if (!nocb_gp_enabled_cb(rdp))
 			continue;
@@ -648,22 +732,37 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 			continue;
 		}
 		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
-		if (bypass_ncbs &&
+		lazy_ncbs = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass);
+
+		if (lazy_ncbs &&
+		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
+		     bypass_ncbs > 2 * qhimark)) {
+			flush_bypass = true;
+		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
 		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
 		     bypass_ncbs > 2 * qhimark)) {
-			// Bypass full or old, so flush it.
-			(void)rcu_nocb_try_flush_bypass(rdp, j);
-			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			flush_bypass = true;
 		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			if (needwake_state)
 				swake_up_one(&rdp->nocb_state_wq);
 			continue; /* No callbacks here, try next. */
 		}
+
+		if (flush_bypass) {
+			// Bypass full or old, so flush it.
+			(void)rcu_nocb_try_flush_bypass(rdp, j);
+			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			lazy_ncbs = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass);
+		}
+
 		if (bypass_ncbs) {
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
-					    TPS("Bypass"));
-			bypass = true;
+				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
+			if (bypass_ncbs == lazy_ncbs)
+				lazy = true;
+			else
+				bypass = true;
 		}
 		rnp = rdp->mynode;
 
@@ -713,12 +812,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	my_rdp->nocb_gp_gp = needwait_gp;
 	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
 
-	if (bypass && !rcu_nocb_poll) {
-		// At least one child with non-empty ->nocb_bypass, so set
-		// timer in order to avoid stranding its callbacks.
-		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
-				   TPS("WakeBypassIsDeferred"));
+	// At least one child with non-empty ->nocb_bypass, so set
+	// timer in order to avoid stranding its callbacks.
+	if (!rcu_nocb_poll) {
+		// If bypass list only has lazy CBs. Add a deferred
+		// lazy wake up.
+		if (lazy && !bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
+					TPS("WakeLazyIsDeferred"));
+		// Otherwise add a deferred bypass wake up.
+		} else if (bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
+					TPS("WakeBypassIsDeferred"));
+		}
 	}
+
 	if (rcu_nocb_poll) {
 		/* Polling, so trace if first poll in the series. */
 		if (gotcbs)
@@ -999,7 +1107,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
 	 * return false, which means that future calls to rcu_nocb_try_bypass()
 	 * will refuse to put anything into the bypass.
 	 */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false, false));
 	/*
 	 * Start with invoking rcu_core() early. This way if the current thread
 	 * happens to preempt an ongoing call to rcu_core() in the middle,
@@ -1500,13 +1608,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 }
 
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, bool lazy, bool wakegp)
 {
 	return true;
 }
 
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags,
+				bool lazy)
 {
 	return false;
 }
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 02/14] rcu: shrinker for lazy rcu
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 01/14] rcu: Introduce call_rcu_lazy() API implementation Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 03/14] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Vineeth Pillai, Joel Fernandes, paulmck, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu

From: Vineeth Pillai <vineeth@bitbyteword.org>

The shrinker is used to speed up the free'ing of memory potentially held
by RCU lazy callbacks. RCU kernel module test cases show this to be
effective. Test is introduced in a later patch.

Signed-off-by: Vineeth Pillai <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 52 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 55636da76bc2..edb4e59dbf38 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1259,6 +1259,55 @@ int rcu_nocb_cpu_offload(int cpu)
 }
 EXPORT_SYMBOL_GPL(rcu_nocb_cpu_offload);
 
+static unsigned long
+lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int cpu;
+	unsigned long count = 0;
+
+	/* Snapshot count of all CPUs */
+	for_each_possible_cpu(cpu) {
+		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+
+		count += rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass);
+	}
+
+	return count ? count : SHRINK_EMPTY;
+}
+
+static unsigned long
+lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int cpu;
+	unsigned long flags;
+	unsigned long count = 0;
+
+	/* Snapshot count of all CPUs */
+	for_each_possible_cpu(cpu) {
+		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+		int _count = rcu_cblist_n_lazy_cbs(&rdp->nocb_bypass);
+
+		if (_count == 0)
+			continue;
+		rcu_nocb_lock_irqsave(rdp, flags);
+		rcu_cblist_reset_lazy_len(&rdp->nocb_bypass);
+		rcu_nocb_unlock_irqrestore(rdp, flags);
+		wake_nocb_gp(rdp, false);
+		sc->nr_to_scan -= _count;
+		count += _count;
+		if (sc->nr_to_scan <= 0)
+			break;
+	}
+	return count ? count : SHRINK_STOP;
+}
+
+static struct shrinker lazy_rcu_shrinker = {
+	.count_objects = lazy_rcu_shrink_count,
+	.scan_objects = lazy_rcu_shrink_scan,
+	.batch = 0,
+	.seeks = DEFAULT_SEEKS,
+};
+
 void __init rcu_init_nohz(void)
 {
 	int cpu;
@@ -1296,6 +1345,9 @@ void __init rcu_init_nohz(void)
 	if (!rcu_state.nocb_is_setup)
 		return;
 
+	if (register_shrinker(&lazy_rcu_shrinker))
+		pr_err("Failed to register lazy_rcu shrinker!\n");
+
 #if defined(CONFIG_NO_HZ_FULL)
 	if (tick_nohz_full_running)
 		cpumask_or(rcu_nocb_mask, rcu_nocb_mask, tick_nohz_full_mask);
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 03/14] rcuscale: Add laziness and kfree tests
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 01/14] rcu: Introduce call_rcu_lazy() API implementation Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 02/14] rcu: shrinker for lazy rcu Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 04/14] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

We aad 2 tests to rcuscale, first one is a startup test to check whether
we are not too lazy or too hard working. Two, emulate kfree_rcu() itself
to use call_rcu_lazy() and check memory pressure. In my testing,
call_rcu_lazy() does well to keep memory pressure under control, similar
to kfree_rcu().

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/rcuscale.c | 74 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 73 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 277a5bfb37d4..ed5544227f4d 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -95,6 +95,7 @@ torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
 torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() scale test?");
 torture_param(int, kfree_mult, 1, "Multiple of kfree_obj size to allocate.");
+torture_param(int, kfree_rcu_by_lazy, 0, "Use call_rcu_lazy() to emulate kfree_rcu()?");
 
 static char *scale_type = "rcu";
 module_param(scale_type, charp, 0444);
@@ -658,6 +659,14 @@ struct kfree_obj {
 	struct rcu_head rh;
 };
 
+/* Used if doing RCU-kfree'ing via call_rcu_lazy(). */
+static void kfree_rcu_lazy(struct rcu_head *rh)
+{
+	struct kfree_obj *obj = container_of(rh, struct kfree_obj, rh);
+
+	kfree(obj);
+}
+
 static int
 kfree_scale_thread(void *arg)
 {
@@ -695,6 +704,11 @@ kfree_scale_thread(void *arg)
 			if (!alloc_ptr)
 				return -ENOMEM;
 
+			if (kfree_rcu_by_lazy) {
+				call_rcu_lazy(&(alloc_ptr->rh), kfree_rcu_lazy);
+				continue;
+			}
+
 			// By default kfree_rcu_test_single and kfree_rcu_test_double are
 			// initialized to false. If both have the same value (false or true)
 			// both are randomly tested, otherwise only the one with value true
@@ -737,6 +751,9 @@ kfree_scale_cleanup(void)
 {
 	int i;
 
+	if (kfree_rcu_by_lazy)
+		rcu_force_call_rcu_to_lazy(false);
+
 	if (torture_cleanup_begin())
 		return;
 
@@ -766,11 +783,64 @@ kfree_scale_shutdown(void *arg)
 	return -EINVAL;
 }
 
+// Used if doing RCU-kfree'ing via call_rcu_lazy().
+static unsigned long jiffies_at_lazy_cb;
+static struct rcu_head lazy_test1_rh;
+static int rcu_lazy_test1_cb_called;
+static void call_rcu_lazy_test1(struct rcu_head *rh)
+{
+	jiffies_at_lazy_cb = jiffies;
+	WRITE_ONCE(rcu_lazy_test1_cb_called, 1);
+}
+
 static int __init
 kfree_scale_init(void)
 {
 	long i;
 	int firsterr = 0;
+	unsigned long orig_jif, jif_start;
+
+	// If lazy-rcu based kfree'ing is requested, then for kernels that
+	// support it, force all call_rcu() to call_rcu_lazy() so that non-lazy
+	// CBs do not remove laziness of the lazy ones (since the test tries to
+	// stress call_rcu_lazy() for OOM).
+	//
+	// Also, do a quick self-test to ensure laziness is as much as
+	// expected.
+	if (kfree_rcu_by_lazy && !IS_ENABLED(CONFIG_RCU_LAZY)) {
+		pr_alert("CONFIG_RCU_LAZY is disabled, falling back to kfree_rcu() "
+			 "for delayed RCU kfree'ing\n");
+		kfree_rcu_by_lazy = 0;
+	}
+
+	if (kfree_rcu_by_lazy) {
+		/* do a test to check the timeout. */
+		orig_jif = rcu_lazy_get_jiffies_till_flush();
+
+		rcu_force_call_rcu_to_lazy(true);
+		rcu_lazy_set_jiffies_till_flush(2 * HZ);
+		rcu_barrier();
+
+		jif_start = jiffies;
+		jiffies_at_lazy_cb = 0;
+		call_rcu_lazy(&lazy_test1_rh, call_rcu_lazy_test1);
+
+		smp_cond_load_relaxed(&rcu_lazy_test1_cb_called, VAL == 1);
+
+		rcu_lazy_set_jiffies_till_flush(orig_jif);
+
+		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start < 2 * HZ)) {
+			pr_alert("ERROR: Lazy CBs are not being lazy as expected!\n");
+			WARN_ON_ONCE(1);
+			return -1;
+		}
+
+		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start > 3 * HZ)) {
+			pr_alert("ERROR: Lazy CBs are being too lazy!\n");
+			WARN_ON_ONCE(1);
+			return -1;
+		}
+	}
 
 	kfree_nrealthreads = compute_real(kfree_nthreads);
 	/* Start up the kthreads. */
@@ -783,7 +853,9 @@ kfree_scale_init(void)
 		schedule_timeout_uninterruptible(1);
 	}
 
-	pr_alert("kfree object size=%zu\n", kfree_mult * sizeof(struct kfree_obj));
+	pr_alert("kfree object size=%zu, kfree_rcu_by_lazy=%d\n",
+			kfree_mult * sizeof(struct kfree_obj),
+			kfree_rcu_by_lazy);
 
 	kfree_reader_tasks = kcalloc(kfree_nrealthreads, sizeof(kfree_reader_tasks[0]),
 			       GFP_KERNEL);
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 04/14] fs: Move call_rcu() to call_rcu_lazy() in some paths
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 03/14] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 05/14] rcutorture: Add test code for call_rcu_lazy() Joel Fernandes (Google)
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

When testing, we found that these paths were invoked often when the
system is not doing anything (screen is ON but otherwise idle).

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 fs/dcache.c     | 4 ++--
 fs/eventpoll.c  | 2 +-
 fs/file_table.c | 2 +-
 fs/inode.c      | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 93f4f5ee07bf..7f51bac390c8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -366,7 +366,7 @@ static void dentry_free(struct dentry *dentry)
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
 		if (likely(atomic_dec_and_test(&p->u.count))) {
-			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
+			call_rcu_lazy(&dentry->d_u.d_rcu, __d_free_external);
 			return;
 		}
 	}
@@ -374,7 +374,7 @@ static void dentry_free(struct dentry *dentry)
 	if (dentry->d_flags & DCACHE_NORCU)
 		__d_free(&dentry->d_u.d_rcu);
 	else
-		call_rcu(&dentry->d_u.d_rcu, __d_free);
+		call_rcu_lazy(&dentry->d_u.d_rcu, __d_free);
 }
 
 /*
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 971f98af48ff..57b3f781760c 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -729,7 +729,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 	 * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make
 	 * use of the rbn field.
 	 */
-	call_rcu(&epi->rcu, epi_rcu_free);
+	call_rcu_lazy(&epi->rcu, epi_rcu_free);
 
 	percpu_counter_dec(&ep->user->epoll_watches);
 
diff --git a/fs/file_table.c b/fs/file_table.c
index 5424e3a8df5f..417f57e9cb30 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -56,7 +56,7 @@ static inline void file_free(struct file *f)
 	security_file_free(f);
 	if (!(f->f_mode & FMODE_NOACCOUNT))
 		percpu_counter_dec(&nr_files);
-	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
+	call_rcu_lazy(&f->f_u.fu_rcuhead, file_free_rcu);
 }
 
 /*
diff --git a/fs/inode.c b/fs/inode.c
index bd4da9c5207e..38fe040ddbd6 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -312,7 +312,7 @@ static void destroy_inode(struct inode *inode)
 			return;
 	}
 	inode->free_inode = ops->free_inode;
-	call_rcu(&inode->i_rcu, i_callback);
+	call_rcu_lazy(&inode->i_rcu, i_callback);
 }
 
 /**
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 05/14] rcutorture: Add test code for call_rcu_lazy()
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 04/14] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 06/14] debug: Toggle lazy at runtime and change flush jiffies Joel Fernandes (Google)
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

We add a new RCU type to test call_rcu_lazy(). This allows us to just
override the '.call' callback. To compensate for the laziness, we force
the laziness to a small number of jiffies. The idea of this test is to
stress the new code paths for stability and ensure it at least is providing
behavior in parity with, or similar to, call_rcu(). The actual check for
amount of laziness is in another test (rcuscale).

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/rcu.h                              |  1 +
 kernel/rcu/rcutorture.c                       | 60 ++++++++++++++++++-
 kernel/rcu/tree.c                             |  1 +
 .../selftests/rcutorture/configs/rcu/CFLIST   |  1 +
 .../selftests/rcutorture/configs/rcu/TREE11   | 18 ++++++
 .../rcutorture/configs/rcu/TREE11.boot        |  8 +++
 6 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 608f6ab76c7f..aa3243e49506 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -460,6 +460,7 @@ enum rcutorture_type {
 	RCU_TASKS_TRACING_FLAVOR,
 	RCU_TRIVIAL_FLAVOR,
 	SRCU_FLAVOR,
+	RCU_LAZY_FLAVOR,
 	INVALID_RCU_FLAVOR
 };
 
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 7120165a9342..c52cc4c064f9 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -872,6 +872,64 @@ static struct rcu_torture_ops tasks_rude_ops = {
 
 #endif // #else #ifdef CONFIG_TASKS_RUDE_RCU
 
+#ifdef CONFIG_RCU_LAZY
+
+/*
+ * Definitions for lazy RCU torture testing.
+ */
+static unsigned long orig_jiffies_till_flush;
+
+static void rcu_sync_torture_init_lazy(void)
+{
+	rcu_sync_torture_init();
+
+	orig_jiffies_till_flush = rcu_lazy_get_jiffies_till_flush();
+	rcu_lazy_set_jiffies_till_flush(50);
+}
+
+static void rcu_lazy_cleanup(void)
+{
+	rcu_lazy_set_jiffies_till_flush(orig_jiffies_till_flush);
+}
+
+static struct rcu_torture_ops rcu_lazy_ops = {
+	.ttype			= RCU_LAZY_FLAVOR,
+	.init			= rcu_sync_torture_init_lazy,
+	.cleanup		= rcu_lazy_cleanup,
+	.readlock		= rcu_torture_read_lock,
+	.read_delay		= rcu_read_delay,
+	.readunlock		= rcu_torture_read_unlock,
+	.readlock_held		= torture_readlock_not_held,
+	.get_gp_seq		= rcu_get_gp_seq,
+	.gp_diff		= rcu_seq_diff,
+	.deferred_free		= rcu_torture_deferred_free,
+	.sync			= synchronize_rcu,
+	.exp_sync		= synchronize_rcu_expedited,
+	.get_gp_state		= get_state_synchronize_rcu,
+	.start_gp_poll		= start_poll_synchronize_rcu,
+	.poll_gp_state		= poll_state_synchronize_rcu,
+	.cond_sync		= cond_synchronize_rcu,
+	.call			= call_rcu_lazy,
+	.cb_barrier		= rcu_barrier,
+	.fqs			= rcu_force_quiescent_state,
+	.stats			= NULL,
+	.gp_kthread_dbg		= show_rcu_gp_kthreads,
+	.check_boost_failed	= rcu_check_boost_fail,
+	.stall_dur		= rcu_jiffies_till_stall_check,
+	.irq_capable		= 1,
+	.can_boost		= IS_ENABLED(CONFIG_RCU_BOOST),
+	.extendables		= RCUTORTURE_MAX_EXTEND,
+	.name			= "rcu_lazy"
+};
+
+#define LAZY_OPS &rcu_lazy_ops,
+
+#else // #ifdef CONFIG_RCU_LAZY
+
+#define LAZY_OPS
+
+#endif // #else #ifdef CONFIG_RCU_LAZY
+
 
 #ifdef CONFIG_TASKS_TRACE_RCU
 
@@ -3145,7 +3203,7 @@ rcu_torture_init(void)
 	unsigned long gp_seq = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops, &busted_srcud_ops,
-		TASKS_OPS TASKS_RUDE_OPS TASKS_TRACING_OPS
+		TASKS_OPS TASKS_RUDE_OPS TASKS_TRACING_OPS LAZY_OPS
 		&trivial_ops,
 	};
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index e76fef8031be..67026382dc21 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -600,6 +600,7 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
 {
 	switch (test_type) {
 	case RCU_FLAVOR:
+	case RCU_LAZY_FLAVOR:
 		*flags = READ_ONCE(rcu_state.gp_flags);
 		*gp_seq = rcu_seq_current(&rcu_state.gp_seq);
 		break;
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
index 98b6175e5aa0..609c3370616f 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
@@ -5,6 +5,7 @@ TREE04
 TREE05
 TREE07
 TREE09
+TREE11
 SRCU-N
 SRCU-P
 SRCU-T
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
new file mode 100644
index 000000000000..436013f3e015
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
@@ -0,0 +1,18 @@
+CONFIG_SMP=y
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=y
+#CHECK#CONFIG_PREEMPT_RCU=y
+CONFIG_HZ_PERIODIC=n
+CONFIG_NO_HZ_IDLE=y
+CONFIG_NO_HZ_FULL=n
+CONFIG_RCU_TRACE=y
+CONFIG_HOTPLUG_CPU=y
+CONFIG_MAXSMP=y
+CONFIG_CPUMASK_OFFSTACK=y
+CONFIG_RCU_NOCB_CPU=y
+CONFIG_DEBUG_LOCK_ALLOC=n
+CONFIG_RCU_BOOST=n
+CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
+CONFIG_RCU_EXPERT=y
+CONFIG_RCU_LAZY=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
new file mode 100644
index 000000000000..9b6f720d4ccd
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
@@ -0,0 +1,8 @@
+maxcpus=8 nr_cpus=43
+rcutree.gp_preinit_delay=3
+rcutree.gp_init_delay=3
+rcutree.gp_cleanup_delay=3
+rcu_nocbs=0-7
+rcutorture.torture_type=rcu_lazy
+rcutorture.nocbs_nthreads=8
+rcutorture.fwd_progress=0
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 06/14] debug: Toggle lazy at runtime and change flush jiffies
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 05/14] rcutorture: Add test code for call_rcu_lazy() Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 07/14] cred: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

Enable/Disable this feature by writing 1 or 0 /proc/sys/vm/rcu_lazy.

Change value of /proc/sys/vm/rcu_lazy_jiffies to change max duration before
flush.

Do not merge, only for debug for reviewers.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched/sysctl.h |  3 +++
 kernel/rcu/tree_nocb.h       |  9 +++++++++
 kernel/sysctl.c              | 17 +++++++++++++++++
 3 files changed, 29 insertions(+)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 2cd928f15df6..54610f9cd962 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -14,6 +14,9 @@ extern unsigned long sysctl_hung_task_timeout_secs;
 enum { sysctl_hung_task_timeout_secs = 0 };
 #endif
 
+extern unsigned int sysctl_rcu_lazy;
+extern unsigned int sysctl_rcu_lazy_jiffies;
+
 enum sched_tunable_scaling {
 	SCHED_TUNABLESCALING_NONE,
 	SCHED_TUNABLESCALING_LOG,
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index edb4e59dbf38..16621b32de46 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -266,6 +266,9 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
 #define LAZY_FLUSH_JIFFIES (10 * HZ)
 unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
 
+unsigned int sysctl_rcu_lazy_jiffies = LAZY_FLUSH_JIFFIES;
+unsigned int sysctl_rcu_lazy = 1;
+
 #ifdef CONFIG_RCU_LAZY
 // To be called only from test code.
 void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
@@ -292,6 +295,9 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
 	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
 	unsigned long mod_jif = 0;
 
+	/* debug: not for merge */
+	rcu_lazy_set_jiffies_till_flush(sysctl_rcu_lazy_jiffies);
+
 	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
 
 	/*
@@ -697,6 +703,9 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	unsigned long wait_gp_seq = 0; // Suppress "use uninitialized" warning.
 	bool wasempty = false;
 
+	/* debug: not for merge */
+	rcu_lazy_set_jiffies_till_flush(sysctl_rcu_lazy_jiffies);
+
 	/*
 	 * Each pass through the following loop checks for CBs and for the
 	 * nearest grace period (if any) to wait for next.  The CB kthreads
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index b00f92df0af5..bbe25d635dc0 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2450,6 +2450,23 @@ static struct ctl_table vm_table[] = {
 		.extra2		= SYSCTL_ONE,
 	},
 #endif
+#ifdef CONFIG_RCU_LAZY
+	{
+		.procname	= "rcu_lazy",
+		.data		= &sysctl_rcu_lazy,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "rcu_lazy_jiffies",
+		.data		= &sysctl_rcu_lazy_jiffies,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
+
 	{ }
 };
 
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 07/14] cred: Move call_rcu() to call_rcu_lazy()
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (5 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 06/14] debug: Toggle lazy at runtime and change flush jiffies Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 08/14] security: " Joel Fernandes (Google)
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/cred.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index e10c15f51c1f..c7cb2e3ac73a 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -150,7 +150,7 @@ void __put_cred(struct cred *cred)
 	if (cred->non_rcu)
 		put_cred_rcu(&cred->rcu);
 	else
-		call_rcu(&cred->rcu, put_cred_rcu);
+		call_rcu_lazy(&cred->rcu, put_cred_rcu);
 }
 EXPORT_SYMBOL(__put_cred);
 
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 08/14] security: Move call_rcu() to call_rcu_lazy()
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (6 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 07/14] cred: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 09/14] net/core: " Joel Fernandes (Google)
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 security/security.c    | 2 +-
 security/selinux/avc.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/security/security.c b/security/security.c
index ea7163c20751..d76f4951b2bd 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1053,7 +1053,7 @@ void security_inode_free(struct inode *inode)
 	 * The inode will be freed after the RCU grace period too.
 	 */
 	if (inode->i_security)
-		call_rcu((struct rcu_head *)inode->i_security,
+		call_rcu_lazy((struct rcu_head *)inode->i_security,
 				inode_free_by_rcu);
 }
 
diff --git a/security/selinux/avc.c b/security/selinux/avc.c
index 9a43af0ebd7d..381f046d820f 100644
--- a/security/selinux/avc.c
+++ b/security/selinux/avc.c
@@ -442,7 +442,7 @@ static void avc_node_free(struct rcu_head *rhead)
 static void avc_node_delete(struct selinux_avc *avc, struct avc_node *node)
 {
 	hlist_del_rcu(&node->list);
-	call_rcu(&node->rhead, avc_node_free);
+	call_rcu_lazy(&node->rhead, avc_node_free);
 	atomic_dec(&avc->avc_cache.active_nodes);
 }
 
@@ -458,7 +458,7 @@ static void avc_node_replace(struct selinux_avc *avc,
 			     struct avc_node *new, struct avc_node *old)
 {
 	hlist_replace_rcu(&old->list, &new->list);
-	call_rcu(&old->rhead, avc_node_free);
+	call_rcu_lazy(&old->rhead, avc_node_free);
 	atomic_dec(&avc->avc_cache.active_nodes);
 }
 
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 09/14] net/core: Move call_rcu() to call_rcu_lazy()
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (7 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 08/14] security: " Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 10/14] kernel: Move various core kernel usages " Joel Fernandes (Google)
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 net/core/dst.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index d16c2c9bfebd..68c240a4a0d7 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -174,7 +174,7 @@ void dst_release(struct dst_entry *dst)
 			net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
 					     __func__, dst, newrefcnt);
 		if (!newrefcnt)
-			call_rcu(&dst->rcu_head, dst_destroy_rcu);
+			call_rcu_lazy(&dst->rcu_head, dst_destroy_rcu);
 	}
 }
 EXPORT_SYMBOL(dst_release);
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 10/14] kernel: Move various core kernel usages to call_rcu_lazy()
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (8 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 09/14] net/core: " Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 11/14] lib: Move call_rcu() " Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/exit.c              | 2 +-
 kernel/pid.c               | 2 +-
 kernel/time/posix-timers.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 853c6a943fce..14cde19ff4c2 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -180,7 +180,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 void put_task_struct_rcu_user(struct task_struct *task)
 {
 	if (refcount_dec_and_test(&task->rcu_users))
-		call_rcu(&task->rcu, delayed_put_task_struct);
+		call_rcu_lazy(&task->rcu, delayed_put_task_struct);
 }
 
 void release_task(struct task_struct *p)
diff --git a/kernel/pid.c b/kernel/pid.c
index 2fc0a16ec77b..5a5144519d70 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -153,7 +153,7 @@ void free_pid(struct pid *pid)
 	}
 	spin_unlock_irqrestore(&pidmap_lock, flags);
 
-	call_rcu(&pid->rcu, delayed_put_pid);
+	call_rcu_lazy(&pid->rcu, delayed_put_pid);
 }
 
 struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 06d1236b3804..63489c4070cd 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -485,7 +485,7 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 	}
 	put_pid(tmr->it_pid);
 	sigqueue_free(tmr->sigq);
-	call_rcu(&tmr->rcu, k_itimer_rcu_free);
+	call_rcu_lazy(&tmr->rcu, k_itimer_rcu_free);
 }
 
 static int common_timer_create(struct k_itimer *new_timer)
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 11/14] lib: Move call_rcu() to call_rcu_lazy()
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (9 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 10/14] kernel: Move various core kernel usages " Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 12/14] i915: " Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

Move radix-tree and xarray to call_rcu_lazy(). This is required to
prevent callbacks triggering RCU machinery too quickly and too often,
which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 lib/radix-tree.c | 2 +-
 lib/xarray.c     | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index b3afafe46fff..1526dc9e1d93 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -305,7 +305,7 @@ void radix_tree_node_rcu_free(struct rcu_head *head)
 static inline void
 radix_tree_node_free(struct radix_tree_node *node)
 {
-	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
+	call_rcu_lazy(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
diff --git a/lib/xarray.c b/lib/xarray.c
index ea9ce1f0b386..230abc8045fe 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -257,7 +257,7 @@ static void xa_node_free(struct xa_node *node)
 {
 	XA_NODE_BUG_ON(node, !list_empty(&node->private_list));
 	node->array = XA_RCU_FREE;
-	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
+	call_rcu_lazy(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 12/14] i915: Move call_rcu() to call_rcu_lazy()
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (10 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 11/14] lib: Move call_rcu() " Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 13/14] fork: Move thread_stack_free_rcu to call_rcu_lazy Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/i915/gem/i915_gem_object.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index 06b1b188ce5a..74f4b6e707c2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -343,7 +343,7 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 		__i915_gem_free_object(obj);
 
 		/* But keep the pointer alive for RCU-protected lookups */
-		call_rcu(&obj->rcu, __i915_gem_free_object_rcu);
+		call_rcu_lazy(&obj->rcu, __i915_gem_free_object_rcu);
 		cond_resched();
 	}
 }
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 13/14] fork: Move thread_stack_free_rcu to call_rcu_lazy
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (11 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 12/14] i915: " Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-19 20:48 ` [PATCH v4 14/14] rcu/tree: Move trace_rcu_callback() before bypassing Joel Fernandes (Google)
  2022-08-29 13:40 ` [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Frederic Weisbecker
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/fork.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index c9a2e19d67e5..a4535cf5446f 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -227,7 +227,7 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 	struct vm_stack *vm_stack = tsk->stack;
 
 	vm_stack->stack_vm_area = tsk->stack_vm_area;
-	call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
+	call_rcu_lazy(&vm_stack->rcu, thread_stack_free_rcu);
 }
 
 static int free_vm_stack_cache(unsigned int cpu)
@@ -354,7 +354,7 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 {
 	struct rcu_head *rh = tsk->stack;
 
-	call_rcu(rh, thread_stack_free_rcu);
+	call_rcu_lazy(rh, thread_stack_free_rcu);
 }
 
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
@@ -389,7 +389,7 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 {
 	struct rcu_head *rh = tsk->stack;
 
-	call_rcu(rh, thread_stack_free_rcu);
+	call_rcu_lazy(rh, thread_stack_free_rcu);
 }
 
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* [PATCH v4 14/14] rcu/tree: Move trace_rcu_callback() before bypassing
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (12 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 13/14] fork: Move thread_stack_free_rcu to call_rcu_lazy Joel Fernandes (Google)
@ 2022-08-19 20:48 ` Joel Fernandes (Google)
  2022-08-29 13:40 ` [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Frederic Weisbecker
  14 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes (Google) @ 2022-08-19 20:48 UTC (permalink / raw)
  To: linux-kernel
  Cc: Joel Fernandes (Google),
	paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt, rcu,
	vineeth

If any CB is queued into the bypass list, then trace_rcu_callback() does
not show it. This makes it not clear when a callback was actually
queued, as you only end up getting a trace_rcu_invoke_callback() trace.
Fix it by moving trace_rcu_callback() before trace_rcu_nocb_try_bypass().

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 67026382dc21..6e14f0257669 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3101,10 +3101,7 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 	}
 
 	check_cb_ovld(rdp);
-	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
-		return; // Enqueued onto ->nocb_bypass, so just leave.
-	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
-	rcu_segcblist_enqueue(&rdp->cblist, head);
+
 	if (__is_kvfree_rcu_offset((unsigned long)func))
 		trace_rcu_kvfree_callback(rcu_state.name, head,
 					 (unsigned long)func,
@@ -3113,6 +3110,11 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 		trace_rcu_callback(rcu_state.name, head,
 				   rcu_segcblist_n_cbs(&rdp->cblist));
 
+	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
+		return; // Enqueued onto ->nocb_bypass, so just leave.
+	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
+	rcu_segcblist_enqueue(&rdp->cblist, head);
+
 	trace_rcu_segcb_stats(&rdp->cblist, TPS("SegCBQueued"));
 
 	/* Go handle any RCU core processing required. */
-- 
2.37.2.609.g9ff673ca1a-goog


^ permalink raw reply related	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (13 preceding siblings ...)
  2022-08-19 20:48 ` [PATCH v4 14/14] rcu/tree: Move trace_rcu_callback() before bypassing Joel Fernandes (Google)
@ 2022-08-29 13:40 ` Frederic Weisbecker
  2022-08-29 16:45   ` Joel Fernandes
  14 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-29 13:40 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: linux-kernel, paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, vineeth

On Fri, Aug 19, 2022 at 08:48:43PM +0000, Joel Fernandes (Google) wrote:
> Refresh tested on real ChromeOS userspace and hardware, passes boot time tests
> and rcuscale tests.
> 
> Fixes on top of v3:
> - Fix boot issues due to a race in the lazy RCU logic which caused a missed
>   wakeup of the RCU GP thread, causing synchronize_rcu() to stall.
> - Fixed trace_rcu_callback tracepoint
> 
> I tested power previously [1], I am in the process of testing power again but I
> wanted share my latest code as others who are testing power as well could use
> the above fixes.

Your patch is very likely to be _generally_ useful and therefore,
the more I look into this, the more I wonder if it is a good idea to rely on
bypass at all, let alone NOCB. Of course in the long term the goal is to have
bypass working without NOCB but why even bothering implementing it for nocb
in the first place?

Several highlights:

1) NOCB is most often needed for nohz_full and the latter has terrible power
management. The CPU 0 is active all the time there.

2) NOCB without nohz_full has extremely rare usecase (RT niche:
https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
)

2) NOCB implies performance issues.

3) We are mixing up two very different things in a single list of callbacks:
   lazy callbacks and flooding callbacks, as a result we are adding lots of
   off-topic corner cases all around:
     * a seperate lazy len field to struct rcu_cblist whose purpose is much more
       general than just bypass/lazy
     * "lazy" specialized parameters to general purpose cblist management
       functions

4) This is further complexifying bypass core code, nocb timer management, core
   nocb group management, all of which being already very complicated.

5) The !NOCB implementation is going to be very different

Ok I can admit one counter argument in favour of using NO_CB:

-1) The scheduler can benefit from a wake CPU to run the callbacks on behalf of a bunch
of idle CPUs, instead of waking up that bunch of CPUs. But still we are dealing
with callbacks that can actually wait...

So here is a proposal: how about forgetting NOCB for now and instead add a new
RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
Then ignore that segment until some timer expiry has been met or the CPU is
known to be busy? Probably some tiny bits need to be tweaked in segcblist
management functions but probably not that much. And also make sure that entrain()
queues to RCU_LAZY_TAIL.

Then the only difference in the case of NOCB is that we add a new timer to the
nocb group leader instead of a local timer in !NOCB.

Now of course I'm certainly overlooking obvious things as always :)

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 13:40 ` [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Frederic Weisbecker
@ 2022-08-29 16:45   ` Joel Fernandes
  2022-08-29 19:46     ` Frederic Weisbecker
  2022-08-29 19:57     ` Paul E. McKenney
  0 siblings, 2 replies; 54+ messages in thread
From: Joel Fernandes @ 2022-08-29 16:45 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: linux-kernel, paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, vineeth

Hi Frederick,

On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> On Fri, Aug 19, 2022 at 08:48:43PM +0000, Joel Fernandes (Google) wrote:
>> Refresh tested on real ChromeOS userspace and hardware, passes boot time tests
>> and rcuscale tests.
>>
>> Fixes on top of v3:
>> - Fix boot issues due to a race in the lazy RCU logic which caused a missed
>>   wakeup of the RCU GP thread, causing synchronize_rcu() to stall.
>> - Fixed trace_rcu_callback tracepoint
>>
>> I tested power previously [1], I am in the process of testing power again but I
>> wanted share my latest code as others who are testing power as well could use
>> the above fixes.
> 
> Your patch is very likely to be _generally_ useful and therefore,
> the more I look into this, the more I wonder if it is a good idea to rely on
> bypass at all, let alone NOCB. Of course in the long term the goal is to have
> bypass working without NOCB but why even bothering implementing it for nocb
> in the first place?

This was discussed with Paul [1]. Quoting:

----
Joel:
>> Also, does doing so not prevent usage of lazy CBs on systems without
>> NOCB? So if we want to future-proof this, I guess that might not be a
>> good decision.
>
Paul:
> True enough, but would this future actually arrive?  After all, if
> someone cared enough about energy efficiency to use call_rcu_lazy(),
> why wouldn't they also offload callbacks?

Joel: I am not sure, but I also don't mind making it depend on NOCB for now
(see below).

[1] https://www.spinics.net/lists/rcu/msg07908.html
----

While I agree with you that perhaps making it more generic is better, this did
take a significant amount of time, testing and corner case hunting to come up
with, and v5 is also in the works so I'd appreciate if we can do it the
bypass-way and optimize later. Arguably the bypass way is quite simple and
allows us to leverage its effects of rcu_barrier and such. And the API will not
change.

> Several highlights:
> 
> 1) NOCB is most often needed for nohz_full and the latter has terrible power
> management. The CPU 0 is active all the time there.

I see. We don't use nohz_full much. NOCB itself gives good power improvement.

> 2) NOCB without nohz_full has extremely rare usecase (RT niche:
> https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
> )

Really? Android has been using it for a long time. It seems to be quite popular
in the battery-powered space.

> 2) NOCB implies performance issues.

Which kinds of? There is slightly worse boot times, but I'm guessing that's do
with the extra scheduling overhead of the extra threads which is usually not a
problem except that RCU is used in the critical path of boot up (on ChromeOS).

> 3) We are mixing up two very different things in a single list of callbacks:
>    lazy callbacks and flooding callbacks, as a result we are adding lots of
>    off-topic corner cases all around:
>      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
>        general than just bypass/lazy
>      * "lazy" specialized parameters to general purpose cblist management
>        functions

I think just 1 or 2 functions have a new lazy param. It didn't seem too
intrusive to me.

> 4) This is further complexifying bypass core code, nocb timer management, core
>    nocb group management, all of which being already very complicated.

True, I agree, a few more cases to handle for sure, but I think I got them all
now (hopefully).

> 5) The !NOCB implementation is going to be very different
> 
> Ok I can admit one counter argument in favour of using NO_CB:
> 
> -1) The scheduler can benefit from a wake CPU to run the callbacks on behalf of a bunch
> of idle CPUs, instead of waking up that bunch of CPUs. But still we are dealing
> with callbacks that can actually wait...

Yeah that's huge. Significant amount of power improvement seems to come from
idle CPUs not being disturbed and their corresponding timer ticks turned off for
longer periods. That's experimentally confirmed (NO_CB giving significant power
improvement on battery-power systems as compared to !NO_CB).

> 
> So here is a proposal: how about forgetting NOCB for now and instead add a new
> RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> Then ignore that segment until some timer expiry has been met or the CPU is
> known to be busy? Probably some tiny bits need to be tweaked in segcblist
> management functions but probably not that much. And also make sure that entrain()
> queues to RCU_LAZY_TAIL.
> 
> Then the only difference in the case of NOCB is that we add a new timer to the
> nocb group leader instead of a local timer in !NOCB.

It sounds reasonable, but I'll go with Paul on the usecase argument - who would
actually care about lazy CBs outside of power, and would those guys ever use
!NO_CB if they cared about power / battery?

Thanks,

 - Joel




^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 16:45   ` Joel Fernandes
@ 2022-08-29 19:46     ` Frederic Weisbecker
  2022-08-29 20:31       ` Paul E. McKenney
  2022-08-29 20:36       ` Joel Fernandes
  2022-08-29 19:57     ` Paul E. McKenney
  1 sibling, 2 replies; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-29 19:46 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, paulmck, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, vineeth

On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> Hi Frederick,
> 
> On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> > On Fri, Aug 19, 2022 at 08:48:43PM +0000, Joel Fernandes (Google) wrote:
> >> Refresh tested on real ChromeOS userspace and hardware, passes boot time tests
> >> and rcuscale tests.
> >>
> >> Fixes on top of v3:
> >> - Fix boot issues due to a race in the lazy RCU logic which caused a missed
> >>   wakeup of the RCU GP thread, causing synchronize_rcu() to stall.
> >> - Fixed trace_rcu_callback tracepoint
> >>
> >> I tested power previously [1], I am in the process of testing power again but I
> >> wanted share my latest code as others who are testing power as well could use
> >> the above fixes.
> > 
> > Your patch is very likely to be _generally_ useful and therefore,
> > the more I look into this, the more I wonder if it is a good idea to rely on
> > bypass at all, let alone NOCB. Of course in the long term the goal is to have
> > bypass working without NOCB but why even bothering implementing it for nocb
> > in the first place?
> 
> This was discussed with Paul [1]. Quoting:
> 
> ----
> Joel:
> >> Also, does doing so not prevent usage of lazy CBs on systems without
> >> NOCB? So if we want to future-proof this, I guess that might not be a
> >> good decision.
> >
> Paul:
> > True enough, but would this future actually arrive?  After all, if
> > someone cared enough about energy efficiency to use call_rcu_lazy(),
> > why wouldn't they also offload callbacks?
> 
> Joel: I am not sure, but I also don't mind making it depend on NOCB for now
> (see below).
> 
> [1] https://www.spinics.net/lists/rcu/msg07908.html
> ----
> 
> While I agree with you that perhaps making it more generic is better, this did
> take a significant amount of time, testing and corner case hunting to come up
> with, and v5 is also in the works so I'd appreciate if we can do it the
> bypass-way and optimize later. Arguably the bypass way is quite simple and
> allows us to leverage its effects of rcu_barrier and such. And the API will not
> change.

Keep in mind that if we later need to rewrite the whole in order to have a
generic approach, this will take even more time in the long run.

> > 2) NOCB without nohz_full has extremely rare usecase (RT niche:
> > https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
> > )
> 
> Really? Android has been using it for a long time. It seems to be quite popular
> in the battery-powered space.

It's really sad that this is the first time I hear about that. I've been working
on this code for years now without this usecase in mind. And yet it's fundamental.

I asked several times around about other usecases of rcu_nocbs than nohz_full=
and nobody reported that. I can hardly even google a significant link
between power saving and rcu_nocbs=

If this is really used that way for a long time then it's a cruel disconnection
between users and developers.

> > 2) NOCB implies performance issues.
> 
> Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> with the extra scheduling overhead of the extra threads which is usually not a
> problem except that RCU is used in the critical path of boot up (on ChromeOS).

I never measured it myself but executing callbacks on another CPUs, with
context switches and locking can only involve significant performance issues if callbacks
are frequent. So it's a tradeoff between power and performance.

> 
> > 3) We are mixing up two very different things in a single list of callbacks:
> >    lazy callbacks and flooding callbacks, as a result we are adding lots of
> >    off-topic corner cases all around:
> >      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
> >        general than just bypass/lazy
> >      * "lazy" specialized parameters to general purpose cblist management
> >        functions
> 
> I think just 1 or 2 functions have a new lazy param. It didn't seem too
> intrusive to me.

What bothers me is that struct cblist has a general purpose and we are adding a field
and a parameter that is relevant to only one specialized user.


> > 4) This is further complexifying bypass core code, nocb timer management, core
> >    nocb group management, all of which being already very complicated.
> 
> True, I agree, a few more cases to handle for sure, but I think I got them all
> now (hopefully).

Now I'm worried about maintainability. Hence why I'd rather see a generic code
for them all if possible.

> > 5) The !NOCB implementation is going to be very different
> > 
> > Ok I can admit one counter argument in favour of using NO_CB:
> > 
> > -1) The scheduler can benefit from a wake CPU to run the callbacks on behalf of a bunch
> > of idle CPUs, instead of waking up that bunch of CPUs. But still we are dealing
> > with callbacks that can actually wait...
> 
> Yeah that's huge. Significant amount of power improvement seems to come from
> idle CPUs not being disturbed and their corresponding timer ticks turned off for
> longer periods. That's experimentally confirmed (NO_CB giving significant power
> improvement on battery-power systems as compared to !NO_CB).

It's a good news to hear that nocbs is used way beyond its initial purpose.
But still very sad to hear about that several years late.

> > So here is a proposal: how about forgetting NOCB for now and instead add a new
> > RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> > Then ignore that segment until some timer expiry has been met or the CPU is
> > known to be busy? Probably some tiny bits need to be tweaked in segcblist
> > management functions but probably not that much. And also make sure that entrain()
> > queues to RCU_LAZY_TAIL.
> > 
> > Then the only difference in the case of NOCB is that we add a new timer to the
> > nocb group leader instead of a local timer in !NOCB.
> 
> It sounds reasonable, but I'll go with Paul on the usecase argument - who would
> actually care about lazy CBs outside of power, and would those guys ever use
> !NO_CB if they cared about power / battery?

_Everybody_ cares about power. Those who don't yet will very soon ;-)

And given the numbers you provided with your measurements, I bet this will
be significant with !NOCB as well. This is not only delaying callbacks execution,
this also reduces the frequency of grace periods, and that impact should be
quite visible.

Note I'm not stricly opposed to the current approach. But I can't say I'm
comfortable with it.

Can we do a simple test? Would it be possible to affine every rcuo%c/%d kthread
to the corresponding CPU%d? For example affine rcuop/1 to CPU 1, rcuop/2 to
CPU2, etc... And then relaunch your measurements on top of that?

The point is that having the callback kthreads affined to their corresponding
CPUs should elude the power saving advantages of rcu_nocbs=, back to roughly
a !NOCB behaviour powerwise (except we have context switches). If you find good
numbers with this setup then you'll find good numbers with !NOCB.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 16:45   ` Joel Fernandes
  2022-08-29 19:46     ` Frederic Weisbecker
@ 2022-08-29 19:57     ` Paul E. McKenney
  2022-08-30 10:43       ` Frederic Weisbecker
  1 sibling, 1 reply; 54+ messages in thread
From: Paul E. McKenney @ 2022-08-29 19:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Frederic Weisbecker, linux-kernel, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, vineeth

On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> Hi Frederick,
> 
> On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> > On Fri, Aug 19, 2022 at 08:48:43PM +0000, Joel Fernandes (Google) wrote:
> >> Refresh tested on real ChromeOS userspace and hardware, passes boot time tests
> >> and rcuscale tests.
> >>
> >> Fixes on top of v3:
> >> - Fix boot issues due to a race in the lazy RCU logic which caused a missed
> >>   wakeup of the RCU GP thread, causing synchronize_rcu() to stall.
> >> - Fixed trace_rcu_callback tracepoint
> >>
> >> I tested power previously [1], I am in the process of testing power again but I
> >> wanted share my latest code as others who are testing power as well could use
> >> the above fixes.
> > 
> > Your patch is very likely to be _generally_ useful and therefore,
> > the more I look into this, the more I wonder if it is a good idea to rely on
> > bypass at all, let alone NOCB. Of course in the long term the goal is to have
> > bypass working without NOCB but why even bothering implementing it for nocb
> > in the first place?
> 
> This was discussed with Paul [1]. Quoting:
> 
> ----
> Joel:
> >> Also, does doing so not prevent usage of lazy CBs on systems without
> >> NOCB? So if we want to future-proof this, I guess that might not be a
> >> good decision.
> >
> Paul:
> > True enough, but would this future actually arrive?  After all, if
> > someone cared enough about energy efficiency to use call_rcu_lazy(),
> > why wouldn't they also offload callbacks?
> 
> Joel: I am not sure, but I also don't mind making it depend on NOCB for now
> (see below).
> 
> [1] https://www.spinics.net/lists/rcu/msg07908.html
> ----
> 
> While I agree with you that perhaps making it more generic is better, this did
> take a significant amount of time, testing and corner case hunting to come up
> with, and v5 is also in the works so I'd appreciate if we can do it the
> bypass-way and optimize later. Arguably the bypass way is quite simple and
> allows us to leverage its effects of rcu_barrier and such. And the API will not
> change.

Just confirming this conversation, on the hopefully unlikely off-chance
that there is any doubt.  ;-)

That said, if there is some compelling use case that is not addressed
by rcu_nocbs, keeping in mind that these can now be made dynamic, then
some adjustment will of course be needed.

> > Several highlights:
> > 
> > 1) NOCB is most often needed for nohz_full and the latter has terrible power
> > management. The CPU 0 is active all the time there.
> 
> I see. We don't use nohz_full much. NOCB itself gives good power improvement.
> 
> > 2) NOCB without nohz_full has extremely rare usecase (RT niche:
> > https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
> > )
> 
> Really? Android has been using it for a long time. It seems to be quite popular
> in the battery-powered space.
> 
> > 2) NOCB implies performance issues.
> 
> Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> with the extra scheduling overhead of the extra threads which is usually not a
> problem except that RCU is used in the critical path of boot up (on ChromeOS).

Back in 2010, Rik van Riel reported significant slowdowns for some types
of Java workloads, but for normal servers, not Android or ChromeOS.
I have no idea whether similar slowdowns exist today.  But if there is
no performance advantage to non-offloaded callbacks, we should first make
offloading the default, and if there are no complaints after a few years,
remove the non-offloaded case completely.

My guess is that at the very least, scheduler corner cases will force
us to keep non-offloaded callbacks, but you never know.  In any case,
a wakeup is considerably more expensive than a non-atomic OR of a bit
in a per-CPU variable, so there is some chance that offloading causes
some important workloads considerable performance degradation.

> > 3) We are mixing up two very different things in a single list of callbacks:
> >    lazy callbacks and flooding callbacks, as a result we are adding lots of
> >    off-topic corner cases all around:
> >      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
> >        general than just bypass/lazy
> >      * "lazy" specialized parameters to general purpose cblist management
> >        functions
> 
> I think just 1 or 2 functions have a new lazy param. It didn't seem too
> intrusive to me.

It has been getting simpler!  ;-)

I bet that the lazy_len field can be a boolean and independent of
->cblist, and that doing that would simplify things at least a little bit.
But, yes, an all-lazy indicator of some sort would still need to exist.

> > 4) This is further complexifying bypass core code, nocb timer management, core
> >    nocb group management, all of which being already very complicated.
> 
> True, I agree, a few more cases to handle for sure, but I think I got them all
> now (hopefully).

If we do need lazy callbacks on non-offloaded CPUs, there will need to
be changes to both the bypass logic (possibly just those changes that
Joel already has, but Murphy might disagree) and to the ->cblist logic.
At the very least, the wakeup logic would need adjustment from current
-rcu and there would still need to be some way of tracking whether or
not all the callbacks in the bypass list are lazy.

> > 5) The !NOCB implementation is going to be very different
> > 
> > Ok I can admit one counter argument in favour of using NO_CB:
> > 
> > -1) The scheduler can benefit from a wake CPU to run the callbacks on behalf of a bunch
> > of idle CPUs, instead of waking up that bunch of CPUs. But still we are dealing
> > with callbacks that can actually wait...

You lost me on this one.  Having a callback invoked on a non-idle CPU
should save significant power without significant delay in callback
invocation.  What am I missing here?

> Yeah that's huge. Significant amount of power improvement seems to come from
> idle CPUs not being disturbed and their corresponding timer ticks turned off for
> longer periods. That's experimentally confirmed (NO_CB giving significant power
> improvement on battery-power systems as compared to !NO_CB).
> 
> > So here is a proposal: how about forgetting NOCB for now and instead add a new
> > RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> > Then ignore that segment until some timer expiry has been met or the CPU is
> > known to be busy? Probably some tiny bits need to be tweaked in segcblist
> > management functions but probably not that much. And also make sure that entrain()
> > queues to RCU_LAZY_TAIL.
> > 
> > Then the only difference in the case of NOCB is that we add a new timer to the
> > nocb group leader instead of a local timer in !NOCB.

It is certainly good to look into alternatives!  Especially if this has
somehow broken (de)offloading.  (Not seeing it in my testing, but then
again, I have not yet tested this series all that much.)

How does the separate RCU_LAZY_TAIL segment help?  I would think
that you would instead want an all-lazy flag on each of the existing
RCU_NEXT_READY_TAIL and RCU_NEXT_TAIL segments.  After all, if there is
even one non-lazy callback in either segment, we need the corresponding
grace period to run sooner rather than later.  And if we are running a
given grace period anyway, it costs little to handle the lazy callbacks
while we are at it.

Or is there a use case where it helps a lot to defer lazy callbacks that
could have been handled by a grace period that needed to happen anyway,
due to the presence of non-lazy callbacks?  I am having a hard time coming
up with one, but perhaps that is a failure of imagination on my part.

There would still need to be changes to the bypass code because NOCB is
what gets both Android and ChromeOS big power savings.

And yes, no matter what, rcu_barrier_entrain() needs to motivate any lazy
callbacks.  Currently, this falls out from the flushing of the bypass.
Presumably, offloading and deoffloading could also take advantage of
bypass flushing.

And I have no idea whether it would make sense for the NOCB and !NOCB
case to share a laziness-motivation timer.

> It sounds reasonable, but I'll go with Paul on the usecase argument - who would
> actually care about lazy CBs outside of power, and would those guys ever use
> !NO_CB if they cared about power / battery?

And if they are not using NOCB, does call_rcu_lazy() actually help?

But again, if call_rcu_lazy() needs to handle the !NOCB case, then it
needs to handle the !NOCB case.  However, given ChromeOS and Android,
we know that it call_rcu_lazy() needs to handle the NOCB case regardless.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 19:46     ` Frederic Weisbecker
@ 2022-08-29 20:31       ` Paul E. McKenney
  2022-08-29 20:54         ` Joel Fernandes
  2022-08-30 10:50         ` Frederic Weisbecker
  2022-08-29 20:36       ` Joel Fernandes
  1 sibling, 2 replies; 54+ messages in thread
From: Paul E. McKenney @ 2022-08-29 20:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, linux-kernel, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, vineeth

On Mon, Aug 29, 2022 at 09:46:22PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > Hi Frederick,
> > 
> > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> > > On Fri, Aug 19, 2022 at 08:48:43PM +0000, Joel Fernandes (Google) wrote:
> > >> Refresh tested on real ChromeOS userspace and hardware, passes boot time tests
> > >> and rcuscale tests.
> > >>
> > >> Fixes on top of v3:
> > >> - Fix boot issues due to a race in the lazy RCU logic which caused a missed
> > >>   wakeup of the RCU GP thread, causing synchronize_rcu() to stall.
> > >> - Fixed trace_rcu_callback tracepoint
> > >>
> > >> I tested power previously [1], I am in the process of testing power again but I
> > >> wanted share my latest code as others who are testing power as well could use
> > >> the above fixes.
> > > 
> > > Your patch is very likely to be _generally_ useful and therefore,
> > > the more I look into this, the more I wonder if it is a good idea to rely on
> > > bypass at all, let alone NOCB. Of course in the long term the goal is to have
> > > bypass working without NOCB but why even bothering implementing it for nocb
> > > in the first place?
> > 
> > This was discussed with Paul [1]. Quoting:
> > 
> > ----
> > Joel:
> > >> Also, does doing so not prevent usage of lazy CBs on systems without
> > >> NOCB? So if we want to future-proof this, I guess that might not be a
> > >> good decision.
> > >
> > Paul:
> > > True enough, but would this future actually arrive?  After all, if
> > > someone cared enough about energy efficiency to use call_rcu_lazy(),
> > > why wouldn't they also offload callbacks?
> > 
> > Joel: I am not sure, but I also don't mind making it depend on NOCB for now
> > (see below).
> > 
> > [1] https://www.spinics.net/lists/rcu/msg07908.html
> > ----
> > 
> > While I agree with you that perhaps making it more generic is better, this did
> > take a significant amount of time, testing and corner case hunting to come up
> > with, and v5 is also in the works so I'd appreciate if we can do it the
> > bypass-way and optimize later. Arguably the bypass way is quite simple and
> > allows us to leverage its effects of rcu_barrier and such. And the API will not
> > change.
> 
> Keep in mind that if we later need to rewrite the whole in order to have a
> generic approach, this will take even more time in the long run.
> 
> > > 2) NOCB without nohz_full has extremely rare usecase (RT niche:
> > > https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
> > > )
> > 
> > Really? Android has been using it for a long time. It seems to be quite popular
> > in the battery-powered space.
> 
> It's really sad that this is the first time I hear about that. I've been working
> on this code for years now without this usecase in mind. And yet it's fundamental.
> 
> I asked several times around about other usecases of rcu_nocbs than nohz_full=
> and nobody reported that. I can hardly even google a significant link
> between power saving and rcu_nocbs=
> 
> If this is really used that way for a long time then it's a cruel disconnection
> between users and developers.

Knowing me, you probably asked about rcu_nocbs and I probably thought
you were asking about nohz_full.  :-/

> > > 2) NOCB implies performance issues.
> > 
> > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > with the extra scheduling overhead of the extra threads which is usually not a
> > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> 
> I never measured it myself but executing callbacks on another CPUs, with
> context switches and locking can only involve significant performance issues if callbacks
> are frequent. So it's a tradeoff between power and performance.

It has indeed been a problem for some workloads in the past.  But I don't
know of any recent measurements.  And NOCB has gotten at least somewhat
faster over the years.

> > > 3) We are mixing up two very different things in a single list of callbacks:
> > >    lazy callbacks and flooding callbacks, as a result we are adding lots of
> > >    off-topic corner cases all around:
> > >      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
> > >        general than just bypass/lazy
> > >      * "lazy" specialized parameters to general purpose cblist management
> > >        functions
> > 
> > I think just 1 or 2 functions have a new lazy param. It didn't seem too
> > intrusive to me.
> 
> What bothers me is that struct cblist has a general purpose and we are adding a field
> and a parameter that is relevant to only one specialized user.

This does sound like a bad idea, now that you mention it.  Joel, if
this is still in place, can it be moved near the rcu_data structure's
bypass-related fields?

And by the way, thank you for reviewing this patch series!

> > > 4) This is further complexifying bypass core code, nocb timer management, core
> > >    nocb group management, all of which being already very complicated.
> > 
> > True, I agree, a few more cases to handle for sure, but I think I got them all
> > now (hopefully).
> 
> Now I'm worried about maintainability. Hence why I'd rather see a generic code
> for them all if possible.

Let's see what Joel's v5 looks like.

> > > 5) The !NOCB implementation is going to be very different
> > > 
> > > Ok I can admit one counter argument in favour of using NO_CB:
> > > 
> > > -1) The scheduler can benefit from a wake CPU to run the callbacks on behalf of a bunch
> > > of idle CPUs, instead of waking up that bunch of CPUs. But still we are dealing
> > > with callbacks that can actually wait...
> > 
> > Yeah that's huge. Significant amount of power improvement seems to come from
> > idle CPUs not being disturbed and their corresponding timer ticks turned off for
> > longer periods. That's experimentally confirmed (NO_CB giving significant power
> > improvement on battery-power systems as compared to !NO_CB).
> 
> It's a good news to hear that nocbs is used way beyond its initial purpose.
> But still very sad to hear about that several years late.

It came as a bit of a surprise to me as well, but was why I felt
comfortable removing CONFIG_RCU_FAST_NO_HZ.

If I remember correctly, the Android power-aware-scheduler folks first
tried it out.

> > > So here is a proposal: how about forgetting NOCB for now and instead add a new
> > > RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> > > Then ignore that segment until some timer expiry has been met or the CPU is
> > > known to be busy? Probably some tiny bits need to be tweaked in segcblist
> > > management functions but probably not that much. And also make sure that entrain()
> > > queues to RCU_LAZY_TAIL.
> > > 
> > > Then the only difference in the case of NOCB is that we add a new timer to the
> > > nocb group leader instead of a local timer in !NOCB.
> > 
> > It sounds reasonable, but I'll go with Paul on the usecase argument - who would
> > actually care about lazy CBs outside of power, and would those guys ever use
> > !NO_CB if they cared about power / battery?
> 
> _Everybody_ cares about power. Those who don't yet will very soon ;-)

Apparently not enough to use CONFIG_RCU_FAST_NO_HZ.  Though to be fair,
that option had its own performance issues.  And it would not reduce
grace periods anywhere near as much as call_rcu_lazy().  But the problem
was that last I checked on server workloads, the callbacks were mostly
those that could not reasonably be lazy.

> And given the numbers you provided with your measurements, I bet this will
> be significant with !NOCB as well. This is not only delaying callbacks execution,
> this also reduces the frequency of grace periods, and that impact should be
> quite visible.
> 
> Note I'm not stricly opposed to the current approach. But I can't say I'm
> comfortable with it.
> 
> Can we do a simple test? Would it be possible to affine every rcuo%c/%d kthread
> to the corresponding CPU%d? For example affine rcuop/1 to CPU 1, rcuop/2 to
> CPU2, etc... And then relaunch your measurements on top of that?
> 
> The point is that having the callback kthreads affined to their corresponding
> CPUs should elude the power saving advantages of rcu_nocbs=, back to roughly
> a !NOCB behaviour powerwise (except we have context switches). If you find good
> numbers with this setup then you'll find good numbers with !NOCB.

Another test would be to look at which callbacks are being invoked
on each grace period.  We have to have a substantial number of grace
periods having all lazy callbacks before call_rcu_lazy() has any chance
of helping.  This would need to happen on a server platform because
Android and ChromeOS data might or might not carry over.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 19:46     ` Frederic Weisbecker
  2022-08-29 20:31       ` Paul E. McKenney
@ 2022-08-29 20:36       ` Joel Fernandes
  2022-08-29 20:42         ` Paul E. McKenney
  2022-08-30 10:26         ` Frederic Weisbecker
  1 sibling, 2 replies; 54+ messages in thread
From: Joel Fernandes @ 2022-08-29 20:36 UTC (permalink / raw)
  To: Frederic Weisbecker, Dietmar Eggemann
  Cc: LKML, Paul E. McKenney, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > Hi Frederick,
> >
> > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> > > On Fri, Aug 19, 2022 at 08:48:43PM +0000, Joel Fernandes (Google) wrote:
> > >> Refresh tested on real ChromeOS userspace and hardware, passes boot time tests
> > >> and rcuscale tests.
> > >>
> > >> Fixes on top of v3:
> > >> - Fix boot issues due to a race in the lazy RCU logic which caused a missed
> > >>   wakeup of the RCU GP thread, causing synchronize_rcu() to stall.
> > >> - Fixed trace_rcu_callback tracepoint
> > >>
> > >> I tested power previously [1], I am in the process of testing power again but I
> > >> wanted share my latest code as others who are testing power as well could use
> > >> the above fixes.
> > >
> > > Your patch is very likely to be _generally_ useful and therefore,
> > > the more I look into this, the more I wonder if it is a good idea to rely on
> > > bypass at all, let alone NOCB. Of course in the long term the goal is to have
> > > bypass working without NOCB but why even bothering implementing it for nocb
> > > in the first place?
> >
> > This was discussed with Paul [1]. Quoting:
> >
> > ----
> > Joel:
> > >> Also, does doing so not prevent usage of lazy CBs on systems without
> > >> NOCB? So if we want to future-proof this, I guess that might not be a
> > >> good decision.
> > >
> > Paul:
> > > True enough, but would this future actually arrive?  After all, if
> > > someone cared enough about energy efficiency to use call_rcu_lazy(),
> > > why wouldn't they also offload callbacks?
> >
> > Joel: I am not sure, but I also don't mind making it depend on NOCB for now
> > (see below).
> >
> > [1] https://www.spinics.net/lists/rcu/msg07908.html
> > ----
> >
> > While I agree with you that perhaps making it more generic is better, this did
> > take a significant amount of time, testing and corner case hunting to come up
> > with, and v5 is also in the works so I'd appreciate if we can do it the
> > bypass-way and optimize later. Arguably the bypass way is quite simple and
> > allows us to leverage its effects of rcu_barrier and such. And the API will not
> > change.
>
> Keep in mind that if we later need to rewrite the whole in order to have a
> generic approach, this will take even more time in the long run.

Agreed on that. But the use

> > > 2) NOCB without nohz_full has extremely rare usecase (RT niche:
> > > https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
> > > )
> >
> > Really? Android has been using it for a long time. It seems to be quite popular
> > in the battery-powered space.
>
> It's really sad that this is the first time I hear about that. I've been working
> on this code for years now without this usecase in mind. And yet it's fundamental.
>
> I asked several times around about other usecases of rcu_nocbs than nohz_full=
> and nobody reported that. I can hardly even google a significant link
> between power saving and rcu_nocbs=
>
> If this is really used that way for a long time then it's a cruel disconnection
> between users and developers.

I was not involved with Android or RCU back then. But my guess is I
don't think it was enabled with the intent of saving power, it is just
that using RCU_NO_CB_CPU has become the way to go to keep dynick-idle
CPUs undisturbed: https://paulmck.livejournal.com/66807.html . Paul ,
+Dietmar Eggemann can probably provide more history.

> > > 2) NOCB implies performance issues.
> >
> > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > with the extra scheduling overhead of the extra threads which is usually not a
> > problem except that RCU is used in the critical path of boot up (on ChromeOS).
>
> I never measured it myself but executing callbacks on another CPUs, with
> context switches and locking can only involve significant performance issues if callbacks
> are frequent. So it's a tradeoff between power and performance.

In my testing of benchmarks on real systems with 8-16 CPUs, the
performance hit is down in the noise. It is possible though that maybe
one can write a non-realistic synthetic test to force the performance
issues, but I've not seen it in the real world. Maybe on
networking-heavy servers with lots of cores, you'll see it but their
batteries if any would be pretty big :-).

> > > 3) We are mixing up two very different things in a single list of callbacks:
> > >    lazy callbacks and flooding callbacks, as a result we are adding lots of
> > >    off-topic corner cases all around:
> > >      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
> > >        general than just bypass/lazy
> > >      * "lazy" specialized parameters to general purpose cblist management
> > >        functions
> >
> > I think just 1 or 2 functions have a new lazy param. It didn't seem too
> > intrusive to me.
>
> What bothers me is that struct cblist has a general purpose and we are adding a field
> and a parameter that is relevant to only one specialized user.

To Paul's point, we can change it to a flag I think. The 3 states are:
- no CBs.
- All lazy
- All non-lazy

Or, worse case, we can move the flag to the per-cpu rcu_data
structure, I think. Does that alleviate your concern?

> > > 4) This is further complexifying bypass core code, nocb timer management, core
> > >    nocb group management, all of which being already very complicated.
> >
> > True, I agree, a few more cases to handle for sure, but I think I got them all
> > now (hopefully).
>
> Now I'm worried about maintainability. Hence why I'd rather see a generic code
> for them all if possible.

Maintainability is a concern for any new code. I rely a lot on testing
both synthetic, and real-world. I have spent a lot of time on test
code on this.

> > > So here is a proposal: how about forgetting NOCB for now and instead add a new
> > > RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> > > Then ignore that segment until some timer expiry has been met or the CPU is
> > > known to be busy? Probably some tiny bits need to be tweaked in segcblist
> > > management functions but probably not that much. And also make sure that entrain()
> > > queues to RCU_LAZY_TAIL.
> > >
> > > Then the only difference in the case of NOCB is that we add a new timer to the
> > > nocb group leader instead of a local timer in !NOCB.
> >
> > It sounds reasonable, but I'll go with Paul on the usecase argument - who would
> > actually care about lazy CBs outside of power, and would those guys ever use
> > !NO_CB if they cared about power / battery?
>
> _Everybody_ cares about power. Those who don't yet will very soon ;-)
>
> And given the numbers you provided with your measurements, I bet this will
> be significant with !NOCB as well. This is not only delaying callbacks execution,
> this also reduces the frequency of grace periods, and that impact should be
> quite visible.

I will defer to Paul on whether the current approach is a viable
solution. If it is not, and we are to take a different design route,
it would be good to know that so I can cease current efforts and go
back to the drawing board.

>
> Note I'm not stricly opposed to the current approach. But I can't say I'm
> comfortable with it.

Maybe we can find a way in the code to see what can be changed. One
concern you rightly raised is the length variables , but I think
that's not a blocker.

> Can we do a simple test? Would it be possible to affine every rcuo%c/%d kthread
> to the corresponding CPU%d? For example affine rcuop/1 to CPU 1, rcuop/2 to
> CPU2, etc... And then relaunch your measurements on top of that?

We already did that and it does not help, it makes things worse. The
decision of where to run RCU threads is best left to the scheduler
(for both performance and power).

> The point is that having the callback kthreads affined to their corresponding
> CPUs should elude the power saving advantages of rcu_nocbs=, back to roughly
> a !NOCB behaviour powerwise (except we have context switches). If you find good
> numbers with this setup then you'll find good numbers with !NOCB.

You'll be surprised at the results! ;-) The reaction of everyone I
know off who has tried this is a pleasant surprise that "the scheduler
might actually know what it's doing" ;-)

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 20:36       ` Joel Fernandes
@ 2022-08-29 20:42         ` Paul E. McKenney
  2022-08-29 20:48           ` Joel Fernandes
  2022-08-30 10:53           ` Frederic Weisbecker
  2022-08-30 10:26         ` Frederic Weisbecker
  1 sibling, 2 replies; 54+ messages in thread
From: Paul E. McKenney @ 2022-08-29 20:42 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Frederic Weisbecker, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
> On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:

[ . .  . ]

> > > > 2) NOCB implies performance issues.
> > >
> > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > with the extra scheduling overhead of the extra threads which is usually not a
> > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> >
> > I never measured it myself but executing callbacks on another CPUs, with
> > context switches and locking can only involve significant performance issues if callbacks
> > are frequent. So it's a tradeoff between power and performance.
> 
> In my testing of benchmarks on real systems with 8-16 CPUs, the
> performance hit is down in the noise. It is possible though that maybe
> one can write a non-realistic synthetic test to force the performance
> issues, but I've not seen it in the real world. Maybe on
> networking-heavy servers with lots of cores, you'll see it but their
> batteries if any would be pretty big :-).

To Frederic's point, if you have enough servers, even a 1% decrease in
power consumption is a very big deal.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 20:42         ` Paul E. McKenney
@ 2022-08-29 20:48           ` Joel Fernandes
  2022-08-30 10:57             ` Frederic Weisbecker
  2022-08-30 10:53           ` Frederic Weisbecker
  1 sibling, 1 reply; 54+ messages in thread
From: Joel Fernandes @ 2022-08-29 20:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Mon, Aug 29, 2022 at 4:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
> > On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
>
> [ . .  . ]
>
> > > > > 2) NOCB implies performance issues.
> > > >
> > > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > > with the extra scheduling overhead of the extra threads which is usually not a
> > > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> > >
> > > I never measured it myself but executing callbacks on another CPUs, with
> > > context switches and locking can only involve significant performance issues if callbacks
> > > are frequent. So it's a tradeoff between power and performance.
> >
> > In my testing of benchmarks on real systems with 8-16 CPUs, the
> > performance hit is down in the noise. It is possible though that maybe
> > one can write a non-realistic synthetic test to force the performance
> > issues, but I've not seen it in the real world. Maybe on
> > networking-heavy servers with lots of cores, you'll see it but their
> > batteries if any would be pretty big :-).
>
> To Frederic's point, if you have enough servers, even a 1% decrease in
> power consumption is a very big deal.  ;-)

Ah I see Frederick's point now, so basically the claim is that using
lazy-RCU on servers might make sense to save power because
CONFIG_RCU_NO_CB_CPU may not be an option there (networking throughput
and so forth).

That's a good point indeed...

As you said, let us see v5 and how we want to proceed from there (as
it is not too far from posting) . I do appreciate Frederick's review
and valid concerns.

Thanks,

- Joel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 20:31       ` Paul E. McKenney
@ 2022-08-29 20:54         ` Joel Fernandes
  2022-08-30 10:50         ` Frederic Weisbecker
  1 sibling, 0 replies; 54+ messages in thread
From: Joel Fernandes @ 2022-08-29 20:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

 On Mon, Aug 29, 2022 at 4:31 PM Paul E. McKenney <paulmck@kernel.org> wrote:
[...]
> > > > 3) We are mixing up two very different things in a single list of callbacks:
> > > >    lazy callbacks and flooding callbacks, as a result we are adding lots of
> > > >    off-topic corner cases all around:
> > > >      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
> > > >        general than just bypass/lazy
> > > >      * "lazy" specialized parameters to general purpose cblist management
> > > >        functions
> > >
> > > I think just 1 or 2 functions have a new lazy param. It didn't seem too
> > > intrusive to me.
> >
> > What bothers me is that struct cblist has a general purpose and we are adding a field
> > and a parameter that is relevant to only one specialized user.
>
> This does sound like a bad idea, now that you mention it.  Joel, if
> this is still in place, can it be moved near the rcu_data structure's
> bypass-related fields?

Yes, I can certainly do that! Consider it gone *poof* from the
rcu_cblist structure, and moved into the rcu_data.

>
> And by the way, thank you for reviewing this patch series!

And my thanks as well! I am deeply appreciative of y'alls work and
participation.

> > > > So here is a proposal: how about forgetting NOCB for now and instead add a new
> > > > RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> > > > Then ignore that segment until some timer expiry has been met or the CPU is
> > > > known to be busy? Probably some tiny bits need to be tweaked in segcblist
> > > > management functions but probably not that much. And also make sure that entrain()
> > > > queues to RCU_LAZY_TAIL.
> > > >
> > > > Then the only difference in the case of NOCB is that we add a new timer to the
> > > > nocb group leader instead of a local timer in !NOCB.
> > >
> > > It sounds reasonable, but I'll go with Paul on the usecase argument - who would
> > > actually care about lazy CBs outside of power, and would those guys ever use
> > > !NO_CB if they cared about power / battery?
> >
> > _Everybody_ cares about power. Those who don't yet will very soon ;-)
>
> Apparently not enough to use CONFIG_RCU_FAST_NO_HZ.  Though to be fair,
> that option had its own performance issues.  And it would not reduce
> grace periods anywhere near as much as call_rcu_lazy().  But the problem
> was that last I checked on server workloads, the callbacks were mostly
> those that could not reasonably be lazy.

I see! FWIW, lazy-RCU does not have a lot of benefit on busy systems
in our testing (because other non-lazy RCU CBs keep churning grace
period cycles). So for servers that are mostly busy, the power benefit
may not be that much IMHO.

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 20:36       ` Joel Fernandes
  2022-08-29 20:42         ` Paul E. McKenney
@ 2022-08-30 10:26         ` Frederic Weisbecker
  2022-08-30 18:40           ` Joel Fernandes
  1 sibling, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-30 10:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Dietmar Eggemann, LKML, Paul E. McKenney, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
> On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> >
> > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > Hi Frederick,
> > >
> > > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> > > > On Fri, Aug 19, 2022 at 08:48:43PM +0000, Joel Fernandes (Google) wrote:
> > > >> Refresh tested on real ChromeOS userspace and hardware, passes boot time tests
> > > >> and rcuscale tests.
> > > >>
> > > >> Fixes on top of v3:
> > > >> - Fix boot issues due to a race in the lazy RCU logic which caused a missed
> > > >>   wakeup of the RCU GP thread, causing synchronize_rcu() to stall.
> > > >> - Fixed trace_rcu_callback tracepoint
> > > >>
> > > >> I tested power previously [1], I am in the process of testing power again but I
> > > >> wanted share my latest code as others who are testing power as well could use
> > > >> the above fixes.
> > > >
> > > > Your patch is very likely to be _generally_ useful and therefore,
> > > > the more I look into this, the more I wonder if it is a good idea to rely on
> > > > bypass at all, let alone NOCB. Of course in the long term the goal is to have
> > > > bypass working without NOCB but why even bothering implementing it for nocb
> > > > in the first place?
> > >
> > > This was discussed with Paul [1]. Quoting:
> > >
> > > ----
> > > Joel:
> > > >> Also, does doing so not prevent usage of lazy CBs on systems without
> > > >> NOCB? So if we want to future-proof this, I guess that might not be a
> > > >> good decision.
> > > >
> > > Paul:
> > > > True enough, but would this future actually arrive?  After all, if
> > > > someone cared enough about energy efficiency to use call_rcu_lazy(),
> > > > why wouldn't they also offload callbacks?
> > >
> > > Joel: I am not sure, but I also don't mind making it depend on NOCB for now
> > > (see below).
> > >
> > > [1] https://www.spinics.net/lists/rcu/msg07908.html
> > > ----
> > >
> > > While I agree with you that perhaps making it more generic is better, this did
> > > take a significant amount of time, testing and corner case hunting to come up
> > > with, and v5 is also in the works so I'd appreciate if we can do it the
> > > bypass-way and optimize later. Arguably the bypass way is quite simple and
> > > allows us to leverage its effects of rcu_barrier and such. And the API will not
> > > change.
> >
> > Keep in mind that if we later need to rewrite the whole in order to have a
> > generic approach, this will take even more time in the long run.
> 
> Agreed on that. But the use
> 
> > > > 2) NOCB without nohz_full has extremely rare usecase (RT niche:
> > > > https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
> > > > )
> > >
> > > Really? Android has been using it for a long time. It seems to be quite popular
> > > in the battery-powered space.
> >
> > It's really sad that this is the first time I hear about that. I've been working
> > on this code for years now without this usecase in mind. And yet it's fundamental.
> >
> > I asked several times around about other usecases of rcu_nocbs than nohz_full=
> > and nobody reported that. I can hardly even google a significant link
> > between power saving and rcu_nocbs=
> >
> > If this is really used that way for a long time then it's a cruel disconnection
> > between users and developers.
> 
> I was not involved with Android or RCU back then. But my guess is I
> don't think it was enabled with the intent of saving power, it is just
> that using RCU_NO_CB_CPU has become the way to go to keep dynick-idle
> CPUs undisturbed: https://paulmck.livejournal.com/66807.html . Paul ,
> +Dietmar Eggemann can probably provide more history.

Thanks for the pointer!

> > > > 2) NOCB implies performance issues.
> > >
> > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > with the extra scheduling overhead of the extra threads which is usually not a
> > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> >
> > I never measured it myself but executing callbacks on another CPUs, with
> > context switches and locking can only involve significant performance issues if callbacks
> > are frequent. So it's a tradeoff between power and performance.
> 
> In my testing of benchmarks on real systems with 8-16 CPUs, the
> performance hit is down in the noise. It is possible though that maybe
> one can write a non-realistic synthetic test to force the performance
> issues, but I've not seen it in the real world. Maybe on
> networking-heavy servers with lots of cores, you'll see it but their
> batteries if any would be pretty big :-).

Yeah I suspect this should have an impact on servers. And even servers may
deserve idleness sometimes.

> 
> > > > 3) We are mixing up two very different things in a single list of callbacks:
> > > >    lazy callbacks and flooding callbacks, as a result we are adding lots of
> > > >    off-topic corner cases all around:
> > > >      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
> > > >        general than just bypass/lazy
> > > >      * "lazy" specialized parameters to general purpose cblist management
> > > >        functions
> > >
> > > I think just 1 or 2 functions have a new lazy param. It didn't seem too
> > > intrusive to me.
> >
> > What bothers me is that struct cblist has a general purpose and we are adding a field
> > and a parameter that is relevant to only one specialized user.
> 
> To Paul's point, we can change it to a flag I think. The 3 states are:
> - no CBs.
> - All lazy
> - All non-lazy

Yeah that makes sense, should we take the generic direction!

> 
> Or, worse case, we can move the flag to the per-cpu rcu_data
> structure, I think. Does that alleviate your concern?
> 
> > > > 4) This is further complexifying bypass core code, nocb timer management, core
> > > >    nocb group management, all of which being already very complicated.
> > >
> > > True, I agree, a few more cases to handle for sure, but I think I got them all
> > > now (hopefully).
> >
> > Now I'm worried about maintainability. Hence why I'd rather see a generic code
> > for them all if possible.
> 
> Maintainability is a concern for any new code. I rely a lot on testing
> both synthetic, and real-world. I have spent a lot of time on test
> code on this.

And I thank you for that. I've always been terrible at performance/power/latency
testing so I'm always relieved when others take over on that :)

> 
> > > > So here is a proposal: how about forgetting NOCB for now and instead add a new
> > > > RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> > > > Then ignore that segment until some timer expiry has been met or the CPU is
> > > > known to be busy? Probably some tiny bits need to be tweaked in segcblist
> > > > management functions but probably not that much. And also make sure that entrain()
> > > > queues to RCU_LAZY_TAIL.
> > > >
> > > > Then the only difference in the case of NOCB is that we add a new timer to the
> > > > nocb group leader instead of a local timer in !NOCB.
> > >
> > > It sounds reasonable, but I'll go with Paul on the usecase argument - who would
> > > actually care about lazy CBs outside of power, and would those guys ever use
> > > !NO_CB if they cared about power / battery?
> >
> > _Everybody_ cares about power. Those who don't yet will very soon ;-)
> >
> > And given the numbers you provided with your measurements, I bet this will
> > be significant with !NOCB as well. This is not only delaying callbacks execution,
> > this also reduces the frequency of grace periods, and that impact should be
> > quite visible.
> 
> I will defer to Paul on whether the current approach is a viable
> solution. If it is not, and we are to take a different design route,
> it would be good to know that so I can cease current efforts and go
> back to the drawing board.

Ok.

> 
> >
> > Note I'm not stricly opposed to the current approach. But I can't say I'm
> > comfortable with it.
> 
> Maybe we can find a way in the code to see what can be changed. One
> concern you rightly raised is the length variables , but I think
> that's not a blocker.

That's just a detail indeed. If we proceed with the current NOCB approach I'll
give a deeper review.

> 
> > Can we do a simple test? Would it be possible to affine every rcuo%c/%d kthread
> > to the corresponding CPU%d? For example affine rcuop/1 to CPU 1, rcuop/2 to
> > CPU2, etc... And then relaunch your measurements on top of that?
> 
> We already did that and it does not help, it makes things worse. The
> decision of where to run RCU threads is best left to the scheduler
> (for both performance and power).

That's the point, we want to artificially remove NOCB advantages to get
a behaviour close to !NOCB, so that we know if it's worth expanding the
patchset to !NOCB.

Let me clarify. The point of NOCB, powerwise, is that the RCU kthreads
track and execute the callbacks to an optimized selection of CPUs, thanks to the
scheduler. And the new lazy flow optimize that further.

Now we would like to know if call_rcu_lazy() also works on !NOCB. But we have
no way to test that right now because we only have an implemention for NOCB.

But if we affine each rcuop/%d to their corresponding CPU%d, we get a behaviour
that is close to !NOCB because then the scheduler doesn't help anymore and
callbacks are executed locally (the only difference is that we do context
switches, locking and and we still have rcuog around but still, that
should give a rough idea of what we get with !NOCB).

Therefore if you test your measurements again before and after your patchset,
both with each rcuop/%d affine to their CPU%d counterpart, you should get a
comparison of before VS after your patchset on what would be an implementation
on !NOCB.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 19:57     ` Paul E. McKenney
@ 2022-08-30 10:43       ` Frederic Weisbecker
  2022-08-30 12:08         ` Paul E. McKenney
  0 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-30 10:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, linux-kernel, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, vineeth

On Mon, Aug 29, 2022 at 12:57:30PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > While I agree with you that perhaps making it more generic is better, this did
> > take a significant amount of time, testing and corner case hunting to come up
> > with, and v5 is also in the works so I'd appreciate if we can do it the
> > bypass-way and optimize later. Arguably the bypass way is quite simple and
> > allows us to leverage its effects of rcu_barrier and such. And the API will not
> > change.
> 
> Just confirming this conversation, on the hopefully unlikely off-chance
> that there is any doubt.  ;-)
> 
> That said, if there is some compelling use case that is not addressed
> by rcu_nocbs, keeping in mind that these can now be made dynamic, then
> some adjustment will of course be needed.

Right there is that too.

> 
> > > Several highlights:
> > > 
> > > 1) NOCB is most often needed for nohz_full and the latter has terrible power
> > > management. The CPU 0 is active all the time there.
> > 
> > I see. We don't use nohz_full much. NOCB itself gives good power improvement.
> > 
> > > 2) NOCB without nohz_full has extremely rare usecase (RT niche:
> > > https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
> > > )
> > 
> > Really? Android has been using it for a long time. It seems to be quite popular
> > in the battery-powered space.
> > 
> > > 2) NOCB implies performance issues.
> > 
> > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > with the extra scheduling overhead of the extra threads which is usually not a
> > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> 
> Back in 2010, Rik van Riel reported significant slowdowns for some types
> of Java workloads, but for normal servers, not Android or ChromeOS.
> I have no idea whether similar slowdowns exist today.  But if there is
> no performance advantage to non-offloaded callbacks, we should first make
> offloading the default, and if there are no complaints after a few years,
> remove the non-offloaded case completely.

My gut feeling is that this is a bad idea. Yet I have no practical proof :o)

> My guess is that at the very least, scheduler corner cases will force
> us to keep non-offloaded callbacks, but you never know.  In any case,
> a wakeup is considerably more expensive than a non-atomic OR of a bit
> in a per-CPU variable, so there is some chance that offloading causes
> some important workloads considerable performance degradation.

Definetly!

> 
> > > 3) We are mixing up two very different things in a single list of callbacks:
> > >    lazy callbacks and flooding callbacks, as a result we are adding lots of
> > >    off-topic corner cases all around:
> > >      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
> > >        general than just bypass/lazy
> > >      * "lazy" specialized parameters to general purpose cblist management
> > >        functions
> > 
> > I think just 1 or 2 functions have a new lazy param. It didn't seem too
> > intrusive to me.
> 
> It has been getting simpler!  ;-)
> 
> I bet that the lazy_len field can be a boolean and independent of
> ->cblist, and that doing that would simplify things at least a little bit.
> But, yes, an all-lazy indicator of some sort would still need to exist.

Yeah, I'll check the patch in detail.

> 
> > > 4) This is further complexifying bypass core code, nocb timer management, core
> > >    nocb group management, all of which being already very complicated.
> > 
> > True, I agree, a few more cases to handle for sure, but I think I got them all
> > now (hopefully).
> 
> If we do need lazy callbacks on non-offloaded CPUs, there will need to
> be changes to both the bypass logic (possibly just those changes that
> Joel already has, but Murphy might disagree) and to the ->cblist logic.
> At the very least, the wakeup logic would need adjustment from current
> -rcu and there would still need to be some way of tracking whether or
> not all the callbacks in the bypass list are lazy.

Sure but we can arrange for pushing the complexity in a common place between
NOCB and !NOCB.

> 
> > > 5) The !NOCB implementation is going to be very different
> > > 
> > > Ok I can admit one counter argument in favour of using NO_CB:
> > > 
> > > -1) The scheduler can benefit from a wake CPU to run the callbacks on behalf of a bunch
> > > of idle CPUs, instead of waking up that bunch of CPUs. But still we are dealing
> > > with callbacks that can actually wait...
> 
> You lost me on this one.  Having a callback invoked on a non-idle CPU
> should save significant power without significant delay in callback
> invocation.  What am I missing here?

The thing is that if the callback can wait, and does actually, then the
advantage of call_rcu_lazy() should be visible whether the callbacks are
offloaded or not.

> 
> > Yeah that's huge. Significant amount of power improvement seems to come from
> > idle CPUs not being disturbed and their corresponding timer ticks turned off for
> > longer periods. That's experimentally confirmed (NO_CB giving significant power
> > improvement on battery-power systems as compared to !NO_CB).
> > 
> > > So here is a proposal: how about forgetting NOCB for now and instead add a new
> > > RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> > > Then ignore that segment until some timer expiry has been met or the CPU is
> > > known to be busy? Probably some tiny bits need to be tweaked in segcblist
> > > management functions but probably not that much. And also make sure that entrain()
> > > queues to RCU_LAZY_TAIL.
> > > 
> > > Then the only difference in the case of NOCB is that we add a new timer to the
> > > nocb group leader instead of a local timer in !NOCB.
> 
> It is certainly good to look into alternatives!  Especially if this has
> somehow broken (de)offloading.  (Not seeing it in my testing, but then
> again, I have not yet tested this series all that much.)
> 
> How does the separate RCU_LAZY_TAIL segment help?  I would think
> that you would instead want an all-lazy flag on each of the existing
> RCU_NEXT_READY_TAIL and RCU_NEXT_TAIL segments.  After all, if there is
> even one non-lazy callback in either segment, we need the corresponding
> grace period to run sooner rather than later.  And if we are running a
> given grace period anyway, it costs little to handle the lazy callbacks
> while we are at it.

Good point! That sounds much better.

> Or is there a use case where it helps a lot to defer lazy callbacks that
> could have been handled by a grace period that needed to happen anyway,
> due to the presence of non-lazy callbacks?  I am having a hard time coming
> up with one, but perhaps that is a failure of imagination on my part.

I don't see one right now.

> 
> There would still need to be changes to the bypass code because NOCB is
> what gets both Android and ChromeOS big power savings.

Actually using the flag on RCU_NEXT_TAIL and RCU_NEXT_READ_TAIL would
avoid touching the bypass code.

I see a big advantage in that we don't mix up two orthogonal things anymore:
flooding callbacks (normal bypass) and callbacks that can actually wait
(lazy callbacks), both needing a different treatment.

> And yes, no matter what, rcu_barrier_entrain() needs to motivate any lazy
> callbacks.  Currently, this falls out from the flushing of the bypass.
> Presumably, offloading and deoffloading could also take advantage of
> bypass flushing.
> 
> And I have no idea whether it would make sense for the NOCB and !NOCB
> case to share a laziness-motivation timer.

No at least the timer will need to be different. It should integrate into
the existing one in NOCB whereas !NOCB should have something more simple.

> 
> > It sounds reasonable, but I'll go with Paul on the usecase argument - who would
> > actually care about lazy CBs outside of power, and would those guys ever use
> > !NO_CB if they cared about power / battery?
> 
> And if they are not using NOCB, does call_rcu_lazy() actually help?

I suspect yes, due to the frequency of grace periods lowering.

> 
> But again, if call_rcu_lazy() needs to handle the !NOCB case, then it
> needs to handle the !NOCB case.  However, given ChromeOS and Android,
> we know that it call_rcu_lazy() needs to handle the NOCB case regardless.

Right.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 20:31       ` Paul E. McKenney
  2022-08-29 20:54         ` Joel Fernandes
@ 2022-08-30 10:50         ` Frederic Weisbecker
  2022-08-30 11:47           ` Paul E. McKenney
  1 sibling, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-30 10:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, linux-kernel, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, vineeth

On Mon, Aug 29, 2022 at 01:31:31PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 29, 2022 at 09:46:22PM +0200, Frederic Weisbecker wrote:
> > It's really sad that this is the first time I hear about that. I've been working
> > on this code for years now without this usecase in mind. And yet it's fundamental.
> > 
> > I asked several times around about other usecases of rcu_nocbs than nohz_full=
> > and nobody reported that. I can hardly even google a significant link
> > between power saving and rcu_nocbs=
> > 
> > If this is really used that way for a long time then it's a cruel disconnection
> > between users and developers.
> 
> Knowing me, you probably asked about rcu_nocbs and I probably thought
> you were asking about nohz_full.  :-/

Can't remember but no big deal, now we know about it and we can move forward
with that in mind.

> 
> > > > 2) NOCB implies performance issues.
> > > 
> > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > with the extra scheduling overhead of the extra threads which is usually not a
> > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> > 
> > I never measured it myself but executing callbacks on another CPUs, with
> > context switches and locking can only involve significant performance issues if callbacks
> > are frequent. So it's a tradeoff between power and performance.
> 
> It has indeed been a problem for some workloads in the past.  But I don't
> know of any recent measurements.  And NOCB has gotten at least somewhat
> faster over the years.

I should try a comparison on a simple kernel build someday.

> 
> > > > 3) We are mixing up two very different things in a single list of callbacks:
> > > >    lazy callbacks and flooding callbacks, as a result we are adding lots of
> > > >    off-topic corner cases all around:
> > > >      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
> > > >        general than just bypass/lazy
> > > >      * "lazy" specialized parameters to general purpose cblist management
> > > >        functions
> > > 
> > > I think just 1 or 2 functions have a new lazy param. It didn't seem too
> > > intrusive to me.
> > 
> > What bothers me is that struct cblist has a general purpose and we are adding a field
> > and a parameter that is relevant to only one specialized user.
> 
> This does sound like a bad idea, now that you mention it.  Joel, if
> this is still in place, can it be moved near the rcu_data structure's
> bypass-related fields?
> 
> And by the way, thank you for reviewing this patch series!

I'll go into a deeper review if we proceed.

> > > > So here is a proposal: how about forgetting NOCB for now and instead add a new
> > > > RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> > > > Then ignore that segment until some timer expiry has been met or the CPU is
> > > > known to be busy? Probably some tiny bits need to be tweaked in segcblist
> > > > management functions but probably not that much. And also make sure that entrain()
> > > > queues to RCU_LAZY_TAIL.
> > > > 
> > > > Then the only difference in the case of NOCB is that we add a new timer to the
> > > > nocb group leader instead of a local timer in !NOCB.
> > > 
> > > It sounds reasonable, but I'll go with Paul on the usecase argument - who would
> > > actually care about lazy CBs outside of power, and would those guys ever use
> > > !NO_CB if they cared about power / battery?
> > 
> > _Everybody_ cares about power. Those who don't yet will very soon ;-)
> 
> Apparently not enough to use CONFIG_RCU_FAST_NO_HZ.  Though to be fair,
> that option had its own performance issues.  And it would not reduce
> grace periods anywhere near as much as call_rcu_lazy().  But the problem
> was that last I checked on server workloads, the callbacks were mostly
> those that could not reasonably be lazy.

Right, but like I said, even servers can sometimes find a moment to think about
their good life.

> > And given the numbers you provided with your measurements, I bet this will
> > be significant with !NOCB as well. This is not only delaying callbacks execution,
> > this also reduces the frequency of grace periods, and that impact should be
> > quite visible.
> > 
> > Note I'm not stricly opposed to the current approach. But I can't say I'm
> > comfortable with it.
> > 
> > Can we do a simple test? Would it be possible to affine every rcuo%c/%d kthread
> > to the corresponding CPU%d? For example affine rcuop/1 to CPU 1, rcuop/2 to
> > CPU2, etc... And then relaunch your measurements on top of that?
> > 
> > The point is that having the callback kthreads affined to their corresponding
> > CPUs should elude the power saving advantages of rcu_nocbs=, back to roughly
> > a !NOCB behaviour powerwise (except we have context switches). If you find good
> > numbers with this setup then you'll find good numbers with !NOCB.
> 
> Another test would be to look at which callbacks are being invoked
> on each grace period.  We have to have a substantial number of grace
> periods having all lazy callbacks before call_rcu_lazy() has any chance
> of helping.  This would need to happen on a server platform because
> Android and ChromeOS data might or might not carry over.

Also that yes.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 20:42         ` Paul E. McKenney
  2022-08-29 20:48           ` Joel Fernandes
@ 2022-08-30 10:53           ` Frederic Weisbecker
  2022-08-30 11:43             ` Paul E. McKenney
  1 sibling, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-30 10:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Mon, Aug 29, 2022 at 01:42:02PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
> > On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> 
> [ . .  . ]
> 
> > > > > 2) NOCB implies performance issues.
> > > >
> > > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > > with the extra scheduling overhead of the extra threads which is usually not a
> > > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> > >
> > > I never measured it myself but executing callbacks on another CPUs, with
> > > context switches and locking can only involve significant performance issues if callbacks
> > > are frequent. So it's a tradeoff between power and performance.
> > 
> > In my testing of benchmarks on real systems with 8-16 CPUs, the
> > performance hit is down in the noise. It is possible though that maybe
> > one can write a non-realistic synthetic test to force the performance
> > issues, but I've not seen it in the real world. Maybe on
> > networking-heavy servers with lots of cores, you'll see it but their
> > batteries if any would be pretty big :-).
> 
> To Frederic's point, if you have enough servers, even a 1% decrease in
> power consumption is a very big deal.  ;-)

The world has enough servers, for that matters ;-)

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-29 20:48           ` Joel Fernandes
@ 2022-08-30 10:57             ` Frederic Weisbecker
  0 siblings, 0 replies; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-30 10:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Paul E. McKenney, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Mon, Aug 29, 2022 at 04:48:25PM -0400, Joel Fernandes wrote:
> On Mon, Aug 29, 2022 at 4:42 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> >
> > On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
> > > On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > > > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> >
> > [ . .  . ]
> >
> > > > > > 2) NOCB implies performance issues.
> > > > >
> > > > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > > > with the extra scheduling overhead of the extra threads which is usually not a
> > > > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> > > >
> > > > I never measured it myself but executing callbacks on another CPUs, with
> > > > context switches and locking can only involve significant performance issues if callbacks
> > > > are frequent. So it's a tradeoff between power and performance.
> > >
> > > In my testing of benchmarks on real systems with 8-16 CPUs, the
> > > performance hit is down in the noise. It is possible though that maybe
> > > one can write a non-realistic synthetic test to force the performance
> > > issues, but I've not seen it in the real world. Maybe on
> > > networking-heavy servers with lots of cores, you'll see it but their
> > > batteries if any would be pretty big :-).
> >
> > To Frederic's point, if you have enough servers, even a 1% decrease in
> > power consumption is a very big deal.  ;-)
> 
> Ah I see Frederick's point now, so basically the claim is that using
> lazy-RCU on servers might make sense to save power because
> CONFIG_RCU_NO_CB_CPU may not be an option there (networking throughput
> and so forth).
> 
> That's a good point indeed...

You got it! And it's not just servers but pretty much everything that may
be idle sometimes. Distros (except android then) don't use rcu_nocbs= by
default. Routers enjoy night bandwith, etc...

Thanks!

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 10:53           ` Frederic Weisbecker
@ 2022-08-30 11:43             ` Paul E. McKenney
  2022-08-30 16:03               ` Frederic Weisbecker
  0 siblings, 1 reply; 54+ messages in thread
From: Paul E. McKenney @ 2022-08-30 11:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Tue, Aug 30, 2022 at 12:53:24PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2022 at 01:42:02PM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
> > > On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > > > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> > 
> > [ . .  . ]
> > 
> > > > > > 2) NOCB implies performance issues.
> > > > >
> > > > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > > > with the extra scheduling overhead of the extra threads which is usually not a
> > > > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> > > >
> > > > I never measured it myself but executing callbacks on another CPUs, with
> > > > context switches and locking can only involve significant performance issues if callbacks
> > > > are frequent. So it's a tradeoff between power and performance.
> > > 
> > > In my testing of benchmarks on real systems with 8-16 CPUs, the
> > > performance hit is down in the noise. It is possible though that maybe
> > > one can write a non-realistic synthetic test to force the performance
> > > issues, but I've not seen it in the real world. Maybe on
> > > networking-heavy servers with lots of cores, you'll see it but their
> > > batteries if any would be pretty big :-).
> > 
> > To Frederic's point, if you have enough servers, even a 1% decrease in
> > power consumption is a very big deal.  ;-)
> 
> The world has enough servers, for that matters ;-)

True enough!  Now you need only demonstrate that call_rcu_lazy() for
!rcu_nocbs servers would actually deliver that 1%.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 10:50         ` Frederic Weisbecker
@ 2022-08-30 11:47           ` Paul E. McKenney
  0 siblings, 0 replies; 54+ messages in thread
From: Paul E. McKenney @ 2022-08-30 11:47 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, linux-kernel, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, vineeth

On Tue, Aug 30, 2022 at 12:50:02PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2022 at 01:31:31PM -0700, Paul E. McKenney wrote:

[ . . . ]

> > Another test would be to look at which callbacks are being invoked
> > on each grace period.  We have to have a substantial number of grace
> > periods having all lazy callbacks before call_rcu_lazy() has any chance
> > of helping.  This would need to happen on a server platform because
> > Android and ChromeOS data might or might not carry over.
> 
> Also that yes.

What would be the best way to collect that data?

Please keep in mind that unless/until we have some clear empirical
indication that call_rcu_lazy() really would help the world of systems
running without rcu_nocbs, moving the call_rcu_lazy() implementation
from the bypass list to cblist does nothing but increase risk to system
that get no benef from call_rcu_lazy().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 10:43       ` Frederic Weisbecker
@ 2022-08-30 12:08         ` Paul E. McKenney
  0 siblings, 0 replies; 54+ messages in thread
From: Paul E. McKenney @ 2022-08-30 12:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, linux-kernel, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, vineeth

On Tue, Aug 30, 2022 at 12:43:44PM +0200, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2022 at 12:57:30PM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > While I agree with you that perhaps making it more generic is better, this did
> > > take a significant amount of time, testing and corner case hunting to come up
> > > with, and v5 is also in the works so I'd appreciate if we can do it the
> > > bypass-way and optimize later. Arguably the bypass way is quite simple and
> > > allows us to leverage its effects of rcu_barrier and such. And the API will not
> > > change.
> > 
> > Just confirming this conversation, on the hopefully unlikely off-chance
> > that there is any doubt.  ;-)
> > 
> > That said, if there is some compelling use case that is not addressed
> > by rcu_nocbs, keeping in mind that these can now be made dynamic, then
> > some adjustment will of course be needed.
> 
> Right there is that too.

But we do need the use case before changing the current direction.
Otherwise, all we are doing is slowing down the arrival of energy
efficiency for a very large number of systems for which we know that
there is a call_rcu_lazy() benefit.

Yes, doing the work twice might seem unattractive, but for purposes of
comparison, I did the CONFIG_RCU_FAST_NO_HZ eight or nine times before
I got something that saved significant real power on real hardware.
And in the end, CONFIG_RCU_FAST_NO_HZ was removed due to lack of use
compared to rcu_nocbs.  ;-)

> > > > Several highlights:
> > > > 
> > > > 1) NOCB is most often needed for nohz_full and the latter has terrible power
> > > > management. The CPU 0 is active all the time there.
> > > 
> > > I see. We don't use nohz_full much. NOCB itself gives good power improvement.
> > > 
> > > > 2) NOCB without nohz_full has extremely rare usecase (RT niche:
> > > > https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
> > > > )
> > > 
> > > Really? Android has been using it for a long time. It seems to be quite popular
> > > in the battery-powered space.
> > > 
> > > > 2) NOCB implies performance issues.
> > > 
> > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > with the extra scheduling overhead of the extra threads which is usually not a
> > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> > 
> > Back in 2010, Rik van Riel reported significant slowdowns for some types
> > of Java workloads, but for normal servers, not Android or ChromeOS.
> > I have no idea whether similar slowdowns exist today.  But if there is
> > no performance advantage to non-offloaded callbacks, we should first make
> > offloading the default, and if there are no complaints after a few years,
> > remove the non-offloaded case completely.
> 
> My gut feeling is that this is a bad idea. Yet I have no practical proof :o)

Indeed, the burden of proof is on those who believe that such slowdowns
have somehow vanished over the past 12 years.  ;-)

> > My guess is that at the very least, scheduler corner cases will force
> > us to keep non-offloaded callbacks, but you never know.  In any case,
> > a wakeup is considerably more expensive than a non-atomic OR of a bit
> > in a per-CPU variable, so there is some chance that offloading causes
> > some important workloads considerable performance degradation.
> 
> Definetly!
> 
> > 
> > > > 3) We are mixing up two very different things in a single list of callbacks:
> > > >    lazy callbacks and flooding callbacks, as a result we are adding lots of
> > > >    off-topic corner cases all around:
> > > >      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
> > > >        general than just bypass/lazy
> > > >      * "lazy" specialized parameters to general purpose cblist management
> > > >        functions
> > > 
> > > I think just 1 or 2 functions have a new lazy param. It didn't seem too
> > > intrusive to me.
> > 
> > It has been getting simpler!  ;-)
> > 
> > I bet that the lazy_len field can be a boolean and independent of
> > ->cblist, and that doing that would simplify things at least a little bit.
> > But, yes, an all-lazy indicator of some sort would still need to exist.
> 
> Yeah, I'll check the patch in detail.

Much appreciated!  I believe that the upcoming v5 has some significant
differences from v4.

> > > > 4) This is further complexifying bypass core code, nocb timer management, core
> > > >    nocb group management, all of which being already very complicated.
> > > 
> > > True, I agree, a few more cases to handle for sure, but I think I got them all
> > > now (hopefully).
> > 
> > If we do need lazy callbacks on non-offloaded CPUs, there will need to
> > be changes to both the bypass logic (possibly just those changes that
> > Joel already has, but Murphy might disagree) and to the ->cblist logic.
> > At the very least, the wakeup logic would need adjustment from current
> > -rcu and there would still need to be some way of tracking whether or
> > not all the callbacks in the bypass list are lazy.
> 
> Sure but we can arrange for pushing the complexity in a common place between
> NOCB and !NOCB.

Easy to say, easy to say...  ;-)

> > > > 5) The !NOCB implementation is going to be very different
> > > > 
> > > > Ok I can admit one counter argument in favour of using NO_CB:
> > > > 
> > > > -1) The scheduler can benefit from a wake CPU to run the callbacks on behalf of a bunch
> > > > of idle CPUs, instead of waking up that bunch of CPUs. But still we are dealing
> > > > with callbacks that can actually wait...
> > 
> > You lost me on this one.  Having a callback invoked on a non-idle CPU
> > should save significant power without significant delay in callback
> > invocation.  What am I missing here?
> 
> The thing is that if the callback can wait, and does actually, then the
> advantage of call_rcu_lazy() should be visible whether the callbacks are
> offloaded or not.

Except that in common configurations of NOCB, the scheduler has much
more freedom to move the work to non-idle CPUs.  Plus in Joel et al.'s
testing, it was the reduced number of grace periods that provided the
power savings.  So if a grace period is happening anyway, then I know of
no benefit from deferring lazy callbacks.  (Which is why I was suggesting
marking laziness of RCU_NEXT_READY_TAIL as well as RCU_NEXT_TAIL.)

> > > Yeah that's huge. Significant amount of power improvement seems to come from
> > > idle CPUs not being disturbed and their corresponding timer ticks turned off for
> > > longer periods. That's experimentally confirmed (NO_CB giving significant power
> > > improvement on battery-power systems as compared to !NO_CB).
> > > 
> > > > So here is a proposal: how about forgetting NOCB for now and instead add a new
> > > > RCU_LAZY_TAIL segment in the struct rcu_segcblist right after RCU_NEXT_TAIL?
> > > > Then ignore that segment until some timer expiry has been met or the CPU is
> > > > known to be busy? Probably some tiny bits need to be tweaked in segcblist
> > > > management functions but probably not that much. And also make sure that entrain()
> > > > queues to RCU_LAZY_TAIL.
> > > > 
> > > > Then the only difference in the case of NOCB is that we add a new timer to the
> > > > nocb group leader instead of a local timer in !NOCB.
> > 
> > It is certainly good to look into alternatives!  Especially if this has
> > somehow broken (de)offloading.  (Not seeing it in my testing, but then
> > again, I have not yet tested this series all that much.)
> > 
> > How does the separate RCU_LAZY_TAIL segment help?  I would think
> > that you would instead want an all-lazy flag on each of the existing
> > RCU_NEXT_READY_TAIL and RCU_NEXT_TAIL segments.  After all, if there is
> > even one non-lazy callback in either segment, we need the corresponding
> > grace period to run sooner rather than later.  And if we are running a
> > given grace period anyway, it costs little to handle the lazy callbacks
> > while we are at it.
> 
> Good point! That sounds much better.
> 
> > Or is there a use case where it helps a lot to defer lazy callbacks that
> > could have been handled by a grace period that needed to happen anyway,
> > due to the presence of non-lazy callbacks?  I am having a hard time coming
> > up with one, but perhaps that is a failure of imagination on my part.
> 
> I don't see one right now.

Perhaps there isn't one, or perhaps Murphy is waiting to teach us both
a lesson.  ;-)

> > There would still need to be changes to the bypass code because NOCB is
> > what gets both Android and ChromeOS big power savings.
> 
> Actually using the flag on RCU_NEXT_TAIL and RCU_NEXT_READ_TAIL would
> avoid touching the bypass code.
> 
> I see a big advantage in that we don't mix up two orthogonal things anymore:
> flooding callbacks (normal bypass) and callbacks that can actually wait
> (lazy callbacks), both needing a different treatment.

At the very least, the bypass code would need to change in order to
track the laziness of the bypassing callbacks.  Plus wouldn't there
need to be some adjustment of was_alldone indication to the caller of
rcu_nocb_try_bypass() in order to avoid that caller starting a grace
period that should have been lazily deferred?  Or am I missing a trick
here?

And again, touching ->cblist increase risk to workloads that are not
yet known to benefit from call_rcu_lazy().

> > And yes, no matter what, rcu_barrier_entrain() needs to motivate any lazy
> > callbacks.  Currently, this falls out from the flushing of the bypass.
> > Presumably, offloading and deoffloading could also take advantage of
> > bypass flushing.
> > 
> > And I have no idea whether it would make sense for the NOCB and !NOCB
> > case to share a laziness-motivation timer.
> 
> No at least the timer will need to be different. It should integrate into
> the existing one in NOCB whereas !NOCB should have something more simple.

I would guess that we will find a devil or two in those details.

> > > It sounds reasonable, but I'll go with Paul on the usecase argument - who would
> > > actually care about lazy CBs outside of power, and would those guys ever use
> > > !NO_CB if they cared about power / battery?
> > 
> > And if they are not using NOCB, does call_rcu_lazy() actually help?
> 
> I suspect yes, due to the frequency of grace periods lowering.

But your suspicion rests on the assumption that server workloads often
have nothing but lazy callbacks queued, correct?

If so, what is a good way to check that assumption?  Measure the fraction
of full seconds during which nothing but lazy callbacks are queued?
Measure the fraction of grace periods that handled only lazy callbacks?

Or is there some better approach?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 11:43             ` Paul E. McKenney
@ 2022-08-30 16:03               ` Frederic Weisbecker
  2022-08-30 16:22                 ` Frederic Weisbecker
  0 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-30 16:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Tue, Aug 30, 2022 at 04:43:43AM -0700, Paul E. McKenney wrote:
> On Tue, Aug 30, 2022 at 12:53:24PM +0200, Frederic Weisbecker wrote:
> > On Mon, Aug 29, 2022 at 01:42:02PM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
> > > > On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > > > > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> > > 
> > > [ . .  . ]
> > > 
> > > > > > > 2) NOCB implies performance issues.
> > > > > >
> > > > > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > > > > with the extra scheduling overhead of the extra threads which is usually not a
> > > > > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> > > > >
> > > > > I never measured it myself but executing callbacks on another CPUs, with
> > > > > context switches and locking can only involve significant performance issues if callbacks
> > > > > are frequent. So it's a tradeoff between power and performance.
> > > > 
> > > > In my testing of benchmarks on real systems with 8-16 CPUs, the
> > > > performance hit is down in the noise. It is possible though that maybe
> > > > one can write a non-realistic synthetic test to force the performance
> > > > issues, but I've not seen it in the real world. Maybe on
> > > > networking-heavy servers with lots of cores, you'll see it but their
> > > > batteries if any would be pretty big :-).
> > > 
> > > To Frederic's point, if you have enough servers, even a 1% decrease in
> > > power consumption is a very big deal.  ;-)
> > 
> > The world has enough servers, for that matters ;-)
> 
> True enough!  Now you need only demonstrate that call_rcu_lazy() for
> !rcu_nocbs servers would actually deliver that 1%.  ;-)

Well, !rcu_nocbs is not only used by server but also by pretty much
everything else, except android IIUC. I can't really measure the whole
world but I don't see how the idleness of a server/router/desktop/embedded/rt/hpc
device differs from the idleness of an android device.

But ok I'll try to measure that.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 16:03               ` Frederic Weisbecker
@ 2022-08-30 16:22                 ` Frederic Weisbecker
  2022-08-30 16:44                   ` Uladzislau Rezki
                                     ` (2 more replies)
  0 siblings, 3 replies; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-30 16:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Tue, Aug 30, 2022 at 06:03:16PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2022 at 04:43:43AM -0700, Paul E. McKenney wrote:
> > On Tue, Aug 30, 2022 at 12:53:24PM +0200, Frederic Weisbecker wrote:
> > > On Mon, Aug 29, 2022 at 01:42:02PM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
> > > > > On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > > > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > > > > > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> > > > 
> > > > [ . .  . ]
> > > > 
> > > > > > > > 2) NOCB implies performance issues.
> > > > > > >
> > > > > > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > > > > > with the extra scheduling overhead of the extra threads which is usually not a
> > > > > > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> > > > > >
> > > > > > I never measured it myself but executing callbacks on another CPUs, with
> > > > > > context switches and locking can only involve significant performance issues if callbacks
> > > > > > are frequent. So it's a tradeoff between power and performance.
> > > > > 
> > > > > In my testing of benchmarks on real systems with 8-16 CPUs, the
> > > > > performance hit is down in the noise. It is possible though that maybe
> > > > > one can write a non-realistic synthetic test to force the performance
> > > > > issues, but I've not seen it in the real world. Maybe on
> > > > > networking-heavy servers with lots of cores, you'll see it but their
> > > > > batteries if any would be pretty big :-).
> > > > 
> > > > To Frederic's point, if you have enough servers, even a 1% decrease in
> > > > power consumption is a very big deal.  ;-)
> > > 
> > > The world has enough servers, for that matters ;-)
> > 
> > True enough!  Now you need only demonstrate that call_rcu_lazy() for
> > !rcu_nocbs servers would actually deliver that 1%.  ;-)
> 
> Well, !rcu_nocbs is not only used by server but also by pretty much
> everything else, except android IIUC. I can't really measure the whole
> world but I don't see how the idleness of a server/router/desktop/embedded/rt/hpc
> device differs from the idleness of an android device.
> 
> But ok I'll try to measure that.

Although who knows, may be some periodic file operation while idle are specific
to Android. I'll try to trace lazy callbacks while idle and the number of grace
periods associated.



^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 16:22                 ` Frederic Weisbecker
@ 2022-08-30 16:44                   ` Uladzislau Rezki
  2022-08-30 18:53                     ` Joel Fernandes
  2022-09-01 11:29                     ` Frederic Weisbecker
  2022-08-30 16:46                   ` Paul E. McKenney
  2022-08-30 18:46                   ` Joel Fernandes
  2 siblings, 2 replies; 54+ messages in thread
From: Uladzislau Rezki @ 2022-08-30 16:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Joel Fernandes, Dietmar Eggemann, LKML,
	Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

Hello, Frederic.

> 
> Although who knows, may be some periodic file operation while idle are specific
> to Android. I'll try to trace lazy callbacks while idle and the number of grace
> periods associated.
> 
> 
Everything related to lazy call-backs is about not waking "nocb"
kthreads in order to offload one or i should say few callbacks
because it is more or less useless. Currently if incoming callback
is the only one, it will kick a GP whereas a GP will kick nocb_kthread
to offload.

In "light" loaded test cases, especially where a power drain is a key
thing, such light load may lead to some kind of "noise" produced by the
RCU core, i.e. kicking idle CPUs, thus wasting power. On our ARM devices
it is not painful, but there is a small power gain and it is visible.
For other systems, like Joel measures for Intel SoC it is more visible,
because of a power cost getting in/out of isle states.

This is what i see on my setup.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 16:22                 ` Frederic Weisbecker
  2022-08-30 16:44                   ` Uladzislau Rezki
@ 2022-08-30 16:46                   ` Paul E. McKenney
  2022-08-31 15:26                     ` Frederic Weisbecker
  2022-08-30 18:46                   ` Joel Fernandes
  2 siblings, 1 reply; 54+ messages in thread
From: Paul E. McKenney @ 2022-08-30 16:46 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Tue, Aug 30, 2022 at 06:22:44PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2022 at 06:03:16PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 30, 2022 at 04:43:43AM -0700, Paul E. McKenney wrote:
> > > On Tue, Aug 30, 2022 at 12:53:24PM +0200, Frederic Weisbecker wrote:
> > > > On Mon, Aug 29, 2022 at 01:42:02PM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
> > > > > > On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
> > > > > > > On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
> > > > > > > > On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
> > > > > 
> > > > > [ . .  . ]
> > > > > 
> > > > > > > > > 2) NOCB implies performance issues.
> > > > > > > >
> > > > > > > > Which kinds of? There is slightly worse boot times, but I'm guessing that's do
> > > > > > > > with the extra scheduling overhead of the extra threads which is usually not a
> > > > > > > > problem except that RCU is used in the critical path of boot up (on ChromeOS).
> > > > > > >
> > > > > > > I never measured it myself but executing callbacks on another CPUs, with
> > > > > > > context switches and locking can only involve significant performance issues if callbacks
> > > > > > > are frequent. So it's a tradeoff between power and performance.
> > > > > > 
> > > > > > In my testing of benchmarks on real systems with 8-16 CPUs, the
> > > > > > performance hit is down in the noise. It is possible though that maybe
> > > > > > one can write a non-realistic synthetic test to force the performance
> > > > > > issues, but I've not seen it in the real world. Maybe on
> > > > > > networking-heavy servers with lots of cores, you'll see it but their
> > > > > > batteries if any would be pretty big :-).
> > > > > 
> > > > > To Frederic's point, if you have enough servers, even a 1% decrease in
> > > > > power consumption is a very big deal.  ;-)
> > > > 
> > > > The world has enough servers, for that matters ;-)
> > > 
> > > True enough!  Now you need only demonstrate that call_rcu_lazy() for
> > > !rcu_nocbs servers would actually deliver that 1%.  ;-)
> > 
> > Well, !rcu_nocbs is not only used by server but also by pretty much
> > everything else, except android IIUC.

And soon, ChromeOS.

> >                                       I can't really measure the whole
> > world but I don't see how the idleness of a server/router/desktop/embedded/rt/hpc
> > device differs from the idleness of an android device.
> > 
> > But ok I'll try to measure that.
> 
> Although who knows, may be some periodic file operation while idle are specific
> to Android. I'll try to trace lazy callbacks while idle and the number of grace
> periods associated.

Sounds like a good start.

And yes, we don't need to show that the whole !NOCB world needs this,
just some significant portion of it.  But we do need some decent evidence.
After all, it is all too easy to do a whole lot of work and find that
the expected benefits fail to materialize.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 10:26         ` Frederic Weisbecker
@ 2022-08-30 18:40           ` Joel Fernandes
  0 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes @ 2022-08-30 18:40 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Dietmar Eggemann, LKML, Paul E. McKenney, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

Hi Frederick,

On 8/30/2022 6:26 AM, Frederic Weisbecker wrote:
> On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
[..]
>>>>> 2) NOCB without nohz_full has extremely rare usecase (RT niche:
>>>>> https://lore.kernel.org/lkml/CAFzL-7vqTX-y06Kc3HaLqRWAYE0d=ms3TzVtZLn0c6ATrKD+Qw@mail.gmail.com/
>>>>> )
>>>>
>>>> Really? Android has been using it for a long time. It seems to be quite popular
>>>> in the battery-powered space.
>>>
>>> It's really sad that this is the first time I hear about that. I've been working
>>> on this code for years now without this usecase in mind. And yet it's fundamental.
>>>
>>> I asked several times around about other usecases of rcu_nocbs than nohz_full=
>>> and nobody reported that. I can hardly even google a significant link
>>> between power saving and rcu_nocbs=
>>>
>>> If this is really used that way for a long time then it's a cruel disconnection
>>> between users and developers.
>>
>> I was not involved with Android or RCU back then. But my guess is I
>> don't think it was enabled with the intent of saving power, it is just
>> that using RCU_NO_CB_CPU has become the way to go to keep dynick-idle
>> CPUs undisturbed: https://paulmck.livejournal.com/66807.html . Paul ,
>> +Dietmar Eggemann can probably provide more history.
> 
> Thanks for the pointer!

No problem. Paul proves to us that any and every bit of writing effort pays off
in one way or the other :-D.

>>>>> 2) NOCB implies performance issues.
>>>>
>>>> Which kinds of? There is slightly worse boot times, but I'm guessing that's do
>>>> with the extra scheduling overhead of the extra threads which is usually not a
>>>> problem except that RCU is used in the critical path of boot up (on ChromeOS).
>>>
>>> I never measured it myself but executing callbacks on another CPUs, with
>>> context switches and locking can only involve significant performance issues if callbacks
>>> are frequent. So it's a tradeoff between power and performance.
>>
>> In my testing of benchmarks on real systems with 8-16 CPUs, the
>> performance hit is down in the noise. It is possible though that maybe
>> one can write a non-realistic synthetic test to force the performance
>> issues, but I've not seen it in the real world. Maybe on
>> networking-heavy servers with lots of cores, you'll see it but their
>> batteries if any would be pretty big :-).
> 
> Yeah I suspect this should have an impact on servers. And even servers may
> deserve idleness sometimes.

Sure. I think I am pretty sure you will see power savings on servers but I guess
the big question is will you regress server workloads with those savings. I do
remember, Eric Dumazet from Google posting to LKML that callback flooding can
happen on servers.

>>>>> 3) We are mixing up two very different things in a single list of callbacks:
>>>>>    lazy callbacks and flooding callbacks, as a result we are adding lots of
>>>>>    off-topic corner cases all around:
>>>>>      * a seperate lazy len field to struct rcu_cblist whose purpose is much more
>>>>>        general than just bypass/lazy
>>>>>      * "lazy" specialized parameters to general purpose cblist management
>>>>>        functions
>>>>
>>>> I think just 1 or 2 functions have a new lazy param. It didn't seem too
>>>> intrusive to me.
>>>
>>> What bothers me is that struct cblist has a general purpose and we are adding a field
>>> and a parameter that is relevant to only one specialized user.
>>
>> To Paul's point, we can change it to a flag I think. The 3 states are:
>> - no CBs.
>> - All lazy
>> - All non-lazy
> 
> Yeah that makes sense, should we take the generic direction!

Sure, thanks.

>>
>> Or, worse case, we can move the flag to the per-cpu rcu_data
>> structure, I think. Does that alleviate your concern?
>>
>>>>> 4) This is further complexifying bypass core code, nocb timer management, core
>>>>>    nocb group management, all of which being already very complicated.
>>>>
>>>> True, I agree, a few more cases to handle for sure, but I think I got them all
>>>> now (hopefully).
>>>
>>> Now I'm worried about maintainability. Hence why I'd rather see a generic code
>>> for them all if possible.
>>
>> Maintainability is a concern for any new code. I rely a lot on testing
>> both synthetic, and real-world. I have spent a lot of time on test
>> code on this.
> 
> And I thank you for that. I've always been terrible at performance/power/latency
> testing so I'm always relieved when others take over on that :)

Thanks, appreciate it.

>>>
>>> Note I'm not stricly opposed to the current approach. But I can't say I'm
>>> comfortable with it.
>>
>> Maybe we can find a way in the code to see what can be changed. One
>> concern you rightly raised is the length variables , but I think
>> that's not a blocker.
> 
> That's just a detail indeed. If we proceed with the current NOCB approach I'll
> give a deeper review.
> 
>>
>>> Can we do a simple test? Would it be possible to affine every rcuo%c/%d kthread
>>> to the corresponding CPU%d? For example affine rcuop/1 to CPU 1, rcuop/2 to
>>> CPU2, etc... And then relaunch your measurements on top of that?
>>
>> We already did that and it does not help, it makes things worse. The
>> decision of where to run RCU threads is best left to the scheduler
>> (for both performance and power).
> 
> That's the point, we want to artificially remove NOCB advantages to get
> a behaviour close to !NOCB, so that we know if it's worth expanding the
> patchset to !NOCB.
> Let me clarify. The point of NOCB, powerwise, is that the RCU kthreads
> track and execute the callbacks to an optimized selection of CPUs, thanks to the
> scheduler. And the new lazy flow optimize that further.
> 
> Now we would like to know if call_rcu_lazy() also works on !NOCB. But we have
> no way to test that right now because we only have an implemention for NOCB.

I am happy to try this, but I want to make a counter-suggestion. You can 'hack'
a behavior similar to call_rcu_lazy() by just increasing jiffiest_till_first_fqs
to 24 or 48 (via boot param). The downside of doing this, you'll slow down
synchronize_rcu() and tracing will be slow. That can be remedied by passing
rcupdate.rcu_expedited boot parameter to speed up synchronous_rcu(). However, it
will still slow down *all* call_rcu(). Neverthless, its a neat trick to see the
light at the end of the tunnel but not for product.

Would it be possible for you try this out with some workloads on a server, or
even the server just being idle?

Just to mention - the power savings between NO_CB and !NO_CB is enormous for us
(40% !!). We attributed this directly to the tick not getting turned off and the
hardware SoC thus not entering a deeper SoC-wide lower state (known as package
C-state). While a server does not typically have an SoC, I suspect *some* power
savings at least.

> But if we affine each rcuop/%d to their corresponding CPU%d, we get a behaviour
> that is close to !NOCB because then the scheduler doesn't help anymore and
> callbacks are executed locally (the only difference is that we do context
> switches, locking and and we still have rcuog around but still, that
> should give a rough idea of what we get with !NOCB).
> 
> Therefore if you test your measurements again before and after your patchset,
> both with each rcuop/%d affine to their CPU%d counterpart, you should get a
> comparison of before VS after your patchset on what would be an implementation
> on !NOCB.

My suspicion is that if I did this experiment on my hardware, it will still not
give  you a good idea of what a server will see. Because I don't have those kind
of  workloads or many-CPU hardware. So it may not be that representative of what
the !NOCB world wants. Further, my hardware is an Intel SoC which is more
integrated than a typical server board so SoC-wide power savings might not be
representative of the server hardware. However, I am happy to try out your
suggestion out of respect for you, if you still feel its worth doing.

Thanks,

 - Joel





^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 16:22                 ` Frederic Weisbecker
  2022-08-30 16:44                   ` Uladzislau Rezki
  2022-08-30 16:46                   ` Paul E. McKenney
@ 2022-08-30 18:46                   ` Joel Fernandes
  2 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes @ 2022-08-30 18:46 UTC (permalink / raw)
  To: Frederic Weisbecker, Paul E. McKenney
  Cc: Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai



On 8/30/2022 12:22 PM, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2022 at 06:03:16PM +0200, Frederic Weisbecker wrote:
>> On Tue, Aug 30, 2022 at 04:43:43AM -0700, Paul E. McKenney wrote:
>>> On Tue, Aug 30, 2022 at 12:53:24PM +0200, Frederic Weisbecker wrote:
>>>> On Mon, Aug 29, 2022 at 01:42:02PM -0700, Paul E. McKenney wrote:
>>>>> On Mon, Aug 29, 2022 at 04:36:40PM -0400, Joel Fernandes wrote:
>>>>>> On Mon, Aug 29, 2022 at 3:46 PM Frederic Weisbecker <frederic@kernel.org> wrote:
>>>>>>> On Mon, Aug 29, 2022 at 12:45:40PM -0400, Joel Fernandes wrote:
>>>>>>>> On 8/29/2022 9:40 AM, Frederic Weisbecker wrote:
>>>>>
>>>>> [ . .  . ]
>>>>>
>>>>>>>>> 2) NOCB implies performance issues.
>>>>>>>>
>>>>>>>> Which kinds of? There is slightly worse boot times, but I'm guessing that's do
>>>>>>>> with the extra scheduling overhead of the extra threads which is usually not a
>>>>>>>> problem except that RCU is used in the critical path of boot up (on ChromeOS).
>>>>>>>
>>>>>>> I never measured it myself but executing callbacks on another CPUs, with
>>>>>>> context switches and locking can only involve significant performance issues if callbacks
>>>>>>> are frequent. So it's a tradeoff between power and performance.
>>>>>>
>>>>>> In my testing of benchmarks on real systems with 8-16 CPUs, the
>>>>>> performance hit is down in the noise. It is possible though that maybe
>>>>>> one can write a non-realistic synthetic test to force the performance
>>>>>> issues, but I've not seen it in the real world. Maybe on
>>>>>> networking-heavy servers with lots of cores, you'll see it but their
>>>>>> batteries if any would be pretty big :-).
>>>>>
>>>>> To Frederic's point, if you have enough servers, even a 1% decrease in
>>>>> power consumption is a very big deal.  ;-)
>>>>
>>>> The world has enough servers, for that matters ;-)
>>>
>>> True enough!  Now you need only demonstrate that call_rcu_lazy() for
>>> !rcu_nocbs servers would actually deliver that 1%.  ;-)
>>
>> Well, !rcu_nocbs is not only used by server but also by pretty much
>> everything else, except android IIUC. I can't really measure the whole
>> world but I don't see how the idleness of a server/router/desktop/embedded/rt/hpc
>> device differs from the idleness of an android device.
>>
>> But ok I'll try to measure that.
> 
> Although who knows, may be some periodic file operation while idle are specific
> to Android. I'll try to trace lazy callbacks while idle and the number of grace
> periods associated.

One potential usecase could be logging if the logger is opening and closing log
files during logging updates. Uladzislau reported this on Android, that a system
logger was doing file open/close and triggering RCU that way (file table,
dentry, inode code all queue RCU callbacks).

Thanks,

 - Joel


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 16:44                   ` Uladzislau Rezki
@ 2022-08-30 18:53                     ` Joel Fernandes
  2022-09-01 11:29                     ` Frederic Weisbecker
  1 sibling, 0 replies; 54+ messages in thread
From: Joel Fernandes @ 2022-08-30 18:53 UTC (permalink / raw)
  To: Uladzislau Rezki, Frederic Weisbecker
  Cc: Paul E. McKenney, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai



On 8/30/2022 12:44 PM, Uladzislau Rezki wrote:
> Hello, Frederic.
> 
>>
>> Although who knows, may be some periodic file operation while idle are specific
>> to Android. I'll try to trace lazy callbacks while idle and the number of grace
>> periods associated.
>>
>>
> Everything related to lazy call-backs is about not waking "nocb"
> kthreads in order to offload one or i should say few callbacks
> because it is more or less useless. Currently if incoming callback
> is the only one, it will kick a GP whereas a GP will kick nocb_kthread
> to offload.
> 
> In "light" loaded test cases, especially where a power drain is a key
> thing, such light load may lead to some kind of "noise" produced by the
> RCU core, i.e. kicking idle CPUs, thus wasting power. On our ARM devices
> it is not painful, but there is a small power gain and it is visible.
> For other systems, like Joel measures for Intel SoC it is more visible,
> because of a power cost getting in/out of isle states.

Indeed! And Intel (Rushikesh) reminded me today that it has become more
important than before to not disturb idle CPUs, because more and more pieces of
hardware are moving into SoCs. So if the whole SoC cannot go into deeper sleep
state (package C-state) because of the CPU getting disturbed, then that's lesser
power savings than say in the not-so-recent past. Rushikesh could shed more
light on that fact.

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 16:46                   ` Paul E. McKenney
@ 2022-08-31 15:26                     ` Frederic Weisbecker
  2022-09-01 14:39                       ` Paul E. McKenney
  0 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-08-31 15:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Tue, Aug 30, 2022 at 09:46:34AM -0700, Paul E. McKenney wrote:
> > Although who knows, may be some periodic file operation while idle are specific
> > to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > periods associated.
> 
> Sounds like a good start.
> 
> And yes, we don't need to show that the whole !NOCB world needs this,
> just some significant portion of it.  But we do need some decent evidence.
> After all, it is all too easy to do a whole lot of work and find that
> the expected benefits fail to materialize.

So here is some quick test. I made a patch that replaces Joel's 1st patch
with an implementation of call_rcu_lazy() that queues lazy callbacks
through the regular call_rcu() way but it counts them in a lazy_count.

Upon idle entry it reports whether the tick is retained solely by lazy
callbacks or not.

I get periodic and frequent results on my idle test box, something must be
opening/closing some file periodically perhaps.

Anyway the thing can be tested with this branch:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
	rcu/lazy-trace

Excerpt:

          <idle>-0       [007] d..1.   414.226966: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   414.228271: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   414.232269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   414.236269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   414.468048: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   414.468268: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   414.472268: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   414.476269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   419.500577: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   419.504253: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   419.508250: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   419.512249: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   419.566881: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   419.568252: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   419.572249: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   419.576255: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   424.666873: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   424.668233: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   424.672230: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   424.676232: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   424.737283: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   424.740233: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   424.744230: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   424.748231: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   429.767922: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   429.768209: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   429.772212: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   429.776212: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   429.972931: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   429.976214: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   429.980211: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.   429.984212: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [003] d..1.   430.402139: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [003] d..1.   430.404211: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [003] d..1.   430.404290: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [003] d..1.   430.408235: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [003] d..1.   430.408304: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [003] d..1.   430.412215: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-30 16:44                   ` Uladzislau Rezki
  2022-08-30 18:53                     ` Joel Fernandes
@ 2022-09-01 11:29                     ` Frederic Weisbecker
  2022-09-01 11:59                       ` Uladzislau Rezki
  1 sibling, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-09-01 11:29 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Paul E. McKenney, Joel Fernandes, Dietmar Eggemann, LKML,
	Rushikesh S Kadam, Neeraj upadhyay, Steven Rostedt, rcu,
	Vineeth Pillai

On Tue, Aug 30, 2022 at 06:44:51PM +0200, Uladzislau Rezki wrote:
> Hello, Frederic.
> 
> > 
> > Although who knows, may be some periodic file operation while idle are specific
> > to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > periods associated.
> > 
> > 
> Everything related to lazy call-backs is about not waking "nocb"
> kthreads in order to offload one or i should say few callbacks
> because it is more or less useless. Currently if incoming callback
> is the only one, it will kick a GP whereas a GP will kick nocb_kthread
> to offload.

Not sure this is only about not waking "nocb" kthreads. The grace period
kthread is also awaken in !NOCB and has quite some work to do. And there,
having a server expands the issue because you may have a lot of CPUs's extended
quiescent states to check.

Also in !NOCB, pending callbacks retain the timer tick of a CPU (see
rcu_needs_cpu()), and cpuidle relies on the tick to be stopped before
allowing the CPU into low power mode. So a lazy callback may delay a CPU from
entering into low power mode for a few milliseconds.

And I can observe those retained ticks on my idle box.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 11:29                     ` Frederic Weisbecker
@ 2022-09-01 11:59                       ` Uladzislau Rezki
  2022-09-01 14:41                         ` Paul E. McKenney
  2022-09-01 15:13                         ` Frederic Weisbecker
  0 siblings, 2 replies; 54+ messages in thread
From: Uladzislau Rezki @ 2022-09-01 11:59 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Uladzislau Rezki, Paul E. McKenney, Joel Fernandes,
	Dietmar Eggemann, LKML, Rushikesh S Kadam, Neeraj upadhyay,
	Steven Rostedt, rcu, Vineeth Pillai

On Thu, Sep 01, 2022 at 01:29:47PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2022 at 06:44:51PM +0200, Uladzislau Rezki wrote:
> > Hello, Frederic.
> > 
> > > 
> > > Although who knows, may be some periodic file operation while idle are specific
> > > to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > > periods associated.
> > > 
> > > 
> > Everything related to lazy call-backs is about not waking "nocb"
> > kthreads in order to offload one or i should say few callbacks
> > because it is more or less useless. Currently if incoming callback
> > is the only one, it will kick a GP whereas a GP will kick nocb_kthread
> > to offload.
> 
> Not sure this is only about not waking "nocb" kthreads. The grace period
> kthread is also awaken in !NOCB and has quite some work to do. And there,
> having a server expands the issue because you may have a lot of CPUs's extended
> quiescent states to check.
> 
I mean here the following combination: NOCB + call_rcu_lazy() tandem.
The !NOCB is not about power save, IMHO. Because it implies callbacks
to be processed on CPUs they are landed.

In this scenario you can not let the EAS scheduler to find a more
efficient CPU for further handling.

>
> Also in !NOCB, pending callbacks retain the timer tick of a CPU (see
> rcu_needs_cpu()), and cpuidle relies on the tick to be stopped before
> allowing the CPU into low power mode. So a lazy callback may delay a CPU from
> entering into low power mode for a few milliseconds.
> 
> And I can observe those retained ticks on my idle box.
>
Maybe !NOCB is more about performance. But i have no clue about
workloads and if such workloads exist nowadays.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-08-31 15:26                     ` Frederic Weisbecker
@ 2022-09-01 14:39                       ` Paul E. McKenney
  2022-09-01 14:58                         ` Frederic Weisbecker
  0 siblings, 1 reply; 54+ messages in thread
From: Paul E. McKenney @ 2022-09-01 14:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Wed, Aug 31, 2022 at 05:26:58PM +0200, Frederic Weisbecker wrote:
> On Tue, Aug 30, 2022 at 09:46:34AM -0700, Paul E. McKenney wrote:
> > > Although who knows, may be some periodic file operation while idle are specific
> > > to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > > periods associated.
> > 
> > Sounds like a good start.
> > 
> > And yes, we don't need to show that the whole !NOCB world needs this,
> > just some significant portion of it.  But we do need some decent evidence.
> > After all, it is all too easy to do a whole lot of work and find that
> > the expected benefits fail to materialize.
> 
> So here is some quick test. I made a patch that replaces Joel's 1st patch
> with an implementation of call_rcu_lazy() that queues lazy callbacks
> through the regular call_rcu() way but it counts them in a lazy_count.
> 
> Upon idle entry it reports whether the tick is retained solely by lazy
> callbacks or not.
> 
> I get periodic and frequent results on my idle test box, something must be
> opening/closing some file periodically perhaps.
> 
> Anyway the thing can be tested with this branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> 	rcu/lazy-trace
> 
> Excerpt:
> 
>           <idle>-0       [007] d..1.   414.226966: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   414.228271: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   414.232269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   414.236269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle

Just to make sure that I understand, at this point, there is only the
one lazy callback (and no non-lazy callbacks) on this CPU, and that
CPU is therefore keeping the tick on only for the benefit of that one
lazy callback.  And for the above four traces, this is likely the same
lazy callback.

Did I get it right, or is there something else going on?

							Thanx, Paul

>           <idle>-0       [007] d..1.   414.468048: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   414.468268: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   414.472268: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   414.476269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   419.500577: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   419.504253: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   419.508250: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   419.512249: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   419.566881: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   419.568252: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   419.572249: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   419.576255: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   424.666873: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   424.668233: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   424.672230: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   424.676232: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   424.737283: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   424.740233: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   424.744230: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   424.748231: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   429.767922: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   429.768209: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   429.772212: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   429.776212: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   429.972931: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   429.976214: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   429.980211: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.   429.984212: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [003] d..1.   430.402139: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [003] d..1.   430.404211: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [003] d..1.   430.404290: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [003] d..1.   430.408235: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [003] d..1.   430.408304: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [003] d..1.   430.412215: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> 
> Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 11:59                       ` Uladzislau Rezki
@ 2022-09-01 14:41                         ` Paul E. McKenney
  2022-09-01 15:30                           ` Frederic Weisbecker
  2022-09-01 15:13                         ` Frederic Weisbecker
  1 sibling, 1 reply; 54+ messages in thread
From: Paul E. McKenney @ 2022-09-01 14:41 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Frederic Weisbecker, Joel Fernandes, Dietmar Eggemann, LKML,
	Rushikesh S Kadam, Neeraj upadhyay, Steven Rostedt, rcu,
	Vineeth Pillai

On Thu, Sep 01, 2022 at 01:59:10PM +0200, Uladzislau Rezki wrote:
> On Thu, Sep 01, 2022 at 01:29:47PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 30, 2022 at 06:44:51PM +0200, Uladzislau Rezki wrote:
> > > Hello, Frederic.
> > > 
> > > > 
> > > > Although who knows, may be some periodic file operation while idle are specific
> > > > to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > > > periods associated.
> > > > 
> > > > 
> > > Everything related to lazy call-backs is about not waking "nocb"
> > > kthreads in order to offload one or i should say few callbacks
> > > because it is more or less useless. Currently if incoming callback
> > > is the only one, it will kick a GP whereas a GP will kick nocb_kthread
> > > to offload.
> > 
> > Not sure this is only about not waking "nocb" kthreads. The grace period
> > kthread is also awaken in !NOCB and has quite some work to do. And there,
> > having a server expands the issue because you may have a lot of CPUs's extended
> > quiescent states to check.
> > 
> I mean here the following combination: NOCB + call_rcu_lazy() tandem.
> The !NOCB is not about power save, IMHO. Because it implies callbacks
> to be processed on CPUs they are landed.
> 
> In this scenario you can not let the EAS scheduler to find a more
> efficient CPU for further handling.

Just to follow up, Uladzislau and others did some detailed performance
analysis of NOCB on Android.  Of course, this analysis might or might
not carry over to servers, but it was pretty detailed.

							Thanx, Paul

> > Also in !NOCB, pending callbacks retain the timer tick of a CPU (see
> > rcu_needs_cpu()), and cpuidle relies on the tick to be stopped before
> > allowing the CPU into low power mode. So a lazy callback may delay a CPU from
> > entering into low power mode for a few milliseconds.
> > 
> > And I can observe those retained ticks on my idle box.
> >
> Maybe !NOCB is more about performance. But i have no clue about
> workloads and if such workloads exist nowadays.
> 
> --
> Uladzislau Rezki

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 14:39                       ` Paul E. McKenney
@ 2022-09-01 14:58                         ` Frederic Weisbecker
  2022-09-01 16:07                           ` Joel Fernandes
  0 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-09-01 14:58 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Thu, Sep 01, 2022 at 07:39:07AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 31, 2022 at 05:26:58PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 30, 2022 at 09:46:34AM -0700, Paul E. McKenney wrote:
> > > > Although who knows, may be some periodic file operation while idle are specific
> > > > to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > > > periods associated.
> > > 
> > > Sounds like a good start.
> > > 
> > > And yes, we don't need to show that the whole !NOCB world needs this,
> > > just some significant portion of it.  But we do need some decent evidence.
> > > After all, it is all too easy to do a whole lot of work and find that
> > > the expected benefits fail to materialize.
> > 
> > So here is some quick test. I made a patch that replaces Joel's 1st patch
> > with an implementation of call_rcu_lazy() that queues lazy callbacks
> > through the regular call_rcu() way but it counts them in a lazy_count.
> > 
> > Upon idle entry it reports whether the tick is retained solely by lazy
> > callbacks or not.
> > 
> > I get periodic and frequent results on my idle test box, something must be
> > opening/closing some file periodically perhaps.
> > 
> > Anyway the thing can be tested with this branch:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > 	rcu/lazy-trace
> > 
> > Excerpt:
> > 
> >           <idle>-0       [007] d..1.   414.226966: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> >           <idle>-0       [007] d..1.   414.228271: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> >           <idle>-0       [007] d..1.   414.232269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> >           <idle>-0       [007] d..1.   414.236269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> 
> Just to make sure that I understand, at this point, there is only the
> one lazy callback (and no non-lazy callbacks) on this CPU, and that
> CPU is therefore keeping the tick on only for the benefit of that one
> lazy callback.  And for the above four traces, this is likely the same
> lazy callback.
> 
> Did I get it right, or is there something else going on?

Exactly that!

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 11:59                       ` Uladzislau Rezki
  2022-09-01 14:41                         ` Paul E. McKenney
@ 2022-09-01 15:13                         ` Frederic Weisbecker
  2022-09-01 16:07                           ` Joel Fernandes
  1 sibling, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-09-01 15:13 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Paul E. McKenney, Joel Fernandes, Dietmar Eggemann, LKML,
	Rushikesh S Kadam, Neeraj upadhyay, Steven Rostedt, rcu,
	Vineeth Pillai

On Thu, Sep 01, 2022 at 01:59:10PM +0200, Uladzislau Rezki wrote:
> On Thu, Sep 01, 2022 at 01:29:47PM +0200, Frederic Weisbecker wrote:
> > On Tue, Aug 30, 2022 at 06:44:51PM +0200, Uladzislau Rezki wrote:
> > > Hello, Frederic.
> > > 
> > > > 
> > > > Although who knows, may be some periodic file operation while idle are specific
> > > > to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > > > periods associated.
> > > > 
> > > > 
> > > Everything related to lazy call-backs is about not waking "nocb"
> > > kthreads in order to offload one or i should say few callbacks
> > > because it is more or less useless. Currently if incoming callback
> > > is the only one, it will kick a GP whereas a GP will kick nocb_kthread
> > > to offload.
> > 
> > Not sure this is only about not waking "nocb" kthreads. The grace period
> > kthread is also awaken in !NOCB and has quite some work to do. And there,
> > having a server expands the issue because you may have a lot of CPUs's extended
> > quiescent states to check.
> > 
> I mean here the following combination: NOCB + call_rcu_lazy() tandem.
> The !NOCB is not about power save, IMHO. Because it implies callbacks
> to be processed on CPUs they are landed.

I'm sorry but I still feel confused reading that !NOCB is not about power
save. To me everything is about power save. NOCB just appears to help optimizing
it without significant tradeoff on some given workloads.

> In this scenario you can not let the EAS scheduler to find a more
> efficient CPU for further handling.

Sure but that doesn't mean there wouldn't be a power saving gain anyway.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 14:41                         ` Paul E. McKenney
@ 2022-09-01 15:30                           ` Frederic Weisbecker
  2022-09-01 16:11                             ` Joel Fernandes
  2022-09-01 16:52                             ` Paul E. McKenney
  0 siblings, 2 replies; 54+ messages in thread
From: Frederic Weisbecker @ 2022-09-01 15:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Uladzislau Rezki, Joel Fernandes, Dietmar Eggemann, LKML,
	Rushikesh S Kadam, Neeraj upadhyay, Steven Rostedt, rcu,
	Vineeth Pillai

On Thu, Sep 01, 2022 at 07:41:58AM -0700, Paul E. McKenney wrote:
> On Thu, Sep 01, 2022 at 01:59:10PM +0200, Uladzislau Rezki wrote:
> > On Thu, Sep 01, 2022 at 01:29:47PM +0200, Frederic Weisbecker wrote:
> > > On Tue, Aug 30, 2022 at 06:44:51PM +0200, Uladzislau Rezki wrote:
> > > > Hello, Frederic.
> > > > 
> > > > > 
> > > > > Although who knows, may be some periodic file operation while idle are specific
> > > > > to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > > > > periods associated.
> > > > > 
> > > > > 
> > > > Everything related to lazy call-backs is about not waking "nocb"
> > > > kthreads in order to offload one or i should say few callbacks
> > > > because it is more or less useless. Currently if incoming callback
> > > > is the only one, it will kick a GP whereas a GP will kick nocb_kthread
> > > > to offload.
> > > 
> > > Not sure this is only about not waking "nocb" kthreads. The grace period
> > > kthread is also awaken in !NOCB and has quite some work to do. And there,
> > > having a server expands the issue because you may have a lot of CPUs's extended
> > > quiescent states to check.
> > > 
> > I mean here the following combination: NOCB + call_rcu_lazy() tandem.
> > The !NOCB is not about power save, IMHO. Because it implies callbacks
> > to be processed on CPUs they are landed.
> > 
> > In this scenario you can not let the EAS scheduler to find a more
> > efficient CPU for further handling.
> 
> Just to follow up, Uladzislau and others did some detailed performance
> analysis of NOCB on Android.  Of course, this analysis might or might
> not carry over to servers, but it was pretty detailed.

Sure I certainly don't deny the benefit on Android and similar workload.
What I'm worried about is that we are making this feature too specialized
when it may deserve to be made more generic.

I'm not convincing anyone though and I don't have the means to provide
numbers, I would need to produce an actual !NOCB implementation for that.

So I'm not entirely comfortable but I'm going to review the current patchset
anyway and once it lands -rcu I'll try to hack a quick !NOCB implementation
for measurements purpose.

Thanks.

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 15:13                         ` Frederic Weisbecker
@ 2022-09-01 16:07                           ` Joel Fernandes
  0 siblings, 0 replies; 54+ messages in thread
From: Joel Fernandes @ 2022-09-01 16:07 UTC (permalink / raw)
  To: Frederic Weisbecker, Uladzislau Rezki
  Cc: Paul E. McKenney, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai



On 9/1/2022 11:13 AM, Frederic Weisbecker wrote:
> On Thu, Sep 01, 2022 at 01:59:10PM +0200, Uladzislau Rezki wrote:
>> On Thu, Sep 01, 2022 at 01:29:47PM +0200, Frederic Weisbecker wrote:
>>> On Tue, Aug 30, 2022 at 06:44:51PM +0200, Uladzislau Rezki wrote:
>>>> Hello, Frederic.
>>>>
>>>>>
>>>>> Although who knows, may be some periodic file operation while idle are specific
>>>>> to Android. I'll try to trace lazy callbacks while idle and the number of grace
>>>>> periods associated.
>>>>>
>>>>>
>>>> Everything related to lazy call-backs is about not waking "nocb"
>>>> kthreads in order to offload one or i should say few callbacks
>>>> because it is more or less useless. Currently if incoming callback
>>>> is the only one, it will kick a GP whereas a GP will kick nocb_kthread
>>>> to offload.
>>>
>>> Not sure this is only about not waking "nocb" kthreads. The grace period
>>> kthread is also awaken in !NOCB and has quite some work to do. And there,
>>> having a server expands the issue because you may have a lot of CPUs's extended
>>> quiescent states to check.
>>>
>> I mean here the following combination: NOCB + call_rcu_lazy() tandem.
>> The !NOCB is not about power save, IMHO. Because it implies callbacks
>> to be processed on CPUs they are landed.
> 
> I'm sorry but I still feel confused reading that !NOCB is not about power
> save. To me everything is about power save. NOCB just appears to help optimizing
> it without significant tradeoff on some given workloads.

Right, I think Fred is coming from the reasonable perspective of "Save power by
making !NOCB RCU do less" (i.e. delay softirq, main GP thread, QS reports etc)
which I think is also valid (not sure how much power that saves on servers but
sounds reasonable to measure). When you hack your implementation to test that,
it would be reasonable to also apply the old FAST NOHZ patch and measure
with/without that.

> 
>> In this scenario you can not let the EAS scheduler to find a more
>> efficient CPU for further handling.
> 
> Sure but that doesn't mean there wouldn't be a power saving gain anyway.

Yes there might be.

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 14:58                         ` Frederic Weisbecker
@ 2022-09-01 16:07                           ` Joel Fernandes
  2022-09-01 16:49                             ` Paul E. McKenney
  0 siblings, 1 reply; 54+ messages in thread
From: Joel Fernandes @ 2022-09-01 16:07 UTC (permalink / raw)
  To: Frederic Weisbecker, Paul E. McKenney
  Cc: Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai



On 9/1/2022 10:58 AM, Frederic Weisbecker wrote:
> On Thu, Sep 01, 2022 at 07:39:07AM -0700, Paul E. McKenney wrote:
>> On Wed, Aug 31, 2022 at 05:26:58PM +0200, Frederic Weisbecker wrote:
>>> On Tue, Aug 30, 2022 at 09:46:34AM -0700, Paul E. McKenney wrote:
>>>>> Although who knows, may be some periodic file operation while idle are specific
>>>>> to Android. I'll try to trace lazy callbacks while idle and the number of grace
>>>>> periods associated.
>>>>
>>>> Sounds like a good start.
>>>>
>>>> And yes, we don't need to show that the whole !NOCB world needs this,
>>>> just some significant portion of it.  But we do need some decent evidence.
>>>> After all, it is all too easy to do a whole lot of work and find that
>>>> the expected benefits fail to materialize.
>>>
>>> So here is some quick test. I made a patch that replaces Joel's 1st patch
>>> with an implementation of call_rcu_lazy() that queues lazy callbacks
>>> through the regular call_rcu() way but it counts them in a lazy_count.
>>>
>>> Upon idle entry it reports whether the tick is retained solely by lazy
>>> callbacks or not.
>>>
>>> I get periodic and frequent results on my idle test box, something must be
>>> opening/closing some file periodically perhaps.
>>>
>>> Anyway the thing can be tested with this branch:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
>>> 	rcu/lazy-trace
>>>
>>> Excerpt:
>>>
>>>           <idle>-0       [007] d..1.   414.226966: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>>>           <idle>-0       [007] d..1.   414.228271: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>>>           <idle>-0       [007] d..1.   414.232269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>>>           <idle>-0       [007] d..1.   414.236269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>>
>> Just to make sure that I understand, at this point, there is only the
>> one lazy callback (and no non-lazy callbacks) on this CPU, and that
>> CPU is therefore keeping the tick on only for the benefit of that one
>> lazy callback.  And for the above four traces, this is likely the same
>> lazy callback.
>>
>> Did I get it right, or is there something else going on?
> 
> Exactly that!

Interesting!

 - Joel


^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 15:30                           ` Frederic Weisbecker
@ 2022-09-01 16:11                             ` Joel Fernandes
  2022-09-01 16:52                             ` Paul E. McKenney
  1 sibling, 0 replies; 54+ messages in thread
From: Joel Fernandes @ 2022-09-01 16:11 UTC (permalink / raw)
  To: Frederic Weisbecker, Paul E. McKenney
  Cc: Uladzislau Rezki, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On 9/1/2022 11:30 AM, Frederic Weisbecker wrote:
> So I'm not entirely comfortable but I'm going to review the current patchset
> anyway and once it lands -rcu I'll try to hack a quick !NOCB implementation
> for measurements purpose.

Sounds good, arguable the core implementation to use bypass list is not too
complex or anything, and maybe you can generalize the bypass list for !NONCB as
well and re-use most of the code. You will need a per-cpu list anyway (bypass or
some other) to queue the CBs. In v1, we had a separate per-cpu list.

Thanks for offering to review the patches, you guys really motivate me to work
on it (I am currently on leave but still working on it). I'll be looking forward
to getting that out soon. Yesterday was good progress.

 - Joel

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 16:07                           ` Joel Fernandes
@ 2022-09-01 16:49                             ` Paul E. McKenney
  2022-09-01 18:28                               ` Frederic Weisbecker
  0 siblings, 1 reply; 54+ messages in thread
From: Paul E. McKenney @ 2022-09-01 16:49 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Frederic Weisbecker, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Thu, Sep 01, 2022 at 12:07:56PM -0400, Joel Fernandes wrote:
> 
> 
> On 9/1/2022 10:58 AM, Frederic Weisbecker wrote:
> > On Thu, Sep 01, 2022 at 07:39:07AM -0700, Paul E. McKenney wrote:
> >> On Wed, Aug 31, 2022 at 05:26:58PM +0200, Frederic Weisbecker wrote:
> >>> On Tue, Aug 30, 2022 at 09:46:34AM -0700, Paul E. McKenney wrote:
> >>>>> Although who knows, may be some periodic file operation while idle are specific
> >>>>> to Android. I'll try to trace lazy callbacks while idle and the number of grace
> >>>>> periods associated.
> >>>>
> >>>> Sounds like a good start.
> >>>>
> >>>> And yes, we don't need to show that the whole !NOCB world needs this,
> >>>> just some significant portion of it.  But we do need some decent evidence.
> >>>> After all, it is all too easy to do a whole lot of work and find that
> >>>> the expected benefits fail to materialize.
> >>>
> >>> So here is some quick test. I made a patch that replaces Joel's 1st patch
> >>> with an implementation of call_rcu_lazy() that queues lazy callbacks
> >>> through the regular call_rcu() way but it counts them in a lazy_count.
> >>>
> >>> Upon idle entry it reports whether the tick is retained solely by lazy
> >>> callbacks or not.
> >>>
> >>> I get periodic and frequent results on my idle test box, something must be
> >>> opening/closing some file periodically perhaps.
> >>>
> >>> Anyway the thing can be tested with this branch:
> >>>
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> >>> 	rcu/lazy-trace
> >>>
> >>> Excerpt:
> >>>
> >>>           <idle>-0       [007] d..1.   414.226966: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> >>>           <idle>-0       [007] d..1.   414.228271: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> >>>           <idle>-0       [007] d..1.   414.232269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> >>>           <idle>-0       [007] d..1.   414.236269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> >>
> >> Just to make sure that I understand, at this point, there is only the
> >> one lazy callback (and no non-lazy callbacks) on this CPU, and that
> >> CPU is therefore keeping the tick on only for the benefit of that one
> >> lazy callback.  And for the above four traces, this is likely the same
> >> lazy callback.
> >>
> >> Did I get it right, or is there something else going on?
> > 
> > Exactly that!

Are these callbacks confined to the RCU_NEXT_READY_TAIL and RCU_NEXT_TAIL
segments, which are the ones that could (in theory) buffer callbacks
without having started a grace period?  Or is it all the callbacks
regardless of segment?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 15:30                           ` Frederic Weisbecker
  2022-09-01 16:11                             ` Joel Fernandes
@ 2022-09-01 16:52                             ` Paul E. McKenney
  1 sibling, 0 replies; 54+ messages in thread
From: Paul E. McKenney @ 2022-09-01 16:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Uladzislau Rezki, Joel Fernandes, Dietmar Eggemann, LKML,
	Rushikesh S Kadam, Neeraj upadhyay, Steven Rostedt, rcu,
	Vineeth Pillai

On Thu, Sep 01, 2022 at 05:30:34PM +0200, Frederic Weisbecker wrote:
> On Thu, Sep 01, 2022 at 07:41:58AM -0700, Paul E. McKenney wrote:
> > On Thu, Sep 01, 2022 at 01:59:10PM +0200, Uladzislau Rezki wrote:
> > > On Thu, Sep 01, 2022 at 01:29:47PM +0200, Frederic Weisbecker wrote:
> > > > On Tue, Aug 30, 2022 at 06:44:51PM +0200, Uladzislau Rezki wrote:
> > > > > Hello, Frederic.
> > > > > 
> > > > > > 
> > > > > > Although who knows, may be some periodic file operation while idle are specific
> > > > > > to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > > > > > periods associated.
> > > > > > 
> > > > > > 
> > > > > Everything related to lazy call-backs is about not waking "nocb"
> > > > > kthreads in order to offload one or i should say few callbacks
> > > > > because it is more or less useless. Currently if incoming callback
> > > > > is the only one, it will kick a GP whereas a GP will kick nocb_kthread
> > > > > to offload.
> > > > 
> > > > Not sure this is only about not waking "nocb" kthreads. The grace period
> > > > kthread is also awaken in !NOCB and has quite some work to do. And there,
> > > > having a server expands the issue because you may have a lot of CPUs's extended
> > > > quiescent states to check.
> > > > 
> > > I mean here the following combination: NOCB + call_rcu_lazy() tandem.
> > > The !NOCB is not about power save, IMHO. Because it implies callbacks
> > > to be processed on CPUs they are landed.
> > > 
> > > In this scenario you can not let the EAS scheduler to find a more
> > > efficient CPU for further handling.
> > 
> > Just to follow up, Uladzislau and others did some detailed performance
> > analysis of NOCB on Android.  Of course, this analysis might or might
> > not carry over to servers, but it was pretty detailed.
> 
> Sure I certainly don't deny the benefit on Android and similar workload.
> What I'm worried about is that we are making this feature too specialized
> when it may deserve to be made more generic.
> 
> I'm not convincing anyone though and I don't have the means to provide
> numbers, I would need to produce an actual !NOCB implementation for that.

I have not yet given up on thinking about what measurements I could take
that would be convincing within Meta.  Maybe some idea will present itself
on the plane.  If nothing else, exploratory measurements with rcutop.

> So I'm not entirely comfortable but I'm going to review the current patchset
> anyway and once it lands -rcu I'll try to hack a quick !NOCB implementation
> for measurements purpose.

That sounds like a very good approach!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 16:49                             ` Paul E. McKenney
@ 2022-09-01 18:28                               ` Frederic Weisbecker
  2022-09-01 20:36                                 ` Paul E. McKenney
  0 siblings, 1 reply; 54+ messages in thread
From: Frederic Weisbecker @ 2022-09-01 18:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Thu, Sep 01, 2022 at 09:49:28AM -0700, Paul E. McKenney wrote:
> > On 9/1/2022 10:58 AM, Frederic Weisbecker wrote:
> > > On Thu, Sep 01, 2022 at 07:39:07AM -0700, Paul E. McKenney wrote:
> > >> On Wed, Aug 31, 2022 at 05:26:58PM +0200, Frederic Weisbecker wrote:
> > >>> On Tue, Aug 30, 2022 at 09:46:34AM -0700, Paul E. McKenney wrote:
> > >>>>> Although who knows, may be some periodic file operation while idle are specific
> > >>>>> to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > >>>>> periods associated.
> > >>>>
> > >>>> Sounds like a good start.
> > >>>>
> > >>>> And yes, we don't need to show that the whole !NOCB world needs this,
> > >>>> just some significant portion of it.  But we do need some decent evidence.
> > >>>> After all, it is all too easy to do a whole lot of work and find that
> > >>>> the expected benefits fail to materialize.
> > >>>
> > >>> So here is some quick test. I made a patch that replaces Joel's 1st patch
> > >>> with an implementation of call_rcu_lazy() that queues lazy callbacks
> > >>> through the regular call_rcu() way but it counts them in a lazy_count.
> > >>>
> > >>> Upon idle entry it reports whether the tick is retained solely by lazy
> > >>> callbacks or not.
> > >>>
> > >>> I get periodic and frequent results on my idle test box, something must be
> > >>> opening/closing some file periodically perhaps.
> > >>>
> > >>> Anyway the thing can be tested with this branch:
> > >>>
> > >>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > >>> 	rcu/lazy-trace
> > >>>
> > >>> Excerpt:
> > >>>
> > >>>           <idle>-0       [007] d..1.   414.226966: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> > >>>           <idle>-0       [007] d..1.   414.228271: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> > >>>           <idle>-0       [007] d..1.   414.232269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> > >>>           <idle>-0       [007] d..1.   414.236269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> > >>
> > >> Just to make sure that I understand, at this point, there is only the
> > >> one lazy callback (and no non-lazy callbacks) on this CPU, and that
> > >> CPU is therefore keeping the tick on only for the benefit of that one
> > >> lazy callback.  And for the above four traces, this is likely the same
> > >> lazy callback.
> > >>
> > >> Did I get it right, or is there something else going on?
> > > 
> > > Exactly that!
> 
> Are these callbacks confined to the RCU_NEXT_READY_TAIL and RCU_NEXT_TAIL
> segments, which are the ones that could (in theory) buffer callbacks
> without having started a grace period?  Or is it all the callbacks
> regardless of segment?

Ah good point!

So I just excluded when those segments have callbacks and I now only get
two tick retains every two seconds:

          <idle>-0       [007] d..1.  1111.893649: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1111.967575: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1113.895470: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1115.669446: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1115.898144: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1117.202833: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1117.900521: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1119.903327: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1120.766864: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1121.909182: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1122.441927: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1123.908911: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1125.868505: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1125.910898: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1127.682837: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1127.913719: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1129.916740: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1130.967052: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1131.919256: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1132.957163: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [000] d..1.  1133.630082: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1133.923053: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1135.927054: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1136.067679: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1137.652294: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1137.932546: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1138.200768: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1139.932573: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1141.167489: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1141.935232: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1143.440538: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
          <idle>-0       [007] d..1.  1143.938560: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle

^ permalink raw reply	[flat|nested] 54+ messages in thread

* Re: [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 18:28                               ` Frederic Weisbecker
@ 2022-09-01 20:36                                 ` Paul E. McKenney
  0 siblings, 0 replies; 54+ messages in thread
From: Paul E. McKenney @ 2022-09-01 20:36 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Rushikesh S Kadam,
	Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt, rcu, Vineeth Pillai

On Thu, Sep 01, 2022 at 08:28:04PM +0200, Frederic Weisbecker wrote:
> On Thu, Sep 01, 2022 at 09:49:28AM -0700, Paul E. McKenney wrote:
> > > On 9/1/2022 10:58 AM, Frederic Weisbecker wrote:
> > > > On Thu, Sep 01, 2022 at 07:39:07AM -0700, Paul E. McKenney wrote:
> > > >> On Wed, Aug 31, 2022 at 05:26:58PM +0200, Frederic Weisbecker wrote:
> > > >>> On Tue, Aug 30, 2022 at 09:46:34AM -0700, Paul E. McKenney wrote:
> > > >>>>> Although who knows, may be some periodic file operation while idle are specific
> > > >>>>> to Android. I'll try to trace lazy callbacks while idle and the number of grace
> > > >>>>> periods associated.
> > > >>>>
> > > >>>> Sounds like a good start.
> > > >>>>
> > > >>>> And yes, we don't need to show that the whole !NOCB world needs this,
> > > >>>> just some significant portion of it.  But we do need some decent evidence.
> > > >>>> After all, it is all too easy to do a whole lot of work and find that
> > > >>>> the expected benefits fail to materialize.
> > > >>>
> > > >>> So here is some quick test. I made a patch that replaces Joel's 1st patch
> > > >>> with an implementation of call_rcu_lazy() that queues lazy callbacks
> > > >>> through the regular call_rcu() way but it counts them in a lazy_count.
> > > >>>
> > > >>> Upon idle entry it reports whether the tick is retained solely by lazy
> > > >>> callbacks or not.
> > > >>>
> > > >>> I get periodic and frequent results on my idle test box, something must be
> > > >>> opening/closing some file periodically perhaps.
> > > >>>
> > > >>> Anyway the thing can be tested with this branch:
> > > >>>
> > > >>> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > > >>> 	rcu/lazy-trace
> > > >>>
> > > >>> Excerpt:
> > > >>>
> > > >>>           <idle>-0       [007] d..1.   414.226966: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> > > >>>           <idle>-0       [007] d..1.   414.228271: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> > > >>>           <idle>-0       [007] d..1.   414.232269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> > > >>>           <idle>-0       [007] d..1.   414.236269: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
> > > >>
> > > >> Just to make sure that I understand, at this point, there is only the
> > > >> one lazy callback (and no non-lazy callbacks) on this CPU, and that
> > > >> CPU is therefore keeping the tick on only for the benefit of that one
> > > >> lazy callback.  And for the above four traces, this is likely the same
> > > >> lazy callback.
> > > >>
> > > >> Did I get it right, or is there something else going on?
> > > > 
> > > > Exactly that!
> > 
> > Are these callbacks confined to the RCU_NEXT_READY_TAIL and RCU_NEXT_TAIL
> > segments, which are the ones that could (in theory) buffer callbacks
> > without having started a grace period?  Or is it all the callbacks
> > regardless of segment?
> 
> Ah good point!
> 
> So I just excluded when those segments have callbacks and I now only get
> two tick retains every two seconds:
> 
>           <idle>-0       [007] d..1.  1111.893649: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1111.967575: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle

But reducing ticks is not the only way energy is saved.  The other way
is a reduction in the number of grace periods.  One way to estimate this
is to take the per-second grace period rate and subtract one grace period
per two seconds.  If the system is idle, this effect might be significant.

							Thanx, Paul

>           <idle>-0       [007] d..1.  1113.895470: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1115.669446: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1115.898144: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1117.202833: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1117.900521: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1119.903327: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1120.766864: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1121.909182: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1122.441927: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1123.908911: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1125.868505: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1125.910898: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1127.682837: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1127.913719: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1129.916740: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1130.967052: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1131.919256: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1132.957163: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [000] d..1.  1133.630082: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1133.923053: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1135.927054: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1136.067679: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1137.652294: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1137.932546: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1138.200768: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1139.932573: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1141.167489: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1141.935232: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1143.440538: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle
>           <idle>-0       [007] d..1.  1143.938560: rcu_needs_cpu: BAD: 1 lazy callbacks retaining dynticks-idle

^ permalink raw reply	[flat|nested] 54+ messages in thread

end of thread, other threads:[~2022-09-01 20:37 UTC | newest]

Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-19 20:48 [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 01/14] rcu: Introduce call_rcu_lazy() API implementation Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 02/14] rcu: shrinker for lazy rcu Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 03/14] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 04/14] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 05/14] rcutorture: Add test code for call_rcu_lazy() Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 06/14] debug: Toggle lazy at runtime and change flush jiffies Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 07/14] cred: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 08/14] security: " Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 09/14] net/core: " Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 10/14] kernel: Move various core kernel usages " Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 11/14] lib: Move call_rcu() " Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 12/14] i915: " Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 13/14] fork: Move thread_stack_free_rcu to call_rcu_lazy Joel Fernandes (Google)
2022-08-19 20:48 ` [PATCH v4 14/14] rcu/tree: Move trace_rcu_callback() before bypassing Joel Fernandes (Google)
2022-08-29 13:40 ` [PATCH v4 00/14] Implement call_rcu_lazy() and miscellaneous fixes Frederic Weisbecker
2022-08-29 16:45   ` Joel Fernandes
2022-08-29 19:46     ` Frederic Weisbecker
2022-08-29 20:31       ` Paul E. McKenney
2022-08-29 20:54         ` Joel Fernandes
2022-08-30 10:50         ` Frederic Weisbecker
2022-08-30 11:47           ` Paul E. McKenney
2022-08-29 20:36       ` Joel Fernandes
2022-08-29 20:42         ` Paul E. McKenney
2022-08-29 20:48           ` Joel Fernandes
2022-08-30 10:57             ` Frederic Weisbecker
2022-08-30 10:53           ` Frederic Weisbecker
2022-08-30 11:43             ` Paul E. McKenney
2022-08-30 16:03               ` Frederic Weisbecker
2022-08-30 16:22                 ` Frederic Weisbecker
2022-08-30 16:44                   ` Uladzislau Rezki
2022-08-30 18:53                     ` Joel Fernandes
2022-09-01 11:29                     ` Frederic Weisbecker
2022-09-01 11:59                       ` Uladzislau Rezki
2022-09-01 14:41                         ` Paul E. McKenney
2022-09-01 15:30                           ` Frederic Weisbecker
2022-09-01 16:11                             ` Joel Fernandes
2022-09-01 16:52                             ` Paul E. McKenney
2022-09-01 15:13                         ` Frederic Weisbecker
2022-09-01 16:07                           ` Joel Fernandes
2022-08-30 16:46                   ` Paul E. McKenney
2022-08-31 15:26                     ` Frederic Weisbecker
2022-09-01 14:39                       ` Paul E. McKenney
2022-09-01 14:58                         ` Frederic Weisbecker
2022-09-01 16:07                           ` Joel Fernandes
2022-09-01 16:49                             ` Paul E. McKenney
2022-09-01 18:28                               ` Frederic Weisbecker
2022-09-01 20:36                                 ` Paul E. McKenney
2022-08-30 18:46                   ` Joel Fernandes
2022-08-30 10:26         ` Frederic Weisbecker
2022-08-30 18:40           ` Joel Fernandes
2022-08-29 19:57     ` Paul E. McKenney
2022-08-30 10:43       ` Frederic Weisbecker
2022-08-30 12:08         ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.