All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
@ 2022-05-12  3:04 Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation Joel Fernandes (Google)
                   ` (15 more replies)
  0 siblings, 16 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

Hello!
Please find the proof of concept version of call_rcu_lazy() attached. This
gives a lot of savings when the CPUs are relatively idle. Huge thanks to
Rushikesh Kadam from Intel for investigating it with me.

Some numbers below:

Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
The observation is that due to a 'trickle down' effect of RCU callbacks, the
system is very lightly loaded but constantly running few RCU callbacks very
often. This confuses the power management hardware that the system is active,
when it is in fact idle.

For example, when ChromeOS screen is off and user is not doing anything on the
system, we can see big power savings.
Before:
Pk%pc10 = 72.13
PkgWatt = 0.58
CorWatt = 0.04

After:
Pk%pc10 = 81.28
PkgWatt = 0.41
CorWatt = 0.03

Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
can see that the display pipeline is constantly doing RCU callback queuing due
to open/close of file descriptors associated with graphics buffers. This is
attributed to the file_free_rcu() path which this patch series also touches.

This patch series adds a simple but effective, and lockless implementation of
RCU callback batching. On memory pressure, timeout or queue growing too big, we
initiate a flush of one or more per-CPU lists.

Similar results can be achieved by increasing jiffies_till_first_fqs, however
that also has the effect of slowing down RCU. Especially I saw huge slow down
of function graph tracer when increasing that.

One drawback of this series is, if another frequent RCU callback creeps up in
the future, that's not lazy, then that will again hurt the power. However, I
believe identifying and fixing those is a more reasonable approach than slowing
RCU down for the whole system.

NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
at runtime to turn it on or off globally. It is default to on. Further, please
use the sysctls in lazy.c for further tuning of parameters that effect the
flushing.

Disclaimer 1: Don't boot your personal system on it yet anticipating power
savings, as TREE07 still causes RCU stalls and I am looking more into that, but
I believe this series should be good for general testing.

Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
of review and agreements.

Joel Fernandes (Google) (14):
  rcu: Add a lock-less lazy RCU implementation
  workqueue: Add a lazy version of queue_rcu_work()
  block/blk-ioc: Move call_rcu() to call_rcu_lazy()
  cred: Move call_rcu() to call_rcu_lazy()
  fs: Move call_rcu() to call_rcu_lazy() in some paths
  kernel: Move various core kernel usages to call_rcu_lazy()
  security: Move call_rcu() to call_rcu_lazy()
  net/core: Move call_rcu() to call_rcu_lazy()
  lib: Move call_rcu() to call_rcu_lazy()
  kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
  i915: Move call_rcu() to call_rcu_lazy()
  rcu/kfree: remove useless monitor_todo flag
  rcu/kfree: Fix kfree_rcu_shrink_count() return value
  DEBUG: Toggle rcu_lazy and tune at runtime

 block/blk-ioc.c                            |   2 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.c |   2 +-
 fs/dcache.c                                |   4 +-
 fs/eventpoll.c                             |   2 +-
 fs/file_table.c                            |   3 +-
 fs/inode.c                                 |   2 +-
 include/linux/rcupdate.h                   |   6 +
 include/linux/sched/sysctl.h               |   4 +
 include/linux/workqueue.h                  |   1 +
 kernel/cred.c                              |   2 +-
 kernel/exit.c                              |   2 +-
 kernel/pid.c                               |   2 +-
 kernel/rcu/Kconfig                         |   8 ++
 kernel/rcu/Makefile                        |   1 +
 kernel/rcu/lazy.c                          | 153 +++++++++++++++++++++
 kernel/rcu/rcu.h                           |   5 +
 kernel/rcu/tree.c                          |  28 ++--
 kernel/sysctl.c                            |  23 ++++
 kernel/time/posix-timers.c                 |   2 +-
 kernel/workqueue.c                         |  25 ++++
 lib/radix-tree.c                           |   2 +-
 lib/xarray.c                               |   2 +-
 net/core/dst.c                             |   2 +-
 security/security.c                        |   2 +-
 security/selinux/avc.c                     |   4 +-
 25 files changed, 255 insertions(+), 34 deletions(-)
 create mode 100644 kernel/rcu/lazy.c

-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-12 23:56   ` Paul E. McKenney
  2022-05-17  9:07   ` Uladzislau Rezki
  2022-05-12  3:04 ` [RFC v1 02/14] workqueue: Add a lazy version of queue_rcu_work() Joel Fernandes (Google)
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

Implement timer-based RCU callback batching. The batch is flushed
whenever a certain amount of time has passed, or the batch on a
particular CPU grows too big. Also memory pressure can flush it.

Locking is avoided to reduce lock contention when queuing and dequeuing
happens on different CPUs of a per-cpu list, such as when shrinker
context is running on different CPU. Also not having to use locks keeps
the per-CPU structure size small.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/rcupdate.h |   6 ++
 kernel/rcu/Kconfig       |   8 +++
 kernel/rcu/Makefile      |   1 +
 kernel/rcu/lazy.c        | 145 +++++++++++++++++++++++++++++++++++++++
 kernel/rcu/rcu.h         |   5 ++
 kernel/rcu/tree.c        |   2 +
 6 files changed, 167 insertions(+)
 create mode 100644 kernel/rcu/lazy.c

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 88b42eb46406..d0a6c4f5172c 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -82,6 +82,12 @@ static inline int rcu_preempt_depth(void)
 
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
+#ifdef CONFIG_RCU_LAZY
+void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
+#else
+#define call_rcu_lazy(head, func) call_rcu(head, func)
+#endif
+
 /* Internal to kernel */
 void rcu_init(void);
 extern int rcu_scheduler_active __read_mostly;
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index bf8e341e75b4..c09715079829 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB
 	  Say N here if you hate read-side memory barriers.
 	  Take the default if you are unsure.
 
+config RCU_LAZY
+	bool "RCU callback lazy invocation functionality"
+	depends on RCU_NOCB_CPU
+	default y
+	help
+	  To save power, batch RCU callbacks and flush after delay, memory
+          pressure or callback list growing too big.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
index 0cfb009a99b9..8968b330d6e0 100644
--- a/kernel/rcu/Makefile
+++ b/kernel/rcu/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o
 obj-$(CONFIG_TREE_RCU) += tree.o
 obj-$(CONFIG_TINY_RCU) += tiny.o
 obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
+obj-$(CONFIG_RCU_LAZY) += lazy.o
diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
new file mode 100644
index 000000000000..55e406cfc528
--- /dev/null
+++ b/kernel/rcu/lazy.c
@@ -0,0 +1,145 @@
+/*
+ * Lockless lazy-RCU implementation.
+ */
+#include <linux/rcupdate.h>
+#include <linux/shrinker.h>
+#include <linux/workqueue.h>
+#include "rcu.h"
+
+// How much to batch before flushing?
+#define MAX_LAZY_BATCH		2048
+
+// How much to wait before flushing?
+#define MAX_LAZY_JIFFIES	10000
+
+// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
+// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
+// later to ensure that rcu_head and lazy_rcu_head are of the same size.
+struct lazy_rcu_head {
+	struct llist_node llist_node;
+	void (*func)(struct callback_head *head);
+} __attribute__((aligned(sizeof(void *))));
+
+struct rcu_lazy_pcp {
+	struct llist_head head;
+	struct delayed_work work;
+	atomic_t count;
+};
+DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins);
+
+// Lockless flush of CPU, can be called concurrently.
+static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp)
+{
+	struct llist_node *node = llist_del_all(&rlp->head);
+	struct lazy_rcu_head *cursor, *temp;
+
+	if (!node)
+		return;
+
+	llist_for_each_entry_safe(cursor, temp, node, llist_node) {
+		struct rcu_head *rh = (struct rcu_head *)cursor;
+		debug_rcu_head_unqueue(rh);
+		call_rcu(rh, rh->func);
+		atomic_dec(&rlp->count);
+	}
+}
+
+void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
+{
+	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
+	struct rcu_lazy_pcp *rlp;
+
+	preempt_disable();
+        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
+	preempt_enable();
+
+	if (debug_rcu_head_queue((void *)head)) {
+		// Probable double call_rcu(), just leak.
+		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
+				__func__, head);
+
+		// Mark as success and leave.
+		return;
+	}
+
+	// Queue to per-cpu llist
+	head->func = func;
+	llist_add(&head->llist_node, &rlp->head);
+
+	// Flush queue if too big
+	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
+		lazy_rcu_flush_cpu(rlp);
+	} else {
+		if (!delayed_work_pending(&rlp->work)) {
+			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
+		}
+	}
+}
+
+static unsigned long
+lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	unsigned long count = 0;
+	int cpu;
+
+	/* Snapshot count of all CPUs */
+	for_each_possible_cpu(cpu) {
+		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
+
+		count += atomic_read(&rlp->count);
+	}
+
+	return count;
+}
+
+static unsigned long
+lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int cpu, freed = 0;
+
+	for_each_possible_cpu(cpu) {
+		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
+		unsigned long count;
+
+		count = atomic_read(&rlp->count);
+		lazy_rcu_flush_cpu(rlp);
+		sc->nr_to_scan -= count;
+		freed += count;
+		if (sc->nr_to_scan <= 0)
+			break;
+	}
+
+	return freed == 0 ? SHRINK_STOP : freed;
+}
+
+/*
+ * This function is invoked after MAX_LAZY_JIFFIES timeout.
+ */
+static void lazy_work(struct work_struct *work)
+{
+	struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work);
+
+	lazy_rcu_flush_cpu(rlp);
+}
+
+static struct shrinker lazy_rcu_shrinker = {
+	.count_objects = lazy_rcu_shrink_count,
+	.scan_objects = lazy_rcu_shrink_scan,
+	.batch = 0,
+	.seeks = DEFAULT_SEEKS,
+};
+
+void __init rcu_lazy_init(void)
+{
+	int cpu;
+
+	BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head));
+
+	for_each_possible_cpu(cpu) {
+		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
+		INIT_DELAYED_WORK(&rlp->work, lazy_work);
+	}
+
+	if (register_shrinker(&lazy_rcu_shrinker))
+		pr_err("Failed to register lazy_rcu shrinker!\n");
+}
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 24b5f2c2de87..a5f4b44f395f 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void);
 static inline void show_rcu_tasks_trace_gp_kthread(void) {}
 #endif
 
+#ifdef CONFIG_RCU_LAZY
+void rcu_lazy_init(void);
+#else
+static inline void rcu_lazy_init(void) {}
+#endif
 #endif /* __LINUX_RCU_H */
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a4c25a6283b0..ebdf6f7c9023 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -4775,6 +4775,8 @@ void __init rcu_init(void)
 		qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
 	else
 		qovld_calc = qovld;
+
+	rcu_lazy_init();
 }
 
 #include "tree_stall.h"
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 02/14] workqueue: Add a lazy version of queue_rcu_work()
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-12 23:58   ` Paul E. McKenney
  2022-05-12  3:04 ` [RFC v1 03/14] block/blk-ioc: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

This will be used in kfree_rcu() later to make it do call_rcu() lazily.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/workqueue.h |  1 +
 kernel/workqueue.c        | 25 +++++++++++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 7fee9b6cfede..2678a6b5b3f3 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -444,6 +444,7 @@ extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
 extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
 			struct delayed_work *dwork, unsigned long delay);
 extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
+extern bool queue_rcu_work_lazy(struct workqueue_struct *wq, struct rcu_work *rwork);
 
 extern void flush_workqueue(struct workqueue_struct *wq);
 extern void drain_workqueue(struct workqueue_struct *wq);
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 33f1106b4f99..9444949cc148 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1796,6 +1796,31 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
 }
 EXPORT_SYMBOL(queue_rcu_work);
 
+/**
+ * queue_rcu_work_lazy - queue work after a RCU grace period
+ * @wq: workqueue to use
+ * @rwork: work to queue
+ *
+ * Return: %false if @rwork was already pending, %true otherwise.  Note
+ * that a full RCU grace period is guaranteed only after a %true return.
+ * While @rwork is guaranteed to be executed after a %false return, the
+ * execution may happen before a full RCU grace period has passed.
+ */
+bool queue_rcu_work_lazy(struct workqueue_struct *wq, struct rcu_work *rwork)
+{
+	struct work_struct *work = &rwork->work;
+
+	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
+		rwork->wq = wq;
+		call_rcu_lazy(&rwork->rcu, rcu_work_rcufn);
+		return true;
+	}
+
+	return false;
+}
+EXPORT_SYMBOL(queue_rcu_work_lazy);
+
+
 /**
  * worker_enter_idle - enter idle state
  * @worker: worker which is entering idle state
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 03/14] block/blk-ioc: Move call_rcu() to call_rcu_lazy()
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 02/14] workqueue: Add a lazy version of queue_rcu_work() Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-13  0:00   ` Paul E. McKenney
  2022-05-12  3:04 ` [RFC v1 04/14] cred: " Joel Fernandes (Google)
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 block/blk-ioc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 11f49f78db32..96d5de5df0b6 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -98,7 +98,7 @@ static void ioc_destroy_icq(struct io_cq *icq)
 	 */
 	icq->__rcu_icq_cache = et->icq_cache;
 	icq->flags |= ICQ_DESTROYED;
-	call_rcu(&icq->__rcu_head, icq_free_icq_rcu);
+	call_rcu_lazy(&icq->__rcu_head, icq_free_icq_rcu);
 }
 
 /*
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 04/14] cred: Move call_rcu() to call_rcu_lazy()
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 03/14] block/blk-ioc: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-13  0:02   ` Paul E. McKenney
  2022-05-12  3:04 ` [RFC v1 05/14] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/cred.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index 933155c96922..f4d69d7f2763 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -150,7 +150,7 @@ void __put_cred(struct cred *cred)
 	if (cred->non_rcu)
 		put_cred_rcu(&cred->rcu);
 	else
-		call_rcu(&cred->rcu, put_cred_rcu);
+		call_rcu_lazy(&cred->rcu, put_cred_rcu);
 }
 EXPORT_SYMBOL(__put_cred);
 
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 05/14] fs: Move call_rcu() to call_rcu_lazy() in some paths
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 04/14] cred: " Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-13  0:07   ` Paul E. McKenney
  2022-05-12  3:04 ` [RFC v1 06/14] kernel: Move various core kernel usages to call_rcu_lazy() Joel Fernandes (Google)
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

When testing, we found that these paths were invoked often when the
system is not doing anything (screen is ON but otherwise idle).

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 fs/dcache.c     | 4 ++--
 fs/eventpoll.c  | 2 +-
 fs/file_table.c | 3 ++-
 fs/inode.c      | 2 +-
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c84269c6e8bf..517e02cde103 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -366,7 +366,7 @@ static void dentry_free(struct dentry *dentry)
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
 		if (likely(atomic_dec_and_test(&p->u.count))) {
-			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
+			call_rcu_lazy(&dentry->d_u.d_rcu, __d_free_external);
 			return;
 		}
 	}
@@ -374,7 +374,7 @@ static void dentry_free(struct dentry *dentry)
 	if (dentry->d_flags & DCACHE_NORCU)
 		__d_free(&dentry->d_u.d_rcu);
 	else
-		call_rcu(&dentry->d_u.d_rcu, __d_free);
+		call_rcu_lazy(&dentry->d_u.d_rcu, __d_free);
 }
 
 /*
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index e2daa940ebce..10a24cca2cff 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -728,7 +728,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 	 * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make
 	 * use of the rbn field.
 	 */
-	call_rcu(&epi->rcu, epi_rcu_free);
+	call_rcu_lazy(&epi->rcu, epi_rcu_free);
 
 	percpu_counter_dec(&ep->user->epoll_watches);
 
diff --git a/fs/file_table.c b/fs/file_table.c
index 7d2e692b66a9..415815d3ef80 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -56,7 +56,8 @@ static inline void file_free(struct file *f)
 	security_file_free(f);
 	if (!(f->f_mode & FMODE_NOACCOUNT))
 		percpu_counter_dec(&nr_files);
-	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
+
+	call_rcu_lazy(&f->f_u.fu_rcuhead, file_free_rcu);
 }
 
 /*
diff --git a/fs/inode.c b/fs/inode.c
index 63324df6fa27..b288a5bef4c7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -312,7 +312,7 @@ static void destroy_inode(struct inode *inode)
 			return;
 	}
 	inode->free_inode = ops->free_inode;
-	call_rcu(&inode->i_rcu, i_callback);
+	call_rcu_lazy(&inode->i_rcu, i_callback);
 }
 
 /**
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 06/14] kernel: Move various core kernel usages to call_rcu_lazy()
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 05/14] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 07/14] security: Move call_rcu() " Joel Fernandes (Google)
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/exit.c              | 2 +-
 kernel/pid.c               | 2 +-
 kernel/time/posix-timers.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index b00a25bb4ab9..6d84ec8e1fd0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -177,7 +177,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 void put_task_struct_rcu_user(struct task_struct *task)
 {
 	if (refcount_dec_and_test(&task->rcu_users))
-		call_rcu(&task->rcu, delayed_put_task_struct);
+		call_rcu_lazy(&task->rcu, delayed_put_task_struct);
 }
 
 void release_task(struct task_struct *p)
diff --git a/kernel/pid.c b/kernel/pid.c
index 2fc0a16ec77b..5a5144519d70 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -153,7 +153,7 @@ void free_pid(struct pid *pid)
 	}
 	spin_unlock_irqrestore(&pidmap_lock, flags);
 
-	call_rcu(&pid->rcu, delayed_put_pid);
+	call_rcu_lazy(&pid->rcu, delayed_put_pid);
 }
 
 struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 1cd10b102c51..04e191dfa91e 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -485,7 +485,7 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 	}
 	put_pid(tmr->it_pid);
 	sigqueue_free(tmr->sigq);
-	call_rcu(&tmr->rcu, k_itimer_rcu_free);
+	call_rcu_lazy(&tmr->rcu, k_itimer_rcu_free);
 }
 
 static int common_timer_create(struct k_itimer *new_timer)
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 07/14] security: Move call_rcu() to call_rcu_lazy()
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (5 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 06/14] kernel: Move various core kernel usages to call_rcu_lazy() Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 08/14] net/core: " Joel Fernandes (Google)
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 security/security.c    | 2 +-
 security/selinux/avc.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/security/security.c b/security/security.c
index 22261d79f333..42c165631e8f 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1039,7 +1039,7 @@ void security_inode_free(struct inode *inode)
 	 * The inode will be freed after the RCU grace period too.
 	 */
 	if (inode->i_security)
-		call_rcu((struct rcu_head *)inode->i_security,
+		call_rcu_lazy((struct rcu_head *)inode->i_security,
 				inode_free_by_rcu);
 }
 
diff --git a/security/selinux/avc.c b/security/selinux/avc.c
index abcd9740d10f..8c639c6df955 100644
--- a/security/selinux/avc.c
+++ b/security/selinux/avc.c
@@ -442,7 +442,7 @@ static void avc_node_free(struct rcu_head *rhead)
 static void avc_node_delete(struct selinux_avc *avc, struct avc_node *node)
 {
 	hlist_del_rcu(&node->list);
-	call_rcu(&node->rhead, avc_node_free);
+	call_rcu_lazy(&node->rhead, avc_node_free);
 	atomic_dec(&avc->avc_cache.active_nodes);
 }
 
@@ -458,7 +458,7 @@ static void avc_node_replace(struct selinux_avc *avc,
 			     struct avc_node *new, struct avc_node *old)
 {
 	hlist_replace_rcu(&old->list, &new->list);
-	call_rcu(&old->rhead, avc_node_free);
+	call_rcu_lazy(&old->rhead, avc_node_free);
 	atomic_dec(&avc->avc_cache.active_nodes);
 }
 
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 08/14] net/core: Move call_rcu() to call_rcu_lazy()
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (6 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 07/14] security: Move call_rcu() " Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 09/14] lib: " Joel Fernandes (Google)
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 net/core/dst.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index d16c2c9bfebd..68c240a4a0d7 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -174,7 +174,7 @@ void dst_release(struct dst_entry *dst)
 			net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
 					     __func__, dst, newrefcnt);
 		if (!newrefcnt)
-			call_rcu(&dst->rcu_head, dst_destroy_rcu);
+			call_rcu_lazy(&dst->rcu_head, dst_destroy_rcu);
 	}
 }
 EXPORT_SYMBOL(dst_release);
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 09/14] lib: Move call_rcu() to call_rcu_lazy()
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (7 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 08/14] net/core: " Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 10/14] kfree/rcu: Queue RCU work via queue_rcu_work_lazy() Joel Fernandes (Google)
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

Move radix-tree and xarray to call_rcu_lazy(). This is required to
prevent callbacks triggering RCU machinery too quickly and too often,
which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 lib/radix-tree.c | 2 +-
 lib/xarray.c     | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index b3afafe46fff..1526dc9e1d93 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -305,7 +305,7 @@ void radix_tree_node_rcu_free(struct rcu_head *head)
 static inline void
 radix_tree_node_free(struct radix_tree_node *node)
 {
-	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
+	call_rcu_lazy(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
diff --git a/lib/xarray.c b/lib/xarray.c
index 6f47f6375808..29d6abf6b1ff 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -255,7 +255,7 @@ static void xa_node_free(struct xa_node *node)
 {
 	XA_NODE_BUG_ON(node, !list_empty(&node->private_list));
 	node->array = XA_RCU_FREE;
-	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
+	call_rcu_lazy(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 10/14] kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (8 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 09/14] lib: " Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-13  0:12   ` Paul E. McKenney
  2022-05-12  3:04 ` [RFC v1 11/14] i915: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index ebdf6f7c9023..3baf29014f86 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3407,7 +3407,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
 			// be that the work is in the pending state when
 			// channels have been detached following by each
 			// other.
-			queue_rcu_work(system_wq, &krwp->rcu_work);
+			queue_rcu_work_lazy(system_wq, &krwp->rcu_work);
 		}
 	}
 
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 11/14] i915: Move call_rcu() to call_rcu_lazy()
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (9 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 10/14] kfree/rcu: Queue RCU work via queue_rcu_work_lazy() Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-12  3:04 ` [RFC v1 12/14] rcu/kfree: remove useless monitor_todo flag Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/i915/gem/i915_gem_object.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index d87b508b59b1..b54d1be3ee68 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -343,7 +343,7 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 		__i915_gem_free_object(obj);
 
 		/* But keep the pointer alive for RCU-protected lookups */
-		call_rcu(&obj->rcu, __i915_gem_free_object_rcu);
+		call_rcu_lazy(&obj->rcu, __i915_gem_free_object_rcu);
 		cond_resched();
 	}
 }
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 12/14] rcu/kfree: remove useless monitor_todo flag
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (10 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 11/14] i915: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-13 14:53   ` Uladzislau Rezki
  2022-05-12  3:04 ` [RFC v1 13/14] rcu/kfree: Fix kfree_rcu_shrink_count() return value Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

monitor_todo is not needed as the work struct already tracks if work is
pending. Just use that to know if work is pending using
delayed_work_pending() helper.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 3baf29014f86..3828ac3bf1c4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3155,7 +3155,6 @@ struct kfree_rcu_cpu_work {
  * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
  * @lock: Synchronize access to this structure
  * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
- * @monitor_todo: Tracks whether a @monitor_work delayed work is pending
  * @initialized: The @rcu_work fields have been initialized
  * @count: Number of objects for which GP not started
  * @bkvcache:
@@ -3180,7 +3179,6 @@ struct kfree_rcu_cpu {
 	struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
 	raw_spinlock_t lock;
 	struct delayed_work monitor_work;
-	bool monitor_todo;
 	bool initialized;
 	int count;
 
@@ -3416,9 +3414,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
 	// of the channels that is still busy we should rearm the
 	// work to repeat an attempt. Because previous batches are
 	// still in progress.
-	if (!krcp->bkvhead[0] && !krcp->bkvhead[1] && !krcp->head)
-		krcp->monitor_todo = false;
-	else
+	if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head)
 		schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
 
 	raw_spin_unlock_irqrestore(&krcp->lock, flags);
@@ -3607,10 +3603,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 
 	// Set timer to drain after KFREE_DRAIN_JIFFIES.
 	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
-	    !krcp->monitor_todo) {
-		krcp->monitor_todo = true;
+	    !delayed_work_pending(&krcp->monitor_work))
 		schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
-	}
 
 unlock_return:
 	krc_this_cpu_unlock(krcp, flags);
@@ -3685,14 +3679,12 @@ void __init kfree_rcu_scheduler_running(void)
 		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
 
 		raw_spin_lock_irqsave(&krcp->lock, flags);
-		if ((!krcp->bkvhead[0] && !krcp->bkvhead[1] && !krcp->head) ||
-				krcp->monitor_todo) {
-			raw_spin_unlock_irqrestore(&krcp->lock, flags);
-			continue;
+		if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
+			if (delayed_work_pending(&krcp->monitor_work)) {
+				schedule_delayed_work_on(cpu, &krcp->monitor_work,
+						KFREE_DRAIN_JIFFIES);
+			}
 		}
-		krcp->monitor_todo = true;
-		schedule_delayed_work_on(cpu, &krcp->monitor_work,
-					 KFREE_DRAIN_JIFFIES);
 		raw_spin_unlock_irqrestore(&krcp->lock, flags);
 	}
 }
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 13/14] rcu/kfree: Fix kfree_rcu_shrink_count() return value
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (11 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 12/14] rcu/kfree: remove useless monitor_todo flag Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-13 14:54   ` Uladzislau Rezki
  2022-05-12  3:04 ` [RFC v1 14/14] DEBUG: Toggle rcu_lazy and tune at runtime Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  15 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

As per the comments in include/linux/shrinker.h, .count_objects callback
should return the number of freeable items, but if there are no objects
to free, SHRINK_EMPTY should be returned. The only time 0 is returned
should be when we are unable to determine the number of objects, or the
cache should be skipped for another reason.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 3828ac3bf1c4..f191542cdf5e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3637,7 +3637,7 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
 		atomic_set(&krcp->backoff_page_cache_fill, 1);
 	}
 
-	return count;
+	return count == 0 ? SHRINK_EMPTY : count;
 }
 
 static unsigned long
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [RFC v1 14/14] DEBUG: Toggle rcu_lazy and tune at runtime
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (12 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 13/14] rcu/kfree: Fix kfree_rcu_shrink_count() return value Joel Fernandes (Google)
@ 2022-05-12  3:04 ` Joel Fernandes (Google)
  2022-05-13  0:16   ` Paul E. McKenney
  2022-05-12  3:17 ` [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes
  2022-06-13 18:53 ` Joel Fernandes
  15 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-05-12  3:04 UTC (permalink / raw)
  To: rcu
  Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt, Joel Fernandes (Google)

Add sysctl knobs just for easier debugging/testing, to tune the maximum
batch size, maximum time to wait before flush, and turning off the
feature entirely.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/sched/sysctl.h |  4 ++++
 kernel/rcu/lazy.c            | 12 ++++++++++--
 kernel/sysctl.c              | 23 +++++++++++++++++++++++
 3 files changed, 37 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c19dd5a2c05c..55ffc61beed1 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -16,6 +16,10 @@ enum { sysctl_hung_task_timeout_secs = 0 };
 
 extern unsigned int sysctl_sched_child_runs_first;
 
+extern unsigned int sysctl_rcu_lazy;
+extern unsigned int sysctl_rcu_lazy_batch;
+extern unsigned int sysctl_rcu_lazy_jiffies;
+
 enum sched_tunable_scaling {
 	SCHED_TUNABLESCALING_NONE,
 	SCHED_TUNABLESCALING_LOG,
diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
index 55e406cfc528..0af9fb67c92b 100644
--- a/kernel/rcu/lazy.c
+++ b/kernel/rcu/lazy.c
@@ -12,6 +12,10 @@
 // How much to wait before flushing?
 #define MAX_LAZY_JIFFIES	10000
 
+unsigned int sysctl_rcu_lazy_batch = MAX_LAZY_BATCH;
+unsigned int sysctl_rcu_lazy_jiffies = MAX_LAZY_JIFFIES;
+unsigned int sysctl_rcu_lazy = 1;
+
 // We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
 // allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
 // later to ensure that rcu_head and lazy_rcu_head are of the same size.
@@ -49,6 +53,10 @@ void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
 	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
 	struct rcu_lazy_pcp *rlp;
 
+	if (!sysctl_rcu_lazy) {
+		return call_rcu(head_rcu, func);
+	}
+
 	preempt_disable();
         rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
 	preempt_enable();
@@ -67,11 +75,11 @@ void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
 	llist_add(&head->llist_node, &rlp->head);
 
 	// Flush queue if too big
-	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
+	if (atomic_inc_return(&rlp->count) >= sysctl_rcu_lazy_batch) {
 		lazy_rcu_flush_cpu(rlp);
 	} else {
 		if (!delayed_work_pending(&rlp->work)) {
-			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
+			schedule_delayed_work(&rlp->work, sysctl_rcu_lazy_jiffies);
 		}
 	}
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5ae443b2882e..2ba830ca71ec 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1659,6 +1659,29 @@ static struct ctl_table kern_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 	},
+#ifdef CONFIG_RCU_LAZY
+	{
+		.procname	= "rcu_lazy",
+		.data		= &sysctl_rcu_lazy,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "rcu_lazy_batch",
+		.data		= &sysctl_rcu_lazy_batch,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+	{
+		.procname	= "rcu_lazy_jiffies",
+		.data		= &sysctl_rcu_lazy_jiffies,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
+#endif
 #ifdef CONFIG_SCHEDSTATS
 	{
 		.procname	= "sched_schedstats",
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (13 preceding siblings ...)
  2022-05-12  3:04 ` [RFC v1 14/14] DEBUG: Toggle rcu_lazy and tune at runtime Joel Fernandes (Google)
@ 2022-05-12  3:17 ` Joel Fernandes
  2022-05-12 13:09   ` Uladzislau Rezki
  2022-05-13  0:23   ` Paul E. McKenney
  2022-06-13 18:53 ` Joel Fernandes
  15 siblings, 2 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-05-12  3:17 UTC (permalink / raw)
  To: rcu
  Cc: Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Paul E. McKenney,
	Steven Rostedt

On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> Hello!
> Please find the proof of concept version of call_rcu_lazy() attached. This
> gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> Rushikesh Kadam from Intel for investigating it with me.
>
> Some numbers below:
>
> Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> The observation is that due to a 'trickle down' effect of RCU callbacks, the
> system is very lightly loaded but constantly running few RCU callbacks very
> often. This confuses the power management hardware that the system is active,
> when it is in fact idle.
>
> For example, when ChromeOS screen is off and user is not doing anything on the
> system, we can see big power savings.
> Before:
> Pk%pc10 = 72.13
> PkgWatt = 0.58
> CorWatt = 0.04
>
> After:
> Pk%pc10 = 81.28
> PkgWatt = 0.41
> CorWatt = 0.03
>
> Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> can see that the display pipeline is constantly doing RCU callback queuing due
> to open/close of file descriptors associated with graphics buffers. This is
> attributed to the file_free_rcu() path which this patch series also touches.
>
> This patch series adds a simple but effective, and lockless implementation of
> RCU callback batching. On memory pressure, timeout or queue growing too big, we
> initiate a flush of one or more per-CPU lists.
>
> Similar results can be achieved by increasing jiffies_till_first_fqs, however
> that also has the effect of slowing down RCU. Especially I saw huge slow down
> of function graph tracer when increasing that.
>
> One drawback of this series is, if another frequent RCU callback creeps up in
> the future, that's not lazy, then that will again hurt the power. However, I
> believe identifying and fixing those is a more reasonable approach than slowing
> RCU down for the whole system.
>
> NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> at runtime to turn it on or off globally. It is default to on. Further, please
> use the sysctls in lazy.c for further tuning of parameters that effect the
> flushing.
>
> Disclaimer 1: Don't boot your personal system on it yet anticipating power
> savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> I believe this series should be good for general testing.
>
> Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> of review and agreements.

I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
support for that definitely needs work.

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-12  3:17 ` [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes
@ 2022-05-12 13:09   ` Uladzislau Rezki
  2022-05-12 13:56     ` Uladzislau Rezki
  2022-05-13  0:23   ` Paul E. McKenney
  1 sibling, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-12 13:09 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, Rushikesh S Kadam, Neeraj upadhyay, Frederic Weisbecker,
	Paul E. McKenney, Steven Rostedt

Hello, Joel!

Which kernel version have you used for this series?

--
Uladzislau Rezki

On Thu, May 12, 2022 at 5:18 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > Hello!
> > Please find the proof of concept version of call_rcu_lazy() attached. This
> > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > Rushikesh Kadam from Intel for investigating it with me.
> >
> > Some numbers below:
> >
> > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > system is very lightly loaded but constantly running few RCU callbacks very
> > often. This confuses the power management hardware that the system is active,
> > when it is in fact idle.
> >
> > For example, when ChromeOS screen is off and user is not doing anything on the
> > system, we can see big power savings.
> > Before:
> > Pk%pc10 = 72.13
> > PkgWatt = 0.58
> > CorWatt = 0.04
> >
> > After:
> > Pk%pc10 = 81.28
> > PkgWatt = 0.41
> > CorWatt = 0.03
> >
> > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > can see that the display pipeline is constantly doing RCU callback queuing due
> > to open/close of file descriptors associated with graphics buffers. This is
> > attributed to the file_free_rcu() path which this patch series also touches.
> >
> > This patch series adds a simple but effective, and lockless implementation of
> > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > initiate a flush of one or more per-CPU lists.
> >
> > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > of function graph tracer when increasing that.
> >
> > One drawback of this series is, if another frequent RCU callback creeps up in
> > the future, that's not lazy, then that will again hurt the power. However, I
> > believe identifying and fixing those is a more reasonable approach than slowing
> > RCU down for the whole system.
> >
> > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > at runtime to turn it on or off globally. It is default to on. Further, please
> > use the sysctls in lazy.c for further tuning of parameters that effect the
> > flushing.
> >
> > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > I believe this series should be good for general testing.
> >
> > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > of review and agreements.
>
> I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> support for that definitely needs work.
>
> thanks,
>
>  - Joel



-- 
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-12 13:09   ` Uladzislau Rezki
@ 2022-05-12 13:56     ` Uladzislau Rezki
  2022-05-12 14:03       ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-12 13:56 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, Rushikesh S Kadam, Neeraj upadhyay, Frederic Weisbecker,
	Paul E. McKenney, Steven Rostedt

Never mind. I port it into 5.10

On Thu, May 12, 2022 at 3:09 PM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> Hello, Joel!
>
> Which kernel version have you used for this series?
>
> --
> Uladzislau Rezki
>
> On Thu, May 12, 2022 at 5:18 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > Hello!
> > > Please find the proof of concept version of call_rcu_lazy() attached. This
> > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > > Rushikesh Kadam from Intel for investigating it with me.
> > >
> > > Some numbers below:
> > >
> > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > > system is very lightly loaded but constantly running few RCU callbacks very
> > > often. This confuses the power management hardware that the system is active,
> > > when it is in fact idle.
> > >
> > > For example, when ChromeOS screen is off and user is not doing anything on the
> > > system, we can see big power savings.
> > > Before:
> > > Pk%pc10 = 72.13
> > > PkgWatt = 0.58
> > > CorWatt = 0.04
> > >
> > > After:
> > > Pk%pc10 = 81.28
> > > PkgWatt = 0.41
> > > CorWatt = 0.03
> > >
> > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > > can see that the display pipeline is constantly doing RCU callback queuing due
> > > to open/close of file descriptors associated with graphics buffers. This is
> > > attributed to the file_free_rcu() path which this patch series also touches.
> > >
> > > This patch series adds a simple but effective, and lockless implementation of
> > > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > > initiate a flush of one or more per-CPU lists.
> > >
> > > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > > of function graph tracer when increasing that.
> > >
> > > One drawback of this series is, if another frequent RCU callback creeps up in
> > > the future, that's not lazy, then that will again hurt the power. However, I
> > > believe identifying and fixing those is a more reasonable approach than slowing
> > > RCU down for the whole system.
> > >
> > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > > at runtime to turn it on or off globally. It is default to on. Further, please
> > > use the sysctls in lazy.c for further tuning of parameters that effect the
> > > flushing.
> > >
> > > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > > I believe this series should be good for general testing.
> > >
> > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > > of review and agreements.
> >
> > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> > support for that definitely needs work.
> >
> > thanks,
> >
> >  - Joel
>
>
>
> --
> Uladzislau Rezki



-- 
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-12 13:56     ` Uladzislau Rezki
@ 2022-05-12 14:03       ` Joel Fernandes
  2022-05-12 14:37         ` Uladzislau Rezki
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-12 14:03 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: rcu, Rushikesh S Kadam, Neeraj upadhyay, Frederic Weisbecker,
	Paul E. McKenney, Steven Rostedt

On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> Never mind. I port it into 5.10

Oh, this is on mainline. Sorry about that. If you want I have a tree here for
5.10 , although that does not have the kfree changes, everything else is
ditto.
https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4

thanks,

 - Joel



> 
> On Thu, May 12, 2022 at 3:09 PM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > Hello, Joel!
> >
> > Which kernel version have you used for this series?
> >
> > --
> > Uladzislau Rezki
> >
> > On Thu, May 12, 2022 at 5:18 AM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> > > <joel@joelfernandes.org> wrote:
> > > >
> > > > Hello!
> > > > Please find the proof of concept version of call_rcu_lazy() attached. This
> > > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > > > Rushikesh Kadam from Intel for investigating it with me.
> > > >
> > > > Some numbers below:
> > > >
> > > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > > > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > > > system is very lightly loaded but constantly running few RCU callbacks very
> > > > often. This confuses the power management hardware that the system is active,
> > > > when it is in fact idle.
> > > >
> > > > For example, when ChromeOS screen is off and user is not doing anything on the
> > > > system, we can see big power savings.
> > > > Before:
> > > > Pk%pc10 = 72.13
> > > > PkgWatt = 0.58
> > > > CorWatt = 0.04
> > > >
> > > > After:
> > > > Pk%pc10 = 81.28
> > > > PkgWatt = 0.41
> > > > CorWatt = 0.03
> > > >
> > > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > > > can see that the display pipeline is constantly doing RCU callback queuing due
> > > > to open/close of file descriptors associated with graphics buffers. This is
> > > > attributed to the file_free_rcu() path which this patch series also touches.
> > > >
> > > > This patch series adds a simple but effective, and lockless implementation of
> > > > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > > > initiate a flush of one or more per-CPU lists.
> > > >
> > > > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > > > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > > > of function graph tracer when increasing that.
> > > >
> > > > One drawback of this series is, if another frequent RCU callback creeps up in
> > > > the future, that's not lazy, then that will again hurt the power. However, I
> > > > believe identifying and fixing those is a more reasonable approach than slowing
> > > > RCU down for the whole system.
> > > >
> > > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > > > at runtime to turn it on or off globally. It is default to on. Further, please
> > > > use the sysctls in lazy.c for further tuning of parameters that effect the
> > > > flushing.
> > > >
> > > > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > > > I believe this series should be good for general testing.
> > > >
> > > > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > > > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > > > of review and agreements.
> > >
> > > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> > > support for that definitely needs work.
> > >
> > > thanks,
> > >
> > >  - Joel
> >
> >
> >
> > --
> > Uladzislau Rezki
> 
> 
> 
> -- 
> Uladzislau Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-12 14:03       ` Joel Fernandes
@ 2022-05-12 14:37         ` Uladzislau Rezki
  2022-05-12 16:09           ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-12 14:37 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, rcu, Rushikesh S Kadam, Neeraj upadhyay,
	Frederic Weisbecker, Paul E. McKenney, Steven Rostedt

> On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > Never mind. I port it into 5.10
> 
> Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> 5.10 , although that does not have the kfree changes, everything else is
> ditto.
> https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> 
No problem. kfree_rcu two patches are not so important in this series.
So i have backported them into my 5.10 kernel because the latest kernel
is not so easy to up and run on my device :)

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-12 14:37         ` Uladzislau Rezki
@ 2022-05-12 16:09           ` Joel Fernandes
  2022-05-12 16:32             ` Uladzislau Rezki
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-12 16:09 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: rcu, Rushikesh S Kadam, Neeraj upadhyay, Frederic Weisbecker,
	Paul E. McKenney, Steven Rostedt

On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > Never mind. I port it into 5.10
> >
> > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > 5.10 , although that does not have the kfree changes, everything else is
> > ditto.
> > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> >
> No problem. kfree_rcu two patches are not so important in this series.
> So i have backported them into my 5.10 kernel because the latest kernel
> is not so easy to up and run on my device :)

Actually I was going to write here, apparently some tests are showing
kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
good to drop those for your initial testing!

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-12 16:09           ` Joel Fernandes
@ 2022-05-12 16:32             ` Uladzislau Rezki
       [not found]               ` <Yn5e7w8NWzThUARb@pc638.lan>
  0 siblings, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-12 16:32 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, rcu, Rushikesh S Kadam, Neeraj upadhyay,
	Frederic Weisbecker, Paul E. McKenney, Steven Rostedt

> On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > Never mind. I port it into 5.10
> > >
> > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > 5.10 , although that does not have the kfree changes, everything else is
> > > ditto.
> > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > >
> > No problem. kfree_rcu two patches are not so important in this series.
> > So i have backported them into my 5.10 kernel because the latest kernel
> > is not so easy to up and run on my device :)
> 
> Actually I was going to write here, apparently some tests are showing
> kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> good to drop those for your initial testing!
> 
Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
so important for kfree_rcu() because we do batch requests there anyway.
One thing that i would like to improve in kfree_rcu() is a better utilization
of page slots.

I will share my results either tomorrow or on Monday. I hope that is fine.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-12  3:04 ` [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation Joel Fernandes (Google)
@ 2022-05-12 23:56   ` Paul E. McKenney
  2022-05-14 15:08     ` Joel Fernandes
  2022-06-01 14:24     ` Frederic Weisbecker
  2022-05-17  9:07   ` Uladzislau Rezki
  1 sibling, 2 replies; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-12 23:56 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote:
> Implement timer-based RCU callback batching. The batch is flushed
> whenever a certain amount of time has passed, or the batch on a
> particular CPU grows too big. Also memory pressure can flush it.
> 
> Locking is avoided to reduce lock contention when queuing and dequeuing
> happens on different CPUs of a per-cpu list, such as when shrinker
> context is running on different CPU. Also not having to use locks keeps
> the per-CPU structure size small.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

It is very good to see this!  Inevitable comments and questions below.

							Thanx, Paul

> ---
>  include/linux/rcupdate.h |   6 ++
>  kernel/rcu/Kconfig       |   8 +++
>  kernel/rcu/Makefile      |   1 +
>  kernel/rcu/lazy.c        | 145 +++++++++++++++++++++++++++++++++++++++
>  kernel/rcu/rcu.h         |   5 ++
>  kernel/rcu/tree.c        |   2 +
>  6 files changed, 167 insertions(+)
>  create mode 100644 kernel/rcu/lazy.c
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 88b42eb46406..d0a6c4f5172c 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -82,6 +82,12 @@ static inline int rcu_preempt_depth(void)
>  
>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>  
> +#ifdef CONFIG_RCU_LAZY
> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
> +#else
> +#define call_rcu_lazy(head, func) call_rcu(head, func)
> +#endif
> +
>  /* Internal to kernel */
>  void rcu_init(void);
>  extern int rcu_scheduler_active __read_mostly;
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index bf8e341e75b4..c09715079829 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB
>  	  Say N here if you hate read-side memory barriers.
>  	  Take the default if you are unsure.
>  
> +config RCU_LAZY
> +	bool "RCU callback lazy invocation functionality"
> +	depends on RCU_NOCB_CPU
> +	default y

This "default y" is OK for experimentation, but for mainline we should
not be forcing this on unsuspecting people.  ;-)

> +	help
> +	  To save power, batch RCU callbacks and flush after delay, memory
> +          pressure or callback list growing too big.
> +
>  endmenu # "RCU Subsystem"
> diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> index 0cfb009a99b9..8968b330d6e0 100644
> --- a/kernel/rcu/Makefile
> +++ b/kernel/rcu/Makefile
> @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o
>  obj-$(CONFIG_TREE_RCU) += tree.o
>  obj-$(CONFIG_TINY_RCU) += tiny.o
>  obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
> +obj-$(CONFIG_RCU_LAZY) += lazy.o
> diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
> new file mode 100644
> index 000000000000..55e406cfc528
> --- /dev/null
> +++ b/kernel/rcu/lazy.c
> @@ -0,0 +1,145 @@
> +/*
> + * Lockless lazy-RCU implementation.
> + */
> +#include <linux/rcupdate.h>
> +#include <linux/shrinker.h>
> +#include <linux/workqueue.h>
> +#include "rcu.h"
> +
> +// How much to batch before flushing?
> +#define MAX_LAZY_BATCH		2048
> +
> +// How much to wait before flushing?
> +#define MAX_LAZY_JIFFIES	10000

That is more than a minute on a HZ=10 system.  Are you sure that you
did not mean "(10 * HZ)" or some such?

> +
> +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
> +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
> +// later to ensure that rcu_head and lazy_rcu_head are of the same size.
> +struct lazy_rcu_head {
> +	struct llist_node llist_node;
> +	void (*func)(struct callback_head *head);
> +} __attribute__((aligned(sizeof(void *))));

This needs a build-time check that rcu_head and lazy_rcu_head are of
the same size.  Maybe something like this in some appropriate context:

	BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy));

Never mind!  I see you have this in rcu_init_lazy().  Plus I now see that
you also mention this in the above comments.  ;-)

> +
> +struct rcu_lazy_pcp {
> +	struct llist_head head;
> +	struct delayed_work work;
> +	atomic_t count;
> +};
> +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins);
> +
> +// Lockless flush of CPU, can be called concurrently.
> +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp)
> +{
> +	struct llist_node *node = llist_del_all(&rlp->head);
> +	struct lazy_rcu_head *cursor, *temp;
> +
> +	if (!node)
> +		return;

At this point, the list is empty but the count is non-zero.  Can
that cause a problem?  (For the existing callback lists, this would
be OK.)

> +	llist_for_each_entry_safe(cursor, temp, node, llist_node) {
> +		struct rcu_head *rh = (struct rcu_head *)cursor;
> +		debug_rcu_head_unqueue(rh);

Good to see this check!

> +		call_rcu(rh, rh->func);
> +		atomic_dec(&rlp->count);
> +	}
> +}
> +
> +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> +{
> +	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> +	struct rcu_lazy_pcp *rlp;
> +
> +	preempt_disable();
> +        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);

Whitespace issue, please fix.

> +	preempt_enable();
> +
> +	if (debug_rcu_head_queue((void *)head)) {
> +		// Probable double call_rcu(), just leak.
> +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> +				__func__, head);
> +
> +		// Mark as success and leave.
> +		return;
> +	}
> +
> +	// Queue to per-cpu llist
> +	head->func = func;
> +	llist_add(&head->llist_node, &rlp->head);

Suppose that there are a bunch of preemptions between the preempt_enable()
above and this point, so that the current CPU's list has lots of
callbacks, but zero ->count.  Can that cause a problem?

In the past, this sort of thing has been an issue for rcu_barrier()
and friends.

> +	// Flush queue if too big

You will also need to check for early boot use.

I -think- it suffice to simply skip the following "if" statement when
rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices.  The reason being
that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT
kernels won't expire them until the softirq kthreads have been spawned.

Which is OK, as it just means that call_rcu_lazy() is a bit more
lazy than expected that early.

Except that call_rcu() can be invoked even before rcu_init() has been
invoked, which is therefore also before rcu_init_lazy() has been invoked.

I thefore suggest something like this at the very start of this function:

	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
		call_rcu(head_rcu, func);

The goal is that people can replace call_rcu() with call_rcu_lazy()
without having to worry about invocation during early boot.

Huh.  "head_rcu"?  Why not "rhp"?  Or follow call_rcu() with "head",
though "rhp" is more consistent with the RCU pointer initials approach.

> +	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
> +		lazy_rcu_flush_cpu(rlp);
> +	} else {
> +		if (!delayed_work_pending(&rlp->work)) {

This check is racy because the work might run to completion right at
this point.  Wouldn't it be better to rely on the internal check of
WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()?

> +			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
> +		}
> +	}
> +}
> +
> +static unsigned long
> +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	unsigned long count = 0;
> +	int cpu;
> +
> +	/* Snapshot count of all CPUs */
> +	for_each_possible_cpu(cpu) {
> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> +
> +		count += atomic_read(&rlp->count);
> +	}
> +
> +	return count;
> +}
> +
> +static unsigned long
> +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	int cpu, freed = 0;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> +		unsigned long count;
> +
> +		count = atomic_read(&rlp->count);
> +		lazy_rcu_flush_cpu(rlp);
> +		sc->nr_to_scan -= count;
> +		freed += count;
> +		if (sc->nr_to_scan <= 0)
> +			break;
> +	}
> +
> +	return freed == 0 ? SHRINK_STOP : freed;

This is a bit surprising given the stated aim of SHRINK_STOP to indicate
potential deadlocks.  But this pattern is common, including on the
kvfree_rcu() path, so OK!  ;-)

Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is
that as well.

> +}
> +
> +/*
> + * This function is invoked after MAX_LAZY_JIFFIES timeout.
> + */
> +static void lazy_work(struct work_struct *work)
> +{
> +	struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work);
> +
> +	lazy_rcu_flush_cpu(rlp);
> +}
> +
> +static struct shrinker lazy_rcu_shrinker = {
> +	.count_objects = lazy_rcu_shrink_count,
> +	.scan_objects = lazy_rcu_shrink_scan,
> +	.batch = 0,
> +	.seeks = DEFAULT_SEEKS,
> +};
> +
> +void __init rcu_lazy_init(void)
> +{
> +	int cpu;
> +
> +	BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head));
> +
> +	for_each_possible_cpu(cpu) {
> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> +		INIT_DELAYED_WORK(&rlp->work, lazy_work);
> +	}
> +
> +	if (register_shrinker(&lazy_rcu_shrinker))
> +		pr_err("Failed to register lazy_rcu shrinker!\n");
> +}
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 24b5f2c2de87..a5f4b44f395f 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void);
>  static inline void show_rcu_tasks_trace_gp_kthread(void) {}
>  #endif
>  
> +#ifdef CONFIG_RCU_LAZY
> +void rcu_lazy_init(void);
> +#else
> +static inline void rcu_lazy_init(void) {}
> +#endif
>  #endif /* __LINUX_RCU_H */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a4c25a6283b0..ebdf6f7c9023 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4775,6 +4775,8 @@ void __init rcu_init(void)
>  		qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
>  	else
>  		qovld_calc = qovld;
> +
> +	rcu_lazy_init();
>  }
>  
>  #include "tree_stall.h"
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 02/14] workqueue: Add a lazy version of queue_rcu_work()
  2022-05-12  3:04 ` [RFC v1 02/14] workqueue: Add a lazy version of queue_rcu_work() Joel Fernandes (Google)
@ 2022-05-12 23:58   ` Paul E. McKenney
  2022-05-14 14:44     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-12 23:58 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 03:04:30AM +0000, Joel Fernandes (Google) wrote:
> This will be used in kfree_rcu() later to make it do call_rcu() lazily.

Given that kfree_rcu() does its own laziness in filling in the page
of pointers, do we really need to add additional laziness?

Or are you measuring a significant benefit from doing this?

							Thanx, Paul

> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/linux/workqueue.h |  1 +
>  kernel/workqueue.c        | 25 +++++++++++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> index 7fee9b6cfede..2678a6b5b3f3 100644
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -444,6 +444,7 @@ extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
>  extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
>  			struct delayed_work *dwork, unsigned long delay);
>  extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
> +extern bool queue_rcu_work_lazy(struct workqueue_struct *wq, struct rcu_work *rwork);
>  
>  extern void flush_workqueue(struct workqueue_struct *wq);
>  extern void drain_workqueue(struct workqueue_struct *wq);
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 33f1106b4f99..9444949cc148 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1796,6 +1796,31 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
>  }
>  EXPORT_SYMBOL(queue_rcu_work);
>  
> +/**
> + * queue_rcu_work_lazy - queue work after a RCU grace period
> + * @wq: workqueue to use
> + * @rwork: work to queue
> + *
> + * Return: %false if @rwork was already pending, %true otherwise.  Note
> + * that a full RCU grace period is guaranteed only after a %true return.
> + * While @rwork is guaranteed to be executed after a %false return, the
> + * execution may happen before a full RCU grace period has passed.
> + */
> +bool queue_rcu_work_lazy(struct workqueue_struct *wq, struct rcu_work *rwork)
> +{
> +	struct work_struct *work = &rwork->work;
> +
> +	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> +		rwork->wq = wq;
> +		call_rcu_lazy(&rwork->rcu, rcu_work_rcufn);
> +		return true;
> +	}
> +
> +	return false;
> +}
> +EXPORT_SYMBOL(queue_rcu_work_lazy);
> +
> +
>  /**
>   * worker_enter_idle - enter idle state
>   * @worker: worker which is entering idle state
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 03/14] block/blk-ioc: Move call_rcu() to call_rcu_lazy()
  2022-05-12  3:04 ` [RFC v1 03/14] block/blk-ioc: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
@ 2022-05-13  0:00   ` Paul E. McKenney
  0 siblings, 0 replies; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-13  0:00 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 03:04:31AM +0000, Joel Fernandes (Google) wrote:
> This is required to prevent callbacks triggering RCU machinery too
> quickly and too often, which adds more power to the system.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Hard to argue with this one given that icq_free_icq_rcu() just does
a kmem_cache_free().  ;-)

							Thanx, Paul

> ---
>  block/blk-ioc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-ioc.c b/block/blk-ioc.c
> index 11f49f78db32..96d5de5df0b6 100644
> --- a/block/blk-ioc.c
> +++ b/block/blk-ioc.c
> @@ -98,7 +98,7 @@ static void ioc_destroy_icq(struct io_cq *icq)
>  	 */
>  	icq->__rcu_icq_cache = et->icq_cache;
>  	icq->flags |= ICQ_DESTROYED;
> -	call_rcu(&icq->__rcu_head, icq_free_icq_rcu);
> +	call_rcu_lazy(&icq->__rcu_head, icq_free_icq_rcu);
>  }
>  
>  /*
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 04/14] cred: Move call_rcu() to call_rcu_lazy()
  2022-05-12  3:04 ` [RFC v1 04/14] cred: " Joel Fernandes (Google)
@ 2022-05-13  0:02   ` Paul E. McKenney
  2022-05-14 14:41     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-13  0:02 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 03:04:32AM +0000, Joel Fernandes (Google) wrote:
> This is required to prevent callbacks triggering RCU machinery too
> quickly and too often, which adds more power to the system.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

The put_cred_rcu() does some debugging that would be less effective
with a 10-second delay.  Which is probably OK assuming that significant
testing happens on CONFIG_RCU_LAZY=n kernels.

							Thanx, Paul

> ---
>  kernel/cred.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 933155c96922..f4d69d7f2763 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -150,7 +150,7 @@ void __put_cred(struct cred *cred)
>  	if (cred->non_rcu)
>  		put_cred_rcu(&cred->rcu);
>  	else
> -		call_rcu(&cred->rcu, put_cred_rcu);
> +		call_rcu_lazy(&cred->rcu, put_cred_rcu);
>  }
>  EXPORT_SYMBOL(__put_cred);
>  
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 05/14] fs: Move call_rcu() to call_rcu_lazy() in some paths
  2022-05-12  3:04 ` [RFC v1 05/14] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
@ 2022-05-13  0:07   ` Paul E. McKenney
  2022-05-14 14:40     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-13  0:07 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 03:04:33AM +0000, Joel Fernandes (Google) wrote:
> This is required to prevent callbacks triggering RCU machinery too
> quickly and too often, which adds more power to the system.
> 
> When testing, we found that these paths were invoked often when the
> system is not doing anything (screen is ON but otherwise idle).
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Some of these callbacks do additional work.  I don't immediately see a
problem, but careful and thorough testing is required.

							Thanx, Paul

> ---
>  fs/dcache.c     | 4 ++--
>  fs/eventpoll.c  | 2 +-
>  fs/file_table.c | 3 ++-
>  fs/inode.c      | 2 +-
>  4 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/dcache.c b/fs/dcache.c
> index c84269c6e8bf..517e02cde103 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -366,7 +366,7 @@ static void dentry_free(struct dentry *dentry)
>  	if (unlikely(dname_external(dentry))) {
>  		struct external_name *p = external_name(dentry);
>  		if (likely(atomic_dec_and_test(&p->u.count))) {
> -			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
> +			call_rcu_lazy(&dentry->d_u.d_rcu, __d_free_external);
>  			return;
>  		}
>  	}
> @@ -374,7 +374,7 @@ static void dentry_free(struct dentry *dentry)
>  	if (dentry->d_flags & DCACHE_NORCU)
>  		__d_free(&dentry->d_u.d_rcu);
>  	else
> -		call_rcu(&dentry->d_u.d_rcu, __d_free);
> +		call_rcu_lazy(&dentry->d_u.d_rcu, __d_free);
>  }
>  
>  /*
> diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> index e2daa940ebce..10a24cca2cff 100644
> --- a/fs/eventpoll.c
> +++ b/fs/eventpoll.c
> @@ -728,7 +728,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
>  	 * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make
>  	 * use of the rbn field.
>  	 */
> -	call_rcu(&epi->rcu, epi_rcu_free);
> +	call_rcu_lazy(&epi->rcu, epi_rcu_free);
>  
>  	percpu_counter_dec(&ep->user->epoll_watches);
>  
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 7d2e692b66a9..415815d3ef80 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -56,7 +56,8 @@ static inline void file_free(struct file *f)
>  	security_file_free(f);
>  	if (!(f->f_mode & FMODE_NOACCOUNT))
>  		percpu_counter_dec(&nr_files);
> -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> +
> +	call_rcu_lazy(&f->f_u.fu_rcuhead, file_free_rcu);
>  }
>  
>  /*
> diff --git a/fs/inode.c b/fs/inode.c
> index 63324df6fa27..b288a5bef4c7 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -312,7 +312,7 @@ static void destroy_inode(struct inode *inode)
>  			return;
>  	}
>  	inode->free_inode = ops->free_inode;
> -	call_rcu(&inode->i_rcu, i_callback);
> +	call_rcu_lazy(&inode->i_rcu, i_callback);
>  }
>  
>  /**
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 10/14] kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
  2022-05-12  3:04 ` [RFC v1 10/14] kfree/rcu: Queue RCU work via queue_rcu_work_lazy() Joel Fernandes (Google)
@ 2022-05-13  0:12   ` Paul E. McKenney
  2022-05-13 14:55     ` Uladzislau Rezki
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-13  0:12 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 03:04:38AM +0000, Joel Fernandes (Google) wrote:
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Again, given that kfree_rcu() is doing its own laziness, is this really
helping?  If so, would it instead make sense to adjust the kfree_rcu()
timeouts?

							Thanx, Paul

> ---
>  kernel/rcu/tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index ebdf6f7c9023..3baf29014f86 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3407,7 +3407,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
>  			// be that the work is in the pending state when
>  			// channels have been detached following by each
>  			// other.
> -			queue_rcu_work(system_wq, &krwp->rcu_work);
> +			queue_rcu_work_lazy(system_wq, &krwp->rcu_work);
>  		}
>  	}
>  
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 14/14] DEBUG: Toggle rcu_lazy and tune at runtime
  2022-05-12  3:04 ` [RFC v1 14/14] DEBUG: Toggle rcu_lazy and tune at runtime Joel Fernandes (Google)
@ 2022-05-13  0:16   ` Paul E. McKenney
  2022-05-14 14:38     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-13  0:16 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 03:04:42AM +0000, Joel Fernandes (Google) wrote:
> Add sysctl knobs just for easier debugging/testing, to tune the maximum
> batch size, maximum time to wait before flush, and turning off the
> feature entirely.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

This is good, and might also be needed longer term.

One thought below.

							Thanx, Paul

> ---
>  include/linux/sched/sysctl.h |  4 ++++
>  kernel/rcu/lazy.c            | 12 ++++++++++--
>  kernel/sysctl.c              | 23 +++++++++++++++++++++++
>  3 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> index c19dd5a2c05c..55ffc61beed1 100644
> --- a/include/linux/sched/sysctl.h
> +++ b/include/linux/sched/sysctl.h
> @@ -16,6 +16,10 @@ enum { sysctl_hung_task_timeout_secs = 0 };
>  
>  extern unsigned int sysctl_sched_child_runs_first;
>  
> +extern unsigned int sysctl_rcu_lazy;
> +extern unsigned int sysctl_rcu_lazy_batch;
> +extern unsigned int sysctl_rcu_lazy_jiffies;
> +
>  enum sched_tunable_scaling {
>  	SCHED_TUNABLESCALING_NONE,
>  	SCHED_TUNABLESCALING_LOG,
> diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
> index 55e406cfc528..0af9fb67c92b 100644
> --- a/kernel/rcu/lazy.c
> +++ b/kernel/rcu/lazy.c
> @@ -12,6 +12,10 @@
>  // How much to wait before flushing?
>  #define MAX_LAZY_JIFFIES	10000
>  
> +unsigned int sysctl_rcu_lazy_batch = MAX_LAZY_BATCH;
> +unsigned int sysctl_rcu_lazy_jiffies = MAX_LAZY_JIFFIES;
> +unsigned int sysctl_rcu_lazy = 1;
> +
>  // We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
>  // allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
>  // later to ensure that rcu_head and lazy_rcu_head are of the same size.
> @@ -49,6 +53,10 @@ void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
>  	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
>  	struct rcu_lazy_pcp *rlp;
>  
> +	if (!sysctl_rcu_lazy) {

This is the place to check for early boot use.  Or, alternatively,
initialize sysctl_rcu_lazy to zero and set it to one once boot is far
enough along to allow all the pieces to work reasonably.

> +		return call_rcu(head_rcu, func);
> +	}
> +
>  	preempt_disable();
>          rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
>  	preempt_enable();
> @@ -67,11 +75,11 @@ void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
>  	llist_add(&head->llist_node, &rlp->head);
>  
>  	// Flush queue if too big
> -	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
> +	if (atomic_inc_return(&rlp->count) >= sysctl_rcu_lazy_batch) {
>  		lazy_rcu_flush_cpu(rlp);
>  	} else {
>  		if (!delayed_work_pending(&rlp->work)) {
> -			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
> +			schedule_delayed_work(&rlp->work, sysctl_rcu_lazy_jiffies);
>  		}
>  	}
>  }
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index 5ae443b2882e..2ba830ca71ec 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -1659,6 +1659,29 @@ static struct ctl_table kern_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec,
>  	},
> +#ifdef CONFIG_RCU_LAZY
> +	{
> +		.procname	= "rcu_lazy",
> +		.data		= &sysctl_rcu_lazy,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
> +		.procname	= "rcu_lazy_batch",
> +		.data		= &sysctl_rcu_lazy_batch,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +	{
> +		.procname	= "rcu_lazy_jiffies",
> +		.data		= &sysctl_rcu_lazy_jiffies,
> +		.maxlen		= sizeof(unsigned int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec,
> +	},
> +#endif
>  #ifdef CONFIG_SCHEDSTATS
>  	{
>  		.procname	= "sched_schedstats",
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-12  3:17 ` [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes
  2022-05-12 13:09   ` Uladzislau Rezki
@ 2022-05-13  0:23   ` Paul E. McKenney
  2022-05-13 14:45     ` Joel Fernandes
  1 sibling, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-13  0:23 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt

On Wed, May 11, 2022 at 11:17:59PM -0400, Joel Fernandes wrote:
> On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > Hello!
> > Please find the proof of concept version of call_rcu_lazy() attached. This
> > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > Rushikesh Kadam from Intel for investigating it with me.
> >
> > Some numbers below:
> >
> > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > system is very lightly loaded but constantly running few RCU callbacks very
> > often. This confuses the power management hardware that the system is active,
> > when it is in fact idle.
> >
> > For example, when ChromeOS screen is off and user is not doing anything on the
> > system, we can see big power savings.
> > Before:
> > Pk%pc10 = 72.13
> > PkgWatt = 0.58
> > CorWatt = 0.04
> >
> > After:
> > Pk%pc10 = 81.28
> > PkgWatt = 0.41
> > CorWatt = 0.03
> >
> > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > can see that the display pipeline is constantly doing RCU callback queuing due
> > to open/close of file descriptors associated with graphics buffers. This is
> > attributed to the file_free_rcu() path which this patch series also touches.
> >
> > This patch series adds a simple but effective, and lockless implementation of
> > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > initiate a flush of one or more per-CPU lists.
> >
> > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > of function graph tracer when increasing that.
> >
> > One drawback of this series is, if another frequent RCU callback creeps up in
> > the future, that's not lazy, then that will again hurt the power. However, I
> > believe identifying and fixing those is a more reasonable approach than slowing
> > RCU down for the whole system.
> >
> > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > at runtime to turn it on or off globally. It is default to on. Further, please
> > use the sysctls in lazy.c for further tuning of parameters that effect the
> > flushing.
> >
> > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > I believe this series should be good for general testing.

Sometimes OOM conditions result in stalls.

> > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > of review and agreements.

We will of course need them to look at the call_rcu_lazy() conversions
at some point, but in the meantime, experimentation is fine.  I looked
at a few, but quickly decided to defer to the people with a better
understanding of the code.

> I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> support for that definitely needs work.

Good to know.  ;-)

With this in place, can the system survive a userspace close(open())
loop, or does that result in OOM?  (I am not worried about battery
lifetime while close(open()) is running, just OOM resistance.)

Does waiting for the shrinker to kick in suffice, or should the
system pressure be taken into account?  As in the "total" numbers
from /proc/pressure/memory.

Again, it is very good to see this series!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-13  0:23   ` Paul E. McKenney
@ 2022-05-13 14:45     ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-05-13 14:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt

On Thu, May 12, 2022 at 8:23 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Wed, May 11, 2022 at 11:17:59PM -0400, Joel Fernandes wrote:
> > On Wed, May 11, 2022 at 11:04 PM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > Hello!
> > > Please find the proof of concept version of call_rcu_lazy() attached. This
> > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > > Rushikesh Kadam from Intel for investigating it with me.
> > >
> > > Some numbers below:
> > >
> > > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > > system is very lightly loaded but constantly running few RCU callbacks very
> > > often. This confuses the power management hardware that the system is active,
> > > when it is in fact idle.
> > >
> > > For example, when ChromeOS screen is off and user is not doing anything on the
> > > system, we can see big power savings.
> > > Before:
> > > Pk%pc10 = 72.13
> > > PkgWatt = 0.58
> > > CorWatt = 0.04
> > >
> > > After:
> > > Pk%pc10 = 81.28
> > > PkgWatt = 0.41
> > > CorWatt = 0.03
> > >
> > > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > > can see that the display pipeline is constantly doing RCU callback queuing due
> > > to open/close of file descriptors associated with graphics buffers. This is
> > > attributed to the file_free_rcu() path which this patch series also touches.
> > >
> > > This patch series adds a simple but effective, and lockless implementation of
> > > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > > initiate a flush of one or more per-CPU lists.
> > >
> > > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > > of function graph tracer when increasing that.
> > >
> > > One drawback of this series is, if another frequent RCU callback creeps up in
> > > the future, that's not lazy, then that will again hurt the power. However, I
> > > believe identifying and fixing those is a more reasonable approach than slowing
> > > RCU down for the whole system.
> > >
> > > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > > at runtime to turn it on or off globally. It is default to on. Further, please
> > > use the sysctls in lazy.c for further tuning of parameters that effect the
> > > flushing.
> > >
> > > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > > I believe this series should be good for general testing.
>
> Sometimes OOM conditions result in stalls.

I see.

> > I did forget to add Disclaimer 3, that this breaks rcu_barrier() and
> > support for that definitely needs work.
>
> Good to know.  ;-)
>
> With this in place, can the system survive a userspace close(open())
> loop, or does that result in OOM?  (I am not worried about battery
> lifetime while close(open()) is running, just OOM resistance.)

Yes, in my testing it survived. I even dropped memory to 512MB and did
the open/close loop test. I believe it survives also because we don't
let the list to grow too big (other than shrinker flushing).

>
> Does waiting for the shrinker to kick in suffice, or should the
> system pressure be taken into account?  As in the "total" numbers
> from /proc/pressure/memory.

I did not find that taking system memory pressure into account is necessary.

> Again, it is very good to see this series!

Thanks I appreciate that, I am excited about battery life savings in
millions of battery powered devices ;-) Even on my Grand mom's android
phone ;-)

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
       [not found]               ` <Yn5e7w8NWzThUARb@pc638.lan>
@ 2022-05-13 14:51                 ` Joel Fernandes
  2022-05-13 15:43                   ` Uladzislau Rezki
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-13 14:51 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Paul E. McKenney, rcu, Rushikesh S Kadam, Neeraj upadhyay,
	Frederic Weisbecker, Steven Rostedt

On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
>
> > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > Never mind. I port it into 5.10
> > > > >
> > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > ditto.
> > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > >
> > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > is not so easy to up and run on my device :)
> > >
> > > Actually I was going to write here, apparently some tests are showing
> > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > good to drop those for your initial testing!
> > >
> > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > so important for kfree_rcu() because we do batch requests there anyway.
> > One thing that i would like to improve in kfree_rcu() is a better utilization
> > of page slots.
> >
> > I will share my results either tomorrow or on Monday. I hope that is fine.
> >
>
> Here we go with some data on our Android handset that runs 5.10 kernel. The test
> case i have checked was a "static image" use case. Condition is: screen ON with
> disabled all connectivity.
>
> 1.
> First data i took is how many wakeups cause an RCU subsystem during this test case
> when everything is pretty idling. Duration is 360 seconds:
>
> <snip>
> serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu

Nice! Do you mind sharing this script? I was just talking to Rushikesh
that we want something like this during testing. Appreciate it. Also,
if we dump timer wakeup reasons/callbacks that would also be awesome.

FWIW, I wrote a BPF tool that periodically dumps callbacks and can
share that with you on request as well. That is probably not in a
shape for mainline though (Makefile missing and such).

> name:                         rcub/0 pid:         16 woken-up     2     interval: min 86772734  max 86772734    avg 43386367
> name:                        rcuop/7 pid:         69 woken-up     4     interval: min  4189     max  8050       avg  5049
> name:                        rcuop/6 pid:         62 woken-up    55     interval: min  6910     max 42592159    avg 3818752
[..]

> There is a big improvement in lazy case. Number of wake-ups got reduced quite a lot
> and it is really good!

Cool!

> 2.
> Please find in attachment two power plots. The same test case. One is related to a
> regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> what is also important, thus it proofs the concept. Please note it might be more power
> efficient for other arches and platforms. Because of different HW design that is related
> to C-states of CPU and energy that is needed to in/out of those deep power states.

Nice! I wonder if you still have other frequent callbacks on your
system that are getting queued during the tests. Could you dump the
rcu_callbacks trace event and see if you have any CBs frequently
called that the series did not address?

Also, one more thing I was curious about is - do you see savings when
you pin the rcu threads to the LITTLE CPUs of the system? The theory
being, not disturbing the BIG CPUs which are more power hungry may let
them go into a deeper idle state and save power (due to leakage
current and so forth).

> So a front-lazy-batching is something worth to have, IMHO :)

Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
it, we can add your data to the LPC slides as well about your
investigation (with attribution to you).

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 12/14] rcu/kfree: remove useless monitor_todo flag
  2022-05-12  3:04 ` [RFC v1 12/14] rcu/kfree: remove useless monitor_todo flag Joel Fernandes (Google)
@ 2022-05-13 14:53   ` Uladzislau Rezki
  2022-05-14 14:35     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-13 14:53 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt

> monitor_todo is not needed as the work struct already tracks if work is
> pending. Just use that to know if work is pending using
> delayed_work_pending() helper.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree.c | 22 +++++++---------------
>  1 file changed, 7 insertions(+), 15 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3baf29014f86..3828ac3bf1c4 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3155,7 +3155,6 @@ struct kfree_rcu_cpu_work {
>   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
>   * @lock: Synchronize access to this structure
>   * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
> - * @monitor_todo: Tracks whether a @monitor_work delayed work is pending
>   * @initialized: The @rcu_work fields have been initialized
>   * @count: Number of objects for which GP not started
>   * @bkvcache:
> @@ -3180,7 +3179,6 @@ struct kfree_rcu_cpu {
>  	struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
>  	raw_spinlock_t lock;
>  	struct delayed_work monitor_work;
> -	bool monitor_todo;
>  	bool initialized;
>  	int count;
>  
> @@ -3416,9 +3414,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
>  	// of the channels that is still busy we should rearm the
>  	// work to repeat an attempt. Because previous batches are
>  	// still in progress.
> -	if (!krcp->bkvhead[0] && !krcp->bkvhead[1] && !krcp->head)
> -		krcp->monitor_todo = false;
> -	else
> +	if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head)
>  		schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
>  
>  	raw_spin_unlock_irqrestore(&krcp->lock, flags);
> @@ -3607,10 +3603,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
>  
>  	// Set timer to drain after KFREE_DRAIN_JIFFIES.
>  	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
> -	    !krcp->monitor_todo) {
> -		krcp->monitor_todo = true;
> +	    !delayed_work_pending(&krcp->monitor_work))
>  		schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
> -	}
>  
>  unlock_return:
>  	krc_this_cpu_unlock(krcp, flags);
> @@ -3685,14 +3679,12 @@ void __init kfree_rcu_scheduler_running(void)
>  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
>  
>  		raw_spin_lock_irqsave(&krcp->lock, flags);
> -		if ((!krcp->bkvhead[0] && !krcp->bkvhead[1] && !krcp->head) ||
> -				krcp->monitor_todo) {
> -			raw_spin_unlock_irqrestore(&krcp->lock, flags);
> -			continue;
> +		if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
> +			if (delayed_work_pending(&krcp->monitor_work)) {
> +				schedule_delayed_work_on(cpu, &krcp->monitor_work,
> +						KFREE_DRAIN_JIFFIES);
> +			}
>  		}
> -		krcp->monitor_todo = true;
> -		schedule_delayed_work_on(cpu, &krcp->monitor_work,
> -					 KFREE_DRAIN_JIFFIES);
>  		raw_spin_unlock_irqrestore(&krcp->lock, flags);
>  	}
>  }
> -- 
>
Looks good to me from the first glance, but let me know to have a look
at it more closely.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 13/14] rcu/kfree: Fix kfree_rcu_shrink_count() return value
  2022-05-12  3:04 ` [RFC v1 13/14] rcu/kfree: Fix kfree_rcu_shrink_count() return value Joel Fernandes (Google)
@ 2022-05-13 14:54   ` Uladzislau Rezki
  2022-05-14 14:34     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-13 14:54 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt

On Thu, May 12, 2022 at 03:04:41AM +0000, Joel Fernandes (Google) wrote:
> As per the comments in include/linux/shrinker.h, .count_objects callback
> should return the number of freeable items, but if there are no objects
> to free, SHRINK_EMPTY should be returned. The only time 0 is returned
> should be when we are unable to determine the number of objects, or the
> cache should be skipped for another reason.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 3828ac3bf1c4..f191542cdf5e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3637,7 +3637,7 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>  		atomic_set(&krcp->backoff_page_cache_fill, 1);
>  	}
>  
> -	return count;
> +	return count == 0 ? SHRINK_EMPTY : count;
>  }
>  
>  static unsigned long
> -- 
> 2.36.0.550.gb090851708-goog
> 
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

--
Uladzisalu Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 10/14] kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
  2022-05-13  0:12   ` Paul E. McKenney
@ 2022-05-13 14:55     ` Uladzislau Rezki
  2022-05-14 14:33       ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-13 14:55 UTC (permalink / raw)
  To: Paul E. McKenney, Joel Fernandes (Google)
  Cc: Joel Fernandes (Google),
	rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

> On Thu, May 12, 2022 at 03:04:38AM +0000, Joel Fernandes (Google) wrote:
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> Again, given that kfree_rcu() is doing its own laziness, is this really
> helping?  If so, would it instead make sense to adjust the kfree_rcu()
> timeouts?
> 
IMHO, this patch does not help much. Like Paul has mentioned we use
batching anyway.

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-13 14:51                 ` Joel Fernandes
@ 2022-05-13 15:43                   ` Uladzislau Rezki
  2022-05-14 14:25                     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-13 15:43 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, Paul E. McKenney, rcu, Rushikesh S Kadam,
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt

[-- Attachment #1: Type: text/plain, Size: 6754 bytes --]

> On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> >
> > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > >
> > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > Never mind. I port it into 5.10
> > > > > >
> > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > ditto.
> > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > >
> > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > is not so easy to up and run on my device :)
> > > >
> > > > Actually I was going to write here, apparently some tests are showing
> > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > good to drop those for your initial testing!
> > > >
> > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > so important for kfree_rcu() because we do batch requests there anyway.
> > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > of page slots.
> > >
> > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > >
> >
> > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > case i have checked was a "static image" use case. Condition is: screen ON with
> > disabled all connectivity.
> >
> > 1.
> > First data i took is how many wakeups cause an RCU subsystem during this test case
> > when everything is pretty idling. Duration is 360 seconds:
> >
> > <snip>
> > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> 
> Nice! Do you mind sharing this script? I was just talking to Rushikesh
> that we want something like this during testing. Appreciate it. Also,
> if we dump timer wakeup reasons/callbacks that would also be awesome.
> 
Please find in attachment. I wrote it once upon a time and make use of
to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
so just compile it.

How to use it:
1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
2. ./perf script -i ./perf.data > foo.script
3. ./perf_script_parser ./foo.script

>
> FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> share that with you on request as well. That is probably not in a
> shape for mainline though (Makefile missing and such).
> 
Yep, please share!

> > name:                         rcub/0 pid:         16 woken-up     2     interval: min 86772734  max 86772734    avg 43386367
> > name:                        rcuop/7 pid:         69 woken-up     4     interval: min  4189     max  8050       avg  5049
> > name:                        rcuop/6 pid:         62 woken-up    55     interval: min  6910     max 42592159    avg 3818752
> [..]
> 
> > There is a big improvement in lazy case. Number of wake-ups got reduced quite a lot
> > and it is really good!
> 
> Cool!
> 
> > 2.
> > Please find in attachment two power plots. The same test case. One is related to a
> > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > what is also important, thus it proofs the concept. Please note it might be more power
> > efficient for other arches and platforms. Because of different HW design that is related
> > to C-states of CPU and energy that is needed to in/out of those deep power states.
> 
> Nice! I wonder if you still have other frequent callbacks on your
> system that are getting queued during the tests. Could you dump the
> rcu_callbacks trace event and see if you have any CBs frequently
> called that the series did not address?
>
I have pretty much like this:
<snip>
    rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
    rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
    rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
    rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
    rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
    rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
    rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
<snip>

so almost everything is batched.

>
> Also, one more thing I was curious about is - do you see savings when
> you pin the rcu threads to the LITTLE CPUs of the system? The theory
> being, not disturbing the BIG CPUs which are more power hungry may let
> them go into a deeper idle state and save power (due to leakage
> current and so forth).
> 
I did some experimenting to pin nocbs to a little cluster. For idle use
cases i did not see any power gain. For heavy one i see that "big" CPUs
are also invoking and busy with it quite often. Probably i should think
of some use case where i can detect the power difference. If you have
something please let me know.

> > So a front-lazy-batching is something worth to have, IMHO :)
> 
> Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> it, we can add your data to the LPC slides as well about your
> investigation (with attribution to you).
> 
No problem, since we will give a talk on LPC, more data we have more
convincing we are :)

--
Uladzislau Rezki

[-- Attachment #2: perf_script_parser.c --]
[-- Type: text/x-csrc, Size: 19453 bytes --]

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <ctype.h>
#include <string.h>

/*
 * Splay-tree implementation to store data: key,value
 * See https://en.wikipedia.org/wiki/Splay_tree
 */
#define offsetof(TYPE, MEMBER)	((unsigned long)&((TYPE *)0)->MEMBER)
#define container_of(ptr, type, member)			\
({												\
	void *__mptr = (void *)(ptr);					\
	((type *)(__mptr - offsetof(type, member)));	\
})

#define SP_INIT_NODE(node)									\
	((node)->left = (node)->right = (node)->parent = NULL)

struct splay_node {
	struct splay_node *left;
	struct splay_node *right;
	struct splay_node *parent;
	unsigned long val;
};

struct splay_root {
	struct splay_node *sp_root;
};

static inline void
set_parent(struct splay_node *n, struct splay_node *p)
{
	if (n)
		n->parent = p;
}

static inline void
change_child(struct splay_node *p,
	struct splay_node *old, struct splay_node *new)
{
	if (p) {
		if (p->left == old)
			p->left = new;
		else
			p->right = new;
	}
}

/*
 * left rotation of node (r), (rc) is (r)'s right child
 */
static inline struct splay_node *
left_pivot(struct splay_node *r)
{
	struct splay_node *rc;

	/*
	 * set (rc) to be the new root
	 */
	rc = r->right;

	/*
	 * point parent to new left/right child
	 */
	rc->parent = r->parent;

	/*
	 * change child of the p->parent.
	 */
	change_child(r->parent, r, rc);

	/*
	 * set (r)'s right child to be (rc)'s left child
	 */
	r->right = rc->left;

	/*
	 * change parent of rc's left child
	 */
	set_parent(rc->left, r);

	/*
	 * set new parent of rotated node
	 */
	r->parent = rc;

	/*
	 * set (rc)'s left child to be (r)
	 */
	rc->left = r;

	/*
	 * return the new root
	 */
	return rc;
}

/*
 * right rotation of node (r), (lc) is (r)'s left child
 */
static inline struct splay_node *
right_pivot(struct splay_node *r)
{
	struct splay_node *lc;

	/*
	 * set (lc) to be the new root
	 */
	lc = r->left;

	/*
	 * point parent to new left/right child
	 */
	lc->parent = r->parent;

	/*
	 * change child of the p->parent.
	 */
	change_child(r->parent, r, lc);

	/*
	 * set (r)'s left child to be (lc)'s right child
	 */
	r->left = lc->right;

	/*
	 * change parent of lc's right child
	 */
	set_parent(lc->right, r);

	/*
	 * set new parent of rotated node
	 */
	r->parent = lc;

	/*
	 * set (lc)'s right child to be (r)
	 */
	lc->right = r;

	/*
	 * return the new root
	 */
	return lc;
}

static struct splay_node *
top_down_splay(unsigned long vstart,
	struct splay_node *root, struct splay_root *sp_root)
{
	/*
	 * During the splitting process two temporary trees are formed.
	 * "l" contains all keys less than the search key/vstart and "r"
	 * contains all keys greater than the search key/vstart.
	 */
	struct splay_node head, *ltree_max, *rtree_max;
	struct splay_node *ltree_prev, *rtree_prev;

	if (root == NULL)
		return NULL;

	SP_INIT_NODE(&head);
	ltree_max = rtree_max = &head;
	ltree_prev = rtree_prev = NULL;

	while (1) {
		if (vstart < root->val && root->left) {
			if (vstart < root->left->val) {
				root = right_pivot(root);

				if (root->left == NULL)
					break;
			}

			/*
			 * Build right subtree.
			 */
			rtree_max->left = root;
			rtree_max->left->parent = rtree_prev;
			rtree_max = rtree_max->left;
			rtree_prev = root;
			root = root->left;
		} else if (vstart > root->val && root->right) {
			if (vstart > root->right->val) {
				root = left_pivot(root);

				if (root->right == NULL)
					break;
			}

			/*
			 * Build left subtree.
			 */
			ltree_max->right = root;
			ltree_max->right->parent = ltree_prev;
			ltree_max = ltree_max->right;
			ltree_prev = root;
			root = root->right;
		} else {
			break;
		}
	}

	/*
	 * Assemble the tree.
	 */
	ltree_max->right = root->left;
	rtree_max->left = root->right;
	root->left = head.right;
	root->right = head.left;

	set_parent(ltree_max->right, ltree_max);
	set_parent(rtree_max->left, rtree_max);
	set_parent(root->left, root);
	set_parent(root->right, root);
	root->parent = NULL;

	/*
	 * Set new root. Please note it might be the same.
	 */
	sp_root->sp_root = root;
	return sp_root->sp_root;
}

struct splay_node *
splay_search(unsigned long key, struct splay_root *root)
{
	struct splay_node *n;

	n = top_down_splay(key, root->sp_root, root);
	if (n && n->val == key)
		return n;

	return NULL;
}

static bool
splay_insert(struct splay_node *n, struct splay_root *sp_root)
{
	struct splay_node *r;

	SP_INIT_NODE(n);

	r = top_down_splay(n->val, sp_root->sp_root, sp_root);
	if (r == NULL) {
		/* First element in the tree */
		sp_root->sp_root = n;
		return false;
	}

	if (n->val < r->val) {
		n->left = r->left;
		n->right = r;

		set_parent(r->left, n);
		r->parent = n;
		r->left = NULL;
	} else if (n->val > r->val) {
		n->right = r->right;
		n->left = r;

		set_parent(r->right, n);
		r->parent = n;
		r->right = NULL;
	} else {
		/*
		 * Same, indicate as not success insertion.
		 */
		return false;
	}

	sp_root->sp_root = n;
	return true;
}

static bool
splay_delete_init(struct splay_node *n, struct splay_root *sp_root)
{
	struct splay_node *subtree[2];
	unsigned long val = n->val;

	/* 1. Splay the node to the root. */
	n = top_down_splay(n->val, sp_root->sp_root, sp_root);
	if (n == NULL || n->val != val)
		return false;

	/* 2. Save left/right sub-trees. */
	subtree[0] = n->left;
	subtree[1] = n->right;

	/* 3. Now remove the node. */
	SP_INIT_NODE(n);

	if (subtree[0]) {
		/* 4. Splay the largest node in left sub-tree to the root. */
		top_down_splay(val, subtree[0], sp_root);

		/* 5. Attach the right sub-tree as the right child of the left sub-tree. */
		sp_root->sp_root->right = subtree[1];

		/* 6. Update the parent of right sub-tree */
		set_parent(subtree[1], sp_root->sp_root);
	} else {
		/* 7. Left sub-tree is NULL, just point to right one. */
		sp_root->sp_root = subtree[1];
	}

	/* 8. Set parent of root node to NULL. */
	if (sp_root->sp_root)
		sp_root->sp_root->parent = NULL;

	return true;
}

static FILE *
open_perf_script_file(const char *path)
{
	FILE *f = NULL;

	if (path == NULL)
		goto out;

	f = fopen(path, "r");
	if (!f)
		goto out;

out:
	return f;
}

static int
get_one_line(FILE *file, char *buf, size_t len)
{
	int i = 0;

	memset(buf, '\0', len);

	for (i = 0; i < len - 1; i++) {
		int c = fgetc(file);

		if (c == EOF)
			return EOF;

		if (c == '\n')
			break;

		if (c != '\r')
			buf[i] = c;
	}

	return i;
}

static int
read_string_till_string(char *buf, char *out, size_t out_len, char *in, size_t in_len)
{
	int i, j;

	memset(out, '\0', out_len);

	for (i = 0; i < out_len; i++) {
		if (buf[i] != in[0]) {
			out[i] = buf[i];
			continue;
		}

		for (j = 0; j < in_len; j++) {
			if (buf[i + j] != in[j])
				break;
		}

		/* Found. */
		if (j == in_len)
			return 1;
	}

	return 0;
}

/*
 * find pattern is  "something [003] 8640.034785: something"
 */
static inline void
get_cpu_sec_usec_in_string(const char *s, int *cpu, int *sec, int *usec)
{
	char usec_buf[32] = {'\0'};
	char sec_buf[32] = {'\0'};
	char cpu_buf[32] = {'\0'};
	bool found_sec = false;
	bool found_usec = false;
	bool found_cpu = false;
	int i, j, dot;

	*cpu = *sec = *usec = -1;

	for (i = 0, j = 0; s[i] != '\0'; i++) {
		if (s[i] == '.') {
			dot = i++;

			/* take microseconds */
			for (j = 0; j < sizeof(usec_buf); j++) {
				if (isdigit(s[i])) {
					usec_buf[j] = s[i];
				} else {
					if (s[i] == ':' && j > 0)
						found_usec = true;
					else
						found_usec = false;

					/* Terminate here. */
					break;
				}

				i++;
			}

			if (found_usec) {
				/* roll back */
				while (s[i] != ' ' && i > 0)
					i--;

				/* take seconds */
				for (j = 0; j < sizeof(sec_buf); j++) {
					if (isdigit(s[++i])) {
						sec_buf[j] = s[i];
					} else {
						if (s[i] == '.' && j > 0)
							found_sec = true;
						else
							found_sec = false;
						
						/* Terminate here. */
						break;
					}
				}
			}

			if (found_sec && found_usec) {
				/* roll back */
				while (s[i] != '[' && i > 0)
					i--;

				/* take seconds */
				for (j = 0; j < sizeof(cpu_buf); j++) {
					if (isdigit(s[++i])) {
						cpu_buf[j] = s[i];
					} else {
						if (s[i] == ']' && j > 0)
							found_cpu = true;
						else
							found_cpu = false;
						
						/* Terminate here. */
						break;
					}
				}

				if (found_cpu && found_sec && found_usec) {
					*sec = atoi(sec_buf);
					*usec = atoi(usec_buf);
					*cpu = atoi(cpu_buf);
					return;
				}
			}

			/*
			 * Check next dot pattern.
			 */
			found_sec = false;
			found_usec = false;
			found_cpu = false;
			i = dot;
		}
	}
}

/*
 * find pattern is  "something comm=foo android thr1 pid=123 something"
 */
static inline int
get_comm_pid_in_string(const char *buf, char *comm, ssize_t len, int *pid)
{
	char *sc, *sp;
	int rv, i;

	memset(comm, '\0', len);

	sc = strstr(buf, "comm=");
	if (sc)
		sp = strstr(sc, " pid=");

	if (!sc || !sp)
		return -1;

	for (i = 0, sc += 5; sc != sp; i++) {
		if (i < len) {
			if (*sc == ' ')
				comm[i] = '-';
			else
				comm[i] = *sc;

			sc++;
		}
	}

	/* Read pid. */
	rv = sscanf(sp, " pid=%d", pid);
	if (rv != 1)
		return -1;

	return 1;
}

static void
perf_script_softirq_delay(FILE *file, int delay_usec)
{
	char buf[4096] = { '\0' };
	char buf_1[4096] = { '\0' };
	long offset;
	char *s;
	int rv;

	while (1) {
		rv = get_one_line(file, buf, sizeof(buf));
		offset = ftell(file);

		if (rv != EOF) {
			s = strstr(buf, "irq:softirq_raise:");
			if (s) {
				char extra[512] = {'\0'};
				int sec_0, usec_0;
				int sec_1, usec_1;
				int handle_vector;
				int rise_vector;
				int cpu_0;
				int cpu_1;

				/*
				 * swapper     0    [000]  6010.619854:  irq:softirq_raise: vec=7 [action=SCHED]
				 * android.bg  3052 [001]  6000.076212:  irq:softirq_entry: vec=9 [action=RCU]
				 */
				(void) sscanf(s, "%s vec=%d", extra, &rise_vector);
				get_cpu_sec_usec_in_string(buf, &cpu_0, &sec_0, &usec_0);

				while (1) {
					rv = get_one_line(file, buf_1, sizeof(buf_1));
					if (rv == EOF)
						break;

					s = strstr(buf_1, "irq:softirq_entry:");
					if (s) {
						(void) sscanf(s, "%s vec=%d", extra, &handle_vector);
						get_cpu_sec_usec_in_string(buf_1, &cpu_1, &sec_1, &usec_1);

						if (cpu_0 == cpu_1 && rise_vector == handle_vector) {
							int delta_time_usec = (sec_1 - sec_0) * 1000000 + (usec_1 - usec_0);

							if (delta_time_usec > delay_usec)
								fprintf(stdout, "{\n%s\n%s\n} diff %d usec\n", buf, buf_1, delta_time_usec);
							break;
						}
					}
				}
			}

			rv = fseek(file, offset, SEEK_SET);
			if (rv)
				fprintf(stdout, "fseek error !!!\n");
		} else {
			break;
		}
	}
}

static void
perf_script_softirq_duration(FILE *file, int duration_usec)
{
	char buf[4096] = { '\0' };
	char buf_1[4096] = { '\0' };
	long offset;
	char *s;
	int rv;

	while (1) {
		rv = get_one_line(file, buf, sizeof(buf));
		offset = ftell(file);

		if (rv != EOF) {
			s = strstr(buf, "irq:softirq_entry:");
			if (s) {
				char extra[512] = {'\0'};
				int sec_0, usec_0;
				int sec_1, usec_1;
				int handle_vector;
				int rise_vector;
				int cpu_0;
				int cpu_1;

				/*
				 * swapper     0    [000]  6010.619854:  irq:softirq_entry: vec=7 [action=SCHED]
				 * android.bg  3052 [001]  6000.076212:  irq:softirq_exit: vec=9 [action=RCU]
				 */
				(void) sscanf(s, "%s vec=%d", extra, &rise_vector);
				get_cpu_sec_usec_in_string(buf, &cpu_0, &sec_0, &usec_0);

				while (1) {
					rv = get_one_line(file, buf_1, sizeof(buf_1));
					if (rv == EOF)
						break;

					s = strstr(buf_1, "irq:softirq_exit:");
					if (s) {
						(void) sscanf(s, "%s vec=%d", extra, &handle_vector);
						get_cpu_sec_usec_in_string(buf_1, &cpu_1, &sec_1, &usec_1);

						if (cpu_0 == cpu_1 && rise_vector == handle_vector) {
							int delta_time_usec = (sec_1 - sec_0) * 1000000 + (usec_1 - usec_0);

							if (delta_time_usec > duration_usec)
								fprintf(stdout, "{\n%s\n%s\n} diff %d usec\n", buf, buf_1, delta_time_usec);
							break;
						}
					}
				}
			}

			rv = fseek(file, offset, SEEK_SET);
			if (rv)
				fprintf(stdout, "fseek error !!!\n");
		} else {
			break;
		}
	}
}

static void
perf_script_hardirq_duration(FILE *file, int duration_msec)
{
	char buf[4096] = { '\0' };
	char buf_1[4096] = { '\0' };
	long offset;
	char *s;
	int rv;

	while (1) {
		rv = get_one_line(file, buf, sizeof(buf));
		offset = ftell(file);

		if (rv != EOF) {
			s = strstr(buf, "irq:irq_handler_entry:");
			if (s) {
				char extra[512] = {'\0'};
				int sec_0, usec_0;
				int sec_1, usec_1;
				int handle_vector;
				int rise_vector;
				int cpu_0;
				int cpu_1;

				/*
				 * swapper     0 [002]  6205.804133: irq:irq_handler_entry: irq=11 name=arch_timer
				 * swapper     0 [002]  6205.804228:  irq:irq_handler_exit: irq=11 ret=handled
				 */
				(void) sscanf(s, "%s irq=%d", extra, &rise_vector);
				get_cpu_sec_usec_in_string(buf, &cpu_0, &sec_0, &usec_0);

				while (1) {
					rv = get_one_line(file, buf_1, sizeof(buf_1));
					if (rv == EOF)
						break;

					s = strstr(buf_1, "irq:irq_handler_exit:");
					if (s) {
						(void) sscanf(s, "%s irq=%d", extra, &handle_vector);
						get_cpu_sec_usec_in_string(buf_1, &cpu_1, &sec_1, &usec_1);

						if (cpu_0 == cpu_1 && rise_vector == handle_vector) {
							int delta_time_usec = (sec_1 - sec_0) * 1000000 + (usec_1 - usec_0);

							if (delta_time_usec > duration_msec)
								fprintf(stdout, "{\n%s\n%s\n} diff %d usec\n", buf, buf_1, delta_time_usec);
							break;
						}
					}
				}
			}

			rv = fseek(file, offset, SEEK_SET);
			if (rv)
				fprintf(stdout, "fseek error !!!\n");
		} else {
			break;
		}
	}
}

struct irq_stat {
	int irq;
	int count;
	char irq_name[512];

	int min_interval;
	int max_interval;
	int avg_interval;

	unsigned int time_stamp_usec;
	struct splay_node node;
};

static struct irq_stat *
new_irq_node_init(int irq, char *irq_name)
{
	struct irq_stat *n = calloc(1, sizeof(*n));

	if (n) {
		n->irq = irq;
		(void) strncpy(n->irq_name, irq_name, sizeof(n->irq_name));
		n->node.val = irq;
	}

	return n;
}

static void
perf_script_hardirq_stat(FILE *file)
{
	struct splay_root sproot = { NULL };
	struct irq_stat *node;
	char buf[4096] = { '\0' };
	char extra[256] = {'\0'};
	char irq_name[256] = {'\0'};
	unsigned int time_stamp_usec;
	int cpu, sec, usec;
	int rv, irq;
	char *s;

	while (1) {
		rv = get_one_line(file, buf, sizeof(buf));
		if (rv == EOF)
			break;

		 s = strstr(buf, "irq:irq_handler_entry:");
		 if (s == NULL)
			 continue;

		 /*
		  * format is as follow one:
		  * sleep  1418 [003]  8780.957112:             irq:irq_handler_entry: irq=11 name=arch_timer
		  */
		 rv = sscanf(s, "%s irq=%d name=%s", extra, &irq, irq_name);
	 	 if (rv != 3)
	 		 continue;

		 get_cpu_sec_usec_in_string(buf, &cpu, &sec, &usec);
		 time_stamp_usec = (sec * 1000000) + usec;

		 if (sproot.sp_root == NULL) {
			 node = new_irq_node_init(irq, irq_name);
			 if (node)
				 splay_insert(&node->node, &sproot);
		 }

		 top_down_splay(irq, sproot.sp_root, &sproot);
		 node = container_of(sproot.sp_root, struct irq_stat, node);

		 /* Found the entry in the tree. */
		 if (node->irq == irq) {
			 if (node->time_stamp_usec) {
				 unsigned int delta = time_stamp_usec - node->time_stamp_usec;

				 if (delta < node->min_interval || !node->min_interval)
					 node->min_interval = delta;

				 if (delta > node->max_interval)
					 node->max_interval = delta;

				 node->avg_interval += delta;
			 }

			 /* Save the last time for this IRQ entry. */
			 node->time_stamp_usec = time_stamp_usec;
		 } else {
			 /* Allocate a new record and place it to the tree. */
			 node = new_irq_node_init(irq, irq_name);
			 if (node)
				 splay_insert(&node->node, &sproot);

		 }

		 /* Update the timestamp for this entry. */
		 node->time_stamp_usec = time_stamp_usec;
		 node->count++;
	}

	/* Dump the tree. */
	while (sproot.sp_root) {
		node = container_of(sproot.sp_root, struct irq_stat, node);

		fprintf(stdout, "irq: %5d name: %30s count: %7d, min: %10d, max: %10d, avg: %10d\n",
				node->irq, node->irq_name, node->count,
				node->min_interval, node->max_interval, node->avg_interval / node->count);

		splay_delete_init(&node->node, &sproot);
		free(node);
	}

	fprintf(stdout, "\tRun './a.out ./perf.script | sort -nk 6' to sort by column 6.\n");
}

struct sched_waking {
	unsigned int wakeup_nr;
	char comm[4096];
	int pid;

	int min_interval;
	int max_interval;
	int avg_interval;

	unsigned int time_stamp_usec;
	struct splay_node node;
};

static struct sched_waking *
new_sched_waking_node_init(int pid, char *comm)
{
	struct sched_waking *n = calloc(1, sizeof(*n));

	if (n) {
		n->pid = pid;
		(void) strncpy(n->comm, comm, sizeof(n->comm));
		n->node.val = pid;
	}

	return n;
}

static void
perf_script_sched_waking_stat(FILE *file, const char *script)
{
	struct splay_root sroot = { NULL };
	struct sched_waking *n;
	char buf[4096] = { '\0' };
	char comm[256] = {'\0'};
	unsigned int time_stamp_usec;
	unsigned int total_waking = 0;
	int cpu, sec, usec;
	int rv, pid;
	char *s;
	
	while (1) {
		rv = get_one_line(file, buf, sizeof(buf));
		if (rv == EOF)
			break;
		/*
		 * format is as follow one:
		 * foo[1] 7521 [002] 10.431216: sched:sched_waking: comm=tr pid=2 prio=120 target_cpu=006
		 */
		s = strstr(buf, "sched:sched_waking:");
		if (s == NULL)
			continue;

		rv = get_comm_pid_in_string(s, comm, sizeof(comm), &pid);
		if (rv < 0) {
			printf("ERROR: skip entry...\n");
			continue;
		}

		get_cpu_sec_usec_in_string(buf, &cpu, &sec, &usec);
		time_stamp_usec = (sec * 1000000) + usec;

		if (sroot.sp_root == NULL) {
			n = new_sched_waking_node_init(pid, comm);
			if (n)
				splay_insert(&n->node, &sroot);
		}

		top_down_splay(pid, sroot.sp_root, &sroot);
		n = container_of(sroot.sp_root, struct sched_waking, node);

		/* Found the entry in the tree. */
		if (n->pid == pid) {
			if (n->time_stamp_usec) {
				unsigned int delta = time_stamp_usec - n->time_stamp_usec;

				if (delta < n->min_interval || !n->min_interval)
					n->min_interval = delta;

				if (delta > n->max_interval)
					n->max_interval = delta;

				n->avg_interval += delta;
			}

			/* Save the last time for this wake-up entry. */
			n->time_stamp_usec = time_stamp_usec;
		} else {
			/* Allocate a new record and place it to the tree. */
			n = new_sched_waking_node_init(pid, comm);
			if (n)
				splay_insert(&n->node, &sroot);
		}

		/* Update the timestamp for this entry. */
		n->time_stamp_usec = time_stamp_usec;
		n->wakeup_nr++;
	}

	/* Dump the Splay-tree. */
	while (sroot.sp_root) {
		n = container_of(sroot.sp_root, struct sched_waking, node);
		fprintf(stdout, "name: %30s pid: %10d woken-up %5d\tinterval: min %5d\tmax %5d\tavg %5d\n",
			n->comm, n->pid, n->wakeup_nr,
			n->min_interval, n->max_interval, n->avg_interval / n->wakeup_nr);

		total_waking += n->wakeup_nr;
		splay_delete_init(&n->node, &sroot);
		free(n);
	}

	fprintf(stdout, "=== Total: %u ===\n", total_waking);
	fprintf(stdout, "\tRun './a.out ./%s | sort -nk 6' to sort by column 6.\n", script);
}

int main(int argc, char **argv)
{
	FILE *file;

	file = open_perf_script_file(argv[1]);
	if (file == NULL) {
		fprintf(stdout, "%s:%d failed: specify a perf script file\n", __func__, __LINE__);
		exit(-1);
	}

	/* perf_script_softirq_delay(file, 1000); */
	/* perf_script_softirq_duration(file, 500); */
	/* perf_script_hardirq_duration(file, 500); */
	/* perf_script_hardirq_stat(file); */
	perf_script_sched_waking_stat(file, argv[1]);

	return 0;
}

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-13 15:43                   ` Uladzislau Rezki
@ 2022-05-14 14:25                     ` Joel Fernandes
  2022-05-14 19:01                       ` Uladzislau Rezki
  2022-08-09  2:25                       ` Joel Fernandes
  0 siblings, 2 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-05-14 14:25 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Paul E. McKenney, rcu, Rushikesh S Kadam, Neeraj upadhyay,
	Frederic Weisbecker, Steven Rostedt

On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote:
> > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > >
> > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > >
> > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > > Never mind. I port it into 5.10
> > > > > > >
> > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > > ditto.
> > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > > >
> > > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > > is not so easy to up and run on my device :)
> > > > >
> > > > > Actually I was going to write here, apparently some tests are showing
> > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > > good to drop those for your initial testing!
> > > > >
> > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > > so important for kfree_rcu() because we do batch requests there anyway.
> > > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > > of page slots.
> > > >
> > > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > > >
> > >
> > > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > > case i have checked was a "static image" use case. Condition is: screen ON with
> > > disabled all connectivity.
> > >
> > > 1.
> > > First data i took is how many wakeups cause an RCU subsystem during this test case
> > > when everything is pretty idling. Duration is 360 seconds:
> > >
> > > <snip>
> > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> > 
> > Nice! Do you mind sharing this script? I was just talking to Rushikesh
> > that we want something like this during testing. Appreciate it. Also,
> > if we dump timer wakeup reasons/callbacks that would also be awesome.
> > 
> Please find in attachment. I wrote it once upon a time and make use of
> to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
> so just compile it.
> 
> How to use it:
> 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
> 2. ./perf script -i ./perf.data > foo.script
> 3. ./perf_script_parser ./foo.script

Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also
use "perf sched record" and "perf sched report --sort latency" to get wakeup
latencies.

> > FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> > share that with you on request as well. That is probably not in a
> > shape for mainline though (Makefile missing and such).
> > 
> Yep, please share!

Sure, check out my bcc repo from here:
https://github.com/joelagnel/bcc

Build this project, then cd libbpf-tools and run make. This should produce a
static binary 'rcutop' which you can push to Android. You have to build for
ARM which bcc should have instructions for. I have also included the rcuptop
diff at the end of this file for reference.

> > > 2.
> > > Please find in attachment two power plots. The same test case. One is related to a
> > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > > what is also important, thus it proofs the concept. Please note it might be more power
> > > efficient for other arches and platforms. Because of different HW design that is related
> > > to C-states of CPU and energy that is needed to in/out of those deep power states.
> > 
> > Nice! I wonder if you still have other frequent callbacks on your
> > system that are getting queued during the tests. Could you dump the
> > rcu_callbacks trace event and see if you have any CBs frequently
> > called that the series did not address?
> >
> I have pretty much like this:
> <snip>
>     rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
>     rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
>     rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
>     rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
>     rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
>     rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
>     rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
> <snip>
> 
> so almost everything is batched.

Nice, glad to know this is happening even without the kfree_rcu() changes.

> > Also, one more thing I was curious about is - do you see savings when
> > you pin the rcu threads to the LITTLE CPUs of the system? The theory
> > being, not disturbing the BIG CPUs which are more power hungry may let
> > them go into a deeper idle state and save power (due to leakage
> > current and so forth).
> > 
> I did some experimenting to pin nocbs to a little cluster. For idle use
> cases i did not see any power gain. For heavy one i see that "big" CPUs
> are also invoking and busy with it quite often. Probably i should think
> of some use case where i can detect the power difference. If you have
> something please let me know.

Yeah, probably screen off + audio playback might be a good one, because it
lightly loads the CPUs.

> > > So a front-lazy-batching is something worth to have, IMHO :)
> > 
> > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> > it, we can add your data to the LPC slides as well about your
> > investigation (with attribution to you).
> > 
> No problem, since we will give a talk on LPC, more data we have more
> convincing we are :)

I forget, you did mention you are Ok with presenting with us right? It would
be great if you present your data when we come to Android, if you are OK with
it. I'll start a common slide deck soon and share so you, Rushikesh and me
can add slides to it and present together.

thanks,

 - Joel

---8<-----------------------

From: Joel Fernandes <joelaf@google.com>
Subject: [PATCH] rcutop

Signed-off-by: Joel Fernandes <joelaf@google.com>
---
 libbpf-tools/Makefile     |   1 +
 libbpf-tools/rcutop.bpf.c |  56 ++++++++
 libbpf-tools/rcutop.c     | 288 ++++++++++++++++++++++++++++++++++++++
 libbpf-tools/rcutop.h     |   8 ++
 4 files changed, 353 insertions(+)
 create mode 100644 libbpf-tools/rcutop.bpf.c
 create mode 100644 libbpf-tools/rcutop.c
 create mode 100644 libbpf-tools/rcutop.h

diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile
index e60ec409..0d4cdff2 100644
--- a/libbpf-tools/Makefile
+++ b/libbpf-tools/Makefile
@@ -42,6 +42,7 @@ APPS = \
 	klockstat \
 	ksnoop \
 	llcstat \
+	rcutop \
 	mountsnoop \
 	numamove \
 	offcputime \
diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c
new file mode 100644
index 00000000..8287bbe2
--- /dev/null
+++ b/libbpf-tools/rcutop.bpf.c
@@ -0,0 +1,56 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+/* Copyright (c) 2021 Hengqi Chen */
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_tracing.h>
+#include "rcutop.h"
+#include "maps.bpf.h"
+
+#define MAX_ENTRIES	10240
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_ENTRIES);
+	__type(key, void *);
+	__type(value, int);
+} cbs_queued SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_HASH);
+	__uint(max_entries, MAX_ENTRIES);
+	__type(key, void *);
+	__type(value, int);
+} cbs_executed SEC(".maps");
+
+SEC("tracepoint/rcu/rcu_callback")
+int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx)
+{
+	void *key = ctx->func;
+	int *val = NULL;
+	static const int zero;
+
+	val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero);
+	if (val) {
+		__sync_fetch_and_add(val, 1);
+	}
+
+	return 0;
+}
+
+SEC("tracepoint/rcu/rcu_invoke_callback")
+int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx)
+{
+	void *key = ctx->func;
+	int *val;
+	int zero = 0;
+
+	val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero);
+	if (val) {
+		__sync_fetch_and_add(val, 1);
+	}
+
+	return 0;
+}
+
+char LICENSE[] SEC("license") = "Dual BSD/GPL";
diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c
new file mode 100644
index 00000000..35795875
--- /dev/null
+++ b/libbpf-tools/rcutop.c
@@ -0,0 +1,288 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+
+/*
+ * rcutop
+ * Copyright (c) 2022 Joel Fernandes
+ *
+ * 05-May-2022   Joel Fernandes   Created this.
+ */
+#include <argp.h>
+#include <errno.h>
+#include <signal.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <bpf/libbpf.h>
+#include <bpf/bpf.h>
+#include "rcutop.h"
+#include "rcutop.skel.h"
+#include "btf_helpers.h"
+#include "trace_helpers.h"
+
+#define warn(...) fprintf(stderr, __VA_ARGS__)
+#define OUTPUT_ROWS_LIMIT 10240
+
+static volatile sig_atomic_t exiting = 0;
+
+static bool clear_screen = true;
+static int output_rows = 20;
+static int interval = 1;
+static int count = 99999999;
+static bool verbose = false;
+
+const char *argp_program_version = "rcutop 0.1";
+const char *argp_program_bug_address =
+"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
+const char argp_program_doc[] =
+"Show RCU callback queuing and execution stats.\n"
+"\n"
+"USAGE: rcutop [-h] [interval] [count]\n"
+"\n"
+"EXAMPLES:\n"
+"    rcutop            # rcu activity top, refresh every 1s\n"
+"    rcutop 5 10       # 5s summaries, 10 times\n";
+
+static const struct argp_option opts[] = {
+	{ "noclear", 'C', NULL, 0, "Don't clear the screen" },
+	{ "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" },
+	{ "verbose", 'v', NULL, 0, "Verbose debug output" },
+	{ NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" },
+	{},
+};
+
+static error_t parse_arg(int key, char *arg, struct argp_state *state)
+{
+	long rows;
+	static int pos_args;
+
+	switch (key) {
+		case 'C':
+			clear_screen = false;
+			break;
+		case 'v':
+			verbose = true;
+			break;
+		case 'h':
+			argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
+			break;
+		case 'r':
+			errno = 0;
+			rows = strtol(arg, NULL, 10);
+			if (errno || rows <= 0) {
+				warn("invalid rows: %s\n", arg);
+				argp_usage(state);
+			}
+			output_rows = rows;
+			if (output_rows > OUTPUT_ROWS_LIMIT)
+				output_rows = OUTPUT_ROWS_LIMIT;
+			break;
+		case ARGP_KEY_ARG:
+			errno = 0;
+			if (pos_args == 0) {
+				interval = strtol(arg, NULL, 10);
+				if (errno || interval <= 0) {
+					warn("invalid interval\n");
+					argp_usage(state);
+				}
+			} else if (pos_args == 1) {
+				count = strtol(arg, NULL, 10);
+				if (errno || count <= 0) {
+					warn("invalid count\n");
+					argp_usage(state);
+				}
+			} else {
+				warn("unrecognized positional argument: %s\n", arg);
+				argp_usage(state);
+			}
+			pos_args++;
+			break;
+		default:
+			return ARGP_ERR_UNKNOWN;
+	}
+	return 0;
+}
+
+static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
+{
+	if (level == LIBBPF_DEBUG && !verbose)
+		return 0;
+	return vfprintf(stderr, format, args);
+}
+
+static void sig_int(int signo)
+{
+	exiting = 1;
+}
+
+static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache,
+		struct rcutop_bpf *obj)
+{
+	void *key, **prev_key = NULL;
+	int n, err = 0;
+	int qfd = bpf_map__fd(obj->maps.cbs_queued);
+	int efd = bpf_map__fd(obj->maps.cbs_executed);
+	const struct ksym *ksym;
+	FILE *f;
+	time_t t;
+	struct tm *tm;
+	char ts[16], buf[256];
+
+	f = fopen("/proc/loadavg", "r");
+	if (f) {
+		time(&t);
+		tm = localtime(&t);
+		strftime(ts, sizeof(ts), "%H:%M:%S", tm);
+		memset(buf, 0, sizeof(buf));
+		n = fread(buf, 1, sizeof(buf), f);
+		if (n)
+			printf("%8s loadavg: %s\n", ts, buf);
+		fclose(f);
+	}
+
+	printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed");
+
+	while (1) {
+		int qcount = 0, ecount = 0;
+
+		err = bpf_map_get_next_key(qfd, prev_key, &key);
+		if (err) {
+			if (errno == ENOENT) {
+				err = 0;
+				break;
+			}
+			warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
+			return err;
+		}
+
+		err = bpf_map_lookup_elem(qfd, &key, &qcount);
+		if (err) {
+			warn("bpf_map_lookup_elem failed: %s\n", strerror(errno));
+			return err;
+		}
+		prev_key = &key;
+
+		bpf_map_lookup_elem(efd, &key, &ecount);
+
+		ksym = ksyms__map_addr(ksyms, (unsigned long)key);
+		printf("%-32s %-6d %-6d\n",
+				ksym ? ksym->name : "Unknown",
+				qcount, ecount);
+	}
+	printf("\n");
+	prev_key = NULL;
+	while (1) {
+		err = bpf_map_get_next_key(qfd, prev_key, &key);
+		if (err) {
+			if (errno == ENOENT) {
+				err = 0;
+				break;
+			}
+			warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
+			return err;
+		}
+		err = bpf_map_delete_elem(qfd, &key);
+		if (err) {
+			if (errno == ENOENT) {
+				err = 0;
+				continue;
+			}
+			warn("bpf_map_delete_elem failed: %s\n", strerror(errno));
+			return err;
+		}
+
+		bpf_map_delete_elem(efd, &key);
+		prev_key = &key;
+	}
+
+	return err;
+}
+
+int main(int argc, char **argv)
+{
+	LIBBPF_OPTS(bpf_object_open_opts, open_opts);
+	static const struct argp argp = {
+		.options = opts,
+		.parser = parse_arg,
+		.doc = argp_program_doc,
+	};
+	struct rcutop_bpf *obj;
+	int err;
+	struct syms_cache *syms_cache = NULL;
+	struct ksyms *ksyms = NULL;
+
+	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
+	if (err)
+		return err;
+
+	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+	libbpf_set_print(libbpf_print_fn);
+
+	err = ensure_core_btf(&open_opts);
+	if (err) {
+		fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
+		return 1;
+	}
+
+	obj = rcutop_bpf__open_opts(&open_opts);
+	if (!obj) {
+		warn("failed to open BPF object\n");
+		return 1;
+	}
+
+	err = rcutop_bpf__load(obj);
+	if (err) {
+		warn("failed to load BPF object: %d\n", err);
+		goto cleanup;
+	}
+
+	err = rcutop_bpf__attach(obj);
+	if (err) {
+		warn("failed to attach BPF programs: %d\n", err);
+		goto cleanup;
+	}
+
+	ksyms = ksyms__load();
+	if (!ksyms) {
+		fprintf(stderr, "failed to load kallsyms\n");
+		goto cleanup;
+	}
+
+	syms_cache = syms_cache__new(0);
+	if (!syms_cache) {
+		fprintf(stderr, "failed to create syms_cache\n");
+		goto cleanup;
+	}
+
+	if (signal(SIGINT, sig_int) == SIG_ERR) {
+		warn("can't set signal handler: %s\n", strerror(errno));
+		err = 1;
+		goto cleanup;
+	}
+
+	while (1) {
+		sleep(interval);
+
+		if (clear_screen) {
+			err = system("clear");
+			if (err)
+				goto cleanup;
+		}
+
+		err = print_stat(ksyms, syms_cache, obj);
+		if (err)
+			goto cleanup;
+
+		count--;
+		if (exiting || !count)
+			goto cleanup;
+	}
+
+cleanup:
+	rcutop_bpf__destroy(obj);
+	cleanup_core_btf(&open_opts);
+
+	return err != 0;
+}
diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h
new file mode 100644
index 00000000..cb2a3557
--- /dev/null
+++ b/libbpf-tools/rcutop.h
@@ -0,0 +1,8 @@
+/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
+#ifndef __RCUTOP_H
+#define __RCUTOP_H
+
+#define PATH_MAX	4096
+#define TASK_COMM_LEN	16
+
+#endif /* __RCUTOP_H */
-- 
2.36.0.550.gb090851708-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [RFC v1 10/14] kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
  2022-05-13 14:55     ` Uladzislau Rezki
@ 2022-05-14 14:33       ` Joel Fernandes
  2022-05-14 19:10         ` Uladzislau Rezki
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-14 14:33 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Paul E. McKenney, rcu, rushikesh.s.kadam, neeraj.iitr10,
	frederic, rostedt

On Fri, May 13, 2022 at 04:55:34PM +0200, Uladzislau Rezki wrote:
> > On Thu, May 12, 2022 at 03:04:38AM +0000, Joel Fernandes (Google) wrote:
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > Again, given that kfree_rcu() is doing its own laziness, is this really
> > helping?  If so, would it instead make sense to adjust the kfree_rcu()
> > timeouts?
> > 
> IMHO, this patch does not help much. Like Paul has mentioned we use
> batching anyway.

I think that depends on the value of KFREE_DRAIN_JIFFIES. It it set to 20ms
in the code. The batching with call_rcu_lazy() is set to 10k jiffies which is
longer which is at least 10 seconds on a 1000HZ system. Before I added this
patch, I was seeing more frequent queue_rcu_work() calls which were starting
grace periods. I am not sure though how much was the power saving by
eliminating queue_rcu_work() , I just wanted to make it go away.

Maybe, instead of this patch, can we make KFREE_DRAIN_JIFFIES a tunable or
boot parameter so systems can set it appropriately? Or we can increase the
default kfree_rcu() drain time considering that we do have a shrinker in case
reclaim needs to happen.

Thoughts?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 13/14] rcu/kfree: Fix kfree_rcu_shrink_count() return value
  2022-05-13 14:54   ` Uladzislau Rezki
@ 2022-05-14 14:34     ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-05-14 14:34 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: rcu, rushikesh.s.kadam, neeraj.iitr10, frederic, paulmck, rostedt

On Fri, May 13, 2022 at 04:54:02PM +0200, Uladzislau Rezki wrote:
> On Thu, May 12, 2022 at 03:04:41AM +0000, Joel Fernandes (Google) wrote:
> > As per the comments in include/linux/shrinker.h, .count_objects callback
> > should return the number of freeable items, but if there are no objects
> > to free, SHRINK_EMPTY should be returned. The only time 0 is returned
> > should be when we are unable to determine the number of objects, or the
> > cache should be skipped for another reason.
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/rcu/tree.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 3828ac3bf1c4..f191542cdf5e 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3637,7 +3637,7 @@ kfree_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> >  		atomic_set(&krcp->backoff_page_cache_fill, 1);
> >  	}
> >  
> > -	return count;
> > +	return count == 0 ? SHRINK_EMPTY : count;
> >  }
> >  
> >  static unsigned long
> > -- 
> > 2.36.0.550.gb090851708-goog
> > 
> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>

thanks!

 - Joel

> 
> --
> Uladzisalu Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 12/14] rcu/kfree: remove useless monitor_todo flag
  2022-05-13 14:53   ` Uladzislau Rezki
@ 2022-05-14 14:35     ` Joel Fernandes
  2022-05-14 19:48       ` Uladzislau Rezki
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-14 14:35 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: rcu, rushikesh.s.kadam, neeraj.iitr10, frederic, paulmck, rostedt

On Fri, May 13, 2022 at 04:53:05PM +0200, Uladzislau Rezki wrote:
> > monitor_todo is not needed as the work struct already tracks if work is
> > pending. Just use that to know if work is pending using
> > delayed_work_pending() helper.
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  kernel/rcu/tree.c | 22 +++++++---------------
> >  1 file changed, 7 insertions(+), 15 deletions(-)
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 3baf29014f86..3828ac3bf1c4 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -3155,7 +3155,6 @@ struct kfree_rcu_cpu_work {
> >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
> >   * @lock: Synchronize access to this structure
> >   * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
> > - * @monitor_todo: Tracks whether a @monitor_work delayed work is pending
> >   * @initialized: The @rcu_work fields have been initialized
> >   * @count: Number of objects for which GP not started
> >   * @bkvcache:
> > @@ -3180,7 +3179,6 @@ struct kfree_rcu_cpu {
> >  	struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> >  	raw_spinlock_t lock;
> >  	struct delayed_work monitor_work;
> > -	bool monitor_todo;
> >  	bool initialized;
> >  	int count;
> >  
> > @@ -3416,9 +3414,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
> >  	// of the channels that is still busy we should rearm the
> >  	// work to repeat an attempt. Because previous batches are
> >  	// still in progress.
> > -	if (!krcp->bkvhead[0] && !krcp->bkvhead[1] && !krcp->head)
> > -		krcp->monitor_todo = false;
> > -	else
> > +	if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head)
> >  		schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
> >  
> >  	raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > @@ -3607,10 +3603,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> >  
> >  	// Set timer to drain after KFREE_DRAIN_JIFFIES.
> >  	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
> > -	    !krcp->monitor_todo) {
> > -		krcp->monitor_todo = true;
> > +	    !delayed_work_pending(&krcp->monitor_work))
> >  		schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
> > -	}
> >  
> >  unlock_return:
> >  	krc_this_cpu_unlock(krcp, flags);
> > @@ -3685,14 +3679,12 @@ void __init kfree_rcu_scheduler_running(void)
> >  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> >  
> >  		raw_spin_lock_irqsave(&krcp->lock, flags);
> > -		if ((!krcp->bkvhead[0] && !krcp->bkvhead[1] && !krcp->head) ||
> > -				krcp->monitor_todo) {
> > -			raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > -			continue;
> > +		if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
> > +			if (delayed_work_pending(&krcp->monitor_work)) {
> > +				schedule_delayed_work_on(cpu, &krcp->monitor_work,
> > +						KFREE_DRAIN_JIFFIES);
> > +			}
> >  		}
> > -		krcp->monitor_todo = true;
> > -		schedule_delayed_work_on(cpu, &krcp->monitor_work,
> > -					 KFREE_DRAIN_JIFFIES);
> >  		raw_spin_unlock_irqrestore(&krcp->lock, flags);
> >  	}
> >  }
> > -- 
> >
> Looks good to me from the first glance, but let me know to have a look
> at it more closely.

Thanks, I appreciate it.

 - Joel


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 14/14] DEBUG: Toggle rcu_lazy and tune at runtime
  2022-05-13  0:16   ` Paul E. McKenney
@ 2022-05-14 14:38     ` Joel Fernandes
  2022-05-14 16:21       ` Paul E. McKenney
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-14 14:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 05:16:22PM -0700, Paul E. McKenney wrote:
> On Thu, May 12, 2022 at 03:04:42AM +0000, Joel Fernandes (Google) wrote:
> > Add sysctl knobs just for easier debugging/testing, to tune the maximum
> > batch size, maximum time to wait before flush, and turning off the
> > feature entirely.
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> This is good, and might also be needed longer term.
> 
> One thought below.
> 
> 							Thanx, Paul
> 
> > ---
> >  include/linux/sched/sysctl.h |  4 ++++
> >  kernel/rcu/lazy.c            | 12 ++++++++++--
> >  kernel/sysctl.c              | 23 +++++++++++++++++++++++
> >  3 files changed, 37 insertions(+), 2 deletions(-)
> > 
> > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > index c19dd5a2c05c..55ffc61beed1 100644
> > --- a/include/linux/sched/sysctl.h
> > +++ b/include/linux/sched/sysctl.h
> > @@ -16,6 +16,10 @@ enum { sysctl_hung_task_timeout_secs = 0 };
> >  
> >  extern unsigned int sysctl_sched_child_runs_first;
> >  
> > +extern unsigned int sysctl_rcu_lazy;
> > +extern unsigned int sysctl_rcu_lazy_batch;
> > +extern unsigned int sysctl_rcu_lazy_jiffies;
> > +
> >  enum sched_tunable_scaling {
> >  	SCHED_TUNABLESCALING_NONE,
> >  	SCHED_TUNABLESCALING_LOG,
> > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
> > index 55e406cfc528..0af9fb67c92b 100644
> > --- a/kernel/rcu/lazy.c
> > +++ b/kernel/rcu/lazy.c
> > @@ -12,6 +12,10 @@
> >  // How much to wait before flushing?
> >  #define MAX_LAZY_JIFFIES	10000
> >  
> > +unsigned int sysctl_rcu_lazy_batch = MAX_LAZY_BATCH;
> > +unsigned int sysctl_rcu_lazy_jiffies = MAX_LAZY_JIFFIES;
> > +unsigned int sysctl_rcu_lazy = 1;
> > +
> >  // We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
> >  // allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
> >  // later to ensure that rcu_head and lazy_rcu_head are of the same size.
> > @@ -49,6 +53,10 @@ void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> >  	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> >  	struct rcu_lazy_pcp *rlp;
> >  
> > +	if (!sysctl_rcu_lazy) {
> 
> This is the place to check for early boot use.  Or, alternatively,
> initialize sysctl_rcu_lazy to zero and set it to one once boot is far
> enough along to allow all the pieces to work reasonably.

Sure, I was also thinking perhaps to set this to a static branch. That way we
don't have to pay the cost of the branch after it is setup on boot.

But I think as you mentioned on IRC, if we were to promote this to a
non-DEBUG patch, we would need to handle conditions similar to Frederick's
work on toggling offloading.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 05/14] fs: Move call_rcu() to call_rcu_lazy() in some paths
  2022-05-13  0:07   ` Paul E. McKenney
@ 2022-05-14 14:40     ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-05-14 14:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 05:07:17PM -0700, Paul E. McKenney wrote:
> On Thu, May 12, 2022 at 03:04:33AM +0000, Joel Fernandes (Google) wrote:
> > This is required to prevent callbacks triggering RCU machinery too
> > quickly and too often, which adds more power to the system.
> > 
> > When testing, we found that these paths were invoked often when the
> > system is not doing anything (screen is ON but otherwise idle).
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> Some of these callbacks do additional work.  I don't immediately see a
> problem, but careful and thorough testing is required.

Full agreed, fwiw I did choose several of them by looking at what the
callback did but I agree additional review and testing is needed.

thanks,

 - Joel


> 
> 							Thanx, Paul
> 
> > ---
> >  fs/dcache.c     | 4 ++--
> >  fs/eventpoll.c  | 2 +-
> >  fs/file_table.c | 3 ++-
> >  fs/inode.c      | 2 +-
> >  4 files changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index c84269c6e8bf..517e02cde103 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -366,7 +366,7 @@ static void dentry_free(struct dentry *dentry)
> >  	if (unlikely(dname_external(dentry))) {
> >  		struct external_name *p = external_name(dentry);
> >  		if (likely(atomic_dec_and_test(&p->u.count))) {
> > -			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
> > +			call_rcu_lazy(&dentry->d_u.d_rcu, __d_free_external);
> >  			return;
> >  		}
> >  	}
> > @@ -374,7 +374,7 @@ static void dentry_free(struct dentry *dentry)
> >  	if (dentry->d_flags & DCACHE_NORCU)
> >  		__d_free(&dentry->d_u.d_rcu);
> >  	else
> > -		call_rcu(&dentry->d_u.d_rcu, __d_free);
> > +		call_rcu_lazy(&dentry->d_u.d_rcu, __d_free);
> >  }
> >  
> >  /*
> > diff --git a/fs/eventpoll.c b/fs/eventpoll.c
> > index e2daa940ebce..10a24cca2cff 100644
> > --- a/fs/eventpoll.c
> > +++ b/fs/eventpoll.c
> > @@ -728,7 +728,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
> >  	 * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make
> >  	 * use of the rbn field.
> >  	 */
> > -	call_rcu(&epi->rcu, epi_rcu_free);
> > +	call_rcu_lazy(&epi->rcu, epi_rcu_free);
> >  
> >  	percpu_counter_dec(&ep->user->epoll_watches);
> >  
> > diff --git a/fs/file_table.c b/fs/file_table.c
> > index 7d2e692b66a9..415815d3ef80 100644
> > --- a/fs/file_table.c
> > +++ b/fs/file_table.c
> > @@ -56,7 +56,8 @@ static inline void file_free(struct file *f)
> >  	security_file_free(f);
> >  	if (!(f->f_mode & FMODE_NOACCOUNT))
> >  		percpu_counter_dec(&nr_files);
> > -	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> > +
> > +	call_rcu_lazy(&f->f_u.fu_rcuhead, file_free_rcu);
> >  }
> >  
> >  /*
> > diff --git a/fs/inode.c b/fs/inode.c
> > index 63324df6fa27..b288a5bef4c7 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -312,7 +312,7 @@ static void destroy_inode(struct inode *inode)
> >  			return;
> >  	}
> >  	inode->free_inode = ops->free_inode;
> > -	call_rcu(&inode->i_rcu, i_callback);
> > +	call_rcu_lazy(&inode->i_rcu, i_callback);
> >  }
> >  
> >  /**
> > -- 
> > 2.36.0.550.gb090851708-goog
> > 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 04/14] cred: Move call_rcu() to call_rcu_lazy()
  2022-05-13  0:02   ` Paul E. McKenney
@ 2022-05-14 14:41     ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-05-14 14:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 05:02:53PM -0700, Paul E. McKenney wrote:
> On Thu, May 12, 2022 at 03:04:32AM +0000, Joel Fernandes (Google) wrote:
> > This is required to prevent callbacks triggering RCU machinery too
> > quickly and too often, which adds more power to the system.
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> The put_cred_rcu() does some debugging that would be less effective
> with a 10-second delay.  Which is probably OK assuming that significant
> testing happens on CONFIG_RCU_LAZY=n kernels.

Good point, I'll add that to the commit message, thanks!

thanks,

 - Joel

> 							Thanx, Paul
> 
> > ---
> >  kernel/cred.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/cred.c b/kernel/cred.c
> > index 933155c96922..f4d69d7f2763 100644
> > --- a/kernel/cred.c
> > +++ b/kernel/cred.c
> > @@ -150,7 +150,7 @@ void __put_cred(struct cred *cred)
> >  	if (cred->non_rcu)
> >  		put_cred_rcu(&cred->rcu);
> >  	else
> > -		call_rcu(&cred->rcu, put_cred_rcu);
> > +		call_rcu_lazy(&cred->rcu, put_cred_rcu);
> >  }
> >  EXPORT_SYMBOL(__put_cred);
> >  
> > -- 
> > 2.36.0.550.gb090851708-goog
> > 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 02/14] workqueue: Add a lazy version of queue_rcu_work()
  2022-05-12 23:58   ` Paul E. McKenney
@ 2022-05-14 14:44     ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-05-14 14:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Thu, May 12, 2022 at 04:58:18PM -0700, Paul E. McKenney wrote:
> On Thu, May 12, 2022 at 03:04:30AM +0000, Joel Fernandes (Google) wrote:
> > This will be used in kfree_rcu() later to make it do call_rcu() lazily.
> 
> Given that kfree_rcu() does its own laziness in filling in the page
> of pointers, do we really need to add additional laziness?
> 
> Or are you measuring a significant benefit from doing this?

The benefit I was measuring was, prior to this, I was seeing frequent calls
to queue_delay_rcu(). Based on the latest test results, though, I am not
seeing a signficant benefit and Vlad also confirms it. So I might drop this
patch completely. For easier testing though, it would be nice to make
KFREE_DRAIN_JIFFIES and mainline that change as well, since I believe we can
achieve similar effect by just increasing that timeout.

thanks!

 - Joel

> 
> 							Thanx, Paul
> 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---
> >  include/linux/workqueue.h |  1 +
> >  kernel/workqueue.c        | 25 +++++++++++++++++++++++++
> >  2 files changed, 26 insertions(+)
> > 
> > diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
> > index 7fee9b6cfede..2678a6b5b3f3 100644
> > --- a/include/linux/workqueue.h
> > +++ b/include/linux/workqueue.h
> > @@ -444,6 +444,7 @@ extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
> >  extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
> >  			struct delayed_work *dwork, unsigned long delay);
> >  extern bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork);
> > +extern bool queue_rcu_work_lazy(struct workqueue_struct *wq, struct rcu_work *rwork);
> >  
> >  extern void flush_workqueue(struct workqueue_struct *wq);
> >  extern void drain_workqueue(struct workqueue_struct *wq);
> > diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> > index 33f1106b4f99..9444949cc148 100644
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -1796,6 +1796,31 @@ bool queue_rcu_work(struct workqueue_struct *wq, struct rcu_work *rwork)
> >  }
> >  EXPORT_SYMBOL(queue_rcu_work);
> >  
> > +/**
> > + * queue_rcu_work_lazy - queue work after a RCU grace period
> > + * @wq: workqueue to use
> > + * @rwork: work to queue
> > + *
> > + * Return: %false if @rwork was already pending, %true otherwise.  Note
> > + * that a full RCU grace period is guaranteed only after a %true return.
> > + * While @rwork is guaranteed to be executed after a %false return, the
> > + * execution may happen before a full RCU grace period has passed.
> > + */
> > +bool queue_rcu_work_lazy(struct workqueue_struct *wq, struct rcu_work *rwork)
> > +{
> > +	struct work_struct *work = &rwork->work;
> > +
> > +	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
> > +		rwork->wq = wq;
> > +		call_rcu_lazy(&rwork->rcu, rcu_work_rcufn);
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +EXPORT_SYMBOL(queue_rcu_work_lazy);
> > +
> > +
> >  /**
> >   * worker_enter_idle - enter idle state
> >   * @worker: worker which is entering idle state
> > -- 
> > 2.36.0.550.gb090851708-goog
> > 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-12 23:56   ` Paul E. McKenney
@ 2022-05-14 15:08     ` Joel Fernandes
  2022-05-14 16:34       ` Paul E. McKenney
  2022-06-01 14:24     ` Frederic Weisbecker
  1 sibling, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-14 15:08 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

Hi Paul,

Thanks for bearing with my slightly late reply, I did "time blocking"
technique to work on RCU today! ;-)

On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote:
> On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote:
> > Implement timer-based RCU callback batching. The batch is flushed
> > whenever a certain amount of time has passed, or the batch on a
> > particular CPU grows too big. Also memory pressure can flush it.
> > 
> > Locking is avoided to reduce lock contention when queuing and dequeuing
> > happens on different CPUs of a per-cpu list, such as when shrinker
> > context is running on different CPU. Also not having to use locks keeps
> > the per-CPU structure size small.
> > 
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> It is very good to see this!  Inevitable comments and questions below.

Thanks for taking a look!

> > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > index bf8e341e75b4..c09715079829 100644
> > --- a/kernel/rcu/Kconfig
> > +++ b/kernel/rcu/Kconfig
> > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB
> >  	  Say N here if you hate read-side memory barriers.
> >  	  Take the default if you are unsure.
> >  
> > +config RCU_LAZY
> > +	bool "RCU callback lazy invocation functionality"
> > +	depends on RCU_NOCB_CPU
> > +	default y
> 
> This "default y" is OK for experimentation, but for mainline we should
> not be forcing this on unsuspecting people.  ;-)

Agreed. I'll default 'n' :)

> > +	help
> > +	  To save power, batch RCU callbacks and flush after delay, memory
> > +          pressure or callback list growing too big.
> > +
> >  endmenu # "RCU Subsystem"
> > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> > index 0cfb009a99b9..8968b330d6e0 100644
> > --- a/kernel/rcu/Makefile
> > +++ b/kernel/rcu/Makefile
> > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o
> >  obj-$(CONFIG_TREE_RCU) += tree.o
> >  obj-$(CONFIG_TINY_RCU) += tiny.o
> >  obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
> > +obj-$(CONFIG_RCU_LAZY) += lazy.o
> > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
> > new file mode 100644
> > index 000000000000..55e406cfc528
> > --- /dev/null
> > +++ b/kernel/rcu/lazy.c
> > @@ -0,0 +1,145 @@
> > +/*
> > + * Lockless lazy-RCU implementation.
> > + */
> > +#include <linux/rcupdate.h>
> > +#include <linux/shrinker.h>
> > +#include <linux/workqueue.h>
> > +#include "rcu.h"
> > +
> > +// How much to batch before flushing?
> > +#define MAX_LAZY_BATCH		2048
> > +
> > +// How much to wait before flushing?
> > +#define MAX_LAZY_JIFFIES	10000
> 
> That is more than a minute on a HZ=10 system.  Are you sure that you
> did not mean "(10 * HZ)" or some such?

Yes, you are right. I need to change that to be constant regardless of HZ. I
will make the change as you suggest.

> > +
> > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
> > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
> > +// later to ensure that rcu_head and lazy_rcu_head are of the same size.
> > +struct lazy_rcu_head {
> > +	struct llist_node llist_node;
> > +	void (*func)(struct callback_head *head);
> > +} __attribute__((aligned(sizeof(void *))));
> 
> This needs a build-time check that rcu_head and lazy_rcu_head are of
> the same size.  Maybe something like this in some appropriate context:
> 
> 	BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy));
> 
> Never mind!  I see you have this in rcu_init_lazy().  Plus I now see that
> you also mention this in the above comments.  ;-)

Cool, great minds think alike! ;-)

> > +
> > +struct rcu_lazy_pcp {
> > +	struct llist_head head;
> > +	struct delayed_work work;
> > +	atomic_t count;
> > +};
> > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins);
> > +
> > +// Lockless flush of CPU, can be called concurrently.
> > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp)
> > +{
> > +	struct llist_node *node = llist_del_all(&rlp->head);
> > +	struct lazy_rcu_head *cursor, *temp;
> > +
> > +	if (!node)
> > +		return;
> 
> At this point, the list is empty but the count is non-zero.  Can
> that cause a problem?  (For the existing callback lists, this would
> be OK.)
> 
> > +	llist_for_each_entry_safe(cursor, temp, node, llist_node) {
> > +		struct rcu_head *rh = (struct rcu_head *)cursor;
> > +		debug_rcu_head_unqueue(rh);
> 
> Good to see this check!
> 
> > +		call_rcu(rh, rh->func);
> > +		atomic_dec(&rlp->count);
> > +	}
> > +}
> > +
> > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> > +{
> > +	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> > +	struct rcu_lazy_pcp *rlp;
> > +
> > +	preempt_disable();
> > +        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
> 
> Whitespace issue, please fix.

Fixed, thanks.

> > +	preempt_enable();
> > +
> > +	if (debug_rcu_head_queue((void *)head)) {
> > +		// Probable double call_rcu(), just leak.
> > +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> > +				__func__, head);
> > +
> > +		// Mark as success and leave.
> > +		return;
> > +	}
> > +
> > +	// Queue to per-cpu llist
> > +	head->func = func;
> > +	llist_add(&head->llist_node, &rlp->head);
> 
> Suppose that there are a bunch of preemptions between the preempt_enable()
> above and this point, so that the current CPU's list has lots of
> callbacks, but zero ->count.  Can that cause a problem?
> 
> In the past, this sort of thing has been an issue for rcu_barrier()
> and friends.

Thanks, I think I dropped the ball on this. You have given me something to
think about. I am thinking on first thought that setting the count in advance
of populating the list should do the trick. I will look more on it.

> > +	// Flush queue if too big
> 
> You will also need to check for early boot use.
> 
> I -think- it suffice to simply skip the following "if" statement when
> rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices.  The reason being
> that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT
> kernels won't expire them until the softirq kthreads have been spawned.
> 
> Which is OK, as it just means that call_rcu_lazy() is a bit more
> lazy than expected that early.
> 
> Except that call_rcu() can be invoked even before rcu_init() has been
> invoked, which is therefore also before rcu_init_lazy() has been invoked.

In other words, you are concerned that too many lazy callbacks might be
pending before rcu_init() is called?

I am going through the kfree_rcu() threads/patches involving
RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before
the scheduler is running causes a crash or warnings?

> I thefore suggest something like this at the very start of this function:
> 
> 	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> 		call_rcu(head_rcu, func);
> 
> The goal is that people can replace call_rcu() with call_rcu_lazy()
> without having to worry about invocation during early boot.

Yes, this seems safer. I don't expect much power savings during system boot
process anyway ;-). I believe perhaps a static branch would work better to
take a branch out from what is likely a fast path.

> Huh.  "head_rcu"?  Why not "rhp"?  Or follow call_rcu() with "head",
> though "rhp" is more consistent with the RCU pointer initials approach.

Fixed, thanks.

> > +	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
> > +		lazy_rcu_flush_cpu(rlp);
> > +	} else {
> > +		if (!delayed_work_pending(&rlp->work)) {
> 
> This check is racy because the work might run to completion right at
> this point.  Wouldn't it be better to rely on the internal check of
> WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()?

Oops, agreed. Will make it as you suggest.

thanks,

 - Joel

> > +			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
> > +		}
> > +	}
> > +}
> > +
> > +static unsigned long
> > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > +{
> > +	unsigned long count = 0;
> > +	int cpu;
> > +
> > +	/* Snapshot count of all CPUs */
> > +	for_each_possible_cpu(cpu) {
> > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > +
> > +		count += atomic_read(&rlp->count);
> > +	}
> > +
> > +	return count;
> > +}
> > +
> > +static unsigned long
> > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > +{
> > +	int cpu, freed = 0;
> > +
> > +	for_each_possible_cpu(cpu) {
> > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > +		unsigned long count;
> > +
> > +		count = atomic_read(&rlp->count);
> > +		lazy_rcu_flush_cpu(rlp);
> > +		sc->nr_to_scan -= count;
> > +		freed += count;
> > +		if (sc->nr_to_scan <= 0)
> > +			break;
> > +	}
> > +
> > +	return freed == 0 ? SHRINK_STOP : freed;
> 
> This is a bit surprising given the stated aim of SHRINK_STOP to indicate
> potential deadlocks.  But this pattern is common, including on the
> kvfree_rcu() path, so OK!  ;-)
> 
> Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is
> that as well.
> 
> > +}
> > +
> > +/*
> > + * This function is invoked after MAX_LAZY_JIFFIES timeout.
> > + */
> > +static void lazy_work(struct work_struct *work)
> > +{
> > +	struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work);
> > +
> > +	lazy_rcu_flush_cpu(rlp);
> > +}
> > +
> > +static struct shrinker lazy_rcu_shrinker = {
> > +	.count_objects = lazy_rcu_shrink_count,
> > +	.scan_objects = lazy_rcu_shrink_scan,
> > +	.batch = 0,
> > +	.seeks = DEFAULT_SEEKS,
> > +};
> > +
> > +void __init rcu_lazy_init(void)
> > +{
> > +	int cpu;
> > +
> > +	BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head));
> > +
> > +	for_each_possible_cpu(cpu) {
> > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > +		INIT_DELAYED_WORK(&rlp->work, lazy_work);
> > +	}
> > +
> > +	if (register_shrinker(&lazy_rcu_shrinker))
> > +		pr_err("Failed to register lazy_rcu shrinker!\n");
> > +}
> > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > index 24b5f2c2de87..a5f4b44f395f 100644
> > --- a/kernel/rcu/rcu.h
> > +++ b/kernel/rcu/rcu.h
> > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void);
> >  static inline void show_rcu_tasks_trace_gp_kthread(void) {}
> >  #endif
> >  
> > +#ifdef CONFIG_RCU_LAZY
> > +void rcu_lazy_init(void);
> > +#else
> > +static inline void rcu_lazy_init(void) {}
> > +#endif
> >  #endif /* __LINUX_RCU_H */
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index a4c25a6283b0..ebdf6f7c9023 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -4775,6 +4775,8 @@ void __init rcu_init(void)
> >  		qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
> >  	else
> >  		qovld_calc = qovld;
> > +
> > +	rcu_lazy_init();
> >  }
> >  
> >  #include "tree_stall.h"
> > -- 
> > 2.36.0.550.gb090851708-goog
> > 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 14/14] DEBUG: Toggle rcu_lazy and tune at runtime
  2022-05-14 14:38     ` Joel Fernandes
@ 2022-05-14 16:21       ` Paul E. McKenney
  0 siblings, 0 replies; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-14 16:21 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Sat, May 14, 2022 at 02:38:53PM +0000, Joel Fernandes wrote:
> On Thu, May 12, 2022 at 05:16:22PM -0700, Paul E. McKenney wrote:
> > On Thu, May 12, 2022 at 03:04:42AM +0000, Joel Fernandes (Google) wrote:
> > > Add sysctl knobs just for easier debugging/testing, to tune the maximum
> > > batch size, maximum time to wait before flush, and turning off the
> > > feature entirely.
> > > 
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > This is good, and might also be needed longer term.
> > 
> > One thought below.
> > 
> > 							Thanx, Paul
> > 
> > > ---
> > >  include/linux/sched/sysctl.h |  4 ++++
> > >  kernel/rcu/lazy.c            | 12 ++++++++++--
> > >  kernel/sysctl.c              | 23 +++++++++++++++++++++++
> > >  3 files changed, 37 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
> > > index c19dd5a2c05c..55ffc61beed1 100644
> > > --- a/include/linux/sched/sysctl.h
> > > +++ b/include/linux/sched/sysctl.h
> > > @@ -16,6 +16,10 @@ enum { sysctl_hung_task_timeout_secs = 0 };
> > >  
> > >  extern unsigned int sysctl_sched_child_runs_first;
> > >  
> > > +extern unsigned int sysctl_rcu_lazy;
> > > +extern unsigned int sysctl_rcu_lazy_batch;
> > > +extern unsigned int sysctl_rcu_lazy_jiffies;
> > > +
> > >  enum sched_tunable_scaling {
> > >  	SCHED_TUNABLESCALING_NONE,
> > >  	SCHED_TUNABLESCALING_LOG,
> > > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
> > > index 55e406cfc528..0af9fb67c92b 100644
> > > --- a/kernel/rcu/lazy.c
> > > +++ b/kernel/rcu/lazy.c
> > > @@ -12,6 +12,10 @@
> > >  // How much to wait before flushing?
> > >  #define MAX_LAZY_JIFFIES	10000
> > >  
> > > +unsigned int sysctl_rcu_lazy_batch = MAX_LAZY_BATCH;
> > > +unsigned int sysctl_rcu_lazy_jiffies = MAX_LAZY_JIFFIES;
> > > +unsigned int sysctl_rcu_lazy = 1;
> > > +
> > >  // We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
> > >  // allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
> > >  // later to ensure that rcu_head and lazy_rcu_head are of the same size.
> > > @@ -49,6 +53,10 @@ void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> > >  	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> > >  	struct rcu_lazy_pcp *rlp;
> > >  
> > > +	if (!sysctl_rcu_lazy) {
> > 
> > This is the place to check for early boot use.  Or, alternatively,
> > initialize sysctl_rcu_lazy to zero and set it to one once boot is far
> > enough along to allow all the pieces to work reasonably.
> 
> Sure, I was also thinking perhaps to set this to a static branch. That way we
> don't have to pay the cost of the branch after it is setup on boot.

A static branch would be fine, though it would be good to demostrate
whether or not it actually provided visble benefits.  Also, please see
below.

> But I think as you mentioned on IRC, if we were to promote this to a
> non-DEBUG patch, we would need to handle conditions similar to Frederick's
> work on toggling offloading.

There are several ways to approach this:

o	Make call_rcu_lazy() also work for non-offloaded CPUs.
	It would clearly need to be default-disabled for this case,
	and careful thought would be required on the distro situation,
	some of which build with CONFIG_NO_HZ_FULL=y, but whose users
	almost all refrain from providing a nohz_full boot parameter.

	In short, adding significant risk to the common case for the
	benefit of an uncommon case is not usually a particularly
	good idea.  ;-)

	And yes, proper segmentation of the cases is important.  As in
	the proper question here is "What happens to users not on ChromeOS
	or Android?"  So the fact that the common case probably is
	Android is not relevant to this line of thought.

o	Make call_rcu_lazy() work on a per-CPU basis, so that a lazy
	CPU cannot be de-offloaded.

o	Make call_rcu_lazy() work on a per-CPU basis, but make flushing
	the lazy queue would be part of the de-offloading process.
	Careful with the static branch in this case, since different
	CPUs might have different reacttions to call_rcu_lazy().

o	Others' ideas here!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-14 15:08     ` Joel Fernandes
@ 2022-05-14 16:34       ` Paul E. McKenney
  2022-05-27 23:12         ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-14 16:34 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Sat, May 14, 2022 at 03:08:33PM +0000, Joel Fernandes wrote:
> Hi Paul,
> 
> Thanks for bearing with my slightly late reply, I did "time blocking"
> technique to work on RCU today! ;-)
> 
> On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote:
> > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote:
> > > Implement timer-based RCU callback batching. The batch is flushed
> > > whenever a certain amount of time has passed, or the batch on a
> > > particular CPU grows too big. Also memory pressure can flush it.
> > > 
> > > Locking is avoided to reduce lock contention when queuing and dequeuing
> > > happens on different CPUs of a per-cpu list, such as when shrinker
> > > context is running on different CPU. Also not having to use locks keeps
> > > the per-CPU structure size small.
> > > 
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > 
> > It is very good to see this!  Inevitable comments and questions below.
> 
> Thanks for taking a look!
> 
> > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > index bf8e341e75b4..c09715079829 100644
> > > --- a/kernel/rcu/Kconfig
> > > +++ b/kernel/rcu/Kconfig
> > > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB
> > >  	  Say N here if you hate read-side memory barriers.
> > >  	  Take the default if you are unsure.
> > >  
> > > +config RCU_LAZY
> > > +	bool "RCU callback lazy invocation functionality"
> > > +	depends on RCU_NOCB_CPU
> > > +	default y
> > 
> > This "default y" is OK for experimentation, but for mainline we should
> > not be forcing this on unsuspecting people.  ;-)
> 
> Agreed. I'll default 'n' :)
> 
> > > +	help
> > > +	  To save power, batch RCU callbacks and flush after delay, memory
> > > +          pressure or callback list growing too big.
> > > +
> > >  endmenu # "RCU Subsystem"
> > > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> > > index 0cfb009a99b9..8968b330d6e0 100644
> > > --- a/kernel/rcu/Makefile
> > > +++ b/kernel/rcu/Makefile
> > > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o
> > >  obj-$(CONFIG_TREE_RCU) += tree.o
> > >  obj-$(CONFIG_TINY_RCU) += tiny.o
> > >  obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
> > > +obj-$(CONFIG_RCU_LAZY) += lazy.o
> > > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
> > > new file mode 100644
> > > index 000000000000..55e406cfc528
> > > --- /dev/null
> > > +++ b/kernel/rcu/lazy.c
> > > @@ -0,0 +1,145 @@
> > > +/*
> > > + * Lockless lazy-RCU implementation.
> > > + */
> > > +#include <linux/rcupdate.h>
> > > +#include <linux/shrinker.h>
> > > +#include <linux/workqueue.h>
> > > +#include "rcu.h"
> > > +
> > > +// How much to batch before flushing?
> > > +#define MAX_LAZY_BATCH		2048
> > > +
> > > +// How much to wait before flushing?
> > > +#define MAX_LAZY_JIFFIES	10000
> > 
> > That is more than a minute on a HZ=10 system.  Are you sure that you
> > did not mean "(10 * HZ)" or some such?
> 
> Yes, you are right. I need to change that to be constant regardless of HZ. I
> will make the change as you suggest.
> 
> > > +
> > > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
> > > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
> > > +// later to ensure that rcu_head and lazy_rcu_head are of the same size.
> > > +struct lazy_rcu_head {
> > > +	struct llist_node llist_node;
> > > +	void (*func)(struct callback_head *head);
> > > +} __attribute__((aligned(sizeof(void *))));
> > 
> > This needs a build-time check that rcu_head and lazy_rcu_head are of
> > the same size.  Maybe something like this in some appropriate context:
> > 
> > 	BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy));
> > 
> > Never mind!  I see you have this in rcu_init_lazy().  Plus I now see that
> > you also mention this in the above comments.  ;-)
> 
> Cool, great minds think alike! ;-)
> 
> > > +
> > > +struct rcu_lazy_pcp {
> > > +	struct llist_head head;
> > > +	struct delayed_work work;
> > > +	atomic_t count;
> > > +};
> > > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins);
> > > +
> > > +// Lockless flush of CPU, can be called concurrently.
> > > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp)
> > > +{
> > > +	struct llist_node *node = llist_del_all(&rlp->head);
> > > +	struct lazy_rcu_head *cursor, *temp;
> > > +
> > > +	if (!node)
> > > +		return;
> > 
> > At this point, the list is empty but the count is non-zero.  Can
> > that cause a problem?  (For the existing callback lists, this would
> > be OK.)
> > 
> > > +	llist_for_each_entry_safe(cursor, temp, node, llist_node) {
> > > +		struct rcu_head *rh = (struct rcu_head *)cursor;
> > > +		debug_rcu_head_unqueue(rh);
> > 
> > Good to see this check!
> > 
> > > +		call_rcu(rh, rh->func);
> > > +		atomic_dec(&rlp->count);
> > > +	}
> > > +}
> > > +
> > > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> > > +{
> > > +	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> > > +	struct rcu_lazy_pcp *rlp;
> > > +
> > > +	preempt_disable();
> > > +        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
> > 
> > Whitespace issue, please fix.
> 
> Fixed, thanks.
> 
> > > +	preempt_enable();
> > > +
> > > +	if (debug_rcu_head_queue((void *)head)) {
> > > +		// Probable double call_rcu(), just leak.
> > > +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> > > +				__func__, head);
> > > +
> > > +		// Mark as success and leave.
> > > +		return;
> > > +	}
> > > +
> > > +	// Queue to per-cpu llist
> > > +	head->func = func;
> > > +	llist_add(&head->llist_node, &rlp->head);
> > 
> > Suppose that there are a bunch of preemptions between the preempt_enable()
> > above and this point, so that the current CPU's list has lots of
> > callbacks, but zero ->count.  Can that cause a problem?
> > 
> > In the past, this sort of thing has been an issue for rcu_barrier()
> > and friends.
> 
> Thanks, I think I dropped the ball on this. You have given me something to
> think about. I am thinking on first thought that setting the count in advance
> of populating the list should do the trick. I will look more on it.

That can work, but don't forget the need for proper memory ordering.

Another approach would be to what the current bypass list does, namely
count the callback in the ->cblist structure.

> > > +	// Flush queue if too big
> > 
> > You will also need to check for early boot use.
> > 
> > I -think- it suffice to simply skip the following "if" statement when
> > rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices.  The reason being
> > that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT
> > kernels won't expire them until the softirq kthreads have been spawned.
> > 
> > Which is OK, as it just means that call_rcu_lazy() is a bit more
> > lazy than expected that early.
> > 
> > Except that call_rcu() can be invoked even before rcu_init() has been
> > invoked, which is therefore also before rcu_init_lazy() has been invoked.
> 
> In other words, you are concerned that too many lazy callbacks might be
> pending before rcu_init() is called?

In other words, I am concerned that bad things might happen if fields
in a rcu_lazy_pcp structure are used before they are initialized.

I am not worried about too many lazy callbacks before rcu_init() because
the non-lazy callbacks (which these currently are) are allowed to pile
up until RCU's grace-period kthreads have been spawned.  There might
come a time when RCU callbacks need to be invoked earlier, but that will
be a separate problem to solve when and if, but with the benefit of the
additional information derived from seeing the problem actually happen.

> I am going through the kfree_rcu() threads/patches involving
> RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before
> the scheduler is running causes a crash or warnings?

There are a lot of different issues that arise in different phases
of boot.  In this particular case, my primary concern is the
use-before-initialized bug.

> > I thefore suggest something like this at the very start of this function:
> > 
> > 	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> > 		call_rcu(head_rcu, func);
> > 
> > The goal is that people can replace call_rcu() with call_rcu_lazy()
> > without having to worry about invocation during early boot.
> 
> Yes, this seems safer. I don't expect much power savings during system boot
> process anyway ;-). I believe perhaps a static branch would work better to
> take a branch out from what is likely a fast path.

A static branch would be fine.  Maybe encapsulate it in a static inline
function for all such comparisons, but most such comparisons are far from
anything resembling a fastpath, so the main potential benefit would be
added readability.  Which could be a compelling reason in and of itself.

							Thanx, Paul

> > Huh.  "head_rcu"?  Why not "rhp"?  Or follow call_rcu() with "head",
> > though "rhp" is more consistent with the RCU pointer initials approach.
> 
> Fixed, thanks.
> 
> > > +	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
> > > +		lazy_rcu_flush_cpu(rlp);
> > > +	} else {
> > > +		if (!delayed_work_pending(&rlp->work)) {
> > 
> > This check is racy because the work might run to completion right at
> > this point.  Wouldn't it be better to rely on the internal check of
> > WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()?
> 
> Oops, agreed. Will make it as you suggest.
> 
> thanks,
> 
>  - Joel
> 
> > > +			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
> > > +		}
> > > +	}
> > > +}
> > > +
> > > +static unsigned long
> > > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > > +{
> > > +	unsigned long count = 0;
> > > +	int cpu;
> > > +
> > > +	/* Snapshot count of all CPUs */
> > > +	for_each_possible_cpu(cpu) {
> > > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > > +
> > > +		count += atomic_read(&rlp->count);
> > > +	}
> > > +
> > > +	return count;
> > > +}
> > > +
> > > +static unsigned long
> > > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > > +{
> > > +	int cpu, freed = 0;
> > > +
> > > +	for_each_possible_cpu(cpu) {
> > > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > > +		unsigned long count;
> > > +
> > > +		count = atomic_read(&rlp->count);
> > > +		lazy_rcu_flush_cpu(rlp);
> > > +		sc->nr_to_scan -= count;
> > > +		freed += count;
> > > +		if (sc->nr_to_scan <= 0)
> > > +			break;
> > > +	}
> > > +
> > > +	return freed == 0 ? SHRINK_STOP : freed;
> > 
> > This is a bit surprising given the stated aim of SHRINK_STOP to indicate
> > potential deadlocks.  But this pattern is common, including on the
> > kvfree_rcu() path, so OK!  ;-)
> > 
> > Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is
> > that as well.
> > 
> > > +}
> > > +
> > > +/*
> > > + * This function is invoked after MAX_LAZY_JIFFIES timeout.
> > > + */
> > > +static void lazy_work(struct work_struct *work)
> > > +{
> > > +	struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work);
> > > +
> > > +	lazy_rcu_flush_cpu(rlp);
> > > +}
> > > +
> > > +static struct shrinker lazy_rcu_shrinker = {
> > > +	.count_objects = lazy_rcu_shrink_count,
> > > +	.scan_objects = lazy_rcu_shrink_scan,
> > > +	.batch = 0,
> > > +	.seeks = DEFAULT_SEEKS,
> > > +};
> > > +
> > > +void __init rcu_lazy_init(void)
> > > +{
> > > +	int cpu;
> > > +
> > > +	BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head));
> > > +
> > > +	for_each_possible_cpu(cpu) {
> > > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > > +		INIT_DELAYED_WORK(&rlp->work, lazy_work);
> > > +	}
> > > +
> > > +	if (register_shrinker(&lazy_rcu_shrinker))
> > > +		pr_err("Failed to register lazy_rcu shrinker!\n");
> > > +}
> > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > index 24b5f2c2de87..a5f4b44f395f 100644
> > > --- a/kernel/rcu/rcu.h
> > > +++ b/kernel/rcu/rcu.h
> > > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void);
> > >  static inline void show_rcu_tasks_trace_gp_kthread(void) {}
> > >  #endif
> > >  
> > > +#ifdef CONFIG_RCU_LAZY
> > > +void rcu_lazy_init(void);
> > > +#else
> > > +static inline void rcu_lazy_init(void) {}
> > > +#endif
> > >  #endif /* __LINUX_RCU_H */
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index a4c25a6283b0..ebdf6f7c9023 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -4775,6 +4775,8 @@ void __init rcu_init(void)
> > >  		qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
> > >  	else
> > >  		qovld_calc = qovld;
> > > +
> > > +	rcu_lazy_init();
> > >  }
> > >  
> > >  #include "tree_stall.h"
> > > -- 
> > > 2.36.0.550.gb090851708-goog
> > > 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-14 14:25                     ` Joel Fernandes
@ 2022-05-14 19:01                       ` Uladzislau Rezki
  2022-08-09  2:25                       ` Joel Fernandes
  1 sibling, 0 replies; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-14 19:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, Paul E. McKenney, rcu, Rushikesh S Kadam,
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt

> On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote:
> > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > > > Never mind. I port it into 5.10
> > > > > > > >
> > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > > > ditto.
> > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > > > >
> > > > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > > > is not so easy to up and run on my device :)
> > > > > >
> > > > > > Actually I was going to write here, apparently some tests are showing
> > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > > > good to drop those for your initial testing!
> > > > > >
> > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > > > so important for kfree_rcu() because we do batch requests there anyway.
> > > > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > > > of page slots.
> > > > >
> > > > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > > > >
> > > >
> > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > > > case i have checked was a "static image" use case. Condition is: screen ON with
> > > > disabled all connectivity.
> > > >
> > > > 1.
> > > > First data i took is how many wakeups cause an RCU subsystem during this test case
> > > > when everything is pretty idling. Duration is 360 seconds:
> > > >
> > > > <snip>
> > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> > > 
> > > Nice! Do you mind sharing this script? I was just talking to Rushikesh
> > > that we want something like this during testing. Appreciate it. Also,
> > > if we dump timer wakeup reasons/callbacks that would also be awesome.
> > > 
> > Please find in attachment. I wrote it once upon a time and make use of
> > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
> > so just compile it.
> > 
> > How to use it:
> > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
> > 2. ./perf script -i ./perf.data > foo.script
> > 3. ./perf_script_parser ./foo.script
> 
> Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also
> use "perf sched record" and "perf sched report --sort latency" to get wakeup
> latencies.
> 
Good. "perf sched report --sort latency" it must be related to wakeup delays?

> > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> > > share that with you on request as well. That is probably not in a
> > > shape for mainline though (Makefile missing and such).
> > > 
> > Yep, please share!
> 
> Sure, check out my bcc repo from here:
> https://github.com/joelagnel/bcc
> 
> Build this project, then cd libbpf-tools and run make. This should produce a
> static binary 'rcutop' which you can push to Android. You have to build for
> ARM which bcc should have instructions for. I have also included the rcuptop
> diff at the end of this file for reference.
> 
Cool. Will try it later :)

> > > > 2.
> > > > Please find in attachment two power plots. The same test case. One is related to a
> > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > > > what is also important, thus it proofs the concept. Please note it might be more power
> > > > efficient for other arches and platforms. Because of different HW design that is related
> > > > to C-states of CPU and energy that is needed to in/out of those deep power states.
> > > 
> > > Nice! I wonder if you still have other frequent callbacks on your
> > > system that are getting queued during the tests. Could you dump the
> > > rcu_callbacks trace event and see if you have any CBs frequently
> > > called that the series did not address?
> > >
> > I have pretty much like this:
> > <snip>
> >     rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
> >     rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
> >     rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
> >     rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
> >     rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
> >     rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
> > <snip>
> > 
> > so almost everything is batched.
> 
> Nice, glad to know this is happening even without the kfree_rcu() changes.
> 
kvfree_rcu() does the job quite good, even though as i mentioned i would like
to send a patch about better page slot utialization. But, yes, since there is 
batching mechanism the call_rcu_lazy() does not give any effect.


> > > Also, one more thing I was curious about is - do you see savings when
> > > you pin the rcu threads to the LITTLE CPUs of the system? The theory
> > > being, not disturbing the BIG CPUs which are more power hungry may let
> > > them go into a deeper idle state and save power (due to leakage
> > > current and so forth).
> > > 
> > I did some experimenting to pin nocbs to a little cluster. For idle use
> > cases i did not see any power gain. For heavy one i see that "big" CPUs
> > are also invoking and busy with it quite often. Probably i should think
> > of some use case where i can detect the power difference. If you have
> > something please let me know.
> 
> Yeah, probably screen off + audio playback might be a good one, because it
> lightly loads the CPUs.
> 
Hm.. Will see and might check, just in case! I have implemented the simple 
call_rcu() static workload generator in order to examine it in more controlled
environment. So i would like to see how it behaves under a static workload
generator.

> > > > So a front-lazy-batching is something worth to have, IMHO :)
> > > 
> > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> > > it, we can add your data to the LPC slides as well about your
> > > investigation (with attribution to you).
> > > 
> > No problem, since we will give a talk on LPC, more data we have more
> > convincing we are :)
> 
> I forget, you did mention you are Ok with presenting with us right? It would
> be great if you present your data when we come to Android, if you are OK with
> it. I'll start a common slide deck soon and share so you, Rushikesh and me
> can add slides to it and present together.
> 
I am OK to report with you together. I can do and prepare some data related to
examination on our Android and big.LITTLE platform that we use for our customers.

> thanks,
> 
>  - Joel
> 
> ---8<-----------------------
> 
> From: Joel Fernandes <joelaf@google.com>
> Subject: [PATCH] rcutop
> 
> Signed-off-by: Joel Fernandes <joelaf@google.com>
> ---
>  libbpf-tools/Makefile     |   1 +
>  libbpf-tools/rcutop.bpf.c |  56 ++++++++
>  libbpf-tools/rcutop.c     | 288 ++++++++++++++++++++++++++++++++++++++
>  libbpf-tools/rcutop.h     |   8 ++
>  4 files changed, 353 insertions(+)
>  create mode 100644 libbpf-tools/rcutop.bpf.c
>  create mode 100644 libbpf-tools/rcutop.c
>  create mode 100644 libbpf-tools/rcutop.h
> 
> diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile
> index e60ec409..0d4cdff2 100644
> --- a/libbpf-tools/Makefile
> +++ b/libbpf-tools/Makefile
> @@ -42,6 +42,7 @@ APPS = \
>  	klockstat \
>  	ksnoop \
>  	llcstat \
> +	rcutop \
>  	mountsnoop \
>  	numamove \
>  	offcputime \
> diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c
> new file mode 100644
> index 00000000..8287bbe2
> --- /dev/null
> +++ b/libbpf-tools/rcutop.bpf.c
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +/* Copyright (c) 2021 Hengqi Chen */
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_core_read.h>
> +#include <bpf/bpf_tracing.h>
> +#include "rcutop.h"
> +#include "maps.bpf.h"
> +
> +#define MAX_ENTRIES	10240
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, MAX_ENTRIES);
> +	__type(key, void *);
> +	__type(value, int);
> +} cbs_queued SEC(".maps");
> +
> +struct {
> +	__uint(type, BPF_MAP_TYPE_HASH);
> +	__uint(max_entries, MAX_ENTRIES);
> +	__type(key, void *);
> +	__type(value, int);
> +} cbs_executed SEC(".maps");
> +
> +SEC("tracepoint/rcu/rcu_callback")
> +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx)
> +{
> +	void *key = ctx->func;
> +	int *val = NULL;
> +	static const int zero;
> +
> +	val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero);
> +	if (val) {
> +		__sync_fetch_and_add(val, 1);
> +	}
> +
> +	return 0;
> +}
> +
> +SEC("tracepoint/rcu/rcu_invoke_callback")
> +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx)
> +{
> +	void *key = ctx->func;
> +	int *val;
> +	int zero = 0;
> +
> +	val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero);
> +	if (val) {
> +		__sync_fetch_and_add(val, 1);
> +	}
> +
> +	return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c
> new file mode 100644
> index 00000000..35795875
> --- /dev/null
> +++ b/libbpf-tools/rcutop.c
> @@ -0,0 +1,288 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +
> +/*
> + * rcutop
> + * Copyright (c) 2022 Joel Fernandes
> + *
> + * 05-May-2022   Joel Fernandes   Created this.
> + */
> +#include <argp.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/bpf.h>
> +#include "rcutop.h"
> +#include "rcutop.skel.h"
> +#include "btf_helpers.h"
> +#include "trace_helpers.h"
> +
> +#define warn(...) fprintf(stderr, __VA_ARGS__)
> +#define OUTPUT_ROWS_LIMIT 10240
> +
> +static volatile sig_atomic_t exiting = 0;
> +
> +static bool clear_screen = true;
> +static int output_rows = 20;
> +static int interval = 1;
> +static int count = 99999999;
> +static bool verbose = false;
> +
> +const char *argp_program_version = "rcutop 0.1";
> +const char *argp_program_bug_address =
> +"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
> +const char argp_program_doc[] =
> +"Show RCU callback queuing and execution stats.\n"
> +"\n"
> +"USAGE: rcutop [-h] [interval] [count]\n"
> +"\n"
> +"EXAMPLES:\n"
> +"    rcutop            # rcu activity top, refresh every 1s\n"
> +"    rcutop 5 10       # 5s summaries, 10 times\n";
> +
> +static const struct argp_option opts[] = {
> +	{ "noclear", 'C', NULL, 0, "Don't clear the screen" },
> +	{ "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" },
> +	{ "verbose", 'v', NULL, 0, "Verbose debug output" },
> +	{ NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" },
> +	{},
> +};
> +
> +static error_t parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +	long rows;
> +	static int pos_args;
> +
> +	switch (key) {
> +		case 'C':
> +			clear_screen = false;
> +			break;
> +		case 'v':
> +			verbose = true;
> +			break;
> +		case 'h':
> +			argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
> +			break;
> +		case 'r':
> +			errno = 0;
> +			rows = strtol(arg, NULL, 10);
> +			if (errno || rows <= 0) {
> +				warn("invalid rows: %s\n", arg);
> +				argp_usage(state);
> +			}
> +			output_rows = rows;
> +			if (output_rows > OUTPUT_ROWS_LIMIT)
> +				output_rows = OUTPUT_ROWS_LIMIT;
> +			break;
> +		case ARGP_KEY_ARG:
> +			errno = 0;
> +			if (pos_args == 0) {
> +				interval = strtol(arg, NULL, 10);
> +				if (errno || interval <= 0) {
> +					warn("invalid interval\n");
> +					argp_usage(state);
> +				}
> +			} else if (pos_args == 1) {
> +				count = strtol(arg, NULL, 10);
> +				if (errno || count <= 0) {
> +					warn("invalid count\n");
> +					argp_usage(state);
> +				}
> +			} else {
> +				warn("unrecognized positional argument: %s\n", arg);
> +				argp_usage(state);
> +			}
> +			pos_args++;
> +			break;
> +		default:
> +			return ARGP_ERR_UNKNOWN;
> +	}
> +	return 0;
> +}
> +
> +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
> +{
> +	if (level == LIBBPF_DEBUG && !verbose)
> +		return 0;
> +	return vfprintf(stderr, format, args);
> +}
> +
> +static void sig_int(int signo)
> +{
> +	exiting = 1;
> +}
> +
> +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache,
> +		struct rcutop_bpf *obj)
> +{
> +	void *key, **prev_key = NULL;
> +	int n, err = 0;
> +	int qfd = bpf_map__fd(obj->maps.cbs_queued);
> +	int efd = bpf_map__fd(obj->maps.cbs_executed);
> +	const struct ksym *ksym;
> +	FILE *f;
> +	time_t t;
> +	struct tm *tm;
> +	char ts[16], buf[256];
> +
> +	f = fopen("/proc/loadavg", "r");
> +	if (f) {
> +		time(&t);
> +		tm = localtime(&t);
> +		strftime(ts, sizeof(ts), "%H:%M:%S", tm);
> +		memset(buf, 0, sizeof(buf));
> +		n = fread(buf, 1, sizeof(buf), f);
> +		if (n)
> +			printf("%8s loadavg: %s\n", ts, buf);
> +		fclose(f);
> +	}
> +
> +	printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed");
> +
> +	while (1) {
> +		int qcount = 0, ecount = 0;
> +
> +		err = bpf_map_get_next_key(qfd, prev_key, &key);
> +		if (err) {
> +			if (errno == ENOENT) {
> +				err = 0;
> +				break;
> +			}
> +			warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +			return err;
> +		}
> +
> +		err = bpf_map_lookup_elem(qfd, &key, &qcount);
> +		if (err) {
> +			warn("bpf_map_lookup_elem failed: %s\n", strerror(errno));
> +			return err;
> +		}
> +		prev_key = &key;
> +
> +		bpf_map_lookup_elem(efd, &key, &ecount);
> +
> +		ksym = ksyms__map_addr(ksyms, (unsigned long)key);
> +		printf("%-32s %-6d %-6d\n",
> +				ksym ? ksym->name : "Unknown",
> +				qcount, ecount);
> +	}
> +	printf("\n");
> +	prev_key = NULL;
> +	while (1) {
> +		err = bpf_map_get_next_key(qfd, prev_key, &key);
> +		if (err) {
> +			if (errno == ENOENT) {
> +				err = 0;
> +				break;
> +			}
> +			warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +			return err;
> +		}
> +		err = bpf_map_delete_elem(qfd, &key);
> +		if (err) {
> +			if (errno == ENOENT) {
> +				err = 0;
> +				continue;
> +			}
> +			warn("bpf_map_delete_elem failed: %s\n", strerror(errno));
> +			return err;
> +		}
> +
> +		bpf_map_delete_elem(efd, &key);
> +		prev_key = &key;
> +	}
> +
> +	return err;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	LIBBPF_OPTS(bpf_object_open_opts, open_opts);
> +	static const struct argp argp = {
> +		.options = opts,
> +		.parser = parse_arg,
> +		.doc = argp_program_doc,
> +	};
> +	struct rcutop_bpf *obj;
> +	int err;
> +	struct syms_cache *syms_cache = NULL;
> +	struct ksyms *ksyms = NULL;
> +
> +	err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
> +	if (err)
> +		return err;
> +
> +	libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> +	libbpf_set_print(libbpf_print_fn);
> +
> +	err = ensure_core_btf(&open_opts);
> +	if (err) {
> +		fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
> +		return 1;
> +	}
> +
> +	obj = rcutop_bpf__open_opts(&open_opts);
> +	if (!obj) {
> +		warn("failed to open BPF object\n");
> +		return 1;
> +	}
> +
> +	err = rcutop_bpf__load(obj);
> +	if (err) {
> +		warn("failed to load BPF object: %d\n", err);
> +		goto cleanup;
> +	}
> +
> +	err = rcutop_bpf__attach(obj);
> +	if (err) {
> +		warn("failed to attach BPF programs: %d\n", err);
> +		goto cleanup;
> +	}
> +
> +	ksyms = ksyms__load();
> +	if (!ksyms) {
> +		fprintf(stderr, "failed to load kallsyms\n");
> +		goto cleanup;
> +	}
> +
> +	syms_cache = syms_cache__new(0);
> +	if (!syms_cache) {
> +		fprintf(stderr, "failed to create syms_cache\n");
> +		goto cleanup;
> +	}
> +
> +	if (signal(SIGINT, sig_int) == SIG_ERR) {
> +		warn("can't set signal handler: %s\n", strerror(errno));
> +		err = 1;
> +		goto cleanup;
> +	}
> +
> +	while (1) {
> +		sleep(interval);
> +
> +		if (clear_screen) {
> +			err = system("clear");
> +			if (err)
> +				goto cleanup;
> +		}
> +
> +		err = print_stat(ksyms, syms_cache, obj);
> +		if (err)
> +			goto cleanup;
> +
> +		count--;
> +		if (exiting || !count)
> +			goto cleanup;
> +	}
> +
> +cleanup:
> +	rcutop_bpf__destroy(obj);
> +	cleanup_core_btf(&open_opts);
> +
> +	return err != 0;
> +}
> diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h
> new file mode 100644
> index 00000000..cb2a3557
> --- /dev/null
> +++ b/libbpf-tools/rcutop.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +#ifndef __RCUTOP_H
> +#define __RCUTOP_H
> +
> +#define PATH_MAX	4096
> +#define TASK_COMM_LEN	16
> +
> +#endif /* __RCUTOP_H */
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 10/14] kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
  2022-05-14 14:33       ` Joel Fernandes
@ 2022-05-14 19:10         ` Uladzislau Rezki
  0 siblings, 0 replies; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-14 19:10 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, Paul E. McKenney, rcu, rushikesh.s.kadam,
	neeraj.iitr10, frederic, rostedt

> On Fri, May 13, 2022 at 04:55:34PM +0200, Uladzislau Rezki wrote:
> > > On Thu, May 12, 2022 at 03:04:38AM +0000, Joel Fernandes (Google) wrote:
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > 
> > > Again, given that kfree_rcu() is doing its own laziness, is this really
> > > helping?  If so, would it instead make sense to adjust the kfree_rcu()
> > > timeouts?
> > > 
> > IMHO, this patch does not help much. Like Paul has mentioned we use
> > batching anyway.
> 
> I think that depends on the value of KFREE_DRAIN_JIFFIES. It it set to 20ms
> in the code. The batching with call_rcu_lazy() is set to 10k jiffies which is
> longer which is at least 10 seconds on a 1000HZ system. Before I added this
> patch, I was seeing more frequent queue_rcu_work() calls which were starting
> grace periods. I am not sure though how much was the power saving by
> eliminating queue_rcu_work() , I just wanted to make it go away.
> 
> Maybe, instead of this patch, can we make KFREE_DRAIN_JIFFIES a tunable or
> boot parameter so systems can set it appropriately? Or we can increase the
> default kfree_rcu() drain time considering that we do have a shrinker in case
> reclaim needs to happen.
> 
> Thoughts?
> 
Agree. We need to change behaviour of the simple KFREE_DRAIN_JIFFIES.
One thing is that we can relay on shrinker. So making the default drain
interval, say, 1 sec sounds reasonable to me and swith to shorter one
if the page is full:

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 222d59299a2a..89b356cee643 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3249,6 +3249,7 @@ EXPORT_SYMBOL_GPL(call_rcu);
 
 
 /* Maximum number of jiffies to wait before draining a batch. */
+#define KFREE_DRAIN_JIFFIES_SEC (HZ)
 #define KFREE_DRAIN_JIFFIES (HZ / 50)
 #define KFREE_N_BATCHES 2
 #define FREE_N_CHANNELS 2
@@ -3685,6 +3686,20 @@ add_ptr_to_bulk_krc_lock(struct kfree_rcu_cpu **krcp,
 	return true;
 }
 
+static bool
+is_krc_page_full(struct kfree_rcu_cpu *krcp)
+{
+	int i;
+
+	// Check if a page is full either for first or second channels.
+	for (i = 0; i < FREE_N_CHANNELS && krcp->bkvhead[i]; i++) {
+		if (krcp->bkvhead[i]->nr_records == KVFREE_BULK_MAX_ENTR)
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * Queue a request for lazy invocation of the appropriate free routine
  * after a grace period.  Please note that three paths are maintained,
@@ -3701,6 +3716,7 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 {
 	unsigned long flags;
 	struct kfree_rcu_cpu *krcp;
+	unsigned long delay;
 	bool success;
 	void *ptr;
 
@@ -3749,7 +3765,11 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
 	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
 	    !krcp->monitor_todo) {
 		krcp->monitor_todo = true;
-		schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
+
+		delay = is_krc_page_full(krcp) ?
+			KFREE_DRAIN_JIFFIES:KFREE_DRAIN_JIFFIES_SEC;
+
+		schedule_delayed_work(&krcp->monitor_work, delay);
 	}
 
 unlock_return:

please note it is just for illustration because the patch is not completed.
As for parameters, i think we can add both to access over /sys/module/...

--
Uladzislau Rezki

^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [RFC v1 12/14] rcu/kfree: remove useless monitor_todo flag
  2022-05-14 14:35     ` Joel Fernandes
@ 2022-05-14 19:48       ` Uladzislau Rezki
  0 siblings, 0 replies; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-14 19:48 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, rcu, rushikesh.s.kadam, neeraj.iitr10,
	frederic, paulmck, rostedt

> On Fri, May 13, 2022 at 04:53:05PM +0200, Uladzislau Rezki wrote:
> > > monitor_todo is not needed as the work struct already tracks if work is
> > > pending. Just use that to know if work is pending using
> > > delayed_work_pending() helper.
> > > 
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> > >  kernel/rcu/tree.c | 22 +++++++---------------
> > >  1 file changed, 7 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 3baf29014f86..3828ac3bf1c4 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3155,7 +3155,6 @@ struct kfree_rcu_cpu_work {
> > >   * @krw_arr: Array of batches of kfree_rcu() objects waiting for a grace period
> > >   * @lock: Synchronize access to this structure
> > >   * @monitor_work: Promote @head to @head_free after KFREE_DRAIN_JIFFIES
> > > - * @monitor_todo: Tracks whether a @monitor_work delayed work is pending
> > >   * @initialized: The @rcu_work fields have been initialized
> > >   * @count: Number of objects for which GP not started
> > >   * @bkvcache:
> > > @@ -3180,7 +3179,6 @@ struct kfree_rcu_cpu {
> > >  	struct kfree_rcu_cpu_work krw_arr[KFREE_N_BATCHES];
> > >  	raw_spinlock_t lock;
> > >  	struct delayed_work monitor_work;
> > > -	bool monitor_todo;
> > >  	bool initialized;
> > >  	int count;
> > >  
> > > @@ -3416,9 +3414,7 @@ static void kfree_rcu_monitor(struct work_struct *work)
> > >  	// of the channels that is still busy we should rearm the
> > >  	// work to repeat an attempt. Because previous batches are
> > >  	// still in progress.
> > > -	if (!krcp->bkvhead[0] && !krcp->bkvhead[1] && !krcp->head)
> > > -		krcp->monitor_todo = false;
> > > -	else
> > > +	if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head)
>
Can we place those three checks into separate inline function because
it is used in two places: krc_needs_offload()?

> > >  		schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
> > >  
> > >  	raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > > @@ -3607,10 +3603,8 @@ void kvfree_call_rcu(struct rcu_head *head, rcu_callback_t func)
> > >  
> > >  	// Set timer to drain after KFREE_DRAIN_JIFFIES.
> > >  	if (rcu_scheduler_active == RCU_SCHEDULER_RUNNING &&
> > > -	    !krcp->monitor_todo) {
> > > -		krcp->monitor_todo = true;
> > > +	    !delayed_work_pending(&krcp->monitor_work))
>
I think checking weather it is pending or not does not make much sense.
schedule_delayed_work() checks inside if the work can be queued.

> > >  		schedule_delayed_work(&krcp->monitor_work, KFREE_DRAIN_JIFFIES);
> > > -	}
> > >  
> > >  unlock_return:
> > >  	krc_this_cpu_unlock(krcp, flags);
> > > @@ -3685,14 +3679,12 @@ void __init kfree_rcu_scheduler_running(void)
> > >  		struct kfree_rcu_cpu *krcp = per_cpu_ptr(&krc, cpu);
> > >  
> > >  		raw_spin_lock_irqsave(&krcp->lock, flags);
> > > -		if ((!krcp->bkvhead[0] && !krcp->bkvhead[1] && !krcp->head) ||
> > > -				krcp->monitor_todo) {
> > > -			raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > > -			continue;
> > > +		if (krcp->bkvhead[0] || krcp->bkvhead[1] || krcp->head) {
Same here. Move to the separate function, IMHO makes sense.

> > > +			if (delayed_work_pending(&krcp->monitor_work)) {
Same here. Should we check it here?

> > > +				schedule_delayed_work_on(cpu, &krcp->monitor_work,
> > > +						KFREE_DRAIN_JIFFIES);
> > > +			}
> > >  		}
> > > -		krcp->monitor_todo = true;
> > > -		schedule_delayed_work_on(cpu, &krcp->monitor_work,
> > > -					 KFREE_DRAIN_JIFFIES);
> > >  		raw_spin_unlock_irqrestore(&krcp->lock, flags);
> > >  	}
> > >  }
> > > -- 
> > >
> > Looks good to me from the first glance, but let me know to have a look
> > at it more closely.
> 
> Thanks, I appreciate it.
> 
One change in design after this patch is a drain work can be queued even
though there is already nothing to drain. I do not find it as a big issue
because it will just bail out. So i tend to simplification.

The monitor_todo guarantees that kvfree_rcu() caller will not schedule
any work until "monitor work" completes its job and if there is still to
do something there it rearms itself.

--
Uladzsialu Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-12  3:04 ` [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation Joel Fernandes (Google)
  2022-05-12 23:56   ` Paul E. McKenney
@ 2022-05-17  9:07   ` Uladzislau Rezki
  2022-05-30 14:54     ` Joel Fernandes
  1 sibling, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-05-17  9:07 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck,
	rostedt

Some extra but not overlapping comments with already mentioned in this
email thread:

> Implement timer-based RCU callback batching. The batch is flushed
> whenever a certain amount of time has passed, or the batch on a
> particular CPU grows too big. Also memory pressure can flush it.
> 
> Locking is avoided to reduce lock contention when queuing and dequeuing
> happens on different CPUs of a per-cpu list, such as when shrinker
> context is running on different CPU. Also not having to use locks keeps
> the per-CPU structure size small.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/linux/rcupdate.h |   6 ++
>  kernel/rcu/Kconfig       |   8 +++
>  kernel/rcu/Makefile      |   1 +
>  kernel/rcu/lazy.c        | 145 +++++++++++++++++++++++++++++++++++++++
>  kernel/rcu/rcu.h         |   5 ++
>  kernel/rcu/tree.c        |   2 +
>  6 files changed, 167 insertions(+)
>  create mode 100644 kernel/rcu/lazy.c
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 88b42eb46406..d0a6c4f5172c 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -82,6 +82,12 @@ static inline int rcu_preempt_depth(void)
>  
>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>  
> +#ifdef CONFIG_RCU_LAZY
> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
> +#else
> +#define call_rcu_lazy(head, func) call_rcu(head, func)
> +#endif
> +
>  /* Internal to kernel */
>  void rcu_init(void);
>  extern int rcu_scheduler_active __read_mostly;
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index bf8e341e75b4..c09715079829 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB
>  	  Say N here if you hate read-side memory barriers.
>  	  Take the default if you are unsure.
>  
> +config RCU_LAZY
> +	bool "RCU callback lazy invocation functionality"
> +	depends on RCU_NOCB_CPU
> +	default y
> +	help
> +	  To save power, batch RCU callbacks and flush after delay, memory
> +          pressure or callback list growing too big.
> +
>  endmenu # "RCU Subsystem"
> diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> index 0cfb009a99b9..8968b330d6e0 100644
> --- a/kernel/rcu/Makefile
> +++ b/kernel/rcu/Makefile
> @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o
>  obj-$(CONFIG_TREE_RCU) += tree.o
>  obj-$(CONFIG_TINY_RCU) += tiny.o
>  obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
> +obj-$(CONFIG_RCU_LAZY) += lazy.o
> diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
> new file mode 100644
> index 000000000000..55e406cfc528
> --- /dev/null
> +++ b/kernel/rcu/lazy.c
> @@ -0,0 +1,145 @@
> +/*
> + * Lockless lazy-RCU implementation.
> + */
> +#include <linux/rcupdate.h>
> +#include <linux/shrinker.h>
> +#include <linux/workqueue.h>
> +#include "rcu.h"
> +
> +// How much to batch before flushing?
> +#define MAX_LAZY_BATCH		2048
> +
> +// How much to wait before flushing?
> +#define MAX_LAZY_JIFFIES	10000
> +
> +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
> +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
> +// later to ensure that rcu_head and lazy_rcu_head are of the same size.
> +struct lazy_rcu_head {
> +	struct llist_node llist_node;
> +	void (*func)(struct callback_head *head);
> +} __attribute__((aligned(sizeof(void *))));
> +
> +struct rcu_lazy_pcp {
> +	struct llist_head head;
> +	struct delayed_work work;
> +	atomic_t count;
> +};
> +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins);
> +
> +// Lockless flush of CPU, can be called concurrently.
> +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp)
> +{
> +	struct llist_node *node = llist_del_all(&rlp->head);
> +	struct lazy_rcu_head *cursor, *temp;
> +
> +	if (!node)
> +		return;
> +
> +	llist_for_each_entry_safe(cursor, temp, node, llist_node) {
> +		struct rcu_head *rh = (struct rcu_head *)cursor;
> +		debug_rcu_head_unqueue(rh);
> +		call_rcu(rh, rh->func);
> +		atomic_dec(&rlp->count);
> +	}
> +}
> +
> +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> +{
> +	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> +	struct rcu_lazy_pcp *rlp;
> +
> +	preempt_disable();
> +        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
> +	preempt_enable();
>
Can we get rid of such explicit disabling/enabling preemption?

> +
> +	if (debug_rcu_head_queue((void *)head)) {
> +		// Probable double call_rcu(), just leak.
> +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> +				__func__, head);
> +
> +		// Mark as success and leave.
> +		return;
> +	}
> +
> +	// Queue to per-cpu llist
> +	head->func = func;
> +	llist_add(&head->llist_node, &rlp->head);
> +
> +	// Flush queue if too big
> +	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
> +		lazy_rcu_flush_cpu(rlp);
>
Can we just schedule the work instead of drawn from the caller context?
For example it can be a hard-irq context.

> +	} else {
> +		if (!delayed_work_pending(&rlp->work)) {
> +			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
> +		}
> +	}
> +}
EXPORT_SYMBOL_GPL()? to be able to use in kernel modules.

> +
> +static unsigned long
> +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	unsigned long count = 0;
> +	int cpu;
> +
> +	/* Snapshot count of all CPUs */
> +	for_each_possible_cpu(cpu) {
> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> +
> +		count += atomic_read(&rlp->count);
> +	}
> +
> +	return count;
> +}
> +
> +static unsigned long
> +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> +{
> +	int cpu, freed = 0;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> +		unsigned long count;
> +
> +		count = atomic_read(&rlp->count);
> +		lazy_rcu_flush_cpu(rlp);
> +		sc->nr_to_scan -= count;
> +		freed += count;
> +		if (sc->nr_to_scan <= 0)
> +			break;
> +	}
> +
> +	return freed == 0 ? SHRINK_STOP : freed;
> +}
> +
> +/*
> + * This function is invoked after MAX_LAZY_JIFFIES timeout.
> + */
> +static void lazy_work(struct work_struct *work)
> +{
> +	struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work);
> +
> +	lazy_rcu_flush_cpu(rlp);
> +}
> +
> +static struct shrinker lazy_rcu_shrinker = {
> +	.count_objects = lazy_rcu_shrink_count,
> +	.scan_objects = lazy_rcu_shrink_scan,
> +	.batch = 0,
> +	.seeks = DEFAULT_SEEKS,
> +};
> +
> +void __init rcu_lazy_init(void)
> +{
> +	int cpu;
> +
> +	BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head));
> +
> +	for_each_possible_cpu(cpu) {
> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> +		INIT_DELAYED_WORK(&rlp->work, lazy_work);
> +	}
> +
> +	if (register_shrinker(&lazy_rcu_shrinker))
> +		pr_err("Failed to register lazy_rcu shrinker!\n");
> +}
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index 24b5f2c2de87..a5f4b44f395f 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void);
>  static inline void show_rcu_tasks_trace_gp_kthread(void) {}
>  #endif
>  
> +#ifdef CONFIG_RCU_LAZY
> +void rcu_lazy_init(void);
> +#else
> +static inline void rcu_lazy_init(void) {}
> +#endif
>  #endif /* __LINUX_RCU_H */
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a4c25a6283b0..ebdf6f7c9023 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -4775,6 +4775,8 @@ void __init rcu_init(void)
>  		qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
>  	else
>  		qovld_calc = qovld;
> +
> +	rcu_lazy_init();
>  }
>  
>  #include "tree_stall.h"
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-14 16:34       ` Paul E. McKenney
@ 2022-05-27 23:12         ` Joel Fernandes
  2022-05-28 17:57           ` Paul E. McKenney
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-27 23:12 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Sat, May 14, 2022 at 09:34:21AM -0700, Paul E. McKenney wrote:
> On Sat, May 14, 2022 at 03:08:33PM +0000, Joel Fernandes wrote:
> > Hi Paul,
> > 
> > Thanks for bearing with my slightly late reply, I did "time blocking"
> > technique to work on RCU today! ;-)
> > 
> > On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote:
> > > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote:
> > > > Implement timer-based RCU callback batching. The batch is flushed
> > > > whenever a certain amount of time has passed, or the batch on a
> > > > particular CPU grows too big. Also memory pressure can flush it.
> > > > 
> > > > Locking is avoided to reduce lock contention when queuing and dequeuing
> > > > happens on different CPUs of a per-cpu list, such as when shrinker
> > > > context is running on different CPU. Also not having to use locks keeps
> > > > the per-CPU structure size small.
> > > > 
> > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > 
> > > It is very good to see this!  Inevitable comments and questions below.
> > 
> > Thanks for taking a look!
> > 
> > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > index bf8e341e75b4..c09715079829 100644
> > > > --- a/kernel/rcu/Kconfig
> > > > +++ b/kernel/rcu/Kconfig
> > > > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB
> > > >  	  Say N here if you hate read-side memory barriers.
> > > >  	  Take the default if you are unsure.
> > > >  
> > > > +config RCU_LAZY
> > > > +	bool "RCU callback lazy invocation functionality"
> > > > +	depends on RCU_NOCB_CPU
> > > > +	default y
> > > 
> > > This "default y" is OK for experimentation, but for mainline we should
> > > not be forcing this on unsuspecting people.  ;-)
> > 
> > Agreed. I'll default 'n' :)
> > 
> > > > +	help
> > > > +	  To save power, batch RCU callbacks and flush after delay, memory
> > > > +          pressure or callback list growing too big.
> > > > +
> > > >  endmenu # "RCU Subsystem"
> > > > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> > > > index 0cfb009a99b9..8968b330d6e0 100644
> > > > --- a/kernel/rcu/Makefile
> > > > +++ b/kernel/rcu/Makefile
> > > > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o
> > > >  obj-$(CONFIG_TREE_RCU) += tree.o
> > > >  obj-$(CONFIG_TINY_RCU) += tiny.o
> > > >  obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
> > > > +obj-$(CONFIG_RCU_LAZY) += lazy.o
> > > > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
> > > > new file mode 100644
> > > > index 000000000000..55e406cfc528
> > > > --- /dev/null
> > > > +++ b/kernel/rcu/lazy.c
> > > > @@ -0,0 +1,145 @@
> > > > +/*
> > > > + * Lockless lazy-RCU implementation.
> > > > + */
> > > > +#include <linux/rcupdate.h>
> > > > +#include <linux/shrinker.h>
> > > > +#include <linux/workqueue.h>
> > > > +#include "rcu.h"
> > > > +
> > > > +// How much to batch before flushing?
> > > > +#define MAX_LAZY_BATCH		2048
> > > > +
> > > > +// How much to wait before flushing?
> > > > +#define MAX_LAZY_JIFFIES	10000
> > > 
> > > That is more than a minute on a HZ=10 system.  Are you sure that you
> > > did not mean "(10 * HZ)" or some such?
> > 
> > Yes, you are right. I need to change that to be constant regardless of HZ. I
> > will make the change as you suggest.
> > 
> > > > +
> > > > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
> > > > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
> > > > +// later to ensure that rcu_head and lazy_rcu_head are of the same size.
> > > > +struct lazy_rcu_head {
> > > > +	struct llist_node llist_node;
> > > > +	void (*func)(struct callback_head *head);
> > > > +} __attribute__((aligned(sizeof(void *))));
> > > 
> > > This needs a build-time check that rcu_head and lazy_rcu_head are of
> > > the same size.  Maybe something like this in some appropriate context:
> > > 
> > > 	BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy));
> > > 
> > > Never mind!  I see you have this in rcu_init_lazy().  Plus I now see that
> > > you also mention this in the above comments.  ;-)
> > 
> > Cool, great minds think alike! ;-)
> > 
> > > > +
> > > > +struct rcu_lazy_pcp {
> > > > +	struct llist_head head;
> > > > +	struct delayed_work work;
> > > > +	atomic_t count;
> > > > +};
> > > > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins);
> > > > +
> > > > +// Lockless flush of CPU, can be called concurrently.
> > > > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp)
> > > > +{
> > > > +	struct llist_node *node = llist_del_all(&rlp->head);
> > > > +	struct lazy_rcu_head *cursor, *temp;
> > > > +
> > > > +	if (!node)
> > > > +		return;
> > > 
> > > At this point, the list is empty but the count is non-zero.  Can
> > > that cause a problem?  (For the existing callback lists, this would
> > > be OK.)
> > > 
> > > > +	llist_for_each_entry_safe(cursor, temp, node, llist_node) {
> > > > +		struct rcu_head *rh = (struct rcu_head *)cursor;
> > > > +		debug_rcu_head_unqueue(rh);
> > > 
> > > Good to see this check!
> > > 
> > > > +		call_rcu(rh, rh->func);
> > > > +		atomic_dec(&rlp->count);
> > > > +	}
> > > > +}
> > > > +
> > > > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> > > > +{
> > > > +	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> > > > +	struct rcu_lazy_pcp *rlp;
> > > > +
> > > > +	preempt_disable();
> > > > +        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
> > > 
> > > Whitespace issue, please fix.
> > 
> > Fixed, thanks.
> > 
> > > > +	preempt_enable();
> > > > +
> > > > +	if (debug_rcu_head_queue((void *)head)) {
> > > > +		// Probable double call_rcu(), just leak.
> > > > +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> > > > +				__func__, head);
> > > > +
> > > > +		// Mark as success and leave.
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	// Queue to per-cpu llist
> > > > +	head->func = func;
> > > > +	llist_add(&head->llist_node, &rlp->head);
> > > 
> > > Suppose that there are a bunch of preemptions between the preempt_enable()
> > > above and this point, so that the current CPU's list has lots of
> > > callbacks, but zero ->count.  Can that cause a problem?
> > > 
> > > In the past, this sort of thing has been an issue for rcu_barrier()
> > > and friends.
> > 
> > Thanks, I think I dropped the ball on this. You have given me something to
> > think about. I am thinking on first thought that setting the count in advance
> > of populating the list should do the trick. I will look more on it.
> 
> That can work, but don't forget the need for proper memory ordering.
> 
> Another approach would be to what the current bypass list does, namely
> count the callback in the ->cblist structure.

I have been thinking about this, and I also arrived at the same conclusion
that it is easier to make it a segcblist structure which has the functions do
the memory ordering. Then we can make rcu_barrier() locklessly sample the
->len of the new per-cpu structure, I *think* it might work.

However, I was actually considering another situation outside of rcu_barrier,
where the preemptions you mentioned would cause the list to appear to be
empty if one were to rely on count, but actually have a lot of stuff queued.
This causes shrinkers to not know that there is a lot of memory available to
free. I don't think its that big an issue, if we can assume the caller's
process context to have sufficient priority. Obviously using a raw spinlock
makes the process context to be highest priority briefly but without the
lock, we don't get that. But yeah that can be sorted by just updating the
count proactively, and then doing the queuing operations.

Another advantage of using the segcblist structure, is, we can also record
the grace period numbers in it, and avoid triggering RCU from the timer if gp
already elapsed (similar to the rcu-poll family of functions).

WDYT?

thanks,

 - Joel

> > > > +	// Flush queue if too big
> > > 
> > > You will also need to check for early boot use.
> > > 
> > > I -think- it suffice to simply skip the following "if" statement when
> > > rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices.  The reason being
> > > that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT
> > > kernels won't expire them until the softirq kthreads have been spawned.
> > > 
> > > Which is OK, as it just means that call_rcu_lazy() is a bit more
> > > lazy than expected that early.
> > > 
> > > Except that call_rcu() can be invoked even before rcu_init() has been
> > > invoked, which is therefore also before rcu_init_lazy() has been invoked.
> > 
> > In other words, you are concerned that too many lazy callbacks might be
> > pending before rcu_init() is called?
> 
> In other words, I am concerned that bad things might happen if fields
> in a rcu_lazy_pcp structure are used before they are initialized.
> 
> I am not worried about too many lazy callbacks before rcu_init() because
> the non-lazy callbacks (which these currently are) are allowed to pile
> up until RCU's grace-period kthreads have been spawned.  There might
> come a time when RCU callbacks need to be invoked earlier, but that will
> be a separate problem to solve when and if, but with the benefit of the
> additional information derived from seeing the problem actually happen.
> 
> > I am going through the kfree_rcu() threads/patches involving
> > RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before
> > the scheduler is running causes a crash or warnings?
> 
> There are a lot of different issues that arise in different phases
> of boot.  In this particular case, my primary concern is the
> use-before-initialized bug.
> 
> > > I thefore suggest something like this at the very start of this function:
> > > 
> > > 	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> > > 		call_rcu(head_rcu, func);
> > > 
> > > The goal is that people can replace call_rcu() with call_rcu_lazy()
> > > without having to worry about invocation during early boot.
> > 
> > Yes, this seems safer. I don't expect much power savings during system boot
> > process anyway ;-). I believe perhaps a static branch would work better to
> > take a branch out from what is likely a fast path.
> 
> A static branch would be fine.  Maybe encapsulate it in a static inline
> function for all such comparisons, but most such comparisons are far from
> anything resembling a fastpath, so the main potential benefit would be
> added readability.  Which could be a compelling reason in and of itself.
> 
> 							Thanx, Paul
> 
> > > Huh.  "head_rcu"?  Why not "rhp"?  Or follow call_rcu() with "head",
> > > though "rhp" is more consistent with the RCU pointer initials approach.
> > 
> > Fixed, thanks.
> > 
> > > > +	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
> > > > +		lazy_rcu_flush_cpu(rlp);
> > > > +	} else {
> > > > +		if (!delayed_work_pending(&rlp->work)) {
> > > 
> > > This check is racy because the work might run to completion right at
> > > this point.  Wouldn't it be better to rely on the internal check of
> > > WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()?
> > 
> > Oops, agreed. Will make it as you suggest.
> > 
> > thanks,
> > 
> >  - Joel
> > 
> > > > +			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
> > > > +		}
> > > > +	}
> > > > +}
> > > > +
> > > > +static unsigned long
> > > > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > > > +{
> > > > +	unsigned long count = 0;
> > > > +	int cpu;
> > > > +
> > > > +	/* Snapshot count of all CPUs */
> > > > +	for_each_possible_cpu(cpu) {
> > > > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > > > +
> > > > +		count += atomic_read(&rlp->count);
> > > > +	}
> > > > +
> > > > +	return count;
> > > > +}
> > > > +
> > > > +static unsigned long
> > > > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > > > +{
> > > > +	int cpu, freed = 0;
> > > > +
> > > > +	for_each_possible_cpu(cpu) {
> > > > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > > > +		unsigned long count;
> > > > +
> > > > +		count = atomic_read(&rlp->count);
> > > > +		lazy_rcu_flush_cpu(rlp);
> > > > +		sc->nr_to_scan -= count;
> > > > +		freed += count;
> > > > +		if (sc->nr_to_scan <= 0)
> > > > +			break;
> > > > +	}
> > > > +
> > > > +	return freed == 0 ? SHRINK_STOP : freed;
> > > 
> > > This is a bit surprising given the stated aim of SHRINK_STOP to indicate
> > > potential deadlocks.  But this pattern is common, including on the
> > > kvfree_rcu() path, so OK!  ;-)
> > > 
> > > Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is
> > > that as well.
> > > 
> > > > +}
> > > > +
> > > > +/*
> > > > + * This function is invoked after MAX_LAZY_JIFFIES timeout.
> > > > + */
> > > > +static void lazy_work(struct work_struct *work)
> > > > +{
> > > > +	struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work);
> > > > +
> > > > +	lazy_rcu_flush_cpu(rlp);
> > > > +}
> > > > +
> > > > +static struct shrinker lazy_rcu_shrinker = {
> > > > +	.count_objects = lazy_rcu_shrink_count,
> > > > +	.scan_objects = lazy_rcu_shrink_scan,
> > > > +	.batch = 0,
> > > > +	.seeks = DEFAULT_SEEKS,
> > > > +};
> > > > +
> > > > +void __init rcu_lazy_init(void)
> > > > +{
> > > > +	int cpu;
> > > > +
> > > > +	BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head));
> > > > +
> > > > +	for_each_possible_cpu(cpu) {
> > > > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > > > +		INIT_DELAYED_WORK(&rlp->work, lazy_work);
> > > > +	}
> > > > +
> > > > +	if (register_shrinker(&lazy_rcu_shrinker))
> > > > +		pr_err("Failed to register lazy_rcu shrinker!\n");
> > > > +}
> > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > index 24b5f2c2de87..a5f4b44f395f 100644
> > > > --- a/kernel/rcu/rcu.h
> > > > +++ b/kernel/rcu/rcu.h
> > > > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void);
> > > >  static inline void show_rcu_tasks_trace_gp_kthread(void) {}
> > > >  #endif
> > > >  
> > > > +#ifdef CONFIG_RCU_LAZY
> > > > +void rcu_lazy_init(void);
> > > > +#else
> > > > +static inline void rcu_lazy_init(void) {}
> > > > +#endif
> > > >  #endif /* __LINUX_RCU_H */
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index a4c25a6283b0..ebdf6f7c9023 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -4775,6 +4775,8 @@ void __init rcu_init(void)
> > > >  		qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
> > > >  	else
> > > >  		qovld_calc = qovld;
> > > > +
> > > > +	rcu_lazy_init();
> > > >  }
> > > >  
> > > >  #include "tree_stall.h"
> > > > -- 
> > > > 2.36.0.550.gb090851708-goog
> > > > 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-27 23:12         ` Joel Fernandes
@ 2022-05-28 17:57           ` Paul E. McKenney
  2022-05-30 14:48             ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-28 17:57 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Fri, May 27, 2022 at 11:12:03PM +0000, Joel Fernandes wrote:
> On Sat, May 14, 2022 at 09:34:21AM -0700, Paul E. McKenney wrote:
> > On Sat, May 14, 2022 at 03:08:33PM +0000, Joel Fernandes wrote:
> > > Hi Paul,
> > > 
> > > Thanks for bearing with my slightly late reply, I did "time blocking"
> > > technique to work on RCU today! ;-)
> > > 
> > > On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote:
> > > > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote:
> > > > > Implement timer-based RCU callback batching. The batch is flushed
> > > > > whenever a certain amount of time has passed, or the batch on a
> > > > > particular CPU grows too big. Also memory pressure can flush it.
> > > > > 
> > > > > Locking is avoided to reduce lock contention when queuing and dequeuing
> > > > > happens on different CPUs of a per-cpu list, such as when shrinker
> > > > > context is running on different CPU. Also not having to use locks keeps
> > > > > the per-CPU structure size small.
> > > > > 
> > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > 
> > > > It is very good to see this!  Inevitable comments and questions below.
> > > 
> > > Thanks for taking a look!
> > > 
> > > > > diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> > > > > index bf8e341e75b4..c09715079829 100644
> > > > > --- a/kernel/rcu/Kconfig
> > > > > +++ b/kernel/rcu/Kconfig
> > > > > @@ -241,4 +241,12 @@ config TASKS_TRACE_RCU_READ_MB
> > > > >  	  Say N here if you hate read-side memory barriers.
> > > > >  	  Take the default if you are unsure.
> > > > >  
> > > > > +config RCU_LAZY
> > > > > +	bool "RCU callback lazy invocation functionality"
> > > > > +	depends on RCU_NOCB_CPU
> > > > > +	default y
> > > > 
> > > > This "default y" is OK for experimentation, but for mainline we should
> > > > not be forcing this on unsuspecting people.  ;-)
> > > 
> > > Agreed. I'll default 'n' :)
> > > 
> > > > > +	help
> > > > > +	  To save power, batch RCU callbacks and flush after delay, memory
> > > > > +          pressure or callback list growing too big.
> > > > > +
> > > > >  endmenu # "RCU Subsystem"
> > > > > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> > > > > index 0cfb009a99b9..8968b330d6e0 100644
> > > > > --- a/kernel/rcu/Makefile
> > > > > +++ b/kernel/rcu/Makefile
> > > > > @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o
> > > > >  obj-$(CONFIG_TREE_RCU) += tree.o
> > > > >  obj-$(CONFIG_TINY_RCU) += tiny.o
> > > > >  obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
> > > > > +obj-$(CONFIG_RCU_LAZY) += lazy.o
> > > > > diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
> > > > > new file mode 100644
> > > > > index 000000000000..55e406cfc528
> > > > > --- /dev/null
> > > > > +++ b/kernel/rcu/lazy.c
> > > > > @@ -0,0 +1,145 @@
> > > > > +/*
> > > > > + * Lockless lazy-RCU implementation.
> > > > > + */
> > > > > +#include <linux/rcupdate.h>
> > > > > +#include <linux/shrinker.h>
> > > > > +#include <linux/workqueue.h>
> > > > > +#include "rcu.h"
> > > > > +
> > > > > +// How much to batch before flushing?
> > > > > +#define MAX_LAZY_BATCH		2048
> > > > > +
> > > > > +// How much to wait before flushing?
> > > > > +#define MAX_LAZY_JIFFIES	10000
> > > > 
> > > > That is more than a minute on a HZ=10 system.  Are you sure that you
> > > > did not mean "(10 * HZ)" or some such?
> > > 
> > > Yes, you are right. I need to change that to be constant regardless of HZ. I
> > > will make the change as you suggest.
> > > 
> > > > > +
> > > > > +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
> > > > > +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
> > > > > +// later to ensure that rcu_head and lazy_rcu_head are of the same size.
> > > > > +struct lazy_rcu_head {
> > > > > +	struct llist_node llist_node;
> > > > > +	void (*func)(struct callback_head *head);
> > > > > +} __attribute__((aligned(sizeof(void *))));
> > > > 
> > > > This needs a build-time check that rcu_head and lazy_rcu_head are of
> > > > the same size.  Maybe something like this in some appropriate context:
> > > > 
> > > > 	BUILD_BUG_ON(sizeof(struct rcu_head) != sizeof(struct_rcu_head_lazy));
> > > > 
> > > > Never mind!  I see you have this in rcu_init_lazy().  Plus I now see that
> > > > you also mention this in the above comments.  ;-)
> > > 
> > > Cool, great minds think alike! ;-)
> > > 
> > > > > +
> > > > > +struct rcu_lazy_pcp {
> > > > > +	struct llist_head head;
> > > > > +	struct delayed_work work;
> > > > > +	atomic_t count;
> > > > > +};
> > > > > +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins);
> > > > > +
> > > > > +// Lockless flush of CPU, can be called concurrently.
> > > > > +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp)
> > > > > +{
> > > > > +	struct llist_node *node = llist_del_all(&rlp->head);
> > > > > +	struct lazy_rcu_head *cursor, *temp;
> > > > > +
> > > > > +	if (!node)
> > > > > +		return;
> > > > 
> > > > At this point, the list is empty but the count is non-zero.  Can
> > > > that cause a problem?  (For the existing callback lists, this would
> > > > be OK.)
> > > > 
> > > > > +	llist_for_each_entry_safe(cursor, temp, node, llist_node) {
> > > > > +		struct rcu_head *rh = (struct rcu_head *)cursor;
> > > > > +		debug_rcu_head_unqueue(rh);
> > > > 
> > > > Good to see this check!
> > > > 
> > > > > +		call_rcu(rh, rh->func);
> > > > > +		atomic_dec(&rlp->count);
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> > > > > +{
> > > > > +	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> > > > > +	struct rcu_lazy_pcp *rlp;
> > > > > +
> > > > > +	preempt_disable();
> > > > > +        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
> > > > 
> > > > Whitespace issue, please fix.
> > > 
> > > Fixed, thanks.
> > > 
> > > > > +	preempt_enable();
> > > > > +
> > > > > +	if (debug_rcu_head_queue((void *)head)) {
> > > > > +		// Probable double call_rcu(), just leak.
> > > > > +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> > > > > +				__func__, head);
> > > > > +
> > > > > +		// Mark as success and leave.
> > > > > +		return;
> > > > > +	}
> > > > > +
> > > > > +	// Queue to per-cpu llist
> > > > > +	head->func = func;
> > > > > +	llist_add(&head->llist_node, &rlp->head);
> > > > 
> > > > Suppose that there are a bunch of preemptions between the preempt_enable()
> > > > above and this point, so that the current CPU's list has lots of
> > > > callbacks, but zero ->count.  Can that cause a problem?
> > > > 
> > > > In the past, this sort of thing has been an issue for rcu_barrier()
> > > > and friends.
> > > 
> > > Thanks, I think I dropped the ball on this. You have given me something to
> > > think about. I am thinking on first thought that setting the count in advance
> > > of populating the list should do the trick. I will look more on it.
> > 
> > That can work, but don't forget the need for proper memory ordering.
> > 
> > Another approach would be to what the current bypass list does, namely
> > count the callback in the ->cblist structure.
> 
> I have been thinking about this, and I also arrived at the same conclusion
> that it is easier to make it a segcblist structure which has the functions do
> the memory ordering. Then we can make rcu_barrier() locklessly sample the
> ->len of the new per-cpu structure, I *think* it might work.
> 
> However, I was actually considering another situation outside of rcu_barrier,
> where the preemptions you mentioned would cause the list to appear to be
> empty if one were to rely on count, but actually have a lot of stuff queued.
> This causes shrinkers to not know that there is a lot of memory available to
> free. I don't think its that big an issue, if we can assume the caller's
> process context to have sufficient priority. Obviously using a raw spinlock
> makes the process context to be highest priority briefly but without the
> lock, we don't get that. But yeah that can be sorted by just updating the
> count proactively, and then doing the queuing operations.
> 
> Another advantage of using the segcblist structure, is, we can also record
> the grace period numbers in it, and avoid triggering RCU from the timer if gp
> already elapsed (similar to the rcu-poll family of functions).
> 
> WDYT?

What you are seeing is that the design space for call_rcu_lazy() is huge,
and the tradeoffs between energy efficiency and complexity are not well
understood.  It is therefore necessary to quickly evaluate alternatives.
Yes, you could simply implement all of them and test them, but that
might you take longer than you would like.  Plus you likely need to
put in place more testing than rcutorture currently provides.

Given what we have seen thus far, I think that the first step has to
be to evaluate what can be obtained with this approach compared with
what can be obtained with other approaches.  What we have right now,
more or less, is this:

o	Offloading buys about 15%.

o	Slowing grace periods buys about another 7%, but requires
	userspace changes to prevent degrading user-visible performance.

o	The initial call_rcu_lazy() prototype buys about 1%.

	True, this is not apples to apples because the first two were
	measured on ChromeOS and this one on Android.  Which means that
	apples to apples evaluation is also required.  But this is the
	information I currently have at hand, and it is probably no more
	than a factor of two off of what would be seen on ChromeOS.

	Or is there some ChromeOS data that tells a different story?
	After all, for all I know, Android might still be expediting
	all normal grace periods.

At which point, the question becomes "how to make up that 7%?" After all,
it is not likely that anyone is going to leave that much battery lifetime
on the table.  Here are the possibilities that I currently see:

o	Slow down grace periods, and also bite the bullet and make
	userspace changes to speed up the RCU grace periods during
	critical operations.

o	Slow down grace periods, but leverage in-kernel changes for
	some of those operations.  For example, RCU uses pm_notifier()
	to cause grace periods to be expedited during suspend and
	hibernation.  The requests for expediting are atomically counted,
	so there can be other similar setups.

o	Choose a more moderate slowing down of the grace period so as to
	minimize (or maybe even eliminate) the need for the aforementioned
	userspace changes.  This also leaves battery lifetime on the
	table, but considerably less of it.

o	Implement more selective slowdowns, such as call_rcu_lazy().

	Other selective slowdowns include in-RCU heuristics such as
	forcing the current grace period to end once the entire
	system has gone idle.  (There is prior work detecting
	full-system idleness for NO_HZ_FULL, but smaller systems
	could use more aggressive techniques.)

I am sure that there are additional possibilities.  Note that it is
possible to do more than one of these, if that proves helpful.

But given this, one question is "What is the most you can possibly
obtain from call_rcu_lazy()?"  If you have a platform with enough
memory, one way to obtain an upper bound is to make call_rcu_lazy()
simply leak the memory.  If that amount of savings does not meet the
need, then other higher-level approaches will be needed, whether
as alternatives or as additions to the call_rcu_lazy() approach.

Should call_rcu_lazy() prove to be part of the mix, here are a (very)
few of the tradeoffs:

o	Put lazy callbacks on their own list or not?

	o	If not, use the bypass list?  If so, is it the
		case that call_rcu_lazy() is just call_rcu() on
		non-rcu_nocbs CPUs?

		Or add the complexity required to use the bypass
		list on rcu_nocbs CPUs but to use ->cblist otherwise?

	o	If so:

		o	Use segmented list with marked grace periods?
			Keep in mind that such a list can track only
			two grace periods.

		o	Use a plain list and have grace-period start
			simply drain the list?

o	Does call_rcu_lazy() do anything to ensure that additional grace
	periods that exist only for the benefit of lazy callbacks are
	maximally shared among all CPUs' lazy callbacks?  If so, what?
	(Keeping in mind that failing to share such grace periods
	burns battery lifetime.)

o	When there is ample memory, how long are lazy callbacks allowed
	to sleep?  Forever?  If not forever, what feeds into computing
	the timeout?

o	What is used to determine when memory is low, so that laziness
	has become a net negative?

o	What other conditions, if any, should prompt motivating lazy
	callbacks?  (See above for the start of a grace period motivating
	lazy callbacks.)

In short, you need to cast your net pretty wide on this one.  It has
not yet been very carefully explored, so there are likely to be surprises,
maybe even good surprises.  ;-)

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> > > > > +	// Flush queue if too big
> > > > 
> > > > You will also need to check for early boot use.
> > > > 
> > > > I -think- it suffice to simply skip the following "if" statement when
> > > > rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices.  The reason being
> > > > that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT
> > > > kernels won't expire them until the softirq kthreads have been spawned.
> > > > 
> > > > Which is OK, as it just means that call_rcu_lazy() is a bit more
> > > > lazy than expected that early.
> > > > 
> > > > Except that call_rcu() can be invoked even before rcu_init() has been
> > > > invoked, which is therefore also before rcu_init_lazy() has been invoked.
> > > 
> > > In other words, you are concerned that too many lazy callbacks might be
> > > pending before rcu_init() is called?
> > 
> > In other words, I am concerned that bad things might happen if fields
> > in a rcu_lazy_pcp structure are used before they are initialized.
> > 
> > I am not worried about too many lazy callbacks before rcu_init() because
> > the non-lazy callbacks (which these currently are) are allowed to pile
> > up until RCU's grace-period kthreads have been spawned.  There might
> > come a time when RCU callbacks need to be invoked earlier, but that will
> > be a separate problem to solve when and if, but with the benefit of the
> > additional information derived from seeing the problem actually happen.
> > 
> > > I am going through the kfree_rcu() threads/patches involving
> > > RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before
> > > the scheduler is running causes a crash or warnings?
> > 
> > There are a lot of different issues that arise in different phases
> > of boot.  In this particular case, my primary concern is the
> > use-before-initialized bug.
> > 
> > > > I thefore suggest something like this at the very start of this function:
> > > > 
> > > > 	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> > > > 		call_rcu(head_rcu, func);
> > > > 
> > > > The goal is that people can replace call_rcu() with call_rcu_lazy()
> > > > without having to worry about invocation during early boot.
> > > 
> > > Yes, this seems safer. I don't expect much power savings during system boot
> > > process anyway ;-). I believe perhaps a static branch would work better to
> > > take a branch out from what is likely a fast path.
> > 
> > A static branch would be fine.  Maybe encapsulate it in a static inline
> > function for all such comparisons, but most such comparisons are far from
> > anything resembling a fastpath, so the main potential benefit would be
> > added readability.  Which could be a compelling reason in and of itself.
> > 
> > 							Thanx, Paul
> > 
> > > > Huh.  "head_rcu"?  Why not "rhp"?  Or follow call_rcu() with "head",
> > > > though "rhp" is more consistent with the RCU pointer initials approach.
> > > 
> > > Fixed, thanks.
> > > 
> > > > > +	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
> > > > > +		lazy_rcu_flush_cpu(rlp);
> > > > > +	} else {
> > > > > +		if (!delayed_work_pending(&rlp->work)) {
> > > > 
> > > > This check is racy because the work might run to completion right at
> > > > this point.  Wouldn't it be better to rely on the internal check of
> > > > WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()?
> > > 
> > > Oops, agreed. Will make it as you suggest.
> > > 
> > > thanks,
> > > 
> > >  - Joel
> > > 
> > > > > +			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
> > > > > +		}
> > > > > +	}
> > > > > +}
> > > > > +
> > > > > +static unsigned long
> > > > > +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> > > > > +{
> > > > > +	unsigned long count = 0;
> > > > > +	int cpu;
> > > > > +
> > > > > +	/* Snapshot count of all CPUs */
> > > > > +	for_each_possible_cpu(cpu) {
> > > > > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > > > > +
> > > > > +		count += atomic_read(&rlp->count);
> > > > > +	}
> > > > > +
> > > > > +	return count;
> > > > > +}
> > > > > +
> > > > > +static unsigned long
> > > > > +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> > > > > +{
> > > > > +	int cpu, freed = 0;
> > > > > +
> > > > > +	for_each_possible_cpu(cpu) {
> > > > > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > > > > +		unsigned long count;
> > > > > +
> > > > > +		count = atomic_read(&rlp->count);
> > > > > +		lazy_rcu_flush_cpu(rlp);
> > > > > +		sc->nr_to_scan -= count;
> > > > > +		freed += count;
> > > > > +		if (sc->nr_to_scan <= 0)
> > > > > +			break;
> > > > > +	}
> > > > > +
> > > > > +	return freed == 0 ? SHRINK_STOP : freed;
> > > > 
> > > > This is a bit surprising given the stated aim of SHRINK_STOP to indicate
> > > > potential deadlocks.  But this pattern is common, including on the
> > > > kvfree_rcu() path, so OK!  ;-)
> > > > 
> > > > Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is
> > > > that as well.
> > > > 
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * This function is invoked after MAX_LAZY_JIFFIES timeout.
> > > > > + */
> > > > > +static void lazy_work(struct work_struct *work)
> > > > > +{
> > > > > +	struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work);
> > > > > +
> > > > > +	lazy_rcu_flush_cpu(rlp);
> > > > > +}
> > > > > +
> > > > > +static struct shrinker lazy_rcu_shrinker = {
> > > > > +	.count_objects = lazy_rcu_shrink_count,
> > > > > +	.scan_objects = lazy_rcu_shrink_scan,
> > > > > +	.batch = 0,
> > > > > +	.seeks = DEFAULT_SEEKS,
> > > > > +};
> > > > > +
> > > > > +void __init rcu_lazy_init(void)
> > > > > +{
> > > > > +	int cpu;
> > > > > +
> > > > > +	BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head));
> > > > > +
> > > > > +	for_each_possible_cpu(cpu) {
> > > > > +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> > > > > +		INIT_DELAYED_WORK(&rlp->work, lazy_work);
> > > > > +	}
> > > > > +
> > > > > +	if (register_shrinker(&lazy_rcu_shrinker))
> > > > > +		pr_err("Failed to register lazy_rcu shrinker!\n");
> > > > > +}
> > > > > diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> > > > > index 24b5f2c2de87..a5f4b44f395f 100644
> > > > > --- a/kernel/rcu/rcu.h
> > > > > +++ b/kernel/rcu/rcu.h
> > > > > @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void);
> > > > >  static inline void show_rcu_tasks_trace_gp_kthread(void) {}
> > > > >  #endif
> > > > >  
> > > > > +#ifdef CONFIG_RCU_LAZY
> > > > > +void rcu_lazy_init(void);
> > > > > +#else
> > > > > +static inline void rcu_lazy_init(void) {}
> > > > > +#endif
> > > > >  #endif /* __LINUX_RCU_H */
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index a4c25a6283b0..ebdf6f7c9023 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -4775,6 +4775,8 @@ void __init rcu_init(void)
> > > > >  		qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
> > > > >  	else
> > > > >  		qovld_calc = qovld;
> > > > > +
> > > > > +	rcu_lazy_init();
> > > > >  }
> > > > >  
> > > > >  #include "tree_stall.h"
> > > > > -- 
> > > > > 2.36.0.550.gb090851708-goog
> > > > > 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-28 17:57           ` Paul E. McKenney
@ 2022-05-30 14:48             ` Joel Fernandes
  2022-05-30 16:42               ` Paul E. McKenney
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-30 14:48 UTC (permalink / raw)
  To: paulmck; +Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

Hi Paul,

I just setup thunderbird on an old machine just to reply to LKML.
Apologies if the formatting is weird. Replies below:

On 5/28/22 13:57, Paul E. McKenney wrote:
[..]
>>>>>> +	preempt_enable();
>>>>>> +
>>>>>> +	if (debug_rcu_head_queue((void *)head)) {
>>>>>> +		// Probable double call_rcu(), just leak.
>>>>>> +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
>>>>>> +				__func__, head);
>>>>>> +
>>>>>> +		// Mark as success and leave.
>>>>>> +		return;
>>>>>> +	}
>>>>>> +
>>>>>> +	// Queue to per-cpu llist
>>>>>> +	head->func = func;
>>>>>> +	llist_add(&head->llist_node, &rlp->head);
>>>>>
>>>>> Suppose that there are a bunch of preemptions between the preempt_enable()
>>>>> above and this point, so that the current CPU's list has lots of
>>>>> callbacks, but zero ->count.  Can that cause a problem?
>>>>>
>>>>> In the past, this sort of thing has been an issue for rcu_barrier()
>>>>> and friends.
>>>>
>>>> Thanks, I think I dropped the ball on this. You have given me something to
>>>> think about. I am thinking on first thought that setting the count in advance
>>>> of populating the list should do the trick. I will look more on it.
>>>
>>> That can work, but don't forget the need for proper memory ordering.
>>>
>>> Another approach would be to what the current bypass list does, namely
>>> count the callback in the ->cblist structure.
>>
>> I have been thinking about this, and I also arrived at the same conclusion
>> that it is easier to make it a segcblist structure which has the functions do
>> the memory ordering. Then we can make rcu_barrier() locklessly sample the
>> ->len of the new per-cpu structure, I *think* it might work.
>>
>> However, I was actually considering another situation outside of rcu_barrier,
>> where the preemptions you mentioned would cause the list to appear to be
>> empty if one were to rely on count, but actually have a lot of stuff queued.
>> This causes shrinkers to not know that there is a lot of memory available to
>> free. I don't think its that big an issue, if we can assume the caller's
>> process context to have sufficient priority. Obviously using a raw spinlock
>> makes the process context to be highest priority briefly but without the
>> lock, we don't get that. But yeah that can be sorted by just updating the
>> count proactively, and then doing the queuing operations.
>>
>> Another advantage of using the segcblist structure, is, we can also record
>> the grace period numbers in it, and avoid triggering RCU from the timer if gp
>> already elapsed (similar to the rcu-poll family of functions).
>>
>> WDYT?
> 
> What you are seeing is that the design space for call_rcu_lazy() is huge,
> and the tradeoffs between energy efficiency and complexity are not well
> understood.  It is therefore necessary to quickly evaluate alternatives.
> Yes, you could simply implement all of them and test them, but that
> might you take longer than you would like.  Plus you likely need to
> put in place more testing than rcutorture currently provides.
> 
> Given what we have seen thus far, I think that the first step has to
> be to evaluate what can be obtained with this approach compared with
> what can be obtained with other approaches.  What we have right now,
> more or less, is this:
> 
> o	Offloading buys about 15%.
> 
> o	Slowing grace periods buys about another 7%, but requires
> 	userspace changes to prevent degrading user-visible performance.

Agreed.

> 
> o	The initial call_rcu_lazy() prototype buys about 1%.


Well, that depends on the usecase. A busy system saves you less because
it is busy anyway, so RCU being quiet does not help much.

From my cover letter, you can see the idle power savings is a whopping
29% with this patch series. Check the PC10 state residencies and the
package wattage.

When ChromeOS screen is off and user is not doing anything on the
Before:
Pk%pc10 = 72.13
PkgWatt = 0.58

After:
Pk%pc10 = 81.28
PkgWatt = 0.41

> 
> 	True, this is not apples to apples because the first two were
> 	measured on ChromeOS and this one on Android.

Yes agreed. I think Vlad is not seeing any jaw dropping savings, because
he's testing ARM. But on Intel we see a lot. Though, I asked Vlad to
also check if rcu callbacks are indeed not being quued / quiet after
applying the patch, and if not then there may still be some call_rcu()s
that need conversion to lazy, on his setup. For example, he might have a
driver on his ARM platform that's queing RCU CBs constantly.

I was thinking that perhaps a debug patch can help quickly nail down
such RCU noise, in the way of a warning. Of course, debug CONFIG should
never be enabled in production, so it would be something along the lines
of how lockdep is used for debugging.

>Which means that
> 	apples to apples evaluation is also required.  But this is the
> 	information I currently have at hand, and it is probably no more
> 	than a factor of two off of what would be seen on ChromeOS.
> 
> 	Or is there some ChromeOS data that tells a different story?
> 	After all, for all I know, Android might still be expediting
> 	all normal grace periods.
> 
> At which point, the question becomes "how to make up that 7%?" After all,
> it is not likely that anyone is going to leave that much battery lifetime
> on the table.  Here are the possibilities that I currently see:
> 
> o	Slow down grace periods, and also bite the bullet and make
> 	userspace changes to speed up the RCU grace periods during
> 	critical operations.

We tried this, and tracing suffers quiet a lot. The system also felt
"sluggish" which I suspect is because of synchronize_rcu() slow downs in
other paths.

> 
> o	Slow down grace periods, but leverage in-kernel changes for
> 	some of those operations.  For example, RCU uses pm_notifier()
> 	to cause grace periods to be expedited during suspend and

Nice! Did not know this.

> 	hibernation.  The requests for expediting are atomically counted,
> 	so there can be other similar setups.

I do like slowing down grace periods globally, because its easy, but I
think it can have quite a few ripple effects if in the future a user of
call_rcu() does not expect lazy behavior :( but still gets it. Same like
how synchronize_rcu_mult() suffered back in its day.

> o	Choose a more moderate slowing down of the grace period so as to
> 	minimize (or maybe even eliminate) the need for the aforementioned
> 	userspace changes.  This also leaves battery lifetime on the
> 	table, but considerably less of it.
> 
> o	Implement more selective slowdowns, such as call_rcu_lazy().
> 
> 	Other selective slowdowns include in-RCU heuristics such as
> 	forcing the current grace period to end once the entire
> 	system has gone idle.  (There is prior work detecting
> 	full-system idleness for NO_HZ_FULL, but smaller systems
> 	could use more aggressive techniques.)
> 
> I am sure that there are additional possibilities.  Note that it is
> possible to do more than one of these, if that proves helpful.

Sure!
> But given this, one question is "What is the most you can possibly
> obtain from call_rcu_lazy()?"  If you have a platform with enough
> memory, one way to obtain an upper bound is to make call_rcu_lazy()
> simply leak the memory.  If that amount of savings does not meet the
> need, then other higher-level approaches will be needed, whether
> as alternatives or as additions to the call_rcu_lazy() approach.
> 
> Should call_rcu_lazy() prove to be part of the mix, here are a (very)
> few of the tradeoffs:
> 
> o	Put lazy callbacks on their own list or not?
> 
> 	o	If not, use the bypass list?  If so, is it the
> 		case that call_rcu_lazy() is just call_rcu() on
> 		non-rcu_nocbs CPUs?
> 
> 		Or add the complexity required to use the bypass
> 		list on rcu_nocbs CPUs but to use ->cblist otherwise?

For bypass list, I am kind of reluctant because the "flush time" of the
bypass list may not match the laziness we seek? For example, I want to
allow user to configure to flush the CBs only after 15 seconds or so, if
the CB list does not grow too big. That might hurt folks using the
bypass list, but requiring a quicker response.

Also, does doing so not prevent usage of lazy CBs on systems without
NOCB? So if we want to future-proof this, I guess that might not be a
good decision.

> 
> 	o	If so:
> 
> 		o	Use segmented list with marked grace periods?
> 			Keep in mind that such a list can track only
> 			two grace periods.
> 
> 		o	Use a plain list and have grace-period start
> 			simply drain the list?

I want to use the segmented list, regardless of whether we use the
bypass list or not, because we get those memory barriers and
rcu_barrier() lockless sampling of ->len, for free :).

> 
> o	Does call_rcu_lazy() do anything to ensure that additional grace
> 	periods that exist only for the benefit of lazy callbacks are
> 	maximally shared among all CPUs' lazy callbacks?  If so, what?
> 	(Keeping in mind that failing to share such grace periods
> 	burns battery lifetime.)

I could be missing your point, can you give example of how you want the
behavior to be?

I am not thinking of creating separate GP cycles just for lazy CBs. The
GP cycle will be the same that non-lazy uses. A lazy CB will just record
what is the current or last GP, and then wait. Once the timer expires,
it will check what is the current or last GP. If a new GP was not
started, then it starts a new GP. If a new GP was started but not
completed, it can simply call_rcu(). If a new GP started and completed,
it does not start new GP and just executes CB, something like that.
Before the flush happens, if multiple lazy CBs were queued in the
meanwhile and the GP seq counters moved forward, then all those lazy CBs
will share the same GP (either not starting a new one, or calling
call_rcu() to share the ongoing on, something like that).

> o	When there is ample memory, how long are lazy callbacks allowed
> 	to sleep?  Forever?  If not forever, what feeds into computing
> 	the timeout?

Yeah, the timeout is fixed currently. I agree this is a good thing to
optimize.

> o	What is used to determine when memory is low, so that laziness
> 	has become a net negative?

The assumption I think we make is that the shrinkers will constantly
flush out lazy CBs if there is memory pressure.

> o	What other conditions, if any, should prompt motivating lazy
> 	callbacks?  (See above for the start of a grace period motivating
> 	lazy callbacks.)
> 
> In short, you need to cast your net pretty wide on this one.  It has
> not yet been very carefully explored, so there are likely to be surprises,
> maybe even good surprises.  ;-)

Cool that sounds like some good opportunity to work on something cool ;-)

Happy memorial day! Off for lunch with some friends and then back to
tinkering.

Thanks,

- Joel

> 
> 							Thanx, Paul
> 
>> thanks,
>>
>>  - Joel
>>
>>>>>> +	// Flush queue if too big
>>>>>
>>>>> You will also need to check for early boot use.
>>>>>
>>>>> I -think- it suffice to simply skip the following "if" statement when
>>>>> rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices.  The reason being
>>>>> that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT
>>>>> kernels won't expire them until the softirq kthreads have been spawned.
>>>>>
>>>>> Which is OK, as it just means that call_rcu_lazy() is a bit more
>>>>> lazy than expected that early.
>>>>>
>>>>> Except that call_rcu() can be invoked even before rcu_init() has been
>>>>> invoked, which is therefore also before rcu_init_lazy() has been invoked.
>>>>
>>>> In other words, you are concerned that too many lazy callbacks might be
>>>> pending before rcu_init() is called?
>>>
>>> In other words, I am concerned that bad things might happen if fields
>>> in a rcu_lazy_pcp structure are used before they are initialized.
>>>
>>> I am not worried about too many lazy callbacks before rcu_init() because
>>> the non-lazy callbacks (which these currently are) are allowed to pile
>>> up until RCU's grace-period kthreads have been spawned.  There might
>>> come a time when RCU callbacks need to be invoked earlier, but that will
>>> be a separate problem to solve when and if, but with the benefit of the
>>> additional information derived from seeing the problem actually happen.
>>>
>>>> I am going through the kfree_rcu() threads/patches involving
>>>> RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before
>>>> the scheduler is running causes a crash or warnings?
>>>
>>> There are a lot of different issues that arise in different phases
>>> of boot.  In this particular case, my primary concern is the
>>> use-before-initialized bug.
>>>
>>>>> I thefore suggest something like this at the very start of this function:
>>>>>
>>>>> 	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
>>>>> 		call_rcu(head_rcu, func);
>>>>>
>>>>> The goal is that people can replace call_rcu() with call_rcu_lazy()
>>>>> without having to worry about invocation during early boot.
>>>>
>>>> Yes, this seems safer. I don't expect much power savings during system boot
>>>> process anyway ;-). I believe perhaps a static branch would work better to
>>>> take a branch out from what is likely a fast path.
>>>
>>> A static branch would be fine.  Maybe encapsulate it in a static inline
>>> function for all such comparisons, but most such comparisons are far from
>>> anything resembling a fastpath, so the main potential benefit would be
>>> added readability.  Which could be a compelling reason in and of itself.
>>>
>>> 							Thanx, Paul
>>>
>>>>> Huh.  "head_rcu"?  Why not "rhp"?  Or follow call_rcu() with "head",
>>>>> though "rhp" is more consistent with the RCU pointer initials approach.
>>>>
>>>> Fixed, thanks.
>>>>
>>>>>> +	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
>>>>>> +		lazy_rcu_flush_cpu(rlp);
>>>>>> +	} else {
>>>>>> +		if (!delayed_work_pending(&rlp->work)) {
>>>>>
>>>>> This check is racy because the work might run to completion right at
>>>>> this point.  Wouldn't it be better to rely on the internal check of
>>>>> WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()?
>>>>
>>>> Oops, agreed. Will make it as you suggest.
>>>>
>>>> thanks,
>>>>
>>>>  - Joel
>>>>
>>>>>> +			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
>>>>>> +		}
>>>>>> +	}
>>>>>> +}
>>>>>> +
>>>>>> +static unsigned long
>>>>>> +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
>>>>>> +{
>>>>>> +	unsigned long count = 0;
>>>>>> +	int cpu;
>>>>>> +
>>>>>> +	/* Snapshot count of all CPUs */
>>>>>> +	for_each_possible_cpu(cpu) {
>>>>>> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
>>>>>> +
>>>>>> +		count += atomic_read(&rlp->count);
>>>>>> +	}
>>>>>> +
>>>>>> +	return count;
>>>>>> +}
>>>>>> +
>>>>>> +static unsigned long
>>>>>> +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
>>>>>> +{
>>>>>> +	int cpu, freed = 0;
>>>>>> +
>>>>>> +	for_each_possible_cpu(cpu) {
>>>>>> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
>>>>>> +		unsigned long count;
>>>>>> +
>>>>>> +		count = atomic_read(&rlp->count);
>>>>>> +		lazy_rcu_flush_cpu(rlp);
>>>>>> +		sc->nr_to_scan -= count;
>>>>>> +		freed += count;
>>>>>> +		if (sc->nr_to_scan <= 0)
>>>>>> +			break;
>>>>>> +	}
>>>>>> +
>>>>>> +	return freed == 0 ? SHRINK_STOP : freed;
>>>>>
>>>>> This is a bit surprising given the stated aim of SHRINK_STOP to indicate
>>>>> potential deadlocks.  But this pattern is common, including on the
>>>>> kvfree_rcu() path, so OK!  ;-)
>>>>>
>>>>> Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is
>>>>> that as well.
>>>>>
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * This function is invoked after MAX_LAZY_JIFFIES timeout.
>>>>>> + */
>>>>>> +static void lazy_work(struct work_struct *work)
>>>>>> +{
>>>>>> +	struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work);
>>>>>> +
>>>>>> +	lazy_rcu_flush_cpu(rlp);
>>>>>> +}
>>>>>> +
>>>>>> +static struct shrinker lazy_rcu_shrinker = {
>>>>>> +	.count_objects = lazy_rcu_shrink_count,
>>>>>> +	.scan_objects = lazy_rcu_shrink_scan,
>>>>>> +	.batch = 0,
>>>>>> +	.seeks = DEFAULT_SEEKS,
>>>>>> +};
>>>>>> +
>>>>>> +void __init rcu_lazy_init(void)
>>>>>> +{
>>>>>> +	int cpu;
>>>>>> +
>>>>>> +	BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head));
>>>>>> +
>>>>>> +	for_each_possible_cpu(cpu) {
>>>>>> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
>>>>>> +		INIT_DELAYED_WORK(&rlp->work, lazy_work);
>>>>>> +	}
>>>>>> +
>>>>>> +	if (register_shrinker(&lazy_rcu_shrinker))
>>>>>> +		pr_err("Failed to register lazy_rcu shrinker!\n");
>>>>>> +}
>>>>>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
>>>>>> index 24b5f2c2de87..a5f4b44f395f 100644
>>>>>> --- a/kernel/rcu/rcu.h
>>>>>> +++ b/kernel/rcu/rcu.h
>>>>>> @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void);
>>>>>>  static inline void show_rcu_tasks_trace_gp_kthread(void) {}
>>>>>>  #endif
>>>>>>  
>>>>>> +#ifdef CONFIG_RCU_LAZY
>>>>>> +void rcu_lazy_init(void);
>>>>>> +#else
>>>>>> +static inline void rcu_lazy_init(void) {}
>>>>>> +#endif
>>>>>>  #endif /* __LINUX_RCU_H */
>>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>>>> index a4c25a6283b0..ebdf6f7c9023 100644
>>>>>> --- a/kernel/rcu/tree.c
>>>>>> +++ b/kernel/rcu/tree.c
>>>>>> @@ -4775,6 +4775,8 @@ void __init rcu_init(void)
>>>>>>  		qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
>>>>>>  	else
>>>>>>  		qovld_calc = qovld;
>>>>>> +
>>>>>> +	rcu_lazy_init();
>>>>>>  }
>>>>>>  
>>>>>>  #include "tree_stall.h"
>>>>>> -- 
>>>>>> 2.36.0.550.gb090851708-goog
>>>>>>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-17  9:07   ` Uladzislau Rezki
@ 2022-05-30 14:54     ` Joel Fernandes
  2022-06-01 14:12       ` Frederic Weisbecker
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-30 14:54 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: rcu, rushikesh.s.kadam, neeraj.iitr10, frederic, paulmck, rostedt

Hi Vlad,

Thanks for the comments. I replied inline:

On 5/17/22 05:07, Uladzislau Rezki wrote:
>> diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
>> index 0cfb009a99b9..8968b330d6e0 100644
>> --- a/kernel/rcu/Makefile
>> +++ b/kernel/rcu/Makefile
>> @@ -16,3 +16,4 @@ obj-$(CONFIG_RCU_REF_SCALE_TEST) += refscale.o
>>  obj-$(CONFIG_TREE_RCU) += tree.o
>>  obj-$(CONFIG_TINY_RCU) += tiny.o
>>  obj-$(CONFIG_RCU_NEED_SEGCBLIST) += rcu_segcblist.o
>> +obj-$(CONFIG_RCU_LAZY) += lazy.o
>> diff --git a/kernel/rcu/lazy.c b/kernel/rcu/lazy.c
>> new file mode 100644
>> index 000000000000..55e406cfc528
>> --- /dev/null
>> +++ b/kernel/rcu/lazy.c
>> @@ -0,0 +1,145 @@
>> +/*
>> + * Lockless lazy-RCU implementation.
>> + */
>> +#include <linux/rcupdate.h>
>> +#include <linux/shrinker.h>
>> +#include <linux/workqueue.h>
>> +#include "rcu.h"
>> +
>> +// How much to batch before flushing?
>> +#define MAX_LAZY_BATCH		2048
>> +
>> +// How much to wait before flushing?
>> +#define MAX_LAZY_JIFFIES	10000
>> +
>> +// We cast lazy_rcu_head to rcu_head and back. This keeps the API simple while
>> +// allowing us to use lockless list node in the head. Also, we use BUILD_BUG_ON
>> +// later to ensure that rcu_head and lazy_rcu_head are of the same size.
>> +struct lazy_rcu_head {
>> +	struct llist_node llist_node;
>> +	void (*func)(struct callback_head *head);
>> +} __attribute__((aligned(sizeof(void *))));
>> +
>> +struct rcu_lazy_pcp {
>> +	struct llist_head head;
>> +	struct delayed_work work;
>> +	atomic_t count;
>> +};
>> +DEFINE_PER_CPU(struct rcu_lazy_pcp, rcu_lazy_pcp_ins);
>> +
>> +// Lockless flush of CPU, can be called concurrently.
>> +static void lazy_rcu_flush_cpu(struct rcu_lazy_pcp *rlp)
>> +{
>> +	struct llist_node *node = llist_del_all(&rlp->head);
>> +	struct lazy_rcu_head *cursor, *temp;
>> +
>> +	if (!node)
>> +		return;
>> +
>> +	llist_for_each_entry_safe(cursor, temp, node, llist_node) {
>> +		struct rcu_head *rh = (struct rcu_head *)cursor;
>> +		debug_rcu_head_unqueue(rh);
>> +		call_rcu(rh, rh->func);
>> +		atomic_dec(&rlp->count);
>> +	}
>> +}
>> +
>> +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
>> +{
>> +	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
>> +	struct rcu_lazy_pcp *rlp;
>> +
>> +	preempt_disable();
>> +        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
>> +	preempt_enable();
>>
> Can we get rid of such explicit disabling/enabling preemption?

Ok I'll try. Last I checked, something needs to disable preemption to
prevent warnings with sampling the current processor ID.

>> +
>> +	if (debug_rcu_head_queue((void *)head)) {
>> +		// Probable double call_rcu(), just leak.
>> +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
>> +				__func__, head);
>> +
>> +		// Mark as success and leave.
>> +		return;
>> +	}
>> +
>> +	// Queue to per-cpu llist
>> +	head->func = func;
>> +	llist_add(&head->llist_node, &rlp->head);
>> +
>> +	// Flush queue if too big
>> +	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
>> +		lazy_rcu_flush_cpu(rlp);
>>
> Can we just schedule the work instead of drawn from the caller context?
> For example it can be a hard-irq context.

You raise a good point. Ok I'll do that. Though if the CB list is small,
I would prefer to do it inline. I will look more into it.

> 
>> +	} else {
>> +		if (!delayed_work_pending(&rlp->work)) {
>> +			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
>> +		}
>> +	}
>> +}
> EXPORT_SYMBOL_GPL()? to be able to use in kernel modules.
> 

Sure, will fix.

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-30 14:48             ` Joel Fernandes
@ 2022-05-30 16:42               ` Paul E. McKenney
  2022-05-31  2:12                 ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-30 16:42 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Mon, May 30, 2022 at 10:48:26AM -0400, Joel Fernandes wrote:
> Hi Paul,
> 
> I just setup thunderbird on an old machine just to reply to LKML.
> Apologies if the formatting is weird. Replies below:

Looks fine at first glance.  ;-)

> On 5/28/22 13:57, Paul E. McKenney wrote:
> [..]
> >>>>>> +	preempt_enable();
> >>>>>> +
> >>>>>> +	if (debug_rcu_head_queue((void *)head)) {
> >>>>>> +		// Probable double call_rcu(), just leak.
> >>>>>> +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> >>>>>> +				__func__, head);
> >>>>>> +
> >>>>>> +		// Mark as success and leave.
> >>>>>> +		return;
> >>>>>> +	}
> >>>>>> +
> >>>>>> +	// Queue to per-cpu llist
> >>>>>> +	head->func = func;
> >>>>>> +	llist_add(&head->llist_node, &rlp->head);
> >>>>>
> >>>>> Suppose that there are a bunch of preemptions between the preempt_enable()
> >>>>> above and this point, so that the current CPU's list has lots of
> >>>>> callbacks, but zero ->count.  Can that cause a problem?
> >>>>>
> >>>>> In the past, this sort of thing has been an issue for rcu_barrier()
> >>>>> and friends.
> >>>>
> >>>> Thanks, I think I dropped the ball on this. You have given me something to
> >>>> think about. I am thinking on first thought that setting the count in advance
> >>>> of populating the list should do the trick. I will look more on it.
> >>>
> >>> That can work, but don't forget the need for proper memory ordering.
> >>>
> >>> Another approach would be to what the current bypass list does, namely
> >>> count the callback in the ->cblist structure.
> >>
> >> I have been thinking about this, and I also arrived at the same conclusion
> >> that it is easier to make it a segcblist structure which has the functions do
> >> the memory ordering. Then we can make rcu_barrier() locklessly sample the
> >> ->len of the new per-cpu structure, I *think* it might work.
> >>
> >> However, I was actually considering another situation outside of rcu_barrier,
> >> where the preemptions you mentioned would cause the list to appear to be
> >> empty if one were to rely on count, but actually have a lot of stuff queued.
> >> This causes shrinkers to not know that there is a lot of memory available to
> >> free. I don't think its that big an issue, if we can assume the caller's
> >> process context to have sufficient priority. Obviously using a raw spinlock
> >> makes the process context to be highest priority briefly but without the
> >> lock, we don't get that. But yeah that can be sorted by just updating the
> >> count proactively, and then doing the queuing operations.
> >>
> >> Another advantage of using the segcblist structure, is, we can also record
> >> the grace period numbers in it, and avoid triggering RCU from the timer if gp
> >> already elapsed (similar to the rcu-poll family of functions).
> >>
> >> WDYT?
> > 
> > What you are seeing is that the design space for call_rcu_lazy() is huge,
> > and the tradeoffs between energy efficiency and complexity are not well
> > understood.  It is therefore necessary to quickly evaluate alternatives.
> > Yes, you could simply implement all of them and test them, but that
> > might you take longer than you would like.  Plus you likely need to
> > put in place more testing than rcutorture currently provides.
> > 
> > Given what we have seen thus far, I think that the first step has to
> > be to evaluate what can be obtained with this approach compared with
> > what can be obtained with other approaches.  What we have right now,
> > more or less, is this:
> > 
> > o	Offloading buys about 15%.
> > 
> > o	Slowing grace periods buys about another 7%, but requires
> > 	userspace changes to prevent degrading user-visible performance.
> 
> Agreed.
> 
> > 
> > o	The initial call_rcu_lazy() prototype buys about 1%.
> 
> Well, that depends on the usecase. A busy system saves you less because
> it is busy anyway, so RCU being quiet does not help much.
> 
> >From my cover letter, you can see the idle power savings is a whopping
> 29% with this patch series. Check the PC10 state residencies and the
> package wattage.
> 
> When ChromeOS screen is off and user is not doing anything on the
> Before:
> Pk%pc10 = 72.13
> PkgWatt = 0.58
> 
> After:
> Pk%pc10 = 81.28
> PkgWatt = 0.41

These numbers do make a better case for this.  On the other hand, that
patch series was missing things like rcu_barrier() support and might also
be converting more call_rcu() invocations to call_rcu_lazy() invocations.
It is therefore necessary to take these numbers with a grain of salt,
much though I hope that they turn out to be realistic.

> > 
> > 	True, this is not apples to apples because the first two were
> > 	measured on ChromeOS and this one on Android.
> 
> Yes agreed. I think Vlad is not seeing any jaw dropping savings, because
> he's testing ARM. But on Intel we see a lot. Though, I asked Vlad to
> also check if rcu callbacks are indeed not being quued / quiet after
> applying the patch, and if not then there may still be some call_rcu()s
> that need conversion to lazy, on his setup. For example, he might have a
> driver on his ARM platform that's queing RCU CBs constantly.
> 
> I was thinking that perhaps a debug patch can help quickly nail down
> such RCU noise, in the way of a warning. Of course, debug CONFIG should
> never be enabled in production, so it would be something along the lines
> of how lockdep is used for debugging.

The trick would be figuring out which of the callbacks are noise.
I suspect that userspace help would be required.  Or maybe just leverage
the existing event traces and let userspace sort it out.

> >Which means that
> > 	apples to apples evaluation is also required.  But this is the
> > 	information I currently have at hand, and it is probably no more
> > 	than a factor of two off of what would be seen on ChromeOS.
> > 
> > 	Or is there some ChromeOS data that tells a different story?
> > 	After all, for all I know, Android might still be expediting
> > 	all normal grace periods.
> > 
> > At which point, the question becomes "how to make up that 7%?" After all,
> > it is not likely that anyone is going to leave that much battery lifetime
> > on the table.  Here are the possibilities that I currently see:
> > 
> > o	Slow down grace periods, and also bite the bullet and make
> > 	userspace changes to speed up the RCU grace periods during
> > 	critical operations.
> 
> We tried this, and tracing suffers quiet a lot. The system also felt
> "sluggish" which I suspect is because of synchronize_rcu() slow downs in
> other paths.

In what way does tracing suffer?  Removing tracepoints?

And yes, this approach absolutely requires finding code paths with
user-visible grace periods.  Which sounds like you tried the "Slow down
grace periods" part but not the "bit the bullet" part.  ;-)

> > o	Slow down grace periods, but leverage in-kernel changes for
> > 	some of those operations.  For example, RCU uses pm_notifier()
> > 	to cause grace periods to be expedited during suspend and
> 
> Nice! Did not know this.
> 
> > 	hibernation.  The requests for expediting are atomically counted,
> > 	so there can be other similar setups.
> 
> I do like slowing down grace periods globally, because its easy, but I
> think it can have quite a few ripple effects if in the future a user of
> call_rcu() does not expect lazy behavior :( but still gets it. Same like
> how synchronize_rcu_mult() suffered back in its day.

No matter what the approach, there are going to be ripple effects.

> > o	Choose a more moderate slowing down of the grace period so as to
> > 	minimize (or maybe even eliminate) the need for the aforementioned
> > 	userspace changes.  This also leaves battery lifetime on the
> > 	table, but considerably less of it.
> > 
> > o	Implement more selective slowdowns, such as call_rcu_lazy().
> > 
> > 	Other selective slowdowns include in-RCU heuristics such as
> > 	forcing the current grace period to end once the entire
> > 	system has gone idle.  (There is prior work detecting
> > 	full-system idleness for NO_HZ_FULL, but smaller systems
> > 	could use more aggressive techniques.)
> > 
> > I am sure that there are additional possibilities.  Note that it is
> > possible to do more than one of these, if that proves helpful.
> 
> Sure!
> > But given this, one question is "What is the most you can possibly
> > obtain from call_rcu_lazy()?"  If you have a platform with enough
> > memory, one way to obtain an upper bound is to make call_rcu_lazy()
> > simply leak the memory.  If that amount of savings does not meet the
> > need, then other higher-level approaches will be needed, whether
> > as alternatives or as additions to the call_rcu_lazy() approach.
> > 
> > Should call_rcu_lazy() prove to be part of the mix, here are a (very)
> > few of the tradeoffs:
> > 
> > o	Put lazy callbacks on their own list or not?
> > 
> > 	o	If not, use the bypass list?  If so, is it the
> > 		case that call_rcu_lazy() is just call_rcu() on
> > 		non-rcu_nocbs CPUs?
> > 
> > 		Or add the complexity required to use the bypass
> > 		list on rcu_nocbs CPUs but to use ->cblist otherwise?
> 
> For bypass list, I am kind of reluctant because the "flush time" of the
> bypass list may not match the laziness we seek? For example, I want to
> allow user to configure to flush the CBs only after 15 seconds or so, if
> the CB list does not grow too big. That might hurt folks using the
> bypass list, but requiring a quicker response.
> 
> Also, does doing so not prevent usage of lazy CBs on systems without
> NOCB? So if we want to future-proof this, I guess that might not be a
> good decision.

True enough, but would this future actually arrive?  After all, if
someone cared enough about energy efficiency to use call_rcu_lazy(),
why wouldn't they also offload callbacks?

On the flush-time mismatch, if there are any non-lazy callbacks in the
list, it costs you nothing to let the lazy callbacks tag along through
the grace period.  So one approach would be to use the current flush
time if there are non-lazy callbacks, but use the longer flush time if
all of the callbacks in the list are lazy callbacks.

> > 	o	If so:
> > 
> > 		o	Use segmented list with marked grace periods?
> > 			Keep in mind that such a list can track only
> > 			two grace periods.
> > 
> > 		o	Use a plain list and have grace-period start
> > 			simply drain the list?
> 
> I want to use the segmented list, regardless of whether we use the
> bypass list or not, because we get those memory barriers and
> rcu_barrier() lockless sampling of ->len, for free :).

The bypass list also gets you the needed memory barriers and lockless
sampling of ->len.  As does any other type of list as long as the
->cblist.len field accounts for the lazy callbacks.

So the main difference is the tracking of grace periods, and even then,
only those grace periods for which this CPU has no non-lazy callbacks.
Or, in the previously discussed case where a single rcuoc kthread serves
multiple CPUs, only those grace periods for which this group of CPUs
has no non-lazy callbacks.

And on devices with few CPUs, wouldn't you make do with one rcuoc kthread
for the system, at least in the common case where there is no callback
flooding?

> > o	Does call_rcu_lazy() do anything to ensure that additional grace
> > 	periods that exist only for the benefit of lazy callbacks are
> > 	maximally shared among all CPUs' lazy callbacks?  If so, what?
> > 	(Keeping in mind that failing to share such grace periods
> > 	burns battery lifetime.)
> 
> I could be missing your point, can you give example of how you want the
> behavior to be?

CPU 0 has 55 lazy callbacks and one non-lazy callback.  So the system
does a grace-period computation and CPU 0 wakes up its rcuoc kthread.
Given that the full price is begin paid anyway, why not also invoke
those 55 lazy callbacks?

If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23
lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks,
again the full price of grace-period computation and rcuoc wakeups is
being paid, so why not also invoke those 26 lazy callbacks?

On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other
rcuoc kthread, and with the same numbers of callbacks as in the previous
paragraph, you get to make a choice.  Do you do an extra wakeup for
CPU 0's rcuoc kthread?  Or do you potentially do an extra grace-period
computation some time in the future?

Suppose further that the CPU that would be awakened for CPU 0's rcuoc
kthread is already non-idle.  At that point, this wakeup is quite cheap.
Except that the wakeup-or-don't choice is made at the beginning of the
grace period, and that CPU's idleness at that point might not be all
that well correlated with its idleness at the time of the wakeup at the
end of that grace period.  Nevertheless, idleness statistics could
help trade off needless from-idle wakeups for needless grace periods.

Which sound like a good reason to consolidate rcuoc processing on
non-busy systems.

But more to the point...

Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that
RCU is otherwise idle.  We really want one (not three!) grace periods
to eventually handle those lazy callbacks, right?

> I am not thinking of creating separate GP cycles just for lazy CBs. The
> GP cycle will be the same that non-lazy uses. A lazy CB will just record
> what is the current or last GP, and then wait. Once the timer expires,
> it will check what is the current or last GP. If a new GP was not
> started, then it starts a new GP.

From my perspective, this new GP is in fact a separate GP just for lazy
callbacks.

What am I missing here?

Also, if your timer triggers the check, isn't that an additional potential
wakeup from idle?  Would it make sense to also minimize those wakeups?
(This is one motivation for grace periods to pick up lazy callbacks.)

>                                   If a new GP was started but not
> completed, it can simply call_rcu(). If a new GP started and completed,
> it does not start new GP and just executes CB, something like that.
> Before the flush happens, if multiple lazy CBs were queued in the
> meanwhile and the GP seq counters moved forward, then all those lazy CBs
> will share the same GP (either not starting a new one, or calling
> call_rcu() to share the ongoing on, something like that).

You could move the ready-to-invoke callbacks to the end of the DONE
segment of ->cblist, assuming that you also adjust rcu_barrier()
appropriately.  If rcu_barrier() is rare, you could simply flush whatever
lazy list before entraining the rcu_barrier() callback.

> > o	When there is ample memory, how long are lazy callbacks allowed
> > 	to sleep?  Forever?  If not forever, what feeds into computing
> > 	the timeout?
> 
> Yeah, the timeout is fixed currently. I agree this is a good thing to
> optimize.

First see what the behavior is.  If the timeout is long enough that
there is almost never a lazy-only grace period, then maybe a fixed
timeout is good enough.

> > o	What is used to determine when memory is low, so that laziness
> > 	has become a net negative?
> 
> The assumption I think we make is that the shrinkers will constantly
> flush out lazy CBs if there is memory pressure.

It might be worth looking at SPI.  That could help avoid a grace-period
wait when clearing out lazy callbacks.

> > o	What other conditions, if any, should prompt motivating lazy
> > 	callbacks?  (See above for the start of a grace period motivating
> > 	lazy callbacks.)
> > 
> > In short, you need to cast your net pretty wide on this one.  It has
> > not yet been very carefully explored, so there are likely to be surprises,
> > maybe even good surprises.  ;-)
> 
> Cool that sounds like some good opportunity to work on something cool ;-)

Which comes back around to your point of needing evaluation of extra
RCU work.  Lots of tradeoffs between different wakeup sources and grace
periods.  Fine-grained views into the black box will be helpful.  ;-)

> Happy memorial day! Off for lunch with some friends and then back to
> tinkering.

Have a good holiday!

							Thanx, Paul

> Thanks,
> 
> - Joel
> 
> > 
> > 							Thanx, Paul
> > 
> >> thanks,
> >>
> >>  - Joel
> >>
> >>>>>> +	// Flush queue if too big
> >>>>>
> >>>>> You will also need to check for early boot use.
> >>>>>
> >>>>> I -think- it suffice to simply skip the following "if" statement when
> >>>>> rcu_scheduler_active == RCU_SCHEDULER_INACTIVE suffices.  The reason being
> >>>>> that you can queue timeouts at that point, even though CONFIG_PREEMPT_RT
> >>>>> kernels won't expire them until the softirq kthreads have been spawned.
> >>>>>
> >>>>> Which is OK, as it just means that call_rcu_lazy() is a bit more
> >>>>> lazy than expected that early.
> >>>>>
> >>>>> Except that call_rcu() can be invoked even before rcu_init() has been
> >>>>> invoked, which is therefore also before rcu_init_lazy() has been invoked.
> >>>>
> >>>> In other words, you are concerned that too many lazy callbacks might be
> >>>> pending before rcu_init() is called?
> >>>
> >>> In other words, I am concerned that bad things might happen if fields
> >>> in a rcu_lazy_pcp structure are used before they are initialized.
> >>>
> >>> I am not worried about too many lazy callbacks before rcu_init() because
> >>> the non-lazy callbacks (which these currently are) are allowed to pile
> >>> up until RCU's grace-period kthreads have been spawned.  There might
> >>> come a time when RCU callbacks need to be invoked earlier, but that will
> >>> be a separate problem to solve when and if, but with the benefit of the
> >>> additional information derived from seeing the problem actually happen.
> >>>
> >>>> I am going through the kfree_rcu() threads/patches involving
> >>>> RCU_SCHEDULER_RUNNING check, but was the concern that queuing timers before
> >>>> the scheduler is running causes a crash or warnings?
> >>>
> >>> There are a lot of different issues that arise in different phases
> >>> of boot.  In this particular case, my primary concern is the
> >>> use-before-initialized bug.
> >>>
> >>>>> I thefore suggest something like this at the very start of this function:
> >>>>>
> >>>>> 	if (rcu_scheduler_active == RCU_SCHEDULER_INACTIVE)
> >>>>> 		call_rcu(head_rcu, func);
> >>>>>
> >>>>> The goal is that people can replace call_rcu() with call_rcu_lazy()
> >>>>> without having to worry about invocation during early boot.
> >>>>
> >>>> Yes, this seems safer. I don't expect much power savings during system boot
> >>>> process anyway ;-). I believe perhaps a static branch would work better to
> >>>> take a branch out from what is likely a fast path.
> >>>
> >>> A static branch would be fine.  Maybe encapsulate it in a static inline
> >>> function for all such comparisons, but most such comparisons are far from
> >>> anything resembling a fastpath, so the main potential benefit would be
> >>> added readability.  Which could be a compelling reason in and of itself.
> >>>
> >>> 							Thanx, Paul
> >>>
> >>>>> Huh.  "head_rcu"?  Why not "rhp"?  Or follow call_rcu() with "head",
> >>>>> though "rhp" is more consistent with the RCU pointer initials approach.
> >>>>
> >>>> Fixed, thanks.
> >>>>
> >>>>>> +	if (atomic_inc_return(&rlp->count) >= MAX_LAZY_BATCH) {
> >>>>>> +		lazy_rcu_flush_cpu(rlp);
> >>>>>> +	} else {
> >>>>>> +		if (!delayed_work_pending(&rlp->work)) {
> >>>>>
> >>>>> This check is racy because the work might run to completion right at
> >>>>> this point.  Wouldn't it be better to rely on the internal check of
> >>>>> WORK_STRUCT_PENDING_BIT within queue_delayed_work_on()?
> >>>>
> >>>> Oops, agreed. Will make it as you suggest.
> >>>>
> >>>> thanks,
> >>>>
> >>>>  - Joel
> >>>>
> >>>>>> +			schedule_delayed_work(&rlp->work, MAX_LAZY_JIFFIES);
> >>>>>> +		}
> >>>>>> +	}
> >>>>>> +}
> >>>>>> +
> >>>>>> +static unsigned long
> >>>>>> +lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
> >>>>>> +{
> >>>>>> +	unsigned long count = 0;
> >>>>>> +	int cpu;
> >>>>>> +
> >>>>>> +	/* Snapshot count of all CPUs */
> >>>>>> +	for_each_possible_cpu(cpu) {
> >>>>>> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> >>>>>> +
> >>>>>> +		count += atomic_read(&rlp->count);
> >>>>>> +	}
> >>>>>> +
> >>>>>> +	return count;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static unsigned long
> >>>>>> +lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
> >>>>>> +{
> >>>>>> +	int cpu, freed = 0;
> >>>>>> +
> >>>>>> +	for_each_possible_cpu(cpu) {
> >>>>>> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> >>>>>> +		unsigned long count;
> >>>>>> +
> >>>>>> +		count = atomic_read(&rlp->count);
> >>>>>> +		lazy_rcu_flush_cpu(rlp);
> >>>>>> +		sc->nr_to_scan -= count;
> >>>>>> +		freed += count;
> >>>>>> +		if (sc->nr_to_scan <= 0)
> >>>>>> +			break;
> >>>>>> +	}
> >>>>>> +
> >>>>>> +	return freed == 0 ? SHRINK_STOP : freed;
> >>>>>
> >>>>> This is a bit surprising given the stated aim of SHRINK_STOP to indicate
> >>>>> potential deadlocks.  But this pattern is common, including on the
> >>>>> kvfree_rcu() path, so OK!  ;-)
> >>>>>
> >>>>> Plus do_shrink_slab() loops if you don't return SHRINK_STOP, so there is
> >>>>> that as well.
> >>>>>
> >>>>>> +}
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * This function is invoked after MAX_LAZY_JIFFIES timeout.
> >>>>>> + */
> >>>>>> +static void lazy_work(struct work_struct *work)
> >>>>>> +{
> >>>>>> +	struct rcu_lazy_pcp *rlp = container_of(work, struct rcu_lazy_pcp, work.work);
> >>>>>> +
> >>>>>> +	lazy_rcu_flush_cpu(rlp);
> >>>>>> +}
> >>>>>> +
> >>>>>> +static struct shrinker lazy_rcu_shrinker = {
> >>>>>> +	.count_objects = lazy_rcu_shrink_count,
> >>>>>> +	.scan_objects = lazy_rcu_shrink_scan,
> >>>>>> +	.batch = 0,
> >>>>>> +	.seeks = DEFAULT_SEEKS,
> >>>>>> +};
> >>>>>> +
> >>>>>> +void __init rcu_lazy_init(void)
> >>>>>> +{
> >>>>>> +	int cpu;
> >>>>>> +
> >>>>>> +	BUILD_BUG_ON(sizeof(struct lazy_rcu_head) != sizeof(struct rcu_head));
> >>>>>> +
> >>>>>> +	for_each_possible_cpu(cpu) {
> >>>>>> +		struct rcu_lazy_pcp *rlp = per_cpu_ptr(&rcu_lazy_pcp_ins, cpu);
> >>>>>> +		INIT_DELAYED_WORK(&rlp->work, lazy_work);
> >>>>>> +	}
> >>>>>> +
> >>>>>> +	if (register_shrinker(&lazy_rcu_shrinker))
> >>>>>> +		pr_err("Failed to register lazy_rcu shrinker!\n");
> >>>>>> +}
> >>>>>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> >>>>>> index 24b5f2c2de87..a5f4b44f395f 100644
> >>>>>> --- a/kernel/rcu/rcu.h
> >>>>>> +++ b/kernel/rcu/rcu.h
> >>>>>> @@ -561,4 +561,9 @@ void show_rcu_tasks_trace_gp_kthread(void);
> >>>>>>  static inline void show_rcu_tasks_trace_gp_kthread(void) {}
> >>>>>>  #endif
> >>>>>>  
> >>>>>> +#ifdef CONFIG_RCU_LAZY
> >>>>>> +void rcu_lazy_init(void);
> >>>>>> +#else
> >>>>>> +static inline void rcu_lazy_init(void) {}
> >>>>>> +#endif
> >>>>>>  #endif /* __LINUX_RCU_H */
> >>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >>>>>> index a4c25a6283b0..ebdf6f7c9023 100644
> >>>>>> --- a/kernel/rcu/tree.c
> >>>>>> +++ b/kernel/rcu/tree.c
> >>>>>> @@ -4775,6 +4775,8 @@ void __init rcu_init(void)
> >>>>>>  		qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
> >>>>>>  	else
> >>>>>>  		qovld_calc = qovld;
> >>>>>> +
> >>>>>> +	rcu_lazy_init();
> >>>>>>  }
> >>>>>>  
> >>>>>>  #include "tree_stall.h"
> >>>>>> -- 
> >>>>>> 2.36.0.550.gb090851708-goog
> >>>>>>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-30 16:42               ` Paul E. McKenney
@ 2022-05-31  2:12                 ` Joel Fernandes
  2022-05-31  4:26                   ` Paul E. McKenney
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-31  2:12 UTC (permalink / raw)
  To: paulmck; +Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

Hi Paul,

I am sending in one more reply before I sleep ;-)

On 5/30/22 12:42, Paul E. McKenney wrote:
> On Mon, May 30, 2022 at 10:48:26AM -0400, Joel Fernandes wrote:
>> Hi Paul,
>>
>> I just setup thunderbird on an old machine just to reply to LKML.
>> Apologies if the formatting is weird. Replies below:
> 
> Looks fine at first glance.  ;-)
> 
>> On 5/28/22 13:57, Paul E. McKenney wrote:
>> [..]
>>>>>>>> +	preempt_enable();
>>>>>>>> +
>>>>>>>> +	if (debug_rcu_head_queue((void *)head)) {
>>>>>>>> +		// Probable double call_rcu(), just leak.
>>>>>>>> +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
>>>>>>>> +				__func__, head);
>>>>>>>> +
>>>>>>>> +		// Mark as success and leave.
>>>>>>>> +		return;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	// Queue to per-cpu llist
>>>>>>>> +	head->func = func;
>>>>>>>> +	llist_add(&head->llist_node, &rlp->head);
>>>>>>>
>>>>>>> Suppose that there are a bunch of preemptions between the preempt_enable()
>>>>>>> above and this point, so that the current CPU's list has lots of
>>>>>>> callbacks, but zero ->count.  Can that cause a problem?
>>>>>>>
>>>>>>> In the past, this sort of thing has been an issue for rcu_barrier()
>>>>>>> and friends.
>>>>>>
>>>>>> Thanks, I think I dropped the ball on this. You have given me something to
>>>>>> think about. I am thinking on first thought that setting the count in advance
>>>>>> of populating the list should do the trick. I will look more on it.
>>>>>
>>>>> That can work, but don't forget the need for proper memory ordering.
>>>>>
>>>>> Another approach would be to what the current bypass list does, namely
>>>>> count the callback in the ->cblist structure.
>>>>
>>>> I have been thinking about this, and I also arrived at the same conclusion
>>>> that it is easier to make it a segcblist structure which has the functions do
>>>> the memory ordering. Then we can make rcu_barrier() locklessly sample the
>>>> ->len of the new per-cpu structure, I *think* it might work.
>>>>
>>>> However, I was actually considering another situation outside of rcu_barrier,
>>>> where the preemptions you mentioned would cause the list to appear to be
>>>> empty if one were to rely on count, but actually have a lot of stuff queued.
>>>> This causes shrinkers to not know that there is a lot of memory available to
>>>> free. I don't think its that big an issue, if we can assume the caller's
>>>> process context to have sufficient priority. Obviously using a raw spinlock
>>>> makes the process context to be highest priority briefly but without the
>>>> lock, we don't get that. But yeah that can be sorted by just updating the
>>>> count proactively, and then doing the queuing operations.
>>>>
>>>> Another advantage of using the segcblist structure, is, we can also record
>>>> the grace period numbers in it, and avoid triggering RCU from the timer if gp
>>>> already elapsed (similar to the rcu-poll family of functions).
>>>>
>>>> WDYT?
>>>
>>> What you are seeing is that the design space for call_rcu_lazy() is huge,
>>> and the tradeoffs between energy efficiency and complexity are not well
>>> understood.  It is therefore necessary to quickly evaluate alternatives.
>>> Yes, you could simply implement all of them and test them, but that
>>> might you take longer than you would like.  Plus you likely need to
>>> put in place more testing than rcutorture currently provides.
>>>
>>> Given what we have seen thus far, I think that the first step has to
>>> be to evaluate what can be obtained with this approach compared with
>>> what can be obtained with other approaches.  What we have right now,
>>> more or less, is this:
>>>
>>> o	Offloading buys about 15%.
>>>
>>> o	Slowing grace periods buys about another 7%, but requires
>>> 	userspace changes to prevent degrading user-visible performance.
>>
>> Agreed.
>>
>>>
>>> o	The initial call_rcu_lazy() prototype buys about 1%.
>>
>> Well, that depends on the usecase. A busy system saves you less because
>> it is busy anyway, so RCU being quiet does not help much.
>>
>> >From my cover letter, you can see the idle power savings is a whopping
>> 29% with this patch series. Check the PC10 state residencies and the
>> package wattage.
>>
>> When ChromeOS screen is off and user is not doing anything on the
>> Before:
>> Pk%pc10 = 72.13
>> PkgWatt = 0.58
>>
>> After:
>> Pk%pc10 = 81.28
>> PkgWatt = 0.41
> 
> These numbers do make a better case for this.  On the other hand, that
> patch series was missing things like rcu_barrier() support and might also
> be converting more call_rcu() invocations to call_rcu_lazy() invocations.
> It is therefore necessary to take these numbers with a grain of salt,
> much though I hope that they turn out to be realistic.

Makes sense, thank you. Actually the promising savings showed up when
also doing jiffies_till_first_fqs increases, so that's why I did the
prototype of call_rcu_lazy to prove that we can achieve same thing
instead of slowing GP for everything. I believe the savings comes from
just not kicking the RCU GP machinery at all.

Agreed on the grain of salt part. Only a test of a final design can tell
us more accurately what the savings are.

>>>
>>> 	True, this is not apples to apples because the first two were
>>> 	measured on ChromeOS and this one on Android.
>>
>> Yes agreed. I think Vlad is not seeing any jaw dropping savings, because
>> he's testing ARM. But on Intel we see a lot. Though, I asked Vlad to
>> also check if rcu callbacks are indeed not being quued / quiet after
>> applying the patch, and if not then there may still be some call_rcu()s
>> that need conversion to lazy, on his setup. For example, he might have a
>> driver on his ARM platform that's queing RCU CBs constantly.
>>
>> I was thinking that perhaps a debug patch can help quickly nail down
>> such RCU noise, in the way of a warning. Of course, debug CONFIG should
>> never be enabled in production, so it would be something along the lines
>> of how lockdep is used for debugging.
> 
> The trick would be figuring out which of the callbacks are noise.
> I suspect that userspace help would be required.  Or maybe just leverage
> the existing event traces and let userspace sort it out.

Ok.

>>> Which means that
>>> 	apples to apples evaluation is also required.  But this is the
>>> 	information I currently have at hand, and it is probably no more
>>> 	than a factor of two off of what would be seen on ChromeOS.
>>>
>>> 	Or is there some ChromeOS data that tells a different story?
>>> 	After all, for all I know, Android might still be expediting
>>> 	all normal grace periods.
>>>
>>> At which point, the question becomes "how to make up that 7%?" After all,
>>> it is not likely that anyone is going to leave that much battery lifetime
>>> on the table.  Here are the possibilities that I currently see:
>>>
>>> o	Slow down grace periods, and also bite the bullet and make
>>> 	userspace changes to speed up the RCU grace periods during
>>> 	critical operations.
>>
>> We tried this, and tracing suffers quiet a lot. The system also felt
>> "sluggish" which I suspect is because of synchronize_rcu() slow downs in
>> other paths.
> 
> In what way does tracing suffer?  Removing tracepoints?

Yes. Start and stopping function tracer goes from 5 seconds to like 30
seconds or so.

> And yes, this approach absolutely requires finding code paths with
> user-visible grace periods.  Which sounds like you tried the "Slow down
> grace periods" part but not the "bit the bullet" part.  ;-)

True, I got so scared looking at the performance that I just decided to
play it safe and do selective call_rcu() lazifying, than making the
everything lazy.

>>> But given this, one question is "What is the most you can possibly
>>> obtain from call_rcu_lazy()?"  If you have a platform with enough
>>> memory, one way to obtain an upper bound is to make call_rcu_lazy()
>>> simply leak the memory.  If that amount of savings does not meet the
>>> need, then other higher-level approaches will be needed, whether
>>> as alternatives or as additions to the call_rcu_lazy() approach.
>>>
>>> Should call_rcu_lazy() prove to be part of the mix, here are a (very)
>>> few of the tradeoffs:
>>>
>>> o	Put lazy callbacks on their own list or not?
>>>
>>> 	o	If not, use the bypass list?  If so, is it the
>>> 		case that call_rcu_lazy() is just call_rcu() on
>>> 		non-rcu_nocbs CPUs?
>>>
>>> 		Or add the complexity required to use the bypass
>>> 		list on rcu_nocbs CPUs but to use ->cblist otherwise?
>>
>> For bypass list, I am kind of reluctant because the "flush time" of the
>> bypass list may not match the laziness we seek? For example, I want to
>> allow user to configure to flush the CBs only after 15 seconds or so, if
>> the CB list does not grow too big. That might hurt folks using the
>> bypass list, but requiring a quicker response.
>>
>> Also, does doing so not prevent usage of lazy CBs on systems without
>> NOCB? So if we want to future-proof this, I guess that might not be a
>> good decision.
> 
> True enough, but would this future actually arrive?  After all, if
> someone cared enough about energy efficiency to use call_rcu_lazy(),
> why wouldn't they also offload callbacks?

I am not sure, but I also don't mind making it depend on NOCB for now
(see below).
> On the flush-time mismatch, if there are any non-lazy callbacks in the
> list, it costs you nothing to let the lazy callbacks tag along through
> the grace period.  So one approach would be to use the current flush
> time if there are non-lazy callbacks, but use the longer flush time if
> all of the callbacks in the list are lazy callbacks.

Cool!

>>> 	o	If so:
>>>
>>> 		o	Use segmented list with marked grace periods?
>>> 			Keep in mind that such a list can track only
>>> 			two grace periods.
>>>
>>> 		o	Use a plain list and have grace-period start
>>> 			simply drain the list?
>>
>> I want to use the segmented list, regardless of whether we use the
>> bypass list or not, because we get those memory barriers and
>> rcu_barrier() lockless sampling of ->len, for free :).
> 
> The bypass list also gets you the needed memory barriers and lockless
> sampling of ->len.  As does any other type of list as long as the
> ->cblist.len field accounts for the lazy callbacks.
> 
> So the main difference is the tracking of grace periods, and even then,
> only those grace periods for which this CPU has no non-lazy callbacks.
> Or, in the previously discussed case where a single rcuoc kthread serves
> multiple CPUs, only those grace periods for which this group of CPUs
> has no non-lazy callbacks.

Unless we assume that everything in the bypass list is lazy, can we
really use it to track both lazy and non lazy CBs, and grace periods?

Currently the struct looks like this:

struct rcu_segcblist {
        struct rcu_head *head;
        struct rcu_head **tails[RCU_CBLIST_NSEGS];
        unsigned long gp_seq[RCU_CBLIST_NSEGS];
#ifdef CONFIG_RCU_NOCB_CPU
        atomic_long_t len;
#else
        long len;
#endif
        long seglen[RCU_CBLIST_NSEGS];
        u8 flags;
};

So now, it would need to be like this?

struct rcu_segcblist {
        struct rcu_head *head;
        struct rcu_head **tails[RCU_CBLIST_NSEGS];
        unsigned long gp_seq[RCU_CBLIST_NSEGS];
#ifdef CONFIG_RCU_NOCB_CPU
        struct rcu_head *lazy_head;
        struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS];
        unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS];
        atomic_long_t lazy_len;
#else
        long len;
#endif
        long seglen[RCU_CBLIST_NSEGS];
        u8 flags;
};


> And on devices with few CPUs, wouldn't you make do with one rcuoc kthread
> for the system, at least in the common case where there is no callback
> flooding?

When Rushikesh tried to reduce the number of callback threads, he did
not see much improvement in power. I don't think the additional wake ups
of extra rcuoc threads makes too much difference - the big power
improvement comes from not even kicking rcu_preempt / rcuog threads...

>>> o	Does call_rcu_lazy() do anything to ensure that additional grace
>>> 	periods that exist only for the benefit of lazy callbacks are
>>> 	maximally shared among all CPUs' lazy callbacks?  If so, what?
>>> 	(Keeping in mind that failing to share such grace periods
>>> 	burns battery lifetime.)
>>
>> I could be missing your point, can you give example of how you want the
>> behavior to be?
> 
> CPU 0 has 55 lazy callbacks and one non-lazy callback.  So the system
> does a grace-period computation and CPU 0 wakes up its rcuoc kthread.
> Given that the full price is begin paid anyway, why not also invoke
> those 55 lazy callbacks?
> 
> If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23
> lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks,
> again the full price of grace-period computation and rcuoc wakeups is
> being paid, so why not also invoke those 26 lazy callbacks?
> 
> On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other
> rcuoc kthread, and with the same numbers of callbacks as in the previous
> paragraph, you get to make a choice.  Do you do an extra wakeup for
> CPU 0's rcuoc kthread?  Or do you potentially do an extra grace-period
> computation some time in the future?
> 
> Suppose further that the CPU that would be awakened for CPU 0's rcuoc
> kthread is already non-idle.  At that point, this wakeup is quite cheap.
> Except that the wakeup-or-don't choice is made at the beginning of the
> grace period, and that CPU's idleness at that point might not be all
> that well correlated with its idleness at the time of the wakeup at the
> end of that grace period.  Nevertheless, idleness statistics could
> help trade off needless from-idle wakeups for needless grace periods.
> 
> Which sound like a good reason to consolidate rcuoc processing on
> non-busy systems.
> 
> But more to the point...
> 
> Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that
> RCU is otherwise idle.  We really want one (not three!) grace periods
> to eventually handle those lazy callbacks, right?

Yes. I think I understand you now, yes we do want to reduce the number
of grace periods. But I think that is an optimization which is not
strictly necessary to get the power savings this patch series
demonstrates. To get the power savings shown here, we need all the RCU
threads to be quiet as much as possible, and not start the rcu_preempt
thread's state machine until when necessary.

>> I am not thinking of creating separate GP cycles just for lazy CBs. The
>> GP cycle will be the same that non-lazy uses. A lazy CB will just record
>> what is the current or last GP, and then wait. Once the timer expires,
>> it will check what is the current or last GP. If a new GP was not
>> started, then it starts a new GP.
> 
> From my perspective, this new GP is in fact a separate GP just for lazy
> callbacks.
> 
> What am I missing here?

I think my thought for power savings is slightly different from yours. I
see lots of power savings when not even going to RCU when system is idle
- that appears to be a lot like what the bypass list does - it avoids
call_rcu() completely and does an early exit:

        if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
                return; // Enqueued onto ->nocb_bypass, so just leave.

I think amortizing cost of grace period computation across lazy CBs may
give power savings, but that is secondary to the goal of not even poking
RCU (which an early exit like the above does). So maybe we can just talk
about that goal first?

> 
> Also, if your timer triggers the check, isn't that an additional potential
> wakeup from idle?  Would it make sense to also minimize those wakeups?
> (This is one motivation for grace periods to pick up lazy callbacks.)
> 
>>                                   If a new GP was started but not
>> completed, it can simply call_rcu(). If a new GP started and completed,
>> it does not start new GP and just executes CB, something like that.
>> Before the flush happens, if multiple lazy CBs were queued in the
>> meanwhile and the GP seq counters moved forward, then all those lazy CBs
>> will share the same GP (either not starting a new one, or calling
>> call_rcu() to share the ongoing on, something like that).
> 
> You could move the ready-to-invoke callbacks to the end of the DONE
> segment of ->cblist, assuming that you also adjust rcu_barrier()
> appropriately.  If rcu_barrier() is rare, you could simply flush whatever
> lazy list before entraining the rcu_barrier() callback.

Got it.

>>> o	When there is ample memory, how long are lazy callbacks allowed
>>> 	to sleep?  Forever?  If not forever, what feeds into computing
>>> 	the timeout?
>>
>> Yeah, the timeout is fixed currently. I agree this is a good thing to
>> optimize.
> 
> First see what the behavior is.  If the timeout is long enough that
> there is almost never a lazy-only grace period, then maybe a fixed
> timeout is good enough.

Ok.

>>> o	What is used to determine when memory is low, so that laziness
>>> 	has become a net negative?
>>
>> The assumption I think we make is that the shrinkers will constantly
>> flush out lazy CBs if there is memory pressure.
> 
> It might be worth looking at SPI.  That could help avoid a grace-period
> wait when clearing out lazy callbacks.

You mean PSI?

> 
>>> o	What other conditions, if any, should prompt motivating lazy
>>> 	callbacks?  (See above for the start of a grace period motivating
>>> 	lazy callbacks.)
>>>
>>> In short, you need to cast your net pretty wide on this one.  It has
>>> not yet been very carefully explored, so there are likely to be surprises,
>>> maybe even good surprises.  ;-)
>>
>> Cool that sounds like some good opportunity to work on something cool ;-)
> 
> Which comes back around to your point of needing evaluation of extra
> RCU work.  Lots of tradeoffs between different wakeup sources and grace
> periods.  Fine-grained views into the black box will be helpful.  ;-)

Thanks for the conversations. I very much like the idea of using bypass
list, but I am not sure if the current segcblist structure will support
both lazy and non-lazy CBs. Basically, if it contains both call_rcu()
and call_rcu_lazy() CBs, that means either all of them are lazy or all
of them are not right? Unless we are also talking about making
additional changes to rcu_segcblist struct to accomodate both types of CBs.

I don't mind admitting to my illiteracy about callback offloading and
bypass lists, but using bypass lists for this problems might improve my
literacy nonetheless so that does motivate me to use them if you can
help me out with the above questions :D

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-31  2:12                 ` Joel Fernandes
@ 2022-05-31  4:26                   ` Paul E. McKenney
  2022-05-31 16:11                     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-31  4:26 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Mon, May 30, 2022 at 10:12:05PM -0400, Joel Fernandes wrote:
> Hi Paul,
> 
> I am sending in one more reply before I sleep ;-)

Likewise.  ;-)

> On 5/30/22 12:42, Paul E. McKenney wrote:
> > On Mon, May 30, 2022 at 10:48:26AM -0400, Joel Fernandes wrote:
> >> Hi Paul,
> >>
> >> I just setup thunderbird on an old machine just to reply to LKML.
> >> Apologies if the formatting is weird. Replies below:
> > 
> > Looks fine at first glance.  ;-)
> > 
> >> On 5/28/22 13:57, Paul E. McKenney wrote:
> >> [..]
> >>>>>>>> +	preempt_enable();
> >>>>>>>> +
> >>>>>>>> +	if (debug_rcu_head_queue((void *)head)) {
> >>>>>>>> +		// Probable double call_rcu(), just leak.
> >>>>>>>> +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> >>>>>>>> +				__func__, head);
> >>>>>>>> +
> >>>>>>>> +		// Mark as success and leave.
> >>>>>>>> +		return;
> >>>>>>>> +	}
> >>>>>>>> +
> >>>>>>>> +	// Queue to per-cpu llist
> >>>>>>>> +	head->func = func;
> >>>>>>>> +	llist_add(&head->llist_node, &rlp->head);
> >>>>>>>
> >>>>>>> Suppose that there are a bunch of preemptions between the preempt_enable()
> >>>>>>> above and this point, so that the current CPU's list has lots of
> >>>>>>> callbacks, but zero ->count.  Can that cause a problem?
> >>>>>>>
> >>>>>>> In the past, this sort of thing has been an issue for rcu_barrier()
> >>>>>>> and friends.
> >>>>>>
> >>>>>> Thanks, I think I dropped the ball on this. You have given me something to
> >>>>>> think about. I am thinking on first thought that setting the count in advance
> >>>>>> of populating the list should do the trick. I will look more on it.
> >>>>>
> >>>>> That can work, but don't forget the need for proper memory ordering.
> >>>>>
> >>>>> Another approach would be to what the current bypass list does, namely
> >>>>> count the callback in the ->cblist structure.
> >>>>
> >>>> I have been thinking about this, and I also arrived at the same conclusion
> >>>> that it is easier to make it a segcblist structure which has the functions do
> >>>> the memory ordering. Then we can make rcu_barrier() locklessly sample the
> >>>> ->len of the new per-cpu structure, I *think* it might work.
> >>>>
> >>>> However, I was actually considering another situation outside of rcu_barrier,
> >>>> where the preemptions you mentioned would cause the list to appear to be
> >>>> empty if one were to rely on count, but actually have a lot of stuff queued.
> >>>> This causes shrinkers to not know that there is a lot of memory available to
> >>>> free. I don't think its that big an issue, if we can assume the caller's
> >>>> process context to have sufficient priority. Obviously using a raw spinlock
> >>>> makes the process context to be highest priority briefly but without the
> >>>> lock, we don't get that. But yeah that can be sorted by just updating the
> >>>> count proactively, and then doing the queuing operations.
> >>>>
> >>>> Another advantage of using the segcblist structure, is, we can also record
> >>>> the grace period numbers in it, and avoid triggering RCU from the timer if gp
> >>>> already elapsed (similar to the rcu-poll family of functions).
> >>>>
> >>>> WDYT?
> >>>
> >>> What you are seeing is that the design space for call_rcu_lazy() is huge,
> >>> and the tradeoffs between energy efficiency and complexity are not well
> >>> understood.  It is therefore necessary to quickly evaluate alternatives.
> >>> Yes, you could simply implement all of them and test them, but that
> >>> might you take longer than you would like.  Plus you likely need to
> >>> put in place more testing than rcutorture currently provides.
> >>>
> >>> Given what we have seen thus far, I think that the first step has to
> >>> be to evaluate what can be obtained with this approach compared with
> >>> what can be obtained with other approaches.  What we have right now,
> >>> more or less, is this:
> >>>
> >>> o	Offloading buys about 15%.
> >>>
> >>> o	Slowing grace periods buys about another 7%, but requires
> >>> 	userspace changes to prevent degrading user-visible performance.
> >>
> >> Agreed.
> >>
> >>>
> >>> o	The initial call_rcu_lazy() prototype buys about 1%.
> >>
> >> Well, that depends on the usecase. A busy system saves you less because
> >> it is busy anyway, so RCU being quiet does not help much.
> >>
> >> >From my cover letter, you can see the idle power savings is a whopping
> >> 29% with this patch series. Check the PC10 state residencies and the
> >> package wattage.
> >>
> >> When ChromeOS screen is off and user is not doing anything on the
> >> Before:
> >> Pk%pc10 = 72.13
> >> PkgWatt = 0.58
> >>
> >> After:
> >> Pk%pc10 = 81.28
> >> PkgWatt = 0.41
> > 
> > These numbers do make a better case for this.  On the other hand, that
> > patch series was missing things like rcu_barrier() support and might also
> > be converting more call_rcu() invocations to call_rcu_lazy() invocations.
> > It is therefore necessary to take these numbers with a grain of salt,
> > much though I hope that they turn out to be realistic.
> 
> Makes sense, thank you. Actually the promising savings showed up when
> also doing jiffies_till_first_fqs increases, so that's why I did the
> prototype of call_rcu_lazy to prove that we can achieve same thing
> instead of slowing GP for everything. I believe the savings comes from
> just not kicking the RCU GP machinery at all.

Agreed on reducing the number of grace periods.  After all, that is the
entire purpose of the earlier boot-parameter changes that slowed down
the grace periods.

But the other side of this same coin is that if the grace-period machinery
is being kicked anyway by some non-lazy callback, then making the lazy
callbacks take advantage of that forced grace period might well save
an entire grace period later on.  Take advantage of the one that is
happening anyway to possibly avoid having to do a full grace period
solely for the benefit of lazy callbacks later on.

> Agreed on the grain of salt part. Only a test of a final design can tell
> us more accurately what the savings are.

With careful experiment design, we can get bounds earlier.

> >>>
> >>> 	True, this is not apples to apples because the first two were
> >>> 	measured on ChromeOS and this one on Android.
> >>
> >> Yes agreed. I think Vlad is not seeing any jaw dropping savings, because
> >> he's testing ARM. But on Intel we see a lot. Though, I asked Vlad to
> >> also check if rcu callbacks are indeed not being quued / quiet after
> >> applying the patch, and if not then there may still be some call_rcu()s
> >> that need conversion to lazy, on his setup. For example, he might have a
> >> driver on his ARM platform that's queing RCU CBs constantly.
> >>
> >> I was thinking that perhaps a debug patch can help quickly nail down
> >> such RCU noise, in the way of a warning. Of course, debug CONFIG should
> >> never be enabled in production, so it would be something along the lines
> >> of how lockdep is used for debugging.
> > 
> > The trick would be figuring out which of the callbacks are noise.
> > I suspect that userspace help would be required.  Or maybe just leverage
> > the existing event traces and let userspace sort it out.
> 
> Ok.
> 
> >>> Which means that
> >>> 	apples to apples evaluation is also required.  But this is the
> >>> 	information I currently have at hand, and it is probably no more
> >>> 	than a factor of two off of what would be seen on ChromeOS.
> >>>
> >>> 	Or is there some ChromeOS data that tells a different story?
> >>> 	After all, for all I know, Android might still be expediting
> >>> 	all normal grace periods.
> >>>
> >>> At which point, the question becomes "how to make up that 7%?" After all,
> >>> it is not likely that anyone is going to leave that much battery lifetime
> >>> on the table.  Here are the possibilities that I currently see:
> >>>
> >>> o	Slow down grace periods, and also bite the bullet and make
> >>> 	userspace changes to speed up the RCU grace periods during
> >>> 	critical operations.
> >>
> >> We tried this, and tracing suffers quiet a lot. The system also felt
> >> "sluggish" which I suspect is because of synchronize_rcu() slow downs in
> >> other paths.
> > 
> > In what way does tracing suffer?  Removing tracepoints?
> 
> Yes. Start and stopping function tracer goes from 5 seconds to like 30
> seconds or so.
> 
> > And yes, this approach absolutely requires finding code paths with
> > user-visible grace periods.  Which sounds like you tried the "Slow down
> > grace periods" part but not the "bit the bullet" part.  ;-)
> 
> True, I got so scared looking at the performance that I just decided to
> play it safe and do selective call_rcu() lazifying, than making the
> everything lazy.

For what definition of "safe", exactly?  ;-)

> >>> But given this, one question is "What is the most you can possibly
> >>> obtain from call_rcu_lazy()?"  If you have a platform with enough
> >>> memory, one way to obtain an upper bound is to make call_rcu_lazy()
> >>> simply leak the memory.  If that amount of savings does not meet the
> >>> need, then other higher-level approaches will be needed, whether
> >>> as alternatives or as additions to the call_rcu_lazy() approach.
> >>>
> >>> Should call_rcu_lazy() prove to be part of the mix, here are a (very)
> >>> few of the tradeoffs:
> >>>
> >>> o	Put lazy callbacks on their own list or not?
> >>>
> >>> 	o	If not, use the bypass list?  If so, is it the
> >>> 		case that call_rcu_lazy() is just call_rcu() on
> >>> 		non-rcu_nocbs CPUs?
> >>>
> >>> 		Or add the complexity required to use the bypass
> >>> 		list on rcu_nocbs CPUs but to use ->cblist otherwise?
> >>
> >> For bypass list, I am kind of reluctant because the "flush time" of the
> >> bypass list may not match the laziness we seek? For example, I want to
> >> allow user to configure to flush the CBs only after 15 seconds or so, if
> >> the CB list does not grow too big. That might hurt folks using the
> >> bypass list, but requiring a quicker response.
> >>
> >> Also, does doing so not prevent usage of lazy CBs on systems without
> >> NOCB? So if we want to future-proof this, I guess that might not be a
> >> good decision.
> > 
> > True enough, but would this future actually arrive?  After all, if
> > someone cared enough about energy efficiency to use call_rcu_lazy(),
> > why wouldn't they also offload callbacks?
> 
> I am not sure, but I also don't mind making it depend on NOCB for now
> (see below).

Very good.

> > On the flush-time mismatch, if there are any non-lazy callbacks in the
> > list, it costs you nothing to let the lazy callbacks tag along through
> > the grace period.  So one approach would be to use the current flush
> > time if there are non-lazy callbacks, but use the longer flush time if
> > all of the callbacks in the list are lazy callbacks.
> 
> Cool!
> 
> >>> 	o	If so:
> >>>
> >>> 		o	Use segmented list with marked grace periods?
> >>> 			Keep in mind that such a list can track only
> >>> 			two grace periods.
> >>>
> >>> 		o	Use a plain list and have grace-period start
> >>> 			simply drain the list?
> >>
> >> I want to use the segmented list, regardless of whether we use the
> >> bypass list or not, because we get those memory barriers and
> >> rcu_barrier() lockless sampling of ->len, for free :).
> > 
> > The bypass list also gets you the needed memory barriers and lockless
> > sampling of ->len.  As does any other type of list as long as the
> > ->cblist.len field accounts for the lazy callbacks.
> > 
> > So the main difference is the tracking of grace periods, and even then,
> > only those grace periods for which this CPU has no non-lazy callbacks.
> > Or, in the previously discussed case where a single rcuoc kthread serves
> > multiple CPUs, only those grace periods for which this group of CPUs
> > has no non-lazy callbacks.
> 
> Unless we assume that everything in the bypass list is lazy, can we
> really use it to track both lazy and non lazy CBs, and grace periods?

Why not just count the number of lazy callbacks in the bypass list?  There
is already a count of the total number of callbacks in ->nocb_bypass.len.
This list is protected by a lock, so protect the count of lazy callbacks
with that same lock.  Then just compare the number of lazy callbacks to
rcu_cblist_n_cbs(&rdp->nocb_bypass).  If they are equal, all callbacks
are lazy.

What am I missing here?

> Currently the struct looks like this:
> 
> struct rcu_segcblist {
>         struct rcu_head *head;
>         struct rcu_head **tails[RCU_CBLIST_NSEGS];
>         unsigned long gp_seq[RCU_CBLIST_NSEGS];
> #ifdef CONFIG_RCU_NOCB_CPU
>         atomic_long_t len;
> #else
>         long len;
> #endif
>         long seglen[RCU_CBLIST_NSEGS];
>         u8 flags;
> };
> 
> So now, it would need to be like this?
> 
> struct rcu_segcblist {
>         struct rcu_head *head;
>         struct rcu_head **tails[RCU_CBLIST_NSEGS];
>         unsigned long gp_seq[RCU_CBLIST_NSEGS];
> #ifdef CONFIG_RCU_NOCB_CPU
>         struct rcu_head *lazy_head;
>         struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS];
>         unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS];
>         atomic_long_t lazy_len;
> #else
>         long len;
> #endif
>         long seglen[RCU_CBLIST_NSEGS];
>         u8 flags;
> };

I freely confess that I am not loving this arrangement.  Large increase
in state space, but little benefit that I can see.  Again, what am I
missing here?

> > And on devices with few CPUs, wouldn't you make do with one rcuoc kthread
> > for the system, at least in the common case where there is no callback
> > flooding?
> 
> When Rushikesh tried to reduce the number of callback threads, he did
> not see much improvement in power. I don't think the additional wake ups
> of extra rcuoc threads makes too much difference - the big power
> improvement comes from not even kicking rcu_preempt / rcuog threads...

I suspect that this might come into play later on.  It is often the case
that a given technique's effectiveness depends on the starting point.

> >>> o	Does call_rcu_lazy() do anything to ensure that additional grace
> >>> 	periods that exist only for the benefit of lazy callbacks are
> >>> 	maximally shared among all CPUs' lazy callbacks?  If so, what?
> >>> 	(Keeping in mind that failing to share such grace periods
> >>> 	burns battery lifetime.)
> >>
> >> I could be missing your point, can you give example of how you want the
> >> behavior to be?
> > 
> > CPU 0 has 55 lazy callbacks and one non-lazy callback.  So the system
> > does a grace-period computation and CPU 0 wakes up its rcuoc kthread.
> > Given that the full price is begin paid anyway, why not also invoke
> > those 55 lazy callbacks?
> > 
> > If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23
> > lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks,
> > again the full price of grace-period computation and rcuoc wakeups is
> > being paid, so why not also invoke those 26 lazy callbacks?
> > 
> > On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other
> > rcuoc kthread, and with the same numbers of callbacks as in the previous
> > paragraph, you get to make a choice.  Do you do an extra wakeup for
> > CPU 0's rcuoc kthread?  Or do you potentially do an extra grace-period
> > computation some time in the future?
> > 
> > Suppose further that the CPU that would be awakened for CPU 0's rcuoc
> > kthread is already non-idle.  At that point, this wakeup is quite cheap.
> > Except that the wakeup-or-don't choice is made at the beginning of the
> > grace period, and that CPU's idleness at that point might not be all
> > that well correlated with its idleness at the time of the wakeup at the
> > end of that grace period.  Nevertheless, idleness statistics could
> > help trade off needless from-idle wakeups for needless grace periods.
> > 
> > Which sound like a good reason to consolidate rcuoc processing on
> > non-busy systems.
> > 
> > But more to the point...
> > 
> > Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that
> > RCU is otherwise idle.  We really want one (not three!) grace periods
> > to eventually handle those lazy callbacks, right?
> 
> Yes. I think I understand you now, yes we do want to reduce the number
> of grace periods. But I think that is an optimization which is not
> strictly necessary to get the power savings this patch series
> demonstrates. To get the power savings shown here, we need all the RCU
> threads to be quiet as much as possible, and not start the rcu_preempt
> thread's state machine until when necessary.

I believe that it will be both an optimization and a simplification.
Again, if you have to start the grace-period machinery anyway, you might
as well take care of the lazy callbacks while you are at it.

> >> I am not thinking of creating separate GP cycles just for lazy CBs. The
> >> GP cycle will be the same that non-lazy uses. A lazy CB will just record
> >> what is the current or last GP, and then wait. Once the timer expires,
> >> it will check what is the current or last GP. If a new GP was not
> >> started, then it starts a new GP.
> > 
> > From my perspective, this new GP is in fact a separate GP just for lazy
> > callbacks.
> > 
> > What am I missing here?
> 
> I think my thought for power savings is slightly different from yours. I
> see lots of power savings when not even going to RCU when system is idle
> - that appears to be a lot like what the bypass list does - it avoids
> call_rcu() completely and does an early exit:
> 
>         if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
>                 return; // Enqueued onto ->nocb_bypass, so just leave.
> 
> I think amortizing cost of grace period computation across lazy CBs may
> give power savings, but that is secondary to the goal of not even poking
> RCU (which an early exit like the above does). So maybe we can just talk
> about that goal first?

You completely lost me on this one.  My point is not to amortize the cost
of grace-period computation across the lazy CBs.  My point is instead
to completely avoid doing any extra grace-period computation for the
lazy CBs in cases where grace-period computations are happening anyway.
Of course, if there are no non-lazy callbacks, then there won't be
any happening-anyway grace periods, and then, yes, there will eventually
need to be a grace period specially for the benefit of the lazy
callbacks.

And taking this piecewise is unlikely to give good results, especially
given that workloads evolve over time, much though I understand your
desire for doing this one piece at a time.

> > Also, if your timer triggers the check, isn't that an additional potential
> > wakeup from idle?  Would it make sense to also minimize those wakeups?
> > (This is one motivation for grace periods to pick up lazy callbacks.)
> > 
> >>                                   If a new GP was started but not
> >> completed, it can simply call_rcu(). If a new GP started and completed,
> >> it does not start new GP and just executes CB, something like that.
> >> Before the flush happens, if multiple lazy CBs were queued in the
> >> meanwhile and the GP seq counters moved forward, then all those lazy CBs
> >> will share the same GP (either not starting a new one, or calling
> >> call_rcu() to share the ongoing on, something like that).
> > 
> > You could move the ready-to-invoke callbacks to the end of the DONE
> > segment of ->cblist, assuming that you also adjust rcu_barrier()
> > appropriately.  If rcu_barrier() is rare, you could simply flush whatever
> > lazy list before entraining the rcu_barrier() callback.
> 
> Got it.
> 
> >>> o	When there is ample memory, how long are lazy callbacks allowed
> >>> 	to sleep?  Forever?  If not forever, what feeds into computing
> >>> 	the timeout?
> >>
> >> Yeah, the timeout is fixed currently. I agree this is a good thing to
> >> optimize.
> > 
> > First see what the behavior is.  If the timeout is long enough that
> > there is almost never a lazy-only grace period, then maybe a fixed
> > timeout is good enough.
> 
> Ok.
> 
> >>> o	What is used to determine when memory is low, so that laziness
> >>> 	has become a net negative?
> >>
> >> The assumption I think we make is that the shrinkers will constantly
> >> flush out lazy CBs if there is memory pressure.
> > 
> > It might be worth looking at SPI.  That could help avoid a grace-period
> > wait when clearing out lazy callbacks.
> 
> You mean PSI?

Yes, pressure stall information.  Apologies for my confusion!

> >>> o	What other conditions, if any, should prompt motivating lazy
> >>> 	callbacks?  (See above for the start of a grace period motivating
> >>> 	lazy callbacks.)
> >>>
> >>> In short, you need to cast your net pretty wide on this one.  It has
> >>> not yet been very carefully explored, so there are likely to be surprises,
> >>> maybe even good surprises.  ;-)
> >>
> >> Cool that sounds like some good opportunity to work on something cool ;-)
> > 
> > Which comes back around to your point of needing evaluation of extra
> > RCU work.  Lots of tradeoffs between different wakeup sources and grace
> > periods.  Fine-grained views into the black box will be helpful.  ;-)
> 
> Thanks for the conversations. I very much like the idea of using bypass
> list, but I am not sure if the current segcblist structure will support
> both lazy and non-lazy CBs. Basically, if it contains both call_rcu()
> and call_rcu_lazy() CBs, that means either all of them are lazy or all
> of them are not right? Unless we are also talking about making
> additional changes to rcu_segcblist struct to accomodate both types of CBs.

Do you need to do anything different to lazy callbacks once they have
been associated with a grace period has been started?  Your current
patch treats them the same.  If that approach continues to work, then
why do you need to track laziness downstream?

> I don't mind admitting to my illiteracy about callback offloading and
> bypass lists, but using bypass lists for this problems might improve my
> literacy nonetheless so that does motivate me to use them if you can
> help me out with the above questions :D

More detail on the grace-period machinery will also be helpful to you.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-31  4:26                   ` Paul E. McKenney
@ 2022-05-31 16:11                     ` Joel Fernandes
  2022-05-31 16:45                       ` Paul E. McKenney
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-31 16:11 UTC (permalink / raw)
  To: paulmck; +Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On 5/31/22 00:26, Paul E. McKenney wrote:
[...]
>>>>> Which means that
>>>>> 	apples to apples evaluation is also required.  But this is the
>>>>> 	information I currently have at hand, and it is probably no more
>>>>> 	than a factor of two off of what would be seen on ChromeOS.
>>>>>
>>>>> 	Or is there some ChromeOS data that tells a different story?
>>>>> 	After all, for all I know, Android might still be expediting
>>>>> 	all normal grace periods.
>>>>>
>>>>> At which point, the question becomes "how to make up that 7%?" After all,
>>>>> it is not likely that anyone is going to leave that much battery lifetime
>>>>> on the table.  Here are the possibilities that I currently see:
>>>>>
>>>>> o	Slow down grace periods, and also bite the bullet and make
>>>>> 	userspace changes to speed up the RCU grace periods during
>>>>> 	critical operations.
>>>>
>>>> We tried this, and tracing suffers quiet a lot. The system also felt
>>>> "sluggish" which I suspect is because of synchronize_rcu() slow downs in
>>>> other paths.
>>>
>>> In what way does tracing suffer?  Removing tracepoints?
>>
>> Yes. Start and stopping function tracer goes from 5 seconds to like 30
>> seconds or so.
>>
>>> And yes, this approach absolutely requires finding code paths with
>>> user-visible grace periods.  Which sounds like you tried the "Slow down
>>> grace periods" part but not the "bit the bullet" part.  ;-)
>>
>> True, I got so scared looking at the performance that I just decided to
>> play it safe and do selective call_rcu() lazifying, than making the
>> everything lazy.
> 
> For what definition of "safe", exactly?  ;-)

I mean the usual, like somebody complaining performance issues crept up
in their workload :)
> 
>>> On the flush-time mismatch, if there are any non-lazy callbacks in the
>>> list, it costs you nothing to let the lazy callbacks tag along through
>>> the grace period.  So one approach would be to use the current flush
>>> time if there are non-lazy callbacks, but use the longer flush time if
>>> all of the callbacks in the list are lazy callbacks.
>>
>> Cool!
>>
>>>>> 	o	If so:
>>>>>
>>>>> 		o	Use segmented list with marked grace periods?
>>>>> 			Keep in mind that such a list can track only
>>>>> 			two grace periods.
>>>>>
>>>>> 		o	Use a plain list and have grace-period start
>>>>> 			simply drain the list?
>>>>
>>>> I want to use the segmented list, regardless of whether we use the
>>>> bypass list or not, because we get those memory barriers and
>>>> rcu_barrier() lockless sampling of ->len, for free :).
>>>
>>> The bypass list also gets you the needed memory barriers and lockless
>>> sampling of ->len.  As does any other type of list as long as the
>>> ->cblist.len field accounts for the lazy callbacks.
>>>
>>> So the main difference is the tracking of grace periods, and even then,
>>> only those grace periods for which this CPU has no non-lazy callbacks.
>>> Or, in the previously discussed case where a single rcuoc kthread serves
>>> multiple CPUs, only those grace periods for which this group of CPUs
>>> has no non-lazy callbacks.
>>
>> Unless we assume that everything in the bypass list is lazy, can we
>> really use it to track both lazy and non lazy CBs, and grace periods?
> 
> Why not just count the number of lazy callbacks in the bypass list?
> There
> is already a count of the total number of callbacks in ->nocb_bypass.len.
> This list is protected by a lock, so protect the count of lazy callbacks
> with that same lock.  

That's a good idea, I can try that.

Then just compare the number of lazy callbacks to
> rcu_cblist_n_cbs(&rdp->nocb_bypass).  If they are equal, all callbacks
> are lazy.
> 
> What am I missing here?

There could be more issues that are incompatible with bypass lists, but
nothing that I feel cannot be worked around.

Example:
1. Say 5 lazy CBs queued onto bypass list (while the regular cblist is
empty).
2. Now say 10000 non-lazy CBs are queued. As per the comments, these
have to go to the bypass list to keep rcu_barrier() from breaking.
3. Because this causes the bypass list to overflow, all the lazy +
non-lazy CBs have to flushed to the main -cblist.

If only the non-lazy CBs are flushed, rcu_barrier() might break. If all
are flushed, then the lazy ones lose their laziness property as RCU will
be immediately kicked off to process GPs on their behalf.

This can fixed by making rcu_barrier() queue both a lazy and non-lazy
CB, and only flushing the non-lazy CBs on a bypass list overflow, to the
->cblist, I think.

Or, we flush both -lazy and non-lazy CBs to the ->cblist just to keep it
simple. I think that should be OK since if there are a lot of CBs queued
in a short time, I don't think there is much opportunity for power
savings anyway IMHO.

>> Currently the struct looks like this:
>>
>> struct rcu_segcblist {
>>         struct rcu_head *head;
>>         struct rcu_head **tails[RCU_CBLIST_NSEGS];
>>         unsigned long gp_seq[RCU_CBLIST_NSEGS];
>> #ifdef CONFIG_RCU_NOCB_CPU
>>         atomic_long_t len;
>> #else
>>         long len;
>> #endif
>>         long seglen[RCU_CBLIST_NSEGS];
>>         u8 flags;
>> };
>>
>> So now, it would need to be like this?
>>
>> struct rcu_segcblist {
>>         struct rcu_head *head;
>>         struct rcu_head **tails[RCU_CBLIST_NSEGS];
>>         unsigned long gp_seq[RCU_CBLIST_NSEGS];
>> #ifdef CONFIG_RCU_NOCB_CPU
>>         struct rcu_head *lazy_head;
>>         struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS];
>>         unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS];
>>         atomic_long_t lazy_len;
>> #else
>>         long len;
>> #endif
>>         long seglen[RCU_CBLIST_NSEGS];
>>         u8 flags;
>> };
> 
> I freely confess that I am not loving this arrangement.  Large increase
> in state space, but little benefit that I can see.  Again, what am I
> missing here?

I somehow thought tracking GPs separately for the lazy CBs requires
duplication of the rcu_head pointers/double-points in this struct. As
you pointed, just tracking the lazy len may be sufficient.

> 
>>> And on devices with few CPUs, wouldn't you make do with one rcuoc kthread
>>> for the system, at least in the common case where there is no callback
>>> flooding?
>>
>> When Rushikesh tried to reduce the number of callback threads, he did
>> not see much improvement in power. I don't think the additional wake ups
>> of extra rcuoc threads makes too much difference - the big power
>> improvement comes from not even kicking rcu_preempt / rcuog threads...
> 
> I suspect that this might come into play later on.  It is often the case
> that a given technique's effectiveness depends on the starting point.
> 
>>>>> o	Does call_rcu_lazy() do anything to ensure that additional grace
>>>>> 	periods that exist only for the benefit of lazy callbacks are
>>>>> 	maximally shared among all CPUs' lazy callbacks?  If so, what?
>>>>> 	(Keeping in mind that failing to share such grace periods
>>>>> 	burns battery lifetime.)
>>>>
>>>> I could be missing your point, can you give example of how you want the
>>>> behavior to be?
>>>
>>> CPU 0 has 55 lazy callbacks and one non-lazy callback.  So the system
>>> does a grace-period computation and CPU 0 wakes up its rcuoc kthread.
>>> Given that the full price is begin paid anyway, why not also invoke
>>> those 55 lazy callbacks?
>>>
>>> If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23
>>> lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks,
>>> again the full price of grace-period computation and rcuoc wakeups is
>>> being paid, so why not also invoke those 26 lazy callbacks?
>>>
>>> On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other
>>> rcuoc kthread, and with the same numbers of callbacks as in the previous
>>> paragraph, you get to make a choice.  Do you do an extra wakeup for
>>> CPU 0's rcuoc kthread?  Or do you potentially do an extra grace-period
>>> computation some time in the future?
>>>
>>> Suppose further that the CPU that would be awakened for CPU 0's rcuoc
>>> kthread is already non-idle.  At that point, this wakeup is quite cheap.
>>> Except that the wakeup-or-don't choice is made at the beginning of the
>>> grace period, and that CPU's idleness at that point might not be all
>>> that well correlated with its idleness at the time of the wakeup at the
>>> end of that grace period.  Nevertheless, idleness statistics could
>>> help trade off needless from-idle wakeups for needless grace periods.
>>>
>>> Which sound like a good reason to consolidate rcuoc processing on
>>> non-busy systems.
>>>
>>> But more to the point...
>>>
>>> Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that
>>> RCU is otherwise idle.  We really want one (not three!) grace periods
>>> to eventually handle those lazy callbacks, right?
>>
>> Yes. I think I understand you now, yes we do want to reduce the number
>> of grace periods. But I think that is an optimization which is not
>> strictly necessary to get the power savings this patch series
>> demonstrates. To get the power savings shown here, we need all the RCU
>> threads to be quiet as much as possible, and not start the rcu_preempt
>> thread's state machine until when necessary.
> 
> I believe that it will be both an optimization and a simplification.
> Again, if you have to start the grace-period machinery anyway, you might
> as well take care of the lazy callbacks while you are at it.

Sure, sounds good. I agree with sharing grace periods instead of
starting a new one..

>>>> I am not thinking of creating separate GP cycles just for lazy CBs. The
>>>> GP cycle will be the same that non-lazy uses. A lazy CB will just record
>>>> what is the current or last GP, and then wait. Once the timer expires,
>>>> it will check what is the current or last GP. If a new GP was not
>>>> started, then it starts a new GP.
>>>
>>> From my perspective, this new GP is in fact a separate GP just for lazy
>>> callbacks.
>>>
>>> What am I missing here?
>>
>> I think my thought for power savings is slightly different from yours. I
>> see lots of power savings when not even going to RCU when system is idle
>> - that appears to be a lot like what the bypass list does - it avoids
>> call_rcu() completely and does an early exit:
>>
>>         if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
>>                 return; // Enqueued onto ->nocb_bypass, so just leave.
>>
>> I think amortizing cost of grace period computation across lazy CBs may
>> give power savings, but that is secondary to the goal of not even poking
>> RCU (which an early exit like the above does). So maybe we can just talk
>> about that goal first?
> 
> You completely lost me on this one.  My point is not to amortize the cost
> of grace-period computation across the lazy CBs.  My point is instead
> to completely avoid doing any extra grace-period computation for the
> lazy CBs in cases where grace-period computations are happening anyway.
> Of course, if there are no non-lazy callbacks, then there won't be
> any happening-anyway grace periods, and then, yes, there will eventually
> need to be a grace period specially for the benefit of the lazy
> callbacks.
> 
> And taking this piecewise is unlikely to give good results, especially
> given that workloads evolve over time, much though I understand your
> desire for doing this one piece at a time.

Sure ok, I agree with you on that.

>>> Also, if your timer triggers the check, isn't that an additional potential
>>> wakeup from idle?  Would it make sense to also minimize those wakeups?
>>> (This is one motivation for grace periods to pick up lazy callbacks.)
>>>
>>>>                                   If a new GP was started but not
>>>> completed, it can simply call_rcu(). If a new GP started and completed,
>>>> it does not start new GP and just executes CB, something like that.
>>>> Before the flush happens, if multiple lazy CBs were queued in the
>>>> meanwhile and the GP seq counters moved forward, then all those lazy CBs
>>>> will share the same GP (either not starting a new one, or calling
>>>> call_rcu() to share the ongoing on, something like that).
>>>
>>> You could move the ready-to-invoke callbacks to the end of the DONE
>>> segment of ->cblist, assuming that you also adjust rcu_barrier()
>>> appropriately.  If rcu_barrier() is rare, you could simply flush whatever
>>> lazy list before entraining the rcu_barrier() callback.
>>
>> Got it.
>>
>>>>> o	When there is ample memory, how long are lazy callbacks allowed
>>>>> 	to sleep?  Forever?  If not forever, what feeds into computing
>>>>> 	the timeout?
>>>>
>>>> Yeah, the timeout is fixed currently. I agree this is a good thing to
>>>> optimize.
>>>
>>> First see what the behavior is.  If the timeout is long enough that
>>> there is almost never a lazy-only grace period, then maybe a fixed
>>> timeout is good enough.
>>
>> Ok.
>>
>>>>> o	What is used to determine when memory is low, so that laziness
>>>>> 	has become a net negative?
>>>>
>>>> The assumption I think we make is that the shrinkers will constantly
>>>> flush out lazy CBs if there is memory pressure.
>>>
>>> It might be worth looking at SPI.  That could help avoid a grace-period
>>> wait when clearing out lazy callbacks.
>>
>> You mean PSI?
> 
> Yes, pressure stall information.  Apologies for my confusion!

Ok, I will look into it, thanks.

>>>>> o	What other conditions, if any, should prompt motivating lazy
>>>>> 	callbacks?  (See above for the start of a grace period motivating
>>>>> 	lazy callbacks.)
>>>>>
>>>>> In short, you need to cast your net pretty wide on this one.  It has
>>>>> not yet been very carefully explored, so there are likely to be surprises,
>>>>> maybe even good surprises.  ;-)
>>>>
>>>> Cool that sounds like some good opportunity to work on something cool ;-)
>>>
>>> Which comes back around to your point of needing evaluation of extra
>>> RCU work.  Lots of tradeoffs between different wakeup sources and grace
>>> periods.  Fine-grained views into the black box will be helpful.  ;-)
>>
>> Thanks for the conversations. I very much like the idea of using bypass
>> list, but I am not sure if the current segcblist structure will support
>> both lazy and non-lazy CBs. Basically, if it contains both call_rcu()
>> and call_rcu_lazy() CBs, that means either all of them are lazy or all
>> of them are not right? Unless we are also talking about making
>> additional changes to rcu_segcblist struct to accomodate both types of CBs.
> 
> Do you need to do anything different to lazy callbacks once they have
> been associated with a grace period has been started?  Your current
> patch treats them the same.  If that approach continues to work, then
> why do you need to track laziness downstream?

Yes I am confused a lot on this part. You are saying just tracking the
"lazy ength" is sufficient to not have to kick off RCU machinery, and
knowing exactly which CBs are lazy and which are not, is not needed
right? If I misunderstood that, could you clarify what you mean by
"track laziness downstream" ?

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-31 16:11                     ` Joel Fernandes
@ 2022-05-31 16:45                       ` Paul E. McKenney
  2022-05-31 18:51                         ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-31 16:45 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Tue, May 31, 2022 at 12:11:00PM -0400, Joel Fernandes wrote:
> On 5/31/22 00:26, Paul E. McKenney wrote:
> [...]
> >>>>> Which means that
> >>>>> 	apples to apples evaluation is also required.  But this is the
> >>>>> 	information I currently have at hand, and it is probably no more
> >>>>> 	than a factor of two off of what would be seen on ChromeOS.
> >>>>>
> >>>>> 	Or is there some ChromeOS data that tells a different story?
> >>>>> 	After all, for all I know, Android might still be expediting
> >>>>> 	all normal grace periods.
> >>>>>
> >>>>> At which point, the question becomes "how to make up that 7%?" After all,
> >>>>> it is not likely that anyone is going to leave that much battery lifetime
> >>>>> on the table.  Here are the possibilities that I currently see:
> >>>>>
> >>>>> o	Slow down grace periods, and also bite the bullet and make
> >>>>> 	userspace changes to speed up the RCU grace periods during
> >>>>> 	critical operations.
> >>>>
> >>>> We tried this, and tracing suffers quiet a lot. The system also felt
> >>>> "sluggish" which I suspect is because of synchronize_rcu() slow downs in
> >>>> other paths.
> >>>
> >>> In what way does tracing suffer?  Removing tracepoints?
> >>
> >> Yes. Start and stopping function tracer goes from 5 seconds to like 30
> >> seconds or so.
> >>
> >>> And yes, this approach absolutely requires finding code paths with
> >>> user-visible grace periods.  Which sounds like you tried the "Slow down
> >>> grace periods" part but not the "bit the bullet" part.  ;-)
> >>
> >> True, I got so scared looking at the performance that I just decided to
> >> play it safe and do selective call_rcu() lazifying, than making the
> >> everything lazy.
> > 
> > For what definition of "safe", exactly?  ;-)
> 
> I mean the usual, like somebody complaining performance issues crept up
> in their workload :)

It is not that difficult.  Just do something like this:

	# run a statistically significant set of tracer tests
	echo 1 >  /sys/kernel/rcu_expedited
	# run a statistically significant set of tracer tests

It is quite possible that you will decide that you want grace periods
to be expedited during tracing operations even on non-battery-powered
systems.

> >>> On the flush-time mismatch, if there are any non-lazy callbacks in the
> >>> list, it costs you nothing to let the lazy callbacks tag along through
> >>> the grace period.  So one approach would be to use the current flush
> >>> time if there are non-lazy callbacks, but use the longer flush time if
> >>> all of the callbacks in the list are lazy callbacks.
> >>
> >> Cool!
> >>
> >>>>> 	o	If so:
> >>>>>
> >>>>> 		o	Use segmented list with marked grace periods?
> >>>>> 			Keep in mind that such a list can track only
> >>>>> 			two grace periods.
> >>>>>
> >>>>> 		o	Use a plain list and have grace-period start
> >>>>> 			simply drain the list?
> >>>>
> >>>> I want to use the segmented list, regardless of whether we use the
> >>>> bypass list or not, because we get those memory barriers and
> >>>> rcu_barrier() lockless sampling of ->len, for free :).
> >>>
> >>> The bypass list also gets you the needed memory barriers and lockless
> >>> sampling of ->len.  As does any other type of list as long as the
> >>> ->cblist.len field accounts for the lazy callbacks.
> >>>
> >>> So the main difference is the tracking of grace periods, and even then,
> >>> only those grace periods for which this CPU has no non-lazy callbacks.
> >>> Or, in the previously discussed case where a single rcuoc kthread serves
> >>> multiple CPUs, only those grace periods for which this group of CPUs
> >>> has no non-lazy callbacks.
> >>
> >> Unless we assume that everything in the bypass list is lazy, can we
> >> really use it to track both lazy and non lazy CBs, and grace periods?
> > 
> > Why not just count the number of lazy callbacks in the bypass list?
> > There
> > is already a count of the total number of callbacks in ->nocb_bypass.len.
> > This list is protected by a lock, so protect the count of lazy callbacks
> > with that same lock.  
> 
> That's a good idea, I can try that.
> 
> Then just compare the number of lazy callbacks to
> > rcu_cblist_n_cbs(&rdp->nocb_bypass).  If they are equal, all callbacks
> > are lazy.
> > 
> > What am I missing here?
> 
> There could be more issues that are incompatible with bypass lists, but
> nothing that I feel cannot be worked around.

Good!

> Example:
> 1. Say 5 lazy CBs queued onto bypass list (while the regular cblist is
> empty).
> 2. Now say 10000 non-lazy CBs are queued. As per the comments, these
> have to go to the bypass list to keep rcu_barrier() from breaking.
> 3. Because this causes the bypass list to overflow, all the lazy +
> non-lazy CBs have to flushed to the main -cblist.
> 
> If only the non-lazy CBs are flushed, rcu_barrier() might break. If all
> are flushed, then the lazy ones lose their laziness property as RCU will
> be immediately kicked off to process GPs on their behalf.

Exactly why is this loss of laziness a problem?  You are doing that
grace period for the 10,000 non-lazy callbacks anyway, so what difference
could the five non-lazy callbacks possibly make?

> This can fixed by making rcu_barrier() queue both a lazy and non-lazy
> CB, and only flushing the non-lazy CBs on a bypass list overflow, to the
> ->cblist, I think.

I don't see anything that needs fixing.  If you are doing a grace period
anyway, just process the lazy callbacks along with the non-lazy callbacks.
After all, you are paying for that grace period anyway.  And handling
the lazy callbacks with that grace period means that you don't need a
later grace period for those five lazy callbacks.  So running the lazy
callbacks into the grace period required by the non-lazy callbacks is
a pure win, right?

If it is not a pure win, please explain exactly what is being lost.

> Or, we flush both -lazy and non-lazy CBs to the ->cblist just to keep it
> simple. I think that should be OK since if there are a lot of CBs queued
> in a short time, I don't think there is much opportunity for power
> savings anyway IMHO.

I believe that it will be simpler, faster, and more energy efficient to
do it this way, flushing everything from the bypass list to ->cblist.
Again, leaving the lazy callbacks lying around means that there must be a
later battery-draining grace period that might not be required otherwise.

> >> Currently the struct looks like this:
> >>
> >> struct rcu_segcblist {
> >>         struct rcu_head *head;
> >>         struct rcu_head **tails[RCU_CBLIST_NSEGS];
> >>         unsigned long gp_seq[RCU_CBLIST_NSEGS];
> >> #ifdef CONFIG_RCU_NOCB_CPU
> >>         atomic_long_t len;
> >> #else
> >>         long len;
> >> #endif
> >>         long seglen[RCU_CBLIST_NSEGS];
> >>         u8 flags;
> >> };
> >>
> >> So now, it would need to be like this?
> >>
> >> struct rcu_segcblist {
> >>         struct rcu_head *head;
> >>         struct rcu_head **tails[RCU_CBLIST_NSEGS];
> >>         unsigned long gp_seq[RCU_CBLIST_NSEGS];
> >> #ifdef CONFIG_RCU_NOCB_CPU
> >>         struct rcu_head *lazy_head;
> >>         struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS];
> >>         unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS];
> >>         atomic_long_t lazy_len;
> >> #else
> >>         long len;
> >> #endif
> >>         long seglen[RCU_CBLIST_NSEGS];
> >>         u8 flags;
> >> };
> > 
> > I freely confess that I am not loving this arrangement.  Large increase
> > in state space, but little benefit that I can see.  Again, what am I
> > missing here?
> 
> I somehow thought tracking GPs separately for the lazy CBs requires
> duplication of the rcu_head pointers/double-points in this struct. As
> you pointed, just tracking the lazy len may be sufficient.

Here is hoping!

After all, if you thought that taking care of applications that need
expediting of grace periods is scary, well, now...

> >>> And on devices with few CPUs, wouldn't you make do with one rcuoc kthread
> >>> for the system, at least in the common case where there is no callback
> >>> flooding?
> >>
> >> When Rushikesh tried to reduce the number of callback threads, he did
> >> not see much improvement in power. I don't think the additional wake ups
> >> of extra rcuoc threads makes too much difference - the big power
> >> improvement comes from not even kicking rcu_preempt / rcuog threads...
> > 
> > I suspect that this might come into play later on.  It is often the case
> > that a given technique's effectiveness depends on the starting point.
> > 
> >>>>> o	Does call_rcu_lazy() do anything to ensure that additional grace
> >>>>> 	periods that exist only for the benefit of lazy callbacks are
> >>>>> 	maximally shared among all CPUs' lazy callbacks?  If so, what?
> >>>>> 	(Keeping in mind that failing to share such grace periods
> >>>>> 	burns battery lifetime.)
> >>>>
> >>>> I could be missing your point, can you give example of how you want the
> >>>> behavior to be?
> >>>
> >>> CPU 0 has 55 lazy callbacks and one non-lazy callback.  So the system
> >>> does a grace-period computation and CPU 0 wakes up its rcuoc kthread.
> >>> Given that the full price is begin paid anyway, why not also invoke
> >>> those 55 lazy callbacks?
> >>>
> >>> If the rcuoc kthread is shared between CPU 0 and CPU 1, and CPU 0 has 23
> >>> lazy callbacks and CPU 1 has 3 lazy callbacks and 2 non-lazy callbacks,
> >>> again the full price of grace-period computation and rcuoc wakeups is
> >>> being paid, so why not also invoke those 26 lazy callbacks?
> >>>
> >>> On the other hand, if CPU 0 uses one rcuoc kthread and CPU 1 some other
> >>> rcuoc kthread, and with the same numbers of callbacks as in the previous
> >>> paragraph, you get to make a choice.  Do you do an extra wakeup for
> >>> CPU 0's rcuoc kthread?  Or do you potentially do an extra grace-period
> >>> computation some time in the future?
> >>>
> >>> Suppose further that the CPU that would be awakened for CPU 0's rcuoc
> >>> kthread is already non-idle.  At that point, this wakeup is quite cheap.
> >>> Except that the wakeup-or-don't choice is made at the beginning of the
> >>> grace period, and that CPU's idleness at that point might not be all
> >>> that well correlated with its idleness at the time of the wakeup at the
> >>> end of that grace period.  Nevertheless, idleness statistics could
> >>> help trade off needless from-idle wakeups for needless grace periods.
> >>>
> >>> Which sound like a good reason to consolidate rcuoc processing on
> >>> non-busy systems.
> >>>
> >>> But more to the point...
> >>>
> >>> Suppose that CPUs 0, 1, and 2 all have only lazy callbacks, and that
> >>> RCU is otherwise idle.  We really want one (not three!) grace periods
> >>> to eventually handle those lazy callbacks, right?
> >>
> >> Yes. I think I understand you now, yes we do want to reduce the number
> >> of grace periods. But I think that is an optimization which is not
> >> strictly necessary to get the power savings this patch series
> >> demonstrates. To get the power savings shown here, we need all the RCU
> >> threads to be quiet as much as possible, and not start the rcu_preempt
> >> thread's state machine until when necessary.
> > 
> > I believe that it will be both an optimization and a simplification.
> > Again, if you have to start the grace-period machinery anyway, you might
> > as well take care of the lazy callbacks while you are at it.
> 
> Sure, sounds good. I agree with sharing grace periods instead of
> starting a new one..

As you say, sounds good.  ;-)

> >>>> I am not thinking of creating separate GP cycles just for lazy CBs. The
> >>>> GP cycle will be the same that non-lazy uses. A lazy CB will just record
> >>>> what is the current or last GP, and then wait. Once the timer expires,
> >>>> it will check what is the current or last GP. If a new GP was not
> >>>> started, then it starts a new GP.
> >>>
> >>> From my perspective, this new GP is in fact a separate GP just for lazy
> >>> callbacks.
> >>>
> >>> What am I missing here?
> >>
> >> I think my thought for power savings is slightly different from yours. I
> >> see lots of power savings when not even going to RCU when system is idle
> >> - that appears to be a lot like what the bypass list does - it avoids
> >> call_rcu() completely and does an early exit:
> >>
> >>         if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
> >>                 return; // Enqueued onto ->nocb_bypass, so just leave.
> >>
> >> I think amortizing cost of grace period computation across lazy CBs may
> >> give power savings, but that is secondary to the goal of not even poking
> >> RCU (which an early exit like the above does). So maybe we can just talk
> >> about that goal first?
> > 
> > You completely lost me on this one.  My point is not to amortize the cost
> > of grace-period computation across the lazy CBs.  My point is instead
> > to completely avoid doing any extra grace-period computation for the
> > lazy CBs in cases where grace-period computations are happening anyway.
> > Of course, if there are no non-lazy callbacks, then there won't be
> > any happening-anyway grace periods, and then, yes, there will eventually
> > need to be a grace period specially for the benefit of the lazy
> > callbacks.
> > 
> > And taking this piecewise is unlikely to give good results, especially
> > given that workloads evolve over time, much though I understand your
> > desire for doing this one piece at a time.
> 
> Sure ok, I agree with you on that.

Very good!

> >>> Also, if your timer triggers the check, isn't that an additional potential
> >>> wakeup from idle?  Would it make sense to also minimize those wakeups?
> >>> (This is one motivation for grace periods to pick up lazy callbacks.)
> >>>
> >>>>                                   If a new GP was started but not
> >>>> completed, it can simply call_rcu(). If a new GP started and completed,
> >>>> it does not start new GP and just executes CB, something like that.
> >>>> Before the flush happens, if multiple lazy CBs were queued in the
> >>>> meanwhile and the GP seq counters moved forward, then all those lazy CBs
> >>>> will share the same GP (either not starting a new one, or calling
> >>>> call_rcu() to share the ongoing on, something like that).
> >>>
> >>> You could move the ready-to-invoke callbacks to the end of the DONE
> >>> segment of ->cblist, assuming that you also adjust rcu_barrier()
> >>> appropriately.  If rcu_barrier() is rare, you could simply flush whatever
> >>> lazy list before entraining the rcu_barrier() callback.
> >>
> >> Got it.
> >>
> >>>>> o	When there is ample memory, how long are lazy callbacks allowed
> >>>>> 	to sleep?  Forever?  If not forever, what feeds into computing
> >>>>> 	the timeout?
> >>>>
> >>>> Yeah, the timeout is fixed currently. I agree this is a good thing to
> >>>> optimize.
> >>>
> >>> First see what the behavior is.  If the timeout is long enough that
> >>> there is almost never a lazy-only grace period, then maybe a fixed
> >>> timeout is good enough.
> >>
> >> Ok.
> >>
> >>>>> o	What is used to determine when memory is low, so that laziness
> >>>>> 	has become a net negative?
> >>>>
> >>>> The assumption I think we make is that the shrinkers will constantly
> >>>> flush out lazy CBs if there is memory pressure.
> >>>
> >>> It might be worth looking at SPI.  That could help avoid a grace-period
> >>> wait when clearing out lazy callbacks.
> >>
> >> You mean PSI?
> > 
> > Yes, pressure stall information.  Apologies for my confusion!
> 
> Ok, I will look into it, thanks.

And I bet that RCU should be using it elsewhere as well.

> >>>>> o	What other conditions, if any, should prompt motivating lazy
> >>>>> 	callbacks?  (See above for the start of a grace period motivating
> >>>>> 	lazy callbacks.)
> >>>>>
> >>>>> In short, you need to cast your net pretty wide on this one.  It has
> >>>>> not yet been very carefully explored, so there are likely to be surprises,
> >>>>> maybe even good surprises.  ;-)
> >>>>
> >>>> Cool that sounds like some good opportunity to work on something cool ;-)
> >>>
> >>> Which comes back around to your point of needing evaluation of extra
> >>> RCU work.  Lots of tradeoffs between different wakeup sources and grace
> >>> periods.  Fine-grained views into the black box will be helpful.  ;-)
> >>
> >> Thanks for the conversations. I very much like the idea of using bypass
> >> list, but I am not sure if the current segcblist structure will support
> >> both lazy and non-lazy CBs. Basically, if it contains both call_rcu()
> >> and call_rcu_lazy() CBs, that means either all of them are lazy or all
> >> of them are not right? Unless we are also talking about making
> >> additional changes to rcu_segcblist struct to accomodate both types of CBs.
> > 
> > Do you need to do anything different to lazy callbacks once they have
> > been associated with a grace period has been started?  Your current
> > patch treats them the same.  If that approach continues to work, then
> > why do you need to track laziness downstream?
> 
> Yes I am confused a lot on this part. You are saying just tracking the
> "lazy ength" is sufficient to not have to kick off RCU machinery, and
> knowing exactly which CBs are lazy and which are not, is not needed
> right? If I misunderstood that, could you clarify what you mean by
> "track laziness downstream" ?

Pretty much.  What I am saying is that the concept of laziness is helpful
primarily while the callbacks are waiting for a grace period to start.
Lazy callbacks are much less aggressive about starting new grace periods
than are non-lazy callbacks.

But once the grace period has started, why treat them differently?
From what you said earlier, your testing thus far shows that the biggest
energy savings is delaying/merging grace periods rather than the details
of callback processing within and after the grace period, right?

And even if differences in callback processing are warranted, there are
lots of ways to make that happen, many of which do not require propagation
of per-callback laziness into and through the grace-period machinery.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-31 16:45                       ` Paul E. McKenney
@ 2022-05-31 18:51                         ` Joel Fernandes
  2022-05-31 19:25                           ` Paul E. McKenney
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-31 18:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Tue, May 31, 2022 at 09:45:34AM -0700, Paul E. McKenney wrote:
[..] 
> > Example:
> > 1. Say 5 lazy CBs queued onto bypass list (while the regular cblist is
> > empty).
> > 2. Now say 10000 non-lazy CBs are queued. As per the comments, these
> > have to go to the bypass list to keep rcu_barrier() from breaking.
> > 3. Because this causes the bypass list to overflow, all the lazy +
> > non-lazy CBs have to flushed to the main -cblist.
> > 
> > If only the non-lazy CBs are flushed, rcu_barrier() might break. If all
> > are flushed, then the lazy ones lose their laziness property as RCU will
> > be immediately kicked off to process GPs on their behalf.
> 
> Exactly why is this loss of laziness a problem?  You are doing that
> grace period for the 10,000 non-lazy callbacks anyway, so what difference
> could the five non-lazy callbacks possibly make?

It does not make any difference, I kind of answered my own question. I was
thinking out loud in this thread (Sorry).

> > This can fixed by making rcu_barrier() queue both a lazy and non-lazy
> > CB, and only flushing the non-lazy CBs on a bypass list overflow, to the
> > ->cblist, I think.
> 
> I don't see anything that needs fixing.  If you are doing a grace period
> anyway, just process the lazy callbacks along with the non-lazy callbacks.
> After all, you are paying for that grace period anyway.  And handling
> the lazy callbacks with that grace period means that you don't need a
> later grace period for those five lazy callbacks.  So running the lazy
> callbacks into the grace period required by the non-lazy callbacks is
> a pure win, right?
> 
> If it is not a pure win, please explain exactly what is being lost.

Agreed. As discussed on IRC, we can only care about increment of the lazy
length, and the flush will drop it to 1 or 0. No need to design for partial
flushing for now as no usecase.

> > Or, we flush both -lazy and non-lazy CBs to the ->cblist just to keep it
> > simple. I think that should be OK since if there are a lot of CBs queued
> > in a short time, I don't think there is much opportunity for power
> > savings anyway IMHO.
> 
> I believe that it will be simpler, faster, and more energy efficient to
> do it this way, flushing everything from the bypass list to ->cblist.
> Again, leaving the lazy callbacks lying around means that there must be a
> later battery-draining grace period that might not be required otherwise.

Perfect.

> > >> Currently the struct looks like this:
> > >>
> > >> struct rcu_segcblist {
> > >>         struct rcu_head *head;
> > >>         struct rcu_head **tails[RCU_CBLIST_NSEGS];
> > >>         unsigned long gp_seq[RCU_CBLIST_NSEGS];
> > >> #ifdef CONFIG_RCU_NOCB_CPU
> > >>         atomic_long_t len;
> > >> #else
> > >>         long len;
> > >> #endif
> > >>         long seglen[RCU_CBLIST_NSEGS];
> > >>         u8 flags;
> > >> };
> > >>
> > >> So now, it would need to be like this?
> > >>
> > >> struct rcu_segcblist {
> > >>         struct rcu_head *head;
> > >>         struct rcu_head **tails[RCU_CBLIST_NSEGS];
> > >>         unsigned long gp_seq[RCU_CBLIST_NSEGS];
> > >> #ifdef CONFIG_RCU_NOCB_CPU
> > >>         struct rcu_head *lazy_head;
> > >>         struct rcu_head **lazy_tails[RCU_CBLIST_NSEGS];
> > >>         unsigned long lazy_gp_seq[RCU_CBLIST_NSEGS];
> > >>         atomic_long_t lazy_len;
> > >> #else
> > >>         long len;
> > >> #endif
> > >>         long seglen[RCU_CBLIST_NSEGS];
> > >>         u8 flags;
> > >> };
> > > 
> > > I freely confess that I am not loving this arrangement.  Large increase
> > > in state space, but little benefit that I can see.  Again, what am I
> > > missing here?
> > 
> > I somehow thought tracking GPs separately for the lazy CBs requires
> > duplication of the rcu_head pointers/double-points in this struct. As
> > you pointed, just tracking the lazy len may be sufficient.
> 
> Here is hoping!
> 
> After all, if you thought that taking care of applications that need
> expediting of grace periods is scary, well, now...

Haha... my fear is I don't know all the applications requiring expedited GP
and I keep getting surprised by new RCU usages that pop up in the system, or
new systems.

For one, a number of tools and processes, use ftrace directly in the system,
and it may not be practical to chase down every tool. Some of them start
tracing randomly in the system. Handling it in-kernel itself would be best if
possible.

Productive email discussion indeed! On to writing the code :P
 
Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-31 18:51                         ` Joel Fernandes
@ 2022-05-31 19:25                           ` Paul E. McKenney
  2022-05-31 21:29                             ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-05-31 19:25 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Tue, May 31, 2022 at 06:51:48PM +0000, Joel Fernandes wrote:
> On Tue, May 31, 2022 at 09:45:34AM -0700, Paul E. McKenney wrote:

[ . . . ]

> > Here is hoping!
> > 
> > After all, if you thought that taking care of applications that need
> > expediting of grace periods is scary, well, now...
> 
> Haha... my fear is I don't know all the applications requiring expedited GP
> and I keep getting surprised by new RCU usages that pop up in the system, or
> new systems.
> 
> For one, a number of tools and processes, use ftrace directly in the system,
> and it may not be practical to chase down every tool. Some of them start
> tracing randomly in the system. Handling it in-kernel itself would be best if
> possible.

Shouldn't it be possible to hook into the kernel code that runs when
the sysfs tracing files are updated?  As in, why not both?  ;-)

> Productive email discussion indeed! On to writing the code :P

Should be fun!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-31 19:25                           ` Paul E. McKenney
@ 2022-05-31 21:29                             ` Joel Fernandes
  2022-05-31 22:44                               ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-05-31 21:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Tue, May 31, 2022 at 12:25:32PM -0700, Paul E. McKenney wrote:
> On Tue, May 31, 2022 at 06:51:48PM +0000, Joel Fernandes wrote:
> > On Tue, May 31, 2022 at 09:45:34AM -0700, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > Here is hoping!
> > > 
> > > After all, if you thought that taking care of applications that need
> > > expediting of grace periods is scary, well, now...
> > 
> > Haha... my fear is I don't know all the applications requiring expedited GP
> > and I keep getting surprised by new RCU usages that pop up in the system, or
> > new systems.
> > 
> > For one, a number of tools and processes, use ftrace directly in the system,
> > and it may not be practical to chase down every tool. Some of them start
> > tracing randomly in the system. Handling it in-kernel itself would be best if
> > possible.
> 
> Shouldn't it be possible to hook into the kernel code that runs when
> the sysfs tracing files are updated?  As in, why not both?  ;-)

Yes, Point taken on this, let us do both solutions.

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-31 21:29                             ` Joel Fernandes
@ 2022-05-31 22:44                               ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-05-31 22:44 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt

On Tue, May 31, 2022 at 5:29 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Tue, May 31, 2022 at 12:25:32PM -0700, Paul E. McKenney wrote:
> > On Tue, May 31, 2022 at 06:51:48PM +0000, Joel Fernandes wrote:
> > > On Tue, May 31, 2022 at 09:45:34AM -0700, Paul E. McKenney wrote:
> >
> > [ . . . ]
> >
> > > > Here is hoping!
> > > >
> > > > After all, if you thought that taking care of applications that need
> > > > expediting of grace periods is scary, well, now...
> > >
> > > Haha... my fear is I don't know all the applications requiring expedited GP
> > > and I keep getting surprised by new RCU usages that pop up in the system, or
> > > new systems.
> > >
> > > For one, a number of tools and processes, use ftrace directly in the system,
> > > and it may not be practical to chase down every tool. Some of them start
> > > tracing randomly in the system. Handling it in-kernel itself would be best if
> > > possible.
> >
> > Shouldn't it be possible to hook into the kernel code that runs when
> > the sysfs tracing files are updated?  As in, why not both?  ;-)
>
> Yes, Point taken on this, let us do both solutions.

It appears Paul has handled both hotplug and rcu_barrier ordering
issues for bypass lists, so this makes it an obvious choice to solve
those problems... :-)

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-30 14:54     ` Joel Fernandes
@ 2022-06-01 14:12       ` Frederic Weisbecker
  2022-06-01 19:10         ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Frederic Weisbecker @ 2022-06-01 14:12 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Uladzislau Rezki, rcu, rushikesh.s.kadam, neeraj.iitr10, paulmck,
	rostedt

On Mon, May 30, 2022 at 10:54:26AM -0400, Joel Fernandes wrote:
> >> +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> >> +{
> >> +	struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> >> +	struct rcu_lazy_pcp *rlp;
> >> +
> >> +	preempt_disable();
> >> +        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
> >> +	preempt_enable();
> >>
> > Can we get rid of such explicit disabling/enabling preemption?
> 
> Ok I'll try. Last I checked, something needs to disable preemption to
> prevent warnings with sampling the current processor ID.

raw_cpu_ptr()

Thanks.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-05-12 23:56   ` Paul E. McKenney
  2022-05-14 15:08     ` Joel Fernandes
@ 2022-06-01 14:24     ` Frederic Weisbecker
  2022-06-01 16:17       ` Paul E. McKenney
  2022-06-01 19:09       ` Joel Fernandes
  1 sibling, 2 replies; 73+ messages in thread
From: Frederic Weisbecker @ 2022-06-01 14:24 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joel Fernandes (Google),
	rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, rostedt

On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote:
> On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote:
> > +	preempt_enable();
> > +
> > +	if (debug_rcu_head_queue((void *)head)) {
> > +		// Probable double call_rcu(), just leak.
> > +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> > +				__func__, head);
> > +
> > +		// Mark as success and leave.
> > +		return;
> > +	}
> > +
> > +	// Queue to per-cpu llist
> > +	head->func = func;
> > +	llist_add(&head->llist_node, &rlp->head);
> 
> Suppose that there are a bunch of preemptions between the preempt_enable()
> above and this point, so that the current CPU's list has lots of
> callbacks, but zero ->count.  Can that cause a problem?
> 
> In the past, this sort of thing has been an issue for rcu_barrier()
> and friends.

Speaking of, shouldn't rcu_barrier() flush all the lazy queues? I
might have missed that somewhere in the patchset though.

Thanks.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-06-01 14:24     ` Frederic Weisbecker
@ 2022-06-01 16:17       ` Paul E. McKenney
  2022-06-01 19:09       ` Joel Fernandes
  1 sibling, 0 replies; 73+ messages in thread
From: Paul E. McKenney @ 2022-06-01 16:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joel Fernandes (Google),
	rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, rostedt

On Wed, Jun 01, 2022 at 04:24:54PM +0200, Frederic Weisbecker wrote:
> On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote:
> > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote:
> > > +	preempt_enable();
> > > +
> > > +	if (debug_rcu_head_queue((void *)head)) {
> > > +		// Probable double call_rcu(), just leak.
> > > +		WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> > > +				__func__, head);
> > > +
> > > +		// Mark as success and leave.
> > > +		return;
> > > +	}
> > > +
> > > +	// Queue to per-cpu llist
> > > +	head->func = func;
> > > +	llist_add(&head->llist_node, &rlp->head);
> > 
> > Suppose that there are a bunch of preemptions between the preempt_enable()
> > above and this point, so that the current CPU's list has lots of
> > callbacks, but zero ->count.  Can that cause a problem?
> > 
> > In the past, this sort of thing has been an issue for rcu_barrier()
> > and friends.
> 
> Speaking of, shouldn't rcu_barrier() flush all the lazy queues? I
> might have missed that somewhere in the patchset though.

It should, but Joel deferred rcu_barrier() handling for this initial
prototype.  So be careful when playing with it.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-06-01 14:24     ` Frederic Weisbecker
  2022-06-01 16:17       ` Paul E. McKenney
@ 2022-06-01 19:09       ` Joel Fernandes
  1 sibling, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-06-01 19:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, rcu, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Steven Rostedt

Hi Fred,

On Wed, Jun 1, 2022 at 10:24 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Thu, May 12, 2022 at 04:56:03PM -0700, Paul E. McKenney wrote:
> > On Thu, May 12, 2022 at 03:04:29AM +0000, Joel Fernandes (Google) wrote:
> > > +   preempt_enable();
> > > +
> > > +   if (debug_rcu_head_queue((void *)head)) {
> > > +           // Probable double call_rcu(), just leak.
> > > +           WARN_ONCE(1, "%s(): Double-freed call. rcu_head %p\n",
> > > +                           __func__, head);
> > > +
> > > +           // Mark as success and leave.
> > > +           return;
> > > +   }
> > > +
> > > +   // Queue to per-cpu llist
> > > +   head->func = func;
> > > +   llist_add(&head->llist_node, &rlp->head);
> >
> > Suppose that there are a bunch of preemptions between the preempt_enable()
> > above and this point, so that the current CPU's list has lots of
> > callbacks, but zero ->count.  Can that cause a problem?
> >
> > In the past, this sort of thing has been an issue for rcu_barrier()
> > and friends.
>
> Speaking of, shouldn't rcu_barrier() flush all the lazy queues? I
> might have missed that somewhere in the patchset though.

Yes, it should. This initial prototype did not handle it and was
mentioned in one of the replies to the cover letter. However, in the
v2 of this, I am planning to use Paul's idea of sticking these in the
bypass list. It simplifies a lot of things which bypass handles
(hotplug, rcu_barrier, de-offload, etc).

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation
  2022-06-01 14:12       ` Frederic Weisbecker
@ 2022-06-01 19:10         ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-06-01 19:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Uladzislau Rezki, rcu, Rushikesh S Kadam, Neeraj upadhyay,
	Paul E. McKenney, Steven Rostedt

On Wed, Jun 1, 2022 at 10:12 AM Frederic Weisbecker <frederic@kernel.org> wrote:
>
> On Mon, May 30, 2022 at 10:54:26AM -0400, Joel Fernandes wrote:
> > >> +void call_rcu_lazy(struct rcu_head *head_rcu, rcu_callback_t func)
> > >> +{
> > >> +  struct lazy_rcu_head *head = (struct lazy_rcu_head *)head_rcu;
> > >> +  struct rcu_lazy_pcp *rlp;
> > >> +
> > >> +  preempt_disable();
> > >> +        rlp = this_cpu_ptr(&rcu_lazy_pcp_ins);
> > >> +  preempt_enable();
> > >>
> > > Can we get rid of such explicit disabling/enabling preemption?
> >
> > Ok I'll try. Last I checked, something needs to disable preemption to
> > prevent warnings with sampling the current processor ID.
>
> raw_cpu_ptr()

Indeed, thanks!

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (14 preceding siblings ...)
  2022-05-12  3:17 ` [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes
@ 2022-06-13 18:53 ` Joel Fernandes
  2022-06-13 22:48   ` Paul E. McKenney
  15 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-06-13 18:53 UTC (permalink / raw)
  To: rcu; +Cc: rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, paulmck, rostedt

On Thu, May 12, 2022 at 03:04:28AM +0000, Joel Fernandes (Google) wrote:
> Hello!
> Please find the proof of concept version of call_rcu_lazy() attached. This
> gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> Rushikesh Kadam from Intel for investigating it with me.

Just a status update, we're reworking this code to use the bypass lists which
takes care of a lot of corner cases. I've had to make changes the the rcuog
thread code to handle wakes up a bit differently and such. Initial testing is
looking good. Currently working on the early-flushing of the lazy CBs based
on memory pressure.

Should be hopefully sending v2 soon!

thanks,

 - Joel



> 
> Some numbers below:
> 
> Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> The observation is that due to a 'trickle down' effect of RCU callbacks, the
> system is very lightly loaded but constantly running few RCU callbacks very
> often. This confuses the power management hardware that the system is active,
> when it is in fact idle.
> 
> For example, when ChromeOS screen is off and user is not doing anything on the
> system, we can see big power savings.
> Before:
> Pk%pc10 = 72.13
> PkgWatt = 0.58
> CorWatt = 0.04
> 
> After:
> Pk%pc10 = 81.28
> PkgWatt = 0.41
> CorWatt = 0.03
> 
> Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> can see that the display pipeline is constantly doing RCU callback queuing due
> to open/close of file descriptors associated with graphics buffers. This is
> attributed to the file_free_rcu() path which this patch series also touches.
> 
> This patch series adds a simple but effective, and lockless implementation of
> RCU callback batching. On memory pressure, timeout or queue growing too big, we
> initiate a flush of one or more per-CPU lists.
> 
> Similar results can be achieved by increasing jiffies_till_first_fqs, however
> that also has the effect of slowing down RCU. Especially I saw huge slow down
> of function graph tracer when increasing that.
> 
> One drawback of this series is, if another frequent RCU callback creeps up in
> the future, that's not lazy, then that will again hurt the power. However, I
> believe identifying and fixing those is a more reasonable approach than slowing
> RCU down for the whole system.
> 
> NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> at runtime to turn it on or off globally. It is default to on. Further, please
> use the sysctls in lazy.c for further tuning of parameters that effect the
> flushing.
> 
> Disclaimer 1: Don't boot your personal system on it yet anticipating power
> savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> I believe this series should be good for general testing.
> 
> Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> of review and agreements.
> 
> Joel Fernandes (Google) (14):
>   rcu: Add a lock-less lazy RCU implementation
>   workqueue: Add a lazy version of queue_rcu_work()
>   block/blk-ioc: Move call_rcu() to call_rcu_lazy()
>   cred: Move call_rcu() to call_rcu_lazy()
>   fs: Move call_rcu() to call_rcu_lazy() in some paths
>   kernel: Move various core kernel usages to call_rcu_lazy()
>   security: Move call_rcu() to call_rcu_lazy()
>   net/core: Move call_rcu() to call_rcu_lazy()
>   lib: Move call_rcu() to call_rcu_lazy()
>   kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
>   i915: Move call_rcu() to call_rcu_lazy()
>   rcu/kfree: remove useless monitor_todo flag
>   rcu/kfree: Fix kfree_rcu_shrink_count() return value
>   DEBUG: Toggle rcu_lazy and tune at runtime
> 
>  block/blk-ioc.c                            |   2 +-
>  drivers/gpu/drm/i915/gem/i915_gem_object.c |   2 +-
>  fs/dcache.c                                |   4 +-
>  fs/eventpoll.c                             |   2 +-
>  fs/file_table.c                            |   3 +-
>  fs/inode.c                                 |   2 +-
>  include/linux/rcupdate.h                   |   6 +
>  include/linux/sched/sysctl.h               |   4 +
>  include/linux/workqueue.h                  |   1 +
>  kernel/cred.c                              |   2 +-
>  kernel/exit.c                              |   2 +-
>  kernel/pid.c                               |   2 +-
>  kernel/rcu/Kconfig                         |   8 ++
>  kernel/rcu/Makefile                        |   1 +
>  kernel/rcu/lazy.c                          | 153 +++++++++++++++++++++
>  kernel/rcu/rcu.h                           |   5 +
>  kernel/rcu/tree.c                          |  28 ++--
>  kernel/sysctl.c                            |  23 ++++
>  kernel/time/posix-timers.c                 |   2 +-
>  kernel/workqueue.c                         |  25 ++++
>  lib/radix-tree.c                           |   2 +-
>  lib/xarray.c                               |   2 +-
>  net/core/dst.c                             |   2 +-
>  security/security.c                        |   2 +-
>  security/selinux/avc.c                     |   4 +-
>  25 files changed, 255 insertions(+), 34 deletions(-)
>  create mode 100644 kernel/rcu/lazy.c
> 
> -- 
> 2.36.0.550.gb090851708-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-06-13 18:53 ` Joel Fernandes
@ 2022-06-13 22:48   ` Paul E. McKenney
  2022-06-16 16:26     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-06-13 22:48 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic, rostedt

On Mon, Jun 13, 2022 at 06:53:27PM +0000, Joel Fernandes wrote:
> On Thu, May 12, 2022 at 03:04:28AM +0000, Joel Fernandes (Google) wrote:
> > Hello!
> > Please find the proof of concept version of call_rcu_lazy() attached. This
> > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > Rushikesh Kadam from Intel for investigating it with me.
> 
> Just a status update, we're reworking this code to use the bypass lists which
> takes care of a lot of corner cases. I've had to make changes the the rcuog
> thread code to handle wakes up a bit differently and such. Initial testing is
> looking good. Currently working on the early-flushing of the lazy CBs based
> on memory pressure.
> 
> Should be hopefully sending v2 soon!

Looking forward to seeing it!

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> 
> 
> > 
> > Some numbers below:
> > 
> > Following are power savings we see on top of RCU_NOCB_CPU on an Intel platform.
> > The observation is that due to a 'trickle down' effect of RCU callbacks, the
> > system is very lightly loaded but constantly running few RCU callbacks very
> > often. This confuses the power management hardware that the system is active,
> > when it is in fact idle.
> > 
> > For example, when ChromeOS screen is off and user is not doing anything on the
> > system, we can see big power savings.
> > Before:
> > Pk%pc10 = 72.13
> > PkgWatt = 0.58
> > CorWatt = 0.04
> > 
> > After:
> > Pk%pc10 = 81.28
> > PkgWatt = 0.41
> > CorWatt = 0.03
> > 
> > Further, when ChromeOS screen is ON but system is idle or lightly loaded, we
> > can see that the display pipeline is constantly doing RCU callback queuing due
> > to open/close of file descriptors associated with graphics buffers. This is
> > attributed to the file_free_rcu() path which this patch series also touches.
> > 
> > This patch series adds a simple but effective, and lockless implementation of
> > RCU callback batching. On memory pressure, timeout or queue growing too big, we
> > initiate a flush of one or more per-CPU lists.
> > 
> > Similar results can be achieved by increasing jiffies_till_first_fqs, however
> > that also has the effect of slowing down RCU. Especially I saw huge slow down
> > of function graph tracer when increasing that.
> > 
> > One drawback of this series is, if another frequent RCU callback creeps up in
> > the future, that's not lazy, then that will again hurt the power. However, I
> > believe identifying and fixing those is a more reasonable approach than slowing
> > RCU down for the whole system.
> > 
> > NOTE: Add debug patch is added in the series toggle /proc/sys/kernel/rcu_lazy
> > at runtime to turn it on or off globally. It is default to on. Further, please
> > use the sysctls in lazy.c for further tuning of parameters that effect the
> > flushing.
> > 
> > Disclaimer 1: Don't boot your personal system on it yet anticipating power
> > savings, as TREE07 still causes RCU stalls and I am looking more into that, but
> > I believe this series should be good for general testing.
> > 
> > Disclaimer 2: I have intentionally not CC'd other subsystem maintainers (like
> > net, fs) to keep noise low and will CC them in the future after 1 or 2 rounds
> > of review and agreements.
> > 
> > Joel Fernandes (Google) (14):
> >   rcu: Add a lock-less lazy RCU implementation
> >   workqueue: Add a lazy version of queue_rcu_work()
> >   block/blk-ioc: Move call_rcu() to call_rcu_lazy()
> >   cred: Move call_rcu() to call_rcu_lazy()
> >   fs: Move call_rcu() to call_rcu_lazy() in some paths
> >   kernel: Move various core kernel usages to call_rcu_lazy()
> >   security: Move call_rcu() to call_rcu_lazy()
> >   net/core: Move call_rcu() to call_rcu_lazy()
> >   lib: Move call_rcu() to call_rcu_lazy()
> >   kfree/rcu: Queue RCU work via queue_rcu_work_lazy()
> >   i915: Move call_rcu() to call_rcu_lazy()
> >   rcu/kfree: remove useless monitor_todo flag
> >   rcu/kfree: Fix kfree_rcu_shrink_count() return value
> >   DEBUG: Toggle rcu_lazy and tune at runtime
> > 
> >  block/blk-ioc.c                            |   2 +-
> >  drivers/gpu/drm/i915/gem/i915_gem_object.c |   2 +-
> >  fs/dcache.c                                |   4 +-
> >  fs/eventpoll.c                             |   2 +-
> >  fs/file_table.c                            |   3 +-
> >  fs/inode.c                                 |   2 +-
> >  include/linux/rcupdate.h                   |   6 +
> >  include/linux/sched/sysctl.h               |   4 +
> >  include/linux/workqueue.h                  |   1 +
> >  kernel/cred.c                              |   2 +-
> >  kernel/exit.c                              |   2 +-
> >  kernel/pid.c                               |   2 +-
> >  kernel/rcu/Kconfig                         |   8 ++
> >  kernel/rcu/Makefile                        |   1 +
> >  kernel/rcu/lazy.c                          | 153 +++++++++++++++++++++
> >  kernel/rcu/rcu.h                           |   5 +
> >  kernel/rcu/tree.c                          |  28 ++--
> >  kernel/sysctl.c                            |  23 ++++
> >  kernel/time/posix-timers.c                 |   2 +-
> >  kernel/workqueue.c                         |  25 ++++
> >  lib/radix-tree.c                           |   2 +-
> >  lib/xarray.c                               |   2 +-
> >  net/core/dst.c                             |   2 +-
> >  security/security.c                        |   2 +-
> >  security/selinux/avc.c                     |   4 +-
> >  25 files changed, 255 insertions(+), 34 deletions(-)
> >  create mode 100644 kernel/rcu/lazy.c
> > 
> > -- 
> > 2.36.0.550.gb090851708-goog
> > 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-06-13 22:48   ` Paul E. McKenney
@ 2022-06-16 16:26     ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-06-16 16:26 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Rushikesh S Kadam, Uladzislau Rezki (Sony),
	Neeraj upadhyay, Frederic Weisbecker, Steven Rostedt

On Mon, Jun 13, 2022 at 6:48 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>
> On Mon, Jun 13, 2022 at 06:53:27PM +0000, Joel Fernandes wrote:
> > On Thu, May 12, 2022 at 03:04:28AM +0000, Joel Fernandes (Google) wrote:
> > > Hello!
> > > Please find the proof of concept version of call_rcu_lazy() attached. This
> > > gives a lot of savings when the CPUs are relatively idle. Huge thanks to
> > > Rushikesh Kadam from Intel for investigating it with me.
> >
> > Just a status update, we're reworking this code to use the bypass lists which
> > takes care of a lot of corner cases. I've had to make changes the the rcuog
> > thread code to handle wakes up a bit differently and such. Initial testing is
> > looking good. Currently working on the early-flushing of the lazy CBs based
> > on memory pressure.
> >
> > Should be hopefully sending v2 soon!
>
> Looking forward to seeing it!

Looks like rebasing on paul's dev is needed for me to even apply test
on our downstream kernels.  Why you ask? Because our downstream folks
are indeed using upstream 5.19-rc3. =) Finally rebase succeeded and
now I can test lazy bypass downstream before sending out the next
RFC...

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes
  2022-05-14 14:25                     ` Joel Fernandes
  2022-05-14 19:01                       ` Uladzislau Rezki
@ 2022-08-09  2:25                       ` Joel Fernandes
  1 sibling, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-08-09  2:25 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Paul E. McKenney, rcu, Rushikesh S Kadam, Neeraj upadhyay,
	Frederic Weisbecker, Steven Rostedt

On Sat, May 14, 2022 at 10:25 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote:
> > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > > > Never mind. I port it into 5.10
> > > > > > > >
> > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > > > ditto.
> > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > > > >
> > > > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > > > is not so easy to up and run on my device :)
> > > > > >
> > > > > > Actually I was going to write here, apparently some tests are showing
> > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > > > good to drop those for your initial testing!
> > > > > >
> > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > > > so important for kfree_rcu() because we do batch requests there anyway.
> > > > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > > > of page slots.
> > > > >
> > > > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > > > >
> > > >
> > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > > > case i have checked was a "static image" use case. Condition is: screen ON with
> > > > disabled all connectivity.
> > > >
> > > > 1.
> > > > First data i took is how many wakeups cause an RCU subsystem during this test case
> > > > when everything is pretty idling. Duration is 360 seconds:
> > > >
> > > > <snip>
> > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> > >
> > > Nice! Do you mind sharing this script? I was just talking to Rushikesh
> > > that we want something like this during testing. Appreciate it. Also,
> > > if we dump timer wakeup reasons/callbacks that would also be awesome.
> > >
> > Please find in attachment. I wrote it once upon a time and make use of
> > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
> > so just compile it.
> >
> > How to use it:
> > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
> > 2. ./perf script -i ./perf.data > foo.script
> > 3. ./perf_script_parser ./foo.script
>
> Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also
> use "perf sched record" and "perf sched report --sort latency" to get wakeup
> latencies.
>
> > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> > > share that with you on request as well. That is probably not in a
> > > shape for mainline though (Makefile missing and such).
> > >
> > Yep, please share!
>
> Sure, check out my bcc repo from here:
> https://github.com/joelagnel/bcc

Kind of random, but this is why I love sharing my tools with others. I
actually forgot that I checked in the code for my rcutop here, and
this email is how I found it, lol :)

Did you happen to try it out too btw? :)

Thanks,

Joel


>
> Build this project, then cd libbpf-tools and run make. This should produce a
> static binary 'rcutop' which you can push to Android. You have to build for
> ARM which bcc should have instructions for. I have also included the rcuptop
> diff at the end of this file for reference.
>
> > > > 2.
> > > > Please find in attachment two power plots. The same test case. One is related to a
> > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > > > what is also important, thus it proofs the concept. Please note it might be more power
> > > > efficient for other arches and platforms. Because of different HW design that is related
> > > > to C-states of CPU and energy that is needed to in/out of those deep power states.
> > >
> > > Nice! I wonder if you still have other frequent callbacks on your
> > > system that are getting queued during the tests. Could you dump the
> > > rcu_callbacks trace event and see if you have any CBs frequently
> > > called that the series did not address?
> > >
> > I have pretty much like this:
> > <snip>
> >     rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
> >     rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
> >     rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
> >     rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
> >     rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
> >     rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
> > <snip>
> >
> > so almost everything is batched.
>
> Nice, glad to know this is happening even without the kfree_rcu() changes.
>
> > > Also, one more thing I was curious about is - do you see savings when
> > > you pin the rcu threads to the LITTLE CPUs of the system? The theory
> > > being, not disturbing the BIG CPUs which are more power hungry may let
> > > them go into a deeper idle state and save power (due to leakage
> > > current and so forth).
> > >
> > I did some experimenting to pin nocbs to a little cluster. For idle use
> > cases i did not see any power gain. For heavy one i see that "big" CPUs
> > are also invoking and busy with it quite often. Probably i should think
> > of some use case where i can detect the power difference. If you have
> > something please let me know.
>
> Yeah, probably screen off + audio playback might be a good one, because it
> lightly loads the CPUs.
>
> > > > So a front-lazy-batching is something worth to have, IMHO :)
> > >
> > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> > > it, we can add your data to the LPC slides as well about your
> > > investigation (with attribution to you).
> > >
> > No problem, since we will give a talk on LPC, more data we have more
> > convincing we are :)
>
> I forget, you did mention you are Ok with presenting with us right? It would
> be great if you present your data when we come to Android, if you are OK with
> it. I'll start a common slide deck soon and share so you, Rushikesh and me
> can add slides to it and present together.
>
> thanks,
>
>  - Joel
>
> ---8<-----------------------
>
> From: Joel Fernandes <joelaf@google.com>
> Subject: [PATCH] rcutop
>
> Signed-off-by: Joel Fernandes <joelaf@google.com>
> ---
>  libbpf-tools/Makefile     |   1 +
>  libbpf-tools/rcutop.bpf.c |  56 ++++++++
>  libbpf-tools/rcutop.c     | 288 ++++++++++++++++++++++++++++++++++++++
>  libbpf-tools/rcutop.h     |   8 ++
>  4 files changed, 353 insertions(+)
>  create mode 100644 libbpf-tools/rcutop.bpf.c
>  create mode 100644 libbpf-tools/rcutop.c
>  create mode 100644 libbpf-tools/rcutop.h
>
> diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile
> index e60ec409..0d4cdff2 100644
> --- a/libbpf-tools/Makefile
> +++ b/libbpf-tools/Makefile
> @@ -42,6 +42,7 @@ APPS = \
>         klockstat \
>         ksnoop \
>         llcstat \
> +       rcutop \
>         mountsnoop \
>         numamove \
>         offcputime \
> diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c
> new file mode 100644
> index 00000000..8287bbe2
> --- /dev/null
> +++ b/libbpf-tools/rcutop.bpf.c
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +/* Copyright (c) 2021 Hengqi Chen */
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_core_read.h>
> +#include <bpf/bpf_tracing.h>
> +#include "rcutop.h"
> +#include "maps.bpf.h"
> +
> +#define MAX_ENTRIES    10240
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(max_entries, MAX_ENTRIES);
> +       __type(key, void *);
> +       __type(value, int);
> +} cbs_queued SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(max_entries, MAX_ENTRIES);
> +       __type(key, void *);
> +       __type(value, int);
> +} cbs_executed SEC(".maps");
> +
> +SEC("tracepoint/rcu/rcu_callback")
> +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx)
> +{
> +       void *key = ctx->func;
> +       int *val = NULL;
> +       static const int zero;
> +
> +       val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero);
> +       if (val) {
> +               __sync_fetch_and_add(val, 1);
> +       }
> +
> +       return 0;
> +}
> +
> +SEC("tracepoint/rcu/rcu_invoke_callback")
> +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx)
> +{
> +       void *key = ctx->func;
> +       int *val;
> +       int zero = 0;
> +
> +       val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero);
> +       if (val) {
> +               __sync_fetch_and_add(val, 1);
> +       }
> +
> +       return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c
> new file mode 100644
> index 00000000..35795875
> --- /dev/null
> +++ b/libbpf-tools/rcutop.c
> @@ -0,0 +1,288 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +
> +/*
> + * rcutop
> + * Copyright (c) 2022 Joel Fernandes
> + *
> + * 05-May-2022   Joel Fernandes   Created this.
> + */
> +#include <argp.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/bpf.h>
> +#include "rcutop.h"
> +#include "rcutop.skel.h"
> +#include "btf_helpers.h"
> +#include "trace_helpers.h"
> +
> +#define warn(...) fprintf(stderr, __VA_ARGS__)
> +#define OUTPUT_ROWS_LIMIT 10240
> +
> +static volatile sig_atomic_t exiting = 0;
> +
> +static bool clear_screen = true;
> +static int output_rows = 20;
> +static int interval = 1;
> +static int count = 99999999;
> +static bool verbose = false;
> +
> +const char *argp_program_version = "rcutop 0.1";
> +const char *argp_program_bug_address =
> +"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
> +const char argp_program_doc[] =
> +"Show RCU callback queuing and execution stats.\n"
> +"\n"
> +"USAGE: rcutop [-h] [interval] [count]\n"
> +"\n"
> +"EXAMPLES:\n"
> +"    rcutop            # rcu activity top, refresh every 1s\n"
> +"    rcutop 5 10       # 5s summaries, 10 times\n";
> +
> +static const struct argp_option opts[] = {
> +       { "noclear", 'C', NULL, 0, "Don't clear the screen" },
> +       { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" },
> +       { "verbose", 'v', NULL, 0, "Verbose debug output" },
> +       { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" },
> +       {},
> +};
> +
> +static error_t parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +       long rows;
> +       static int pos_args;
> +
> +       switch (key) {
> +               case 'C':
> +                       clear_screen = false;
> +                       break;
> +               case 'v':
> +                       verbose = true;
> +                       break;
> +               case 'h':
> +                       argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
> +                       break;
> +               case 'r':
> +                       errno = 0;
> +                       rows = strtol(arg, NULL, 10);
> +                       if (errno || rows <= 0) {
> +                               warn("invalid rows: %s\n", arg);
> +                               argp_usage(state);
> +                       }
> +                       output_rows = rows;
> +                       if (output_rows > OUTPUT_ROWS_LIMIT)
> +                               output_rows = OUTPUT_ROWS_LIMIT;
> +                       break;
> +               case ARGP_KEY_ARG:
> +                       errno = 0;
> +                       if (pos_args == 0) {
> +                               interval = strtol(arg, NULL, 10);
> +                               if (errno || interval <= 0) {
> +                                       warn("invalid interval\n");
> +                                       argp_usage(state);
> +                               }
> +                       } else if (pos_args == 1) {
> +                               count = strtol(arg, NULL, 10);
> +                               if (errno || count <= 0) {
> +                                       warn("invalid count\n");
> +                                       argp_usage(state);
> +                               }
> +                       } else {
> +                               warn("unrecognized positional argument: %s\n", arg);
> +                               argp_usage(state);
> +                       }
> +                       pos_args++;
> +                       break;
> +               default:
> +                       return ARGP_ERR_UNKNOWN;
> +       }
> +       return 0;
> +}
> +
> +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
> +{
> +       if (level == LIBBPF_DEBUG && !verbose)
> +               return 0;
> +       return vfprintf(stderr, format, args);
> +}
> +
> +static void sig_int(int signo)
> +{
> +       exiting = 1;
> +}
> +
> +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache,
> +               struct rcutop_bpf *obj)
> +{
> +       void *key, **prev_key = NULL;
> +       int n, err = 0;
> +       int qfd = bpf_map__fd(obj->maps.cbs_queued);
> +       int efd = bpf_map__fd(obj->maps.cbs_executed);
> +       const struct ksym *ksym;
> +       FILE *f;
> +       time_t t;
> +       struct tm *tm;
> +       char ts[16], buf[256];
> +
> +       f = fopen("/proc/loadavg", "r");
> +       if (f) {
> +               time(&t);
> +               tm = localtime(&t);
> +               strftime(ts, sizeof(ts), "%H:%M:%S", tm);
> +               memset(buf, 0, sizeof(buf));
> +               n = fread(buf, 1, sizeof(buf), f);
> +               if (n)
> +                       printf("%8s loadavg: %s\n", ts, buf);
> +               fclose(f);
> +       }
> +
> +       printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed");
> +
> +       while (1) {
> +               int qcount = 0, ecount = 0;
> +
> +               err = bpf_map_get_next_key(qfd, prev_key, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               break;
> +                       }
> +                       warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +
> +               err = bpf_map_lookup_elem(qfd, &key, &qcount);
> +               if (err) {
> +                       warn("bpf_map_lookup_elem failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +               prev_key = &key;
> +
> +               bpf_map_lookup_elem(efd, &key, &ecount);
> +
> +               ksym = ksyms__map_addr(ksyms, (unsigned long)key);
> +               printf("%-32s %-6d %-6d\n",
> +                               ksym ? ksym->name : "Unknown",
> +                               qcount, ecount);
> +       }
> +       printf("\n");
> +       prev_key = NULL;
> +       while (1) {
> +               err = bpf_map_get_next_key(qfd, prev_key, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               break;
> +                       }
> +                       warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +               err = bpf_map_delete_elem(qfd, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               continue;
> +                       }
> +                       warn("bpf_map_delete_elem failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +
> +               bpf_map_delete_elem(efd, &key);
> +               prev_key = &key;
> +       }
> +
> +       return err;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       LIBBPF_OPTS(bpf_object_open_opts, open_opts);
> +       static const struct argp argp = {
> +               .options = opts,
> +               .parser = parse_arg,
> +               .doc = argp_program_doc,
> +       };
> +       struct rcutop_bpf *obj;
> +       int err;
> +       struct syms_cache *syms_cache = NULL;
> +       struct ksyms *ksyms = NULL;
> +
> +       err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
> +       if (err)
> +               return err;
> +
> +       libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> +       libbpf_set_print(libbpf_print_fn);
> +
> +       err = ensure_core_btf(&open_opts);
> +       if (err) {
> +               fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
> +               return 1;
> +       }
> +
> +       obj = rcutop_bpf__open_opts(&open_opts);
> +       if (!obj) {
> +               warn("failed to open BPF object\n");
> +               return 1;
> +       }
> +
> +       err = rcutop_bpf__load(obj);
> +       if (err) {
> +               warn("failed to load BPF object: %d\n", err);
> +               goto cleanup;
> +       }
> +
> +       err = rcutop_bpf__attach(obj);
> +       if (err) {
> +               warn("failed to attach BPF programs: %d\n", err);
> +               goto cleanup;
> +       }
> +
> +       ksyms = ksyms__load();
> +       if (!ksyms) {
> +               fprintf(stderr, "failed to load kallsyms\n");
> +               goto cleanup;
> +       }
> +
> +       syms_cache = syms_cache__new(0);
> +       if (!syms_cache) {
> +               fprintf(stderr, "failed to create syms_cache\n");
> +               goto cleanup;
> +       }
> +
> +       if (signal(SIGINT, sig_int) == SIG_ERR) {
> +               warn("can't set signal handler: %s\n", strerror(errno));
> +               err = 1;
> +               goto cleanup;
> +       }
> +
> +       while (1) {
> +               sleep(interval);
> +
> +               if (clear_screen) {
> +                       err = system("clear");
> +                       if (err)
> +                               goto cleanup;
> +               }
> +
> +               err = print_stat(ksyms, syms_cache, obj);
> +               if (err)
> +                       goto cleanup;
> +
> +               count--;
> +               if (exiting || !count)
> +                       goto cleanup;
> +       }
> +
> +cleanup:
> +       rcutop_bpf__destroy(obj);
> +       cleanup_core_btf(&open_opts);
> +
> +       return err != 0;
> +}
> diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h
> new file mode 100644
> index 00000000..cb2a3557
> --- /dev/null
> +++ b/libbpf-tools/rcutop.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +#ifndef __RCUTOP_H
> +#define __RCUTOP_H
> +
> +#define PATH_MAX       4096
> +#define TASK_COMM_LEN  16
> +
> +#endif /* __RCUTOP_H */
> --
> 2.36.0.550.gb090851708-goog
>

On Sat, May 14, 2022 at 10:25 AM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Fri, May 13, 2022 at 05:43:51PM +0200, Uladzislau Rezki wrote:
> > > On Fri, May 13, 2022 at 9:36 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > >
> > > > > > On Thu, May 12, 2022 at 10:37 AM Uladzislau Rezki <urezki@gmail.com> wrote:
> > > > > > >
> > > > > > > > On Thu, May 12, 2022 at 03:56:37PM +0200, Uladzislau Rezki wrote:
> > > > > > > > > Never mind. I port it into 5.10
> > > > > > > >
> > > > > > > > Oh, this is on mainline. Sorry about that. If you want I have a tree here for
> > > > > > > > 5.10 , although that does not have the kfree changes, everything else is
> > > > > > > > ditto.
> > > > > > > > https://github.com/joelagnel/linux-kernel/tree/rcu-nocb-4
> > > > > > > >
> > > > > > > No problem. kfree_rcu two patches are not so important in this series.
> > > > > > > So i have backported them into my 5.10 kernel because the latest kernel
> > > > > > > is not so easy to up and run on my device :)
> > > > > >
> > > > > > Actually I was going to write here, apparently some tests are showing
> > > > > > kfree_rcu()->call_rcu_lazy() causing possible regression. So it is
> > > > > > good to drop those for your initial testing!
> > > > > >
> > > > > Yep, i dropped both. The one that make use of call_rcu_lazy() seems not
> > > > > so important for kfree_rcu() because we do batch requests there anyway.
> > > > > One thing that i would like to improve in kfree_rcu() is a better utilization
> > > > > of page slots.
> > > > >
> > > > > I will share my results either tomorrow or on Monday. I hope that is fine.
> > > > >
> > > >
> > > > Here we go with some data on our Android handset that runs 5.10 kernel. The test
> > > > case i have checked was a "static image" use case. Condition is: screen ON with
> > > > disabled all connectivity.
> > > >
> > > > 1.
> > > > First data i took is how many wakeups cause an RCU subsystem during this test case
> > > > when everything is pretty idling. Duration is 360 seconds:
> > > >
> > > > <snip>
> > > > serezkiul@seldlx26095:~/data/call_rcu_lazy$ ./psp ./perf_360_sec_rcu_lazy_off.script | sort -nk 6 | grep rcu
> > >
> > > Nice! Do you mind sharing this script? I was just talking to Rushikesh
> > > that we want something like this during testing. Appreciate it. Also,
> > > if we dump timer wakeup reasons/callbacks that would also be awesome.
> > >
> > Please find in attachment. I wrote it once upon a time and make use of
> > to parse "perf script" output, i.e. raw data. The file name is: perf_script_parser.c
> > so just compile it.
> >
> > How to use it:
> > 1. run perf: './perf sched record -a -- sleep "how much in sec you want to collect data"'
> > 2. ./perf script -i ./perf.data > foo.script
> > 3. ./perf_script_parser ./foo.script
>
> Thanks a lot for sharing this. I think it will be quite useful. FWIW, I also
> use "perf sched record" and "perf sched report --sort latency" to get wakeup
> latencies.
>
> > > FWIW, I wrote a BPF tool that periodically dumps callbacks and can
> > > share that with you on request as well. That is probably not in a
> > > shape for mainline though (Makefile missing and such).
> > >
> > Yep, please share!
>
> Sure, check out my bcc repo from here:
> https://github.com/joelagnel/bcc
>
> Build this project, then cd libbpf-tools and run make. This should produce a
> static binary 'rcutop' which you can push to Android. You have to build for
> ARM which bcc should have instructions for. I have also included the rcuptop
> diff at the end of this file for reference.
>
> > > > 2.
> > > > Please find in attachment two power plots. The same test case. One is related to a
> > > > regular use of call_rcu() and second one is "lazy" usage. There is light a difference
> > > > in power, it is ~2mA. Event though it is rather small but it is detectable and solid
> > > > what is also important, thus it proofs the concept. Please note it might be more power
> > > > efficient for other arches and platforms. Because of different HW design that is related
> > > > to C-states of CPU and energy that is needed to in/out of those deep power states.
> > >
> > > Nice! I wonder if you still have other frequent callbacks on your
> > > system that are getting queued during the tests. Could you dump the
> > > rcu_callbacks trace event and see if you have any CBs frequently
> > > called that the series did not address?
> > >
> > I have pretty much like this:
> > <snip>
> >     rcuop/2-33      [002] d..1  6172.420541: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [001] d..1  6173.131965: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.696540: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [003] d..1  6173.703695: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [001] d..1  6173.711607: rcu_batch_start: rcu_preempt CBs=1667 bl=13
> >     rcuop/1-26      [000] d..1  6175.619722: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6176.135844: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/3-40      [002] d..1  6176.303723: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [002] d..1  6176.519894: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6176.527895: rcu_batch_start: rcu_preempt CBs=273 bl=10
> >     rcuop/1-26      [003] d..1  6178.543729: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6178.551707: rcu_batch_start: rcu_preempt CBs=1317 bl=10
> >     rcuop/0-15      [003] d..1  6178.819698: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/0-15      [003] d..1  6178.827734: rcu_batch_start: rcu_preempt CBs=949 bl=10
> >     rcuop/3-40      [001] d..1  6179.203645: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [001] d..1  6179.455747: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/2-33      [002] d..1  6179.471725: rcu_batch_start: rcu_preempt CBs=1983 bl=15
> >     rcuop/1-26      [003] d..1  6181.287646: rcu_batch_start: rcu_preempt CBs=2048 bl=16
> >     rcuop/1-26      [003] d..1  6181.295607: rcu_batch_start: rcu_preempt CBs=55 bl=10
> > <snip>
> >
> > so almost everything is batched.
>
> Nice, glad to know this is happening even without the kfree_rcu() changes.
>
> > > Also, one more thing I was curious about is - do you see savings when
> > > you pin the rcu threads to the LITTLE CPUs of the system? The theory
> > > being, not disturbing the BIG CPUs which are more power hungry may let
> > > them go into a deeper idle state and save power (due to leakage
> > > current and so forth).
> > >
> > I did some experimenting to pin nocbs to a little cluster. For idle use
> > cases i did not see any power gain. For heavy one i see that "big" CPUs
> > are also invoking and busy with it quite often. Probably i should think
> > of some use case where i can detect the power difference. If you have
> > something please let me know.
>
> Yeah, probably screen off + audio playback might be a good one, because it
> lightly loads the CPUs.
>
> > > > So a front-lazy-batching is something worth to have, IMHO :)
> > >
> > > Exciting! Being lazy pays off some times ;-) ;-). If you are Ok with
> > > it, we can add your data to the LPC slides as well about your
> > > investigation (with attribution to you).
> > >
> > No problem, since we will give a talk on LPC, more data we have more
> > convincing we are :)
>
> I forget, you did mention you are Ok with presenting with us right? It would
> be great if you present your data when we come to Android, if you are OK with
> it. I'll start a common slide deck soon and share so you, Rushikesh and me
> can add slides to it and present together.
>
> thanks,
>
>  - Joel
>
> ---8<-----------------------
>
> From: Joel Fernandes <joelaf@google.com>
> Subject: [PATCH] rcutop
>
> Signed-off-by: Joel Fernandes <joelaf@google.com>
> ---
>  libbpf-tools/Makefile     |   1 +
>  libbpf-tools/rcutop.bpf.c |  56 ++++++++
>  libbpf-tools/rcutop.c     | 288 ++++++++++++++++++++++++++++++++++++++
>  libbpf-tools/rcutop.h     |   8 ++
>  4 files changed, 353 insertions(+)
>  create mode 100644 libbpf-tools/rcutop.bpf.c
>  create mode 100644 libbpf-tools/rcutop.c
>  create mode 100644 libbpf-tools/rcutop.h
>
> diff --git a/libbpf-tools/Makefile b/libbpf-tools/Makefile
> index e60ec409..0d4cdff2 100644
> --- a/libbpf-tools/Makefile
> +++ b/libbpf-tools/Makefile
> @@ -42,6 +42,7 @@ APPS = \
>         klockstat \
>         ksnoop \
>         llcstat \
> +       rcutop \
>         mountsnoop \
>         numamove \
>         offcputime \
> diff --git a/libbpf-tools/rcutop.bpf.c b/libbpf-tools/rcutop.bpf.c
> new file mode 100644
> index 00000000..8287bbe2
> --- /dev/null
> +++ b/libbpf-tools/rcutop.bpf.c
> @@ -0,0 +1,56 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +/* Copyright (c) 2021 Hengqi Chen */
> +#include <vmlinux.h>
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_core_read.h>
> +#include <bpf/bpf_tracing.h>
> +#include "rcutop.h"
> +#include "maps.bpf.h"
> +
> +#define MAX_ENTRIES    10240
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(max_entries, MAX_ENTRIES);
> +       __type(key, void *);
> +       __type(value, int);
> +} cbs_queued SEC(".maps");
> +
> +struct {
> +       __uint(type, BPF_MAP_TYPE_HASH);
> +       __uint(max_entries, MAX_ENTRIES);
> +       __type(key, void *);
> +       __type(value, int);
> +} cbs_executed SEC(".maps");
> +
> +SEC("tracepoint/rcu/rcu_callback")
> +int tracepoint_rcu_callback(struct trace_event_raw_rcu_callback* ctx)
> +{
> +       void *key = ctx->func;
> +       int *val = NULL;
> +       static const int zero;
> +
> +       val = bpf_map_lookup_or_try_init(&cbs_queued, &key, &zero);
> +       if (val) {
> +               __sync_fetch_and_add(val, 1);
> +       }
> +
> +       return 0;
> +}
> +
> +SEC("tracepoint/rcu/rcu_invoke_callback")
> +int tracepoint_rcu_invoke_callback(struct trace_event_raw_rcu_invoke_callback* ctx)
> +{
> +       void *key = ctx->func;
> +       int *val;
> +       int zero = 0;
> +
> +       val = bpf_map_lookup_or_try_init(&cbs_executed, (void *)&key, (void *)&zero);
> +       if (val) {
> +               __sync_fetch_and_add(val, 1);
> +       }
> +
> +       return 0;
> +}
> +
> +char LICENSE[] SEC("license") = "Dual BSD/GPL";
> diff --git a/libbpf-tools/rcutop.c b/libbpf-tools/rcutop.c
> new file mode 100644
> index 00000000..35795875
> --- /dev/null
> +++ b/libbpf-tools/rcutop.c
> @@ -0,0 +1,288 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +
> +/*
> + * rcutop
> + * Copyright (c) 2022 Joel Fernandes
> + *
> + * 05-May-2022   Joel Fernandes   Created this.
> + */
> +#include <argp.h>
> +#include <errno.h>
> +#include <signal.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <time.h>
> +#include <unistd.h>
> +
> +#include <bpf/libbpf.h>
> +#include <bpf/bpf.h>
> +#include "rcutop.h"
> +#include "rcutop.skel.h"
> +#include "btf_helpers.h"
> +#include "trace_helpers.h"
> +
> +#define warn(...) fprintf(stderr, __VA_ARGS__)
> +#define OUTPUT_ROWS_LIMIT 10240
> +
> +static volatile sig_atomic_t exiting = 0;
> +
> +static bool clear_screen = true;
> +static int output_rows = 20;
> +static int interval = 1;
> +static int count = 99999999;
> +static bool verbose = false;
> +
> +const char *argp_program_version = "rcutop 0.1";
> +const char *argp_program_bug_address =
> +"https://github.com/iovisor/bcc/tree/master/libbpf-tools";
> +const char argp_program_doc[] =
> +"Show RCU callback queuing and execution stats.\n"
> +"\n"
> +"USAGE: rcutop [-h] [interval] [count]\n"
> +"\n"
> +"EXAMPLES:\n"
> +"    rcutop            # rcu activity top, refresh every 1s\n"
> +"    rcutop 5 10       # 5s summaries, 10 times\n";
> +
> +static const struct argp_option opts[] = {
> +       { "noclear", 'C', NULL, 0, "Don't clear the screen" },
> +       { "rows", 'r', "ROWS", 0, "Maximum rows to print, default 20" },
> +       { "verbose", 'v', NULL, 0, "Verbose debug output" },
> +       { NULL, 'h', NULL, OPTION_HIDDEN, "Show the full help" },
> +       {},
> +};
> +
> +static error_t parse_arg(int key, char *arg, struct argp_state *state)
> +{
> +       long rows;
> +       static int pos_args;
> +
> +       switch (key) {
> +               case 'C':
> +                       clear_screen = false;
> +                       break;
> +               case 'v':
> +                       verbose = true;
> +                       break;
> +               case 'h':
> +                       argp_state_help(state, stderr, ARGP_HELP_STD_HELP);
> +                       break;
> +               case 'r':
> +                       errno = 0;
> +                       rows = strtol(arg, NULL, 10);
> +                       if (errno || rows <= 0) {
> +                               warn("invalid rows: %s\n", arg);
> +                               argp_usage(state);
> +                       }
> +                       output_rows = rows;
> +                       if (output_rows > OUTPUT_ROWS_LIMIT)
> +                               output_rows = OUTPUT_ROWS_LIMIT;
> +                       break;
> +               case ARGP_KEY_ARG:
> +                       errno = 0;
> +                       if (pos_args == 0) {
> +                               interval = strtol(arg, NULL, 10);
> +                               if (errno || interval <= 0) {
> +                                       warn("invalid interval\n");
> +                                       argp_usage(state);
> +                               }
> +                       } else if (pos_args == 1) {
> +                               count = strtol(arg, NULL, 10);
> +                               if (errno || count <= 0) {
> +                                       warn("invalid count\n");
> +                                       argp_usage(state);
> +                               }
> +                       } else {
> +                               warn("unrecognized positional argument: %s\n", arg);
> +                               argp_usage(state);
> +                       }
> +                       pos_args++;
> +                       break;
> +               default:
> +                       return ARGP_ERR_UNKNOWN;
> +       }
> +       return 0;
> +}
> +
> +static int libbpf_print_fn(enum libbpf_print_level level, const char *format, va_list args)
> +{
> +       if (level == LIBBPF_DEBUG && !verbose)
> +               return 0;
> +       return vfprintf(stderr, format, args);
> +}
> +
> +static void sig_int(int signo)
> +{
> +       exiting = 1;
> +}
> +
> +static int print_stat(struct ksyms *ksyms, struct syms_cache *syms_cache,
> +               struct rcutop_bpf *obj)
> +{
> +       void *key, **prev_key = NULL;
> +       int n, err = 0;
> +       int qfd = bpf_map__fd(obj->maps.cbs_queued);
> +       int efd = bpf_map__fd(obj->maps.cbs_executed);
> +       const struct ksym *ksym;
> +       FILE *f;
> +       time_t t;
> +       struct tm *tm;
> +       char ts[16], buf[256];
> +
> +       f = fopen("/proc/loadavg", "r");
> +       if (f) {
> +               time(&t);
> +               tm = localtime(&t);
> +               strftime(ts, sizeof(ts), "%H:%M:%S", tm);
> +               memset(buf, 0, sizeof(buf));
> +               n = fread(buf, 1, sizeof(buf), f);
> +               if (n)
> +                       printf("%8s loadavg: %s\n", ts, buf);
> +               fclose(f);
> +       }
> +
> +       printf("%-32s %-6s %-6s\n", "Callback", "Queued", "Executed");
> +
> +       while (1) {
> +               int qcount = 0, ecount = 0;
> +
> +               err = bpf_map_get_next_key(qfd, prev_key, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               break;
> +                       }
> +                       warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +
> +               err = bpf_map_lookup_elem(qfd, &key, &qcount);
> +               if (err) {
> +                       warn("bpf_map_lookup_elem failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +               prev_key = &key;
> +
> +               bpf_map_lookup_elem(efd, &key, &ecount);
> +
> +               ksym = ksyms__map_addr(ksyms, (unsigned long)key);
> +               printf("%-32s %-6d %-6d\n",
> +                               ksym ? ksym->name : "Unknown",
> +                               qcount, ecount);
> +       }
> +       printf("\n");
> +       prev_key = NULL;
> +       while (1) {
> +               err = bpf_map_get_next_key(qfd, prev_key, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               break;
> +                       }
> +                       warn("bpf_map_get_next_key failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +               err = bpf_map_delete_elem(qfd, &key);
> +               if (err) {
> +                       if (errno == ENOENT) {
> +                               err = 0;
> +                               continue;
> +                       }
> +                       warn("bpf_map_delete_elem failed: %s\n", strerror(errno));
> +                       return err;
> +               }
> +
> +               bpf_map_delete_elem(efd, &key);
> +               prev_key = &key;
> +       }
> +
> +       return err;
> +}
> +
> +int main(int argc, char **argv)
> +{
> +       LIBBPF_OPTS(bpf_object_open_opts, open_opts);
> +       static const struct argp argp = {
> +               .options = opts,
> +               .parser = parse_arg,
> +               .doc = argp_program_doc,
> +       };
> +       struct rcutop_bpf *obj;
> +       int err;
> +       struct syms_cache *syms_cache = NULL;
> +       struct ksyms *ksyms = NULL;
> +
> +       err = argp_parse(&argp, argc, argv, 0, NULL, NULL);
> +       if (err)
> +               return err;
> +
> +       libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
> +       libbpf_set_print(libbpf_print_fn);
> +
> +       err = ensure_core_btf(&open_opts);
> +       if (err) {
> +               fprintf(stderr, "failed to fetch necessary BTF for CO-RE: %s\n", strerror(-err));
> +               return 1;
> +       }
> +
> +       obj = rcutop_bpf__open_opts(&open_opts);
> +       if (!obj) {
> +               warn("failed to open BPF object\n");
> +               return 1;
> +       }
> +
> +       err = rcutop_bpf__load(obj);
> +       if (err) {
> +               warn("failed to load BPF object: %d\n", err);
> +               goto cleanup;
> +       }
> +
> +       err = rcutop_bpf__attach(obj);
> +       if (err) {
> +               warn("failed to attach BPF programs: %d\n", err);
> +               goto cleanup;
> +       }
> +
> +       ksyms = ksyms__load();
> +       if (!ksyms) {
> +               fprintf(stderr, "failed to load kallsyms\n");
> +               goto cleanup;
> +       }
> +
> +       syms_cache = syms_cache__new(0);
> +       if (!syms_cache) {
> +               fprintf(stderr, "failed to create syms_cache\n");
> +               goto cleanup;
> +       }
> +
> +       if (signal(SIGINT, sig_int) == SIG_ERR) {
> +               warn("can't set signal handler: %s\n", strerror(errno));
> +               err = 1;
> +               goto cleanup;
> +       }
> +
> +       while (1) {
> +               sleep(interval);
> +
> +               if (clear_screen) {
> +                       err = system("clear");
> +                       if (err)
> +                               goto cleanup;
> +               }
> +
> +               err = print_stat(ksyms, syms_cache, obj);
> +               if (err)
> +                       goto cleanup;
> +
> +               count--;
> +               if (exiting || !count)
> +                       goto cleanup;
> +       }
> +
> +cleanup:
> +       rcutop_bpf__destroy(obj);
> +       cleanup_core_btf(&open_opts);
> +
> +       return err != 0;
> +}
> diff --git a/libbpf-tools/rcutop.h b/libbpf-tools/rcutop.h
> new file mode 100644
> index 00000000..cb2a3557
> --- /dev/null
> +++ b/libbpf-tools/rcutop.h
> @@ -0,0 +1,8 @@
> +/* SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause) */
> +#ifndef __RCUTOP_H
> +#define __RCUTOP_H
> +
> +#define PATH_MAX       4096
> +#define TASK_COMM_LEN  16
> +
> +#endif /* __RCUTOP_H */
> --
> 2.36.0.550.gb090851708-goog
>

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2022-08-09  2:25 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-12  3:04 [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
2022-05-12  3:04 ` [RFC v1 01/14] rcu: Add a lock-less lazy RCU implementation Joel Fernandes (Google)
2022-05-12 23:56   ` Paul E. McKenney
2022-05-14 15:08     ` Joel Fernandes
2022-05-14 16:34       ` Paul E. McKenney
2022-05-27 23:12         ` Joel Fernandes
2022-05-28 17:57           ` Paul E. McKenney
2022-05-30 14:48             ` Joel Fernandes
2022-05-30 16:42               ` Paul E. McKenney
2022-05-31  2:12                 ` Joel Fernandes
2022-05-31  4:26                   ` Paul E. McKenney
2022-05-31 16:11                     ` Joel Fernandes
2022-05-31 16:45                       ` Paul E. McKenney
2022-05-31 18:51                         ` Joel Fernandes
2022-05-31 19:25                           ` Paul E. McKenney
2022-05-31 21:29                             ` Joel Fernandes
2022-05-31 22:44                               ` Joel Fernandes
2022-06-01 14:24     ` Frederic Weisbecker
2022-06-01 16:17       ` Paul E. McKenney
2022-06-01 19:09       ` Joel Fernandes
2022-05-17  9:07   ` Uladzislau Rezki
2022-05-30 14:54     ` Joel Fernandes
2022-06-01 14:12       ` Frederic Weisbecker
2022-06-01 19:10         ` Joel Fernandes
2022-05-12  3:04 ` [RFC v1 02/14] workqueue: Add a lazy version of queue_rcu_work() Joel Fernandes (Google)
2022-05-12 23:58   ` Paul E. McKenney
2022-05-14 14:44     ` Joel Fernandes
2022-05-12  3:04 ` [RFC v1 03/14] block/blk-ioc: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
2022-05-13  0:00   ` Paul E. McKenney
2022-05-12  3:04 ` [RFC v1 04/14] cred: " Joel Fernandes (Google)
2022-05-13  0:02   ` Paul E. McKenney
2022-05-14 14:41     ` Joel Fernandes
2022-05-12  3:04 ` [RFC v1 05/14] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
2022-05-13  0:07   ` Paul E. McKenney
2022-05-14 14:40     ` Joel Fernandes
2022-05-12  3:04 ` [RFC v1 06/14] kernel: Move various core kernel usages to call_rcu_lazy() Joel Fernandes (Google)
2022-05-12  3:04 ` [RFC v1 07/14] security: Move call_rcu() " Joel Fernandes (Google)
2022-05-12  3:04 ` [RFC v1 08/14] net/core: " Joel Fernandes (Google)
2022-05-12  3:04 ` [RFC v1 09/14] lib: " Joel Fernandes (Google)
2022-05-12  3:04 ` [RFC v1 10/14] kfree/rcu: Queue RCU work via queue_rcu_work_lazy() Joel Fernandes (Google)
2022-05-13  0:12   ` Paul E. McKenney
2022-05-13 14:55     ` Uladzislau Rezki
2022-05-14 14:33       ` Joel Fernandes
2022-05-14 19:10         ` Uladzislau Rezki
2022-05-12  3:04 ` [RFC v1 11/14] i915: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
2022-05-12  3:04 ` [RFC v1 12/14] rcu/kfree: remove useless monitor_todo flag Joel Fernandes (Google)
2022-05-13 14:53   ` Uladzislau Rezki
2022-05-14 14:35     ` Joel Fernandes
2022-05-14 19:48       ` Uladzislau Rezki
2022-05-12  3:04 ` [RFC v1 13/14] rcu/kfree: Fix kfree_rcu_shrink_count() return value Joel Fernandes (Google)
2022-05-13 14:54   ` Uladzislau Rezki
2022-05-14 14:34     ` Joel Fernandes
2022-05-12  3:04 ` [RFC v1 14/14] DEBUG: Toggle rcu_lazy and tune at runtime Joel Fernandes (Google)
2022-05-13  0:16   ` Paul E. McKenney
2022-05-14 14:38     ` Joel Fernandes
2022-05-14 16:21       ` Paul E. McKenney
2022-05-12  3:17 ` [RFC v1 00/14] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes
2022-05-12 13:09   ` Uladzislau Rezki
2022-05-12 13:56     ` Uladzislau Rezki
2022-05-12 14:03       ` Joel Fernandes
2022-05-12 14:37         ` Uladzislau Rezki
2022-05-12 16:09           ` Joel Fernandes
2022-05-12 16:32             ` Uladzislau Rezki
     [not found]               ` <Yn5e7w8NWzThUARb@pc638.lan>
2022-05-13 14:51                 ` Joel Fernandes
2022-05-13 15:43                   ` Uladzislau Rezki
2022-05-14 14:25                     ` Joel Fernandes
2022-05-14 19:01                       ` Uladzislau Rezki
2022-08-09  2:25                       ` Joel Fernandes
2022-05-13  0:23   ` Paul E. McKenney
2022-05-13 14:45     ` Joel Fernandes
2022-06-13 18:53 ` Joel Fernandes
2022-06-13 22:48   ` Paul E. McKenney
2022-06-16 16:26     ` Joel Fernandes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.