linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes
@ 2022-09-01 22:17 Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 01/18] mm/slub: perform free consistency checks before call_rcu Joel Fernandes (Google)
                   ` (18 more replies)
  0 siblings, 19 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

Here is v5 of call_rcu_lazy() based on the latest RCU -dev branch. Main changes are:
- moved length field into rcu_data (Frederic suggestion)
- added new traces to aid debugging and testing.
	- the new trace patch (along with the rcuscale and rcutorture tests)
	  gives confidence that the patches work well. Also it is tested on
	  real ChromeOS hardware and the boot time is looking good even though
	  lazy callbacks are being queued (i.e. the lazy ones do not effect the
	  synchronous non-lazy ones that effect boot time)
- rewrote some parts of the core patch.
- for rcutop, please apply the diff in the following link to the BCC repo:
  https://lore.kernel.org/r/Yn+79v4q+1c9lXdc@google.com
  Then, cd libbpf-tools/ and run make to build the rcutop static binary.
  (If you need an x86 binary, ping me and I'll email you).
  In the future, I will attempt to make rcutop built within the kernel repo.
  This is already done for another tool (see tools/bpf/runqslower) so is doable.

The 2 mm patches are what Vlastimil pulled into slab-next. I included them in
this series so that the tracing patch builds.

Previous series was posted here:
 https://lore.kernel.org/all/20220713213237.1596225-1-joel@joelfernandes.org/

Linked below [1] is some power data I collected with Turbostat on an x86
ChromeOS ADL machine. The numbers are not based on -next, but rather 5.19
kernel as that's what booted on my ChromeOS machine).

These are output by Turbostat, by running:
turbostat -S -s PkgWatt,CorWatt --interval 5
PkgWatt - summary of package power in Watts 5 second interval.
CoreWatt - summary of core power in Watts 5 second interval.

[1] https://lore.kernel.org/r/b4145008-6840-fc69-a6d6-e38bc218009d@joelfernandes.org

Joel Fernandes (Google) (15):
rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask
rcu: Fix late wakeup when flush of bypass cblist happens
rcu: Move trace_rcu_callback() before bypassing
rcu: Introduce call_rcu_lazy() API implementation
rcu: Add per-CB tracing for queuing, flush and invocation.
rcuscale: Add laziness and kfree tests
rcutorture: Add test code for call_rcu_lazy()
fs: Move call_rcu() to call_rcu_lazy() in some paths
cred: Move call_rcu() to call_rcu_lazy()
security: Move call_rcu() to call_rcu_lazy()
net/core: Move call_rcu() to call_rcu_lazy()
kernel: Move various core kernel usages to call_rcu_lazy()
lib: Move call_rcu() to call_rcu_lazy()
i915: Move call_rcu() to call_rcu_lazy()
fork: Move thread_stack_free_rcu() to call_rcu_lazy()

Vineeth Pillai (1):
rcu: shrinker for lazy rcu

Vlastimil Babka (2):
mm/slub: perform free consistency checks before call_rcu
mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head

drivers/gpu/drm/i915/gem/i915_gem_object.c    |   2 +-
fs/dcache.c                                   |   4 +-
fs/eventpoll.c                                |   2 +-
fs/file_table.c                               |   2 +-
fs/inode.c                                    |   2 +-
include/linux/rcupdate.h                      |   6 +
include/linux/types.h                         |  44 +++
include/trace/events/rcu.h                    |  69 ++++-
kernel/cred.c                                 |   2 +-
kernel/exit.c                                 |   2 +-
kernel/fork.c                                 |   6 +-
kernel/pid.c                                  |   2 +-
kernel/rcu/Kconfig                            |  19 ++
kernel/rcu/rcu.h                              |  12 +
kernel/rcu/rcu_segcblist.c                    |  23 +-
kernel/rcu/rcu_segcblist.h                    |   8 +
kernel/rcu/rcuscale.c                         |  74 ++++-
kernel/rcu/rcutorture.c                       |  60 +++-
kernel/rcu/tree.c                             | 187 ++++++++----
kernel/rcu/tree.h                             |  13 +-
kernel/rcu/tree_nocb.h                        | 282 +++++++++++++++---
kernel/time/posix-timers.c                    |   2 +-
lib/radix-tree.c                              |   2 +-
lib/xarray.c                                  |   2 +-
mm/slab.h                                     |  54 ++--
mm/slub.c                                     |  20 +-
net/core/dst.c                                |   2 +-
security/security.c                           |   2 +-
security/selinux/avc.c                        |   4 +-
.../selftests/rcutorture/configs/rcu/CFLIST   |   1 +
.../selftests/rcutorture/configs/rcu/TREE11   |  18 ++
.../rcutorture/configs/rcu/TREE11.boot        |   8 +
32 files changed, 783 insertions(+), 153 deletions(-)
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

--
2.37.2.789.g6183377224-goog


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH v5 01/18] mm/slub: perform free consistency checks before call_rcu
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 02/18] mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head Joel Fernandes (Google)
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Vlastimil Babka

From: Vlastimil Babka <vbabka@suse.cz>

For SLAB_TYPESAFE_BY_RCU caches we use call_rcu to perform empty slab
freeing. The rcu callback rcu_free_slab() calls __free_slab() that
currently includes checking the slab consistency for caches with
SLAB_CONSISTENCY_CHECKS flags. This check needs the slab->objects field
to be intact.

Because in the next patch we want to allow rcu_head in struct slab to
become larger in debug configurations and thus potentially overwrite
more fields through a union than slab_list, we want to limit the fields
used in rcu_free_slab().  Thus move the consistency checks to
free_slab() before call_rcu(). This can be done safely even for
SLAB_TYPESAFE_BY_RCU caches where accesses to the objects can still
occur after freeing them.

As a result, only the slab->slab_cache field has to be physically
separate from rcu_head for the freeing callback to work. We also save
some cycles in the rcu callback for caches with consistency checks
enabled.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slub.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 862dbd9af4f5..d86be1b0d09f 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2036,14 +2036,6 @@ static void __free_slab(struct kmem_cache *s, struct slab *slab)
 	int order = folio_order(folio);
 	int pages = 1 << order;
 
-	if (kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS)) {
-		void *p;
-
-		slab_pad_check(s, slab);
-		for_each_object(p, s, slab_address(slab), slab->objects)
-			check_object(s, slab, p, SLUB_RED_INACTIVE);
-	}
-
 	__slab_clear_pfmemalloc(slab);
 	__folio_clear_slab(folio);
 	folio->mapping = NULL;
@@ -2062,9 +2054,17 @@ static void rcu_free_slab(struct rcu_head *h)
 
 static void free_slab(struct kmem_cache *s, struct slab *slab)
 {
-	if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU)) {
+	if (kmem_cache_debug_flags(s, SLAB_CONSISTENCY_CHECKS)) {
+		void *p;
+
+		slab_pad_check(s, slab);
+		for_each_object(p, s, slab_address(slab), slab->objects)
+			check_object(s, slab, p, SLUB_RED_INACTIVE);
+	}
+
+	if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU))
 		call_rcu(&slab->rcu_head, rcu_free_slab);
-	} else
+	else
 		__free_slab(s, slab);
 }
 
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 02/18] mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 01/18] mm/slub: perform free consistency checks before call_rcu Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-02  9:26   ` Vlastimil Babka
  2022-09-01 22:17 ` [PATCH v5 03/18] rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask Joel Fernandes (Google)
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Vlastimil Babka,
	Joel Fernandes

From: Vlastimil Babka <vbabka@suse.cz>

Joel reports [1] that increasing the rcu_head size for debugging
purposes used to work before struct slab was split from struct page, but
now runs into the various SLAB_MATCH() sanity checks of the layout.

This is because the rcu_head in struct page is in union with large
sub-structures and has space to grow without exceeding their size, while
in struct slab (for SLAB and SLUB) it's in union only with a list_head.

On closer inspection (and after the previous patch) we can put all
fields except slab_cache to a union with rcu_head, as slab_cache is
sufficient for the rcu freeing callbacks to work and the rest can be
overwritten by rcu_head without causing issues.

This is only somewhat complicated by the need to keep SLUB's
freelist+counters aligned for cmpxchg_double. As a result the fields
need to be reordered so that slab_cache is first (after page flags) and
the union with rcu_head follows. For consistency, do that for SLAB as
well, although not necessary there.

As a result, the rcu_head field in struct page and struct slab is no
longer at the same offset, but that doesn't matter as there is no
casting that would rely on that in the slab freeing callbacks, so we can
just drop the respective SLAB_MATCH() check.

Also we need to update the SLAB_MATCH() for compound_head to reflect the
new ordering.

While at it, also add a static_assert to check the alignment needed for
cmpxchg_double so mistakes are found sooner than a runtime GPF.

[1] https://lore.kernel.org/all/85afd876-d8bb-0804-b2c5-48ed3055e702@joelfernandes.org/

Reported-by: Joel Fernandes <joel@joelfernandes.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/slab.h | 54 ++++++++++++++++++++++++++++++++----------------------
 1 file changed, 32 insertions(+), 22 deletions(-)

diff --git a/mm/slab.h b/mm/slab.h
index 4ec82bec15ec..2c248864ea91 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -11,37 +11,43 @@ struct slab {
 
 #if defined(CONFIG_SLAB)
 
+	struct kmem_cache *slab_cache;
 	union {
-		struct list_head slab_list;
+		struct {
+			struct list_head slab_list;
+			void *freelist;	/* array of free object indexes */
+			void *s_mem;	/* first object */
+		};
 		struct rcu_head rcu_head;
 	};
-	struct kmem_cache *slab_cache;
-	void *freelist;	/* array of free object indexes */
-	void *s_mem;	/* first object */
 	unsigned int active;
 
 #elif defined(CONFIG_SLUB)
 
-	union {
-		struct list_head slab_list;
-		struct rcu_head rcu_head;
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-		struct {
-			struct slab *next;
-			int slabs;	/* Nr of slabs left */
-		};
-#endif
-	};
 	struct kmem_cache *slab_cache;
-	/* Double-word boundary */
-	void *freelist;		/* first free object */
 	union {
-		unsigned long counters;
 		struct {
-			unsigned inuse:16;
-			unsigned objects:15;
-			unsigned frozen:1;
+			union {
+				struct list_head slab_list;
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+				struct {
+					struct slab *next;
+					int slabs;	/* Nr of slabs left */
+				};
+#endif
+			};
+			/* Double-word boundary */
+			void *freelist;		/* first free object */
+			union {
+				unsigned long counters;
+				struct {
+					unsigned inuse:16;
+					unsigned objects:15;
+					unsigned frozen:1;
+				};
+			};
 		};
+		struct rcu_head rcu_head;
 	};
 	unsigned int __unused;
 
@@ -66,9 +72,10 @@ struct slab {
 #define SLAB_MATCH(pg, sl)						\
 	static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl))
 SLAB_MATCH(flags, __page_flags);
-SLAB_MATCH(compound_head, slab_list);	/* Ensure bit 0 is clear */
 #ifndef CONFIG_SLOB
-SLAB_MATCH(rcu_head, rcu_head);
+SLAB_MATCH(compound_head, slab_cache);	/* Ensure bit 0 is clear */
+#else
+SLAB_MATCH(compound_head, slab_list);	/* Ensure bit 0 is clear */
 #endif
 SLAB_MATCH(_refcount, __page_refcount);
 #ifdef CONFIG_MEMCG
@@ -76,6 +83,9 @@ SLAB_MATCH(memcg_data, memcg_data);
 #endif
 #undef SLAB_MATCH
 static_assert(sizeof(struct slab) <= sizeof(struct page));
+#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && defined(CONFIG_SLUB)
+static_assert(IS_ALIGNED(offsetof(struct slab, freelist), 16));
+#endif
 
 /**
  * folio_slab - Converts from folio to slab.
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 03/18] rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 01/18] mm/slub: perform free consistency checks before call_rcu Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 02/18] mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-06 22:26   ` Boqun Feng
  2022-09-01 22:17 ` [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens Joel Fernandes (Google)
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

The rnp->qsmask is locklessly accessed from rcutree_dying_cpu(). This
may help avoid load/store tearing due to concurrent access, KCSAN
issues, and preserve sanity of people reading the mask in tracing.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 0ca21ac0f064..5ec97e3f7468 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2106,7 +2106,7 @@ int rcutree_dying_cpu(unsigned int cpu)
 	if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
 		return 0;
 
-	blkd = !!(rnp->qsmask & rdp->grpmask);
+	blkd = !!(READ_ONCE(rnp->qsmask) & rdp->grpmask);
 	trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
 			       blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
 	return 0;
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (2 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 03/18] rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-02 11:35   ` Frederic Weisbecker
                     ` (2 more replies)
  2022-09-01 22:17 ` [PATCH v5 05/18] rcu: Move trace_rcu_callback() before bypassing Joel Fernandes (Google)
                   ` (14 subsequent siblings)
  18 siblings, 3 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

When the bypass cblist gets too big or its timeout has occurred, it is
flushed into the main cblist. However, the bypass timer is still running
and the behavior is that it would eventually expire and wake the GP
thread.

Since we are going to use the bypass cblist for lazy CBs, do the wakeup
soon as the flush happens. Otherwise, the lazy-timer will go off much
later and the now-non-lazy cblist CBs can get stranded for the duration
of the timer.

This is a good thing to do anyway (regardless of this series), since it
makes the behavior consistent with behavior of other code paths where queueing
something into the ->cblist makes the GP kthread in a non-sleeping state
quickly.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 0a5f0ef41484..31068dd31315 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -447,7 +447,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 			rcu_advance_cbs_nowake(rdp->mynode, rdp);
 			rdp->nocb_gp_adv_time = j;
 		}
-		rcu_nocb_unlock_irqrestore(rdp, flags);
+
+		// The flush succeeded and we moved CBs into the ->cblist.
+		// However, the bypass timer might still be running. Wakeup the
+		// GP thread by calling a helper with was_all_done set so that
+		// wake up happens (needed if main CB list was empty before).
+		__call_rcu_nocb_wake(rdp, true, flags)
+
 		return true; // Callback already enqueued.
 	}
 
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 05/18] rcu: Move trace_rcu_callback() before bypassing
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (3 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation Joel Fernandes (Google)
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

If any CB is queued into the bypass list, then trace_rcu_callback() does
not show it. This makes it not clear when a callback was actually
queued, as you only end up getting a trace_rcu_invoke_callback() trace.
Fix it by moving trace_rcu_callback() before
trace_rcu_nocb_try_bypass().

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5ec97e3f7468..9fe581be8696 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2809,10 +2809,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 	}
 
 	check_cb_ovld(rdp);
-	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
-		return; // Enqueued onto ->nocb_bypass, so just leave.
-	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
-	rcu_segcblist_enqueue(&rdp->cblist, head);
+
 	if (__is_kvfree_rcu_offset((unsigned long)func))
 		trace_rcu_kvfree_callback(rcu_state.name, head,
 					 (unsigned long)func,
@@ -2821,6 +2818,11 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 		trace_rcu_callback(rcu_state.name, head,
 				   rcu_segcblist_n_cbs(&rdp->cblist));
 
+	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
+		return; // Enqueued onto ->nocb_bypass, so just leave.
+	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
+	rcu_segcblist_enqueue(&rdp->cblist, head);
+
 	trace_rcu_segcb_stats(&rdp->cblist, TPS("SegCBQueued"));
 
 	/* Go handle any RCU core processing required. */
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (4 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 05/18] rcu: Move trace_rcu_callback() before bypassing Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-02 15:21   ` Frederic Weisbecker
  2022-09-03 14:03   ` Paul E. McKenney
  2022-09-01 22:17 ` [PATCH v5 07/18] rcu: shrinker for lazy rcu Joel Fernandes (Google)
                   ` (12 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

Implement timer-based RCU lazy callback batching. The batch is flushed
whenever a certain amount of time has passed, or the batch on a
particular CPU grows too big. Also memory pressure will flush it in a
future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

Suggested-by: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/rcupdate.h   |   6 ++
 kernel/rcu/Kconfig         |   8 ++
 kernel/rcu/rcu.h           |  11 +++
 kernel/rcu/rcu_segcblist.c |   2 +-
 kernel/rcu/tree.c          | 130 +++++++++++++++---------
 kernel/rcu/tree.h          |  13 ++-
 kernel/rcu/tree_nocb.h     | 198 ++++++++++++++++++++++++++++++-------
 7 files changed, 280 insertions(+), 88 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 08605ce7379d..82e8a07e0856 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
 
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
+#ifdef CONFIG_RCU_LAZY
+void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
+#else
+#define call_rcu_lazy(head, func) call_rcu(head, func)
+#endif
+
 /* Internal to kernel */
 void rcu_init(void);
 extern int rcu_scheduler_active;
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index d471d22a5e21..3128d01427cb 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
 	  Say N here if you hate read-side memory barriers.
 	  Take the default if you are unsure.
 
+config RCU_LAZY
+	bool "RCU callback lazy invocation functionality"
+	depends on RCU_NOCB_CPU
+	default n
+	help
+	  To save power, batch RCU callbacks and flush after delay, memory
+	  pressure or callback list growing too big.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index be5979da07f5..94675f14efe8 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -474,6 +474,14 @@ enum rcutorture_type {
 	INVALID_RCU_FLAVOR
 };
 
+#if defined(CONFIG_RCU_LAZY)
+unsigned long rcu_lazy_get_jiffies_till_flush(void);
+void rcu_lazy_set_jiffies_till_flush(unsigned long j);
+#else
+static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
+static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
+#endif
+
 #if defined(CONFIG_TREE_RCU)
 void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
 			    unsigned long *gp_seq);
@@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
 			       unsigned long c_old,
 			       unsigned long c);
 void rcu_gp_set_torture_wait(int duration);
+void rcu_force_call_rcu_to_lazy(bool force);
+
 #else
 static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
 					  int *flags, unsigned long *gp_seq)
@@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
 	do { } while (0)
 #endif
 static inline void rcu_gp_set_torture_wait(int duration) { }
+static inline void rcu_force_call_rcu_to_lazy(bool force) { }
 #endif
 
 #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index c54ea2b6a36b..55b50e592986 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -38,7 +38,7 @@ void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
  * element of the second rcu_cblist structure, but ensuring that the second
  * rcu_cblist structure, if initially non-empty, always appears non-empty
  * throughout the process.  If rdp is NULL, the second rcu_cblist structure
- * is instead initialized to empty.
+ * is instead initialized to empty. Also account for lazy_len for lazy CBs.
  */
 void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
 			      struct rcu_cblist *srclp,
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9fe581be8696..aaced29a0a71 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
 	raw_spin_unlock_rcu_node(rnp);
 }
 
-/**
- * call_rcu() - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all pre-existing RCU read-side
- * critical sections have completed.  However, the callback function
- * might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked.
- *
- * RCU read-side critical sections are delimited by rcu_read_lock()
- * and rcu_read_unlock(), and may be nested.  In addition, but only in
- * v5.0 and later, regions of code across which interrupts, preemption,
- * or softirqs have been disabled also serve as RCU read-side critical
- * sections.  This includes hardware interrupt handlers, softirq handlers,
- * and NMI handlers.
- *
- * Note that all CPUs must agree that the grace period extended beyond
- * all pre-existing RCU read-side critical section.  On systems with more
- * than one CPU, this means that when "func()" is invoked, each CPU is
- * guaranteed to have executed a full memory barrier since the end of its
- * last RCU read-side critical section whose beginning preceded the call
- * to call_rcu().  It also means that each CPU executing an RCU read-side
- * critical section that continues beyond the start of "func()" must have
- * executed a memory barrier after the call_rcu() but before the beginning
- * of that RCU read-side critical section.  Note that these guarantees
- * include CPUs that are offline, idle, or executing in user mode, as
- * well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
- * resulting RCU callback function "func()", then both CPU A and CPU B are
- * guaranteed to execute a full memory barrier during the time interval
- * between the call to call_rcu() and the invocation of "func()" -- even
- * if CPU A and CPU B are the same CPU (but again only if the system has
- * more than one CPU).
- *
- * Implementation of these memory-ordering guarantees is described here:
- * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
- */
-void call_rcu(struct rcu_head *head, rcu_callback_t func)
+static void
+__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 {
 	static atomic_t doublefrees;
 	unsigned long flags;
@@ -2818,7 +2779,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 		trace_rcu_callback(rcu_state.name, head,
 				   rcu_segcblist_n_cbs(&rdp->cblist));
 
-	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
+	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
 		return; // Enqueued onto ->nocb_bypass, so just leave.
 	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
 	rcu_segcblist_enqueue(&rdp->cblist, head);
@@ -2833,8 +2794,86 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 		local_irq_restore(flags);
 	}
 }
-EXPORT_SYMBOL_GPL(call_rcu);
 
+#ifdef CONFIG_RCU_LAZY
+/**
+ * call_rcu_lazy() - Lazily queue RCU callback for invocation after grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.
+ *
+ * Use this API instead of call_rcu() if you don't mind the callback being
+ * invoked after very long periods of time on systems without memory pressure
+ * and on systems which are lightly loaded or mostly idle.
+ *
+ * Other than the extra delay in callbacks being invoked, this function is
+ * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
+ * details about memory ordering and other functionality.
+ */
+void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, true);
+}
+EXPORT_SYMBOL_GPL(call_rcu_lazy);
+#endif
+
+static bool force_call_rcu_to_lazy;
+
+void rcu_force_call_rcu_to_lazy(bool force)
+{
+	if (IS_ENABLED(CONFIG_RCU_SCALE_TEST))
+		WRITE_ONCE(force_call_rcu_to_lazy, force);
+}
+EXPORT_SYMBOL_GPL(rcu_force_call_rcu_to_lazy);
+
+/**
+ * call_rcu() - Queue an RCU callback for invocation after a grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.  However, the callback function
+ * might well execute concurrently with RCU read-side critical sections
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested.  In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
+ * sections.  This includes hardware interrupt handlers, softirq handlers,
+ * and NMI handlers.
+ *
+ * Note that all CPUs must agree that the grace period extended beyond
+ * all pre-existing RCU read-side critical section.  On systems with more
+ * than one CPU, this means that when "func()" is invoked, each CPU is
+ * guaranteed to have executed a full memory barrier since the end of its
+ * last RCU read-side critical section whose beginning preceded the call
+ * to call_rcu().  It also means that each CPU executing an RCU read-side
+ * critical section that continues beyond the start of "func()" must have
+ * executed a memory barrier after the call_rcu() but before the beginning
+ * of that RCU read-side critical section.  Note that these guarantees
+ * include CPUs that are offline, idle, or executing in user mode, as
+ * well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
+ * resulting RCU callback function "func()", then both CPU A and CPU B are
+ * guaranteed to execute a full memory barrier during the time interval
+ * between the call to call_rcu() and the invocation of "func()" -- even
+ * if CPU A and CPU B are the same CPU (but again only if the system has
+ * more than one CPU).
+ *
+ * Implementation of these memory-ordering guarantees is described here:
+ * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
+ */
+void call_rcu(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, force_call_rcu_to_lazy);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
 
 /* Maximum number of jiffies to wait before draining a batch. */
 #define KFREE_DRAIN_JIFFIES (5 * HZ)
@@ -3904,7 +3943,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	rdp->barrier_head.func = rcu_barrier_callback;
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false,
+		     /* wake gp thread */ true));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
 	} else {
@@ -4325,7 +4365,7 @@ void rcutree_migrate_callbacks(int cpu)
 	my_rdp = this_cpu_ptr(&rcu_data);
 	my_rnp = my_rdp->mynode;
 	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, false, false));
 	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
 	/* Leverage recent GPs and set GP for new callbacks. */
 	needwake = rcu_advance_cbs(my_rnp, rdp) ||
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index d4a97e40ea9c..946d819b23fc 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -263,14 +263,18 @@ struct rcu_data {
 	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
 	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
 
+#ifdef CONFIG_RCU_LAZY
+	long lazy_len;			/* Length of buffered lazy callbacks. */
+#endif
 	int cpu;
 };
 
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
 #define RCU_NOCB_WAKE_NOT	0
 #define RCU_NOCB_WAKE_BYPASS	1
-#define RCU_NOCB_WAKE		2
-#define RCU_NOCB_WAKE_FORCE	3
+#define RCU_NOCB_WAKE_LAZY	2
+#define RCU_NOCB_WAKE		3
+#define RCU_NOCB_WAKE_FORCE	4
 
 #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
 					/* For jiffies_till_first_fqs and */
@@ -440,9 +444,10 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j);
+				  unsigned long j, bool lazy, bool wakegp);
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags);
+				bool *was_alldone, unsigned long flags,
+				bool lazy);
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 				 unsigned long flags);
 static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 31068dd31315..7e97a7b6e046 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
 	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
 }
 
+/*
+ * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
+ * can elapse before lazy callbacks are flushed. Lazy callbacks
+ * could be flushed much earlier for a number of other reasons
+ * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
+ * left unsubmitted to RCU after those many jiffies.
+ */
+#define LAZY_FLUSH_JIFFIES (10 * HZ)
+unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
+
+#ifdef CONFIG_RCU_LAZY
+// To be called only from test code.
+void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
+{
+	jiffies_till_flush = jif;
+}
+EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
+
+unsigned long rcu_lazy_get_jiffies_till_flush(void)
+{
+	return jiffies_till_flush;
+}
+EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
+#endif
+
 /*
  * Arrange to wake the GP kthread for this NOCB group at some future
  * time when it is safe to do so.
@@ -265,23 +290,39 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
 {
 	unsigned long flags;
 	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
+	unsigned long mod_jif = 0;
 
 	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
 
 	/*
-	 * Bypass wakeup overrides previous deferments. In case
-	 * of callback storm, no need to wake up too early.
+	 * Bypass and lazy wakeup overrides previous deferments. In case of
+	 * callback storm, no need to wake up too early.
 	 */
-	if (waketype == RCU_NOCB_WAKE_BYPASS) {
-		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
-		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
-	} else {
+	switch (waketype) {
+	case RCU_NOCB_WAKE_LAZY:
+		if (rdp->nocb_defer_wakeup != RCU_NOCB_WAKE_LAZY)
+			mod_jif = jiffies_till_flush;
+		break;
+
+	case RCU_NOCB_WAKE_BYPASS:
+		mod_jif = 2;
+		break;
+
+	case RCU_NOCB_WAKE:
+	case RCU_NOCB_WAKE_FORCE:
+		// For these, make it wake up the soonest if we
+		// were in a bypass or lazy sleep before.
 		if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
-			mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
-		if (rdp_gp->nocb_defer_wakeup < waketype)
-			WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+			mod_jif = 1;
+		break;
 	}
 
+	if (mod_jif)
+		mod_timer(&rdp_gp->nocb_timer, jiffies + mod_jif);
+
+	if (rdp_gp->nocb_defer_wakeup < waketype)
+		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+
 	raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
 
 	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
@@ -293,10 +334,13 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
  * proves to be initially empty, just return false because the no-CB GP
  * kthread may need to be awakened in this case.
  *
+ * Return true if there was something to be flushed and it succeeded, otherwise
+ * false.
+ *
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				     unsigned long j)
+				     unsigned long j, bool lazy)
 {
 	struct rcu_cblist rcl;
 
@@ -310,7 +354,18 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
 	if (rhp)
 		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
-	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+
+	/*
+	 * If the new CB requested was a lazy one, queue it onto the main
+	 * ->cblist so we can take advantage of a sooner grade period.
+	 */
+	if (lazy && rhp) {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
+		rcu_cblist_enqueue(&rcl, rhp);
+	} else {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+	}
+
 	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
 	WRITE_ONCE(rdp->nocb_bypass_first, j);
 	rcu_nocb_bypass_unlock(rdp);
@@ -326,13 +381,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, bool lazy, bool wake_gp)
 {
+	bool ret;
+
 	if (!rcu_rdp_is_offloaded(rdp))
 		return true;
 	rcu_lockdep_assert_cblist_protected(rdp);
 	rcu_nocb_bypass_lock(rdp);
-	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
+	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, lazy);
+
+	if (wake_gp)
+		wake_nocb_gp(rdp, true);
+
+	return ret;
 }
 
 /*
@@ -345,7 +407,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
 	if (!rcu_rdp_is_offloaded(rdp) ||
 	    !rcu_nocb_bypass_trylock(rdp))
 		return;
-	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
+	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, false));
 }
 
 /*
@@ -367,12 +429,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
  * there is only one CPU in operation.
  */
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags,
+				bool lazy)
 {
 	unsigned long c;
 	unsigned long cur_gp_seq;
 	unsigned long j = jiffies;
 	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
 
 	lockdep_assert_irqs_disabled();
 
@@ -417,23 +481,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	// If there hasn't yet been all that many ->cblist enqueues
 	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
 	// ->nocb_bypass first.
-	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
+	// Lazy CBs throttle this back and do immediate bypass queuing.
+	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
 		rcu_nocb_lock(rdp);
 		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 		if (*was_alldone)
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstQ"));
-		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
+
+		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, false, false));
+
 		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
 		return false; // Caller must enqueue the callback.
 	}
 
 	// If ->nocb_bypass has been used too long or is too full,
 	// flush ->nocb_bypass to ->cblist.
-	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
+	if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
+	    (ncbs &&  bypass_is_lazy &&
+		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
 	    ncbs >= qhimark) {
 		rcu_nocb_lock(rdp);
-		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
+
+		if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) {
 			*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 			if (*was_alldone)
 				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
@@ -460,16 +530,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	// We need to use the bypass.
 	rcu_nocb_wait_contended(rdp);
 	rcu_nocb_bypass_lock(rdp);
+
 	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
 	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
 	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
+
+	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
+		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
+
 	if (!ncbs) {
 		WRITE_ONCE(rdp->nocb_bypass_first, j);
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
 	}
+
 	rcu_nocb_bypass_unlock(rdp);
 	smp_mb(); /* Order enqueue before wake. */
-	if (ncbs) {
+
+	// We had CBs in the bypass list before. There is nothing else to do if:
+	// There were only non-lazy CBs before, in this case, the bypass timer
+	// or GP-thread will handle the CBs including any new lazy ones.
+	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
+	// case the old lazy timer would have been setup. When that expires,
+	// the new lazy one will be handled.
+	if (ncbs && (!bypass_is_lazy || lazy)) {
 		local_irq_restore(flags);
 	} else {
 		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
@@ -478,6 +561,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstBQwake"));
 			__call_rcu_nocb_wake(rdp, true, flags);
+		} else if (bypass_is_lazy && !lazy) {
+			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+					    TPS("FirstBQwakeLazy2Non"));
+			__call_rcu_nocb_wake(rdp, true, flags);
 		} else {
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstBQnoWake"));
@@ -499,7 +586,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 {
 	unsigned long cur_gp_seq;
 	unsigned long j;
-	long len;
+	long len, lazy_len, bypass_len;
 	struct task_struct *t;
 
 	// If we are being polled or there is no kthread, just leave.
@@ -512,9 +599,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 	}
 	// Need to actually to a wakeup.
 	len = rcu_segcblist_n_cbs(&rdp->cblist);
+	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	lazy_len = READ_ONCE(rdp->lazy_len);
 	if (was_alldone) {
 		rdp->qlen_last_fqs_check = len;
-		if (!irqs_disabled_flags(flags)) {
+		// Only lazy CBs in bypass list
+		if (lazy_len && bypass_len == lazy_len) {
+			rcu_nocb_unlock_irqrestore(rdp, flags);
+			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
+					   TPS("WakeLazy"));
+		} else if (!irqs_disabled_flags(flags)) {
 			/* ... if queue was empty ... */
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			wake_nocb_gp(rdp, false);
@@ -604,8 +698,8 @@ static void nocb_gp_sleep(struct rcu_data *my_rdp, int cpu)
  */
 static void nocb_gp_wait(struct rcu_data *my_rdp)
 {
-	bool bypass = false;
-	long bypass_ncbs;
+	bool bypass = false, lazy = false;
+	long bypass_ncbs, lazy_ncbs;
 	int __maybe_unused cpu = my_rdp->cpu;
 	unsigned long cur_gp_seq;
 	unsigned long flags;
@@ -640,24 +734,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	 * won't be ignored for long.
 	 */
 	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
+		bool flush_bypass = false;
+
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
 		rcu_nocb_lock_irqsave(rdp, flags);
 		lockdep_assert_held(&rdp->nocb_lock);
 		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
-		if (bypass_ncbs &&
+		lazy_ncbs = READ_ONCE(rdp->lazy_len);
+
+		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
+		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
+		     bypass_ncbs > 2 * qhimark)) {
+			flush_bypass = true;
+		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
 		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
 		     bypass_ncbs > 2 * qhimark)) {
-			// Bypass full or old, so flush it.
-			(void)rcu_nocb_try_flush_bypass(rdp, j);
-			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			flush_bypass = true;
 		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			continue; /* No callbacks here, try next. */
 		}
+
+		if (flush_bypass) {
+			// Bypass full or old, so flush it.
+			(void)rcu_nocb_try_flush_bypass(rdp, j);
+			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			lazy_ncbs = READ_ONCE(rdp->lazy_len);
+		}
+
 		if (bypass_ncbs) {
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
-					    TPS("Bypass"));
-			bypass = true;
+				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
+			if (bypass_ncbs == lazy_ncbs)
+				lazy = true;
+			else
+				bypass = true;
 		}
 		rnp = rdp->mynode;
 
@@ -705,12 +816,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	my_rdp->nocb_gp_gp = needwait_gp;
 	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
 
-	if (bypass && !rcu_nocb_poll) {
-		// At least one child with non-empty ->nocb_bypass, so set
-		// timer in order to avoid stranding its callbacks.
-		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
-				   TPS("WakeBypassIsDeferred"));
+	// At least one child with non-empty ->nocb_bypass, so set
+	// timer in order to avoid stranding its callbacks.
+	if (!rcu_nocb_poll) {
+		// If bypass list only has lazy CBs. Add a deferred
+		// lazy wake up.
+		if (lazy && !bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
+					TPS("WakeLazyIsDeferred"));
+		// Otherwise add a deferred bypass wake up.
+		} else if (bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
+					TPS("WakeBypassIsDeferred"));
+		}
 	}
+
 	if (rcu_nocb_poll) {
 		/* Polling, so trace if first poll in the series. */
 		if (gotcbs)
@@ -1036,7 +1156,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
 	 * return false, which means that future calls to rcu_nocb_try_bypass()
 	 * will refuse to put anything into the bypass.
 	 */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false, false));
 	/*
 	 * Start with invoking rcu_core() early. This way if the current thread
 	 * happens to preempt an ongoing call to rcu_core() in the middle,
@@ -1290,6 +1410,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 	raw_spin_lock_init(&rdp->nocb_gp_lock);
 	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
 	rcu_cblist_init(&rdp->nocb_bypass);
+	WRITE_ONCE(rdp->lazy_len, 0);
 	mutex_init(&rdp->nocb_gp_kthread_mutex);
 }
 
@@ -1571,13 +1692,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 }
 
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, bool lazy, bool wakegp)
 {
 	return true;
 }
 
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags,
+				bool lazy)
 {
 	return false;
 }
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 07/18] rcu: shrinker for lazy rcu
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (5 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation Joel Fernandes (Google)
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes

From: Vineeth Pillai <vineeth@bitbyteword.org>

The shrinker is used to speed up the free'ing of memory potentially held
by RCU lazy callbacks. RCU kernel module test cases show this to be
effective. Test is introduced in a later patch.

Signed-off-by: Vineeth Pillai <vineeth@bitbyteword.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 52 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 52 insertions(+)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 7e97a7b6e046..560ba87911c5 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -1333,6 +1333,55 @@ int rcu_nocb_cpu_offload(int cpu)
 }
 EXPORT_SYMBOL_GPL(rcu_nocb_cpu_offload);
 
+static unsigned long
+lazy_rcu_shrink_count(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int cpu;
+	unsigned long count = 0;
+
+	/* Snapshot count of all CPUs */
+	for_each_possible_cpu(cpu) {
+		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+
+		count +=  READ_ONCE(rdp->lazy_len);
+	}
+
+	return count ? count : SHRINK_EMPTY;
+}
+
+static unsigned long
+lazy_rcu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
+{
+	int cpu;
+	unsigned long flags;
+	unsigned long count = 0;
+
+	/* Snapshot count of all CPUs */
+	for_each_possible_cpu(cpu) {
+		struct rcu_data *rdp = per_cpu_ptr(&rcu_data, cpu);
+		int _count = READ_ONCE(rdp->lazy_len);
+
+		if (_count == 0)
+			continue;
+		rcu_nocb_lock_irqsave(rdp, flags);
+		WRITE_ONCE(rdp->lazy_len, 0);
+		rcu_nocb_unlock_irqrestore(rdp, flags);
+		wake_nocb_gp(rdp, false);
+		sc->nr_to_scan -= _count;
+		count += _count;
+		if (sc->nr_to_scan <= 0)
+			break;
+	}
+	return count ? count : SHRINK_STOP;
+}
+
+static struct shrinker lazy_rcu_shrinker = {
+	.count_objects = lazy_rcu_shrink_count,
+	.scan_objects = lazy_rcu_shrink_scan,
+	.batch = 0,
+	.seeks = DEFAULT_SEEKS,
+};
+
 void __init rcu_init_nohz(void)
 {
 	int cpu;
@@ -1367,6 +1416,9 @@ void __init rcu_init_nohz(void)
 	if (!rcu_state.nocb_is_setup)
 		return;
 
+	if (register_shrinker(&lazy_rcu_shrinker, "rcu-lazy"))
+		pr_err("Failed to register lazy_rcu shrinker!\n");
+
 #if defined(CONFIG_NO_HZ_FULL)
 	if (tick_nohz_full_running)
 		cpumask_or(rcu_nocb_mask, rcu_nocb_mask, tick_nohz_full_mask);
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation.
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (6 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 07/18] rcu: shrinker for lazy rcu Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-02 16:48   ` kernel test robot
  2022-09-02 19:01   ` kernel test robot
  2022-09-01 22:17 ` [PATCH v5 09/18] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

This is very useful to debug whether call_rcu_lazy() is working
correctly. In the future, any BPF tools could also be modified to read
the new information.

The additional tracing is enabled by a new CONFIG_RCU_TRACE_CB and is
kept disabled by default.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/types.h      | 44 ++++++++++++++++++++++++
 include/trace/events/rcu.h | 69 +++++++++++++++++++++++++++++++++++---
 kernel/rcu/Kconfig         | 11 ++++++
 kernel/rcu/rcu_segcblist.c | 21 ++++++++++++
 kernel/rcu/rcu_segcblist.h |  8 +++++
 kernel/rcu/tree.c          | 46 +++++++++++++++++++++++--
 kernel/rcu/tree_nocb.h     | 26 +++++++++++++-
 7 files changed, 217 insertions(+), 8 deletions(-)

diff --git a/include/linux/types.h b/include/linux/types.h
index ea8cf60a8a79..47501fdfd3c1 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -198,10 +198,51 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+#ifdef CONFIG_RCU_TRACE_CB
+/*
+ * Debug information that a caller can store within a callback_head.
+ * Its expected to provide at least 12 bytes before BUILD_BUG starts
+ * complaining.
+ */
+enum cb_debug_flags {
+	CB_DEBUG_KFREE,
+	CB_DEBUG_LAZY,
+	CB_DEBUG_BYPASS,
+
+	// A new non-lazy CB showed up and we decided to not use
+	// the bypass list for it. So we flushed the old ones.
+	CB_DEBUG_NON_LAZY_FLUSHED,
+
+	// We decided to use the bypass list but had to flush
+	// the old bypass CBs because they got too old or too big.
+	CB_DEBUG_BYPASS_FLUSHED,
+	CB_DEBUG_BYPASS_LAZY_FLUSHED,
+
+	// The GP thread flushed the bypass CBs if they got old or big.
+	CB_DEBUG_GPTHREAD_FLUSHED,
+
+	// De-offload from NOCB mode.
+	CB_DEBUG_DEOFFLOAD_FLUSHED,
+
+	// rcu_barrier() flushes lazy/bypass CBs for CB exec ordering.
+	CB_DEBUG_BARRIER_FLUSHED
+};
+
+struct cb_debug_info {
+	// 16-bit jiffie deltas can provide about 60 seconds of resolution
+	// at HZ=1000 before wrapping. That's enough for debug.
+	u16 cb_queue_jiff;
+	u16 first_bp_jiff;
+	u16 cb_flush_jiff;
+	enum cb_debug_flags flags:16;
+};
+#endif
+
 /**
  * struct callback_head - callback structure for use with RCU and task_work
  * @next: next update requests in a list
  * @func: actual update function to call after the grace period.
+ * @di: debug information that can be stored.
  *
  * The struct is aligned to size of pointer. On most architectures it happens
  * naturally due ABI requirements, but some architectures (like CRIS) have
@@ -220,6 +261,9 @@ struct ustat {
 struct callback_head {
 	struct callback_head *next;
 	void (*func)(struct callback_head *head);
+#ifdef CONFIG_RCU_TRACE_CB
+	struct cb_debug_info di;
+#endif
 } __attribute__((aligned(sizeof(void *))));
 #define rcu_head callback_head
 
diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 90b2fb0292cb..dd6baec6745c 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -630,24 +630,85 @@ TRACE_EVENT_RCU(rcu_batch_start,
  */
 TRACE_EVENT_RCU(rcu_invoke_callback,
 
-	TP_PROTO(const char *rcuname, struct rcu_head *rhp),
+	TP_PROTO(const char *rcuname, struct rcu_head *rhp
+#ifdef CONFIG_RCU_TRACE_CB
+		, unsigned long jiffies_first
+#endif
+	),
 
-	TP_ARGS(rcuname, rhp),
+	TP_ARGS(rcuname, rhp
+#ifdef CONFIG_RCU_TRACE_CB
+		, jiffies_first
+#endif
+		),
 
 	TP_STRUCT__entry(
 		__field(const char *, rcuname)
 		__field(void *, rhp)
 		__field(void *, func)
+#ifdef CONFIG_RCU_TRACE_CB
+		__field(u16, cb_debug_flags)
+		__field(int, cb_queue_exec_latency)
+		__array(char, bypass_flush_reason, 32)
+		__field(int, cb_bypass_slack)
+		__field(int, cb_queue_flush_latency)
+#endif
 	),
 
 	TP_fast_assign(
 		__entry->rcuname = rcuname;
 		__entry->rhp = rhp;
 		__entry->func = rhp->func;
+#ifdef CONFIG_RCU_TRACE_CB
+		__entry->cb_debug_flags = rhp->di.flags;
+		__entry->cb_queue_exec_latency =
+			(jiffies - jiffies_first) - rhp->di.cb_queue_jiff,
+
+		/* The following 3 fields are valid only for bypass/lazy CBs. */
+		__entry->cb_bypass_slack =
+			(rhp->di.flags & BIT(CB_DEBUG_BYPASS)) ?
+				rhp->di.cb_queue_jiff - rhp->di.first_bp_jiff : 0,
+		__entry->cb_queue_flush_latency =
+			(rhp->di.flags & BIT(CB_DEBUG_BYPASS)) ?
+				rhp->di.cb_flush_jiff - rhp->di.cb_queue_jiff : 0;
+
+
+		if (__entry->cb_debug_flags & BIT(CB_DEBUG_NON_LAZY_FLUSHED))
+			strcpy(__entry->bypass_flush_reason, "non-lazy");
+		else if (__entry->cb_debug_flags & BIT(CB_DEBUG_BYPASS_FLUSHED))
+			strcpy(__entry->bypass_flush_reason, "bypass");
+		else if (__entry->cb_debug_flags & BIT(CB_DEBUG_BYPASS_LAZY_FLUSHED))
+			strcpy(__entry->bypass_flush_reason, "bypass-lazy");
+		else if (__entry->cb_debug_flags & BIT(CB_DEBUG_GPTHREAD_FLUSHED))
+			strcpy(__entry->bypass_flush_reason, "gpthread");
+		else if (__entry->cb_debug_flags & BIT(CB_DEBUG_BARRIER_FLUSHED))
+			strcpy(__entry->bypass_flush_reason, "rcu_barrier");
+		else if (__entry->cb_debug_flags & BIT(CB_DEBUG_DEOFFLOAD_FLUSHED))
+			strcpy(__entry->bypass_flush_reason, "deoffload");
+#endif
 	),
 
-	TP_printk("%s rhp=%p func=%ps",
-		  __entry->rcuname, __entry->rhp, __entry->func)
+	TP_printk("%s rhp=%p func=%ps"
+#ifdef CONFIG_RCU_TRACE_CB
+			" lazy=%d "
+			"bypass=%d "
+			"flush_reason=%s "
+			"queue_exec_latency=%d "
+			"bypass_slack=%d "
+			"queue_flush_latency=%d"
+#endif
+			,
+		  __entry->rcuname, __entry->rhp, __entry->func
+#ifdef CONFIG_RCU_TRACE_CB
+		,
+		!!(__entry->cb_debug_flags & BIT(CB_DEBUG_LAZY)),
+		!!(__entry->cb_debug_flags & BIT(CB_DEBUG_BYPASS)),
+		__entry->bypass_flush_reason,
+		__entry->cb_queue_exec_latency,
+		__entry->cb_bypass_slack,
+		__entry->cb_queue_flush_latency
+#endif
+	 )
 );
 
 /*
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index 3128d01427cb..cd6ff63517f4 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -319,4 +319,15 @@ config RCU_LAZY
 	  To save power, batch RCU callbacks and flush after delay, memory
 	  pressure or callback list growing too big.
 
+config RCU_TRACE_CB
+	bool "Enable additional callback tracing"
+	depends on RCU_LAZY
+	default n
+	help
+	  Enable additional callback tracing for RCU. Currently only
+	  tracing for lazy/bypass callbacks is done. In the future, it could be
+	  extended to non-lazy callbacks as well. The trace point contains
+	  detailed information about callback flushing reason and execution
+	  time. This information can be retrieved by BPF tools via tracepoints.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
index 55b50e592986..8f0b024f7bc5 100644
--- a/kernel/rcu/rcu_segcblist.c
+++ b/kernel/rcu/rcu_segcblist.c
@@ -32,6 +32,27 @@ void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
 	WRITE_ONCE(rclp->len, rclp->len + 1);
 }
 
+/*
+ * Set a debug flag on all CBs in the unsegmented cblist @rclp.
+ *
+ * The callback causing the flush. Should be NULL if its a timer that does it.
+ */
+#ifdef CONFIG_RCU_TRACE_CB
+void rcu_cblist_set_flush(struct rcu_cblist *rcl,
+			 enum cb_debug_flags flags,
+			 unsigned long flush_jiff)
+{
+	if (!rcl || !rcl->head)
+		return;
+
+	for (struct rcu_head *head = rcl->head;
+			head && head != *(rcl->tail); head = head->next) {
+		head->di.flags |= flags;
+		head->di.cb_flush_jiff = flush_jiff;
+	}
+}
+#endif
+
 /*
  * Flush the second rcu_cblist structure onto the first one, obliterating
  * any contents of the first.  If rhp is non-NULL, enqueue it as the sole
diff --git a/kernel/rcu/rcu_segcblist.h b/kernel/rcu/rcu_segcblist.h
index 431cee212467..ac1f34024b84 100644
--- a/kernel/rcu/rcu_segcblist.h
+++ b/kernel/rcu/rcu_segcblist.h
@@ -53,6 +53,14 @@ static inline long rcu_segcblist_n_cbs(struct rcu_segcblist *rsclp)
 #endif
 }
 
+#ifdef CONFIG_RCU_TRACE_CB
+void rcu_cblist_set_flush(struct rcu_cblist *rcl,
+			 enum cb_debug_flags flags,
+			 unsigned long flush_jiff);
+#else
+#define rcu_cblist_set_flush(...)
+#endif
+
 static inline void rcu_segcblist_set_flags(struct rcu_segcblist *rsclp,
 					   int flags)
 {
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index aaced29a0a71..8111d9f37621 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2179,6 +2179,12 @@ int rcutree_dead_cpu(unsigned int cpu)
 	return 0;
 }
 
+static unsigned long cb_debug_jiffies_first;
+
+#if defined(CONFIG_RCU_TRACE_CB) && defined(CONFIG_RCU_LAZY)
+extern unsigned long jiffies_till_flush;
+#endif
+
 /*
  * Invoke any RCU callbacks that have made it to the end of their grace
  * period.  Throttle as specified by rdp->blimit.
@@ -2241,7 +2247,12 @@ static void rcu_do_batch(struct rcu_data *rdp)
 		debug_rcu_head_unqueue(rhp);
 
 		rcu_lock_acquire(&rcu_callback_map);
-		trace_rcu_invoke_callback(rcu_state.name, rhp);
+
+		trace_rcu_invoke_callback(rcu_state.name, rhp
+#ifdef CONFIG_RCU_TRACE_CB
+			, READ_ONCE(cb_debug_jiffies_first)
+#endif
+		);
 
 		f = rhp->func;
 		WRITE_ONCE(rhp->func, (rcu_callback_t)0L);
@@ -2736,6 +2747,17 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 	struct rcu_data *rdp;
 	bool was_alldone;
 
+	/*
+	 * For CB debugging, record the starting reference to jiffies while
+	 * making sure it does not wrap around. The starting reference is
+	 * used to calculate 16-bit jiffie deltas. To be safe, reset the
+	 * reference once the delta exceeds 15-bits worth.
+	 */
+	if (IS_ENABLED(CONFIG_RCU_TRACE_CB) &&
+		(!READ_ONCE(cb_debug_jiffies_first) ||
+		 (jiffies - READ_ONCE(cb_debug_jiffies_first) > (1 << 15))))
+		WRITE_ONCE(cb_debug_jiffies_first, jiffies);
+
 	/* Misaligned rcu_head! */
 	WARN_ON_ONCE((unsigned long)head & (sizeof(void *) - 1));
 
@@ -2771,14 +2793,29 @@ __call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 
 	check_cb_ovld(rdp);
 
-	if (__is_kvfree_rcu_offset((unsigned long)func))
+#ifdef CONFIG_RCU_TRACE_CB
+	head->di.flags = 0;
+	WARN_ON_ONCE(jiffies - READ_ONCE(cb_debug_jiffies_first) > 65535);
+	head->di.cb_queue_jiff = (u16)(jiffies - READ_ONCE(cb_debug_jiffies_first));
+#endif
+
+	if (__is_kvfree_rcu_offset((unsigned long)func)) {
 		trace_rcu_kvfree_callback(rcu_state.name, head,
 					 (unsigned long)func,
 					 rcu_segcblist_n_cbs(&rdp->cblist));
-	else
+#ifdef CONFIG_RCU_TRACE_CB
+		head->di.flags |= BIT(CB_DEBUG_KFREE);
+#endif
+	} else {
 		trace_rcu_callback(rcu_state.name, head,
 				   rcu_segcblist_n_cbs(&rdp->cblist));
 
+#ifdef CONFIG_RCU_TRACE_CB
+		if (lazy)
+			head->di.flags |= BIT(CB_DEBUG_LAZY);
+#endif
+	}
+
 	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
 		return; // Enqueued onto ->nocb_bypass, so just leave.
 	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
@@ -3943,6 +3980,9 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	rdp->barrier_head.func = rcu_barrier_callback;
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
+
+	rcu_cblist_set_flush(&rdp->nocb_bypass, BIT(CB_DEBUG_BARRIER_FLUSHED),
+			(jiffies - READ_ONCE(cb_debug_jiffies_first)));
 	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false,
 		     /* wake gp thread */ true));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 560ba87911c5..ee5924ba2f3b 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -355,6 +355,11 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	if (rhp)
 		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
 
+	/* The lazy CBs are being flushed, but a new one might be enqueued. */
+#ifdef CONFIG_RCU_TRACE_CB
+	WARN_ON_ONCE(rhp && lazy != !!(rhp->di.flags & BIT(CB_DEBUG_LAZY)));
+#endif
+
 	/*
 	 * If the new CB requested was a lazy one, queue it onto the main
 	 * ->cblist so we can take advantage of a sooner grade period.
@@ -407,6 +412,10 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
 	if (!rcu_rdp_is_offloaded(rdp) ||
 	    !rcu_nocb_bypass_trylock(rdp))
 		return;
+
+	rcu_cblist_set_flush(&rdp->nocb_bypass, BIT(CB_DEBUG_GPTHREAD_FLUSHED),
+			     (j - READ_ONCE(cb_debug_jiffies_first)));
+
 	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, false));
 }
 
@@ -489,8 +498,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstQ"));
 
-		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, false, false));
+		rcu_cblist_set_flush(&rdp->nocb_bypass, BIT(CB_DEBUG_NON_LAZY_FLUSHED),
+				     (j - READ_ONCE(cb_debug_jiffies_first)));
 
+		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, false, false));
 		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
 		return false; // Caller must enqueue the callback.
 	}
@@ -503,6 +514,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	    ncbs >= qhimark) {
 		rcu_nocb_lock(rdp);
 
+		rcu_cblist_set_flush(&rdp->nocb_bypass,
+				lazy ? BIT(CB_DEBUG_BYPASS_LAZY_FLUSHED) : BIT(CB_DEBUG_BYPASS_FLUSHED),
+				(j - READ_ONCE(cb_debug_jiffies_first)));
+
 		if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) {
 			*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 			if (*was_alldone)
@@ -543,6 +558,11 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
 	}
 
+#ifdef CONFIG_RCU_TRACE_CB
+	rhp->di.flags |= BIT(CB_DEBUG_BYPASS);
+	rhp->di.first_bp_jiff = READ_ONCE(rdp->nocb_bypass_first) - READ_ONCE(cb_debug_jiffies_first);
+#endif
+
 	rcu_nocb_bypass_unlock(rdp);
 	smp_mb(); /* Order enqueue before wake. */
 
@@ -1156,6 +1176,10 @@ static long rcu_nocb_rdp_deoffload(void *arg)
 	 * return false, which means that future calls to rcu_nocb_try_bypass()
 	 * will refuse to put anything into the bypass.
 	 */
+
+	rcu_cblist_set_flush(&rdp->nocb_bypass, BIT(CB_DEBUG_DEOFFLOAD_FLUSHED),
+			(jiffies - READ_ONCE(cb_debug_jiffies_first)));
+
 	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false, false));
 	/*
 	 * Start with invoking rcu_core() early. This way if the current thread
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 09/18] rcuscale: Add laziness and kfree tests
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (7 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 10/18] rcutorture: Add test code for call_rcu_lazy() Joel Fernandes (Google)
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

We aad 2 tests to rcuscale, first one is a startup test to check whether
we are not too lazy or too hard working. Two, emulate kfree_rcu() itself
to use call_rcu_lazy() and check memory pressure. In my testing,
call_rcu_lazy() does well to keep memory pressure under control, similar
to kfree_rcu().

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/rcuscale.c | 74 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 73 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcuscale.c b/kernel/rcu/rcuscale.c
index 3ef02d4a8108..6808f2e14385 100644
--- a/kernel/rcu/rcuscale.c
+++ b/kernel/rcu/rcuscale.c
@@ -95,6 +95,7 @@ torture_param(int, verbose, 1, "Enable verbose debugging printk()s");
 torture_param(int, writer_holdoff, 0, "Holdoff (us) between GPs, zero to disable");
 torture_param(int, kfree_rcu_test, 0, "Do we run a kfree_rcu() scale test?");
 torture_param(int, kfree_mult, 1, "Multiple of kfree_obj size to allocate.");
+torture_param(int, kfree_rcu_by_lazy, 0, "Use call_rcu_lazy() to emulate kfree_rcu()?");
 
 static char *scale_type = "rcu";
 module_param(scale_type, charp, 0444);
@@ -659,6 +660,14 @@ struct kfree_obj {
 	struct rcu_head rh;
 };
 
+/* Used if doing RCU-kfree'ing via call_rcu_lazy(). */
+static void kfree_rcu_lazy(struct rcu_head *rh)
+{
+	struct kfree_obj *obj = container_of(rh, struct kfree_obj, rh);
+
+	kfree(obj);
+}
+
 static int
 kfree_scale_thread(void *arg)
 {
@@ -696,6 +705,11 @@ kfree_scale_thread(void *arg)
 			if (!alloc_ptr)
 				return -ENOMEM;
 
+			if (kfree_rcu_by_lazy) {
+				call_rcu_lazy(&(alloc_ptr->rh), kfree_rcu_lazy);
+				continue;
+			}
+
 			// By default kfree_rcu_test_single and kfree_rcu_test_double are
 			// initialized to false. If both have the same value (false or true)
 			// both are randomly tested, otherwise only the one with value true
@@ -738,6 +752,9 @@ kfree_scale_cleanup(void)
 {
 	int i;
 
+	if (kfree_rcu_by_lazy)
+		rcu_force_call_rcu_to_lazy(false);
+
 	if (torture_cleanup_begin())
 		return;
 
@@ -767,11 +784,64 @@ kfree_scale_shutdown(void *arg)
 	return -EINVAL;
 }
 
+// Used if doing RCU-kfree'ing via call_rcu_lazy().
+static unsigned long jiffies_at_lazy_cb;
+static struct rcu_head lazy_test1_rh;
+static int rcu_lazy_test1_cb_called;
+static void call_rcu_lazy_test1(struct rcu_head *rh)
+{
+	jiffies_at_lazy_cb = jiffies;
+	WRITE_ONCE(rcu_lazy_test1_cb_called, 1);
+}
+
 static int __init
 kfree_scale_init(void)
 {
 	long i;
 	int firsterr = 0;
+	unsigned long orig_jif, jif_start;
+
+	// If lazy-rcu based kfree'ing is requested, then for kernels that
+	// support it, force all call_rcu() to call_rcu_lazy() so that non-lazy
+	// CBs do not remove laziness of the lazy ones (since the test tries to
+	// stress call_rcu_lazy() for OOM).
+	//
+	// Also, do a quick self-test to ensure laziness is as much as
+	// expected.
+	if (kfree_rcu_by_lazy && !IS_ENABLED(CONFIG_RCU_LAZY)) {
+		pr_alert("CONFIG_RCU_LAZY is disabled, falling back to kfree_rcu() "
+			 "for delayed RCU kfree'ing\n");
+		kfree_rcu_by_lazy = 0;
+	}
+
+	if (kfree_rcu_by_lazy) {
+		/* do a test to check the timeout. */
+		orig_jif = rcu_lazy_get_jiffies_till_flush();
+
+		rcu_force_call_rcu_to_lazy(true);
+		rcu_lazy_set_jiffies_till_flush(2 * HZ);
+		rcu_barrier();
+
+		jif_start = jiffies;
+		jiffies_at_lazy_cb = 0;
+		call_rcu_lazy(&lazy_test1_rh, call_rcu_lazy_test1);
+
+		smp_cond_load_relaxed(&rcu_lazy_test1_cb_called, VAL == 1);
+
+		rcu_lazy_set_jiffies_till_flush(orig_jif);
+
+		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start < 2 * HZ)) {
+			pr_alert("ERROR: Lazy CBs are not being lazy as expected!\n");
+			WARN_ON_ONCE(1);
+			return -1;
+		}
+
+		if (WARN_ON_ONCE(jiffies_at_lazy_cb - jif_start > 3 * HZ)) {
+			pr_alert("ERROR: Lazy CBs are being too lazy!\n");
+			WARN_ON_ONCE(1);
+			return -1;
+		}
+	}
 
 	kfree_nrealthreads = compute_real(kfree_nthreads);
 	/* Start up the kthreads. */
@@ -784,7 +854,9 @@ kfree_scale_init(void)
 		schedule_timeout_uninterruptible(1);
 	}
 
-	pr_alert("kfree object size=%zu\n", kfree_mult * sizeof(struct kfree_obj));
+	pr_alert("kfree object size=%zu, kfree_rcu_by_lazy=%d\n",
+			kfree_mult * sizeof(struct kfree_obj),
+			kfree_rcu_by_lazy);
 
 	kfree_reader_tasks = kcalloc(kfree_nrealthreads, sizeof(kfree_reader_tasks[0]),
 			       GFP_KERNEL);
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 10/18] rcutorture: Add test code for call_rcu_lazy()
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (8 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 09/18] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 11/18] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

We add a new RCU type to test call_rcu_lazy(). This allows us to just
override the '.call' callback. To compensate for the laziness, we force
the laziness to a small number of jiffies. The idea of this test is to
stress the new code paths for stability and ensure it at least is providing
behavior in parity with, or similar to, call_rcu(). The actual check for
amount of laziness is in another test (rcuscale).

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/rcu.h                              |  1 +
 kernel/rcu/rcutorture.c                       | 60 ++++++++++++++++++-
 kernel/rcu/tree.c                             |  1 +
 .../selftests/rcutorture/configs/rcu/CFLIST   |  1 +
 .../selftests/rcutorture/configs/rcu/TREE11   | 18 ++++++
 .../rcutorture/configs/rcu/TREE11.boot        |  8 +++
 6 files changed, 88 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
 create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 94675f14efe8..a6efb831673f 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -471,6 +471,7 @@ enum rcutorture_type {
 	RCU_TASKS_TRACING_FLAVOR,
 	RCU_TRIVIAL_FLAVOR,
 	SRCU_FLAVOR,
+	RCU_LAZY_FLAVOR,
 	INVALID_RCU_FLAVOR
 };
 
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 9ad5301385a4..b4114f29ab61 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -927,6 +927,64 @@ static struct rcu_torture_ops tasks_rude_ops = {
 
 #endif // #else #ifdef CONFIG_TASKS_RUDE_RCU
 
+#ifdef CONFIG_RCU_LAZY
+
+/*
+ * Definitions for lazy RCU torture testing.
+ */
+static unsigned long orig_jiffies_till_flush;
+
+static void rcu_sync_torture_init_lazy(void)
+{
+	rcu_sync_torture_init();
+
+	orig_jiffies_till_flush = rcu_lazy_get_jiffies_till_flush();
+	rcu_lazy_set_jiffies_till_flush(50);
+}
+
+static void rcu_lazy_cleanup(void)
+{
+	rcu_lazy_set_jiffies_till_flush(orig_jiffies_till_flush);
+}
+
+static struct rcu_torture_ops rcu_lazy_ops = {
+	.ttype			= RCU_LAZY_FLAVOR,
+	.init			= rcu_sync_torture_init_lazy,
+	.cleanup		= rcu_lazy_cleanup,
+	.readlock		= rcu_torture_read_lock,
+	.read_delay		= rcu_read_delay,
+	.readunlock		= rcu_torture_read_unlock,
+	.readlock_held		= torture_readlock_not_held,
+	.get_gp_seq		= rcu_get_gp_seq,
+	.gp_diff		= rcu_seq_diff,
+	.deferred_free		= rcu_torture_deferred_free,
+	.sync			= synchronize_rcu,
+	.exp_sync		= synchronize_rcu_expedited,
+	.get_gp_state		= get_state_synchronize_rcu,
+	.start_gp_poll		= start_poll_synchronize_rcu,
+	.poll_gp_state		= poll_state_synchronize_rcu,
+	.cond_sync		= cond_synchronize_rcu,
+	.call			= call_rcu_lazy,
+	.cb_barrier		= rcu_barrier,
+	.fqs			= rcu_force_quiescent_state,
+	.stats			= NULL,
+	.gp_kthread_dbg		= show_rcu_gp_kthreads,
+	.check_boost_failed	= rcu_check_boost_fail,
+	.stall_dur		= rcu_jiffies_till_stall_check,
+	.irq_capable		= 1,
+	.can_boost		= IS_ENABLED(CONFIG_RCU_BOOST),
+	.extendables		= RCUTORTURE_MAX_EXTEND,
+	.name			= "rcu_lazy"
+};
+
+#define LAZY_OPS &rcu_lazy_ops,
+
+#else // #ifdef CONFIG_RCU_LAZY
+
+#define LAZY_OPS
+
+#endif // #else #ifdef CONFIG_RCU_LAZY
+
 
 #ifdef CONFIG_TASKS_TRACE_RCU
 
@@ -3466,7 +3524,7 @@ rcu_torture_init(void)
 	unsigned long gp_seq = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
 		&rcu_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops, &busted_srcud_ops,
-		TASKS_OPS TASKS_RUDE_OPS TASKS_TRACING_OPS
+		TASKS_OPS TASKS_RUDE_OPS TASKS_TRACING_OPS LAZY_OPS
 		&trivial_ops,
 	};
 
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8111d9f37621..08bafe13c3ad 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -537,6 +537,7 @@ void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
 {
 	switch (test_type) {
 	case RCU_FLAVOR:
+	case RCU_LAZY_FLAVOR:
 		*flags = READ_ONCE(rcu_state.gp_flags);
 		*gp_seq = rcu_seq_current(&rcu_state.gp_seq);
 		break;
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
index 98b6175e5aa0..609c3370616f 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
@@ -5,6 +5,7 @@ TREE04
 TREE05
 TREE07
 TREE09
+TREE11
 SRCU-N
 SRCU-P
 SRCU-T
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11 b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
new file mode 100644
index 000000000000..436013f3e015
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11
@@ -0,0 +1,18 @@
+CONFIG_SMP=y
+CONFIG_PREEMPT_NONE=n
+CONFIG_PREEMPT_VOLUNTARY=n
+CONFIG_PREEMPT=y
+#CHECK#CONFIG_PREEMPT_RCU=y
+CONFIG_HZ_PERIODIC=n
+CONFIG_NO_HZ_IDLE=y
+CONFIG_NO_HZ_FULL=n
+CONFIG_RCU_TRACE=y
+CONFIG_HOTPLUG_CPU=y
+CONFIG_MAXSMP=y
+CONFIG_CPUMASK_OFFSTACK=y
+CONFIG_RCU_NOCB_CPU=y
+CONFIG_DEBUG_LOCK_ALLOC=n
+CONFIG_RCU_BOOST=n
+CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
+CONFIG_RCU_EXPERT=y
+CONFIG_RCU_LAZY=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
new file mode 100644
index 000000000000..9b6f720d4ccd
--- /dev/null
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
@@ -0,0 +1,8 @@
+maxcpus=8 nr_cpus=43
+rcutree.gp_preinit_delay=3
+rcutree.gp_init_delay=3
+rcutree.gp_cleanup_delay=3
+rcu_nocbs=0-7
+rcutorture.torture_type=rcu_lazy
+rcutorture.nocbs_nthreads=8
+rcutorture.fwd_progress=0
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 11/18] fs: Move call_rcu() to call_rcu_lazy() in some paths
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (9 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 10/18] rcutorture: Add test code for call_rcu_lazy() Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 12/18] cred: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

When testing, we found that these paths were invoked often when the
system is not doing anything (screen is ON but otherwise idle).

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 fs/dcache.c     | 4 ++--
 fs/eventpoll.c  | 2 +-
 fs/file_table.c | 2 +-
 fs/inode.c      | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c5dc32a59c76..28a159a7e460 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -366,7 +366,7 @@ static void dentry_free(struct dentry *dentry)
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
 		if (likely(atomic_dec_and_test(&p->u.count))) {
-			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
+			call_rcu_lazy(&dentry->d_u.d_rcu, __d_free_external);
 			return;
 		}
 	}
@@ -374,7 +374,7 @@ static void dentry_free(struct dentry *dentry)
 	if (dentry->d_flags & DCACHE_NORCU)
 		__d_free(&dentry->d_u.d_rcu);
 	else
-		call_rcu(&dentry->d_u.d_rcu, __d_free);
+		call_rcu_lazy(&dentry->d_u.d_rcu, __d_free);
 }
 
 /*
diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 8b56b94e2f56..530e62538d3d 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -728,7 +728,7 @@ static int ep_remove(struct eventpoll *ep, struct epitem *epi)
 	 * ep->mtx. The rcu read side, reverse_path_check_proc(), does not make
 	 * use of the rbn field.
 	 */
-	call_rcu(&epi->rcu, epi_rcu_free);
+	call_rcu_lazy(&epi->rcu, epi_rcu_free);
 
 	percpu_counter_dec(&ep->user->epoll_watches);
 
diff --git a/fs/file_table.c b/fs/file_table.c
index 99c6796c9f28..36b3de669771 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -56,7 +56,7 @@ static inline void file_free(struct file *f)
 	security_file_free(f);
 	if (!(f->f_mode & FMODE_NOACCOUNT))
 		percpu_counter_dec(&nr_files);
-	call_rcu(&f->f_rcuhead, file_free_rcu);
+	call_rcu_lazy(&f->f_rcuhead, file_free_rcu);
 }
 
 /*
diff --git a/fs/inode.c b/fs/inode.c
index 6462276dfdf0..7d1573cf3a97 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -312,7 +312,7 @@ static void destroy_inode(struct inode *inode)
 			return;
 	}
 	inode->free_inode = ops->free_inode;
-	call_rcu(&inode->i_rcu, i_callback);
+	call_rcu_lazy(&inode->i_rcu, i_callback);
 }
 
 /**
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 12/18] cred: Move call_rcu() to call_rcu_lazy()
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (10 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 11/18] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 13/18] security: " Joel Fernandes (Google)
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/cred.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/cred.c b/kernel/cred.c
index e10c15f51c1f..c7cb2e3ac73a 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -150,7 +150,7 @@ void __put_cred(struct cred *cred)
 	if (cred->non_rcu)
 		put_cred_rcu(&cred->rcu);
 	else
-		call_rcu(&cred->rcu, put_cred_rcu);
+		call_rcu_lazy(&cred->rcu, put_cred_rcu);
 }
 EXPORT_SYMBOL(__put_cred);
 
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 13/18] security: Move call_rcu() to call_rcu_lazy()
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (11 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 12/18] cred: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 14/18] net/core: " Joel Fernandes (Google)
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 security/security.c    | 2 +-
 security/selinux/avc.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/security/security.c b/security/security.c
index 14d30fec8a00..b51b4bdb567d 100644
--- a/security/security.c
+++ b/security/security.c
@@ -1053,7 +1053,7 @@ void security_inode_free(struct inode *inode)
 	 * The inode will be freed after the RCU grace period too.
 	 */
 	if (inode->i_security)
-		call_rcu((struct rcu_head *)inode->i_security,
+		call_rcu_lazy((struct rcu_head *)inode->i_security,
 				inode_free_by_rcu);
 }
 
diff --git a/security/selinux/avc.c b/security/selinux/avc.c
index 9a43af0ebd7d..381f046d820f 100644
--- a/security/selinux/avc.c
+++ b/security/selinux/avc.c
@@ -442,7 +442,7 @@ static void avc_node_free(struct rcu_head *rhead)
 static void avc_node_delete(struct selinux_avc *avc, struct avc_node *node)
 {
 	hlist_del_rcu(&node->list);
-	call_rcu(&node->rhead, avc_node_free);
+	call_rcu_lazy(&node->rhead, avc_node_free);
 	atomic_dec(&avc->avc_cache.active_nodes);
 }
 
@@ -458,7 +458,7 @@ static void avc_node_replace(struct selinux_avc *avc,
 			     struct avc_node *new, struct avc_node *old)
 {
 	hlist_replace_rcu(&old->list, &new->list);
-	call_rcu(&old->rhead, avc_node_free);
+	call_rcu_lazy(&old->rhead, avc_node_free);
 	atomic_dec(&avc->avc_cache.active_nodes);
 }
 
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 14/18] net/core: Move call_rcu() to call_rcu_lazy()
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (12 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 13/18] security: " Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 15/18] kernel: Move various core kernel usages " Joel Fernandes (Google)
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 net/core/dst.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dst.c b/net/core/dst.c
index bc9c9be4e080..babf49e413e1 100644
--- a/net/core/dst.c
+++ b/net/core/dst.c
@@ -174,7 +174,7 @@ void dst_release(struct dst_entry *dst)
 			net_warn_ratelimited("%s: dst:%p refcnt:%d\n",
 					     __func__, dst, newrefcnt);
 		if (!newrefcnt)
-			call_rcu(&dst->rcu_head, dst_destroy_rcu);
+			call_rcu_lazy(&dst->rcu_head, dst_destroy_rcu);
 	}
 }
 EXPORT_SYMBOL(dst_release);
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 15/18] kernel: Move various core kernel usages to call_rcu_lazy()
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (13 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 14/18] net/core: " Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 16/18] lib: Move call_rcu() " Joel Fernandes (Google)
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/exit.c              | 2 +-
 kernel/pid.c               | 2 +-
 kernel/time/posix-timers.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 84021b24f79e..b2a96356980a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -180,7 +180,7 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 void put_task_struct_rcu_user(struct task_struct *task)
 {
 	if (refcount_dec_and_test(&task->rcu_users))
-		call_rcu(&task->rcu, delayed_put_task_struct);
+		call_rcu_lazy(&task->rcu, delayed_put_task_struct);
 }
 
 void release_task(struct task_struct *p)
diff --git a/kernel/pid.c b/kernel/pid.c
index 2fc0a16ec77b..5a5144519d70 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -153,7 +153,7 @@ void free_pid(struct pid *pid)
 	}
 	spin_unlock_irqrestore(&pidmap_lock, flags);
 
-	call_rcu(&pid->rcu, delayed_put_pid);
+	call_rcu_lazy(&pid->rcu, delayed_put_pid);
 }
 
 struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
diff --git a/kernel/time/posix-timers.c b/kernel/time/posix-timers.c
index 5dead89308b7..660daa427b41 100644
--- a/kernel/time/posix-timers.c
+++ b/kernel/time/posix-timers.c
@@ -485,7 +485,7 @@ static void release_posix_timer(struct k_itimer *tmr, int it_id_set)
 	}
 	put_pid(tmr->it_pid);
 	sigqueue_free(tmr->sigq);
-	call_rcu(&tmr->rcu, k_itimer_rcu_free);
+	call_rcu_lazy(&tmr->rcu, k_itimer_rcu_free);
 }
 
 static int common_timer_create(struct k_itimer *new_timer)
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 16/18] lib: Move call_rcu() to call_rcu_lazy()
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (14 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 15/18] kernel: Move various core kernel usages " Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 17/18] i915: " Joel Fernandes (Google)
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

Move radix-tree and xarray to call_rcu_lazy(). This is required to
prevent callbacks triggering RCU machinery too quickly and too often,
which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 lib/radix-tree.c | 2 +-
 lib/xarray.c     | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/radix-tree.c b/lib/radix-tree.c
index 3c78e1e8b2ad..49c821b16acc 100644
--- a/lib/radix-tree.c
+++ b/lib/radix-tree.c
@@ -305,7 +305,7 @@ void radix_tree_node_rcu_free(struct rcu_head *head)
 static inline void
 radix_tree_node_free(struct radix_tree_node *node)
 {
-	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
+	call_rcu_lazy(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
diff --git a/lib/xarray.c b/lib/xarray.c
index ea9ce1f0b386..230abc8045fe 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -257,7 +257,7 @@ static void xa_node_free(struct xa_node *node)
 {
 	XA_NODE_BUG_ON(node, !list_empty(&node->private_list));
 	node->array = XA_RCU_FREE;
-	call_rcu(&node->rcu_head, radix_tree_node_rcu_free);
+	call_rcu_lazy(&node->rcu_head, radix_tree_node_rcu_free);
 }
 
 /*
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 17/18] i915: Move call_rcu() to call_rcu_lazy()
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (15 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 16/18] lib: Move call_rcu() " Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-01 22:17 ` [PATCH v5 18/18] fork: Move thread_stack_free_rcu() " Joel Fernandes (Google)
  2022-09-03 15:22 ` [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Paul E. McKenney
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 drivers/gpu/drm/i915/gem/i915_gem_object.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_object.c b/drivers/gpu/drm/i915/gem/i915_gem_object.c
index ccec4055fde3..21d6f66fb394 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_object.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_object.c
@@ -343,7 +343,7 @@ static void __i915_gem_free_objects(struct drm_i915_private *i915,
 		__i915_gem_free_object(obj);
 
 		/* But keep the pointer alive for RCU-protected lookups */
-		call_rcu(&obj->rcu, __i915_gem_free_object_rcu);
+		call_rcu_lazy(&obj->rcu, __i915_gem_free_object_rcu);
 		cond_resched();
 	}
 }
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH v5 18/18] fork: Move thread_stack_free_rcu() to call_rcu_lazy()
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (16 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 17/18] i915: " Joel Fernandes (Google)
@ 2022-09-01 22:17 ` Joel Fernandes (Google)
  2022-09-03 15:22 ` [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Paul E. McKenney
  18 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes (Google) @ 2022-09-01 22:17 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Joel Fernandes (Google)

This is required to prevent callbacks triggering RCU machinery too
quickly and too often, which adds more power to the system.

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/fork.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 90c85b17bf69..08a2298b1f94 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -226,7 +226,7 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 	struct vm_stack *vm_stack = tsk->stack;
 
 	vm_stack->stack_vm_area = tsk->stack_vm_area;
-	call_rcu(&vm_stack->rcu, thread_stack_free_rcu);
+	call_rcu_lazy(&vm_stack->rcu, thread_stack_free_rcu);
 }
 
 static int free_vm_stack_cache(unsigned int cpu)
@@ -353,7 +353,7 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 {
 	struct rcu_head *rh = tsk->stack;
 
-	call_rcu(rh, thread_stack_free_rcu);
+	call_rcu_lazy(rh, thread_stack_free_rcu);
 }
 
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
@@ -388,7 +388,7 @@ static void thread_stack_delayed_free(struct task_struct *tsk)
 {
 	struct rcu_head *rh = tsk->stack;
 
-	call_rcu(rh, thread_stack_free_rcu);
+	call_rcu_lazy(rh, thread_stack_free_rcu);
 }
 
 static int alloc_thread_stack_node(struct task_struct *tsk, int node)
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 02/18] mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head
  2022-09-01 22:17 ` [PATCH v5 02/18] mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head Joel Fernandes (Google)
@ 2022-09-02  9:26   ` Vlastimil Babka
  2022-09-02  9:30     ` Vlastimil Babka
  0 siblings, 1 reply; 73+ messages in thread
From: Vlastimil Babka @ 2022-09-02  9:26 UTC (permalink / raw)
  To: Joel Fernandes (Google), rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng

On 9/2/22 00:17, Joel Fernandes (Google) wrote:
> From: Vlastimil Babka <vbabka@suse.cz>
> 
> Joel reports [1] that increasing the rcu_head size for debugging
> purposes used to work before struct slab was split from struct page, but
> now runs into the various SLAB_MATCH() sanity checks of the layout.
> 
> This is because the rcu_head in struct page is in union with large
> sub-structures and has space to grow without exceeding their size, while
> in struct slab (for SLAB and SLUB) it's in union only with a list_head.
> 
> On closer inspection (and after the previous patch) we can put all
> fields except slab_cache to a union with rcu_head, as slab_cache is
> sufficient for the rcu freeing callbacks to work and the rest can be
> overwritten by rcu_head without causing issues.
> 
> This is only somewhat complicated by the need to keep SLUB's
> freelist+counters aligned for cmpxchg_double. As a result the fields
> need to be reordered so that slab_cache is first (after page flags) and
> the union with rcu_head follows. For consistency, do that for SLAB as
> well, although not necessary there.
> 
> As a result, the rcu_head field in struct page and struct slab is no
> longer at the same offset, but that doesn't matter as there is no
> casting that would rely on that in the slab freeing callbacks, so we can
> just drop the respective SLAB_MATCH() check.
> 
> Also we need to update the SLAB_MATCH() for compound_head to reflect the
> new ordering.
> 
> While at it, also add a static_assert to check the alignment needed for
> cmpxchg_double so mistakes are found sooner than a runtime GPF.
> 
> [1] https://lore.kernel.org/all/85afd876-d8bb-0804-b2c5-48ed3055e702@joelfernandes.org/
> 
> Reported-by: Joel Fernandes <joel@joelfernandes.org>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I've added patches 01 and 02 to slab tree for -next exposure before Joel's
full series posting, but it should be also ok if rcu tree carries them with
the whole patchset. I can then drop them from slab tree (there are no
dependencies with other stuff there) so we don't introduce duplicite commits
needlessly, just give me a heads up.

> ---
>  mm/slab.h | 54 ++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 32 insertions(+), 22 deletions(-)
> 
> diff --git a/mm/slab.h b/mm/slab.h
> index 4ec82bec15ec..2c248864ea91 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -11,37 +11,43 @@ struct slab {
>  
>  #if defined(CONFIG_SLAB)
>  
> +	struct kmem_cache *slab_cache;
>  	union {
> -		struct list_head slab_list;
> +		struct {
> +			struct list_head slab_list;
> +			void *freelist;	/* array of free object indexes */
> +			void *s_mem;	/* first object */
> +		};
>  		struct rcu_head rcu_head;
>  	};
> -	struct kmem_cache *slab_cache;
> -	void *freelist;	/* array of free object indexes */
> -	void *s_mem;	/* first object */
>  	unsigned int active;
>  
>  #elif defined(CONFIG_SLUB)
>  
> -	union {
> -		struct list_head slab_list;
> -		struct rcu_head rcu_head;
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
> -		struct {
> -			struct slab *next;
> -			int slabs;	/* Nr of slabs left */
> -		};
> -#endif
> -	};
>  	struct kmem_cache *slab_cache;
> -	/* Double-word boundary */
> -	void *freelist;		/* first free object */
>  	union {
> -		unsigned long counters;
>  		struct {
> -			unsigned inuse:16;
> -			unsigned objects:15;
> -			unsigned frozen:1;
> +			union {
> +				struct list_head slab_list;
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
> +				struct {
> +					struct slab *next;
> +					int slabs;	/* Nr of slabs left */
> +				};
> +#endif
> +			};
> +			/* Double-word boundary */
> +			void *freelist;		/* first free object */
> +			union {
> +				unsigned long counters;
> +				struct {
> +					unsigned inuse:16;
> +					unsigned objects:15;
> +					unsigned frozen:1;
> +				};
> +			};
>  		};
> +		struct rcu_head rcu_head;
>  	};
>  	unsigned int __unused;
>  
> @@ -66,9 +72,10 @@ struct slab {
>  #define SLAB_MATCH(pg, sl)						\
>  	static_assert(offsetof(struct page, pg) == offsetof(struct slab, sl))
>  SLAB_MATCH(flags, __page_flags);
> -SLAB_MATCH(compound_head, slab_list);	/* Ensure bit 0 is clear */
>  #ifndef CONFIG_SLOB
> -SLAB_MATCH(rcu_head, rcu_head);
> +SLAB_MATCH(compound_head, slab_cache);	/* Ensure bit 0 is clear */
> +#else
> +SLAB_MATCH(compound_head, slab_list);	/* Ensure bit 0 is clear */
>  #endif
>  SLAB_MATCH(_refcount, __page_refcount);
>  #ifdef CONFIG_MEMCG
> @@ -76,6 +83,9 @@ SLAB_MATCH(memcg_data, memcg_data);
>  #endif
>  #undef SLAB_MATCH
>  static_assert(sizeof(struct slab) <= sizeof(struct page));
> +#if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && defined(CONFIG_SLUB)
> +static_assert(IS_ALIGNED(offsetof(struct slab, freelist), 16));
> +#endif
>  
>  /**
>   * folio_slab - Converts from folio to slab.


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 02/18] mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head
  2022-09-02  9:26   ` Vlastimil Babka
@ 2022-09-02  9:30     ` Vlastimil Babka
  2022-09-02 15:09       ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Vlastimil Babka @ 2022-09-02  9:30 UTC (permalink / raw)
  To: Joel Fernandes (Google), rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Hyeonggon Yoo

On 9/2/22 11:26, Vlastimil Babka wrote:
> On 9/2/22 00:17, Joel Fernandes (Google) wrote:
>> From: Vlastimil Babka <vbabka@suse.cz>
>> 
>> Joel reports [1] that increasing the rcu_head size for debugging
>> purposes used to work before struct slab was split from struct page, but
>> now runs into the various SLAB_MATCH() sanity checks of the layout.
>> 
>> This is because the rcu_head in struct page is in union with large
>> sub-structures and has space to grow without exceeding their size, while
>> in struct slab (for SLAB and SLUB) it's in union only with a list_head.
>> 
>> On closer inspection (and after the previous patch) we can put all
>> fields except slab_cache to a union with rcu_head, as slab_cache is
>> sufficient for the rcu freeing callbacks to work and the rest can be
>> overwritten by rcu_head without causing issues.
>> 
>> This is only somewhat complicated by the need to keep SLUB's
>> freelist+counters aligned for cmpxchg_double. As a result the fields
>> need to be reordered so that slab_cache is first (after page flags) and
>> the union with rcu_head follows. For consistency, do that for SLAB as
>> well, although not necessary there.
>> 
>> As a result, the rcu_head field in struct page and struct slab is no
>> longer at the same offset, but that doesn't matter as there is no
>> casting that would rely on that in the slab freeing callbacks, so we can
>> just drop the respective SLAB_MATCH() check.
>> 
>> Also we need to update the SLAB_MATCH() for compound_head to reflect the
>> new ordering.
>> 
>> While at it, also add a static_assert to check the alignment needed for
>> cmpxchg_double so mistakes are found sooner than a runtime GPF.
>> 
>> [1] https://lore.kernel.org/all/85afd876-d8bb-0804-b2c5-48ed3055e702@joelfernandes.org/
>> 
>> Reported-by: Joel Fernandes <joel@joelfernandes.org>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> I've added patches 01 and 02 to slab tree for -next exposure before Joel's
> full series posting, but it should be also ok if rcu tree carries them with
> the whole patchset. I can then drop them from slab tree (there are no
> dependencies with other stuff there) so we don't introduce duplicite commits
> needlessly, just give me a heads up.

Ah but in that case please apply the reviews from my posting [1]

patch 1:
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

patch 2
Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>

[1] https://lore.kernel.org/all/20220826090912.11292-1-vbabka@suse.cz/

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-01 22:17 ` [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens Joel Fernandes (Google)
@ 2022-09-02 11:35   ` Frederic Weisbecker
  2022-09-02 23:58     ` Joel Fernandes
  2022-09-03 14:04   ` Paul E. McKenney
  2022-09-06  3:07   ` Joel Fernandes
  2 siblings, 1 reply; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-02 11:35 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Thu, Sep 01, 2022 at 10:17:06PM +0000, Joel Fernandes (Google) wrote:
> When the bypass cblist gets too big or its timeout has occurred, it is
> flushed into the main cblist. However, the bypass timer is still running
> and the behavior is that it would eventually expire and wake the GP
> thread.
> 
> Since we are going to use the bypass cblist for lazy CBs, do the wakeup
> soon as the flush happens. Otherwise, the lazy-timer will go off much
> later and the now-non-lazy cblist CBs can get stranded for the duration
> of the timer.
> 
> This is a good thing to do anyway (regardless of this series), since it
> makes the behavior consistent with behavior of other code paths where queueing
> something into the ->cblist makes the GP kthread in a non-sleeping state
> quickly.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree_nocb.h | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 0a5f0ef41484..31068dd31315 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -447,7 +447,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  			rcu_advance_cbs_nowake(rdp->mynode, rdp);
>  			rdp->nocb_gp_adv_time = j;
>  		}
> -		rcu_nocb_unlock_irqrestore(rdp, flags);
> +
> +		// The flush succeeded and we moved CBs into the ->cblist.
> +		// However, the bypass timer might still be running. Wakeup the
> +		// GP thread by calling a helper with was_all_done set so that
> +		// wake up happens (needed if main CB list was empty before).
> +		__call_rcu_nocb_wake(rdp, true, flags)
> +

Ok so there are two different changes here:

1) wake up nocb_gp as we just flushed the bypass list. Indeed if the regular
   callback list was empty before flushing, we rather want to immediately wake
   up nocb_gp instead of waiting for the bypass timer to process them.

2) wake up nocb_gp unconditionally (ie: even if the regular queue was not empty
   before bypass flushing) so that nocb_gp_wait() is forced through another loop
   starting with cancelling the bypass timer (I suggest you put such explanation
   in the comment btw because that process may not be obvious for mortals).

The change 1) looks like a good idea to me.

The change 2) has unclear motivation. It forces nocb_gp_wait() through another
costly loop even though the timer might have been cancelled into some near
future, eventually avoiding that extra costly loop. Also it abuses the
was_alldone stuff and we may get rcu_nocb_wake with incoherent meanings
(WakeEmpty/WakeEmptyIsDeferred) when it's actually not empty.

So you may need to clarify the purpose. And I would suggest to make two patches
here.

Thanks!


   

   

>  		return true; // Callback already enqueued.
>  	}
>  
> -- 
> 2.37.2.789.g6183377224-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 02/18] mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head
  2022-09-02  9:30     ` Vlastimil Babka
@ 2022-09-02 15:09       ` Joel Fernandes
  2022-09-03 13:53         ` Paul E. McKenney
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-02 15:09 UTC (permalink / raw)
  To: Vlastimil Babka, rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng, Hyeonggon Yoo



On 9/2/2022 5:30 AM, Vlastimil Babka wrote:
> On 9/2/22 11:26, Vlastimil Babka wrote:
>> On 9/2/22 00:17, Joel Fernandes (Google) wrote:
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>
>>> Joel reports [1] that increasing the rcu_head size for debugging
>>> purposes used to work before struct slab was split from struct page, but
>>> now runs into the various SLAB_MATCH() sanity checks of the layout.
>>>
>>> This is because the rcu_head in struct page is in union with large
>>> sub-structures and has space to grow without exceeding their size, while
>>> in struct slab (for SLAB and SLUB) it's in union only with a list_head.
>>>
>>> On closer inspection (and after the previous patch) we can put all
>>> fields except slab_cache to a union with rcu_head, as slab_cache is
>>> sufficient for the rcu freeing callbacks to work and the rest can be
>>> overwritten by rcu_head without causing issues.
>>>
>>> This is only somewhat complicated by the need to keep SLUB's
>>> freelist+counters aligned for cmpxchg_double. As a result the fields
>>> need to be reordered so that slab_cache is first (after page flags) and
>>> the union with rcu_head follows. For consistency, do that for SLAB as
>>> well, although not necessary there.
>>>
>>> As a result, the rcu_head field in struct page and struct slab is no
>>> longer at the same offset, but that doesn't matter as there is no
>>> casting that would rely on that in the slab freeing callbacks, so we can
>>> just drop the respective SLAB_MATCH() check.
>>>
>>> Also we need to update the SLAB_MATCH() for compound_head to reflect the
>>> new ordering.
>>>
>>> While at it, also add a static_assert to check the alignment needed for
>>> cmpxchg_double so mistakes are found sooner than a runtime GPF.
>>>
>>> [1] https://lore.kernel.org/all/85afd876-d8bb-0804-b2c5-48ed3055e702@joelfernandes.org/
>>>
>>> Reported-by: Joel Fernandes <joel@joelfernandes.org>
>>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>>
>> I've added patches 01 and 02 to slab tree for -next exposure before Joel's
>> full series posting, but it should be also ok if rcu tree carries them with
>> the whole patchset. I can then drop them from slab tree (there are no
>> dependencies with other stuff there) so we don't introduce duplicite commits
>> needlessly, just give me a heads up.
> 
> Ah but in that case please apply the reviews from my posting [1]
> 
> patch 1:
> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> 
> patch 2
> Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> 
> [1] https://lore.kernel.org/all/20220826090912.11292-1-vbabka@suse.cz/


Sorry for injecting confusion - my main intent with including the mm patches in
this series is to make it easier for other reviewers/testers to backport the
series to their kernels in one shot. Some reviewers expressed interested in
trying out the series.

I think it is best to let the -mm patches in the series go through the slab
tree, as you also have the Acks/Reviews there and will take sure those
dependencies are out of the way.

My lesson here is to be more clear, I could have added some notes for context
below the "---" of those mm patches.

Thanks again for your help,

- Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-01 22:17 ` [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation Joel Fernandes (Google)
@ 2022-09-02 15:21   ` Frederic Weisbecker
  2022-09-02 23:09     ` Joel Fernandes
                       ` (2 more replies)
  2022-09-03 14:03   ` Paul E. McKenney
  1 sibling, 3 replies; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-02 15:21 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Thu, Sep 01, 2022 at 10:17:08PM +0000, Joel Fernandes (Google) wrote:
> Implement timer-based RCU lazy callback batching. The batch is flushed
> whenever a certain amount of time has passed, or the batch on a
> particular CPU grows too big. Also memory pressure will flush it in a
> future patch.
> 
> To handle several corner cases automagically (such as rcu_barrier() and
> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> length has the lazy CB length included in it. A separate lazy CB length
> counter is also introduced to keep track of the number of lazy CBs.
> 
> Suggested-by: Paul McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/linux/rcupdate.h   |   6 ++
>  kernel/rcu/Kconfig         |   8 ++
>  kernel/rcu/rcu.h           |  11 +++
>  kernel/rcu/rcu_segcblist.c |   2 +-
>  kernel/rcu/tree.c          | 130 +++++++++++++++---------
>  kernel/rcu/tree.h          |  13 ++-
>  kernel/rcu/tree_nocb.h     | 198 ++++++++++++++++++++++++++++++-------
>  7 files changed, 280 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 08605ce7379d..82e8a07e0856 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
>  
>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>  
> +#ifdef CONFIG_RCU_LAZY
> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
> +#else
> +#define call_rcu_lazy(head, func) call_rcu(head, func)

Why about an inline instead?

> +#endif
> +
>  /* Internal to kernel */
>  void rcu_init(void);
>  extern int rcu_scheduler_active;
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index d471d22a5e21..3128d01427cb 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
>  	  Say N here if you hate read-side memory barriers.
>  	  Take the default if you are unsure.
>  
> +config RCU_LAZY
> +	bool "RCU callback lazy invocation functionality"
> +	depends on RCU_NOCB_CPU
> +	default n
> +	help
> +	  To save power, batch RCU callbacks and flush after delay, memory
> +	  pressure or callback list growing too big.
> +
>  endmenu # "RCU Subsystem"
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index be5979da07f5..94675f14efe8 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -474,6 +474,14 @@ enum rcutorture_type {
>  	INVALID_RCU_FLAVOR
>  };
>  
> +#if defined(CONFIG_RCU_LAZY)
> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> +#else
> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> +#endif
> +
>  #if defined(CONFIG_TREE_RCU)
>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>  			    unsigned long *gp_seq);
> @@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>  			       unsigned long c_old,
>  			       unsigned long c);
>  void rcu_gp_set_torture_wait(int duration);
> +void rcu_force_call_rcu_to_lazy(bool force);
> +
>  #else
>  static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
>  					  int *flags, unsigned long *gp_seq)
> @@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>  	do { } while (0)
>  #endif
>  static inline void rcu_gp_set_torture_wait(int duration) { }
> +static inline void rcu_force_call_rcu_to_lazy(bool force) { }
>  #endif
>  
>  #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
> diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
> index c54ea2b6a36b..55b50e592986 100644
> --- a/kernel/rcu/rcu_segcblist.c
> +++ b/kernel/rcu/rcu_segcblist.c
> @@ -38,7 +38,7 @@ void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
>   * element of the second rcu_cblist structure, but ensuring that the second
>   * rcu_cblist structure, if initially non-empty, always appears non-empty
>   * throughout the process.  If rdp is NULL, the second rcu_cblist structure
> - * is instead initialized to empty.
> + * is instead initialized to empty. Also account for lazy_len for lazy CBs.

Leftover?

> @@ -3904,7 +3943,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	rdp->barrier_head.func = rcu_barrier_callback;
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false,
> +		     /* wake gp thread */ true));

It's a bad sign when you need to start commenting your boolean parameters :)
Perhaps use a single two-bit flag instead of two booleans, for readability?

Also that's a subtle change which purpose isn't explained. It means that
rcu_barrier_entrain() used to wait for the bypass timer in the worst case
but now we force rcuog into it immediately. Should that be a separate change?

> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 31068dd31315..7e97a7b6e046 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>  }
>  
> +/*
> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> + * could be flushed much earlier for a number of other reasons
> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> + * left unsubmitted to RCU after those many jiffies.
> + */
> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;

It seems this can be made static?

> @@ -265,23 +290,39 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>  {
>  	unsigned long flags;
>  	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
> +	unsigned long mod_jif = 0;
>  
>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>  
>  	/*
> -	 * Bypass wakeup overrides previous deferments. In case
> -	 * of callback storm, no need to wake up too early.
> +	 * Bypass and lazy wakeup overrides previous deferments. In case of
> +	 * callback storm, no need to wake up too early.
>  	 */
> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
> -		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
> -		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> -	} else {
> +	switch (waketype) {
> +	case RCU_NOCB_WAKE_LAZY:
> +		if (rdp->nocb_defer_wakeup != RCU_NOCB_WAKE_LAZY)
> +			mod_jif = jiffies_till_flush;
> +		break;

Ah so this timer override all the others, and it's a several seconds
timers. Then I'll have a related concern on nocb_gp_wait().

> +
> +	case RCU_NOCB_WAKE_BYPASS:
> +		mod_jif = 2;
> +		break;
> +
> +	case RCU_NOCB_WAKE:
> +	case RCU_NOCB_WAKE_FORCE:
> +		// For these, make it wake up the soonest if we
> +		// were in a bypass or lazy sleep before.
>  		if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
> -			mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
> -		if (rdp_gp->nocb_defer_wakeup < waketype)
> -			WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> +			mod_jif = 1;
> +		break;
>  	}
>  
> +	if (mod_jif)
> +		mod_timer(&rdp_gp->nocb_timer, jiffies + mod_jif);
> +
> +	if (rdp_gp->nocb_defer_wakeup < waketype)
> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);

So RCU_NOCB_WAKE_BYPASS and RCU_NOCB_WAKE_LAZY don't override the timer state
anymore? Looks like something is missing.

> +
>  	raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
>  
>  	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
[...]
> @@ -705,12 +816,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	my_rdp->nocb_gp_gp = needwait_gp;
>  	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
>  
> -	if (bypass && !rcu_nocb_poll) {
> -		// At least one child with non-empty ->nocb_bypass, so set
> -		// timer in order to avoid stranding its callbacks.
> -		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> -				   TPS("WakeBypassIsDeferred"));
> +	// At least one child with non-empty ->nocb_bypass, so set
> +	// timer in order to avoid stranding its callbacks.
> +	if (!rcu_nocb_poll) {
> +		// If bypass list only has lazy CBs. Add a deferred
> +		// lazy wake up.
> +		if (lazy && !bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
> +					TPS("WakeLazyIsDeferred"));

What if:

1) rdp(1) has lazy callbacks
2) all other rdp's have no callback at all
3) nocb_gp_wait() runs through all rdp's, everything is handled, except for
   these lazy callbacks
4) It reaches the above path, ready to arm the RCU_NOCB_WAKE_LAZY timer,
   but it hasn't yet called wake_nocb_gp_defer()
5) Oh but rdp(2) queues a non-lazy callback. interrupts are disabled so it defers
   the wake up to nocb_gp_wait() with arming the timer in RCU_NOCB_WAKE.
6) nocb_gp_wait() finally calls wake_nocb_gp_defer() and override the timeout
   to several seconds ahead.
7) No more callbacks queued, the non-lazy callback will have to wait several
   seconds to complete.

Or did I miss something? Note that the race exists with RCU_NOCB_WAKE_BYPASS
but it's only about one jiffy delay, not seconds.


(further review later).

Thanks.

> +		// Otherwise add a deferred bypass wake up.
> +		} else if (bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> +					TPS("WakeBypassIsDeferred"));
> +		}
>  	}
> +
>  	if (rcu_nocb_poll) {
>  		/* Polling, so trace if first poll in the series. */
>  		if (gotcbs)
> @@ -1036,7 +1156,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>  	 * return false, which means that future calls to rcu_nocb_try_bypass()
>  	 * will refuse to put anything into the bypass.
>  	 */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false, false));
>  	/*
>  	 * Start with invoking rcu_core() early. This way if the current thread
>  	 * happens to preempt an ongoing call to rcu_core() in the middle,
> @@ -1290,6 +1410,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  	raw_spin_lock_init(&rdp->nocb_gp_lock);
>  	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
>  	rcu_cblist_init(&rdp->nocb_bypass);
> +	WRITE_ONCE(rdp->lazy_len, 0);
>  	mutex_init(&rdp->nocb_gp_kthread_mutex);
>  }
>  
> @@ -1571,13 +1692,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>  }
>  
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, bool lazy, bool wakegp)
>  {
>  	return true;
>  }
>  
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags)
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy)
>  {
>  	return false;
>  }
> -- 
> 2.37.2.789.g6183377224-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation.
  2022-09-01 22:17 ` [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation Joel Fernandes (Google)
@ 2022-09-02 16:48   ` kernel test robot
  2022-09-03 12:39     ` Paul E. McKenney
  2022-09-02 19:01   ` kernel test robot
  1 sibling, 1 reply; 73+ messages in thread
From: kernel test robot @ 2022-09-02 16:48 UTC (permalink / raw)
  To: Joel Fernandes (Google), rcu
  Cc: kbuild-all, linux-kernel, rushikesh.s.kadam, urezki,
	neeraj.iitr10, frederic, paulmck, rostedt, vineeth, boqun.feng,
	Joel Fernandes (Google)

Hi "Joel,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on paulmck-rcu/dev]
[also build test ERROR on pcmoore-selinux/next drm-intel/for-linux-next linus/master v6.0-rc3]
[cannot apply to vbabka-slab/for-next rostedt-trace/for-next tip/timers/core next-20220901]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/Implement-call_rcu_lazy-and-miscellaneous-fixes/20220902-062156
base:   https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev
config: mips-allyesconfig (https://download.01.org/0day-ci/archive/20220903/202209030052.20CJhjTX-lkp@intel.com/config)
compiler: mips-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/c0f09b1d42d06649680f74a78ca363e7f1c158b2
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Joel-Fernandes-Google/Implement-call_rcu_lazy-and-miscellaneous-fixes/20220902-062156
        git checkout c0f09b1d42d06649680f74a78ca363e7f1c158b2
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>):

   In file included from <command-line>:
   In function 'dst_hold',
       inlined from 'dst_clone' at include/net/dst.h:251:3,
       inlined from '__skb_dst_copy' at include/net/dst.h:284:3,
       inlined from 'ovs_vport_output' at net/openvswitch/actions.c:787:2:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
   In function 'dst_hold',
       inlined from 'execute_set_action' at net/openvswitch/actions.c:1093:3,
       inlined from 'do_execute_actions' at net/openvswitch/actions.c:1377:10:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
--
   In file included from <command-line>:
   In function 'dst_hold',
       inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
       inlined from 'dn_insert_route.constprop.isra' at net/decnet/dn_route.c:334:4:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
   In function 'dst_hold',
       inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
       inlined from 'dn_insert_route.constprop.isra' at net/decnet/dn_route.c:347:2:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
   In function 'dst_hold',
       inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
       inlined from 'dn_route_input' at net/decnet/dn_route.c:1535:4:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
   In function 'dst_hold',
       inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
       inlined from '__dn_route_output_key.isra' at net/decnet/dn_route.c:1257:5:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
   In function 'dst_hold',
       inlined from 'dst_clone' at include/net/dst.h:251:3,
       inlined from 'dn_cache_dump' at net/decnet/dn_route.c:1752:4:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
--
   In file included from <command-line>:
   In function 'dst_hold',
       inlined from 'dst_clone' at include/net/dst.h:251:3,
       inlined from 'ip6_copy_metadata' at net/ipv6/ip6_output.c:654:2:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
   In function 'dst_hold',
       inlined from 'ip6_append_data' at net/ipv6/ip6_output.c:1838:3:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
   In function 'dst_hold',
       inlined from 'dst_clone' at include/net/dst.h:251:3,
       inlined from 'ip6_sk_dst_lookup_flow' at net/ipv6/ip6_output.c:1262:3:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~
--
   In file included from <command-line>:
   In function 'dst_hold',
       inlined from 'dst_clone' at include/net/dst.h:251:3,
       inlined from '__skb_dst_copy' at include/net/dst.h:284:3,
       inlined from 'skb_dst_copy' at include/net/dst.h:289:2,
       inlined from 'ip6_list_rcv_finish.constprop' at net/ipv6/ip6_input.c:128:4:
>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |                                             ^
   include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
     335 |                         prefix ## suffix();                             \
         |                         ^~~~~~
   include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
     354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
         |         ^~~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
      39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
         |                                     ^~~~~~~~~~~~~~~~~~
   include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
      50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
         |         ^~~~~~~~~~~~~~~~
   include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
     230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
         |         ^~~~~~~~~~~~


vim +/__compiletime_assert_490 +354 include/linux/compiler_types.h

eb5c2d4b45e3d2 Will Deacon 2020-07-21  340  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  341  #define _compiletime_assert(condition, msg, prefix, suffix) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21  342  	__compiletime_assert(condition, msg, prefix, suffix)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  343  
eb5c2d4b45e3d2 Will Deacon 2020-07-21  344  /**
eb5c2d4b45e3d2 Will Deacon 2020-07-21  345   * compiletime_assert - break build and emit msg if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  346   * @condition: a compile-time constant condition to check
eb5c2d4b45e3d2 Will Deacon 2020-07-21  347   * @msg:       a message to emit if condition is false
eb5c2d4b45e3d2 Will Deacon 2020-07-21  348   *
eb5c2d4b45e3d2 Will Deacon 2020-07-21  349   * In tradition of POSIX assert, this macro will break the build if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  350   * supplied condition is *false*, emitting the supplied error message if the
eb5c2d4b45e3d2 Will Deacon 2020-07-21  351   * compiler has support to do so.
eb5c2d4b45e3d2 Will Deacon 2020-07-21  352   */
eb5c2d4b45e3d2 Will Deacon 2020-07-21  353  #define compiletime_assert(condition, msg) \
eb5c2d4b45e3d2 Will Deacon 2020-07-21 @354  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
eb5c2d4b45e3d2 Will Deacon 2020-07-21  355  

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation.
  2022-09-01 22:17 ` [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation Joel Fernandes (Google)
  2022-09-02 16:48   ` kernel test robot
@ 2022-09-02 19:01   ` kernel test robot
  1 sibling, 0 replies; 73+ messages in thread
From: kernel test robot @ 2022-09-02 19:01 UTC (permalink / raw)
  To: Joel Fernandes (Google), rcu
  Cc: kbuild-all, linux-kernel, rushikesh.s.kadam, urezki,
	neeraj.iitr10, frederic, paulmck, rostedt, vineeth, boqun.feng,
	Joel Fernandes (Google)

Hi "Joel,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on paulmck-rcu/dev]
[also build test WARNING on pcmoore-selinux/next drm-intel/for-linux-next linus/master v6.0-rc3]
[cannot apply to vbabka-slab/for-next rostedt-trace/for-next tip/timers/core next-20220901]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/Implement-call_rcu_lazy-and-miscellaneous-fixes/20220902-062156
base:   https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev
config: sparc-allyesconfig (https://download.01.org/0day-ci/archive/20220903/202209030201.w42TEYCt-lkp@intel.com/config)
compiler: sparc64-linux-gcc (GCC) 12.1.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/c0f09b1d42d06649680f74a78ca363e7f1c158b2
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Joel-Fernandes-Google/Implement-call_rcu_lazy-and-miscellaneous-fixes/20220902-062156
        git checkout c0f09b1d42d06649680f74a78ca363e7f1c158b2
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=sparc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from include/linux/init.h:6,
                    from arch/sparc/kernel/head_64.S:13:
>> include/linux/types.h:233:44: warning: missing terminating ' character
     233 |         // at HZ=1000 before wrapping. That's enough for debug.
         |                                            ^


vim +233 include/linux/types.h

   230	
   231	struct cb_debug_info {
   232		// 16-bit jiffie deltas can provide about 60 seconds of resolution
 > 233		// at HZ=1000 before wrapping. That's enough for debug.
   234		u16 cb_queue_jiff;
   235		u16 first_bp_jiff;
   236		u16 cb_flush_jiff;
   237		enum cb_debug_flags flags:16;
   238	};
   239	#endif
   240	

-- 
0-DAY CI Kernel Test Service
https://01.org/lkp

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-02 15:21   ` Frederic Weisbecker
@ 2022-09-02 23:09     ` Joel Fernandes
  2022-09-05 12:59       ` Frederic Weisbecker
  2022-09-03 22:00     ` Joel Fernandes
  2022-09-06  3:05     ` Joel Fernandes
  2 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-02 23:09 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

Hi Frederick,

On 9/2/2022 11:21 AM, Frederic Weisbecker wrote:
> On Thu, Sep 01, 2022 at 10:17:08PM +0000, Joel Fernandes (Google) wrote:
>> Implement timer-based RCU lazy callback batching. The batch is flushed
>> whenever a certain amount of time has passed, or the batch on a
>> particular CPU grows too big. Also memory pressure will flush it in a
>> future patch.
>>
>> To handle several corner cases automagically (such as rcu_barrier() and
>> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
>> length has the lazy CB length included in it. A separate lazy CB length
>> counter is also introduced to keep track of the number of lazy CBs.
>>
>> Suggested-by: Paul McKenney <paulmck@kernel.org>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  include/linux/rcupdate.h   |   6 ++
>>  kernel/rcu/Kconfig         |   8 ++
>>  kernel/rcu/rcu.h           |  11 +++
>>  kernel/rcu/rcu_segcblist.c |   2 +-
>>  kernel/rcu/tree.c          | 130 +++++++++++++++---------
>>  kernel/rcu/tree.h          |  13 ++-
>>  kernel/rcu/tree_nocb.h     | 198 ++++++++++++++++++++++++++++++-------
>>  7 files changed, 280 insertions(+), 88 deletions(-)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 08605ce7379d..82e8a07e0856 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
>>  
>>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>  
>> +#ifdef CONFIG_RCU_LAZY
>> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
>> +#else
>> +#define call_rcu_lazy(head, func) call_rcu(head, func)
> 
> Why about an inline instead?

Sure, I can do that.

>> +#endif
>> +
>>  /* Internal to kernel */
>>  void rcu_init(void);
>>  extern int rcu_scheduler_active;
>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>> index d471d22a5e21..3128d01427cb 100644
>> --- a/kernel/rcu/Kconfig
>> +++ b/kernel/rcu/Kconfig
>> @@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
>>  	  Say N here if you hate read-side memory barriers.
>>  	  Take the default if you are unsure.
>>  
>> +config RCU_LAZY
>> +	bool "RCU callback lazy invocation functionality"
>> +	depends on RCU_NOCB_CPU
>> +	default n
>> +	help
>> +	  To save power, batch RCU callbacks and flush after delay, memory
>> +	  pressure or callback list growing too big.
>> +
>>  endmenu # "RCU Subsystem"
>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
>> index be5979da07f5..94675f14efe8 100644
>> --- a/kernel/rcu/rcu.h
>> +++ b/kernel/rcu/rcu.h
>> @@ -474,6 +474,14 @@ enum rcutorture_type {
>>  	INVALID_RCU_FLAVOR
>>  };
>>  
>> +#if defined(CONFIG_RCU_LAZY)
>> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
>> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
>> +#else
>> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
>> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
>> +#endif
>> +
>>  #if defined(CONFIG_TREE_RCU)
>>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>>  			    unsigned long *gp_seq);
>> @@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>>  			       unsigned long c_old,
>>  			       unsigned long c);
>>  void rcu_gp_set_torture_wait(int duration);
>> +void rcu_force_call_rcu_to_lazy(bool force);
>> +
>>  #else
>>  static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
>>  					  int *flags, unsigned long *gp_seq)
>> @@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>>  	do { } while (0)
>>  #endif
>>  static inline void rcu_gp_set_torture_wait(int duration) { }
>> +static inline void rcu_force_call_rcu_to_lazy(bool force) { }
>>  #endif
>>  
>>  #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
>> diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
>> index c54ea2b6a36b..55b50e592986 100644
>> --- a/kernel/rcu/rcu_segcblist.c
>> +++ b/kernel/rcu/rcu_segcblist.c
>> @@ -38,7 +38,7 @@ void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
>>   * element of the second rcu_cblist structure, but ensuring that the second
>>   * rcu_cblist structure, if initially non-empty, always appears non-empty
>>   * throughout the process.  If rdp is NULL, the second rcu_cblist structure
>> - * is instead initialized to empty.
>> + * is instead initialized to empty. Also account for lazy_len for lazy CBs.
> 
> Leftover?

Will fix, thanks.

>> @@ -3904,7 +3943,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>  	rdp->barrier_head.func = rcu_barrier_callback;
>>  	debug_rcu_head_queue(&rdp->barrier_head);
>>  	rcu_nocb_lock(rdp);
>> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false,
>> +		     /* wake gp thread */ true));
> 
> It's a bad sign when you need to start commenting your boolean parameters :)
> Perhaps use a single two-bit flag instead of two booleans, for readability?

That's fair, what do you mean 2-bit flag? Are you saying, we encode the last 2
parameters to flush bypass in a u*?

> Also that's a subtle change which purpose isn't explained. It means that
> rcu_barrier_entrain() used to wait for the bypass timer in the worst case
> but now we force rcuog into it immediately. Should that be a separate change?

It could be split, but it is laziness that amplifies the issue so I thought of
keeping it in the same patch. I don't mind one way or the other.

>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index 31068dd31315..7e97a7b6e046 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>>  }
>>  
>> +/*
>> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
>> + * can elapse before lazy callbacks are flushed. Lazy callbacks
>> + * could be flushed much earlier for a number of other reasons
>> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
>> + * left unsubmitted to RCU after those many jiffies.
>> + */
>> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
>> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> 
> It seems this can be made static?

Yes, will do.
>> @@ -265,23 +290,39 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>  {
>>  	unsigned long flags;
>>  	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
>> +	unsigned long mod_jif = 0;
>>  
>>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>>  
>>  	/*
>> -	 * Bypass wakeup overrides previous deferments. In case
>> -	 * of callback storm, no need to wake up too early.
>> +	 * Bypass and lazy wakeup overrides previous deferments. In case of
>> +	 * callback storm, no need to wake up too early.
>>  	 */
>> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
>> -		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>> -		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>> -	} else {
>> +	switch (waketype) {
>> +	case RCU_NOCB_WAKE_LAZY:
>> +		if (rdp->nocb_defer_wakeup != RCU_NOCB_WAKE_LAZY)
>> +			mod_jif = jiffies_till_flush;
>> +		break;
> 
> Ah so this timer override all the others, and it's a several seconds
> timers. Then I'll have a related concern on nocb_gp_wait().

Ok I replied below.

>> +	case RCU_NOCB_WAKE_BYPASS:
>> +		mod_jif = 2;
>> +		break;
>> +
>> +	case RCU_NOCB_WAKE:
>> +	case RCU_NOCB_WAKE_FORCE:
>> +		// For these, make it wake up the soonest if we
>> +		// were in a bypass or lazy sleep before.
>>  		if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
>> -			mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
>> -		if (rdp_gp->nocb_defer_wakeup < waketype)
>> -			WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>> +			mod_jif = 1;
>> +		break;
>>  	}
>>  
>> +	if (mod_jif)
>> +		mod_timer(&rdp_gp->nocb_timer, jiffies + mod_jif);
>> +
>> +	if (rdp_gp->nocb_defer_wakeup < waketype)
>> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> 
> So RCU_NOCB_WAKE_BYPASS and RCU_NOCB_WAKE_LAZY don't override the timer state
> anymore? Looks like something is missing.

My goal was to make sure that NOCB_WAKE_LAZY wake keeps the timer lazy. If I
don't do this, then when CPU enters idle, it will immediately do a wake up via
this call:

	rcu_nocb_need_deferred_wakeup(rdp_gp, RCU_NOCB_WAKE)

That was almost always causing lazy CBs to be non-lazy thus negating all the
benefits.

Note that bypass will also have same issue where the bypass CB will not wait for
intended bypass duration. To make it consistent with lazy, I made bypass also
not override nocb_defer_wakeup.

I agree its not pretty, but it works and I could not find any code path where it
does not work. That said, I am open to ideas for changing this and perhaps some
of these unneeded delays with bypass CBs can be split into separate patches.

Regarding your point about nocb_defer_wakeup state diverging from the timer
programming, that happens anyway here in current code:

 283        } else {
 284                if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
 285                        mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
 286                if (rdp_gp->nocb_defer_wakeup < waketype)
 287                        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
 288        }

>> +
>>  	raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
>>  
>>  	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
> [...]
>> @@ -705,12 +816,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>  	my_rdp->nocb_gp_gp = needwait_gp;
>>  	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
>>  
>> -	if (bypass && !rcu_nocb_poll) {
>> -		// At least one child with non-empty ->nocb_bypass, so set
>> -		// timer in order to avoid stranding its callbacks.
>> -		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
>> -				   TPS("WakeBypassIsDeferred"));
>> +	// At least one child with non-empty ->nocb_bypass, so set
>> +	// timer in order to avoid stranding its callbacks.
>> +	if (!rcu_nocb_poll) {
>> +		// If bypass list only has lazy CBs. Add a deferred
>> +		// lazy wake up.
>> +		if (lazy && !bypass) {
>> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
>> +					TPS("WakeLazyIsDeferred"));
> 
> What if:
> 
> 1) rdp(1) has lazy callbacks
> 2) all other rdp's have no callback at all
> 3) nocb_gp_wait() runs through all rdp's, everything is handled, except for
>    these lazy callbacks
> 4) It reaches the above path, ready to arm the RCU_NOCB_WAKE_LAZY timer,
>    but it hasn't yet called wake_nocb_gp_defer()
> 5) Oh but rdp(2) queues a non-lazy callback. interrupts are disabled so it defers
>    the wake up to nocb_gp_wait() with arming the timer in RCU_NOCB_WAKE.
> 6) nocb_gp_wait() finally calls wake_nocb_gp_defer() and override the timeout
>    to several seconds ahead.
> 7) No more callbacks queued, the non-lazy callback will have to wait several
>    seconds to complete.
> 
> Or did I miss something?

In theory, I can see this being an issue. In practice, I have not seen it to be.
In my view, the nocb GP thread should not go to sleep in the first place if
there are any non-bypass CBs being queued. If it does, then that seems an
optimization-related bug.

That said, we can make wake_nocb_gp_defer() more robust perhaps by making it not
overwrite the timer if the wake-type requested is weaker than RCU_NOCB_WAKE,
however that should not cause the going-into-idle issues I pointed. Whether the
idle time issue will happen, I have no idea. But in theory, will that address
your concern above?

> Note that the race exists with RCU_NOCB_WAKE_BYPASS
> but it's only about one jiffy delay, not seconds.

Well, 2 jiffies. But yeah.

Thanks, so far I do not see anything that cannot be fixed on top of this patch
but you raised some good points. Maybe we ought to rewrite the idle path to not
disturb lazy CBs in a different way, or something (while keeping the timer state
consistent with the programming of the timer in wake_nocb_gp_defer()).

> (further review later).

Ok thanks, looking forward to them.


- Joel




> 
> Thanks.
> 
>> +		// Otherwise add a deferred bypass wake up.
>> +		} else if (bypass) {
>> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
>> +					TPS("WakeBypassIsDeferred"));
>> +		}
>>  	}
>> +
>>  	if (rcu_nocb_poll) {
>>  		/* Polling, so trace if first poll in the series. */
>>  		if (gotcbs)
>> @@ -1036,7 +1156,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>>  	 * return false, which means that future calls to rcu_nocb_try_bypass()
>>  	 * will refuse to put anything into the bypass.
>>  	 */
>> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false, false));
>>  	/*
>>  	 * Start with invoking rcu_core() early. This way if the current thread
>>  	 * happens to preempt an ongoing call to rcu_core() in the middle,
>> @@ -1290,6 +1410,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>>  	raw_spin_lock_init(&rdp->nocb_gp_lock);
>>  	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
>>  	rcu_cblist_init(&rdp->nocb_bypass);
>> +	WRITE_ONCE(rdp->lazy_len, 0);
>>  	mutex_init(&rdp->nocb_gp_kthread_mutex);
>>  }
>>  
>> @@ -1571,13 +1692,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>>  }
>>  
>>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				  unsigned long j)
>> +				  unsigned long j, bool lazy, bool wakegp)
>>  {
>>  	return true;
>>  }
>>  
>>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				bool *was_alldone, unsigned long flags)
>> +				bool *was_alldone, unsigned long flags,
>> +				bool lazy)
>>  {
>>  	return false;
>>  }
>> -- 
>> 2.37.2.789.g6183377224-goog
>>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-02 11:35   ` Frederic Weisbecker
@ 2022-09-02 23:58     ` Joel Fernandes
  2022-09-03 15:10       ` Paul E. McKenney
  2022-09-04 21:13       ` Frederic Weisbecker
  0 siblings, 2 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-02 23:58 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/2/2022 7:35 AM, Frederic Weisbecker wrote:
> On Thu, Sep 01, 2022 at 10:17:06PM +0000, Joel Fernandes (Google) wrote:
>> When the bypass cblist gets too big or its timeout has occurred, it is
>> flushed into the main cblist. However, the bypass timer is still running
>> and the behavior is that it would eventually expire and wake the GP
>> thread.
>>
>> Since we are going to use the bypass cblist for lazy CBs, do the wakeup
>> soon as the flush happens. Otherwise, the lazy-timer will go off much
>> later and the now-non-lazy cblist CBs can get stranded for the duration
>> of the timer.
>>
>> This is a good thing to do anyway (regardless of this series), since it
>> makes the behavior consistent with behavior of other code paths where queueing
>> something into the ->cblist makes the GP kthread in a non-sleeping state
>> quickly.
>>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  kernel/rcu/tree_nocb.h | 8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index 0a5f0ef41484..31068dd31315 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -447,7 +447,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  			rcu_advance_cbs_nowake(rdp->mynode, rdp);
>>  			rdp->nocb_gp_adv_time = j;
>>  		}
>> -		rcu_nocb_unlock_irqrestore(rdp, flags);
>> +
>> +		// The flush succeeded and we moved CBs into the ->cblist.
>> +		// However, the bypass timer might still be running. Wakeup the
>> +		// GP thread by calling a helper with was_all_done set so that
>> +		// wake up happens (needed if main CB list was empty before).
>> +		__call_rcu_nocb_wake(rdp, true, flags)
>> +
> 
> Ok so there are two different changes here:
> 
> 1) wake up nocb_gp as we just flushed the bypass list. Indeed if the regular
>    callback list was empty before flushing, we rather want to immediately wake
>    up nocb_gp instead of waiting for the bypass timer to process them.
> 
> 2) wake up nocb_gp unconditionally (ie: even if the regular queue was not empty
>    before bypass flushing) so that nocb_gp_wait() is forced through another loop
>    starting with cancelling the bypass timer (I suggest you put such explanation
>    in the comment btw because that process may not be obvious for mortals).
> 
> The change 1) looks like a good idea to me.
> 
> The change 2) has unclear motivation. It forces nocb_gp_wait() through another
> costly loop even though the timer might have been cancelled into some near
> future, eventually avoiding that extra costly loop. Also it abuses the
> was_alldone stuff and we may get rcu_nocb_wake with incoherent meanings
> (WakeEmpty/WakeEmptyIsDeferred) when it's actually not empty.

Yes #2 can be optimized as follows I think on top of this patch, good point:
=============
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index ee5924ba2f3b..24aabd723abd 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -514,12 +514,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp,
struct rcu_head *rhp,
            ncbs >= qhimark) {
                rcu_nocb_lock(rdp);

+               *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+
                rcu_cblist_set_flush(&rdp->nocb_bypass,
                                lazy ? BIT(CB_DEBUG_BYPASS_LAZY_FLUSHED) :
BIT(CB_DEBUG_BYPASS_FLUSHED),
                                (j - READ_ONCE(cb_debug_jiffies_first)));

                if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) {
-                       *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
                        if (*was_alldone)
                                trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
                                                    TPS("FirstQ"));
@@ -537,7 +538,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct
rcu_head *rhp,
                // However, the bypass timer might still be running. Wakeup the
                // GP thread by calling a helper with was_all_done set so that
                // wake up happens (needed if main CB list was empty before).
-               __call_rcu_nocb_wake(rdp, true, flags)
+               __call_rcu_nocb_wake(rdp, *was_all_done, flags)

                return true; // Callback already enqueued.
        }
=============

> So you may need to clarify the purpose. And I would suggest to make two patches
> here.
I guess this change only #2 is no longer a concern? And splitting is not needed
then as it is only #1.

Thanks,

 - Joel






^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation.
  2022-09-02 16:48   ` kernel test robot
@ 2022-09-03 12:39     ` Paul E. McKenney
  2022-09-03 14:07       ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-09-03 12:39 UTC (permalink / raw)
  To: kernel test robot
  Cc: Joel Fernandes (Google),
	rcu, kbuild-all, linux-kernel, rushikesh.s.kadam, urezki,
	neeraj.iitr10, frederic, rostedt, vineeth, boqun.feng

On Sat, Sep 03, 2022 at 12:48:28AM +0800, kernel test robot wrote:
> Hi "Joel,
> 
> Thank you for the patch! Yet something to improve:
> 
> [auto build test ERROR on paulmck-rcu/dev]
> [also build test ERROR on pcmoore-selinux/next drm-intel/for-linux-next linus/master v6.0-rc3]
> [cannot apply to vbabka-slab/for-next rostedt-trace/for-next tip/timers/core next-20220901]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch#_base_tree_information]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/Implement-call_rcu_lazy-and-miscellaneous-fixes/20220902-062156
> base:   https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev
> config: mips-allyesconfig (https://download.01.org/0day-ci/archive/20220903/202209030052.20CJhjTX-lkp@intel.com/config)
> compiler: mips-linux-gcc (GCC) 12.1.0
> reproduce (this is a W=1 build):
>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # https://github.com/intel-lab-lkp/linux/commit/c0f09b1d42d06649680f74a78ca363e7f1c158b2
>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>         git fetch --no-tags linux-review Joel-Fernandes-Google/Implement-call_rcu_lazy-and-miscellaneous-fixes/20220902-062156
>         git checkout c0f09b1d42d06649680f74a78ca363e7f1c158b2
>         # save the config file
>         mkdir build_dir && cp config build_dir/.config
>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash
> 
> If you fix the issue, kindly add following tag where applicable
> Reported-by: kernel test robot <lkp@intel.com>
> 
> All errors (new ones prefixed by >>):
> 
>    In file included from <command-line>:
>    In function 'dst_hold',
>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>        inlined from '__skb_dst_copy' at include/net/dst.h:284:3,
>        inlined from 'ovs_vport_output' at net/openvswitch/actions.c:787:2:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~

This looks like fallout from the rcu_head structure changing size,
given that __refcnt comes after rcu_head on 32-bit systems.

It looks like the per-CB tracing code needs to be kept on the side for
the time being rather than being sent to -next (let alone mainline).

Any reason I cannot just move this one to the end of the stack, after
"fork: Move thread_stack_free_rcu() to call_rcu_lazy()"?

							Thanx, Paul

>    In function 'dst_hold',
>        inlined from 'execute_set_action' at net/openvswitch/actions.c:1093:3,
>        inlined from 'do_execute_actions' at net/openvswitch/actions.c:1377:10:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
> --
>    In file included from <command-line>:
>    In function 'dst_hold',
>        inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
>        inlined from 'dn_insert_route.constprop.isra' at net/decnet/dn_route.c:334:4:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
>    In function 'dst_hold',
>        inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
>        inlined from 'dn_insert_route.constprop.isra' at net/decnet/dn_route.c:347:2:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
>    In function 'dst_hold',
>        inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
>        inlined from 'dn_route_input' at net/decnet/dn_route.c:1535:4:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
>    In function 'dst_hold',
>        inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
>        inlined from '__dn_route_output_key.isra' at net/decnet/dn_route.c:1257:5:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
>    In function 'dst_hold',
>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>        inlined from 'dn_cache_dump' at net/decnet/dn_route.c:1752:4:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
> --
>    In file included from <command-line>:
>    In function 'dst_hold',
>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>        inlined from 'ip6_copy_metadata' at net/ipv6/ip6_output.c:654:2:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
>    In function 'dst_hold',
>        inlined from 'ip6_append_data' at net/ipv6/ip6_output.c:1838:3:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
>    In function 'dst_hold',
>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>        inlined from 'ip6_sk_dst_lookup_flow' at net/ipv6/ip6_output.c:1262:3:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
> --
>    In file included from <command-line>:
>    In function 'dst_hold',
>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>        inlined from '__skb_dst_copy' at include/net/dst.h:284:3,
>        inlined from 'skb_dst_copy' at include/net/dst.h:289:2,
>        inlined from 'ip6_list_rcv_finish.constprop' at net/ipv6/ip6_input.c:128:4:
> >> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |                                             ^
>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>      335 |                         prefix ## suffix();                             \
>          |                         ^~~~~~
>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>          |         ^~~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>          |                                     ^~~~~~~~~~~~~~~~~~
>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>          |         ^~~~~~~~~~~~~~~~
>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>          |         ^~~~~~~~~~~~
> 
> 
> vim +/__compiletime_assert_490 +354 include/linux/compiler_types.h
> 
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  340  
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  341  #define _compiletime_assert(condition, msg, prefix, suffix) \
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  342  	__compiletime_assert(condition, msg, prefix, suffix)
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  343  
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  344  /**
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  345   * compiletime_assert - break build and emit msg if condition is false
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  346   * @condition: a compile-time constant condition to check
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  347   * @msg:       a message to emit if condition is false
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  348   *
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  349   * In tradition of POSIX assert, this macro will break the build if the
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  350   * supplied condition is *false*, emitting the supplied error message if the
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  351   * compiler has support to do so.
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  352   */
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  353  #define compiletime_assert(condition, msg) \
> eb5c2d4b45e3d2 Will Deacon 2020-07-21 @354  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
> eb5c2d4b45e3d2 Will Deacon 2020-07-21  355  
> 
> -- 
> 0-DAY CI Kernel Test Service
> https://01.org/lkp

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 02/18] mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head
  2022-09-02 15:09       ` Joel Fernandes
@ 2022-09-03 13:53         ` Paul E. McKenney
  0 siblings, 0 replies; 73+ messages in thread
From: Paul E. McKenney @ 2022-09-03 13:53 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Vlastimil Babka, rcu, linux-kernel, rushikesh.s.kadam, urezki,
	neeraj.iitr10, frederic, rostedt, vineeth, boqun.feng,
	Hyeonggon Yoo

On Fri, Sep 02, 2022 at 11:09:08AM -0400, Joel Fernandes wrote:
> 
> 
> On 9/2/2022 5:30 AM, Vlastimil Babka wrote:
> > On 9/2/22 11:26, Vlastimil Babka wrote:
> >> On 9/2/22 00:17, Joel Fernandes (Google) wrote:
> >>> From: Vlastimil Babka <vbabka@suse.cz>
> >>>
> >>> Joel reports [1] that increasing the rcu_head size for debugging
> >>> purposes used to work before struct slab was split from struct page, but
> >>> now runs into the various SLAB_MATCH() sanity checks of the layout.
> >>>
> >>> This is because the rcu_head in struct page is in union with large
> >>> sub-structures and has space to grow without exceeding their size, while
> >>> in struct slab (for SLAB and SLUB) it's in union only with a list_head.
> >>>
> >>> On closer inspection (and after the previous patch) we can put all
> >>> fields except slab_cache to a union with rcu_head, as slab_cache is
> >>> sufficient for the rcu freeing callbacks to work and the rest can be
> >>> overwritten by rcu_head without causing issues.
> >>>
> >>> This is only somewhat complicated by the need to keep SLUB's
> >>> freelist+counters aligned for cmpxchg_double. As a result the fields
> >>> need to be reordered so that slab_cache is first (after page flags) and
> >>> the union with rcu_head follows. For consistency, do that for SLAB as
> >>> well, although not necessary there.
> >>>
> >>> As a result, the rcu_head field in struct page and struct slab is no
> >>> longer at the same offset, but that doesn't matter as there is no
> >>> casting that would rely on that in the slab freeing callbacks, so we can
> >>> just drop the respective SLAB_MATCH() check.
> >>>
> >>> Also we need to update the SLAB_MATCH() for compound_head to reflect the
> >>> new ordering.
> >>>
> >>> While at it, also add a static_assert to check the alignment needed for
> >>> cmpxchg_double so mistakes are found sooner than a runtime GPF.
> >>>
> >>> [1] https://lore.kernel.org/all/85afd876-d8bb-0804-b2c5-48ed3055e702@joelfernandes.org/
> >>>
> >>> Reported-by: Joel Fernandes <joel@joelfernandes.org>
> >>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> >>
> >> I've added patches 01 and 02 to slab tree for -next exposure before Joel's
> >> full series posting, but it should be also ok if rcu tree carries them with
> >> the whole patchset. I can then drop them from slab tree (there are no
> >> dependencies with other stuff there) so we don't introduce duplicite commits
> >> needlessly, just give me a heads up.
> > 
> > Ah but in that case please apply the reviews from my posting [1]
> > 
> > patch 1:
> > Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > 
> > patch 2
> > Acked-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> > 
> > [1] https://lore.kernel.org/all/20220826090912.11292-1-vbabka@suse.cz/
> 
> 
> Sorry for injecting confusion - my main intent with including the mm patches in
> this series is to make it easier for other reviewers/testers to backport the
> series to their kernels in one shot. Some reviewers expressed interested in
> trying out the series.
> 
> I think it is best to let the -mm patches in the series go through the slab
> tree, as you also have the Acks/Reviews there and will take sure those
> dependencies are out of the way.
> 
> My lesson here is to be more clear, I could have added some notes for context
> below the "---" of those mm patches.
> 
> Thanks again for your help,

Hello, Vlastimil, and thank you for putting these together!

I believe that your two patches should go in via the slab tree.
I am queueing them in -rcu only temporarily and just for convenience
in testing.  I expect that I will rebase them so that I can let your
versions cover things in -next.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-01 22:17 ` [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation Joel Fernandes (Google)
  2022-09-02 15:21   ` Frederic Weisbecker
@ 2022-09-03 14:03   ` Paul E. McKenney
  2022-09-03 14:05     ` Joel Fernandes
  1 sibling, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-09-03 14:03 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, vineeth, boqun.feng

On Thu, Sep 01, 2022 at 10:17:08PM +0000, Joel Fernandes (Google) wrote:
> Implement timer-based RCU lazy callback batching. The batch is flushed
> whenever a certain amount of time has passed, or the batch on a
> particular CPU grows too big. Also memory pressure will flush it in a
> future patch.
> 
> To handle several corner cases automagically (such as rcu_barrier() and
> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> length has the lazy CB length included in it. A separate lazy CB length
> counter is also introduced to keep track of the number of lazy CBs.
> 
> Suggested-by: Paul McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

I got this from TREE01 and TREE04:

kernel/rcu/tree_nocb.h:439:46: error: ‘struct rcu_data’ has no member named ‘lazy_len’

So I moved the lazy_len field out from under CONFIG_RCU_LAZY.

							Thanx, Paul

> ---
>  include/linux/rcupdate.h   |   6 ++
>  kernel/rcu/Kconfig         |   8 ++
>  kernel/rcu/rcu.h           |  11 +++
>  kernel/rcu/rcu_segcblist.c |   2 +-
>  kernel/rcu/tree.c          | 130 +++++++++++++++---------
>  kernel/rcu/tree.h          |  13 ++-
>  kernel/rcu/tree_nocb.h     | 198 ++++++++++++++++++++++++++++++-------
>  7 files changed, 280 insertions(+), 88 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 08605ce7379d..82e8a07e0856 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
>  
>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>  
> +#ifdef CONFIG_RCU_LAZY
> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
> +#else
> +#define call_rcu_lazy(head, func) call_rcu(head, func)
> +#endif
> +
>  /* Internal to kernel */
>  void rcu_init(void);
>  extern int rcu_scheduler_active;
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index d471d22a5e21..3128d01427cb 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
>  	  Say N here if you hate read-side memory barriers.
>  	  Take the default if you are unsure.
>  
> +config RCU_LAZY
> +	bool "RCU callback lazy invocation functionality"
> +	depends on RCU_NOCB_CPU
> +	default n
> +	help
> +	  To save power, batch RCU callbacks and flush after delay, memory
> +	  pressure or callback list growing too big.
> +
>  endmenu # "RCU Subsystem"
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index be5979da07f5..94675f14efe8 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -474,6 +474,14 @@ enum rcutorture_type {
>  	INVALID_RCU_FLAVOR
>  };
>  
> +#if defined(CONFIG_RCU_LAZY)
> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> +#else
> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> +#endif
> +
>  #if defined(CONFIG_TREE_RCU)
>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>  			    unsigned long *gp_seq);
> @@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>  			       unsigned long c_old,
>  			       unsigned long c);
>  void rcu_gp_set_torture_wait(int duration);
> +void rcu_force_call_rcu_to_lazy(bool force);
> +
>  #else
>  static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
>  					  int *flags, unsigned long *gp_seq)
> @@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>  	do { } while (0)
>  #endif
>  static inline void rcu_gp_set_torture_wait(int duration) { }
> +static inline void rcu_force_call_rcu_to_lazy(bool force) { }
>  #endif
>  
>  #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
> diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
> index c54ea2b6a36b..55b50e592986 100644
> --- a/kernel/rcu/rcu_segcblist.c
> +++ b/kernel/rcu/rcu_segcblist.c
> @@ -38,7 +38,7 @@ void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
>   * element of the second rcu_cblist structure, but ensuring that the second
>   * rcu_cblist structure, if initially non-empty, always appears non-empty
>   * throughout the process.  If rdp is NULL, the second rcu_cblist structure
> - * is instead initialized to empty.
> + * is instead initialized to empty. Also account for lazy_len for lazy CBs.
>   */
>  void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
>  			      struct rcu_cblist *srclp,
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 9fe581be8696..aaced29a0a71 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>  	raw_spin_unlock_rcu_node(rnp);
>  }
>  
> -/**
> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
> - * @head: structure to be used for queueing the RCU updates.
> - * @func: actual callback function to be invoked after the grace period
> - *
> - * The callback function will be invoked some time after a full grace
> - * period elapses, in other words after all pre-existing RCU read-side
> - * critical sections have completed.  However, the callback function
> - * might well execute concurrently with RCU read-side critical sections
> - * that started after call_rcu() was invoked.
> - *
> - * RCU read-side critical sections are delimited by rcu_read_lock()
> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
> - * v5.0 and later, regions of code across which interrupts, preemption,
> - * or softirqs have been disabled also serve as RCU read-side critical
> - * sections.  This includes hardware interrupt handlers, softirq handlers,
> - * and NMI handlers.
> - *
> - * Note that all CPUs must agree that the grace period extended beyond
> - * all pre-existing RCU read-side critical section.  On systems with more
> - * than one CPU, this means that when "func()" is invoked, each CPU is
> - * guaranteed to have executed a full memory barrier since the end of its
> - * last RCU read-side critical section whose beginning preceded the call
> - * to call_rcu().  It also means that each CPU executing an RCU read-side
> - * critical section that continues beyond the start of "func()" must have
> - * executed a memory barrier after the call_rcu() but before the beginning
> - * of that RCU read-side critical section.  Note that these guarantees
> - * include CPUs that are offline, idle, or executing in user mode, as
> - * well as CPUs that are executing in the kernel.
> - *
> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> - * resulting RCU callback function "func()", then both CPU A and CPU B are
> - * guaranteed to execute a full memory barrier during the time interval
> - * between the call to call_rcu() and the invocation of "func()" -- even
> - * if CPU A and CPU B are the same CPU (but again only if the system has
> - * more than one CPU).
> - *
> - * Implementation of these memory-ordering guarantees is described here:
> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> - */
> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +static void
> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>  {
>  	static atomic_t doublefrees;
>  	unsigned long flags;
> @@ -2818,7 +2779,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  		trace_rcu_callback(rcu_state.name, head,
>  				   rcu_segcblist_n_cbs(&rdp->cblist));
>  
> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>  		return; // Enqueued onto ->nocb_bypass, so just leave.
>  	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>  	rcu_segcblist_enqueue(&rdp->cblist, head);
> @@ -2833,8 +2794,86 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  		local_irq_restore(flags);
>  	}
>  }
> -EXPORT_SYMBOL_GPL(call_rcu);
>  
> +#ifdef CONFIG_RCU_LAZY
> +/**
> + * call_rcu_lazy() - Lazily queue RCU callback for invocation after grace period.
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.
> + *
> + * Use this API instead of call_rcu() if you don't mind the callback being
> + * invoked after very long periods of time on systems without memory pressure
> + * and on systems which are lightly loaded or mostly idle.
> + *
> + * Other than the extra delay in callbacks being invoked, this function is
> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
> + * details about memory ordering and other functionality.
> + */
> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, true);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_lazy);
> +#endif
> +
> +static bool force_call_rcu_to_lazy;
> +
> +void rcu_force_call_rcu_to_lazy(bool force)
> +{
> +	if (IS_ENABLED(CONFIG_RCU_SCALE_TEST))
> +		WRITE_ONCE(force_call_rcu_to_lazy, force);
> +}
> +EXPORT_SYMBOL_GPL(rcu_force_call_rcu_to_lazy);
> +
> +/**
> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.  However, the callback function
> + * might well execute concurrently with RCU read-side critical sections
> + * that started after call_rcu() was invoked.
> + *
> + * RCU read-side critical sections are delimited by rcu_read_lock()
> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
> + * v5.0 and later, regions of code across which interrupts, preemption,
> + * or softirqs have been disabled also serve as RCU read-side critical
> + * sections.  This includes hardware interrupt handlers, softirq handlers,
> + * and NMI handlers.
> + *
> + * Note that all CPUs must agree that the grace period extended beyond
> + * all pre-existing RCU read-side critical section.  On systems with more
> + * than one CPU, this means that when "func()" is invoked, each CPU is
> + * guaranteed to have executed a full memory barrier since the end of its
> + * last RCU read-side critical section whose beginning preceded the call
> + * to call_rcu().  It also means that each CPU executing an RCU read-side
> + * critical section that continues beyond the start of "func()" must have
> + * executed a memory barrier after the call_rcu() but before the beginning
> + * of that RCU read-side critical section.  Note that these guarantees
> + * include CPUs that are offline, idle, or executing in user mode, as
> + * well as CPUs that are executing in the kernel.
> + *
> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> + * resulting RCU callback function "func()", then both CPU A and CPU B are
> + * guaranteed to execute a full memory barrier during the time interval
> + * between the call to call_rcu() and the invocation of "func()" -- even
> + * if CPU A and CPU B are the same CPU (but again only if the system has
> + * more than one CPU).
> + *
> + * Implementation of these memory-ordering guarantees is described here:
> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> + */
> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, force_call_rcu_to_lazy);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu);
>  
>  /* Maximum number of jiffies to wait before draining a batch. */
>  #define KFREE_DRAIN_JIFFIES (5 * HZ)
> @@ -3904,7 +3943,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	rdp->barrier_head.func = rcu_barrier_callback;
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false,
> +		     /* wake gp thread */ true));
>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>  		atomic_inc(&rcu_state.barrier_cpu_count);
>  	} else {
> @@ -4325,7 +4365,7 @@ void rcutree_migrate_callbacks(int cpu)
>  	my_rdp = this_cpu_ptr(&rcu_data);
>  	my_rnp = my_rdp->mynode;
>  	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, false, false));
>  	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>  	/* Leverage recent GPs and set GP for new callbacks. */
>  	needwake = rcu_advance_cbs(my_rnp, rdp) ||
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index d4a97e40ea9c..946d819b23fc 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -263,14 +263,18 @@ struct rcu_data {
>  	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
>  	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
>  
> +#ifdef CONFIG_RCU_LAZY
> +	long lazy_len;			/* Length of buffered lazy callbacks. */
> +#endif
>  	int cpu;
>  };
>  
>  /* Values for nocb_defer_wakeup field in struct rcu_data. */
>  #define RCU_NOCB_WAKE_NOT	0
>  #define RCU_NOCB_WAKE_BYPASS	1
> -#define RCU_NOCB_WAKE		2
> -#define RCU_NOCB_WAKE_FORCE	3
> +#define RCU_NOCB_WAKE_LAZY	2
> +#define RCU_NOCB_WAKE		3
> +#define RCU_NOCB_WAKE_FORCE	4
>  
>  #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>  					/* For jiffies_till_first_fqs and */
> @@ -440,9 +444,10 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>  static void rcu_init_one_nocb(struct rcu_node *rnp);
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j);
> +				  unsigned long j, bool lazy, bool wakegp);
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags);
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy);
>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>  				 unsigned long flags);
>  static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 31068dd31315..7e97a7b6e046 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>  }
>  
> +/*
> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> + * could be flushed much earlier for a number of other reasons
> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> + * left unsubmitted to RCU after those many jiffies.
> + */
> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> +
> +#ifdef CONFIG_RCU_LAZY
> +// To be called only from test code.
> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
> +{
> +	jiffies_till_flush = jif;
> +}
> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
> +
> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
> +{
> +	return jiffies_till_flush;
> +}
> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
> +#endif
> +
>  /*
>   * Arrange to wake the GP kthread for this NOCB group at some future
>   * time when it is safe to do so.
> @@ -265,23 +290,39 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>  {
>  	unsigned long flags;
>  	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
> +	unsigned long mod_jif = 0;
>  
>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>  
>  	/*
> -	 * Bypass wakeup overrides previous deferments. In case
> -	 * of callback storm, no need to wake up too early.
> +	 * Bypass and lazy wakeup overrides previous deferments. In case of
> +	 * callback storm, no need to wake up too early.
>  	 */
> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
> -		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
> -		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> -	} else {
> +	switch (waketype) {
> +	case RCU_NOCB_WAKE_LAZY:
> +		if (rdp->nocb_defer_wakeup != RCU_NOCB_WAKE_LAZY)
> +			mod_jif = jiffies_till_flush;
> +		break;
> +
> +	case RCU_NOCB_WAKE_BYPASS:
> +		mod_jif = 2;
> +		break;
> +
> +	case RCU_NOCB_WAKE:
> +	case RCU_NOCB_WAKE_FORCE:
> +		// For these, make it wake up the soonest if we
> +		// were in a bypass or lazy sleep before.
>  		if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
> -			mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
> -		if (rdp_gp->nocb_defer_wakeup < waketype)
> -			WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> +			mod_jif = 1;
> +		break;
>  	}
>  
> +	if (mod_jif)
> +		mod_timer(&rdp_gp->nocb_timer, jiffies + mod_jif);
> +
> +	if (rdp_gp->nocb_defer_wakeup < waketype)
> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> +
>  	raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
>  
>  	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
> @@ -293,10 +334,13 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>   * proves to be initially empty, just return false because the no-CB GP
>   * kthread may need to be awakened in this case.
>   *
> + * Return true if there was something to be flushed and it succeeded, otherwise
> + * false.
> + *
>   * Note that this function always returns true if rhp is NULL.
>   */
>  static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				     unsigned long j)
> +				     unsigned long j, bool lazy)
>  {
>  	struct rcu_cblist rcl;
>  
> @@ -310,7 +354,18 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
>  	if (rhp)
>  		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
> -	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +
> +	/*
> +	 * If the new CB requested was a lazy one, queue it onto the main
> +	 * ->cblist so we can take advantage of a sooner grade period.
> +	 */
> +	if (lazy && rhp) {
> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
> +		rcu_cblist_enqueue(&rcl, rhp);
> +	} else {
> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
> +	}
> +
>  	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>  	WRITE_ONCE(rdp->nocb_bypass_first, j);
>  	rcu_nocb_bypass_unlock(rdp);
> @@ -326,13 +381,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>   * Note that this function always returns true if rhp is NULL.
>   */
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, bool lazy, bool wake_gp)
>  {
> +	bool ret;
> +
>  	if (!rcu_rdp_is_offloaded(rdp))
>  		return true;
>  	rcu_lockdep_assert_cblist_protected(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
> +	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, lazy);
> +
> +	if (wake_gp)
> +		wake_nocb_gp(rdp, true);
> +
> +	return ret;
>  }
>  
>  /*
> @@ -345,7 +407,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>  	if (!rcu_rdp_is_offloaded(rdp) ||
>  	    !rcu_nocb_bypass_trylock(rdp))
>  		return;
> -	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
> +	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, false));
>  }
>  
>  /*
> @@ -367,12 +429,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>   * there is only one CPU in operation.
>   */
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags)
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy)
>  {
>  	unsigned long c;
>  	unsigned long cur_gp_seq;
>  	unsigned long j = jiffies;
>  	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +	bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
>  
>  	lockdep_assert_irqs_disabled();
>  
> @@ -417,23 +481,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	// If there hasn't yet been all that many ->cblist enqueues
>  	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
>  	// ->nocb_bypass first.
> -	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
> +	// Lazy CBs throttle this back and do immediate bypass queuing.
> +	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
>  		rcu_nocb_lock(rdp);
>  		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>  		if (*was_alldone)
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("FirstQ"));
> -		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
> +
> +		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, false, false));
> +
>  		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
>  		return false; // Caller must enqueue the callback.
>  	}
>  
>  	// If ->nocb_bypass has been used too long or is too full,
>  	// flush ->nocb_bypass to ->cblist.
> -	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> +	if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
> +	    (ncbs &&  bypass_is_lazy &&
> +		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
>  	    ncbs >= qhimark) {
>  		rcu_nocb_lock(rdp);
> -		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
> +
> +		if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) {
>  			*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>  			if (*was_alldone)
>  				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> @@ -460,16 +530,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	// We need to use the bypass.
>  	rcu_nocb_wait_contended(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> +
>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> +
> +	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
> +
>  	if (!ncbs) {
>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>  	}
> +
>  	rcu_nocb_bypass_unlock(rdp);
>  	smp_mb(); /* Order enqueue before wake. */
> -	if (ncbs) {
> +
> +	// We had CBs in the bypass list before. There is nothing else to do if:
> +	// There were only non-lazy CBs before, in this case, the bypass timer
> +	// or GP-thread will handle the CBs including any new lazy ones.
> +	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
> +	// case the old lazy timer would have been setup. When that expires,
> +	// the new lazy one will be handled.
> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>  		local_irq_restore(flags);
>  	} else {
>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
> @@ -478,6 +561,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("FirstBQwake"));
>  			__call_rcu_nocb_wake(rdp, true, flags);
> +		} else if (bypass_is_lazy && !lazy) {
> +			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> +					    TPS("FirstBQwakeLazy2Non"));
> +			__call_rcu_nocb_wake(rdp, true, flags);
>  		} else {
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("FirstBQnoWake"));
> @@ -499,7 +586,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  {
>  	unsigned long cur_gp_seq;
>  	unsigned long j;
> -	long len;
> +	long len, lazy_len, bypass_len;
>  	struct task_struct *t;
>  
>  	// If we are being polled or there is no kthread, just leave.
> @@ -512,9 +599,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>  	}
>  	// Need to actually to a wakeup.
>  	len = rcu_segcblist_n_cbs(&rdp->cblist);
> +	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +	lazy_len = READ_ONCE(rdp->lazy_len);
>  	if (was_alldone) {
>  		rdp->qlen_last_fqs_check = len;
> -		if (!irqs_disabled_flags(flags)) {
> +		// Only lazy CBs in bypass list
> +		if (lazy_len && bypass_len == lazy_len) {
> +			rcu_nocb_unlock_irqrestore(rdp, flags);
> +			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
> +					   TPS("WakeLazy"));
> +		} else if (!irqs_disabled_flags(flags)) {
>  			/* ... if queue was empty ... */
>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			wake_nocb_gp(rdp, false);
> @@ -604,8 +698,8 @@ static void nocb_gp_sleep(struct rcu_data *my_rdp, int cpu)
>   */
>  static void nocb_gp_wait(struct rcu_data *my_rdp)
>  {
> -	bool bypass = false;
> -	long bypass_ncbs;
> +	bool bypass = false, lazy = false;
> +	long bypass_ncbs, lazy_ncbs;
>  	int __maybe_unused cpu = my_rdp->cpu;
>  	unsigned long cur_gp_seq;
>  	unsigned long flags;
> @@ -640,24 +734,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	 * won't be ignored for long.
>  	 */
>  	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
> +		bool flush_bypass = false;
> +
>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
>  		rcu_nocb_lock_irqsave(rdp, flags);
>  		lockdep_assert_held(&rdp->nocb_lock);
>  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> -		if (bypass_ncbs &&
> +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
> +
> +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
> +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
> +		     bypass_ncbs > 2 * qhimark)) {
> +			flush_bypass = true;
> +		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
>  		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
>  		     bypass_ncbs > 2 * qhimark)) {
> -			// Bypass full or old, so flush it.
> -			(void)rcu_nocb_try_flush_bypass(rdp, j);
> -			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +			flush_bypass = true;
>  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>  			continue; /* No callbacks here, try next. */
>  		}
> +
> +		if (flush_bypass) {
> +			// Bypass full or old, so flush it.
> +			(void)rcu_nocb_try_flush_bypass(rdp, j);
> +			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
> +			lazy_ncbs = READ_ONCE(rdp->lazy_len);
> +		}
> +
>  		if (bypass_ncbs) {
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> -					    TPS("Bypass"));
> -			bypass = true;
> +				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
> +			if (bypass_ncbs == lazy_ncbs)
> +				lazy = true;
> +			else
> +				bypass = true;
>  		}
>  		rnp = rdp->mynode;
>  
> @@ -705,12 +816,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  	my_rdp->nocb_gp_gp = needwait_gp;
>  	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
>  
> -	if (bypass && !rcu_nocb_poll) {
> -		// At least one child with non-empty ->nocb_bypass, so set
> -		// timer in order to avoid stranding its callbacks.
> -		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> -				   TPS("WakeBypassIsDeferred"));
> +	// At least one child with non-empty ->nocb_bypass, so set
> +	// timer in order to avoid stranding its callbacks.
> +	if (!rcu_nocb_poll) {
> +		// If bypass list only has lazy CBs. Add a deferred
> +		// lazy wake up.
> +		if (lazy && !bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
> +					TPS("WakeLazyIsDeferred"));
> +		// Otherwise add a deferred bypass wake up.
> +		} else if (bypass) {
> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> +					TPS("WakeBypassIsDeferred"));
> +		}
>  	}
> +
>  	if (rcu_nocb_poll) {
>  		/* Polling, so trace if first poll in the series. */
>  		if (gotcbs)
> @@ -1036,7 +1156,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>  	 * return false, which means that future calls to rcu_nocb_try_bypass()
>  	 * will refuse to put anything into the bypass.
>  	 */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false, false));
>  	/*
>  	 * Start with invoking rcu_core() early. This way if the current thread
>  	 * happens to preempt an ongoing call to rcu_core() in the middle,
> @@ -1290,6 +1410,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>  	raw_spin_lock_init(&rdp->nocb_gp_lock);
>  	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
>  	rcu_cblist_init(&rdp->nocb_bypass);
> +	WRITE_ONCE(rdp->lazy_len, 0);
>  	mutex_init(&rdp->nocb_gp_kthread_mutex);
>  }
>  
> @@ -1571,13 +1692,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>  }
>  
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, bool lazy, bool wakegp)
>  {
>  	return true;
>  }
>  
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags)
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy)
>  {
>  	return false;
>  }
> -- 
> 2.37.2.789.g6183377224-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-01 22:17 ` [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens Joel Fernandes (Google)
  2022-09-02 11:35   ` Frederic Weisbecker
@ 2022-09-03 14:04   ` Paul E. McKenney
  2022-09-03 14:05     ` Joel Fernandes
  2022-09-06  3:07   ` Joel Fernandes
  2 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-09-03 14:04 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, vineeth, boqun.feng

On Thu, Sep 01, 2022 at 10:17:06PM +0000, Joel Fernandes (Google) wrote:
> When the bypass cblist gets too big or its timeout has occurred, it is
> flushed into the main cblist. However, the bypass timer is still running
> and the behavior is that it would eventually expire and wake the GP
> thread.
> 
> Since we are going to use the bypass cblist for lazy CBs, do the wakeup
> soon as the flush happens. Otherwise, the lazy-timer will go off much
> later and the now-non-lazy cblist CBs can get stranded for the duration
> of the timer.
> 
> This is a good thing to do anyway (regardless of this series), since it
> makes the behavior consistent with behavior of other code paths where queueing
> something into the ->cblist makes the GP kthread in a non-sleeping state
> quickly.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree_nocb.h | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 0a5f0ef41484..31068dd31315 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -447,7 +447,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  			rcu_advance_cbs_nowake(rdp->mynode, rdp);
>  			rdp->nocb_gp_adv_time = j;
>  		}
> -		rcu_nocb_unlock_irqrestore(rdp, flags);
> +
> +		// The flush succeeded and we moved CBs into the ->cblist.
> +		// However, the bypass timer might still be running. Wakeup the
> +		// GP thread by calling a helper with was_all_done set so that
> +		// wake up happens (needed if main CB list was empty before).
> +		__call_rcu_nocb_wake(rdp, true, flags)

TREE01 and TREE04 gripe about the missing ";".  I added it.

							Thanx, Paul

> +
>  		return true; // Callback already enqueued.
>  	}
>  
> -- 
> 2.37.2.789.g6183377224-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-03 14:04   ` Paul E. McKenney
@ 2022-09-03 14:05     ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-03 14:05 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, vineeth, boqun.feng



On 9/3/2022 10:04 AM, Paul E. McKenney wrote:
> On Thu, Sep 01, 2022 at 10:17:06PM +0000, Joel Fernandes (Google) wrote:
>> When the bypass cblist gets too big or its timeout has occurred, it is
>> flushed into the main cblist. However, the bypass timer is still running
>> and the behavior is that it would eventually expire and wake the GP
>> thread.
>>
>> Since we are going to use the bypass cblist for lazy CBs, do the wakeup
>> soon as the flush happens. Otherwise, the lazy-timer will go off much
>> later and the now-non-lazy cblist CBs can get stranded for the duration
>> of the timer.
>>
>> This is a good thing to do anyway (regardless of this series), since it
>> makes the behavior consistent with behavior of other code paths where queueing
>> something into the ->cblist makes the GP kthread in a non-sleeping state
>> quickly.
>>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  kernel/rcu/tree_nocb.h | 8 +++++++-
>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index 0a5f0ef41484..31068dd31315 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -447,7 +447,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  			rcu_advance_cbs_nowake(rdp->mynode, rdp);
>>  			rdp->nocb_gp_adv_time = j;
>>  		}
>> -		rcu_nocb_unlock_irqrestore(rdp, flags);
>> +
>> +		// The flush succeeded and we moved CBs into the ->cblist.
>> +		// However, the bypass timer might still be running. Wakeup the
>> +		// GP thread by calling a helper with was_all_done set so that
>> +		// wake up happens (needed if main CB list was empty before).
>> +		__call_rcu_nocb_wake(rdp, true, flags)
> 
> TREE01 and TREE04 gripe about the missing ";".  I added it.

Sorry about that, I must have messed lost the semicolon during re-basing.

 - Joel


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-03 14:03   ` Paul E. McKenney
@ 2022-09-03 14:05     ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-03 14:05 UTC (permalink / raw)
  To: paulmck
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, vineeth, boqun.feng



On 9/3/2022 10:03 AM, Paul E. McKenney wrote:
> On Thu, Sep 01, 2022 at 10:17:08PM +0000, Joel Fernandes (Google) wrote:
>> Implement timer-based RCU lazy callback batching. The batch is flushed
>> whenever a certain amount of time has passed, or the batch on a
>> particular CPU grows too big. Also memory pressure will flush it in a
>> future patch.
>>
>> To handle several corner cases automagically (such as rcu_barrier() and
>> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
>> length has the lazy CB length included in it. A separate lazy CB length
>> counter is also introduced to keep track of the number of lazy CBs.
>>
>> Suggested-by: Paul McKenney <paulmck@kernel.org>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> I got this from TREE01 and TREE04:
> 
> kernel/rcu/tree_nocb.h:439:46: error: ‘struct rcu_data’ has no member named ‘lazy_len’
> 
> So I moved the lazy_len field out from under CONFIG_RCU_LAZY.

Ok, thank you!

- Joel

> 
> 							Thanx, Paul
> 
>> ---
>>  include/linux/rcupdate.h   |   6 ++
>>  kernel/rcu/Kconfig         |   8 ++
>>  kernel/rcu/rcu.h           |  11 +++
>>  kernel/rcu/rcu_segcblist.c |   2 +-
>>  kernel/rcu/tree.c          | 130 +++++++++++++++---------
>>  kernel/rcu/tree.h          |  13 ++-
>>  kernel/rcu/tree_nocb.h     | 198 ++++++++++++++++++++++++++++++-------
>>  7 files changed, 280 insertions(+), 88 deletions(-)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 08605ce7379d..82e8a07e0856 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
>>  
>>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>  
>> +#ifdef CONFIG_RCU_LAZY
>> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
>> +#else
>> +#define call_rcu_lazy(head, func) call_rcu(head, func)
>> +#endif
>> +
>>  /* Internal to kernel */
>>  void rcu_init(void);
>>  extern int rcu_scheduler_active;
>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>> index d471d22a5e21..3128d01427cb 100644
>> --- a/kernel/rcu/Kconfig
>> +++ b/kernel/rcu/Kconfig
>> @@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
>>  	  Say N here if you hate read-side memory barriers.
>>  	  Take the default if you are unsure.
>>  
>> +config RCU_LAZY
>> +	bool "RCU callback lazy invocation functionality"
>> +	depends on RCU_NOCB_CPU
>> +	default n
>> +	help
>> +	  To save power, batch RCU callbacks and flush after delay, memory
>> +	  pressure or callback list growing too big.
>> +
>>  endmenu # "RCU Subsystem"
>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
>> index be5979da07f5..94675f14efe8 100644
>> --- a/kernel/rcu/rcu.h
>> +++ b/kernel/rcu/rcu.h
>> @@ -474,6 +474,14 @@ enum rcutorture_type {
>>  	INVALID_RCU_FLAVOR
>>  };
>>  
>> +#if defined(CONFIG_RCU_LAZY)
>> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
>> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
>> +#else
>> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
>> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
>> +#endif
>> +
>>  #if defined(CONFIG_TREE_RCU)
>>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>>  			    unsigned long *gp_seq);
>> @@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>>  			       unsigned long c_old,
>>  			       unsigned long c);
>>  void rcu_gp_set_torture_wait(int duration);
>> +void rcu_force_call_rcu_to_lazy(bool force);
>> +
>>  #else
>>  static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
>>  					  int *flags, unsigned long *gp_seq)
>> @@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>>  	do { } while (0)
>>  #endif
>>  static inline void rcu_gp_set_torture_wait(int duration) { }
>> +static inline void rcu_force_call_rcu_to_lazy(bool force) { }
>>  #endif
>>  
>>  #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
>> diff --git a/kernel/rcu/rcu_segcblist.c b/kernel/rcu/rcu_segcblist.c
>> index c54ea2b6a36b..55b50e592986 100644
>> --- a/kernel/rcu/rcu_segcblist.c
>> +++ b/kernel/rcu/rcu_segcblist.c
>> @@ -38,7 +38,7 @@ void rcu_cblist_enqueue(struct rcu_cblist *rclp, struct rcu_head *rhp)
>>   * element of the second rcu_cblist structure, but ensuring that the second
>>   * rcu_cblist structure, if initially non-empty, always appears non-empty
>>   * throughout the process.  If rdp is NULL, the second rcu_cblist structure
>> - * is instead initialized to empty.
>> + * is instead initialized to empty. Also account for lazy_len for lazy CBs.
>>   */
>>  void rcu_cblist_flush_enqueue(struct rcu_cblist *drclp,
>>  			      struct rcu_cblist *srclp,
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 9fe581be8696..aaced29a0a71 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>>  	raw_spin_unlock_rcu_node(rnp);
>>  }
>>  
>> -/**
>> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
>> - * @head: structure to be used for queueing the RCU updates.
>> - * @func: actual callback function to be invoked after the grace period
>> - *
>> - * The callback function will be invoked some time after a full grace
>> - * period elapses, in other words after all pre-existing RCU read-side
>> - * critical sections have completed.  However, the callback function
>> - * might well execute concurrently with RCU read-side critical sections
>> - * that started after call_rcu() was invoked.
>> - *
>> - * RCU read-side critical sections are delimited by rcu_read_lock()
>> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
>> - * v5.0 and later, regions of code across which interrupts, preemption,
>> - * or softirqs have been disabled also serve as RCU read-side critical
>> - * sections.  This includes hardware interrupt handlers, softirq handlers,
>> - * and NMI handlers.
>> - *
>> - * Note that all CPUs must agree that the grace period extended beyond
>> - * all pre-existing RCU read-side critical section.  On systems with more
>> - * than one CPU, this means that when "func()" is invoked, each CPU is
>> - * guaranteed to have executed a full memory barrier since the end of its
>> - * last RCU read-side critical section whose beginning preceded the call
>> - * to call_rcu().  It also means that each CPU executing an RCU read-side
>> - * critical section that continues beyond the start of "func()" must have
>> - * executed a memory barrier after the call_rcu() but before the beginning
>> - * of that RCU read-side critical section.  Note that these guarantees
>> - * include CPUs that are offline, idle, or executing in user mode, as
>> - * well as CPUs that are executing in the kernel.
>> - *
>> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>> - * resulting RCU callback function "func()", then both CPU A and CPU B are
>> - * guaranteed to execute a full memory barrier during the time interval
>> - * between the call to call_rcu() and the invocation of "func()" -- even
>> - * if CPU A and CPU B are the same CPU (but again only if the system has
>> - * more than one CPU).
>> - *
>> - * Implementation of these memory-ordering guarantees is described here:
>> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>> - */
>> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
>> +static void
>> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>>  {
>>  	static atomic_t doublefrees;
>>  	unsigned long flags;
>> @@ -2818,7 +2779,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>  		trace_rcu_callback(rcu_state.name, head,
>>  				   rcu_segcblist_n_cbs(&rdp->cblist));
>>  
>> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
>> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>>  		return; // Enqueued onto ->nocb_bypass, so just leave.
>>  	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>>  	rcu_segcblist_enqueue(&rdp->cblist, head);
>> @@ -2833,8 +2794,86 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>  		local_irq_restore(flags);
>>  	}
>>  }
>> -EXPORT_SYMBOL_GPL(call_rcu);
>>  
>> +#ifdef CONFIG_RCU_LAZY
>> +/**
>> + * call_rcu_lazy() - Lazily queue RCU callback for invocation after grace period.
>> + * @head: structure to be used for queueing the RCU updates.
>> + * @func: actual callback function to be invoked after the grace period
>> + *
>> + * The callback function will be invoked some time after a full grace
>> + * period elapses, in other words after all pre-existing RCU read-side
>> + * critical sections have completed.
>> + *
>> + * Use this API instead of call_rcu() if you don't mind the callback being
>> + * invoked after very long periods of time on systems without memory pressure
>> + * and on systems which are lightly loaded or mostly idle.
>> + *
>> + * Other than the extra delay in callbacks being invoked, this function is
>> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
>> + * details about memory ordering and other functionality.
>> + */
>> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func)
>> +{
>> +	return __call_rcu_common(head, func, true);
>> +}
>> +EXPORT_SYMBOL_GPL(call_rcu_lazy);
>> +#endif
>> +
>> +static bool force_call_rcu_to_lazy;
>> +
>> +void rcu_force_call_rcu_to_lazy(bool force)
>> +{
>> +	if (IS_ENABLED(CONFIG_RCU_SCALE_TEST))
>> +		WRITE_ONCE(force_call_rcu_to_lazy, force);
>> +}
>> +EXPORT_SYMBOL_GPL(rcu_force_call_rcu_to_lazy);
>> +
>> +/**
>> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
>> + * @head: structure to be used for queueing the RCU updates.
>> + * @func: actual callback function to be invoked after the grace period
>> + *
>> + * The callback function will be invoked some time after a full grace
>> + * period elapses, in other words after all pre-existing RCU read-side
>> + * critical sections have completed.  However, the callback function
>> + * might well execute concurrently with RCU read-side critical sections
>> + * that started after call_rcu() was invoked.
>> + *
>> + * RCU read-side critical sections are delimited by rcu_read_lock()
>> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
>> + * v5.0 and later, regions of code across which interrupts, preemption,
>> + * or softirqs have been disabled also serve as RCU read-side critical
>> + * sections.  This includes hardware interrupt handlers, softirq handlers,
>> + * and NMI handlers.
>> + *
>> + * Note that all CPUs must agree that the grace period extended beyond
>> + * all pre-existing RCU read-side critical section.  On systems with more
>> + * than one CPU, this means that when "func()" is invoked, each CPU is
>> + * guaranteed to have executed a full memory barrier since the end of its
>> + * last RCU read-side critical section whose beginning preceded the call
>> + * to call_rcu().  It also means that each CPU executing an RCU read-side
>> + * critical section that continues beyond the start of "func()" must have
>> + * executed a memory barrier after the call_rcu() but before the beginning
>> + * of that RCU read-side critical section.  Note that these guarantees
>> + * include CPUs that are offline, idle, or executing in user mode, as
>> + * well as CPUs that are executing in the kernel.
>> + *
>> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>> + * resulting RCU callback function "func()", then both CPU A and CPU B are
>> + * guaranteed to execute a full memory barrier during the time interval
>> + * between the call to call_rcu() and the invocation of "func()" -- even
>> + * if CPU A and CPU B are the same CPU (but again only if the system has
>> + * more than one CPU).
>> + *
>> + * Implementation of these memory-ordering guarantees is described here:
>> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>> + */
>> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
>> +{
>> +	return __call_rcu_common(head, func, force_call_rcu_to_lazy);
>> +}
>> +EXPORT_SYMBOL_GPL(call_rcu);
>>  
>>  /* Maximum number of jiffies to wait before draining a batch. */
>>  #define KFREE_DRAIN_JIFFIES (5 * HZ)
>> @@ -3904,7 +3943,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>  	rdp->barrier_head.func = rcu_barrier_callback;
>>  	debug_rcu_head_queue(&rdp->barrier_head);
>>  	rcu_nocb_lock(rdp);
>> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false,
>> +		     /* wake gp thread */ true));
>>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>>  		atomic_inc(&rcu_state.barrier_cpu_count);
>>  	} else {
>> @@ -4325,7 +4365,7 @@ void rcutree_migrate_callbacks(int cpu)
>>  	my_rdp = this_cpu_ptr(&rcu_data);
>>  	my_rnp = my_rdp->mynode;
>>  	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
>> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
>> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, false, false));
>>  	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>>  	/* Leverage recent GPs and set GP for new callbacks. */
>>  	needwake = rcu_advance_cbs(my_rnp, rdp) ||
>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>> index d4a97e40ea9c..946d819b23fc 100644
>> --- a/kernel/rcu/tree.h
>> +++ b/kernel/rcu/tree.h
>> @@ -263,14 +263,18 @@ struct rcu_data {
>>  	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
>>  	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
>>  
>> +#ifdef CONFIG_RCU_LAZY
>> +	long lazy_len;			/* Length of buffered lazy callbacks. */
>> +#endif
>>  	int cpu;
>>  };
>>  
>>  /* Values for nocb_defer_wakeup field in struct rcu_data. */
>>  #define RCU_NOCB_WAKE_NOT	0
>>  #define RCU_NOCB_WAKE_BYPASS	1
>> -#define RCU_NOCB_WAKE		2
>> -#define RCU_NOCB_WAKE_FORCE	3
>> +#define RCU_NOCB_WAKE_LAZY	2
>> +#define RCU_NOCB_WAKE		3
>> +#define RCU_NOCB_WAKE_FORCE	4
>>  
>>  #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>>  					/* For jiffies_till_first_fqs and */
>> @@ -440,9 +444,10 @@ static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>>  static void rcu_init_one_nocb(struct rcu_node *rnp);
>>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				  unsigned long j);
>> +				  unsigned long j, bool lazy, bool wakegp);
>>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				bool *was_alldone, unsigned long flags);
>> +				bool *was_alldone, unsigned long flags,
>> +				bool lazy);
>>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>>  				 unsigned long flags);
>>  static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index 31068dd31315..7e97a7b6e046 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>>  }
>>  
>> +/*
>> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
>> + * can elapse before lazy callbacks are flushed. Lazy callbacks
>> + * could be flushed much earlier for a number of other reasons
>> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
>> + * left unsubmitted to RCU after those many jiffies.
>> + */
>> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
>> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
>> +
>> +#ifdef CONFIG_RCU_LAZY
>> +// To be called only from test code.
>> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
>> +{
>> +	jiffies_till_flush = jif;
>> +}
>> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
>> +
>> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
>> +{
>> +	return jiffies_till_flush;
>> +}
>> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
>> +#endif
>> +
>>  /*
>>   * Arrange to wake the GP kthread for this NOCB group at some future
>>   * time when it is safe to do so.
>> @@ -265,23 +290,39 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>  {
>>  	unsigned long flags;
>>  	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
>> +	unsigned long mod_jif = 0;
>>  
>>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>>  
>>  	/*
>> -	 * Bypass wakeup overrides previous deferments. In case
>> -	 * of callback storm, no need to wake up too early.
>> +	 * Bypass and lazy wakeup overrides previous deferments. In case of
>> +	 * callback storm, no need to wake up too early.
>>  	 */
>> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
>> -		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>> -		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>> -	} else {
>> +	switch (waketype) {
>> +	case RCU_NOCB_WAKE_LAZY:
>> +		if (rdp->nocb_defer_wakeup != RCU_NOCB_WAKE_LAZY)
>> +			mod_jif = jiffies_till_flush;
>> +		break;
>> +
>> +	case RCU_NOCB_WAKE_BYPASS:
>> +		mod_jif = 2;
>> +		break;
>> +
>> +	case RCU_NOCB_WAKE:
>> +	case RCU_NOCB_WAKE_FORCE:
>> +		// For these, make it wake up the soonest if we
>> +		// were in a bypass or lazy sleep before.
>>  		if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
>> -			mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
>> -		if (rdp_gp->nocb_defer_wakeup < waketype)
>> -			WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>> +			mod_jif = 1;
>> +		break;
>>  	}
>>  
>> +	if (mod_jif)
>> +		mod_timer(&rdp_gp->nocb_timer, jiffies + mod_jif);
>> +
>> +	if (rdp_gp->nocb_defer_wakeup < waketype)
>> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>> +
>>  	raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
>>  
>>  	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
>> @@ -293,10 +334,13 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>   * proves to be initially empty, just return false because the no-CB GP
>>   * kthread may need to be awakened in this case.
>>   *
>> + * Return true if there was something to be flushed and it succeeded, otherwise
>> + * false.
>> + *
>>   * Note that this function always returns true if rhp is NULL.
>>   */
>>  static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				     unsigned long j)
>> +				     unsigned long j, bool lazy)
>>  {
>>  	struct rcu_cblist rcl;
>>  
>> @@ -310,7 +354,18 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
>>  	if (rhp)
>>  		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>> -	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
>> +
>> +	/*
>> +	 * If the new CB requested was a lazy one, queue it onto the main
>> +	 * ->cblist so we can take advantage of a sooner grade period.
>> +	 */
>> +	if (lazy && rhp) {
>> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
>> +		rcu_cblist_enqueue(&rcl, rhp);
>> +	} else {
>> +		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
>> +	}
>> +
>>  	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
>>  	WRITE_ONCE(rdp->nocb_bypass_first, j);
>>  	rcu_nocb_bypass_unlock(rdp);
>> @@ -326,13 +381,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>   * Note that this function always returns true if rhp is NULL.
>>   */
>>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				  unsigned long j)
>> +				  unsigned long j, bool lazy, bool wake_gp)
>>  {
>> +	bool ret;
>> +
>>  	if (!rcu_rdp_is_offloaded(rdp))
>>  		return true;
>>  	rcu_lockdep_assert_cblist_protected(rdp);
>>  	rcu_nocb_bypass_lock(rdp);
>> -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
>> +	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, lazy);
>> +
>> +	if (wake_gp)
>> +		wake_nocb_gp(rdp, true);
>> +
>> +	return ret;
>>  }
>>  
>>  /*
>> @@ -345,7 +407,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>>  	if (!rcu_rdp_is_offloaded(rdp) ||
>>  	    !rcu_nocb_bypass_trylock(rdp))
>>  		return;
>> -	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
>> +	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, false));
>>  }
>>  
>>  /*
>> @@ -367,12 +429,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
>>   * there is only one CPU in operation.
>>   */
>>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				bool *was_alldone, unsigned long flags)
>> +				bool *was_alldone, unsigned long flags,
>> +				bool lazy)
>>  {
>>  	unsigned long c;
>>  	unsigned long cur_gp_seq;
>>  	unsigned long j = jiffies;
>>  	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +	bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
>>  
>>  	lockdep_assert_irqs_disabled();
>>  
>> @@ -417,23 +481,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  	// If there hasn't yet been all that many ->cblist enqueues
>>  	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
>>  	// ->nocb_bypass first.
>> -	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
>> +	// Lazy CBs throttle this back and do immediate bypass queuing.
>> +	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
>>  		rcu_nocb_lock(rdp);
>>  		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>>  		if (*was_alldone)
>>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>  					    TPS("FirstQ"));
>> -		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
>> +
>> +		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, false, false));
>> +
>>  		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
>>  		return false; // Caller must enqueue the callback.
>>  	}
>>  
>>  	// If ->nocb_bypass has been used too long or is too full,
>>  	// flush ->nocb_bypass to ->cblist.
>> -	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
>> +	if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
>> +	    (ncbs &&  bypass_is_lazy &&
>> +		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
>>  	    ncbs >= qhimark) {
>>  		rcu_nocb_lock(rdp);
>> -		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
>> +
>> +		if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) {
>>  			*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>>  			if (*was_alldone)
>>  				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>> @@ -460,16 +530,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  	// We need to use the bypass.
>>  	rcu_nocb_wait_contended(rdp);
>>  	rcu_nocb_bypass_lock(rdp);
>> +
>>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
>> +
>> +	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
>> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
>> +
>>  	if (!ncbs) {
>>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>>  	}
>> +
>>  	rcu_nocb_bypass_unlock(rdp);
>>  	smp_mb(); /* Order enqueue before wake. */
>> -	if (ncbs) {
>> +
>> +	// We had CBs in the bypass list before. There is nothing else to do if:
>> +	// There were only non-lazy CBs before, in this case, the bypass timer
>> +	// or GP-thread will handle the CBs including any new lazy ones.
>> +	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
>> +	// case the old lazy timer would have been setup. When that expires,
>> +	// the new lazy one will be handled.
>> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>>  		local_irq_restore(flags);
>>  	} else {
>>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
>> @@ -478,6 +561,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>  					    TPS("FirstBQwake"));
>>  			__call_rcu_nocb_wake(rdp, true, flags);
>> +		} else if (bypass_is_lazy && !lazy) {
>> +			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>> +					    TPS("FirstBQwakeLazy2Non"));
>> +			__call_rcu_nocb_wake(rdp, true, flags);
>>  		} else {
>>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>  					    TPS("FirstBQnoWake"));
>> @@ -499,7 +586,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>>  {
>>  	unsigned long cur_gp_seq;
>>  	unsigned long j;
>> -	long len;
>> +	long len, lazy_len, bypass_len;
>>  	struct task_struct *t;
>>  
>>  	// If we are being polled or there is no kthread, just leave.
>> @@ -512,9 +599,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
>>  	}
>>  	// Need to actually to a wakeup.
>>  	len = rcu_segcblist_n_cbs(&rdp->cblist);
>> +	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +	lazy_len = READ_ONCE(rdp->lazy_len);
>>  	if (was_alldone) {
>>  		rdp->qlen_last_fqs_check = len;
>> -		if (!irqs_disabled_flags(flags)) {
>> +		// Only lazy CBs in bypass list
>> +		if (lazy_len && bypass_len == lazy_len) {
>> +			rcu_nocb_unlock_irqrestore(rdp, flags);
>> +			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
>> +					   TPS("WakeLazy"));
>> +		} else if (!irqs_disabled_flags(flags)) {
>>  			/* ... if queue was empty ... */
>>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>>  			wake_nocb_gp(rdp, false);
>> @@ -604,8 +698,8 @@ static void nocb_gp_sleep(struct rcu_data *my_rdp, int cpu)
>>   */
>>  static void nocb_gp_wait(struct rcu_data *my_rdp)
>>  {
>> -	bool bypass = false;
>> -	long bypass_ncbs;
>> +	bool bypass = false, lazy = false;
>> +	long bypass_ncbs, lazy_ncbs;
>>  	int __maybe_unused cpu = my_rdp->cpu;
>>  	unsigned long cur_gp_seq;
>>  	unsigned long flags;
>> @@ -640,24 +734,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>  	 * won't be ignored for long.
>>  	 */
>>  	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
>> +		bool flush_bypass = false;
>> +
>>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
>>  		rcu_nocb_lock_irqsave(rdp, flags);
>>  		lockdep_assert_held(&rdp->nocb_lock);
>>  		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> -		if (bypass_ncbs &&
>> +		lazy_ncbs = READ_ONCE(rdp->lazy_len);
>> +
>> +		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
>> +		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
>> +		     bypass_ncbs > 2 * qhimark)) {
>> +			flush_bypass = true;
>> +		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
>>  		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
>>  		     bypass_ncbs > 2 * qhimark)) {
>> -			// Bypass full or old, so flush it.
>> -			(void)rcu_nocb_try_flush_bypass(rdp, j);
>> -			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +			flush_bypass = true;
>>  		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
>>  			rcu_nocb_unlock_irqrestore(rdp, flags);
>>  			continue; /* No callbacks here, try next. */
>>  		}
>> +
>> +		if (flush_bypass) {
>> +			// Bypass full or old, so flush it.
>> +			(void)rcu_nocb_try_flush_bypass(rdp, j);
>> +			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>> +			lazy_ncbs = READ_ONCE(rdp->lazy_len);
>> +		}
>> +
>>  		if (bypass_ncbs) {
>>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>> -					    TPS("Bypass"));
>> -			bypass = true;
>> +				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
>> +			if (bypass_ncbs == lazy_ncbs)
>> +				lazy = true;
>> +			else
>> +				bypass = true;
>>  		}
>>  		rnp = rdp->mynode;
>>  
>> @@ -705,12 +816,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>  	my_rdp->nocb_gp_gp = needwait_gp;
>>  	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
>>  
>> -	if (bypass && !rcu_nocb_poll) {
>> -		// At least one child with non-empty ->nocb_bypass, so set
>> -		// timer in order to avoid stranding its callbacks.
>> -		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
>> -				   TPS("WakeBypassIsDeferred"));
>> +	// At least one child with non-empty ->nocb_bypass, so set
>> +	// timer in order to avoid stranding its callbacks.
>> +	if (!rcu_nocb_poll) {
>> +		// If bypass list only has lazy CBs. Add a deferred
>> +		// lazy wake up.
>> +		if (lazy && !bypass) {
>> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
>> +					TPS("WakeLazyIsDeferred"));
>> +		// Otherwise add a deferred bypass wake up.
>> +		} else if (bypass) {
>> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
>> +					TPS("WakeBypassIsDeferred"));
>> +		}
>>  	}
>> +
>>  	if (rcu_nocb_poll) {
>>  		/* Polling, so trace if first poll in the series. */
>>  		if (gotcbs)
>> @@ -1036,7 +1156,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
>>  	 * return false, which means that future calls to rcu_nocb_try_bypass()
>>  	 * will refuse to put anything into the bypass.
>>  	 */
>> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false, false));
>>  	/*
>>  	 * Start with invoking rcu_core() early. This way if the current thread
>>  	 * happens to preempt an ongoing call to rcu_core() in the middle,
>> @@ -1290,6 +1410,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
>>  	raw_spin_lock_init(&rdp->nocb_gp_lock);
>>  	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
>>  	rcu_cblist_init(&rdp->nocb_bypass);
>> +	WRITE_ONCE(rdp->lazy_len, 0);
>>  	mutex_init(&rdp->nocb_gp_kthread_mutex);
>>  }
>>  
>> @@ -1571,13 +1692,14 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
>>  }
>>  
>>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				  unsigned long j)
>> +				  unsigned long j, bool lazy, bool wakegp)
>>  {
>>  	return true;
>>  }
>>  
>>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				bool *was_alldone, unsigned long flags)
>> +				bool *was_alldone, unsigned long flags,
>> +				bool lazy)
>>  {
>>  	return false;
>>  }
>> -- 
>> 2.37.2.789.g6183377224-goog
>>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation.
  2022-09-03 12:39     ` Paul E. McKenney
@ 2022-09-03 14:07       ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-03 14:07 UTC (permalink / raw)
  To: paulmck, kernel test robot
  Cc: rcu, kbuild-all, linux-kernel, rushikesh.s.kadam, urezki,
	neeraj.iitr10, frederic, rostedt, vineeth, boqun.feng



On 9/3/2022 8:39 AM, Paul E. McKenney wrote:
> On Sat, Sep 03, 2022 at 12:48:28AM +0800, kernel test robot wrote:
>> Hi "Joel,
>>
>> Thank you for the patch! Yet something to improve:
>>
>> [auto build test ERROR on paulmck-rcu/dev]
>> [also build test ERROR on pcmoore-selinux/next drm-intel/for-linux-next linus/master v6.0-rc3]
>> [cannot apply to vbabka-slab/for-next rostedt-trace/for-next tip/timers/core next-20220901]
>> [If your patch is applied to the wrong git tree, kindly drop us a note.
>> And when submitting patch, we suggest to use '--base' as documented in
>> https://git-scm.com/docs/git-format-patch#_base_tree_information]
>>
>> url:    https://github.com/intel-lab-lkp/linux/commits/Joel-Fernandes-Google/Implement-call_rcu_lazy-and-miscellaneous-fixes/20220902-062156
>> base:   https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git dev
>> config: mips-allyesconfig (https://download.01.org/0day-ci/archive/20220903/202209030052.20CJhjTX-lkp@intel.com/config)
>> compiler: mips-linux-gcc (GCC) 12.1.0
>> reproduce (this is a W=1 build):
>>         wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
>>         chmod +x ~/bin/make.cross
>>         # https://github.com/intel-lab-lkp/linux/commit/c0f09b1d42d06649680f74a78ca363e7f1c158b2
>>         git remote add linux-review https://github.com/intel-lab-lkp/linux
>>         git fetch --no-tags linux-review Joel-Fernandes-Google/Implement-call_rcu_lazy-and-miscellaneous-fixes/20220902-062156
>>         git checkout c0f09b1d42d06649680f74a78ca363e7f1c158b2
>>         # save the config file
>>         mkdir build_dir && cp config build_dir/.config
>>         COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross W=1 O=build_dir ARCH=mips SHELL=/bin/bash
>>
>> If you fix the issue, kindly add following tag where applicable
>> Reported-by: kernel test robot <lkp@intel.com>
>>
>> All errors (new ones prefixed by >>):
>>
>>    In file included from <command-line>:
>>    In function 'dst_hold',
>>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>>        inlined from '__skb_dst_copy' at include/net/dst.h:284:3,
>>        inlined from 'ovs_vport_output' at net/openvswitch/actions.c:787:2:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
> 
> This looks like fallout from the rcu_head structure changing size,
> given that __refcnt comes after rcu_head on 32-bit systems.
> 

Yes, I also thought of that as the likely cause.

> It looks like the per-CB tracing code needs to be kept on the side for
> the time being rather than being sent to -next (let alone mainline).

True, that should be Ok. It is a debug patch and can make it a bit later.

> Any reason I cannot just move this one to the end of the stack, after
> "fork: Move thread_stack_free_rcu() to call_rcu_lazy()"?

No reason, it is an independent patch.

 - Joel



> 
> 							Thanx, Paul
> 
>>    In function 'dst_hold',
>>        inlined from 'execute_set_action' at net/openvswitch/actions.c:1093:3,
>>        inlined from 'do_execute_actions' at net/openvswitch/actions.c:1377:10:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>> --
>>    In file included from <command-line>:
>>    In function 'dst_hold',
>>        inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
>>        inlined from 'dn_insert_route.constprop.isra' at net/decnet/dn_route.c:334:4:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>>    In function 'dst_hold',
>>        inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
>>        inlined from 'dn_insert_route.constprop.isra' at net/decnet/dn_route.c:347:2:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>>    In function 'dst_hold',
>>        inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
>>        inlined from 'dn_route_input' at net/decnet/dn_route.c:1535:4:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>>    In function 'dst_hold',
>>        inlined from 'dst_hold_and_use' at include/net/dst.h:244:2,
>>        inlined from '__dn_route_output_key.isra' at net/decnet/dn_route.c:1257:5:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>>    In function 'dst_hold',
>>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>>        inlined from 'dn_cache_dump' at net/decnet/dn_route.c:1752:4:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>> --
>>    In file included from <command-line>:
>>    In function 'dst_hold',
>>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>>        inlined from 'ip6_copy_metadata' at net/ipv6/ip6_output.c:654:2:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>>    In function 'dst_hold',
>>        inlined from 'ip6_append_data' at net/ipv6/ip6_output.c:1838:3:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>>    In function 'dst_hold',
>>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>>        inlined from 'ip6_sk_dst_lookup_flow' at net/ipv6/ip6_output.c:1262:3:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>> --
>>    In file included from <command-line>:
>>    In function 'dst_hold',
>>        inlined from 'dst_clone' at include/net/dst.h:251:3,
>>        inlined from '__skb_dst_copy' at include/net/dst.h:284:3,
>>        inlined from 'skb_dst_copy' at include/net/dst.h:289:2,
>>        inlined from 'ip6_list_rcv_finish.constprop' at net/ipv6/ip6_input.c:128:4:
>>>> include/linux/compiler_types.h:354:45: error: call to '__compiletime_assert_490' declared with attribute error: BUILD_BUG_ON failed: offsetof(struct dst_entry, __refcnt) & 63
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |                                             ^
>>    include/linux/compiler_types.h:335:25: note: in definition of macro '__compiletime_assert'
>>      335 |                         prefix ## suffix();                             \
>>          |                         ^~~~~~
>>    include/linux/compiler_types.h:354:9: note: in expansion of macro '_compiletime_assert'
>>      354 |         _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>>          |         ^~~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert'
>>       39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg)
>>          |                                     ^~~~~~~~~~~~~~~~~~
>>    include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG'
>>       50 |         BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition)
>>          |         ^~~~~~~~~~~~~~~~
>>    include/net/dst.h:230:9: note: in expansion of macro 'BUILD_BUG_ON'
>>      230 |         BUILD_BUG_ON(offsetof(struct dst_entry, __refcnt) & 63);
>>          |         ^~~~~~~~~~~~
>>
>>
>> vim +/__compiletime_assert_490 +354 include/linux/compiler_types.h
>>
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  340  
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  341  #define _compiletime_assert(condition, msg, prefix, suffix) \
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  342  	__compiletime_assert(condition, msg, prefix, suffix)
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  343  
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  344  /**
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  345   * compiletime_assert - break build and emit msg if condition is false
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  346   * @condition: a compile-time constant condition to check
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  347   * @msg:       a message to emit if condition is false
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  348   *
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  349   * In tradition of POSIX assert, this macro will break the build if the
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  350   * supplied condition is *false*, emitting the supplied error message if the
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  351   * compiler has support to do so.
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  352   */
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  353  #define compiletime_assert(condition, msg) \
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21 @354  	_compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)
>> eb5c2d4b45e3d2 Will Deacon 2020-07-21  355  
>>
>> -- 
>> 0-DAY CI Kernel Test Service
>> https://01.org/lkp

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-02 23:58     ` Joel Fernandes
@ 2022-09-03 15:10       ` Paul E. McKenney
  2022-09-04 21:13       ` Frederic Weisbecker
  1 sibling, 0 replies; 73+ messages in thread
From: Paul E. McKenney @ 2022-09-03 15:10 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Frederic Weisbecker, rcu, linux-kernel, rushikesh.s.kadam,
	urezki, neeraj.iitr10, rostedt, vineeth, boqun.feng

On Fri, Sep 02, 2022 at 07:58:42PM -0400, Joel Fernandes wrote:
> On 9/2/2022 7:35 AM, Frederic Weisbecker wrote:
> > On Thu, Sep 01, 2022 at 10:17:06PM +0000, Joel Fernandes (Google) wrote:
> >> When the bypass cblist gets too big or its timeout has occurred, it is
> >> flushed into the main cblist. However, the bypass timer is still running
> >> and the behavior is that it would eventually expire and wake the GP
> >> thread.
> >>
> >> Since we are going to use the bypass cblist for lazy CBs, do the wakeup
> >> soon as the flush happens. Otherwise, the lazy-timer will go off much
> >> later and the now-non-lazy cblist CBs can get stranded for the duration
> >> of the timer.
> >>
> >> This is a good thing to do anyway (regardless of this series), since it
> >> makes the behavior consistent with behavior of other code paths where queueing
> >> something into the ->cblist makes the GP kthread in a non-sleeping state
> >> quickly.
> >>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >>  kernel/rcu/tree_nocb.h | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> >> index 0a5f0ef41484..31068dd31315 100644
> >> --- a/kernel/rcu/tree_nocb.h
> >> +++ b/kernel/rcu/tree_nocb.h
> >> @@ -447,7 +447,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >>  			rcu_advance_cbs_nowake(rdp->mynode, rdp);
> >>  			rdp->nocb_gp_adv_time = j;
> >>  		}
> >> -		rcu_nocb_unlock_irqrestore(rdp, flags);
> >> +
> >> +		// The flush succeeded and we moved CBs into the ->cblist.
> >> +		// However, the bypass timer might still be running. Wakeup the
> >> +		// GP thread by calling a helper with was_all_done set so that
> >> +		// wake up happens (needed if main CB list was empty before).
> >> +		__call_rcu_nocb_wake(rdp, true, flags)
> >> +
> > 
> > Ok so there are two different changes here:
> > 
> > 1) wake up nocb_gp as we just flushed the bypass list. Indeed if the regular
> >    callback list was empty before flushing, we rather want to immediately wake
> >    up nocb_gp instead of waiting for the bypass timer to process them.
> > 
> > 2) wake up nocb_gp unconditionally (ie: even if the regular queue was not empty
> >    before bypass flushing) so that nocb_gp_wait() is forced through another loop
> >    starting with cancelling the bypass timer (I suggest you put such explanation
> >    in the comment btw because that process may not be obvious for mortals).
> > 
> > The change 1) looks like a good idea to me.
> > 
> > The change 2) has unclear motivation. It forces nocb_gp_wait() through another
> > costly loop even though the timer might have been cancelled into some near
> > future, eventually avoiding that extra costly loop. Also it abuses the
> > was_alldone stuff and we may get rcu_nocb_wake with incoherent meanings
> > (WakeEmpty/WakeEmptyIsDeferred) when it's actually not empty.
> 
> Yes #2 can be optimized as follows I think on top of this patch, good point:

I am holding off on this for the moment.

							Thanx, Paul

> =============
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index ee5924ba2f3b..24aabd723abd 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -514,12 +514,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp,
> struct rcu_head *rhp,
>             ncbs >= qhimark) {
>                 rcu_nocb_lock(rdp);
> 
> +               *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> +
>                 rcu_cblist_set_flush(&rdp->nocb_bypass,
>                                 lazy ? BIT(CB_DEBUG_BYPASS_LAZY_FLUSHED) :
> BIT(CB_DEBUG_BYPASS_FLUSHED),
>                                 (j - READ_ONCE(cb_debug_jiffies_first)));
> 
>                 if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) {
> -                       *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>                         if (*was_alldone)
>                                 trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>                                                     TPS("FirstQ"));
> @@ -537,7 +538,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct
> rcu_head *rhp,
>                 // However, the bypass timer might still be running. Wakeup the
>                 // GP thread by calling a helper with was_all_done set so that
>                 // wake up happens (needed if main CB list was empty before).
> -               __call_rcu_nocb_wake(rdp, true, flags)
> +               __call_rcu_nocb_wake(rdp, *was_all_done, flags)
> 
>                 return true; // Callback already enqueued.
>         }
> =============
> 
> > So you may need to clarify the purpose. And I would suggest to make two patches
> > here.
> I guess this change only #2 is no longer a concern? And splitting is not needed
> then as it is only #1.
> 
> Thanks,
> 
>  - Joel
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes
  2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
                   ` (17 preceding siblings ...)
  2022-09-01 22:17 ` [PATCH v5 18/18] fork: Move thread_stack_free_rcu() " Joel Fernandes (Google)
@ 2022-09-03 15:22 ` Paul E. McKenney
  18 siblings, 0 replies; 73+ messages in thread
From: Paul E. McKenney @ 2022-09-03 15:22 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, rostedt, vineeth, boqun.feng

On Thu, Sep 01, 2022 at 10:17:02PM +0000, Joel Fernandes (Google) wrote:
> Here is v5 of call_rcu_lazy() based on the latest RCU -dev branch. Main changes are:
> - moved length field into rcu_data (Frederic suggestion)
> - added new traces to aid debugging and testing.
> 	- the new trace patch (along with the rcuscale and rcutorture tests)
> 	  gives confidence that the patches work well. Also it is tested on
> 	  real ChromeOS hardware and the boot time is looking good even though
> 	  lazy callbacks are being queued (i.e. the lazy ones do not effect the
> 	  synchronous non-lazy ones that effect boot time)
> - rewrote some parts of the core patch.
> - for rcutop, please apply the diff in the following link to the BCC repo:
>   https://lore.kernel.org/r/Yn+79v4q+1c9lXdc@google.com
>   Then, cd libbpf-tools/ and run make to build the rcutop static binary.
>   (If you need an x86 binary, ping me and I'll email you).
>   In the future, I will attempt to make rcutop built within the kernel repo.
>   This is already done for another tool (see tools/bpf/runqslower) so is doable.
> 
> The 2 mm patches are what Vlastimil pulled into slab-next. I included them in
> this series so that the tracing patch builds.
> 
> Previous series was posted here:
>  https://lore.kernel.org/all/20220713213237.1596225-1-joel@joelfernandes.org/
> 
> Linked below [1] is some power data I collected with Turbostat on an x86
> ChromeOS ADL machine. The numbers are not based on -next, but rather 5.19
> kernel as that's what booted on my ChromeOS machine).
> 
> These are output by Turbostat, by running:
> turbostat -S -s PkgWatt,CorWatt --interval 5
> PkgWatt - summary of package power in Watts 5 second interval.
> CoreWatt - summary of core power in Watts 5 second interval.
> 
> [1] https://lore.kernel.org/r/b4145008-6840-fc69-a6d6-e38bc218009d@joelfernandes.org

Thank you for all your work on this!

I have pulled these in for testing and review.  Some work is required
for them to be ready for mainline.

> Joel Fernandes (Google) (15):

I took these:

> rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask
> rcu: Fix late wakeup when flush of bypass cblist happens
> rcu: Move trace_rcu_callback() before bypassing
> rcu: Introduce call_rcu_lazy() API implementation

You and Frederic need to come to agreement on "rcu: Fix late wakeup when
flush of bypass cblist happens".

These have some difficulties, so I put them on top of the stack:

> rcu: Add per-CB tracing for queuing, flush and invocation.
	This one breaks 32-bit MIPS.
> rcuscale: Add laziness and kfree tests
	I am concerned that this one has OOM problems.
> rcutorture: Add test code for call_rcu_lazy()
	This one does not need a separate scenario, but instead a separate
	rcutorture.gp_lazy boot parameter.  For a rough template for this
	sort of change, please see the rcutorture changes in this commit:

	76ea364161e7 ("rcu: Add full-sized polling for start_poll()")

	The key point is that every rcutorture.torture_type=rcu test then
	becomes a call_rcu_lazy() test.  (Unless explicitly overridden.)

I took these, though they need go up their respective trees or get acks
from their respective maintainers.

> fs: Move call_rcu() to call_rcu_lazy() in some paths
> cred: Move call_rcu() to call_rcu_lazy()
> security: Move call_rcu() to call_rcu_lazy()
> net/core: Move call_rcu() to call_rcu_lazy()
> kernel: Move various core kernel usages to call_rcu_lazy()
> lib: Move call_rcu() to call_rcu_lazy()
> i915: Move call_rcu() to call_rcu_lazy()
> fork: Move thread_stack_free_rcu() to call_rcu_lazy()
> 
> Vineeth Pillai (1):

I took this:

> rcu: shrinker for lazy rcu
> 
> Vlastimil Babka (2):

As noted earlier, I took these strictly for testing.  I do not expect
to push them into -next, let alone into mainline.

> mm/slub: perform free consistency checks before call_rcu
> mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head

These all are on -rcu not-yet-for-mainline branch lazy.2022.09.03b.

							Thanx, Paul

> drivers/gpu/drm/i915/gem/i915_gem_object.c    |   2 +-
> fs/dcache.c                                   |   4 +-
> fs/eventpoll.c                                |   2 +-
> fs/file_table.c                               |   2 +-
> fs/inode.c                                    |   2 +-
> include/linux/rcupdate.h                      |   6 +
> include/linux/types.h                         |  44 +++
> include/trace/events/rcu.h                    |  69 ++++-
> kernel/cred.c                                 |   2 +-
> kernel/exit.c                                 |   2 +-
> kernel/fork.c                                 |   6 +-
> kernel/pid.c                                  |   2 +-
> kernel/rcu/Kconfig                            |  19 ++
> kernel/rcu/rcu.h                              |  12 +
> kernel/rcu/rcu_segcblist.c                    |  23 +-
> kernel/rcu/rcu_segcblist.h                    |   8 +
> kernel/rcu/rcuscale.c                         |  74 ++++-
> kernel/rcu/rcutorture.c                       |  60 +++-
> kernel/rcu/tree.c                             | 187 ++++++++----
> kernel/rcu/tree.h                             |  13 +-
> kernel/rcu/tree_nocb.h                        | 282 +++++++++++++++---
> kernel/time/posix-timers.c                    |   2 +-
> lib/radix-tree.c                              |   2 +-
> lib/xarray.c                                  |   2 +-
> mm/slab.h                                     |  54 ++--
> mm/slub.c                                     |  20 +-
> net/core/dst.c                                |   2 +-
> security/security.c                           |   2 +-
> security/selinux/avc.c                        |   4 +-
> .../selftests/rcutorture/configs/rcu/CFLIST   |   1 +
> .../selftests/rcutorture/configs/rcu/TREE11   |  18 ++
> .../rcutorture/configs/rcu/TREE11.boot        |   8 +
> 32 files changed, 783 insertions(+), 153 deletions(-)
> create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11
> create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/TREE11.boot
> 
> --
> 2.37.2.789.g6183377224-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-02 15:21   ` Frederic Weisbecker
  2022-09-02 23:09     ` Joel Fernandes
@ 2022-09-03 22:00     ` Joel Fernandes
  2022-09-04 21:01       ` Frederic Weisbecker
  2022-09-06  3:05     ` Joel Fernandes
  2 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-03 22:00 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Fri, Sep 02, 2022 at 05:21:32PM +0200, Frederic Weisbecker wrote:
[..] 
> > +
> >  	raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
> >  
> >  	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
> [...]
> > @@ -705,12 +816,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> >  	my_rdp->nocb_gp_gp = needwait_gp;
> >  	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
> >  
> > -	if (bypass && !rcu_nocb_poll) {
> > -		// At least one child with non-empty ->nocb_bypass, so set
> > -		// timer in order to avoid stranding its callbacks.
> > -		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> > -				   TPS("WakeBypassIsDeferred"));
> > +	// At least one child with non-empty ->nocb_bypass, so set
> > +	// timer in order to avoid stranding its callbacks.
> > +	if (!rcu_nocb_poll) {
> > +		// If bypass list only has lazy CBs. Add a deferred
> > +		// lazy wake up.
> > +		if (lazy && !bypass) {
> > +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
> > +					TPS("WakeLazyIsDeferred"));
> 
> What if:
> 
> 1) rdp(1) has lazy callbacks
> 2) all other rdp's have no callback at all
> 3) nocb_gp_wait() runs through all rdp's, everything is handled, except for
>    these lazy callbacks
> 4) It reaches the above path, ready to arm the RCU_NOCB_WAKE_LAZY timer,
>    but it hasn't yet called wake_nocb_gp_defer()
> 5) Oh but rdp(2) queues a non-lazy callback. interrupts are disabled so it defers
>    the wake up to nocb_gp_wait() with arming the timer in RCU_NOCB_WAKE.
> 6) nocb_gp_wait() finally calls wake_nocb_gp_defer() and override the timeout
>    to several seconds ahead.
> 7) No more callbacks queued, the non-lazy callback will have to wait several
>    seconds to complete.
> 
> Or did I miss something? Note that the race exists with RCU_NOCB_WAKE_BYPASS
> but it's only about one jiffy delay, not seconds.
> 

So I think the below patch should fix that. But I have not tested it at all
and it could very well have issues. In particular, there is a likelihood of a
wake up while holding a lock which I'm not sure is safe due to scheduler
locks. I'll test it next week. Let me know any thoughts though.

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH] rcu: Fix race where wake_nocb_gp_defer() lazy wake can
 overwrite a non-lazy wake

Fix by holding nocb_gp_lock when observing the state of all rdps. If any
rdp queued a non-lazy CB, we would do a wake up of the main gp thread.

This should address the race Frederick reported (which could effect both
non-lazy CBs using the bypass list, and lazy CBs, though lazy CBs much
more noticeably).

Quoting from Frederic's email:

1) rdp(1) has lazy callbacks
2) all other rdp's have no callback at all
3) nocb_gp_wait() runs through all rdp's, everything is handled, except for
   these lazy callbacks
4) It reaches the above path, ready to arm the RCU_NOCB_WAKE_LAZY timer,
   but it hasn't yet called wake_nocb_gp_defer()
5) Oh but rdp(2) queues a non-lazy callback. interrupts are disabled so it defers
   the wake up to nocb_gp_wait() with arming the timer in RCU_NOCB_WAKE.
6) nocb_gp_wait() finally calls wake_nocb_gp_defer() and override the timeout
   to several seconds ahead.
7) No more callbacks queued, the non-lazy callback will have to wait several
   seconds to complete.

Here, the nocb gp lock is held when #4 happens. So the deferred wakeup
on #5 has to wait till #4 finishes.

Reported-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 8b46442e4473..6690ece8fe20 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -285,14 +285,15 @@ EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
  * Arrange to wake the GP kthread for this NOCB group at some future
  * time when it is safe to do so.
  */
-static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
-			       const char *reason)
+static void __wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
+			       const char *reason, bool locked)
 {
 	unsigned long flags;
 	struct rcu_data *rdp_gp = rdp->nocb_gp_rdp;
 	unsigned long mod_jif = 0;
 
-	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
+	if (!locked)
+		raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
 
 	/*
 	 * Bypass and lazy wakeup overrides previous deferments. In case of
@@ -323,11 +324,23 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
 	if (rdp_gp->nocb_defer_wakeup < waketype)
 		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
 
-	raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
+	if (!locked)
+		raw_spin_unlock_irqrestore(&rdp_gp->nocb_gp_lock, flags);
 
 	trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, reason);
 }
 
+static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
+		const char *reason) {
+	__wake_nocb_gp_defer(rdp, waketype, reason, false);
+}
+
+
+static void wake_nocb_gp_defer_locked(struct rcu_data *rdp, int waketype,
+		const char *reason) {
+	__wake_nocb_gp_defer(rdp, waketype, reason, true);

+
 /*
  * Flush the ->nocb_bypass queue into ->cblist, enqueuing rhp if non-NULL.
  * However, if there is a callback to be enqueued and if ->nocb_bypass
@@ -754,6 +767,8 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	 * is added to the list, so the skipped-over rcu_data structures
 	 * won't be ignored for long.
 	 */
+
+	raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
 	list_for_each_entry_rcu(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp, 1) {
 		bool needwake_state = false;
 		bool flush_bypass = false;
@@ -855,14 +870,15 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 		// If bypass list only has lazy CBs. Add a deferred
 		// lazy wake up.
 		if (lazy && !bypass) {
-			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
+			wake_nocb_gp_defer_locked(my_rdp, RCU_NOCB_WAKE_LAZY,
 					TPS("WakeLazyIsDeferred"));
 		// Otherwise add a deferred bypass wake up.
 		} else if (bypass) {
-			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
+			wake_nocb_gp_defer_locked(my_rdp, RCU_NOCB_WAKE_BYPASS,
 					TPS("WakeBypassIsDeferred"));
 		}
 	}
+	raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
 
 	if (rcu_nocb_poll) {
 		/* Polling, so trace if first poll in the series. */
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-03 22:00     ` Joel Fernandes
@ 2022-09-04 21:01       ` Frederic Weisbecker
  2022-09-05 20:20         ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-04 21:01 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Sat, Sep 03, 2022 at 10:00:29PM +0000, Joel Fernandes wrote:
> On Fri, Sep 02, 2022 at 05:21:32PM +0200, Frederic Weisbecker wrote:
> +
> +	raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);

This is locking during the whole group iteration potentially contending call_rcu(),
idle loop, resume to userspace on all rdp in the group...

How about not overwriting timers instead and only set the RCU_NOCB_WAKE_LAZY
timer when it is previously in RCU_NOCB_WAKE_NOT state? After all if the timer is armed,
it's because we have regular callbacks queued and thus we don't need to wait
before processing the lazy callbacks since we are going to start a grace period
anyway.

Thanks.


>  	list_for_each_entry_rcu(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp, 1) {
>  		bool needwake_state = false;
>  		bool flush_bypass = false;
> @@ -855,14 +870,15 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>  		// If bypass list only has lazy CBs. Add a deferred
>  		// lazy wake up.
>  		if (lazy && !bypass) {
> -			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
> +			wake_nocb_gp_defer_locked(my_rdp, RCU_NOCB_WAKE_LAZY,
>  					TPS("WakeLazyIsDeferred"));
>  		// Otherwise add a deferred bypass wake up.
>  		} else if (bypass) {
> -			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> +			wake_nocb_gp_defer_locked(my_rdp, RCU_NOCB_WAKE_BYPASS,
>  					TPS("WakeBypassIsDeferred"));
>  		}
>  	}
> +	raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
>  
>  	if (rcu_nocb_poll) {
>  		/* Polling, so trace if first poll in the series. */
> -- 
> 2.37.2.789.g6183377224-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-02 23:58     ` Joel Fernandes
  2022-09-03 15:10       ` Paul E. McKenney
@ 2022-09-04 21:13       ` Frederic Weisbecker
  1 sibling, 0 replies; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-04 21:13 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Fri, Sep 02, 2022 at 07:58:42PM -0400, Joel Fernandes wrote:
> 
> 
> On 9/2/2022 7:35 AM, Frederic Weisbecker wrote:
> > On Thu, Sep 01, 2022 at 10:17:06PM +0000, Joel Fernandes (Google) wrote:
> >> When the bypass cblist gets too big or its timeout has occurred, it is
> >> flushed into the main cblist. However, the bypass timer is still running
> >> and the behavior is that it would eventually expire and wake the GP
> >> thread.
> >>
> >> Since we are going to use the bypass cblist for lazy CBs, do the wakeup
> >> soon as the flush happens. Otherwise, the lazy-timer will go off much
> >> later and the now-non-lazy cblist CBs can get stranded for the duration
> >> of the timer.
> >>
> >> This is a good thing to do anyway (regardless of this series), since it
> >> makes the behavior consistent with behavior of other code paths where queueing
> >> something into the ->cblist makes the GP kthread in a non-sleeping state
> >> quickly.
> >>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >>  kernel/rcu/tree_nocb.h | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> >> index 0a5f0ef41484..31068dd31315 100644
> >> --- a/kernel/rcu/tree_nocb.h
> >> +++ b/kernel/rcu/tree_nocb.h
> >> @@ -447,7 +447,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >>  			rcu_advance_cbs_nowake(rdp->mynode, rdp);
> >>  			rdp->nocb_gp_adv_time = j;
> >>  		}
> >> -		rcu_nocb_unlock_irqrestore(rdp, flags);
> >> +
> >> +		// The flush succeeded and we moved CBs into the ->cblist.
> >> +		// However, the bypass timer might still be running. Wakeup the
> >> +		// GP thread by calling a helper with was_all_done set so that
> >> +		// wake up happens (needed if main CB list was empty before).
> >> +		__call_rcu_nocb_wake(rdp, true, flags)
> >> +
> > 
> > Ok so there are two different changes here:
> > 
> > 1) wake up nocb_gp as we just flushed the bypass list. Indeed if the regular
> >    callback list was empty before flushing, we rather want to immediately wake
> >    up nocb_gp instead of waiting for the bypass timer to process them.
> > 
> > 2) wake up nocb_gp unconditionally (ie: even if the regular queue was not empty
> >    before bypass flushing) so that nocb_gp_wait() is forced through another loop
> >    starting with cancelling the bypass timer (I suggest you put such explanation
> >    in the comment btw because that process may not be obvious for mortals).
> > 
> > The change 1) looks like a good idea to me.
> > 
> > The change 2) has unclear motivation. It forces nocb_gp_wait() through another
> > costly loop even though the timer might have been cancelled into some near
> > future, eventually avoiding that extra costly loop. Also it abuses the
> > was_alldone stuff and we may get rcu_nocb_wake with incoherent meanings
> > (WakeEmpty/WakeEmptyIsDeferred) when it's actually not empty.
> 
> Yes #2 can be optimized as follows I think on top of this patch, good point:
> =============
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index ee5924ba2f3b..24aabd723abd 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -514,12 +514,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp,
> struct rcu_head *rhp,
>             ncbs >= qhimark) {
>                 rcu_nocb_lock(rdp);
> 
> +               *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> +
>                 rcu_cblist_set_flush(&rdp->nocb_bypass,
>                                 lazy ? BIT(CB_DEBUG_BYPASS_LAZY_FLUSHED) :
> BIT(CB_DEBUG_BYPASS_FLUSHED),
>                                 (j - READ_ONCE(cb_debug_jiffies_first)));
> 
>                 if (!rcu_nocb_flush_bypass(rdp, rhp, j, lazy, false)) {
> -                       *was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>                         if (*was_alldone)
>                                 trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>                                                     TPS("FirstQ"));
> @@ -537,7 +538,7 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct
> rcu_head *rhp,
>                 // However, the bypass timer might still be running. Wakeup the
>                 // GP thread by calling a helper with was_all_done set so that
>                 // wake up happens (needed if main CB list was empty before).
> -               __call_rcu_nocb_wake(rdp, true, flags)
> +               __call_rcu_nocb_wake(rdp, *was_all_done, flags)
> 
>                 return true; // Callback already enqueued.
>

That looks right!

}
> =============
> 
> > So you may need to clarify the purpose. And I would suggest to make two patches
> > here.
> I guess this change only #2 is no longer a concern? And splitting is not needed
> then as it is only #1.

Sounds good!

Thanks!

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-02 23:09     ` Joel Fernandes
@ 2022-09-05 12:59       ` Frederic Weisbecker
  2022-09-05 20:18         ` Joel Fernandes
  2022-09-05 20:32         ` Joel Fernandes
  0 siblings, 2 replies; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-05 12:59 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Fri, Sep 02, 2022 at 07:09:39PM -0400, Joel Fernandes wrote:
> On 9/2/2022 11:21 AM, Frederic Weisbecker wrote:
> >> @@ -3904,7 +3943,8 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> >>  	rdp->barrier_head.func = rcu_barrier_callback;
> >>  	debug_rcu_head_queue(&rdp->barrier_head);
> >>  	rcu_nocb_lock(rdp);
> >> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> >> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, false,
> >> +		     /* wake gp thread */ true));
> > 
> > It's a bad sign when you need to start commenting your boolean parameters :)
> > Perhaps use a single two-bit flag instead of two booleans, for readability?
> 
> That's fair, what do you mean 2-bit flag? Are you saying, we encode the last 2
> parameters to flush bypass in a u*?

Yeah exactly. Such as rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_LAZY | FLUSH_WAKE)

> 
> > Also that's a subtle change which purpose isn't explained. It means that
> > rcu_barrier_entrain() used to wait for the bypass timer in the worst case
> > but now we force rcuog into it immediately. Should that be a separate change?
> 
> It could be split, but it is laziness that amplifies the issue so I thought of
> keeping it in the same patch. I don't mind one way or the other.

Ok then lets keep it here but please add a comment for the reason to
force wake here.

> >> +	case RCU_NOCB_WAKE_BYPASS:
> >> +		mod_jif = 2;
> >> +		break;
> >> +
> >> +	case RCU_NOCB_WAKE:
> >> +	case RCU_NOCB_WAKE_FORCE:
> >> +		// For these, make it wake up the soonest if we
> >> +		// were in a bypass or lazy sleep before.
> >>  		if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
> >> -			mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
> >> -		if (rdp_gp->nocb_defer_wakeup < waketype)
> >> -			WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> >> +			mod_jif = 1;
> >> +		break;
> >>  	}
> >>  
> >> +	if (mod_jif)
> >> +		mod_timer(&rdp_gp->nocb_timer, jiffies + mod_jif);
> >> +
> >> +	if (rdp_gp->nocb_defer_wakeup < waketype)
> >> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> > 
> > So RCU_NOCB_WAKE_BYPASS and RCU_NOCB_WAKE_LAZY don't override the timer state
> > anymore? Looks like something is missing.
> 
> My goal was to make sure that NOCB_WAKE_LAZY wake keeps the timer lazy. If I
> don't do this, then when CPU enters idle, it will immediately do a wake up via
> this call:
> 
> 	rcu_nocb_need_deferred_wakeup(rdp_gp, RCU_NOCB_WAKE)

But if the timer is in RCU_NOCB_WAKE_LAZY mode, that shouldn't be a problem.

> 
> That was almost always causing lazy CBs to be non-lazy thus negating all the
> benefits.
> 
> Note that bypass will also have same issue where the bypass CB will not wait for
> intended bypass duration. To make it consistent with lazy, I made bypass also
> not override nocb_defer_wakeup.

I'm surprised because rcu_nocb_flush_deferred_wakeup() should only do the wake up
if the timer is RCU_NOCB_WAKE or RCU_NOCB_WAKE_FORCE. Or is that code buggy
somehow?

Actually your change is modifying the timer delay without changing the timer
mode, which may shortcut rcu_nocb_flush_deferred_wakeup() check and actually
make it perform early upon idle loop entry.

Or am I missing something?


> 
> I agree its not pretty, but it works and I could not find any code path where it
> does not work. That said, I am open to ideas for changing this and perhaps some
> of these unneeded delays with bypass CBs can be split into separate patches.
> 
> Regarding your point about nocb_defer_wakeup state diverging from the timer
> programming, that happens anyway here in current code:
> 
>  283        } else {
>  284                if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
>  285                        mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
>  286                if (rdp_gp->nocb_defer_wakeup < waketype)
>  287                        WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>  288        }

How so?

> >> @@ -705,12 +816,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
> >>  	my_rdp->nocb_gp_gp = needwait_gp;
> >>  	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
> >>  
> >> -	if (bypass && !rcu_nocb_poll) {
> >> -		// At least one child with non-empty ->nocb_bypass, so set
> >> -		// timer in order to avoid stranding its callbacks.
> >> -		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
> >> -				   TPS("WakeBypassIsDeferred"));
> >> +	// At least one child with non-empty ->nocb_bypass, so set
> >> +	// timer in order to avoid stranding its callbacks.
> >> +	if (!rcu_nocb_poll) {
> >> +		// If bypass list only has lazy CBs. Add a deferred
> >> +		// lazy wake up.
> >> +		if (lazy && !bypass) {
> >> +			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
> >> +					TPS("WakeLazyIsDeferred"));
> > 
> > What if:
> > 
> > 1) rdp(1) has lazy callbacks
> > 2) all other rdp's have no callback at all
> > 3) nocb_gp_wait() runs through all rdp's, everything is handled, except for
> >    these lazy callbacks
> > 4) It reaches the above path, ready to arm the RCU_NOCB_WAKE_LAZY timer,
> >    but it hasn't yet called wake_nocb_gp_defer()
> > 5) Oh but rdp(2) queues a non-lazy callback. interrupts are disabled so it defers
> >    the wake up to nocb_gp_wait() with arming the timer in RCU_NOCB_WAKE.
> > 6) nocb_gp_wait() finally calls wake_nocb_gp_defer() and override the timeout
> >    to several seconds ahead.
> > 7) No more callbacks queued, the non-lazy callback will have to wait several
> >    seconds to complete.
> > 
> > Or did I miss something?
> 
> In theory, I can see this being an issue. In practice, I have not seen it to
> be.

What matters is that the issue looks plausible.

> In my view, the nocb GP thread should not go to sleep in the first place if
> there are any non-bypass CBs being queued. If it does, then that seems an
> optimization-related bug.

Yeah, it's a constraint introduced by the optimized delayed wake up.
By why would RCU_NOCB_WAKE_LAZY need to overwrite RCU_NOCB_WAKE in the first
place?

> 
> That said, we can make wake_nocb_gp_defer() more robust perhaps by making it not
> overwrite the timer if the wake-type requested is weaker than RCU_NOCB_WAKE,
> however that should not cause the going-into-idle issues I pointed. Whether the
> idle time issue will happen, I have no idea. But in theory, will that address
> your concern above?

Yes, but I'm still confused by this idle time issue.

> 
> > Note that the race exists with RCU_NOCB_WAKE_BYPASS
> > but it's only about one jiffy delay, not seconds.
> 
> Well, 2 jiffies. But yeah.
> 
> Thanks, so far I do not see anything that cannot be fixed on top of this patch
> but you raised some good points. Maybe we ought to rewrite the idle path to not
> disturb lazy CBs in a different way, or something (while keeping the timer state
> consistent with the programming of the timer in wake_nocb_gp_defer()).

I'd rather see an updated patch (not the whole patchset but just this one) rather
than deltas, just to make sure I'm not missing something in the whole picture.

Thanks.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-05 12:59       ` Frederic Weisbecker
@ 2022-09-05 20:18         ` Joel Fernandes
  2022-09-05 20:32         ` Joel Fernandes
  1 sibling, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-05 20:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

Hi Frederick,

On 9/5/2022 8:59 AM, Frederic Weisbecker wrote:
>>> Also that's a subtle change which purpose isn't explained. It means that
>>> rcu_barrier_entrain() used to wait for the bypass timer in the worst case
>>> but now we force rcuog into it immediately. Should that be a separate change?
>> It could be split, but it is laziness that amplifies the issue so I thought of
>> keeping it in the same patch. I don't mind one way or the other.
> Ok then lets keep it here but please add a comment for the reason to
> force wake here.

Ok will do, thanks.

>>>> +	case RCU_NOCB_WAKE_BYPASS:
>>>> +		mod_jif = 2;
>>>> +		break;
>>>> +
>>>> +	case RCU_NOCB_WAKE:
>>>> +	case RCU_NOCB_WAKE_FORCE:
>>>> +		// For these, make it wake up the soonest if we
>>>> +		// were in a bypass or lazy sleep before.
>>>>  		if (rdp_gp->nocb_defer_wakeup < RCU_NOCB_WAKE)
>>>> -			mod_timer(&rdp_gp->nocb_timer, jiffies + 1);
>>>> -		if (rdp_gp->nocb_defer_wakeup < waketype)
>>>> -			WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>>> +			mod_jif = 1;
>>>> +		break;
>>>>  	}
>>>>  
>>>> +	if (mod_jif)
>>>> +		mod_timer(&rdp_gp->nocb_timer, jiffies + mod_jif);
>>>> +
>>>> +	if (rdp_gp->nocb_defer_wakeup < waketype)
>>>> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>> So RCU_NOCB_WAKE_BYPASS and RCU_NOCB_WAKE_LAZY don't override the timer state
>>> anymore? Looks like something is missing.
>> My goal was to make sure that NOCB_WAKE_LAZY wake keeps the timer lazy. If I
>> don't do this, then when CPU enters idle, it will immediately do a wake up via
>> this call:
>>
>> 	rcu_nocb_need_deferred_wakeup(rdp_gp, RCU_NOCB_WAKE)
> But if the timer is in RCU_NOCB_WAKE_LAZY mode, that shouldn't be a problem.
> 
>> That was almost always causing lazy CBs to be non-lazy thus negating all the
>> benefits.
>>
>> Note that bypass will also have same issue where the bypass CB will not wait for
>> intended bypass duration. To make it consistent with lazy, I made bypass also
>> not override nocb_defer_wakeup.
> I'm surprised because rcu_nocb_flush_deferred_wakeup() should only do the wake up
> if the timer is RCU_NOCB_WAKE or RCU_NOCB_WAKE_FORCE. Or is that code buggy
> somehow?
> Actually your change is modifying the timer delay without changing the timer
> mode, which may shortcut rcu_nocb_flush_deferred_wakeup() check and actually
> make it perform early upon idle loop entry.
> 
> Or am I missing something?
> 

You could very well have a point and I am not sure now (I happen to 'forget' the
issue once the code was working). I distinctly remember not being able to be
lazy without doing this. Maybe there is some other path. I am kicking myself for
not commenting in the code or change log enough about the issue.

I will test again sync'ing the lazy timer and the ->nocb_defer_wakeup field
properly and see if I can trigger the issue.

Thanks!

 - Joel








^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-04 21:01       ` Frederic Weisbecker
@ 2022-09-05 20:20         ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-05 20:20 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

Hi Frederic,

On 9/4/2022 5:01 PM, Frederic Weisbecker wrote:
> On Sat, Sep 03, 2022 at 10:00:29PM +0000, Joel Fernandes wrote:
>> On Fri, Sep 02, 2022 at 05:21:32PM +0200, Frederic Weisbecker wrote:
>> +
>> +	raw_spin_lock_irqsave(&my_rdp->nocb_gp_lock, flags);
> 
> This is locking during the whole group iteration potentially contending call_rcu(),
> idle loop, resume to userspace on all rdp in the group...
> 
> How about not overwriting timers instead and only set the RCU_NOCB_WAKE_LAZY
> timer when it is previously in RCU_NOCB_WAKE_NOT state? After all if the timer is armed,
> it's because we have regular callbacks queued and thus we don't need to wait
> before processing the lazy callbacks since we are going to start a grace period
> anyway.
> 
> Thanks.

I like your idea better. I think it will work well to resolve the race you
described.

Thanks!

 - Joel


> 
> 
>>  	list_for_each_entry_rcu(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp, 1) {
>>  		bool needwake_state = false;
>>  		bool flush_bypass = false;
>> @@ -855,14 +870,15 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
>>  		// If bypass list only has lazy CBs. Add a deferred
>>  		// lazy wake up.
>>  		if (lazy && !bypass) {
>> -			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
>> +			wake_nocb_gp_defer_locked(my_rdp, RCU_NOCB_WAKE_LAZY,
>>  					TPS("WakeLazyIsDeferred"));
>>  		// Otherwise add a deferred bypass wake up.
>>  		} else if (bypass) {
>> -			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
>> +			wake_nocb_gp_defer_locked(my_rdp, RCU_NOCB_WAKE_BYPASS,
>>  					TPS("WakeBypassIsDeferred"));
>>  		}
>>  	}
>> +	raw_spin_unlock_irqrestore(&my_rdp->nocb_gp_lock, flags);
>>  
>>  	if (rcu_nocb_poll) {
>>  		/* Polling, so trace if first poll in the series. */
>> -- 
>> 2.37.2.789.g6183377224-goog
>>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-05 12:59       ` Frederic Weisbecker
  2022-09-05 20:18         ` Joel Fernandes
@ 2022-09-05 20:32         ` Joel Fernandes
  2022-09-06  8:55           ` Paul E. McKenney
  1 sibling, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-05 20:32 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/5/2022 8:59 AM, Frederic Weisbecker wrote:
> I'd rather see an updated patch (not the whole patchset but just this one) rather
> than deltas, just to make sure I'm not missing something in the whole picture.
> 
> Thanks.

There is also the previous patch which needs a fix up you suggested.

I will reply to individual patches with in-line patch (for you), as well as
refreshing whole series since it might be easier for Paul to apply them together.

thanks,

- Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-02 15:21   ` Frederic Weisbecker
  2022-09-02 23:09     ` Joel Fernandes
  2022-09-03 22:00     ` Joel Fernandes
@ 2022-09-06  3:05     ` Joel Fernandes
  2022-09-06 15:17       ` Frederic Weisbecker
  2022-09-06 18:16       ` Uladzislau Rezki
  2 siblings, 2 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-06  3:05 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Fri, Sep 02, 2022 at 05:21:32PM +0200, Frederic Weisbecker wrote:
> On Thu, Sep 01, 2022 at 10:17:08PM +0000, Joel Fernandes (Google) wrote:
> > Implement timer-based RCU lazy callback batching. The batch is flushed
> > whenever a certain amount of time has passed, or the batch on a
> > particular CPU grows too big. Also memory pressure will flush it in a
> > future patch.
> > 
> > To handle several corner cases automagically (such as rcu_barrier() and
> > hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> > length has the lazy CB length included in it. A separate lazy CB length
> > counter is also introduced to keep track of the number of lazy CBs.
> > 
> > Suggested-by: Paul McKenney <paulmck@kernel.org>
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > ---

Here is the updated version of this patch for further testing and review.
Paul, you could consider updating your test branch. I have tested it in
ChromeOS as well, and rcuscale. The laziness and boot times are looking good.
There was at least one bug that I fixed that got introduced with the moving
of the length field to rcu_data.  Thanks a lot Frederic for the review
comments.

I will look at the rcu torture issue next... I suspect the length field issue
may have been causing it.

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH v6] rcu: Introduce call_rcu_lazy() API implementation

Implement timer-based RCU lazy callback batching. The batch is flushed
whenever a certain amount of time has passed, or the batch on a
particular CPU grows too big. Also memory pressure will flush it in a
future patch.

To handle several corner cases automagically (such as rcu_barrier() and
hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
length has the lazy CB length included in it. A separate lazy CB length
counter is also introduced to keep track of the number of lazy CBs.

v5->v6:

[ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
  deferral levels wake much earlier so for those it is not needed. ]

[ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]

[ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]

Suggested-by: Paul McKenney <paulmck@kernel.org>
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 include/linux/rcupdate.h |   6 ++
 kernel/rcu/Kconfig       |   8 ++
 kernel/rcu/rcu.h         |  11 +++
 kernel/rcu/tree.c        | 133 +++++++++++++++++++----------
 kernel/rcu/tree.h        |  17 +++-
 kernel/rcu/tree_nocb.h   | 175 ++++++++++++++++++++++++++++++++-------
 6 files changed, 269 insertions(+), 81 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 08605ce7379d..82e8a07e0856 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
 
 #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
 
+#ifdef CONFIG_RCU_LAZY
+void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
+#else
+#define call_rcu_lazy(head, func) call_rcu(head, func)
+#endif
+
 /* Internal to kernel */
 void rcu_init(void);
 extern int rcu_scheduler_active;
diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
index d471d22a5e21..3128d01427cb 100644
--- a/kernel/rcu/Kconfig
+++ b/kernel/rcu/Kconfig
@@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
 	  Say N here if you hate read-side memory barriers.
 	  Take the default if you are unsure.
 
+config RCU_LAZY
+	bool "RCU callback lazy invocation functionality"
+	depends on RCU_NOCB_CPU
+	default n
+	help
+	  To save power, batch RCU callbacks and flush after delay, memory
+	  pressure or callback list growing too big.
+
 endmenu # "RCU Subsystem"
diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index be5979da07f5..94675f14efe8 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -474,6 +474,14 @@ enum rcutorture_type {
 	INVALID_RCU_FLAVOR
 };
 
+#if defined(CONFIG_RCU_LAZY)
+unsigned long rcu_lazy_get_jiffies_till_flush(void);
+void rcu_lazy_set_jiffies_till_flush(unsigned long j);
+#else
+static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
+static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
+#endif
+
 #if defined(CONFIG_TREE_RCU)
 void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
 			    unsigned long *gp_seq);
@@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
 			       unsigned long c_old,
 			       unsigned long c);
 void rcu_gp_set_torture_wait(int duration);
+void rcu_force_call_rcu_to_lazy(bool force);
+
 #else
 static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
 					  int *flags, unsigned long *gp_seq)
@@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
 	do { } while (0)
 #endif
 static inline void rcu_gp_set_torture_wait(int duration) { }
+static inline void rcu_force_call_rcu_to_lazy(bool force) { }
 #endif
 
 #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9fe581be8696..dbd25b8c080e 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
 	raw_spin_unlock_rcu_node(rnp);
 }
 
-/**
- * call_rcu() - Queue an RCU callback for invocation after a grace period.
- * @head: structure to be used for queueing the RCU updates.
- * @func: actual callback function to be invoked after the grace period
- *
- * The callback function will be invoked some time after a full grace
- * period elapses, in other words after all pre-existing RCU read-side
- * critical sections have completed.  However, the callback function
- * might well execute concurrently with RCU read-side critical sections
- * that started after call_rcu() was invoked.
- *
- * RCU read-side critical sections are delimited by rcu_read_lock()
- * and rcu_read_unlock(), and may be nested.  In addition, but only in
- * v5.0 and later, regions of code across which interrupts, preemption,
- * or softirqs have been disabled also serve as RCU read-side critical
- * sections.  This includes hardware interrupt handlers, softirq handlers,
- * and NMI handlers.
- *
- * Note that all CPUs must agree that the grace period extended beyond
- * all pre-existing RCU read-side critical section.  On systems with more
- * than one CPU, this means that when "func()" is invoked, each CPU is
- * guaranteed to have executed a full memory barrier since the end of its
- * last RCU read-side critical section whose beginning preceded the call
- * to call_rcu().  It also means that each CPU executing an RCU read-side
- * critical section that continues beyond the start of "func()" must have
- * executed a memory barrier after the call_rcu() but before the beginning
- * of that RCU read-side critical section.  Note that these guarantees
- * include CPUs that are offline, idle, or executing in user mode, as
- * well as CPUs that are executing in the kernel.
- *
- * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
- * resulting RCU callback function "func()", then both CPU A and CPU B are
- * guaranteed to execute a full memory barrier during the time interval
- * between the call to call_rcu() and the invocation of "func()" -- even
- * if CPU A and CPU B are the same CPU (but again only if the system has
- * more than one CPU).
- *
- * Implementation of these memory-ordering guarantees is described here:
- * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
- */
-void call_rcu(struct rcu_head *head, rcu_callback_t func)
+static void
+__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
 {
 	static atomic_t doublefrees;
 	unsigned long flags;
@@ -2818,7 +2779,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 		trace_rcu_callback(rcu_state.name, head,
 				   rcu_segcblist_n_cbs(&rdp->cblist));
 
-	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
+	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
 		return; // Enqueued onto ->nocb_bypass, so just leave.
 	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
 	rcu_segcblist_enqueue(&rdp->cblist, head);
@@ -2833,8 +2794,86 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 		local_irq_restore(flags);
 	}
 }
-EXPORT_SYMBOL_GPL(call_rcu);
 
+#ifdef CONFIG_RCU_LAZY
+/**
+ * call_rcu_lazy() - Lazily queue RCU callback for invocation after grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.
+ *
+ * Use this API instead of call_rcu() if you don't mind the callback being
+ * invoked after very long periods of time on systems without memory pressure
+ * and on systems which are lightly loaded or mostly idle.
+ *
+ * Other than the extra delay in callbacks being invoked, this function is
+ * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
+ * details about memory ordering and other functionality.
+ */
+void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, true);
+}
+EXPORT_SYMBOL_GPL(call_rcu_lazy);
+#endif
+
+static bool force_call_rcu_to_lazy;
+
+void rcu_force_call_rcu_to_lazy(bool force)
+{
+	if (IS_ENABLED(CONFIG_RCU_SCALE_TEST))
+		WRITE_ONCE(force_call_rcu_to_lazy, force);
+}
+EXPORT_SYMBOL_GPL(rcu_force_call_rcu_to_lazy);
+
+/**
+ * call_rcu() - Queue an RCU callback for invocation after a grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual callback function to be invoked after the grace period
+ *
+ * The callback function will be invoked some time after a full grace
+ * period elapses, in other words after all pre-existing RCU read-side
+ * critical sections have completed.  However, the callback function
+ * might well execute concurrently with RCU read-side critical sections
+ * that started after call_rcu() was invoked.
+ *
+ * RCU read-side critical sections are delimited by rcu_read_lock()
+ * and rcu_read_unlock(), and may be nested.  In addition, but only in
+ * v5.0 and later, regions of code across which interrupts, preemption,
+ * or softirqs have been disabled also serve as RCU read-side critical
+ * sections.  This includes hardware interrupt handlers, softirq handlers,
+ * and NMI handlers.
+ *
+ * Note that all CPUs must agree that the grace period extended beyond
+ * all pre-existing RCU read-side critical section.  On systems with more
+ * than one CPU, this means that when "func()" is invoked, each CPU is
+ * guaranteed to have executed a full memory barrier since the end of its
+ * last RCU read-side critical section whose beginning preceded the call
+ * to call_rcu().  It also means that each CPU executing an RCU read-side
+ * critical section that continues beyond the start of "func()" must have
+ * executed a memory barrier after the call_rcu() but before the beginning
+ * of that RCU read-side critical section.  Note that these guarantees
+ * include CPUs that are offline, idle, or executing in user mode, as
+ * well as CPUs that are executing in the kernel.
+ *
+ * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
+ * resulting RCU callback function "func()", then both CPU A and CPU B are
+ * guaranteed to execute a full memory barrier during the time interval
+ * between the call to call_rcu() and the invocation of "func()" -- even
+ * if CPU A and CPU B are the same CPU (but again only if the system has
+ * more than one CPU).
+ *
+ * Implementation of these memory-ordering guarantees is described here:
+ * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
+ */
+void call_rcu(struct rcu_head *head, rcu_callback_t func)
+{
+	return __call_rcu_common(head, func, force_call_rcu_to_lazy);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
 
 /* Maximum number of jiffies to wait before draining a batch. */
 #define KFREE_DRAIN_JIFFIES (5 * HZ)
@@ -3904,7 +3943,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
 	rdp->barrier_head.func = rcu_barrier_callback;
 	debug_rcu_head_queue(&rdp->barrier_head);
 	rcu_nocb_lock(rdp);
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+        /*
+         * Flush the bypass list, but also wake up the GP thread as otherwise
+         * bypass/lazy CBs maynot be noticed, and can cause real long delays!
+         */
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
 	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
 		atomic_inc(&rcu_state.barrier_cpu_count);
 	} else {
@@ -4325,7 +4368,7 @@ void rcutree_migrate_callbacks(int cpu)
 	my_rdp = this_cpu_ptr(&rcu_data);
 	my_rnp = my_rdp->mynode;
 	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE));
 	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
 	/* Leverage recent GPs and set GP for new callbacks. */
 	needwake = rcu_advance_cbs(my_rnp, rdp) ||
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index d4a97e40ea9c..361c41d642c7 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -263,14 +263,16 @@ struct rcu_data {
 	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
 	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
 
+	long lazy_len;			/* Length of buffered lazy callbacks. */
 	int cpu;
 };
 
 /* Values for nocb_defer_wakeup field in struct rcu_data. */
 #define RCU_NOCB_WAKE_NOT	0
 #define RCU_NOCB_WAKE_BYPASS	1
-#define RCU_NOCB_WAKE		2
-#define RCU_NOCB_WAKE_FORCE	3
+#define RCU_NOCB_WAKE_LAZY	2
+#define RCU_NOCB_WAKE		3
+#define RCU_NOCB_WAKE_FORCE	4
 
 #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
 					/* For jiffies_till_first_fqs and */
@@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
 static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
 static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
 static void rcu_init_one_nocb(struct rcu_node *rnp);
+
+#define FLUSH_BP_NONE 0
+/* Is the CB being enqueued after the flush, a lazy CB? */
+#define FLUSH_BP_LAZY BIT(0)
+/* Wake up nocb-GP thread after flush? */
+#define FLUSH_BP_WAKE BIT(1)
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j);
+				  unsigned long j, unsigned long flush_flags);
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags);
+				bool *was_alldone, unsigned long flags,
+				bool lazy);
 static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
 				 unsigned long flags);
 static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 4dc86274b3e8..b201606f7c4f 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
 	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
 }
 
+/*
+ * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
+ * can elapse before lazy callbacks are flushed. Lazy callbacks
+ * could be flushed much earlier for a number of other reasons
+ * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
+ * left unsubmitted to RCU after those many jiffies.
+ */
+#define LAZY_FLUSH_JIFFIES (10 * HZ)
+unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
+
+#ifdef CONFIG_RCU_LAZY
+// To be called only from test code.
+void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
+{
+	jiffies_till_flush = jif;
+}
+EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
+
+unsigned long rcu_lazy_get_jiffies_till_flush(void)
+{
+	return jiffies_till_flush;
+}
+EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
+#endif
+
 /*
  * Arrange to wake the GP kthread for this NOCB group at some future
  * time when it is safe to do so.
@@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
 	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
 
 	/*
-	 * Bypass wakeup overrides previous deferments. In case
-	 * of callback storm, no need to wake up too early.
+	 * Bypass wakeup overrides previous deferments. In case of
+	 * callback storm, no need to wake up too early.
 	 */
-	if (waketype == RCU_NOCB_WAKE_BYPASS) {
+	if (waketype == RCU_NOCB_WAKE_LAZY
+		&& READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
+		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
+		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
+	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
 		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
 		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
 	} else {
@@ -293,12 +322,16 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
  * proves to be initially empty, just return false because the no-CB GP
  * kthread may need to be awakened in this case.
  *
+ * Return true if there was something to be flushed and it succeeded, otherwise
+ * false.
+ *
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				     unsigned long j)
+				     unsigned long j, unsigned long flush_flags)
 {
 	struct rcu_cblist rcl;
+	bool lazy = flush_flags & FLUSH_BP_LAZY;
 
 	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
 	rcu_lockdep_assert_cblist_protected(rdp);
@@ -310,7 +343,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	/* Note: ->cblist.len already accounts for ->nocb_bypass contents. */
 	if (rhp)
 		rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
-	rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+
+	/*
+	 * If the new CB requested was a lazy one, queue it onto the main
+	 * ->cblist so we can take advantage of a sooner grade period.
+	 */
+	if (lazy && rhp) {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, NULL);
+		rcu_cblist_enqueue(&rcl, rhp);
+		WRITE_ONCE(rdp->lazy_len, 0);
+	} else {
+		rcu_cblist_flush_enqueue(&rcl, &rdp->nocb_bypass, rhp);
+		WRITE_ONCE(rdp->lazy_len, 0);
+	}
+
 	rcu_segcblist_insert_pend_cbs(&rdp->cblist, &rcl);
 	WRITE_ONCE(rdp->nocb_bypass_first, j);
 	rcu_nocb_bypass_unlock(rdp);
@@ -326,13 +372,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
  * Note that this function always returns true if rhp is NULL.
  */
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, unsigned long flush_flags)
 {
+	bool ret;
+
 	if (!rcu_rdp_is_offloaded(rdp))
 		return true;
 	rcu_lockdep_assert_cblist_protected(rdp);
 	rcu_nocb_bypass_lock(rdp);
-	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
+	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
+
+	if (flush_flags & FLUSH_BP_WAKE)
+		wake_nocb_gp(rdp, true);
+
+	return ret;
 }
 
 /*
@@ -345,7 +398,7 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
 	if (!rcu_rdp_is_offloaded(rdp) ||
 	    !rcu_nocb_bypass_trylock(rdp))
 		return;
-	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j));
+	WARN_ON_ONCE(!rcu_nocb_do_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
 }
 
 /*
@@ -367,12 +420,14 @@ static void rcu_nocb_try_flush_bypass(struct rcu_data *rdp, unsigned long j)
  * there is only one CPU in operation.
  */
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags,
+				bool lazy)
 {
 	unsigned long c;
 	unsigned long cur_gp_seq;
 	unsigned long j = jiffies;
 	long ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	bool bypass_is_lazy = (ncbs == READ_ONCE(rdp->lazy_len));
 
 	lockdep_assert_irqs_disabled();
 
@@ -417,25 +472,30 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	// If there hasn't yet been all that many ->cblist enqueues
 	// this jiffy, tell the caller to enqueue onto ->cblist.  But flush
 	// ->nocb_bypass first.
-	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy) {
+	// Lazy CBs throttle this back and do immediate bypass queuing.
+	if (rdp->nocb_nobypass_count < nocb_nobypass_lim_per_jiffy && !lazy) {
 		rcu_nocb_lock(rdp);
 		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 		if (*was_alldone)
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstQ"));
-		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j));
+
+		WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, j, FLUSH_BP_NONE));
 		WARN_ON_ONCE(rcu_cblist_n_cbs(&rdp->nocb_bypass));
 		return false; // Caller must enqueue the callback.
 	}
 
 	// If ->nocb_bypass has been used too long or is too full,
 	// flush ->nocb_bypass to ->cblist.
-	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
+	if ((ncbs && !bypass_is_lazy && j != READ_ONCE(rdp->nocb_bypass_first)) ||
+	    (ncbs &&  bypass_is_lazy &&
+		(time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush))) ||
 	    ncbs >= qhimark) {
 		rcu_nocb_lock(rdp);
 		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 
-		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
+		if (!rcu_nocb_flush_bypass(rdp, rhp, j,
+					   lazy ? FLUSH_BP_LAZY : FLUSH_BP_NONE)) {
 			if (*was_alldone)
 				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 						    TPS("FirstQ"));
@@ -461,16 +521,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	// We need to use the bypass.
 	rcu_nocb_wait_contended(rdp);
 	rcu_nocb_bypass_lock(rdp);
+
 	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
 	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
 	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
+
+	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
+		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
+
 	if (!ncbs) {
 		WRITE_ONCE(rdp->nocb_bypass_first, j);
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
 	}
+
 	rcu_nocb_bypass_unlock(rdp);
 	smp_mb(); /* Order enqueue before wake. */
-	if (ncbs) {
+
+	// We had CBs in the bypass list before. There is nothing else to do if:
+	// There were only non-lazy CBs before, in this case, the bypass timer
+	// or GP-thread will handle the CBs including any new lazy ones.
+	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
+	// case the old lazy timer would have been setup. When that expires,
+	// the new lazy one will be handled.
+	if (ncbs && (!bypass_is_lazy || lazy)) {
 		local_irq_restore(flags);
 	} else {
 		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
@@ -479,6 +552,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstBQwake"));
 			__call_rcu_nocb_wake(rdp, true, flags);
+		} else if (bypass_is_lazy && !lazy) {
+			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
+					    TPS("FirstBQwakeLazy2Non"));
+			__call_rcu_nocb_wake(rdp, true, flags);
 		} else {
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 					    TPS("FirstBQnoWake"));
@@ -500,7 +577,7 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 {
 	unsigned long cur_gp_seq;
 	unsigned long j;
-	long len;
+	long len, lazy_len, bypass_len;
 	struct task_struct *t;
 
 	// If we are being polled or there is no kthread, just leave.
@@ -513,9 +590,16 @@ static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_alldone,
 	}
 	// Need to actually to a wakeup.
 	len = rcu_segcblist_n_cbs(&rdp->cblist);
+	bypass_len = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+	lazy_len = READ_ONCE(rdp->lazy_len);
 	if (was_alldone) {
 		rdp->qlen_last_fqs_check = len;
-		if (!irqs_disabled_flags(flags)) {
+		// Only lazy CBs in bypass list
+		if (lazy_len && bypass_len == lazy_len) {
+			rcu_nocb_unlock_irqrestore(rdp, flags);
+			wake_nocb_gp_defer(rdp, RCU_NOCB_WAKE_LAZY,
+					   TPS("WakeLazy"));
+		} else if (!irqs_disabled_flags(flags)) {
 			/* ... if queue was empty ... */
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			wake_nocb_gp(rdp, false);
@@ -605,8 +689,8 @@ static void nocb_gp_sleep(struct rcu_data *my_rdp, int cpu)
  */
 static void nocb_gp_wait(struct rcu_data *my_rdp)
 {
-	bool bypass = false;
-	long bypass_ncbs;
+	bool bypass = false, lazy = false;
+	long bypass_ncbs, lazy_ncbs;
 	int __maybe_unused cpu = my_rdp->cpu;
 	unsigned long cur_gp_seq;
 	unsigned long flags;
@@ -641,24 +725,41 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	 * won't be ignored for long.
 	 */
 	list_for_each_entry(rdp, &my_rdp->nocb_head_rdp, nocb_entry_rdp) {
+		bool flush_bypass = false;
+
 		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("Check"));
 		rcu_nocb_lock_irqsave(rdp, flags);
 		lockdep_assert_held(&rdp->nocb_lock);
 		bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
-		if (bypass_ncbs &&
+		lazy_ncbs = READ_ONCE(rdp->lazy_len);
+
+		if (bypass_ncbs && (lazy_ncbs == bypass_ncbs) &&
+		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + jiffies_till_flush) ||
+		     bypass_ncbs > 2 * qhimark)) {
+			flush_bypass = true;
+		} else if (bypass_ncbs && (lazy_ncbs != bypass_ncbs) &&
 		    (time_after(j, READ_ONCE(rdp->nocb_bypass_first) + 1) ||
 		     bypass_ncbs > 2 * qhimark)) {
-			// Bypass full or old, so flush it.
-			(void)rcu_nocb_try_flush_bypass(rdp, j);
-			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			flush_bypass = true;
 		} else if (!bypass_ncbs && rcu_segcblist_empty(&rdp->cblist)) {
 			rcu_nocb_unlock_irqrestore(rdp, flags);
 			continue; /* No callbacks here, try next. */
 		}
+
+		if (flush_bypass) {
+			// Bypass full or old, so flush it.
+			(void)rcu_nocb_try_flush_bypass(rdp, j);
+			bypass_ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
+			lazy_ncbs = READ_ONCE(rdp->lazy_len);
+		}
+
 		if (bypass_ncbs) {
 			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
-					    TPS("Bypass"));
-			bypass = true;
+				    bypass_ncbs == lazy_ncbs ? TPS("Lazy") : TPS("Bypass"));
+			if (bypass_ncbs == lazy_ncbs)
+				lazy = true;
+			else
+				bypass = true;
 		}
 		rnp = rdp->mynode;
 
@@ -706,12 +807,21 @@ static void nocb_gp_wait(struct rcu_data *my_rdp)
 	my_rdp->nocb_gp_gp = needwait_gp;
 	my_rdp->nocb_gp_seq = needwait_gp ? wait_gp_seq : 0;
 
-	if (bypass && !rcu_nocb_poll) {
-		// At least one child with non-empty ->nocb_bypass, so set
-		// timer in order to avoid stranding its callbacks.
-		wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
-				   TPS("WakeBypassIsDeferred"));
+	// At least one child with non-empty ->nocb_bypass, so set
+	// timer in order to avoid stranding its callbacks.
+	if (!rcu_nocb_poll) {
+		// If bypass list only has lazy CBs. Add a deferred
+		// lazy wake up.
+		if (lazy && !bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_LAZY,
+					TPS("WakeLazyIsDeferred"));
+		// Otherwise add a deferred bypass wake up.
+		} else if (bypass) {
+			wake_nocb_gp_defer(my_rdp, RCU_NOCB_WAKE_BYPASS,
+					TPS("WakeBypassIsDeferred"));
+		}
 	}
+
 	if (rcu_nocb_poll) {
 		/* Polling, so trace if first poll in the series. */
 		if (gotcbs)
@@ -1037,7 +1147,7 @@ static long rcu_nocb_rdp_deoffload(void *arg)
 	 * return false, which means that future calls to rcu_nocb_try_bypass()
 	 * will refuse to put anything into the bypass.
 	 */
-	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
+	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_NONE));
 	/*
 	 * Start with invoking rcu_core() early. This way if the current thread
 	 * happens to preempt an ongoing call to rcu_core() in the middle,
@@ -1291,6 +1401,7 @@ static void __init rcu_boot_init_nocb_percpu_data(struct rcu_data *rdp)
 	raw_spin_lock_init(&rdp->nocb_gp_lock);
 	timer_setup(&rdp->nocb_timer, do_nocb_deferred_wakeup_timer, 0);
 	rcu_cblist_init(&rdp->nocb_bypass);
+	WRITE_ONCE(rdp->lazy_len, 0);
 	mutex_init(&rdp->nocb_gp_kthread_mutex);
 }
 
@@ -1572,13 +1683,13 @@ static void rcu_init_one_nocb(struct rcu_node *rnp)
 }
 
 static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				  unsigned long j)
+				  unsigned long j, unsigned long flush_flags)
 {
 	return true;
 }
 
 static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
-				bool *was_alldone, unsigned long flags)
+				bool *was_alldone, unsigned long flags, bool lazy)
 {
 	return false;
 }
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-01 22:17 ` [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens Joel Fernandes (Google)
  2022-09-02 11:35   ` Frederic Weisbecker
  2022-09-03 14:04   ` Paul E. McKenney
@ 2022-09-06  3:07   ` Joel Fernandes
  2022-09-06  9:48     ` Frederic Weisbecker
  2 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-06  3:07 UTC (permalink / raw)
  To: rcu
  Cc: linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10, frederic,
	paulmck, rostedt, vineeth, boqun.feng

On Thu, Sep 01, 2022 at 10:17:06PM +0000, Joel Fernandes (Google) wrote:
> When the bypass cblist gets too big or its timeout has occurred, it is
> flushed into the main cblist. However, the bypass timer is still running
> and the behavior is that it would eventually expire and wake the GP
> thread.
> 
> Since we are going to use the bypass cblist for lazy CBs, do the wakeup
> soon as the flush happens. Otherwise, the lazy-timer will go off much
> later and the now-non-lazy cblist CBs can get stranded for the duration
> of the timer.
> 
> This is a good thing to do anyway (regardless of this series), since it
> makes the behavior consistent with behavior of other code paths where queueing
> something into the ->cblist makes the GP kthread in a non-sleeping state
> quickly.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---

Here is the updated version of this patch after review comments from
Frederic:

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH v6] rcu: Fix late wakeup when flush of bypass cblist happens

When the bypass cblist gets too big or its timeout has occurred, it is
flushed into the main cblist. However, the bypass timer is still running
and the behavior is that it would eventually expire and wake the GP
thread.

Since we are going to use the bypass cblist for lazy CBs, do the wakeup
soon as the flush happens. Otherwise, the lazy-timer will go off much
later and the now-non-lazy cblist CBs can get stranded for the duration
of the timer.

This is a good thing to do anyway (regardless of this series), since it
makes the behavior consistent with behavior of other code paths where queueing
something into the ->cblist makes the GP kthread in a non-sleeping state
quickly.

[ Frederic Weisbec: changes to not do wake up GP thread unless needed ].

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index 0a5f0ef41484..4dc86274b3e8 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -433,8 +433,9 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
 	    ncbs >= qhimark) {
 		rcu_nocb_lock(rdp);
+		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+
 		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
-			*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
 			if (*was_alldone)
 				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
 						    TPS("FirstQ"));
@@ -447,7 +448,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 			rcu_advance_cbs_nowake(rdp->mynode, rdp);
 			rdp->nocb_gp_adv_time = j;
 		}
-		rcu_nocb_unlock_irqrestore(rdp, flags);
+
+		// The flush succeeded and we moved CBs into the ->cblist.
+		// However, the bypass timer might still be running. Wakeup the
+		// GP thread by calling a helper with was_all_done set so that
+		// wake up happens (needed if main CB list was empty before).
+		__call_rcu_nocb_wake(rdp, *was_alldone, flags);
+
 		return true; // Callback already enqueued.
 	}
 
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-05 20:32         ` Joel Fernandes
@ 2022-09-06  8:55           ` Paul E. McKenney
  2022-09-06 16:16             ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Paul E. McKenney @ 2022-09-06  8:55 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Frederic Weisbecker, rcu, linux-kernel, rushikesh.s.kadam,
	urezki, neeraj.iitr10, rostedt, vineeth, boqun.feng

On Mon, Sep 05, 2022 at 04:32:26PM -0400, Joel Fernandes wrote:
> 
> 
> On 9/5/2022 8:59 AM, Frederic Weisbecker wrote:
> > I'd rather see an updated patch (not the whole patchset but just this one) rather
> > than deltas, just to make sure I'm not missing something in the whole picture.
> > 
> > Thanks.
> 
> There is also the previous patch which needs a fix up you suggested.
> 
> I will reply to individual patches with in-line patch (for you), as well as
> refreshing whole series since it might be easier for Paul to apply them together.

To say nothing of the greater probability that the result will match
your intent.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-06  3:07   ` Joel Fernandes
@ 2022-09-06  9:48     ` Frederic Weisbecker
  2022-09-07  2:43       ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-06  9:48 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Tue, Sep 06, 2022 at 03:07:05AM +0000, Joel Fernandes wrote:
> On Thu, Sep 01, 2022 at 10:17:06PM +0000, Joel Fernandes (Google) wrote:
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Subject: [PATCH v6] rcu: Fix late wakeup when flush of bypass cblist happens
> 
> When the bypass cblist gets too big or its timeout has occurred, it is
> flushed into the main cblist. However, the bypass timer is still running
> and the behavior is that it would eventually expire and wake the GP
> thread.
> 
> Since we are going to use the bypass cblist for lazy CBs, do the wakeup
> soon as the flush happens. Otherwise, the lazy-timer will go off much
> later and the now-non-lazy cblist CBs can get stranded for the duration
> of the timer.
> 
> This is a good thing to do anyway (regardless of this series), since it
> makes the behavior consistent with behavior of other code paths where queueing
> something into the ->cblist makes the GP kthread in a non-sleeping state
> quickly.
> 
> [ Frederic Weisbec: changes to not do wake up GP thread unless needed ].
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree_nocb.h | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 0a5f0ef41484..4dc86274b3e8 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -433,8 +433,9 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	if ((ncbs && j != READ_ONCE(rdp->nocb_bypass_first)) ||
>  	    ncbs >= qhimark) {
>  		rcu_nocb_lock(rdp);
> +		*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> +
>  		if (!rcu_nocb_flush_bypass(rdp, rhp, j)) {
> -			*was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>  			if (*was_alldone)
>  				trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  						    TPS("FirstQ"));
> @@ -447,7 +448,13 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  			rcu_advance_cbs_nowake(rdp->mynode, rdp);
>  			rdp->nocb_gp_adv_time = j;
>  		}
> -		rcu_nocb_unlock_irqrestore(rdp, flags);
> +
> +		// The flush succeeded and we moved CBs into the ->cblist.
> +		// However, the bypass timer might still be running. Wakeup the

That part of the comment mentioning the bypass timer looks strange...


> +		// GP thread by calling a helper with was_all_done set so that
> +		// wake up happens (needed if main CB list was empty before).

How about:

                // The flush succeeded and we moved CBs into the regular list.
		// Don't wait for the wake up timer as it may be too far ahead.
		// Wake up now GP thread instead if the cblist was empty

Thanks.

Other than that the patch looks good, thanks!

> +		__call_rcu_nocb_wake(rdp, *was_alldone, flags);
> +
>  		return true; // Callback already enqueued.
>  	}
>  
> -- 
> 2.37.2.789.g6183377224-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06  3:05     ` Joel Fernandes
@ 2022-09-06 15:17       ` Frederic Weisbecker
  2022-09-06 16:15         ` Joel Fernandes
  2022-09-07  0:06         ` Joel Fernandes
  2022-09-06 18:16       ` Uladzislau Rezki
  1 sibling, 2 replies; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-06 15:17 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Tue, Sep 06, 2022 at 03:05:46AM +0000, Joel Fernandes wrote:
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 4dc86274b3e8..b201606f7c4f 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>  }
>  
> +/*
> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> + * could be flushed much earlier for a number of other reasons
> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> + * left unsubmitted to RCU after those many jiffies.
> + */
> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;

Still not static.

> @@ -293,12 +322,16 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>   * proves to be initially empty, just return false because the no-CB GP
>   * kthread may need to be awakened in this case.
>   *
> + * Return true if there was something to be flushed and it succeeded, otherwise
> + * false.
> + *

This kind of contradict the comment that follows. Not sure you need to add
that line because the existing comment seem to cover it.

>   * Note that this function always returns true if rhp is NULL.

>   */
>  static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				     unsigned long j)
> +				     unsigned long j, unsigned long flush_flags)
>  {
>  	struct rcu_cblist rcl;
> +	bool lazy = flush_flags & FLUSH_BP_LAZY;
>  
>  	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
>  	rcu_lockdep_assert_cblist_protected(rdp);
> @@ -326,13 +372,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>   * Note that this function always returns true if rhp is NULL.
>   */
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j)
> +				  unsigned long j, unsigned long flush_flags)
>  {
> +	bool ret;
> +
>  	if (!rcu_rdp_is_offloaded(rdp))
>  		return true;
>  	rcu_lockdep_assert_cblist_protected(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
> +	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> +
> +	if (flush_flags & FLUSH_BP_WAKE)
> +		wake_nocb_gp(rdp, true);

Why the true above?

Also should we check if the wake up is really necessary (otherwise it means we
force a wake up for all rdp's from rcu_barrier())?

       was_alldone = rcu_segcblist_pend_cbs(&rdp->cblist);
       ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
       if (was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist))
       	  wake_nocb_gp(rdp, false);


> @@ -461,16 +521,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  	// We need to use the bypass.
>  	rcu_nocb_wait_contended(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> +
>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
> +
> +	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
> +
>  	if (!ncbs) {
>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>  	}
> +
>  	rcu_nocb_bypass_unlock(rdp);
>  	smp_mb(); /* Order enqueue before wake. */
> -	if (ncbs) {
> +
> +	// We had CBs in the bypass list before. There is nothing else to do if:
> +	// There were only non-lazy CBs before, in this case, the bypass timer

Kind of misleading. I would replace "There were only non-lazy CBs before" with
"There was at least one non-lazy CBs before".

> +	// or GP-thread will handle the CBs including any new lazy ones.
> +	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
> +	// case the old lazy timer would have been setup. When that expires,
> +	// the new lazy one will be handled.
> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>  		local_irq_restore(flags);
>  	} else {
>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
> @@ -479,6 +552,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("FirstBQwake"));
>  			__call_rcu_nocb_wake(rdp, true, flags);
> +		} else if (bypass_is_lazy && !lazy) {
> +			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
> +					    TPS("FirstBQwakeLazy2Non"));
> +			__call_rcu_nocb_wake(rdp, true, flags);

Not sure we need this chunk. Since there are pending callbacks anyway,
nocb_gp_wait() should be handling them and it will set the appropriate
timer on the next loop.

Thanks.

>  		} else {
>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>  					    TPS("FirstBQnoWake"));

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 15:17       ` Frederic Weisbecker
@ 2022-09-06 16:15         ` Joel Fernandes
  2022-09-06 16:31           ` Joel Fernandes
  2022-09-07 10:03           ` Frederic Weisbecker
  2022-09-07  0:06         ` Joel Fernandes
  1 sibling, 2 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-06 16:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/6/2022 11:17 AM, Frederic Weisbecker wrote:
> On Tue, Sep 06, 2022 at 03:05:46AM +0000, Joel Fernandes wrote:
>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index 4dc86274b3e8..b201606f7c4f 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>>  }
>>  
>> +/*
>> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
>> + * can elapse before lazy callbacks are flushed. Lazy callbacks
>> + * could be flushed much earlier for a number of other reasons
>> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
>> + * left unsubmitted to RCU after those many jiffies.
>> + */
>> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
>> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> 
> Still not static.

Oops will fix.

>> @@ -293,12 +322,16 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>   * proves to be initially empty, just return false because the no-CB GP
>>   * kthread may need to be awakened in this case.
>>   *
>> + * Return true if there was something to be flushed and it succeeded, otherwise
>> + * false.
>> + *
> 
> This kind of contradict the comment that follows. Not sure you need to add
> that line because the existing comment seem to cover it.
> 
>>   * Note that this function always returns true if rhp is NULL.
> 
>>   */
>>  static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				     unsigned long j)
>> +				     unsigned long j, unsigned long flush_flags)
>>  {
>>  	struct rcu_cblist rcl;
>> +	bool lazy = flush_flags & FLUSH_BP_LAZY;
>>  
>>  	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
>>  	rcu_lockdep_assert_cblist_protected(rdp);
>> @@ -326,13 +372,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>   * Note that this function always returns true if rhp is NULL.
>>   */
>>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				  unsigned long j)
>> +				  unsigned long j, unsigned long flush_flags)
>>  {
>> +	bool ret;
>> +
>>  	if (!rcu_rdp_is_offloaded(rdp))
>>  		return true;
>>  	rcu_lockdep_assert_cblist_protected(rdp);
>>  	rcu_nocb_bypass_lock(rdp);
>> -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
>> +	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
>> +
>> +	if (flush_flags & FLUSH_BP_WAKE)
>> +		wake_nocb_gp(rdp, true);
> 
> Why the true above?
> 
> Also should we check if the wake up is really necessary (otherwise it means we
> force a wake up for all rdp's from rcu_barrier())?

Good point. I need to look into your suggested optimization here more, it is
possible the wake up is not needed some of the times, but considering that
rcu_barrier() is a slow path, I will probably aim for robustness here over
mathematically correct code. And the cost of a missed wake up here can be
serious as I saw by the RCU_SCALE test I added. Also the wake up pf the rcuog
thread is square root of CPUs because of the rdp grouping right?

Still, it'd be good to not do something unnecessary, so your point is well
taken. I'll spend some time on this and get back to you.

>        was_alldone = rcu_segcblist_pend_cbs(&rdp->cblist);
>        ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
>        if (was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist))
>        	  wake_nocb_gp(rdp, false);
> 
> 
>> @@ -461,16 +521,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  	// We need to use the bypass.
>>  	rcu_nocb_wait_contended(rdp);
>>  	rcu_nocb_bypass_lock(rdp);
>> +
>>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
>> +
>> +	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
>> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
>> +
>>  	if (!ncbs) {
>>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>>  	}
>> +
>>  	rcu_nocb_bypass_unlock(rdp);
>>  	smp_mb(); /* Order enqueue before wake. */
>> -	if (ncbs) {
>> +
>> +	// We had CBs in the bypass list before. There is nothing else to do if:
>> +	// There were only non-lazy CBs before, in this case, the bypass timer
> 
> Kind of misleading. I would replace "There were only non-lazy CBs before" with
> "There was at least one non-lazy CBs before".

I really mean "There were only non-lazy CBs ever queued in the bypass list
before". That's the bypass_is_lazy variable. So I did not fully understand your
suggested comment change.

>> +	// or GP-thread will handle the CBs including any new lazy ones.
>> +	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
>> +	// case the old lazy timer would have been setup. When that expires,
>> +	// the new lazy one will be handled.
>> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>>  		local_irq_restore(flags);
>>  	} else {
>>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
>> @@ -479,6 +552,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>  					    TPS("FirstBQwake"));
>>  			__call_rcu_nocb_wake(rdp, true, flags);
>> +		} else if (bypass_is_lazy && !lazy) {
>> +			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>> +					    TPS("FirstBQwakeLazy2Non"));
>> +			__call_rcu_nocb_wake(rdp, true, flags);
> 
> Not sure we need this chunk. Since there are pending callbacks anyway,
> nocb_gp_wait() should be handling them and it will set the appropriate
> timer on the next loop.

We do because those pending callbacks could be because of a bypass list flush
and not because there were pending CBs before, right? I do recall missed wake
ups of non-lazy CBs, and them having to wait for the full lazy timer duration
and slowing down synchronize_rcu() which is on the ChromeOS boot critical path!

Thanks,

 - Joel




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06  8:55           ` Paul E. McKenney
@ 2022-09-06 16:16             ` Joel Fernandes
  2022-09-06 17:05               ` Paul E. McKenney
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-06 16:16 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, rcu, linux-kernel, rushikesh.s.kadam,
	urezki, neeraj.iitr10, rostedt, vineeth, boqun.feng

On 9/6/2022 4:55 AM, Paul E. McKenney wrote:
> On Mon, Sep 05, 2022 at 04:32:26PM -0400, Joel Fernandes wrote:
>>
>>
>> On 9/5/2022 8:59 AM, Frederic Weisbecker wrote:
>>> I'd rather see an updated patch (not the whole patchset but just this one) rather
>>> than deltas, just to make sure I'm not missing something in the whole picture.
>>>
>>> Thanks.
>>
>> There is also the previous patch which needs a fix up you suggested.
>>
>> I will reply to individual patches with in-line patch (for you), as well as
>> refreshing whole series since it might be easier for Paul to apply them together.
> 
> To say nothing of the greater probability that the result will match
> your intent.  ;-)

It is shaping up well ;-)

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 16:15         ` Joel Fernandes
@ 2022-09-06 16:31           ` Joel Fernandes
  2022-09-06 16:38             ` Joel Fernandes
  2022-09-07 10:03           ` Frederic Weisbecker
  1 sibling, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-06 16:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/6/2022 12:15 PM, Joel Fernandes wrote:
>>> @@ -461,16 +521,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>  	// We need to use the bypass.
>>>  	rcu_nocb_wait_contended(rdp);
>>>  	rcu_nocb_bypass_lock(rdp);
>>> +
>>>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>>>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
>>> +
>>> +	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
>>> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
>>> +
>>>  	if (!ncbs) {
>>>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>>>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>>>  	}
>>> +
>>>  	rcu_nocb_bypass_unlock(rdp);
>>>  	smp_mb(); /* Order enqueue before wake. */
>>> -	if (ncbs) {
>>> +
>>> +	// We had CBs in the bypass list before. There is nothing else to do if:
>>> +	// There were only non-lazy CBs before, in this case, the bypass timer
>> Kind of misleading. I would replace "There were only non-lazy CBs before" with
>> "There was at least one non-lazy CBs before".
> I really mean "There were only non-lazy CBs ever queued in the bypass list
> before". That's the bypass_is_lazy variable. So I did not fully understand your
> suggested comment change.
> 
>>> +	// or GP-thread will handle the CBs including any new lazy ones.
>>> +	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
>>> +	// case the old lazy timer would have been setup. When that expires,
>>> +	// the new lazy one will be handled.
>>> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>>>  		local_irq_restore(flags);
>>>  	} else {
>>>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
>>> @@ -479,6 +552,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>>  					    TPS("FirstBQwake"));
>>>  			__call_rcu_nocb_wake(rdp, true, flags);
>>> +		} else if (bypass_is_lazy && !lazy) {
>>> +			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>> +					    TPS("FirstBQwakeLazy2Non"));
>>> +			__call_rcu_nocb_wake(rdp, true, flags);
>>
>> Not sure we need this chunk. Since there are pending callbacks anyway,
>> nocb_gp_wait() should be handling them and it will set the appropriate
>> timer on the next loop.
>
> We do because those pending callbacks could be because of a bypass list flush
> and not because there were pending CBs before, right? I do recall missed wake
> ups of non-lazy CBs, and them having to wait for the full lazy timer duration
> and slowing down synchronize_rcu() which is on the ChromeOS boot critical path!
> 

Just to add more details, consider the series of events:

1. Only lazy CBs are ever queued. Timer is armed for multiple seconds.
rcu_segcblist_pend_cbs remains false.

2. First non-lazy CB triggers to code that does the bypyass rate-limit thing.

3. By pass list is flushed because it is non-lazy CB and we need to start GP
processing soon.

4. Due to flush, rcu_segcblist_pend_cbs() is now true.

5. We reach this "else if" clause because bypass_is_lazy means only lazy CBs
were ever buffered. We need to reprogram the timer or do an immediate wake up.
That's the intention of __call_rcu_nocb_wake().

I really saw #1 and #2 trigger during boot up itself and cause a multi-second
boot regression.

The chunk is needed to handle this case. I indeed do not see a boot regression
any more. Did I miss something?

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 16:31           ` Joel Fernandes
@ 2022-09-06 16:38             ` Joel Fernandes
  2022-09-06 16:43               ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-06 16:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/6/2022 12:31 PM, Joel Fernandes wrote:
> 
> 
> On 9/6/2022 12:15 PM, Joel Fernandes wrote:
>>>> @@ -461,16 +521,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>>  	// We need to use the bypass.
>>>>  	rcu_nocb_wait_contended(rdp);
>>>>  	rcu_nocb_bypass_lock(rdp);
>>>> +
>>>>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>>>>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
>>>> +
>>>> +	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
>>>> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
>>>> +
>>>>  	if (!ncbs) {
>>>>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>>>>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>>>>  	}
>>>> +
>>>>  	rcu_nocb_bypass_unlock(rdp);
>>>>  	smp_mb(); /* Order enqueue before wake. */
>>>> -	if (ncbs) {
>>>> +
>>>> +	// We had CBs in the bypass list before. There is nothing else to do if:
>>>> +	// There were only non-lazy CBs before, in this case, the bypass timer
>>> Kind of misleading. I would replace "There were only non-lazy CBs before" with
>>> "There was at least one non-lazy CBs before".
>> I really mean "There were only non-lazy CBs ever queued in the bypass list
>> before". That's the bypass_is_lazy variable. So I did not fully understand your
>> suggested comment change.
>>
>>>> +	// or GP-thread will handle the CBs including any new lazy ones.
>>>> +	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
>>>> +	// case the old lazy timer would have been setup. When that expires,
>>>> +	// the new lazy one will be handled.
>>>> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>>>>  		local_irq_restore(flags);
>>>>  	} else {
>>>>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
>>>> @@ -479,6 +552,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>>>  					    TPS("FirstBQwake"));
>>>>  			__call_rcu_nocb_wake(rdp, true, flags);
>>>> +		} else if (bypass_is_lazy && !lazy) {
>>>> +			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>>> +					    TPS("FirstBQwakeLazy2Non"));
>>>> +			__call_rcu_nocb_wake(rdp, true, flags);
>>>
>>> Not sure we need this chunk. Since there are pending callbacks anyway,
>>> nocb_gp_wait() should be handling them and it will set the appropriate
>>> timer on the next loop.
>>
>> We do because those pending callbacks could be because of a bypass list flush
>> and not because there were pending CBs before, right? I do recall missed wake
>> ups of non-lazy CBs, and them having to wait for the full lazy timer duration
>> and slowing down synchronize_rcu() which is on the ChromeOS boot critical path!
>>
> 
> Just to add more details, consider the series of events:
> 
> 1. Only lazy CBs are ever queued. Timer is armed for multiple seconds.
> rcu_segcblist_pend_cbs remains false.
> 
> 2. First non-lazy CB triggers to code that does the bypyass rate-limit thing.
> 
> 3. By pass list is flushed because it is non-lazy CB and we need to start GP
> processing soon.

Correcting the events, #3 does not happen if we got here.

> 
> 4. Due to flush, rcu_segcblist_pend_cbs() is now true.

So rcu_segcblist_pend_cbs() cannot be true.

> 5. We reach this "else if" clause because bypass_is_lazy means only lazy CBs
> were ever buffered. We need to reprogram the timer or do an immediate wake up.
> That's the intention of __call_rcu_nocb_wake().
> 
> I really saw #1 and #2 trigger during boot up itself and cause a multi-second
> boot regression.

So may be this hunk is needed not needed any more and the boot regression is
fine. I can try to drop this hunk and run the tests again...

Thanks!

 - Joel



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 16:38             ` Joel Fernandes
@ 2022-09-06 16:43               ` Joel Fernandes
  2022-09-06 19:11                 ` Frederic Weisbecker
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-06 16:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/6/2022 12:38 PM, Joel Fernandes wrote:
> 
> 
> On 9/6/2022 12:31 PM, Joel Fernandes wrote:
>>
>>
>> On 9/6/2022 12:15 PM, Joel Fernandes wrote:
>>>>> @@ -461,16 +521,29 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>>>  	// We need to use the bypass.
>>>>>  	rcu_nocb_wait_contended(rdp);
>>>>>  	rcu_nocb_bypass_lock(rdp);
>>>>> +
>>>>>  	ncbs = rcu_cblist_n_cbs(&rdp->nocb_bypass);
>>>>>  	rcu_segcblist_inc_len(&rdp->cblist); /* Must precede enqueue. */
>>>>>  	rcu_cblist_enqueue(&rdp->nocb_bypass, rhp);
>>>>> +
>>>>> +	if (IS_ENABLED(CONFIG_RCU_LAZY) && lazy)
>>>>> +		WRITE_ONCE(rdp->lazy_len, rdp->lazy_len + 1);
>>>>> +
>>>>>  	if (!ncbs) {
>>>>>  		WRITE_ONCE(rdp->nocb_bypass_first, j);
>>>>>  		trace_rcu_nocb_wake(rcu_state.name, rdp->cpu, TPS("FirstBQ"));
>>>>>  	}
>>>>> +
>>>>>  	rcu_nocb_bypass_unlock(rdp);
>>>>>  	smp_mb(); /* Order enqueue before wake. */
>>>>> -	if (ncbs) {
>>>>> +
>>>>> +	// We had CBs in the bypass list before. There is nothing else to do if:
>>>>> +	// There were only non-lazy CBs before, in this case, the bypass timer
>>>> Kind of misleading. I would replace "There were only non-lazy CBs before" with
>>>> "There was at least one non-lazy CBs before".
>>> I really mean "There were only non-lazy CBs ever queued in the bypass list
>>> before". That's the bypass_is_lazy variable. So I did not fully understand your
>>> suggested comment change.
>>>
>>>>> +	// or GP-thread will handle the CBs including any new lazy ones.
>>>>> +	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
>>>>> +	// case the old lazy timer would have been setup. When that expires,
>>>>> +	// the new lazy one will be handled.
>>>>> +	if (ncbs && (!bypass_is_lazy || lazy)) {
>>>>>  		local_irq_restore(flags);
>>>>>  	} else {
>>>>>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.
>>>>> @@ -479,6 +552,10 @@ static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>>>  			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>>>>  					    TPS("FirstBQwake"));
>>>>>  			__call_rcu_nocb_wake(rdp, true, flags);
>>>>> +		} else if (bypass_is_lazy && !lazy) {
>>>>> +			trace_rcu_nocb_wake(rcu_state.name, rdp->cpu,
>>>>> +					    TPS("FirstBQwakeLazy2Non"));
>>>>> +			__call_rcu_nocb_wake(rdp, true, flags);
>>>>
>>>> Not sure we need this chunk. Since there are pending callbacks anyway,
>>>> nocb_gp_wait() should be handling them and it will set the appropriate
>>>> timer on the next loop.
>>>
>>> We do because those pending callbacks could be because of a bypass list flush
>>> and not because there were pending CBs before, right? I do recall missed wake
>>> ups of non-lazy CBs, and them having to wait for the full lazy timer duration
>>> and slowing down synchronize_rcu() which is on the ChromeOS boot critical path!
>>>
>>
>> Just to add more details, consider the series of events:
>>
>> 1. Only lazy CBs are ever queued. Timer is armed for multiple seconds.
>> rcu_segcblist_pend_cbs remains false.
>>
>> 2. First non-lazy CB triggers to code that does the bypyass rate-limit thing.
>>
>> 3. By pass list is flushed because it is non-lazy CB and we need to start GP
>> processing soon.
> 
> Correcting the events, #3 does not happen if we got here.
> 
>>
>> 4. Due to flush, rcu_segcblist_pend_cbs() is now true.
> 
> So rcu_segcblist_pend_cbs() cannot be true.
> 
>> 5. We reach this "else if" clause because bypass_is_lazy means only lazy CBs
>> were ever buffered. We need to reprogram the timer or do an immediate wake up.
>> That's the intention of __call_rcu_nocb_wake().
>>
>> I really saw #1 and #2 trigger during boot up itself and cause a multi-second
>> boot regression.
> 
> So may be this hunk is needed not needed any more and the boot regression is
> fine. I can try to drop this hunk and run the tests again...

Ah, now I know why I got confused. I *used* to flush the bypass list before when
!lazy CBs showed up. Paul suggested this is overkill. In this old overkill
method, I was missing a wake up which was likely causing the boot regression.
Forcing a wake up fixed that. Now in v5 I make it such that I don't do the flush
on a !lazy rate-limit.

I am sorry for the confusion. Either way, in my defense this is just an extra
bit of code that I have to delete. This code is hard. I have mostly relied on a
test-driven development. But now thanks to this review and I am learning the
code more and more...

Thanks,

 - Joel




^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 16:16             ` Joel Fernandes
@ 2022-09-06 17:05               ` Paul E. McKenney
  0 siblings, 0 replies; 73+ messages in thread
From: Paul E. McKenney @ 2022-09-06 17:05 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Frederic Weisbecker, rcu, linux-kernel, rushikesh.s.kadam,
	urezki, neeraj.iitr10, rostedt, vineeth, boqun.feng

On Tue, Sep 06, 2022 at 12:16:19PM -0400, Joel Fernandes wrote:
> On 9/6/2022 4:55 AM, Paul E. McKenney wrote:
> > On Mon, Sep 05, 2022 at 04:32:26PM -0400, Joel Fernandes wrote:
> >>
> >>
> >> On 9/5/2022 8:59 AM, Frederic Weisbecker wrote:
> >>> I'd rather see an updated patch (not the whole patchset but just this one) rather
> >>> than deltas, just to make sure I'm not missing something in the whole picture.
> >>>
> >>> Thanks.
> >>
> >> There is also the previous patch which needs a fix up you suggested.
> >>
> >> I will reply to individual patches with in-line patch (for you), as well as
> >> refreshing whole series since it might be easier for Paul to apply them together.
> > 
> > To say nothing of the greater probability that the result will match
> > your intent.  ;-)
> 
> It is shaping up well ;-)

Very good, looking forward to the next round!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06  3:05     ` Joel Fernandes
  2022-09-06 15:17       ` Frederic Weisbecker
@ 2022-09-06 18:16       ` Uladzislau Rezki
  2022-09-06 18:21         ` Joel Fernandes
  1 sibling, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-09-06 18:16 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Frederic Weisbecker, rcu, linux-kernel, rushikesh.s.kadam,
	urezki, neeraj.iitr10, paulmck, rostedt, vineeth, boqun.feng

> On Fri, Sep 02, 2022 at 05:21:32PM +0200, Frederic Weisbecker wrote:
> > On Thu, Sep 01, 2022 at 10:17:08PM +0000, Joel Fernandes (Google) wrote:
> > > Implement timer-based RCU lazy callback batching. The batch is flushed
> > > whenever a certain amount of time has passed, or the batch on a
> > > particular CPU grows too big. Also memory pressure will flush it in a
> > > future patch.
> > > 
> > > To handle several corner cases automagically (such as rcu_barrier() and
> > > hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> > > length has the lazy CB length included in it. A separate lazy CB length
> > > counter is also introduced to keep track of the number of lazy CBs.
> > > 
> > > Suggested-by: Paul McKenney <paulmck@kernel.org>
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > ---
> 
> Here is the updated version of this patch for further testing and review.
> Paul, you could consider updating your test branch. I have tested it in
> ChromeOS as well, and rcuscale. The laziness and boot times are looking good.
> There was at least one bug that I fixed that got introduced with the moving
> of the length field to rcu_data.  Thanks a lot Frederic for the review
> comments.
> 
> I will look at the rcu torture issue next... I suspect the length field issue
> may have been causing it.
> 
> ---8<-----------------------
> 
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Subject: [PATCH v6] rcu: Introduce call_rcu_lazy() API implementation
> 
> Implement timer-based RCU lazy callback batching. The batch is flushed
> whenever a certain amount of time has passed, or the batch on a
> particular CPU grows too big. Also memory pressure will flush it in a
> future patch.
> 
> To handle several corner cases automagically (such as rcu_barrier() and
> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> length has the lazy CB length included in it. A separate lazy CB length
> counter is also introduced to keep track of the number of lazy CBs.
> 
> v5->v6:
> 
> [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
>   deferral levels wake much earlier so for those it is not needed. ]
> 
> [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> 
> [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> 
> Suggested-by: Paul McKenney <paulmck@kernel.org>
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  include/linux/rcupdate.h |   6 ++
>  kernel/rcu/Kconfig       |   8 ++
>  kernel/rcu/rcu.h         |  11 +++
>  kernel/rcu/tree.c        | 133 +++++++++++++++++++----------
>  kernel/rcu/tree.h        |  17 +++-
>  kernel/rcu/tree_nocb.h   | 175 ++++++++++++++++++++++++++++++++-------
>  6 files changed, 269 insertions(+), 81 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 08605ce7379d..82e8a07e0856 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
>  
>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>  
> +#ifdef CONFIG_RCU_LAZY
> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
> +#else
> +#define call_rcu_lazy(head, func) call_rcu(head, func)
> +#endif
> +
>  /* Internal to kernel */
>  void rcu_init(void);
>  extern int rcu_scheduler_active;
> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> index d471d22a5e21..3128d01427cb 100644
> --- a/kernel/rcu/Kconfig
> +++ b/kernel/rcu/Kconfig
> @@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
>  	  Say N here if you hate read-side memory barriers.
>  	  Take the default if you are unsure.
>  
> +config RCU_LAZY
> +	bool "RCU callback lazy invocation functionality"
> +	depends on RCU_NOCB_CPU
> +	default n
> +	help
> +	  To save power, batch RCU callbacks and flush after delay, memory
> +	  pressure or callback list growing too big.
> +
>  endmenu # "RCU Subsystem"
> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> index be5979da07f5..94675f14efe8 100644
> --- a/kernel/rcu/rcu.h
> +++ b/kernel/rcu/rcu.h
> @@ -474,6 +474,14 @@ enum rcutorture_type {
>  	INVALID_RCU_FLAVOR
>  };
>  
> +#if defined(CONFIG_RCU_LAZY)
> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> +#else
> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> +#endif
> +
>  #if defined(CONFIG_TREE_RCU)
>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>  			    unsigned long *gp_seq);
> @@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>  			       unsigned long c_old,
>  			       unsigned long c);
>  void rcu_gp_set_torture_wait(int duration);
> +void rcu_force_call_rcu_to_lazy(bool force);
> +
>  #else
>  static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
>  					  int *flags, unsigned long *gp_seq)
> @@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>  	do { } while (0)
>  #endif
>  static inline void rcu_gp_set_torture_wait(int duration) { }
> +static inline void rcu_force_call_rcu_to_lazy(bool force) { }
>  #endif
>  
>  #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 9fe581be8696..dbd25b8c080e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>  	raw_spin_unlock_rcu_node(rnp);
>  }
>  
> -/**
> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
> - * @head: structure to be used for queueing the RCU updates.
> - * @func: actual callback function to be invoked after the grace period
> - *
> - * The callback function will be invoked some time after a full grace
> - * period elapses, in other words after all pre-existing RCU read-side
> - * critical sections have completed.  However, the callback function
> - * might well execute concurrently with RCU read-side critical sections
> - * that started after call_rcu() was invoked.
> - *
> - * RCU read-side critical sections are delimited by rcu_read_lock()
> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
> - * v5.0 and later, regions of code across which interrupts, preemption,
> - * or softirqs have been disabled also serve as RCU read-side critical
> - * sections.  This includes hardware interrupt handlers, softirq handlers,
> - * and NMI handlers.
> - *
> - * Note that all CPUs must agree that the grace period extended beyond
> - * all pre-existing RCU read-side critical section.  On systems with more
> - * than one CPU, this means that when "func()" is invoked, each CPU is
> - * guaranteed to have executed a full memory barrier since the end of its
> - * last RCU read-side critical section whose beginning preceded the call
> - * to call_rcu().  It also means that each CPU executing an RCU read-side
> - * critical section that continues beyond the start of "func()" must have
> - * executed a memory barrier after the call_rcu() but before the beginning
> - * of that RCU read-side critical section.  Note that these guarantees
> - * include CPUs that are offline, idle, or executing in user mode, as
> - * well as CPUs that are executing in the kernel.
> - *
> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> - * resulting RCU callback function "func()", then both CPU A and CPU B are
> - * guaranteed to execute a full memory barrier during the time interval
> - * between the call to call_rcu() and the invocation of "func()" -- even
> - * if CPU A and CPU B are the same CPU (but again only if the system has
> - * more than one CPU).
> - *
> - * Implementation of these memory-ordering guarantees is described here:
> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> - */
> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +static void
> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>  {
>  	static atomic_t doublefrees;
>  	unsigned long flags;
> @@ -2818,7 +2779,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  		trace_rcu_callback(rcu_state.name, head,
>  				   rcu_segcblist_n_cbs(&rdp->cblist));
>  
> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>  		return; // Enqueued onto ->nocb_bypass, so just leave.
>  	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>  	rcu_segcblist_enqueue(&rdp->cblist, head);
> @@ -2833,8 +2794,86 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  		local_irq_restore(flags);
>  	}
>  }
> -EXPORT_SYMBOL_GPL(call_rcu);
>  
> +#ifdef CONFIG_RCU_LAZY
> +/**
> + * call_rcu_lazy() - Lazily queue RCU callback for invocation after grace period.
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.
> + *
> + * Use this API instead of call_rcu() if you don't mind the callback being
> + * invoked after very long periods of time on systems without memory pressure
> + * and on systems which are lightly loaded or mostly idle.
> + *
> + * Other than the extra delay in callbacks being invoked, this function is
> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
> + * details about memory ordering and other functionality.
> + */
> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, true);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_lazy);
> +#endif
> +
> +static bool force_call_rcu_to_lazy;
> +
> +void rcu_force_call_rcu_to_lazy(bool force)
> +{
> +	if (IS_ENABLED(CONFIG_RCU_SCALE_TEST))
> +		WRITE_ONCE(force_call_rcu_to_lazy, force);
> +}
> +EXPORT_SYMBOL_GPL(rcu_force_call_rcu_to_lazy);
> +
> +/**
> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual callback function to be invoked after the grace period
> + *
> + * The callback function will be invoked some time after a full grace
> + * period elapses, in other words after all pre-existing RCU read-side
> + * critical sections have completed.  However, the callback function
> + * might well execute concurrently with RCU read-side critical sections
> + * that started after call_rcu() was invoked.
> + *
> + * RCU read-side critical sections are delimited by rcu_read_lock()
> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
> + * v5.0 and later, regions of code across which interrupts, preemption,
> + * or softirqs have been disabled also serve as RCU read-side critical
> + * sections.  This includes hardware interrupt handlers, softirq handlers,
> + * and NMI handlers.
> + *
> + * Note that all CPUs must agree that the grace period extended beyond
> + * all pre-existing RCU read-side critical section.  On systems with more
> + * than one CPU, this means that when "func()" is invoked, each CPU is
> + * guaranteed to have executed a full memory barrier since the end of its
> + * last RCU read-side critical section whose beginning preceded the call
> + * to call_rcu().  It also means that each CPU executing an RCU read-side
> + * critical section that continues beyond the start of "func()" must have
> + * executed a memory barrier after the call_rcu() but before the beginning
> + * of that RCU read-side critical section.  Note that these guarantees
> + * include CPUs that are offline, idle, or executing in user mode, as
> + * well as CPUs that are executing in the kernel.
> + *
> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> + * resulting RCU callback function "func()", then both CPU A and CPU B are
> + * guaranteed to execute a full memory barrier during the time interval
> + * between the call to call_rcu() and the invocation of "func()" -- even
> + * if CPU A and CPU B are the same CPU (but again only if the system has
> + * more than one CPU).
> + *
> + * Implementation of these memory-ordering guarantees is described here:
> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> + */
> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
> +{
> +	return __call_rcu_common(head, func, force_call_rcu_to_lazy);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu);
>  
>  /* Maximum number of jiffies to wait before draining a batch. */
>  #define KFREE_DRAIN_JIFFIES (5 * HZ)
> @@ -3904,7 +3943,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>  	rdp->barrier_head.func = rcu_barrier_callback;
>  	debug_rcu_head_queue(&rdp->barrier_head);
>  	rcu_nocb_lock(rdp);
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> +        /*
> +         * Flush the bypass list, but also wake up the GP thread as otherwise
> +         * bypass/lazy CBs maynot be noticed, and can cause real long delays!
> +         */
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>  		atomic_inc(&rcu_state.barrier_cpu_count);
>  	} else {
> @@ -4325,7 +4368,7 @@ void rcutree_migrate_callbacks(int cpu)
>  	my_rdp = this_cpu_ptr(&rcu_data);
>  	my_rnp = my_rdp->mynode;
>  	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE));
>  	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>  	/* Leverage recent GPs and set GP for new callbacks. */
>  	needwake = rcu_advance_cbs(my_rnp, rdp) ||
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index d4a97e40ea9c..361c41d642c7 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -263,14 +263,16 @@ struct rcu_data {
>  	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
>  	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
>  
> +	long lazy_len;			/* Length of buffered lazy callbacks. */
>  	int cpu;
>  };
>  
>  /* Values for nocb_defer_wakeup field in struct rcu_data. */
>  #define RCU_NOCB_WAKE_NOT	0
>  #define RCU_NOCB_WAKE_BYPASS	1
> -#define RCU_NOCB_WAKE		2
> -#define RCU_NOCB_WAKE_FORCE	3
> +#define RCU_NOCB_WAKE_LAZY	2
> +#define RCU_NOCB_WAKE		3
> +#define RCU_NOCB_WAKE_FORCE	4
>  
>  #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>  					/* For jiffies_till_first_fqs and */
> @@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
>  static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>  static void rcu_init_one_nocb(struct rcu_node *rnp);
> +
> +#define FLUSH_BP_NONE 0
> +/* Is the CB being enqueued after the flush, a lazy CB? */
> +#define FLUSH_BP_LAZY BIT(0)
> +/* Wake up nocb-GP thread after flush? */
> +#define FLUSH_BP_WAKE BIT(1)
>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				  unsigned long j);
> +				  unsigned long j, unsigned long flush_flags);
>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> -				bool *was_alldone, unsigned long flags);
> +				bool *was_alldone, unsigned long flags,
> +				bool lazy);
>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>  				 unsigned long flags);
>  static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index 4dc86274b3e8..b201606f7c4f 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>  }
>  
> +/*
> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> + * could be flushed much earlier for a number of other reasons
> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> + * left unsubmitted to RCU after those many jiffies.
> + */
> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> +
> +#ifdef CONFIG_RCU_LAZY
> +// To be called only from test code.
> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
> +{
> +	jiffies_till_flush = jif;
> +}
> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
> +
> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
> +{
> +	return jiffies_till_flush;
> +}
> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
> +#endif
> +
>  /*
>   * Arrange to wake the GP kthread for this NOCB group at some future
>   * time when it is safe to do so.
> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>  
>  	/*
> -	 * Bypass wakeup overrides previous deferments. In case
> -	 * of callback storm, no need to wake up too early.
> +	 * Bypass wakeup overrides previous deferments. In case of
> +	 * callback storm, no need to wake up too early.
>  	 */
> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
> +	if (waketype == RCU_NOCB_WAKE_LAZY
> +		&& READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
> +		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> +	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>  		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>  		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>  	} else {
>
Joel, i have a question here. I see that lazy callback just makes the GP
start later where a regular call_rcu() API will instead cut it back.
Could you please clarify how it solves the sequence like:

CPU:

1) task A -> call_rcu_lazy();
2) task B -> call_rcu();

so lazily decision about running GP later a task B overrides to initiate it sooner? 

or am i missing something here?

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 18:16       ` Uladzislau Rezki
@ 2022-09-06 18:21         ` Joel Fernandes
  2022-09-07  8:52           ` Uladzislau Rezki
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-06 18:21 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: Frederic Weisbecker, rcu, linux-kernel, rushikesh.s.kadam,
	neeraj.iitr10, paulmck, rostedt, vineeth, boqun.feng



On 9/6/2022 2:16 PM, Uladzislau Rezki wrote:
>> On Fri, Sep 02, 2022 at 05:21:32PM +0200, Frederic Weisbecker wrote:
>>> On Thu, Sep 01, 2022 at 10:17:08PM +0000, Joel Fernandes (Google) wrote:
>>>> Implement timer-based RCU lazy callback batching. The batch is flushed
>>>> whenever a certain amount of time has passed, or the batch on a
>>>> particular CPU grows too big. Also memory pressure will flush it in a
>>>> future patch.
>>>>
>>>> To handle several corner cases automagically (such as rcu_barrier() and
>>>> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
>>>> length has the lazy CB length included in it. A separate lazy CB length
>>>> counter is also introduced to keep track of the number of lazy CBs.
>>>>
>>>> Suggested-by: Paul McKenney <paulmck@kernel.org>
>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>> ---
>>
>> Here is the updated version of this patch for further testing and review.
>> Paul, you could consider updating your test branch. I have tested it in
>> ChromeOS as well, and rcuscale. The laziness and boot times are looking good.
>> There was at least one bug that I fixed that got introduced with the moving
>> of the length field to rcu_data.  Thanks a lot Frederic for the review
>> comments.
>>
>> I will look at the rcu torture issue next... I suspect the length field issue
>> may have been causing it.
>>
>> ---8<-----------------------
>>
>> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>> Subject: [PATCH v6] rcu: Introduce call_rcu_lazy() API implementation
>>
>> Implement timer-based RCU lazy callback batching. The batch is flushed
>> whenever a certain amount of time has passed, or the batch on a
>> particular CPU grows too big. Also memory pressure will flush it in a
>> future patch.
>>
>> To handle several corner cases automagically (such as rcu_barrier() and
>> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
>> length has the lazy CB length included in it. A separate lazy CB length
>> counter is also introduced to keep track of the number of lazy CBs.
>>
>> v5->v6:
>>
>> [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
>>   deferral levels wake much earlier so for those it is not needed. ]
>>
>> [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
>>
>> [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
>>
>> Suggested-by: Paul McKenney <paulmck@kernel.org>
>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>> ---
>>  include/linux/rcupdate.h |   6 ++
>>  kernel/rcu/Kconfig       |   8 ++
>>  kernel/rcu/rcu.h         |  11 +++
>>  kernel/rcu/tree.c        | 133 +++++++++++++++++++----------
>>  kernel/rcu/tree.h        |  17 +++-
>>  kernel/rcu/tree_nocb.h   | 175 ++++++++++++++++++++++++++++++++-------
>>  6 files changed, 269 insertions(+), 81 deletions(-)
>>
>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>> index 08605ce7379d..82e8a07e0856 100644
>> --- a/include/linux/rcupdate.h
>> +++ b/include/linux/rcupdate.h
>> @@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
>>  
>>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>  
>> +#ifdef CONFIG_RCU_LAZY
>> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
>> +#else
>> +#define call_rcu_lazy(head, func) call_rcu(head, func)
>> +#endif
>> +
>>  /* Internal to kernel */
>>  void rcu_init(void);
>>  extern int rcu_scheduler_active;
>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>> index d471d22a5e21..3128d01427cb 100644
>> --- a/kernel/rcu/Kconfig
>> +++ b/kernel/rcu/Kconfig
>> @@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
>>  	  Say N here if you hate read-side memory barriers.
>>  	  Take the default if you are unsure.
>>  
>> +config RCU_LAZY
>> +	bool "RCU callback lazy invocation functionality"
>> +	depends on RCU_NOCB_CPU
>> +	default n
>> +	help
>> +	  To save power, batch RCU callbacks and flush after delay, memory
>> +	  pressure or callback list growing too big.
>> +
>>  endmenu # "RCU Subsystem"
>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
>> index be5979da07f5..94675f14efe8 100644
>> --- a/kernel/rcu/rcu.h
>> +++ b/kernel/rcu/rcu.h
>> @@ -474,6 +474,14 @@ enum rcutorture_type {
>>  	INVALID_RCU_FLAVOR
>>  };
>>  
>> +#if defined(CONFIG_RCU_LAZY)
>> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
>> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
>> +#else
>> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
>> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
>> +#endif
>> +
>>  #if defined(CONFIG_TREE_RCU)
>>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>>  			    unsigned long *gp_seq);
>> @@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>>  			       unsigned long c_old,
>>  			       unsigned long c);
>>  void rcu_gp_set_torture_wait(int duration);
>> +void rcu_force_call_rcu_to_lazy(bool force);
>> +
>>  #else
>>  static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
>>  					  int *flags, unsigned long *gp_seq)
>> @@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>>  	do { } while (0)
>>  #endif
>>  static inline void rcu_gp_set_torture_wait(int duration) { }
>> +static inline void rcu_force_call_rcu_to_lazy(bool force) { }
>>  #endif
>>  
>>  #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>> index 9fe581be8696..dbd25b8c080e 100644
>> --- a/kernel/rcu/tree.c
>> +++ b/kernel/rcu/tree.c
>> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>>  	raw_spin_unlock_rcu_node(rnp);
>>  }
>>  
>> -/**
>> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
>> - * @head: structure to be used for queueing the RCU updates.
>> - * @func: actual callback function to be invoked after the grace period
>> - *
>> - * The callback function will be invoked some time after a full grace
>> - * period elapses, in other words after all pre-existing RCU read-side
>> - * critical sections have completed.  However, the callback function
>> - * might well execute concurrently with RCU read-side critical sections
>> - * that started after call_rcu() was invoked.
>> - *
>> - * RCU read-side critical sections are delimited by rcu_read_lock()
>> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
>> - * v5.0 and later, regions of code across which interrupts, preemption,
>> - * or softirqs have been disabled also serve as RCU read-side critical
>> - * sections.  This includes hardware interrupt handlers, softirq handlers,
>> - * and NMI handlers.
>> - *
>> - * Note that all CPUs must agree that the grace period extended beyond
>> - * all pre-existing RCU read-side critical section.  On systems with more
>> - * than one CPU, this means that when "func()" is invoked, each CPU is
>> - * guaranteed to have executed a full memory barrier since the end of its
>> - * last RCU read-side critical section whose beginning preceded the call
>> - * to call_rcu().  It also means that each CPU executing an RCU read-side
>> - * critical section that continues beyond the start of "func()" must have
>> - * executed a memory barrier after the call_rcu() but before the beginning
>> - * of that RCU read-side critical section.  Note that these guarantees
>> - * include CPUs that are offline, idle, or executing in user mode, as
>> - * well as CPUs that are executing in the kernel.
>> - *
>> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>> - * resulting RCU callback function "func()", then both CPU A and CPU B are
>> - * guaranteed to execute a full memory barrier during the time interval
>> - * between the call to call_rcu() and the invocation of "func()" -- even
>> - * if CPU A and CPU B are the same CPU (but again only if the system has
>> - * more than one CPU).
>> - *
>> - * Implementation of these memory-ordering guarantees is described here:
>> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>> - */
>> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
>> +static void
>> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>>  {
>>  	static atomic_t doublefrees;
>>  	unsigned long flags;
>> @@ -2818,7 +2779,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>  		trace_rcu_callback(rcu_state.name, head,
>>  				   rcu_segcblist_n_cbs(&rdp->cblist));
>>  
>> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
>> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>>  		return; // Enqueued onto ->nocb_bypass, so just leave.
>>  	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>>  	rcu_segcblist_enqueue(&rdp->cblist, head);
>> @@ -2833,8 +2794,86 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>  		local_irq_restore(flags);
>>  	}
>>  }
>> -EXPORT_SYMBOL_GPL(call_rcu);
>>  
>> +#ifdef CONFIG_RCU_LAZY
>> +/**
>> + * call_rcu_lazy() - Lazily queue RCU callback for invocation after grace period.
>> + * @head: structure to be used for queueing the RCU updates.
>> + * @func: actual callback function to be invoked after the grace period
>> + *
>> + * The callback function will be invoked some time after a full grace
>> + * period elapses, in other words after all pre-existing RCU read-side
>> + * critical sections have completed.
>> + *
>> + * Use this API instead of call_rcu() if you don't mind the callback being
>> + * invoked after very long periods of time on systems without memory pressure
>> + * and on systems which are lightly loaded or mostly idle.
>> + *
>> + * Other than the extra delay in callbacks being invoked, this function is
>> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
>> + * details about memory ordering and other functionality.
>> + */
>> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func)
>> +{
>> +	return __call_rcu_common(head, func, true);
>> +}
>> +EXPORT_SYMBOL_GPL(call_rcu_lazy);
>> +#endif
>> +
>> +static bool force_call_rcu_to_lazy;
>> +
>> +void rcu_force_call_rcu_to_lazy(bool force)
>> +{
>> +	if (IS_ENABLED(CONFIG_RCU_SCALE_TEST))
>> +		WRITE_ONCE(force_call_rcu_to_lazy, force);
>> +}
>> +EXPORT_SYMBOL_GPL(rcu_force_call_rcu_to_lazy);
>> +
>> +/**
>> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
>> + * @head: structure to be used for queueing the RCU updates.
>> + * @func: actual callback function to be invoked after the grace period
>> + *
>> + * The callback function will be invoked some time after a full grace
>> + * period elapses, in other words after all pre-existing RCU read-side
>> + * critical sections have completed.  However, the callback function
>> + * might well execute concurrently with RCU read-side critical sections
>> + * that started after call_rcu() was invoked.
>> + *
>> + * RCU read-side critical sections are delimited by rcu_read_lock()
>> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
>> + * v5.0 and later, regions of code across which interrupts, preemption,
>> + * or softirqs have been disabled also serve as RCU read-side critical
>> + * sections.  This includes hardware interrupt handlers, softirq handlers,
>> + * and NMI handlers.
>> + *
>> + * Note that all CPUs must agree that the grace period extended beyond
>> + * all pre-existing RCU read-side critical section.  On systems with more
>> + * than one CPU, this means that when "func()" is invoked, each CPU is
>> + * guaranteed to have executed a full memory barrier since the end of its
>> + * last RCU read-side critical section whose beginning preceded the call
>> + * to call_rcu().  It also means that each CPU executing an RCU read-side
>> + * critical section that continues beyond the start of "func()" must have
>> + * executed a memory barrier after the call_rcu() but before the beginning
>> + * of that RCU read-side critical section.  Note that these guarantees
>> + * include CPUs that are offline, idle, or executing in user mode, as
>> + * well as CPUs that are executing in the kernel.
>> + *
>> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>> + * resulting RCU callback function "func()", then both CPU A and CPU B are
>> + * guaranteed to execute a full memory barrier during the time interval
>> + * between the call to call_rcu() and the invocation of "func()" -- even
>> + * if CPU A and CPU B are the same CPU (but again only if the system has
>> + * more than one CPU).
>> + *
>> + * Implementation of these memory-ordering guarantees is described here:
>> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>> + */
>> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
>> +{
>> +	return __call_rcu_common(head, func, force_call_rcu_to_lazy);
>> +}
>> +EXPORT_SYMBOL_GPL(call_rcu);
>>  
>>  /* Maximum number of jiffies to wait before draining a batch. */
>>  #define KFREE_DRAIN_JIFFIES (5 * HZ)
>> @@ -3904,7 +3943,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>  	rdp->barrier_head.func = rcu_barrier_callback;
>>  	debug_rcu_head_queue(&rdp->barrier_head);
>>  	rcu_nocb_lock(rdp);
>> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>> +        /*
>> +         * Flush the bypass list, but also wake up the GP thread as otherwise
>> +         * bypass/lazy CBs maynot be noticed, and can cause real long delays!
>> +         */
>> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
>>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>>  		atomic_inc(&rcu_state.barrier_cpu_count);
>>  	} else {
>> @@ -4325,7 +4368,7 @@ void rcutree_migrate_callbacks(int cpu)
>>  	my_rdp = this_cpu_ptr(&rcu_data);
>>  	my_rnp = my_rdp->mynode;
>>  	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
>> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
>> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE));
>>  	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>>  	/* Leverage recent GPs and set GP for new callbacks. */
>>  	needwake = rcu_advance_cbs(my_rnp, rdp) ||
>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>> index d4a97e40ea9c..361c41d642c7 100644
>> --- a/kernel/rcu/tree.h
>> +++ b/kernel/rcu/tree.h
>> @@ -263,14 +263,16 @@ struct rcu_data {
>>  	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
>>  	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
>>  
>> +	long lazy_len;			/* Length of buffered lazy callbacks. */
>>  	int cpu;
>>  };
>>  
>>  /* Values for nocb_defer_wakeup field in struct rcu_data. */
>>  #define RCU_NOCB_WAKE_NOT	0
>>  #define RCU_NOCB_WAKE_BYPASS	1
>> -#define RCU_NOCB_WAKE		2
>> -#define RCU_NOCB_WAKE_FORCE	3
>> +#define RCU_NOCB_WAKE_LAZY	2
>> +#define RCU_NOCB_WAKE		3
>> +#define RCU_NOCB_WAKE_FORCE	4
>>  
>>  #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>>  					/* For jiffies_till_first_fqs and */
>> @@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
>>  static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>>  static void rcu_init_one_nocb(struct rcu_node *rnp);
>> +
>> +#define FLUSH_BP_NONE 0
>> +/* Is the CB being enqueued after the flush, a lazy CB? */
>> +#define FLUSH_BP_LAZY BIT(0)
>> +/* Wake up nocb-GP thread after flush? */
>> +#define FLUSH_BP_WAKE BIT(1)
>>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				  unsigned long j);
>> +				  unsigned long j, unsigned long flush_flags);
>>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>> -				bool *was_alldone, unsigned long flags);
>> +				bool *was_alldone, unsigned long flags,
>> +				bool lazy);
>>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>>  				 unsigned long flags);
>>  static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index 4dc86274b3e8..b201606f7c4f 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>>  }
>>  
>> +/*
>> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
>> + * can elapse before lazy callbacks are flushed. Lazy callbacks
>> + * could be flushed much earlier for a number of other reasons
>> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
>> + * left unsubmitted to RCU after those many jiffies.
>> + */
>> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
>> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
>> +
>> +#ifdef CONFIG_RCU_LAZY
>> +// To be called only from test code.
>> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
>> +{
>> +	jiffies_till_flush = jif;
>> +}
>> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
>> +
>> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
>> +{
>> +	return jiffies_till_flush;
>> +}
>> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
>> +#endif
>> +
>>  /*
>>   * Arrange to wake the GP kthread for this NOCB group at some future
>>   * time when it is safe to do so.
>> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>>  
>>  	/*
>> -	 * Bypass wakeup overrides previous deferments. In case
>> -	 * of callback storm, no need to wake up too early.
>> +	 * Bypass wakeup overrides previous deferments. In case of
>> +	 * callback storm, no need to wake up too early.
>>  	 */
>> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
>> +	if (waketype == RCU_NOCB_WAKE_LAZY
>> +		&& READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
>> +		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
>> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>> +	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>>  		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>>  		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>  	} else {
>>
> Joel, i have a question here. I see that lazy callback just makes the GP
> start later where a regular call_rcu() API will instead cut it back.
> Could you please clarify how it solves the sequence like:
> 
> CPU:
> 
> 1) task A -> call_rcu_lazy();
> 2) task B -> call_rcu();
> 
> so lazily decision about running GP later a task B overrides to initiate it sooner? 

Is your question that task B will make task A's CB run sooner? Yes that's right.
The reason is that the point of call_rcu_lazy() is to delay GP start as much as
possible to keep system quiet. However, if a call_rcu() comes in any way, we
have to start a GP soon. So we might as well take advantage of that for the lazy
one as we are paying the full price for the new GP.

It is kind of like RCU's amortizing behavior in that sense, however we put that
into overdrive to be greedy for power.

Does that answer your question?

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 16:43               ` Joel Fernandes
@ 2022-09-06 19:11                 ` Frederic Weisbecker
  2022-09-07  2:56                   ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-06 19:11 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Tue, Sep 06, 2022 at 12:43:52PM -0400, Joel Fernandes wrote:
> On 9/6/2022 12:38 PM, Joel Fernandes wrote:
> Ah, now I know why I got confused. I *used* to flush the bypass list before when
> !lazy CBs showed up. Paul suggested this is overkill. In this old overkill
> method, I was missing a wake up which was likely causing the boot regression.
> Forcing a wake up fixed that. Now in v5 I make it such that I don't do the flush
> on a !lazy rate-limit.
> 
> I am sorry for the confusion. Either way, in my defense this is just an extra
> bit of code that I have to delete. This code is hard. I have mostly relied on a
> test-driven development. But now thanks to this review and I am learning the
> code more and more...

Yeah this code is hard.

Especially as it's possible to flush from both sides and queue the timer
from both sides. And both sides read the bypass/lazy counter locklessly.
But only call_rcu_*() can queue/increase the bypass size whereas only
nocb_gp_wait() can cancel the timer. Phew!

Among the many possible dances between rcu_nocb_try_bypass()
and nocb_gp_wait(), I haven't found a way yet for the timer to be
set to LAZY when it should be BYPASS (or other kind of accident such
as an ignored callback).

In the worst case we may arm an earlier timer than necessary
(RCU_NOCB_WAKE_BYPASS instead of RCU_NOCB_WAKE_LAZY for example).

Famous last words...


> Thanks,
> 
>  - Joel
> 
> 
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 03/18] rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask
  2022-09-01 22:17 ` [PATCH v5 03/18] rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask Joel Fernandes (Google)
@ 2022-09-06 22:26   ` Boqun Feng
  2022-09-06 22:31     ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Boqun Feng @ 2022-09-06 22:26 UTC (permalink / raw)
  To: Joel Fernandes (Google)
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, paulmck, rostedt, vineeth

On Thu, Sep 01, 2022 at 10:17:05PM +0000, Joel Fernandes (Google) wrote:
> The rnp->qsmask is locklessly accessed from rcutree_dying_cpu(). This
> may help avoid load/store tearing due to concurrent access, KCSAN

Nit: you can avoid only load tearing with READ_ONCE().

Regards,
Boqun

> issues, and preserve sanity of people reading the mask in tracing.
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 0ca21ac0f064..5ec97e3f7468 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2106,7 +2106,7 @@ int rcutree_dying_cpu(unsigned int cpu)
>  	if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
>  		return 0;
>  
> -	blkd = !!(rnp->qsmask & rdp->grpmask);
> +	blkd = !!(READ_ONCE(rnp->qsmask) & rdp->grpmask);
>  	trace_rcu_grace_period(rcu_state.name, READ_ONCE(rnp->gp_seq),
>  			       blkd ? TPS("cpuofl-bgp") : TPS("cpuofl"));
>  	return 0;
> -- 
> 2.37.2.789.g6183377224-goog
> 

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 03/18] rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask
  2022-09-06 22:26   ` Boqun Feng
@ 2022-09-06 22:31     ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-06 22:31 UTC (permalink / raw)
  To: Boqun Feng
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	frederic, paulmck, rostedt, vineeth



On 9/6/2022 6:26 PM, Boqun Feng wrote:
> On Thu, Sep 01, 2022 at 10:17:05PM +0000, Joel Fernandes (Google) wrote:
>> The rnp->qsmask is locklessly accessed from rcutree_dying_cpu(). This
>> may help avoid load/store tearing due to concurrent access, KCSAN
> 
> Nit: you can avoid only load tearing with READ_ONCE().
> 

I modified it as you suggested for the next revision, thanks.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 15:17       ` Frederic Weisbecker
  2022-09-06 16:15         ` Joel Fernandes
@ 2022-09-07  0:06         ` Joel Fernandes
  2022-09-07  9:40           ` Frederic Weisbecker
  1 sibling, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-07  0:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Tue, Sep 06, 2022 at 05:17:57PM +0200, Frederic Weisbecker wrote:
> On Tue, Sep 06, 2022 at 03:05:46AM +0000, Joel Fernandes wrote:
> > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> > index 4dc86274b3e8..b201606f7c4f 100644
> > --- a/kernel/rcu/tree_nocb.h
> > +++ b/kernel/rcu/tree_nocb.h
> > @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
> >  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
> >  }
> >  
> > +/*
> > + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> > + * can elapse before lazy callbacks are flushed. Lazy callbacks
> > + * could be flushed much earlier for a number of other reasons
> > + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> > + * left unsubmitted to RCU after those many jiffies.
> > + */
> > +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> > +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> 
> Still not static.
> 
> > @@ -293,12 +322,16 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
> >   * proves to be initially empty, just return false because the no-CB GP
> >   * kthread may need to be awakened in this case.
> >   *
> > + * Return true if there was something to be flushed and it succeeded, otherwise
> > + * false.
> > + *
> 
> This kind of contradict the comment that follows. Not sure you need to add
> that line because the existing comment seem to cover it.
> 
> >   * Note that this function always returns true if rhp is NULL.
> 
> >   */
> >  static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> > -				     unsigned long j)
> > +				     unsigned long j, unsigned long flush_flags)
> >  {
> >  	struct rcu_cblist rcl;
> > +	bool lazy = flush_flags & FLUSH_BP_LAZY;
> >  
> >  	WARN_ON_ONCE(!rcu_rdp_is_offloaded(rdp));
> >  	rcu_lockdep_assert_cblist_protected(rdp);
> > @@ -326,13 +372,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >   * Note that this function always returns true if rhp is NULL.
> >   */
> >  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> > -				  unsigned long j)
> > +				  unsigned long j, unsigned long flush_flags)
> >  {
> > +	bool ret;
> > +
> >  	if (!rcu_rdp_is_offloaded(rdp))
> >  		return true;
> >  	rcu_lockdep_assert_cblist_protected(rdp);
> >  	rcu_nocb_bypass_lock(rdp);
> > -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
> > +	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> > +
> > +	if (flush_flags & FLUSH_BP_WAKE)
> > +		wake_nocb_gp(rdp, true);
> 
> Why the true above?
> 
> Also should we check if the wake up is really necessary (otherwise it means we
> force a wake up for all rdp's from rcu_barrier())?
> 
>        was_alldone = rcu_segcblist_pend_cbs(&rdp->cblist);
>        ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
>        if (was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist))
>        	  wake_nocb_gp(rdp, false);

You mean something like the following right? Though I'm thinking if its
better to call wake_nocb_gp() from tree.c in entrain() and let that handle
the wake. That way, we can get rid of the extra FLUSH_BP flags as well and
let the flush callers deal with the wakeups..

Anyway, for testing this should be good...

---8<-----------------------

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index bd8f39ee2cd0..e3344c262672 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -382,15 +382,19 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j, unsigned long flush_flags)
 {
 	bool ret;
+	bool was_alldone;
 
 	if (!rcu_rdp_is_offloaded(rdp))
 		return true;
 	rcu_lockdep_assert_cblist_protected(rdp);
 	rcu_nocb_bypass_lock(rdp);
+	if (flush_flags & FLUSH_BP_WAKE)
+		was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+
 	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
 
-	if (flush_flags & FLUSH_BP_WAKE)
-		wake_nocb_gp(rdp, true);
+	if (flush_flags & FLUSH_BP_WAKE && was_alldone)
+		wake_nocb_gp(rdp, false);
 
 	return ret;
 }
-- 
2.37.2.789.g6183377224-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens
  2022-09-06  9:48     ` Frederic Weisbecker
@ 2022-09-07  2:43       ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-07  2:43 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/6/2022 5:48 AM, Frederic Weisbecker wrote:
>>  		}
>> -		rcu_nocb_unlock_irqrestore(rdp, flags);
>> +
>> +		// The flush succeeded and we moved CBs into the ->cblist.
>> +		// However, the bypass timer might still be running. Wakeup the
> That part of the comment mentioning the bypass timer looks strange...
> 
> 
>> +		// GP thread by calling a helper with was_all_done set so that
>> +		// wake up happens (needed if main CB list was empty before).
> How about:
> 
>                 // The flush succeeded and we moved CBs into the regular list.
> 		// Don't wait for the wake up timer as it may be too far ahead.
> 		// Wake up now GP thread instead if the cblist was empty
> 
> Thanks.
> 
> Other than that the patch looks good, thanks!
> 

I updated it accordingly and will send it for next revision, thank you!

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 19:11                 ` Frederic Weisbecker
@ 2022-09-07  2:56                   ` Joel Fernandes
  2022-09-07  9:56                     ` Frederic Weisbecker
  0 siblings, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-07  2:56 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/6/2022 3:11 PM, Frederic Weisbecker wrote:
> On Tue, Sep 06, 2022 at 12:43:52PM -0400, Joel Fernandes wrote:
>> On 9/6/2022 12:38 PM, Joel Fernandes wrote:
>> Ah, now I know why I got confused. I *used* to flush the bypass list before when
>> !lazy CBs showed up. Paul suggested this is overkill. In this old overkill
>> method, I was missing a wake up which was likely causing the boot regression.
>> Forcing a wake up fixed that. Now in v5 I make it such that I don't do the flush
>> on a !lazy rate-limit.
>>
>> I am sorry for the confusion. Either way, in my defense this is just an extra
>> bit of code that I have to delete. This code is hard. I have mostly relied on a
>> test-driven development. But now thanks to this review and I am learning the
>> code more and more...
> 
> Yeah this code is hard.
> 
> Especially as it's possible to flush from both sides and queue the timer
> from both sides. And both sides read the bypass/lazy counter locklessly.
> But only call_rcu_*() can queue/increase the bypass size whereas only
> nocb_gp_wait() can cancel the timer. Phew!
> 

Haha, Indeed ;-)

> Among the many possible dances between rcu_nocb_try_bypass()
> and nocb_gp_wait(), I haven't found a way yet for the timer to be
> set to LAZY when it should be BYPASS (or other kind of accident such
> as an ignored callback).
> In the worst case we may arm an earlier timer than necessary
> (RCU_NOCB_WAKE_BYPASS instead of RCU_NOCB_WAKE_LAZY for example).
> 
> Famous last words...

Agreed.

On the issue of regressions with non-lazy things being treated as lazy, I was
thinking of adding a bounded-time-check to:

[PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation.

Where, if a non-lazy CB takes an abnormally long time to execute (say it was
subject to a race-condition), it would splat. This can be done because I am
tracking the queue-time in the rcu_head in that patch.

On another note, boot time regressions show up pretty quickly (at least on
ChromeOS) when non-lazy things become lazy and so far with the latest code it
has fortunately been pretty well behaved.

Thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 18:21         ` Joel Fernandes
@ 2022-09-07  8:52           ` Uladzislau Rezki
  2022-09-07 15:23             ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Uladzislau Rezki @ 2022-09-07  8:52 UTC (permalink / raw)
  To: Joel Fernandes, Frederic Weisbecker
  Cc: Uladzislau Rezki, Frederic Weisbecker, rcu, linux-kernel,
	rushikesh.s.kadam, neeraj.iitr10, paulmck, rostedt, vineeth,
	boqun.feng

On Tue, Sep 06, 2022 at 02:21:46PM -0400, Joel Fernandes wrote:
> 
> 
> On 9/6/2022 2:16 PM, Uladzislau Rezki wrote:
> >> On Fri, Sep 02, 2022 at 05:21:32PM +0200, Frederic Weisbecker wrote:
> >>> On Thu, Sep 01, 2022 at 10:17:08PM +0000, Joel Fernandes (Google) wrote:
> >>>> Implement timer-based RCU lazy callback batching. The batch is flushed
> >>>> whenever a certain amount of time has passed, or the batch on a
> >>>> particular CPU grows too big. Also memory pressure will flush it in a
> >>>> future patch.
> >>>>
> >>>> To handle several corner cases automagically (such as rcu_barrier() and
> >>>> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> >>>> length has the lazy CB length included in it. A separate lazy CB length
> >>>> counter is also introduced to keep track of the number of lazy CBs.
> >>>>
> >>>> Suggested-by: Paul McKenney <paulmck@kernel.org>
> >>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >>>> ---
> >>
> >> Here is the updated version of this patch for further testing and review.
> >> Paul, you could consider updating your test branch. I have tested it in
> >> ChromeOS as well, and rcuscale. The laziness and boot times are looking good.
> >> There was at least one bug that I fixed that got introduced with the moving
> >> of the length field to rcu_data.  Thanks a lot Frederic for the review
> >> comments.
> >>
> >> I will look at the rcu torture issue next... I suspect the length field issue
> >> may have been causing it.
> >>
> >> ---8<-----------------------
> >>
> >> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> >> Subject: [PATCH v6] rcu: Introduce call_rcu_lazy() API implementation
> >>
> >> Implement timer-based RCU lazy callback batching. The batch is flushed
> >> whenever a certain amount of time has passed, or the batch on a
> >> particular CPU grows too big. Also memory pressure will flush it in a
> >> future patch.
> >>
> >> To handle several corner cases automagically (such as rcu_barrier() and
> >> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
> >> length has the lazy CB length included in it. A separate lazy CB length
> >> counter is also introduced to keep track of the number of lazy CBs.
> >>
> >> v5->v6:
> >>
> >> [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
> >>   deferral levels wake much earlier so for those it is not needed. ]
> >>
> >> [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
> >>
> >> [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
> >>
> >> Suggested-by: Paul McKenney <paulmck@kernel.org>
> >> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >> ---
> >>  include/linux/rcupdate.h |   6 ++
> >>  kernel/rcu/Kconfig       |   8 ++
> >>  kernel/rcu/rcu.h         |  11 +++
> >>  kernel/rcu/tree.c        | 133 +++++++++++++++++++----------
> >>  kernel/rcu/tree.h        |  17 +++-
> >>  kernel/rcu/tree_nocb.h   | 175 ++++++++++++++++++++++++++++++++-------
> >>  6 files changed, 269 insertions(+), 81 deletions(-)
> >>
> >> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> >> index 08605ce7379d..82e8a07e0856 100644
> >> --- a/include/linux/rcupdate.h
> >> +++ b/include/linux/rcupdate.h
> >> @@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
> >>  
> >>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
> >>  
> >> +#ifdef CONFIG_RCU_LAZY
> >> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
> >> +#else
> >> +#define call_rcu_lazy(head, func) call_rcu(head, func)
> >> +#endif
> >> +
> >>  /* Internal to kernel */
> >>  void rcu_init(void);
> >>  extern int rcu_scheduler_active;
> >> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
> >> index d471d22a5e21..3128d01427cb 100644
> >> --- a/kernel/rcu/Kconfig
> >> +++ b/kernel/rcu/Kconfig
> >> @@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
> >>  	  Say N here if you hate read-side memory barriers.
> >>  	  Take the default if you are unsure.
> >>  
> >> +config RCU_LAZY
> >> +	bool "RCU callback lazy invocation functionality"
> >> +	depends on RCU_NOCB_CPU
> >> +	default n
> >> +	help
> >> +	  To save power, batch RCU callbacks and flush after delay, memory
> >> +	  pressure or callback list growing too big.
> >> +
> >>  endmenu # "RCU Subsystem"
> >> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
> >> index be5979da07f5..94675f14efe8 100644
> >> --- a/kernel/rcu/rcu.h
> >> +++ b/kernel/rcu/rcu.h
> >> @@ -474,6 +474,14 @@ enum rcutorture_type {
> >>  	INVALID_RCU_FLAVOR
> >>  };
> >>  
> >> +#if defined(CONFIG_RCU_LAZY)
> >> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
> >> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
> >> +#else
> >> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
> >> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
> >> +#endif
> >> +
> >>  #if defined(CONFIG_TREE_RCU)
> >>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
> >>  			    unsigned long *gp_seq);
> >> @@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
> >>  			       unsigned long c_old,
> >>  			       unsigned long c);
> >>  void rcu_gp_set_torture_wait(int duration);
> >> +void rcu_force_call_rcu_to_lazy(bool force);
> >> +
> >>  #else
> >>  static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
> >>  					  int *flags, unsigned long *gp_seq)
> >> @@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
> >>  	do { } while (0)
> >>  #endif
> >>  static inline void rcu_gp_set_torture_wait(int duration) { }
> >> +static inline void rcu_force_call_rcu_to_lazy(bool force) { }
> >>  #endif
> >>  
> >>  #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
> >> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> index 9fe581be8696..dbd25b8c080e 100644
> >> --- a/kernel/rcu/tree.c
> >> +++ b/kernel/rcu/tree.c
> >> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
> >>  	raw_spin_unlock_rcu_node(rnp);
> >>  }
> >>  
> >> -/**
> >> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
> >> - * @head: structure to be used for queueing the RCU updates.
> >> - * @func: actual callback function to be invoked after the grace period
> >> - *
> >> - * The callback function will be invoked some time after a full grace
> >> - * period elapses, in other words after all pre-existing RCU read-side
> >> - * critical sections have completed.  However, the callback function
> >> - * might well execute concurrently with RCU read-side critical sections
> >> - * that started after call_rcu() was invoked.
> >> - *
> >> - * RCU read-side critical sections are delimited by rcu_read_lock()
> >> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
> >> - * v5.0 and later, regions of code across which interrupts, preemption,
> >> - * or softirqs have been disabled also serve as RCU read-side critical
> >> - * sections.  This includes hardware interrupt handlers, softirq handlers,
> >> - * and NMI handlers.
> >> - *
> >> - * Note that all CPUs must agree that the grace period extended beyond
> >> - * all pre-existing RCU read-side critical section.  On systems with more
> >> - * than one CPU, this means that when "func()" is invoked, each CPU is
> >> - * guaranteed to have executed a full memory barrier since the end of its
> >> - * last RCU read-side critical section whose beginning preceded the call
> >> - * to call_rcu().  It also means that each CPU executing an RCU read-side
> >> - * critical section that continues beyond the start of "func()" must have
> >> - * executed a memory barrier after the call_rcu() but before the beginning
> >> - * of that RCU read-side critical section.  Note that these guarantees
> >> - * include CPUs that are offline, idle, or executing in user mode, as
> >> - * well as CPUs that are executing in the kernel.
> >> - *
> >> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> >> - * resulting RCU callback function "func()", then both CPU A and CPU B are
> >> - * guaranteed to execute a full memory barrier during the time interval
> >> - * between the call to call_rcu() and the invocation of "func()" -- even
> >> - * if CPU A and CPU B are the same CPU (but again only if the system has
> >> - * more than one CPU).
> >> - *
> >> - * Implementation of these memory-ordering guarantees is described here:
> >> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> >> - */
> >> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
> >> +static void
> >> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
> >>  {
> >>  	static atomic_t doublefrees;
> >>  	unsigned long flags;
> >> @@ -2818,7 +2779,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
> >>  		trace_rcu_callback(rcu_state.name, head,
> >>  				   rcu_segcblist_n_cbs(&rdp->cblist));
> >>  
> >> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
> >> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
> >>  		return; // Enqueued onto ->nocb_bypass, so just leave.
> >>  	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
> >>  	rcu_segcblist_enqueue(&rdp->cblist, head);
> >> @@ -2833,8 +2794,86 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
> >>  		local_irq_restore(flags);
> >>  	}
> >>  }
> >> -EXPORT_SYMBOL_GPL(call_rcu);
> >>  
> >> +#ifdef CONFIG_RCU_LAZY
> >> +/**
> >> + * call_rcu_lazy() - Lazily queue RCU callback for invocation after grace period.
> >> + * @head: structure to be used for queueing the RCU updates.
> >> + * @func: actual callback function to be invoked after the grace period
> >> + *
> >> + * The callback function will be invoked some time after a full grace
> >> + * period elapses, in other words after all pre-existing RCU read-side
> >> + * critical sections have completed.
> >> + *
> >> + * Use this API instead of call_rcu() if you don't mind the callback being
> >> + * invoked after very long periods of time on systems without memory pressure
> >> + * and on systems which are lightly loaded or mostly idle.
> >> + *
> >> + * Other than the extra delay in callbacks being invoked, this function is
> >> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
> >> + * details about memory ordering and other functionality.
> >> + */
> >> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func)
> >> +{
> >> +	return __call_rcu_common(head, func, true);
> >> +}
> >> +EXPORT_SYMBOL_GPL(call_rcu_lazy);
> >> +#endif
> >> +
> >> +static bool force_call_rcu_to_lazy;
> >> +
> >> +void rcu_force_call_rcu_to_lazy(bool force)
> >> +{
> >> +	if (IS_ENABLED(CONFIG_RCU_SCALE_TEST))
> >> +		WRITE_ONCE(force_call_rcu_to_lazy, force);
> >> +}
> >> +EXPORT_SYMBOL_GPL(rcu_force_call_rcu_to_lazy);
> >> +
> >> +/**
> >> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
> >> + * @head: structure to be used for queueing the RCU updates.
> >> + * @func: actual callback function to be invoked after the grace period
> >> + *
> >> + * The callback function will be invoked some time after a full grace
> >> + * period elapses, in other words after all pre-existing RCU read-side
> >> + * critical sections have completed.  However, the callback function
> >> + * might well execute concurrently with RCU read-side critical sections
> >> + * that started after call_rcu() was invoked.
> >> + *
> >> + * RCU read-side critical sections are delimited by rcu_read_lock()
> >> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
> >> + * v5.0 and later, regions of code across which interrupts, preemption,
> >> + * or softirqs have been disabled also serve as RCU read-side critical
> >> + * sections.  This includes hardware interrupt handlers, softirq handlers,
> >> + * and NMI handlers.
> >> + *
> >> + * Note that all CPUs must agree that the grace period extended beyond
> >> + * all pre-existing RCU read-side critical section.  On systems with more
> >> + * than one CPU, this means that when "func()" is invoked, each CPU is
> >> + * guaranteed to have executed a full memory barrier since the end of its
> >> + * last RCU read-side critical section whose beginning preceded the call
> >> + * to call_rcu().  It also means that each CPU executing an RCU read-side
> >> + * critical section that continues beyond the start of "func()" must have
> >> + * executed a memory barrier after the call_rcu() but before the beginning
> >> + * of that RCU read-side critical section.  Note that these guarantees
> >> + * include CPUs that are offline, idle, or executing in user mode, as
> >> + * well as CPUs that are executing in the kernel.
> >> + *
> >> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
> >> + * resulting RCU callback function "func()", then both CPU A and CPU B are
> >> + * guaranteed to execute a full memory barrier during the time interval
> >> + * between the call to call_rcu() and the invocation of "func()" -- even
> >> + * if CPU A and CPU B are the same CPU (but again only if the system has
> >> + * more than one CPU).
> >> + *
> >> + * Implementation of these memory-ordering guarantees is described here:
> >> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
> >> + */
> >> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
> >> +{
> >> +	return __call_rcu_common(head, func, force_call_rcu_to_lazy);
> >> +}
> >> +EXPORT_SYMBOL_GPL(call_rcu);
> >>  
> >>  /* Maximum number of jiffies to wait before draining a batch. */
> >>  #define KFREE_DRAIN_JIFFIES (5 * HZ)
> >> @@ -3904,7 +3943,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
> >>  	rdp->barrier_head.func = rcu_barrier_callback;
> >>  	debug_rcu_head_queue(&rdp->barrier_head);
> >>  	rcu_nocb_lock(rdp);
> >> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
> >> +        /*
> >> +         * Flush the bypass list, but also wake up the GP thread as otherwise
> >> +         * bypass/lazy CBs maynot be noticed, and can cause real long delays!
> >> +         */
> >> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
> >>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
> >>  		atomic_inc(&rcu_state.barrier_cpu_count);
> >>  	} else {
> >> @@ -4325,7 +4368,7 @@ void rcutree_migrate_callbacks(int cpu)
> >>  	my_rdp = this_cpu_ptr(&rcu_data);
> >>  	my_rnp = my_rdp->mynode;
> >>  	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
> >> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
> >> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE));
> >>  	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
> >>  	/* Leverage recent GPs and set GP for new callbacks. */
> >>  	needwake = rcu_advance_cbs(my_rnp, rdp) ||
> >> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> >> index d4a97e40ea9c..361c41d642c7 100644
> >> --- a/kernel/rcu/tree.h
> >> +++ b/kernel/rcu/tree.h
> >> @@ -263,14 +263,16 @@ struct rcu_data {
> >>  	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
> >>  	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
> >>  
> >> +	long lazy_len;			/* Length of buffered lazy callbacks. */
> >>  	int cpu;
> >>  };
> >>  
> >>  /* Values for nocb_defer_wakeup field in struct rcu_data. */
> >>  #define RCU_NOCB_WAKE_NOT	0
> >>  #define RCU_NOCB_WAKE_BYPASS	1
> >> -#define RCU_NOCB_WAKE		2
> >> -#define RCU_NOCB_WAKE_FORCE	3
> >> +#define RCU_NOCB_WAKE_LAZY	2
> >> +#define RCU_NOCB_WAKE		3
> >> +#define RCU_NOCB_WAKE_FORCE	4
> >>  
> >>  #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> >>  					/* For jiffies_till_first_fqs and */
> >> @@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
> >>  static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
> >>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
> >>  static void rcu_init_one_nocb(struct rcu_node *rnp);
> >> +
> >> +#define FLUSH_BP_NONE 0
> >> +/* Is the CB being enqueued after the flush, a lazy CB? */
> >> +#define FLUSH_BP_LAZY BIT(0)
> >> +/* Wake up nocb-GP thread after flush? */
> >> +#define FLUSH_BP_WAKE BIT(1)
> >>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >> -				  unsigned long j);
> >> +				  unsigned long j, unsigned long flush_flags);
> >>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >> -				bool *was_alldone, unsigned long flags);
> >> +				bool *was_alldone, unsigned long flags,
> >> +				bool lazy);
> >>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
> >>  				 unsigned long flags);
> >>  static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
> >> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> >> index 4dc86274b3e8..b201606f7c4f 100644
> >> --- a/kernel/rcu/tree_nocb.h
> >> +++ b/kernel/rcu/tree_nocb.h
> >> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
> >>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
> >>  }
> >>  
> >> +/*
> >> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
> >> + * can elapse before lazy callbacks are flushed. Lazy callbacks
> >> + * could be flushed much earlier for a number of other reasons
> >> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
> >> + * left unsubmitted to RCU after those many jiffies.
> >> + */
> >> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
> >> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
> >> +
> >> +#ifdef CONFIG_RCU_LAZY
> >> +// To be called only from test code.
> >> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
> >> +{
> >> +	jiffies_till_flush = jif;
> >> +}
> >> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
> >> +
> >> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
> >> +{
> >> +	return jiffies_till_flush;
> >> +}
> >> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
> >> +#endif
> >> +
> >>  /*
> >>   * Arrange to wake the GP kthread for this NOCB group at some future
> >>   * time when it is safe to do so.
> >> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
> >>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
> >>  
> >>  	/*
> >> -	 * Bypass wakeup overrides previous deferments. In case
> >> -	 * of callback storm, no need to wake up too early.
> >> +	 * Bypass wakeup overrides previous deferments. In case of
> >> +	 * callback storm, no need to wake up too early.
> >>  	 */
> >> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
> >> +	if (waketype == RCU_NOCB_WAKE_LAZY
> >> +		&& READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
> >> +		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
> >> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> >> +	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
> >>  		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
> >>  		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
> >>  	} else {
> >>
> > Joel, i have a question here. I see that lazy callback just makes the GP
> > start later where a regular call_rcu() API will instead cut it back.
> > Could you please clarify how it solves the sequence like:
> > 
> > CPU:
> > 
> > 1) task A -> call_rcu_lazy();
> > 2) task B -> call_rcu();
> > 
> > so lazily decision about running GP later a task B overrides to initiate it sooner? 
> 
> Is your question that task B will make task A's CB run sooner? Yes that's right.
> The reason is that the point of call_rcu_lazy() is to delay GP start as much as
> possible to keep system quiet. However, if a call_rcu() comes in any way, we
> have to start a GP soon. So we might as well take advantage of that for the lazy
> one as we are paying the full price for the new GP.
> 
> It is kind of like RCU's amortizing behavior in that sense, however we put that
> into overdrive to be greedy for power.
> 
> Does that answer your question?
> 
OK. I get it correctly then :) IMHO, the problem with such approach is that we
may end-up in non-lazy way of moving RCU-core forward even though we do t lazily
and one user not.

According to my observation, our setup suffers from many wake-ups across the CPUs
and mixing non_lazy_way with lazy_way will just be converted into a regular patern. 

I think adding kind of high_water_mark on "entry" would solve it. Any thoughts here?

Joel, do you see that v6 makes the system more idle from RCU perspective on the Intel SoC?

From my side i will port v6 on the 5.10 kernel and give some tests today.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-07  0:06         ` Joel Fernandes
@ 2022-09-07  9:40           ` Frederic Weisbecker
  2022-09-07 13:44             ` Joel Fernandes
  2022-09-21 23:52             ` Joel Fernandes
  0 siblings, 2 replies; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-07  9:40 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Wed, Sep 07, 2022 at 12:06:26AM +0000, Joel Fernandes wrote:
> > > @@ -326,13 +372,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> > >   * Note that this function always returns true if rhp is NULL.
> > >   */
> > >  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> > > -				  unsigned long j)
> > > +				  unsigned long j, unsigned long flush_flags)
> > >  {
> > > +	bool ret;
> > > +
> > >  	if (!rcu_rdp_is_offloaded(rdp))
> > >  		return true;
> > >  	rcu_lockdep_assert_cblist_protected(rdp);
> > >  	rcu_nocb_bypass_lock(rdp);
> > > -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
> > > +	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> > > +
> > > +	if (flush_flags & FLUSH_BP_WAKE)
> > > +		wake_nocb_gp(rdp, true);
> > 
> > Why the true above?
> > 
> > Also should we check if the wake up is really necessary (otherwise it means we
> > force a wake up for all rdp's from rcu_barrier())?
> > 
> >        was_alldone = rcu_segcblist_pend_cbs(&rdp->cblist);
> >        ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> >        if (was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist))
> >        	  wake_nocb_gp(rdp, false);
> 
> You mean something like the following right? Though I'm thinking if its
> better to call wake_nocb_gp() from tree.c in entrain() and let that handle
> the wake. That way, we can get rid of the extra FLUSH_BP flags as well and
> let the flush callers deal with the wakeups..

Ah yes that could make sense if only one caller cares.

> 
> Anyway, for testing this should be good...
> 
> ---8<-----------------------
> 
> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> index bd8f39ee2cd0..e3344c262672 100644
> --- a/kernel/rcu/tree_nocb.h
> +++ b/kernel/rcu/tree_nocb.h
> @@ -382,15 +382,19 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>  				  unsigned long j, unsigned long flush_flags)
>  {
>  	bool ret;
> +	bool was_alldone;
>  
>  	if (!rcu_rdp_is_offloaded(rdp))
>  		return true;
>  	rcu_lockdep_assert_cblist_protected(rdp);
>  	rcu_nocb_bypass_lock(rdp);
> +	if (flush_flags & FLUSH_BP_WAKE)
> +		was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> +

You can check that outside bypass lock (but you still need nocb_lock).

>  	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
>  
> -	if (flush_flags & FLUSH_BP_WAKE)
> -		wake_nocb_gp(rdp, true);
> +	if (flush_flags & FLUSH_BP_WAKE && was_alldone)
> +		wake_nocb_gp(rdp, false);

That doesn't check if the bypass list was empty.

Thanks.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-07  2:56                   ` Joel Fernandes
@ 2022-09-07  9:56                     ` Frederic Weisbecker
  0 siblings, 0 replies; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-07  9:56 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Tue, Sep 06, 2022 at 10:56:01PM -0400, Joel Fernandes wrote:
> On the issue of regressions with non-lazy things being treated as lazy, I was
> thinking of adding a bounded-time-check to:
> 
> [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation.
> 
> Where, if a non-lazy CB takes an abnormally long time to execute (say it was
> subject to a race-condition), it would splat. This can be done because I am
> tracking the queue-time in the rcu_head in that patch.
> 
> On another note, boot time regressions show up pretty quickly (at least on
> ChromeOS) when non-lazy things become lazy and so far with the latest code it
> has fortunately been pretty well behaved.

Makes sense. We definetly need some sort of detection for delayed non-lazy
callbacks.

Thanks!

> 
> Thanks,
> 
>  - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-06 16:15         ` Joel Fernandes
  2022-09-06 16:31           ` Joel Fernandes
@ 2022-09-07 10:03           ` Frederic Weisbecker
  2022-09-07 14:01             ` Joel Fernandes
  1 sibling, 1 reply; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-07 10:03 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Tue, Sep 06, 2022 at 12:15:19PM -0400, Joel Fernandes wrote:
> >> +
> >> +	// We had CBs in the bypass list before. There is nothing else to do if:
> >> +	// There were only non-lazy CBs before, in this case, the bypass timer
> > 
> > Kind of misleading. I would replace "There were only non-lazy CBs before" with
> > "There was at least one non-lazy CBs before".
> 
> I really mean "There were only non-lazy CBs ever queued in the bypass list
> before". That's the bypass_is_lazy variable. So I did not fully understand your
> suggested comment change.

I may well be missing something but to me it seems that:

bypass_is_lazy = all bypass callbacks are lazy
!bypass_is_lazy = there is at least one non-lazy bypass callback

And indeed as long as there is at least one non-lazy callback, we don't
want to rely on the LAZY timer.

Am I overlooking something?

> 
> >> +	// or GP-thread will handle the CBs including any new lazy ones.
> >> +	// Or, the new CB is lazy and the old bypass-CBs were also lazy. In this
> >> +	// case the old lazy timer would have been setup. When that expires,
> >> +	// the new lazy one will be handled.
> >> +	if (ncbs && (!bypass_is_lazy || lazy)) {
> >>  		local_irq_restore(flags);
> >>  	} else {
> >>  		// No-CBs GP kthread might be indefinitely asleep, if so, wake.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-07  9:40           ` Frederic Weisbecker
@ 2022-09-07 13:44             ` Joel Fernandes
  2022-09-07 15:38               ` Frederic Weisbecker
  2022-09-21 23:52             ` Joel Fernandes
  1 sibling, 1 reply; 73+ messages in thread
From: Joel Fernandes @ 2022-09-07 13:44 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

Hi Frederic,

On 9/7/2022 5:40 AM, Frederic Weisbecker wrote:
>>
>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>> index bd8f39ee2cd0..e3344c262672 100644
>> --- a/kernel/rcu/tree_nocb.h
>> +++ b/kernel/rcu/tree_nocb.h
>> @@ -382,15 +382,19 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>  				  unsigned long j, unsigned long flush_flags)
>>  {
>>  	bool ret;
>> +	bool was_alldone;
>>  
>>  	if (!rcu_rdp_is_offloaded(rdp))
>>  		return true;
>>  	rcu_lockdep_assert_cblist_protected(rdp);
>>  	rcu_nocb_bypass_lock(rdp);
>> +	if (flush_flags & FLUSH_BP_WAKE)
>> +		was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>> +
> You can check that outside bypass lock (but you still need nocb_lock).

Right, ok. I can make it outside the bypass lock, and the nocb_lock is implied
by rcu_lockdep_assert_cblist_protected().

> 
>>  	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
>>  
>> -	if (flush_flags & FLUSH_BP_WAKE)
>> -		wake_nocb_gp(rdp, true);
>> +	if (flush_flags & FLUSH_BP_WAKE && was_alldone)
>> +		wake_nocb_gp(rdp, false);
> That doesn't check if the bypass list was empty.

thanks, will add your suggested optimization, however I have a general question:

For the case where the bypass list is empty, where does rcu_barrier() do a wake
up of the nocb GP thread after entrain()?

I don't see a call to __call_rcu_nocb_wake() like rcutree_migrate_callbacks()
does. Is the wake up done in some other path?

thanks,

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-07 10:03           ` Frederic Weisbecker
@ 2022-09-07 14:01             ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-07 14:01 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/7/2022 6:03 AM, Frederic Weisbecker wrote:
> On Tue, Sep 06, 2022 at 12:15:19PM -0400, Joel Fernandes wrote:
>>>> +
>>>> +	// We had CBs in the bypass list before. There is nothing else to do if:
>>>> +	// There were only non-lazy CBs before, in this case, the bypass timer
>>>
>>> Kind of misleading. I would replace "There were only non-lazy CBs before" with
>>> "There was at least one non-lazy CBs before".
>>
>> I really mean "There were only non-lazy CBs ever queued in the bypass list
>> before". That's the bypass_is_lazy variable. So I did not fully understand your
>> suggested comment change.
> 
> I may well be missing something but to me it seems that:
> 
> bypass_is_lazy = all bypass callbacks are lazy
> !bypass_is_lazy = there is at least one non-lazy bypass callback
> 
> And indeed as long as there is at least one non-lazy callback, we don't
> want to rely on the LAZY timer.
> 
> Am I overlooking something?

No you are not over looking and you are very right that I may need to change the
comment.

To clarify my intent, a wake up or timer adjustment needs to be done only if:

1. Bypass list was fully empty before (this is the first bypass list entry).

Or both the below conditions are met:

1. Bypass list had only lazy CBs before.

2. The new CB is non-lazy.

Instead of saying, "nothing needs to be done if...", I will change it to:

        // A wake up of the grace period kthread or timer adjustment needs to
        // be done only if:
        // 1. Bypass list was empty before (this is the first bypass queue).
        //      Or, both the below conditions are met:
        // 1. Bypass list had only lazy CBs before.
        // 2. The new CB is non-lazy.

That sounds less confusing...

Or, I can just make the edit you suggested... let me know either way!

Thanks!

 - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-07  8:52           ` Uladzislau Rezki
@ 2022-09-07 15:23             ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-07 15:23 UTC (permalink / raw)
  To: Uladzislau Rezki, Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, neeraj.iitr10, paulmck,
	rostedt, vineeth, boqun.feng

Hi Vlad,

On 9/7/2022 4:52 AM, Uladzislau Rezki wrote:
> On Tue, Sep 06, 2022 at 02:21:46PM -0400, Joel Fernandes wrote:
>>
>>
>> On 9/6/2022 2:16 PM, Uladzislau Rezki wrote:
>>>> On Fri, Sep 02, 2022 at 05:21:32PM +0200, Frederic Weisbecker wrote:
>>>>> On Thu, Sep 01, 2022 at 10:17:08PM +0000, Joel Fernandes (Google) wrote:
>>>>>> Implement timer-based RCU lazy callback batching. The batch is flushed
>>>>>> whenever a certain amount of time has passed, or the batch on a
>>>>>> particular CPU grows too big. Also memory pressure will flush it in a
>>>>>> future patch.
>>>>>>
>>>>>> To handle several corner cases automagically (such as rcu_barrier() and
>>>>>> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
>>>>>> length has the lazy CB length included in it. A separate lazy CB length
>>>>>> counter is also introduced to keep track of the number of lazy CBs.
>>>>>>
>>>>>> Suggested-by: Paul McKenney <paulmck@kernel.org>
>>>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>>>> ---
>>>>
>>>> Here is the updated version of this patch for further testing and review.
>>>> Paul, you could consider updating your test branch. I have tested it in
>>>> ChromeOS as well, and rcuscale. The laziness and boot times are looking good.
>>>> There was at least one bug that I fixed that got introduced with the moving
>>>> of the length field to rcu_data.  Thanks a lot Frederic for the review
>>>> comments.
>>>>
>>>> I will look at the rcu torture issue next... I suspect the length field issue
>>>> may have been causing it.
>>>>
>>>> ---8<-----------------------
>>>>
>>>> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
>>>> Subject: [PATCH v6] rcu: Introduce call_rcu_lazy() API implementation
>>>>
>>>> Implement timer-based RCU lazy callback batching. The batch is flushed
>>>> whenever a certain amount of time has passed, or the batch on a
>>>> particular CPU grows too big. Also memory pressure will flush it in a
>>>> future patch.
>>>>
>>>> To handle several corner cases automagically (such as rcu_barrier() and
>>>> hotplug), we re-use bypass lists to handle lazy CBs. The bypass list
>>>> length has the lazy CB length included in it. A separate lazy CB length
>>>> counter is also introduced to keep track of the number of lazy CBs.
>>>>
>>>> v5->v6:
>>>>
>>>> [ Frederic Weisbec: Program the lazy timer only if WAKE_NOT, since other
>>>>   deferral levels wake much earlier so for those it is not needed. ]
>>>>
>>>> [ Frederic Weisbec: Use flush flags to keep bypass API code clean. ]
>>>>
>>>> [ Joel: Fix issue where I was not resetting lazy_len after moving it to rdp ]
>>>>
>>>> Suggested-by: Paul McKenney <paulmck@kernel.org>
>>>> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>>>> ---
>>>>  include/linux/rcupdate.h |   6 ++
>>>>  kernel/rcu/Kconfig       |   8 ++
>>>>  kernel/rcu/rcu.h         |  11 +++
>>>>  kernel/rcu/tree.c        | 133 +++++++++++++++++++----------
>>>>  kernel/rcu/tree.h        |  17 +++-
>>>>  kernel/rcu/tree_nocb.h   | 175 ++++++++++++++++++++++++++++++++-------
>>>>  6 files changed, 269 insertions(+), 81 deletions(-)
>>>>
>>>> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
>>>> index 08605ce7379d..82e8a07e0856 100644
>>>> --- a/include/linux/rcupdate.h
>>>> +++ b/include/linux/rcupdate.h
>>>> @@ -108,6 +108,12 @@ static inline int rcu_preempt_depth(void)
>>>>  
>>>>  #endif /* #else #ifdef CONFIG_PREEMPT_RCU */
>>>>  
>>>> +#ifdef CONFIG_RCU_LAZY
>>>> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func);
>>>> +#else
>>>> +#define call_rcu_lazy(head, func) call_rcu(head, func)
>>>> +#endif
>>>> +
>>>>  /* Internal to kernel */
>>>>  void rcu_init(void);
>>>>  extern int rcu_scheduler_active;
>>>> diff --git a/kernel/rcu/Kconfig b/kernel/rcu/Kconfig
>>>> index d471d22a5e21..3128d01427cb 100644
>>>> --- a/kernel/rcu/Kconfig
>>>> +++ b/kernel/rcu/Kconfig
>>>> @@ -311,4 +311,12 @@ config TASKS_TRACE_RCU_READ_MB
>>>>  	  Say N here if you hate read-side memory barriers.
>>>>  	  Take the default if you are unsure.
>>>>  
>>>> +config RCU_LAZY
>>>> +	bool "RCU callback lazy invocation functionality"
>>>> +	depends on RCU_NOCB_CPU
>>>> +	default n
>>>> +	help
>>>> +	  To save power, batch RCU callbacks and flush after delay, memory
>>>> +	  pressure or callback list growing too big.
>>>> +
>>>>  endmenu # "RCU Subsystem"
>>>> diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
>>>> index be5979da07f5..94675f14efe8 100644
>>>> --- a/kernel/rcu/rcu.h
>>>> +++ b/kernel/rcu/rcu.h
>>>> @@ -474,6 +474,14 @@ enum rcutorture_type {
>>>>  	INVALID_RCU_FLAVOR
>>>>  };
>>>>  
>>>> +#if defined(CONFIG_RCU_LAZY)
>>>> +unsigned long rcu_lazy_get_jiffies_till_flush(void);
>>>> +void rcu_lazy_set_jiffies_till_flush(unsigned long j);
>>>> +#else
>>>> +static inline unsigned long rcu_lazy_get_jiffies_till_flush(void) { return 0; }
>>>> +static inline void rcu_lazy_set_jiffies_till_flush(unsigned long j) { }
>>>> +#endif
>>>> +
>>>>  #if defined(CONFIG_TREE_RCU)
>>>>  void rcutorture_get_gp_data(enum rcutorture_type test_type, int *flags,
>>>>  			    unsigned long *gp_seq);
>>>> @@ -483,6 +491,8 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>>>>  			       unsigned long c_old,
>>>>  			       unsigned long c);
>>>>  void rcu_gp_set_torture_wait(int duration);
>>>> +void rcu_force_call_rcu_to_lazy(bool force);
>>>> +
>>>>  #else
>>>>  static inline void rcutorture_get_gp_data(enum rcutorture_type test_type,
>>>>  					  int *flags, unsigned long *gp_seq)
>>>> @@ -501,6 +511,7 @@ void do_trace_rcu_torture_read(const char *rcutorturename,
>>>>  	do { } while (0)
>>>>  #endif
>>>>  static inline void rcu_gp_set_torture_wait(int duration) { }
>>>> +static inline void rcu_force_call_rcu_to_lazy(bool force) { }
>>>>  #endif
>>>>  
>>>>  #if IS_ENABLED(CONFIG_RCU_TORTURE_TEST) || IS_MODULE(CONFIG_RCU_TORTURE_TEST)
>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>> index 9fe581be8696..dbd25b8c080e 100644
>>>> --- a/kernel/rcu/tree.c
>>>> +++ b/kernel/rcu/tree.c
>>>> @@ -2728,47 +2728,8 @@ static void check_cb_ovld(struct rcu_data *rdp)
>>>>  	raw_spin_unlock_rcu_node(rnp);
>>>>  }
>>>>  
>>>> -/**
>>>> - * call_rcu() - Queue an RCU callback for invocation after a grace period.
>>>> - * @head: structure to be used for queueing the RCU updates.
>>>> - * @func: actual callback function to be invoked after the grace period
>>>> - *
>>>> - * The callback function will be invoked some time after a full grace
>>>> - * period elapses, in other words after all pre-existing RCU read-side
>>>> - * critical sections have completed.  However, the callback function
>>>> - * might well execute concurrently with RCU read-side critical sections
>>>> - * that started after call_rcu() was invoked.
>>>> - *
>>>> - * RCU read-side critical sections are delimited by rcu_read_lock()
>>>> - * and rcu_read_unlock(), and may be nested.  In addition, but only in
>>>> - * v5.0 and later, regions of code across which interrupts, preemption,
>>>> - * or softirqs have been disabled also serve as RCU read-side critical
>>>> - * sections.  This includes hardware interrupt handlers, softirq handlers,
>>>> - * and NMI handlers.
>>>> - *
>>>> - * Note that all CPUs must agree that the grace period extended beyond
>>>> - * all pre-existing RCU read-side critical section.  On systems with more
>>>> - * than one CPU, this means that when "func()" is invoked, each CPU is
>>>> - * guaranteed to have executed a full memory barrier since the end of its
>>>> - * last RCU read-side critical section whose beginning preceded the call
>>>> - * to call_rcu().  It also means that each CPU executing an RCU read-side
>>>> - * critical section that continues beyond the start of "func()" must have
>>>> - * executed a memory barrier after the call_rcu() but before the beginning
>>>> - * of that RCU read-side critical section.  Note that these guarantees
>>>> - * include CPUs that are offline, idle, or executing in user mode, as
>>>> - * well as CPUs that are executing in the kernel.
>>>> - *
>>>> - * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>>>> - * resulting RCU callback function "func()", then both CPU A and CPU B are
>>>> - * guaranteed to execute a full memory barrier during the time interval
>>>> - * between the call to call_rcu() and the invocation of "func()" -- even
>>>> - * if CPU A and CPU B are the same CPU (but again only if the system has
>>>> - * more than one CPU).
>>>> - *
>>>> - * Implementation of these memory-ordering guarantees is described here:
>>>> - * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>>>> - */
>>>> -void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>>> +static void
>>>> +__call_rcu_common(struct rcu_head *head, rcu_callback_t func, bool lazy)
>>>>  {
>>>>  	static atomic_t doublefrees;
>>>>  	unsigned long flags;
>>>> @@ -2818,7 +2779,7 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>>>  		trace_rcu_callback(rcu_state.name, head,
>>>>  				   rcu_segcblist_n_cbs(&rdp->cblist));
>>>>  
>>>> -	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags))
>>>> +	if (rcu_nocb_try_bypass(rdp, head, &was_alldone, flags, lazy))
>>>>  		return; // Enqueued onto ->nocb_bypass, so just leave.
>>>>  	// If no-CBs CPU gets here, rcu_nocb_try_bypass() acquired ->nocb_lock.
>>>>  	rcu_segcblist_enqueue(&rdp->cblist, head);
>>>> @@ -2833,8 +2794,86 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>>>  		local_irq_restore(flags);
>>>>  	}
>>>>  }
>>>> -EXPORT_SYMBOL_GPL(call_rcu);
>>>>  
>>>> +#ifdef CONFIG_RCU_LAZY
>>>> +/**
>>>> + * call_rcu_lazy() - Lazily queue RCU callback for invocation after grace period.
>>>> + * @head: structure to be used for queueing the RCU updates.
>>>> + * @func: actual callback function to be invoked after the grace period
>>>> + *
>>>> + * The callback function will be invoked some time after a full grace
>>>> + * period elapses, in other words after all pre-existing RCU read-side
>>>> + * critical sections have completed.
>>>> + *
>>>> + * Use this API instead of call_rcu() if you don't mind the callback being
>>>> + * invoked after very long periods of time on systems without memory pressure
>>>> + * and on systems which are lightly loaded or mostly idle.
>>>> + *
>>>> + * Other than the extra delay in callbacks being invoked, this function is
>>>> + * identical to, and reuses call_rcu()'s logic. Refer to call_rcu() for more
>>>> + * details about memory ordering and other functionality.
>>>> + */
>>>> +void call_rcu_lazy(struct rcu_head *head, rcu_callback_t func)
>>>> +{
>>>> +	return __call_rcu_common(head, func, true);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(call_rcu_lazy);
>>>> +#endif
>>>> +
>>>> +static bool force_call_rcu_to_lazy;
>>>> +
>>>> +void rcu_force_call_rcu_to_lazy(bool force)
>>>> +{
>>>> +	if (IS_ENABLED(CONFIG_RCU_SCALE_TEST))
>>>> +		WRITE_ONCE(force_call_rcu_to_lazy, force);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(rcu_force_call_rcu_to_lazy);
>>>> +
>>>> +/**
>>>> + * call_rcu() - Queue an RCU callback for invocation after a grace period.
>>>> + * @head: structure to be used for queueing the RCU updates.
>>>> + * @func: actual callback function to be invoked after the grace period
>>>> + *
>>>> + * The callback function will be invoked some time after a full grace
>>>> + * period elapses, in other words after all pre-existing RCU read-side
>>>> + * critical sections have completed.  However, the callback function
>>>> + * might well execute concurrently with RCU read-side critical sections
>>>> + * that started after call_rcu() was invoked.
>>>> + *
>>>> + * RCU read-side critical sections are delimited by rcu_read_lock()
>>>> + * and rcu_read_unlock(), and may be nested.  In addition, but only in
>>>> + * v5.0 and later, regions of code across which interrupts, preemption,
>>>> + * or softirqs have been disabled also serve as RCU read-side critical
>>>> + * sections.  This includes hardware interrupt handlers, softirq handlers,
>>>> + * and NMI handlers.
>>>> + *
>>>> + * Note that all CPUs must agree that the grace period extended beyond
>>>> + * all pre-existing RCU read-side critical section.  On systems with more
>>>> + * than one CPU, this means that when "func()" is invoked, each CPU is
>>>> + * guaranteed to have executed a full memory barrier since the end of its
>>>> + * last RCU read-side critical section whose beginning preceded the call
>>>> + * to call_rcu().  It also means that each CPU executing an RCU read-side
>>>> + * critical section that continues beyond the start of "func()" must have
>>>> + * executed a memory barrier after the call_rcu() but before the beginning
>>>> + * of that RCU read-side critical section.  Note that these guarantees
>>>> + * include CPUs that are offline, idle, or executing in user mode, as
>>>> + * well as CPUs that are executing in the kernel.
>>>> + *
>>>> + * Furthermore, if CPU A invoked call_rcu() and CPU B invoked the
>>>> + * resulting RCU callback function "func()", then both CPU A and CPU B are
>>>> + * guaranteed to execute a full memory barrier during the time interval
>>>> + * between the call to call_rcu() and the invocation of "func()" -- even
>>>> + * if CPU A and CPU B are the same CPU (but again only if the system has
>>>> + * more than one CPU).
>>>> + *
>>>> + * Implementation of these memory-ordering guarantees is described here:
>>>> + * Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.rst.
>>>> + */
>>>> +void call_rcu(struct rcu_head *head, rcu_callback_t func)
>>>> +{
>>>> +	return __call_rcu_common(head, func, force_call_rcu_to_lazy);
>>>> +}
>>>> +EXPORT_SYMBOL_GPL(call_rcu);
>>>>  
>>>>  /* Maximum number of jiffies to wait before draining a batch. */
>>>>  #define KFREE_DRAIN_JIFFIES (5 * HZ)
>>>> @@ -3904,7 +3943,11 @@ static void rcu_barrier_entrain(struct rcu_data *rdp)
>>>>  	rdp->barrier_head.func = rcu_barrier_callback;
>>>>  	debug_rcu_head_queue(&rdp->barrier_head);
>>>>  	rcu_nocb_lock(rdp);
>>>> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies));
>>>> +        /*
>>>> +         * Flush the bypass list, but also wake up the GP thread as otherwise
>>>> +         * bypass/lazy CBs maynot be noticed, and can cause real long delays!
>>>> +         */
>>>> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(rdp, NULL, jiffies, FLUSH_BP_WAKE));
>>>>  	if (rcu_segcblist_entrain(&rdp->cblist, &rdp->barrier_head)) {
>>>>  		atomic_inc(&rcu_state.barrier_cpu_count);
>>>>  	} else {
>>>> @@ -4325,7 +4368,7 @@ void rcutree_migrate_callbacks(int cpu)
>>>>  	my_rdp = this_cpu_ptr(&rcu_data);
>>>>  	my_rnp = my_rdp->mynode;
>>>>  	rcu_nocb_lock(my_rdp); /* irqs already disabled. */
>>>> -	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies));
>>>> +	WARN_ON_ONCE(!rcu_nocb_flush_bypass(my_rdp, NULL, jiffies, FLUSH_BP_NONE));
>>>>  	raw_spin_lock_rcu_node(my_rnp); /* irqs already disabled. */
>>>>  	/* Leverage recent GPs and set GP for new callbacks. */
>>>>  	needwake = rcu_advance_cbs(my_rnp, rdp) ||
>>>> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
>>>> index d4a97e40ea9c..361c41d642c7 100644
>>>> --- a/kernel/rcu/tree.h
>>>> +++ b/kernel/rcu/tree.h
>>>> @@ -263,14 +263,16 @@ struct rcu_data {
>>>>  	unsigned long last_fqs_resched;	/* Time of last rcu_resched(). */
>>>>  	unsigned long last_sched_clock;	/* Jiffies of last rcu_sched_clock_irq(). */
>>>>  
>>>> +	long lazy_len;			/* Length of buffered lazy callbacks. */
>>>>  	int cpu;
>>>>  };
>>>>  
>>>>  /* Values for nocb_defer_wakeup field in struct rcu_data. */
>>>>  #define RCU_NOCB_WAKE_NOT	0
>>>>  #define RCU_NOCB_WAKE_BYPASS	1
>>>> -#define RCU_NOCB_WAKE		2
>>>> -#define RCU_NOCB_WAKE_FORCE	3
>>>> +#define RCU_NOCB_WAKE_LAZY	2
>>>> +#define RCU_NOCB_WAKE		3
>>>> +#define RCU_NOCB_WAKE_FORCE	4
>>>>  
>>>>  #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
>>>>  					/* For jiffies_till_first_fqs and */
>>>> @@ -439,10 +441,17 @@ static void zero_cpu_stall_ticks(struct rcu_data *rdp);
>>>>  static struct swait_queue_head *rcu_nocb_gp_get(struct rcu_node *rnp);
>>>>  static void rcu_nocb_gp_cleanup(struct swait_queue_head *sq);
>>>>  static void rcu_init_one_nocb(struct rcu_node *rnp);
>>>> +
>>>> +#define FLUSH_BP_NONE 0
>>>> +/* Is the CB being enqueued after the flush, a lazy CB? */
>>>> +#define FLUSH_BP_LAZY BIT(0)
>>>> +/* Wake up nocb-GP thread after flush? */
>>>> +#define FLUSH_BP_WAKE BIT(1)
>>>>  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> -				  unsigned long j);
>>>> +				  unsigned long j, unsigned long flush_flags);
>>>>  static bool rcu_nocb_try_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>> -				bool *was_alldone, unsigned long flags);
>>>> +				bool *was_alldone, unsigned long flags,
>>>> +				bool lazy);
>>>>  static void __call_rcu_nocb_wake(struct rcu_data *rdp, bool was_empty,
>>>>  				 unsigned long flags);
>>>>  static int rcu_nocb_need_deferred_wakeup(struct rcu_data *rdp, int level);
>>>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>>>> index 4dc86274b3e8..b201606f7c4f 100644
>>>> --- a/kernel/rcu/tree_nocb.h
>>>> +++ b/kernel/rcu/tree_nocb.h
>>>> @@ -256,6 +256,31 @@ static bool wake_nocb_gp(struct rcu_data *rdp, bool force)
>>>>  	return __wake_nocb_gp(rdp_gp, rdp, force, flags);
>>>>  }
>>>>  
>>>> +/*
>>>> + * LAZY_FLUSH_JIFFIES decides the maximum amount of time that
>>>> + * can elapse before lazy callbacks are flushed. Lazy callbacks
>>>> + * could be flushed much earlier for a number of other reasons
>>>> + * however, LAZY_FLUSH_JIFFIES will ensure no lazy callbacks are
>>>> + * left unsubmitted to RCU after those many jiffies.
>>>> + */
>>>> +#define LAZY_FLUSH_JIFFIES (10 * HZ)
>>>> +unsigned long jiffies_till_flush = LAZY_FLUSH_JIFFIES;
>>>> +
>>>> +#ifdef CONFIG_RCU_LAZY
>>>> +// To be called only from test code.
>>>> +void rcu_lazy_set_jiffies_till_flush(unsigned long jif)
>>>> +{
>>>> +	jiffies_till_flush = jif;
>>>> +}
>>>> +EXPORT_SYMBOL(rcu_lazy_set_jiffies_till_flush);
>>>> +
>>>> +unsigned long rcu_lazy_get_jiffies_till_flush(void)
>>>> +{
>>>> +	return jiffies_till_flush;
>>>> +}
>>>> +EXPORT_SYMBOL(rcu_lazy_get_jiffies_till_flush);
>>>> +#endif
>>>> +
>>>>  /*
>>>>   * Arrange to wake the GP kthread for this NOCB group at some future
>>>>   * time when it is safe to do so.
>>>> @@ -269,10 +294,14 @@ static void wake_nocb_gp_defer(struct rcu_data *rdp, int waketype,
>>>>  	raw_spin_lock_irqsave(&rdp_gp->nocb_gp_lock, flags);
>>>>  
>>>>  	/*
>>>> -	 * Bypass wakeup overrides previous deferments. In case
>>>> -	 * of callback storm, no need to wake up too early.
>>>> +	 * Bypass wakeup overrides previous deferments. In case of
>>>> +	 * callback storm, no need to wake up too early.
>>>>  	 */
>>>> -	if (waketype == RCU_NOCB_WAKE_BYPASS) {
>>>> +	if (waketype == RCU_NOCB_WAKE_LAZY
>>>> +		&& READ_ONCE(rdp->nocb_defer_wakeup) == RCU_NOCB_WAKE_NOT) {
>>>> +		mod_timer(&rdp_gp->nocb_timer, jiffies + jiffies_till_flush);
>>>> +		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>>> +	} else if (waketype == RCU_NOCB_WAKE_BYPASS) {
>>>>  		mod_timer(&rdp_gp->nocb_timer, jiffies + 2);
>>>>  		WRITE_ONCE(rdp_gp->nocb_defer_wakeup, waketype);
>>>>  	} else {
>>>>
>>> Joel, i have a question here. I see that lazy callback just makes the GP
>>> start later where a regular call_rcu() API will instead cut it back.
>>> Could you please clarify how it solves the sequence like:
>>>
>>> CPU:
>>>
>>> 1) task A -> call_rcu_lazy();
>>> 2) task B -> call_rcu();
>>>
>>> so lazily decision about running GP later a task B overrides to initiate it sooner? 
>>
>> Is your question that task B will make task A's CB run sooner? Yes that's right.
>> The reason is that the point of call_rcu_lazy() is to delay GP start as much as
>> possible to keep system quiet. However, if a call_rcu() comes in any way, we
>> have to start a GP soon. So we might as well take advantage of that for the lazy
>> one as we are paying the full price for the new GP.
>>
>> It is kind of like RCU's amortizing behavior in that sense, however we put that
>> into overdrive to be greedy for power.
>>
>> Does that answer your question?
>>
> OK. I get it correctly then :) IMHO, the problem with such approach is that we
> may end-up in non-lazy way of moving RCU-core forward even though we do t lazily
> and one user not.

Yes, the best we can do seems to be to identify all call_rcu() users that can be
lazified (say using rcutop) and apply the API change to them. It depends on use
case for sure, for example if some usecase is doing frequent synchronize_rcu()
for whatever reason, then that will impede the lazy users. On the Intel ChromeOS
device, we do not see such a problem.

> According to my observation, our setup suffers from many wake-ups across the CPUs
> and mixing non_lazy_way with lazy_way will just be converted into a regular patern. 
> 
> I think adding kind of high_water_mark on "entry" would solve it. Any thoughts here?

By high water mark, do you mean for memory? If memory, we have to consider that
we cannot assume that call_rcu() users are using RCU for memory reclaim only.
There could be usecases like per-cpu refcounts which use call_rcu() for
non-memory purposes. How would use a memory-watermark there?


> Joel, do you see that v6 makes the system more idle from RCU perspective on the Intel SoC?

I did not compare Intel with say ARM, so I don't know whether ARM SoC is less
sensitive to this issue. I think it depends on the firmware as well. On Intel we
see that the firmware does not go as aggressive on SoC-wide power management if
there is RCU activity.. at least that's the observation.

> From my side i will port v6 on the 5.10 kernel and give some tests today.

Great, thanks! Looking forward to it.

- Joel


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-07 13:44             ` Joel Fernandes
@ 2022-09-07 15:38               ` Frederic Weisbecker
  2022-09-07 15:39                 ` Joel Fernandes
  0 siblings, 1 reply; 73+ messages in thread
From: Frederic Weisbecker @ 2022-09-07 15:38 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Wed, Sep 07, 2022 at 09:44:01AM -0400, Joel Fernandes wrote:
> Hi Frederic,
> 
> On 9/7/2022 5:40 AM, Frederic Weisbecker wrote:
> >>
> >> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> >> index bd8f39ee2cd0..e3344c262672 100644
> >> --- a/kernel/rcu/tree_nocb.h
> >> +++ b/kernel/rcu/tree_nocb.h
> >> @@ -382,15 +382,19 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >>  				  unsigned long j, unsigned long flush_flags)
> >>  {
> >>  	bool ret;
> >> +	bool was_alldone;
> >>  
> >>  	if (!rcu_rdp_is_offloaded(rdp))
> >>  		return true;
> >>  	rcu_lockdep_assert_cblist_protected(rdp);
> >>  	rcu_nocb_bypass_lock(rdp);
> >> +	if (flush_flags & FLUSH_BP_WAKE)
> >> +		was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> >> +
> > You can check that outside bypass lock (but you still need nocb_lock).
> 
> Right, ok. I can make it outside the bypass lock, and the nocb_lock is implied
> by rcu_lockdep_assert_cblist_protected().
> 
> > 
> >>  	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> >>  
> >> -	if (flush_flags & FLUSH_BP_WAKE)
> >> -		wake_nocb_gp(rdp, true);
> >> +	if (flush_flags & FLUSH_BP_WAKE && was_alldone)
> >> +		wake_nocb_gp(rdp, false);
> > That doesn't check if the bypass list was empty.
> 
> thanks, will add your suggested optimization, however I have a general question:
> 
> For the case where the bypass list is empty, where does rcu_barrier() do a wake
> up of the nocb GP thread after entrain()?
> 
> I don't see a call to __call_rcu_nocb_wake() like rcutree_migrate_callbacks()
> does. Is the wake up done in some other path?

Actually I can't find it. You're right it may be missing. It's not even just
about the bypass list but also the entrain'ed callback. Would you be willing to
cook a fix?

Thanks!

> 
> thanks,
> 
>  - Joel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-07 15:38               ` Frederic Weisbecker
@ 2022-09-07 15:39                 ` Joel Fernandes
  0 siblings, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-07 15:39 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng



On 9/7/2022 11:38 AM, Frederic Weisbecker wrote:
> On Wed, Sep 07, 2022 at 09:44:01AM -0400, Joel Fernandes wrote:
>> Hi Frederic,
>>
>> On 9/7/2022 5:40 AM, Frederic Weisbecker wrote:
>>>>
>>>> diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
>>>> index bd8f39ee2cd0..e3344c262672 100644
>>>> --- a/kernel/rcu/tree_nocb.h
>>>> +++ b/kernel/rcu/tree_nocb.h
>>>> @@ -382,15 +382,19 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
>>>>  				  unsigned long j, unsigned long flush_flags)
>>>>  {
>>>>  	bool ret;
>>>> +	bool was_alldone;
>>>>  
>>>>  	if (!rcu_rdp_is_offloaded(rdp))
>>>>  		return true;
>>>>  	rcu_lockdep_assert_cblist_protected(rdp);
>>>>  	rcu_nocb_bypass_lock(rdp);
>>>> +	if (flush_flags & FLUSH_BP_WAKE)
>>>> +		was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
>>>> +
>>> You can check that outside bypass lock (but you still need nocb_lock).
>>
>> Right, ok. I can make it outside the bypass lock, and the nocb_lock is implied
>> by rcu_lockdep_assert_cblist_protected().
>>
>>>
>>>>  	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
>>>>  
>>>> -	if (flush_flags & FLUSH_BP_WAKE)
>>>> -		wake_nocb_gp(rdp, true);
>>>> +	if (flush_flags & FLUSH_BP_WAKE && was_alldone)
>>>> +		wake_nocb_gp(rdp, false);
>>> That doesn't check if the bypass list was empty.
>>
>> thanks, will add your suggested optimization, however I have a general question:
>>
>> For the case where the bypass list is empty, where does rcu_barrier() do a wake
>> up of the nocb GP thread after entrain()?
>>
>> I don't see a call to __call_rcu_nocb_wake() like rcutree_migrate_callbacks()
>> does. Is the wake up done in some other path?
> 
> Actually I can't find it. You're right it may be missing. It's not even just
> about the bypass list but also the entrain'ed callback. Would you be willing to
> cook a fix?

Sounds good, yes I am happy to make a fix and include it in the next series
posting. Thanks for confirming the potential issue.

I will be on a plan in a few hours, I can even try to do it on the plane but
will see how it goes ;)

Thanks!

 - Joel



^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation
  2022-09-07  9:40           ` Frederic Weisbecker
  2022-09-07 13:44             ` Joel Fernandes
@ 2022-09-21 23:52             ` Joel Fernandes
  1 sibling, 0 replies; 73+ messages in thread
From: Joel Fernandes @ 2022-09-21 23:52 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: rcu, linux-kernel, rushikesh.s.kadam, urezki, neeraj.iitr10,
	paulmck, rostedt, vineeth, boqun.feng

On Wed, Sep 07, 2022 at 11:40:14AM +0200, Frederic Weisbecker wrote:
> On Wed, Sep 07, 2022 at 12:06:26AM +0000, Joel Fernandes wrote:
> > > > @@ -326,13 +372,20 @@ static bool rcu_nocb_do_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> > > >   * Note that this function always returns true if rhp is NULL.
> > > >   */
> > > >  static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> > > > -				  unsigned long j)
> > > > +				  unsigned long j, unsigned long flush_flags)
> > > >  {
> > > > +	bool ret;
> > > > +
> > > >  	if (!rcu_rdp_is_offloaded(rdp))
> > > >  		return true;
> > > >  	rcu_lockdep_assert_cblist_protected(rdp);
> > > >  	rcu_nocb_bypass_lock(rdp);
> > > > -	return rcu_nocb_do_flush_bypass(rdp, rhp, j);
> > > > +	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> > > > +
> > > > +	if (flush_flags & FLUSH_BP_WAKE)
> > > > +		wake_nocb_gp(rdp, true);
> > > 
> > > Why the true above?
> > > 
> > > Also should we check if the wake up is really necessary (otherwise it means we
> > > force a wake up for all rdp's from rcu_barrier())?
> > > 
> > >        was_alldone = rcu_segcblist_pend_cbs(&rdp->cblist);
> > >        ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> > >        if (was_alldone && rcu_segcblist_pend_cbs(&rdp->cblist))
> > >        	  wake_nocb_gp(rdp, false);
> > 
> > You mean something like the following right? Though I'm thinking if its
> > better to call wake_nocb_gp() from tree.c in entrain() and let that handle
> > the wake. That way, we can get rid of the extra FLUSH_BP flags as well and
> > let the flush callers deal with the wakeups..
> 
> Ah yes that could make sense if only one caller cares.
> 
> > 
> > Anyway, for testing this should be good...
> > 
> > ---8<-----------------------
> > 
> > diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
> > index bd8f39ee2cd0..e3344c262672 100644
> > --- a/kernel/rcu/tree_nocb.h
> > +++ b/kernel/rcu/tree_nocb.h
> > @@ -382,15 +382,19 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
> >  				  unsigned long j, unsigned long flush_flags)
> >  {
> >  	bool ret;
> > +	bool was_alldone;
> >  
> >  	if (!rcu_rdp_is_offloaded(rdp))
> >  		return true;
> >  	rcu_lockdep_assert_cblist_protected(rdp);
> >  	rcu_nocb_bypass_lock(rdp);
> > +	if (flush_flags & FLUSH_BP_WAKE)
> > +		was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
> > +
> 
> You can check that outside bypass lock (but you still need nocb_lock).
> 
> >  	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
> >  
> > -	if (flush_flags & FLUSH_BP_WAKE)
> > -		wake_nocb_gp(rdp, true);
> > +	if (flush_flags & FLUSH_BP_WAKE && was_alldone)
> > +		wake_nocb_gp(rdp, false);
> 
> That doesn't check if the bypass list was empty.

I am ending up with something like the below for v6, after discussing with
Paul on IRC he pointed out we only need to do the rcu_barrier() related
wakeup when all the CBs are lazy in the bypass list. Otherwise timer goes
off. I think Frederic mentioned something similar above in different words.

I prefer to keep this logic in tree_nocb.h since rcu_barrier_entrain()
shouldn't have to deal with nocb internals (in theory anyway).

Looks Ok?

thanks,

 - Joel

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH for v6] fixup! rcu: Introduce call_rcu_lazy() API implementation

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree_nocb.h | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/rcu/tree_nocb.h b/kernel/rcu/tree_nocb.h
index c197534d0c99..fd056358f041 100644
--- a/kernel/rcu/tree_nocb.h
+++ b/kernel/rcu/tree_nocb.h
@@ -375,18 +375,26 @@ static bool rcu_nocb_flush_bypass(struct rcu_data *rdp, struct rcu_head *rhp,
 				  unsigned long j, unsigned long flush_flags)
 {
 	bool ret;
-	bool was_alldone;
+	bool was_alldone = false;
+	bool bypass_all_lazy = false;
 
 	if (!rcu_rdp_is_offloaded(rdp))
 		return true;
 	rcu_lockdep_assert_cblist_protected(rdp);
 	rcu_nocb_bypass_lock(rdp);
-	if (flush_flags & FLUSH_BP_WAKE)
+
+	if (flush_flags & FLUSH_BP_WAKE) {
 		was_alldone = !rcu_segcblist_pend_cbs(&rdp->cblist);
+		bypass_all_lazy =
+		  (rcu_cblist_n_cbs(&rdp->nocb_bypass) == rdp->lazy_len);
+	}
 
 	ret = rcu_nocb_do_flush_bypass(rdp, rhp, j, flush_flags);
 
-	if (flush_flags & FLUSH_BP_WAKE && was_alldone)
+	// Wake up the nocb GP thread if needed. GP thread could be sleeping
+	// while waiting for lazy timer to expire (otherwise rcu_barrier may
+	// end up waiting for the duration of the lazy timer).
+	if (flush_flags & FLUSH_BP_WAKE && was_alldone && bypass_all_lazy)
 		wake_nocb_gp(rdp, false);
 
 	return ret;
-- 
2.37.3.998.g577e59143f-goog


^ permalink raw reply related	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2022-09-21 23:52 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-01 22:17 [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 01/18] mm/slub: perform free consistency checks before call_rcu Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 02/18] mm/sl[au]b: rearrange struct slab fields to allow larger rcu_head Joel Fernandes (Google)
2022-09-02  9:26   ` Vlastimil Babka
2022-09-02  9:30     ` Vlastimil Babka
2022-09-02 15:09       ` Joel Fernandes
2022-09-03 13:53         ` Paul E. McKenney
2022-09-01 22:17 ` [PATCH v5 03/18] rcu/tree: Use READ_ONCE() for lockless read of rnp->qsmask Joel Fernandes (Google)
2022-09-06 22:26   ` Boqun Feng
2022-09-06 22:31     ` Joel Fernandes
2022-09-01 22:17 ` [PATCH v5 04/18] rcu: Fix late wakeup when flush of bypass cblist happens Joel Fernandes (Google)
2022-09-02 11:35   ` Frederic Weisbecker
2022-09-02 23:58     ` Joel Fernandes
2022-09-03 15:10       ` Paul E. McKenney
2022-09-04 21:13       ` Frederic Weisbecker
2022-09-03 14:04   ` Paul E. McKenney
2022-09-03 14:05     ` Joel Fernandes
2022-09-06  3:07   ` Joel Fernandes
2022-09-06  9:48     ` Frederic Weisbecker
2022-09-07  2:43       ` Joel Fernandes
2022-09-01 22:17 ` [PATCH v5 05/18] rcu: Move trace_rcu_callback() before bypassing Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 06/18] rcu: Introduce call_rcu_lazy() API implementation Joel Fernandes (Google)
2022-09-02 15:21   ` Frederic Weisbecker
2022-09-02 23:09     ` Joel Fernandes
2022-09-05 12:59       ` Frederic Weisbecker
2022-09-05 20:18         ` Joel Fernandes
2022-09-05 20:32         ` Joel Fernandes
2022-09-06  8:55           ` Paul E. McKenney
2022-09-06 16:16             ` Joel Fernandes
2022-09-06 17:05               ` Paul E. McKenney
2022-09-03 22:00     ` Joel Fernandes
2022-09-04 21:01       ` Frederic Weisbecker
2022-09-05 20:20         ` Joel Fernandes
2022-09-06  3:05     ` Joel Fernandes
2022-09-06 15:17       ` Frederic Weisbecker
2022-09-06 16:15         ` Joel Fernandes
2022-09-06 16:31           ` Joel Fernandes
2022-09-06 16:38             ` Joel Fernandes
2022-09-06 16:43               ` Joel Fernandes
2022-09-06 19:11                 ` Frederic Weisbecker
2022-09-07  2:56                   ` Joel Fernandes
2022-09-07  9:56                     ` Frederic Weisbecker
2022-09-07 10:03           ` Frederic Weisbecker
2022-09-07 14:01             ` Joel Fernandes
2022-09-07  0:06         ` Joel Fernandes
2022-09-07  9:40           ` Frederic Weisbecker
2022-09-07 13:44             ` Joel Fernandes
2022-09-07 15:38               ` Frederic Weisbecker
2022-09-07 15:39                 ` Joel Fernandes
2022-09-21 23:52             ` Joel Fernandes
2022-09-06 18:16       ` Uladzislau Rezki
2022-09-06 18:21         ` Joel Fernandes
2022-09-07  8:52           ` Uladzislau Rezki
2022-09-07 15:23             ` Joel Fernandes
2022-09-03 14:03   ` Paul E. McKenney
2022-09-03 14:05     ` Joel Fernandes
2022-09-01 22:17 ` [PATCH v5 07/18] rcu: shrinker for lazy rcu Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 08/18] rcu: Add per-CB tracing for queuing, flush and invocation Joel Fernandes (Google)
2022-09-02 16:48   ` kernel test robot
2022-09-03 12:39     ` Paul E. McKenney
2022-09-03 14:07       ` Joel Fernandes
2022-09-02 19:01   ` kernel test robot
2022-09-01 22:17 ` [PATCH v5 09/18] rcuscale: Add laziness and kfree tests Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 10/18] rcutorture: Add test code for call_rcu_lazy() Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 11/18] fs: Move call_rcu() to call_rcu_lazy() in some paths Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 12/18] cred: Move call_rcu() to call_rcu_lazy() Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 13/18] security: " Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 14/18] net/core: " Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 15/18] kernel: Move various core kernel usages " Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 16/18] lib: Move call_rcu() " Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 17/18] i915: " Joel Fernandes (Google)
2022-09-01 22:17 ` [PATCH v5 18/18] fork: Move thread_stack_free_rcu() " Joel Fernandes (Google)
2022-09-03 15:22 ` [PATCH v5 00/18] Implement call_rcu_lazy() and miscellaneous fixes Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).