linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] rcu: Fix PF_IDLE related issues, part. 1
@ 2023-10-19 23:35 Frederic Weisbecker
  2023-10-19 23:35 ` [PATCH 1/4] softirq: Rename __raise_softirq_irqoff() to raise_softirq_no_wake() Frederic Weisbecker
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Frederic Weisbecker @ 2023-10-19 23:35 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Liam R . Howlett,
	Peter Zijlstra, Sebastian Siewior, Thomas Gleixner, Zqiang,
	Lai Jiangshan

The modification of PF_IDLE semantics lately to fix a bug in rcutiny
eventually introduced new bugs in RCU-tasks. This series revert that
PF_IDLE modification and fix rcutiny in an alternate way.

More issues need to be fixed in RCU-tasks:

* The boot code preceding the idle entry is included in this
  quiescent state. Especially after the completion of kthreadd_done
  after which init/1 can launch userspace concurrently. The window
  is tiny before PF_IDLE is set but it exists.

* Similarly, the boot code preceding the idle entry on secondary
  CPUs is wrongly accounted as RCU tasks quiescent state.

And those will be fixed separetely.

Frederic Weisbecker (4):
  softirq: Rename __raise_softirq_irqoff() to raise_softirq_no_wake()
  softirq: Introduce raise_ksoftirqd_irqoff()
  rcu: Make tiny RCU use ksoftirqd to trigger a QS from idle
  Revert "kernel/sched: Modify initial boot task idle setup"

 block/blk-mq.c            |  2 +-
 include/linux/interrupt.h |  3 ++-
 kernel/rcu/tiny.c         | 21 +++++++++++++-----
 kernel/sched/core.c       |  2 +-
 kernel/sched/idle.c       |  1 -
 kernel/softirq.c          | 45 +++++++++++++++++++++++++--------------
 lib/irq_poll.c            |  4 ++--
 net/core/dev.c            |  8 +++----
 8 files changed, 55 insertions(+), 31 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/4] softirq: Rename __raise_softirq_irqoff() to raise_softirq_no_wake()
  2023-10-19 23:35 [PATCH 0/4] rcu: Fix PF_IDLE related issues, part. 1 Frederic Weisbecker
@ 2023-10-19 23:35 ` Frederic Weisbecker
  2023-10-19 23:35 ` [PATCH 2/4] softirq: Introduce raise_ksoftirqd_irqoff() Frederic Weisbecker
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 10+ messages in thread
From: Frederic Weisbecker @ 2023-10-19 23:35 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Lai Jiangshan,
	Liam R . Howlett, Peter Zijlstra, Sebastian Siewior,
	Thomas Gleixner

This makes the purpose of this function clearer.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 block/blk-mq.c            | 2 +-
 include/linux/interrupt.h | 2 +-
 kernel/softirq.c          | 6 +++---
 lib/irq_poll.c            | 4 ++--
 net/core/dev.c            | 8 ++++----
 5 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1fafd54dce3c..1bda40a2aa29 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1144,7 +1144,7 @@ static int blk_softirq_cpu_dead(unsigned int cpu)
 
 static void __blk_mq_complete_request_remote(void *data)
 {
-	__raise_softirq_irqoff(BLOCK_SOFTIRQ);
+	raise_softirq_no_wake(BLOCK_SOFTIRQ);
 }
 
 static inline bool blk_mq_complete_need_ipi(struct request *rq)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 76121c2bb4f8..558a1a329da9 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -604,7 +604,7 @@ static inline void do_softirq_post_smp_call_flush(unsigned int unused)
 
 extern void open_softirq(int nr, void (*action)(struct softirq_action *));
 extern void softirq_init(void);
-extern void __raise_softirq_irqoff(unsigned int nr);
+extern void raise_softirq_no_wake(unsigned int nr);
 
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 210cf5f8d92c..acfed6f3701d 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -664,7 +664,7 @@ void irq_exit(void)
  */
 inline void raise_softirq_irqoff(unsigned int nr)
 {
-	__raise_softirq_irqoff(nr);
+	raise_softirq_no_wake(nr);
 
 	/*
 	 * If we're in an interrupt or softirq, we're done
@@ -688,7 +688,7 @@ void raise_softirq(unsigned int nr)
 	local_irq_restore(flags);
 }
 
-void __raise_softirq_irqoff(unsigned int nr)
+void raise_softirq_no_wake(unsigned int nr)
 {
 	lockdep_assert_irqs_disabled();
 	trace_softirq_raise(nr);
@@ -795,7 +795,7 @@ static void tasklet_action_common(struct softirq_action *a,
 		t->next = NULL;
 		*tl_head->tail = t;
 		tl_head->tail = &t->next;
-		__raise_softirq_irqoff(softirq_nr);
+		raise_softirq_no_wake(softirq_nr);
 		local_irq_enable();
 	}
 }
diff --git a/lib/irq_poll.c b/lib/irq_poll.c
index 2d5329a42105..193cd847fd8f 100644
--- a/lib/irq_poll.c
+++ b/lib/irq_poll.c
@@ -130,7 +130,7 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h)
 	}
 
 	if (rearm)
-		__raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
+		raise_softirq_no_wake(IRQ_POLL_SOFTIRQ);
 
 	local_irq_enable();
 }
@@ -197,7 +197,7 @@ static int irq_poll_cpu_dead(unsigned int cpu)
 	local_irq_disable();
 	list_splice_init(&per_cpu(blk_cpu_iopoll, cpu),
 			 this_cpu_ptr(&blk_cpu_iopoll));
-	__raise_softirq_irqoff(IRQ_POLL_SOFTIRQ);
+	raise_softirq_no_wake(IRQ_POLL_SOFTIRQ);
 	local_irq_enable();
 	local_bh_enable();
 
diff --git a/net/core/dev.c b/net/core/dev.c
index 85df22f05c38..6f4622cc8939 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4459,7 +4459,7 @@ static inline void ____napi_schedule(struct softnet_data *sd,
 	 * we have to raise NET_RX_SOFTIRQ.
 	 */
 	if (!sd->in_net_rx_action)
-		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+		raise_softirq_no_wake(NET_RX_SOFTIRQ);
 }
 
 #ifdef CONFIG_RPS
@@ -4678,7 +4678,7 @@ static void trigger_rx_softirq(void *data)
 {
 	struct softnet_data *sd = data;
 
-	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+	raise_softirq_no_wake(NET_RX_SOFTIRQ);
 	smp_store_release(&sd->defer_ipi_scheduled, 0);
 }
 
@@ -4705,7 +4705,7 @@ static void napi_schedule_rps(struct softnet_data *sd)
 		 * we have to raise NET_RX_SOFTIRQ.
 		 */
 		if (!mysd->in_net_rx_action && !mysd->in_napi_threaded_poll)
-			__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+			raise_softirq_no_wake(NET_RX_SOFTIRQ);
 		return;
 	}
 #endif /* CONFIG_RPS */
@@ -6743,7 +6743,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
 	list_splice_tail(&repoll, &list);
 	list_splice(&list, &sd->poll_list);
 	if (!list_empty(&sd->poll_list))
-		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+		raise_softirq_no_wake(NET_RX_SOFTIRQ);
 	else
 		sd->in_net_rx_action = false;
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/4] softirq: Introduce raise_ksoftirqd_irqoff()
  2023-10-19 23:35 [PATCH 0/4] rcu: Fix PF_IDLE related issues, part. 1 Frederic Weisbecker
  2023-10-19 23:35 ` [PATCH 1/4] softirq: Rename __raise_softirq_irqoff() to raise_softirq_no_wake() Frederic Weisbecker
@ 2023-10-19 23:35 ` Frederic Weisbecker
  2023-10-19 23:35 ` [PATCH 3/4] rcu: Make tiny RCU use ksoftirqd to trigger a QS from idle Frederic Weisbecker
  2023-10-19 23:35 ` [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup" Frederic Weisbecker
  3 siblings, 0 replies; 10+ messages in thread
From: Frederic Weisbecker @ 2023-10-19 23:35 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Lai Jiangshan,
	Liam R . Howlett, Peter Zijlstra, Sebastian Siewior,
	Thomas Gleixner

Provide a function to raise a softirq vector and force the wakeup of
ksoftirqd along the way, irrespective of the current interrupt context.

This is going to be used by rcutiny to fix and optimize the triggering
of quiescent states from idle.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 include/linux/interrupt.h |  1 +
 kernel/softirq.c          | 71 +++++++++++++++++++++++----------------
 2 files changed, 43 insertions(+), 29 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 558a1a329da9..301d2956e746 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -608,6 +608,7 @@ extern void raise_softirq_no_wake(unsigned int nr);
 
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
+extern void raise_ksoftirqd_irqoff(unsigned int nr);
 
 DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index acfed6f3701d..9c29a8ced1c3 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -659,35 +659,6 @@ void irq_exit(void)
 	lockdep_hardirq_exit();
 }
 
-/*
- * This function must run with irqs disabled!
- */
-inline void raise_softirq_irqoff(unsigned int nr)
-{
-	raise_softirq_no_wake(nr);
-
-	/*
-	 * If we're in an interrupt or softirq, we're done
-	 * (this also catches softirq-disabled code). We will
-	 * actually run the softirq once we return from
-	 * the irq or softirq.
-	 *
-	 * Otherwise we wake up ksoftirqd to make sure we
-	 * schedule the softirq soon.
-	 */
-	if (!in_interrupt() && should_wake_ksoftirqd())
-		wakeup_softirqd();
-}
-
-void raise_softirq(unsigned int nr)
-{
-	unsigned long flags;
-
-	local_irq_save(flags);
-	raise_softirq_irqoff(nr);
-	local_irq_restore(flags);
-}
-
 void raise_softirq_no_wake(unsigned int nr)
 {
 	lockdep_assert_irqs_disabled();
@@ -695,6 +666,48 @@ void raise_softirq_no_wake(unsigned int nr)
 	or_softirq_pending(1UL << nr);
 }
 
+/*
+ * This function must run with irqs disabled!
+ */
+static inline void __raise_softirq_irqoff(unsigned int nr, bool threaded)
+{
+	raise_softirq_no_wake(nr);
+
+	if (threaded && should_wake_ksoftirqd())
+		wakeup_softirqd();
+}
+
+/*
+ * This function must run with irqs disabled!
+ */
+inline void raise_softirq_irqoff(unsigned int nr)
+{
+	bool threaded;
+	/*
+	 * If in an interrupt or softirq (servicing or disabled
+	 * section), the vector will be handled at the end of
+	 * the interrupt or softirq servicing/disabled section.
+	 * Otherwise the vector must rely on ksoftirqd.
+	 */
+	threaded = !in_interrupt();
+
+	__raise_softirq_irqoff(nr, threaded);
+}
+
+void raise_softirq(unsigned int nr)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	raise_softirq_irqoff(nr);
+	local_irq_restore(flags);
+}
+
+void raise_ksoftirqd_irqoff(unsigned int nr)
+{
+	__raise_softirq_irqoff(nr, true);
+}
+
 void open_softirq(int nr, void (*action)(struct softirq_action *))
 {
 	softirq_vec[nr].action = action;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 3/4] rcu: Make tiny RCU use ksoftirqd to trigger a QS from idle
  2023-10-19 23:35 [PATCH 0/4] rcu: Fix PF_IDLE related issues, part. 1 Frederic Weisbecker
  2023-10-19 23:35 ` [PATCH 1/4] softirq: Rename __raise_softirq_irqoff() to raise_softirq_no_wake() Frederic Weisbecker
  2023-10-19 23:35 ` [PATCH 2/4] softirq: Introduce raise_ksoftirqd_irqoff() Frederic Weisbecker
@ 2023-10-19 23:35 ` Frederic Weisbecker
  2023-10-20  0:49   ` Paul E. McKenney
  2023-10-19 23:35 ` [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup" Frederic Weisbecker
  3 siblings, 1 reply; 10+ messages in thread
From: Frederic Weisbecker @ 2023-10-19 23:35 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Lai Jiangshan,
	Liam R . Howlett, Peter Zijlstra, Sebastian Siewior,
	Thomas Gleixner

The commit:

	cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")

fixed an issue where rcutiny would request a quiescent state with
setting TIF_NEED_RESCHED in early boot when init/0 has the PF_IDLE flag
set but interrupts aren't enabled yet. A subsequent call to
cond_resched() would then enable IRQs too early.

When callbacks are enqueued in idle, RCU currently performs the
following:

1) Call resched_cpu() to trigger exit from idle and go through the
   scheduler to call rcu_note_context_switch() -> rcu_qs()

2) rcu_qs() notes the quiescent state and raises RCU_SOFTIRQ if there
   is a callback, waking up ksoftirqd since it isn't called from an
   interrupt.

However the call to resched_cpu() can opportunistically be replaced and
optimized with raising RCU_SOFTIRQ and forcing ksoftirqd wakeup instead.

It's worth noting that RCU grace period polling while idle is then
suboptimized but such a usecase can be considered very rare or even
non-existent.

The advantage of this optimization is that it also works if PF_IDLE is
set early because ksoftirqd is created way after IRQs are enabled on
boot and it can't be awaken before its creation. If
raise_ksoftirqd_irqoff() is called after the first scheduling point
but before kostfirqd is created, nearby voluntary schedule calls are
expected to provide the desired quiescent state and in the worst case
the first launch of ksoftirqd is close enough on the first initcalls.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/rcu/tiny.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index fec804b79080..9460e4e9d84c 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -190,12 +190,15 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
 	local_irq_save(flags);
 	*rcu_ctrlblk.curtail = head;
 	rcu_ctrlblk.curtail = &head->next;
-	local_irq_restore(flags);
 
 	if (unlikely(is_idle_task(current))) {
-		/* force scheduling for rcu_qs() */
-		resched_cpu(0);
+		/*
+		 * Force resched to trigger a QS and handle callbacks right after.
+		 * This also takes care of avoiding too early rescheduling on boot.
+		 */
+		raise_ksoftirqd_irqoff(RCU_SOFTIRQ);
 	}
+	local_irq_restore(flags);
 }
 EXPORT_SYMBOL_GPL(call_rcu);
 
@@ -228,8 +231,16 @@ unsigned long start_poll_synchronize_rcu(void)
 	unsigned long gp_seq = get_state_synchronize_rcu();
 
 	if (unlikely(is_idle_task(current))) {
-		/* force scheduling for rcu_qs() */
-		resched_cpu(0);
+		unsigned long flags;
+
+		/*
+		 * Force resched to trigger a QS. This also takes care of avoiding
+		 * too early rescheduling on boot. It's suboptimized but GP
+		 * polling on idle isn't expected much as a usecase.
+		 */
+		local_irq_save(flags);
+		raise_ksoftirqd_irqoff(RCU_SOFTIRQ);
+		local_irq_restore(flags);
 	}
 	return gp_seq;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup"
  2023-10-19 23:35 [PATCH 0/4] rcu: Fix PF_IDLE related issues, part. 1 Frederic Weisbecker
                   ` (2 preceding siblings ...)
  2023-10-19 23:35 ` [PATCH 3/4] rcu: Make tiny RCU use ksoftirqd to trigger a QS from idle Frederic Weisbecker
@ 2023-10-19 23:35 ` Frederic Weisbecker
  2023-10-20  8:25   ` Peter Zijlstra
  3 siblings, 1 reply; 10+ messages in thread
From: Frederic Weisbecker @ 2023-10-19 23:35 UTC (permalink / raw)
  To: LKML
  Cc: Frederic Weisbecker, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Lai Jiangshan,
	Liam R . Howlett, Peter Zijlstra, Sebastian Siewior,
	Thomas Gleixner

Now that rcutiny can deal with early boot PF_IDLE setting, revert
commit cff9b2332ab762b7e0586c793c431a8f2ea4db04.

This fixes several subtle issues introduced on RCU-tasks(-trace):

1) RCU-tasks stalls when:

   1.1 Grace period is started before init/0 had a chance to set PF_IDLE,
       keeping it stuck in the holdout list until idle ever schedules.

   1.2 Grace period is started when some possible CPUs have never been
       online, keeping their idle tasks stuck in the holdout list until
       the CPU ever boots up.

   1.3 Similar to 1.1 but with secondary CPUs: Grace period is started
       concurrently with secondary CPU booting, putting its idle task in
       the holdout list because PF_IDLE isn't yet observed on it. It
       stays then stuck in the holdout list until that CPU ever
       schedules. The effect is mitigated here by all the smpboot
       kthreads and the hotplug AP thread that must run to bring the
       CPU up.

2) Spurious warning on RCU task trace that assumes offline CPU's idle
   task is always PF_IDLE.

More issues have been found in RCU-tasks related to PF_IDLE which should
be fixed with later changes as those are not regressions:

3) The RCU-Tasks semantics consider the idle loop as a quiescent state,
   however:

   3.1 The boot code preceding the idle entry is included in this
       quiescent state. Especially after the completion of kthreadd_done
       after which init/1 can launch userspace concurrently. The window
       is tiny before PF_IDLE is set but it exists.

   3.2 Similarly, the boot code preceding the idle entry on secondary
       CPUs is wrongly accounted as RCU tasks quiescent state.

Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
---
 kernel/sched/core.c | 2 +-
 kernel/sched/idle.c | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ad960f97e4e1..b02dcbe98024 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9269,7 +9269,7 @@ void __init init_idle(struct task_struct *idle, int cpu)
 	 * PF_KTHREAD should already be set at this point; regardless, make it
 	 * look like a proper per-CPU kthread.
 	 */
-	idle->flags |= PF_KTHREAD | PF_NO_SETAFFINITY;
+	idle->flags |= PF_IDLE | PF_KTHREAD | PF_NO_SETAFFINITY;
 	kthread_set_per_cpu(idle, cpu);
 
 #ifdef CONFIG_SMP
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 5007b25c5bc6..342f58a329f5 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,7 +373,6 @@ EXPORT_SYMBOL_GPL(play_idle_precise);
 
 void cpu_startup_entry(enum cpuhp_state state)
 {
-	current->flags |= PF_IDLE;
 	arch_cpu_idle_prepare();
 	cpuhp_online_idle(state);
 	while (1)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 3/4] rcu: Make tiny RCU use ksoftirqd to trigger a QS from idle
  2023-10-19 23:35 ` [PATCH 3/4] rcu: Make tiny RCU use ksoftirqd to trigger a QS from idle Frederic Weisbecker
@ 2023-10-20  0:49   ` Paul E. McKenney
  0 siblings, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2023-10-20  0:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Steven Rostedt,
	Uladzislau Rezki, rcu, Zqiang, Lai Jiangshan, Liam R . Howlett,
	Peter Zijlstra, Sebastian Siewior, Thomas Gleixner

On Fri, Oct 20, 2023 at 01:35:42AM +0200, Frederic Weisbecker wrote:
> The commit:
> 
> 	cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
> 
> fixed an issue where rcutiny would request a quiescent state with
> setting TIF_NEED_RESCHED in early boot when init/0 has the PF_IDLE flag
> set but interrupts aren't enabled yet. A subsequent call to
> cond_resched() would then enable IRQs too early.
> 
> When callbacks are enqueued in idle, RCU currently performs the
> following:
> 
> 1) Call resched_cpu() to trigger exit from idle and go through the
>    scheduler to call rcu_note_context_switch() -> rcu_qs()
> 
> 2) rcu_qs() notes the quiescent state and raises RCU_SOFTIRQ if there
>    is a callback, waking up ksoftirqd since it isn't called from an
>    interrupt.
> 
> However the call to resched_cpu() can opportunistically be replaced and
> optimized with raising RCU_SOFTIRQ and forcing ksoftirqd wakeup instead.
> 
> It's worth noting that RCU grace period polling while idle is then
> suboptimized but such a usecase can be considered very rare or even
> non-existent.
> 
> The advantage of this optimization is that it also works if PF_IDLE is
> set early because ksoftirqd is created way after IRQs are enabled on
> boot and it can't be awaken before its creation. If
> raise_ksoftirqd_irqoff() is called after the first scheduling point
> but before kostfirqd is created, nearby voluntary schedule calls are
> expected to provide the desired quiescent state and in the worst case
> the first launch of ksoftirqd is close enough on the first initcalls.
> 
> Fixes: cff9b2332ab7 ("kernel/sched: Modify initial boot task idle setup")
> Cc: Liam R. Howlett <Liam.Howlett@oracle.com>
> Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Sebastian Siewior <bigeasy@linutronix.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Frederic Weisbecker <frederic@kernel.org>

Reviewed-by: Paul E. McKenney <paulmck@kernel.org>

I did a touch-test of the series, and have started overnight testing.

							Thanx, Paul

> ---
>  kernel/rcu/tiny.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
> index fec804b79080..9460e4e9d84c 100644
> --- a/kernel/rcu/tiny.c
> +++ b/kernel/rcu/tiny.c
> @@ -190,12 +190,15 @@ void call_rcu(struct rcu_head *head, rcu_callback_t func)
>  	local_irq_save(flags);
>  	*rcu_ctrlblk.curtail = head;
>  	rcu_ctrlblk.curtail = &head->next;
> -	local_irq_restore(flags);
>  
>  	if (unlikely(is_idle_task(current))) {
> -		/* force scheduling for rcu_qs() */
> -		resched_cpu(0);
> +		/*
> +		 * Force resched to trigger a QS and handle callbacks right after.
> +		 * This also takes care of avoiding too early rescheduling on boot.
> +		 */
> +		raise_ksoftirqd_irqoff(RCU_SOFTIRQ);
>  	}
> +	local_irq_restore(flags);
>  }
>  EXPORT_SYMBOL_GPL(call_rcu);
>  
> @@ -228,8 +231,16 @@ unsigned long start_poll_synchronize_rcu(void)
>  	unsigned long gp_seq = get_state_synchronize_rcu();
>  
>  	if (unlikely(is_idle_task(current))) {
> -		/* force scheduling for rcu_qs() */
> -		resched_cpu(0);
> +		unsigned long flags;
> +
> +		/*
> +		 * Force resched to trigger a QS. This also takes care of avoiding
> +		 * too early rescheduling on boot. It's suboptimized but GP
> +		 * polling on idle isn't expected much as a usecase.
> +		 */
> +		local_irq_save(flags);
> +		raise_ksoftirqd_irqoff(RCU_SOFTIRQ);
> +		local_irq_restore(flags);
>  	}
>  	return gp_seq;
>  }
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup"
  2023-10-19 23:35 ` [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup" Frederic Weisbecker
@ 2023-10-20  8:25   ` Peter Zijlstra
  2023-10-20 12:31     ` Frederic Weisbecker
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2023-10-20  8:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Lai Jiangshan,
	Liam R . Howlett, Sebastian Siewior, Thomas Gleixner

On Fri, Oct 20, 2023 at 01:35:43AM +0200, Frederic Weisbecker wrote:
> Now that rcutiny can deal with early boot PF_IDLE setting, revert
> commit cff9b2332ab762b7e0586c793c431a8f2ea4db04.
> 
> This fixes several subtle issues introduced on RCU-tasks(-trace):
> 
> 1) RCU-tasks stalls when:
> 
>    1.1 Grace period is started before init/0 had a chance to set PF_IDLE,
>        keeping it stuck in the holdout list until idle ever schedules.
> 
>    1.2 Grace period is started when some possible CPUs have never been
>        online, keeping their idle tasks stuck in the holdout list until
>        the CPU ever boots up.
> 
>    1.3 Similar to 1.1 but with secondary CPUs: Grace period is started
>        concurrently with secondary CPU booting, putting its idle task in
>        the holdout list because PF_IDLE isn't yet observed on it. It
>        stays then stuck in the holdout list until that CPU ever
>        schedules. The effect is mitigated here by all the smpboot
>        kthreads and the hotplug AP thread that must run to bring the
>        CPU up.
> 
> 2) Spurious warning on RCU task trace that assumes offline CPU's idle
>    task is always PF_IDLE.
> 
> More issues have been found in RCU-tasks related to PF_IDLE which should
> be fixed with later changes as those are not regressions:
> 
> 3) The RCU-Tasks semantics consider the idle loop as a quiescent state,
>    however:
> 
>    3.1 The boot code preceding the idle entry is included in this
>        quiescent state. Especially after the completion of kthreadd_done
>        after which init/1 can launch userspace concurrently. The window
>        is tiny before PF_IDLE is set but it exists.
> 
>    3.2 Similarly, the boot code preceding the idle entry on secondary
>        CPUs is wrongly accounted as RCU tasks quiescent state.
> 

Urgh... so the plan is to fix RCU-tasks for all of the above to not rely
on PF_IDLE ? Because I rather like the more strict PF_IDLE and
subsequently don't much like this revert.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup"
  2023-10-20  8:25   ` Peter Zijlstra
@ 2023-10-20 12:31     ` Frederic Weisbecker
  2023-10-20 13:48       ` Peter Zijlstra
  0 siblings, 1 reply; 10+ messages in thread
From: Frederic Weisbecker @ 2023-10-20 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Lai Jiangshan,
	Liam R . Howlett, Sebastian Siewior, Thomas Gleixner

On Fri, Oct 20, 2023 at 10:25:31AM +0200, Peter Zijlstra wrote:
> On Fri, Oct 20, 2023 at 01:35:43AM +0200, Frederic Weisbecker wrote:
> > Now that rcutiny can deal with early boot PF_IDLE setting, revert
> > commit cff9b2332ab762b7e0586c793c431a8f2ea4db04.
> > 
> > This fixes several subtle issues introduced on RCU-tasks(-trace):
> > 
> > 1) RCU-tasks stalls when:
> > 
> >    1.1 Grace period is started before init/0 had a chance to set PF_IDLE,
> >        keeping it stuck in the holdout list until idle ever schedules.
> > 
> >    1.2 Grace period is started when some possible CPUs have never been
> >        online, keeping their idle tasks stuck in the holdout list until
> >        the CPU ever boots up.
> > 
> >    1.3 Similar to 1.1 but with secondary CPUs: Grace period is started
> >        concurrently with secondary CPU booting, putting its idle task in
> >        the holdout list because PF_IDLE isn't yet observed on it. It
> >        stays then stuck in the holdout list until that CPU ever
> >        schedules. The effect is mitigated here by all the smpboot
> >        kthreads and the hotplug AP thread that must run to bring the
> >        CPU up.
> > 
> > 2) Spurious warning on RCU task trace that assumes offline CPU's idle
> >    task is always PF_IDLE.
> > 
> > More issues have been found in RCU-tasks related to PF_IDLE which should
> > be fixed with later changes as those are not regressions:
> > 
> > 3) The RCU-Tasks semantics consider the idle loop as a quiescent state,
> >    however:
> > 
> >    3.1 The boot code preceding the idle entry is included in this
> >        quiescent state. Especially after the completion of kthreadd_done
> >        after which init/1 can launch userspace concurrently. The window
> >        is tiny before PF_IDLE is set but it exists.
> > 
> >    3.2 Similarly, the boot code preceding the idle entry on secondary
> >        CPUs is wrongly accounted as RCU tasks quiescent state.
> > 
> 
> Urgh... so the plan is to fix RCU-tasks for all of the above to not rely
> on PF_IDLE ? Because I rather like the more strict PF_IDLE and
> subsequently don't much like this revert.

Yeah I can't say I really like the old coverage of PF_IDLE either. The new one
(after Liam's patch) is only halfway better defined though: it makes the boot
CPU's idle behave quite well: PF_IDLE is set on idle entry. And secondary
CPU's idle behave quite well also except when they go offline and then online
again. And then the secondary boot code becomes PF_IDLE.

We probably need something like this:

diff --git a/kernel/cpu.c b/kernel/cpu.c
index 3b9d5c7eb4a2..b24d7937b989 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -1394,7 +1394,9 @@ void cpuhp_report_idle_dead(void)
 {
 	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
 
+	current->flags &= ~PF_IDLE;
 	BUG_ON(st->state != CPUHP_AP_OFFLINE);
+
 	rcutree_report_cpu_dead();
 	st->state = CPUHP_AP_IDLE_DEAD;
 	/*
@@ -1642,6 +1644,8 @@ void cpuhp_online_idle(enum cpuhp_state state)
 {
 	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
 
+	current->flags |= PF_IDLE;
+
 	/* Happens for the boot cpu */
 	if (state != CPUHP_AP_ONLINE_IDLE)
 		return;
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 5007b25c5bc6..342f58a329f5 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -373,7 +373,6 @@ EXPORT_SYMBOL_GPL(play_idle_precise);
 
 void cpu_startup_entry(enum cpuhp_state state)
 {
-	current->flags |= PF_IDLE;
 	arch_cpu_idle_prepare();
 	cpuhp_online_idle(state);
 	while (1)


And then RCU-tasks can better rely on it as it really draws correctly
the idle coverage. As for working on top of that, we have thought about
solutions, lemme try to make them proper.

Thanks.

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup"
  2023-10-20 12:31     ` Frederic Weisbecker
@ 2023-10-20 13:48       ` Peter Zijlstra
  2023-10-20 15:05         ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2023-10-20 13:48 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: LKML, Boqun Feng, Joel Fernandes, Josh Triplett,
	Mathieu Desnoyers, Neeraj Upadhyay, Paul E . McKenney,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Lai Jiangshan,
	Liam R . Howlett, Sebastian Siewior, Thomas Gleixner

On Fri, Oct 20, 2023 at 02:31:13PM +0200, Frederic Weisbecker wrote:

> Yeah I can't say I really like the old coverage of PF_IDLE either. The new one
> (after Liam's patch) is only halfway better defined though: it makes the boot
> CPU's idle behave quite well: PF_IDLE is set on idle entry. And secondary
> CPU's idle behave quite well also except when they go offline and then online
> again. And then the secondary boot code becomes PF_IDLE.

Bah offline, yeah, we should just not do that :-)

> We probably need something like this:
> 
> diff --git a/kernel/cpu.c b/kernel/cpu.c
> index 3b9d5c7eb4a2..b24d7937b989 100644
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -1394,7 +1394,9 @@ void cpuhp_report_idle_dead(void)
>  {
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>  
> +	current->flags &= ~PF_IDLE;
>  	BUG_ON(st->state != CPUHP_AP_OFFLINE);
> +
>  	rcutree_report_cpu_dead();
>  	st->state = CPUHP_AP_IDLE_DEAD;
>  	/*
> @@ -1642,6 +1644,8 @@ void cpuhp_online_idle(enum cpuhp_state state)
>  {
>  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
>  
> +	current->flags |= PF_IDLE;
> +
>  	/* Happens for the boot cpu */
>  	if (state != CPUHP_AP_ONLINE_IDLE)
>  		return;

Yeah that works I suppose.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup"
  2023-10-20 13:48       ` Peter Zijlstra
@ 2023-10-20 15:05         ` Paul E. McKenney
  0 siblings, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2023-10-20 15:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, LKML, Boqun Feng, Joel Fernandes,
	Josh Triplett, Mathieu Desnoyers, Neeraj Upadhyay,
	Steven Rostedt, Uladzislau Rezki, rcu, Zqiang, Lai Jiangshan,
	Liam R . Howlett, Sebastian Siewior, Thomas Gleixner

On Fri, Oct 20, 2023 at 03:48:36PM +0200, Peter Zijlstra wrote:
> On Fri, Oct 20, 2023 at 02:31:13PM +0200, Frederic Weisbecker wrote:
> 
> > Yeah I can't say I really like the old coverage of PF_IDLE either. The new one
> > (after Liam's patch) is only halfway better defined though: it makes the boot
> > CPU's idle behave quite well: PF_IDLE is set on idle entry. And secondary
> > CPU's idle behave quite well also except when they go offline and then online
> > again. And then the secondary boot code becomes PF_IDLE.
> 
> Bah offline, yeah, we should just not do that :-)
> 
> > We probably need something like this:
> > 
> > diff --git a/kernel/cpu.c b/kernel/cpu.c
> > index 3b9d5c7eb4a2..b24d7937b989 100644
> > --- a/kernel/cpu.c
> > +++ b/kernel/cpu.c
> > @@ -1394,7 +1394,9 @@ void cpuhp_report_idle_dead(void)
> >  {
> >  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> >  
> > +	current->flags &= ~PF_IDLE;
> >  	BUG_ON(st->state != CPUHP_AP_OFFLINE);
> > +
> >  	rcutree_report_cpu_dead();
> >  	st->state = CPUHP_AP_IDLE_DEAD;
> >  	/*
> > @@ -1642,6 +1644,8 @@ void cpuhp_online_idle(enum cpuhp_state state)
> >  {
> >  	struct cpuhp_cpu_state *st = this_cpu_ptr(&cpuhp_state);
> >  
> > +	current->flags |= PF_IDLE;
> > +
> >  	/* Happens for the boot cpu */
> >  	if (state != CPUHP_AP_ONLINE_IDLE)
> >  		return;
> 
> Yeah that works I suppose.

Booting up kernels being what it is, there might not be a completely
pretty solution.  Not that I would say "no" to such a solution should
it appear, mind you!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-10-20 15:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-19 23:35 [PATCH 0/4] rcu: Fix PF_IDLE related issues, part. 1 Frederic Weisbecker
2023-10-19 23:35 ` [PATCH 1/4] softirq: Rename __raise_softirq_irqoff() to raise_softirq_no_wake() Frederic Weisbecker
2023-10-19 23:35 ` [PATCH 2/4] softirq: Introduce raise_ksoftirqd_irqoff() Frederic Weisbecker
2023-10-19 23:35 ` [PATCH 3/4] rcu: Make tiny RCU use ksoftirqd to trigger a QS from idle Frederic Weisbecker
2023-10-20  0:49   ` Paul E. McKenney
2023-10-19 23:35 ` [PATCH 4/4] Revert "kernel/sched: Modify initial boot task idle setup" Frederic Weisbecker
2023-10-20  8:25   ` Peter Zijlstra
2023-10-20 12:31     ` Frederic Weisbecker
2023-10-20 13:48       ` Peter Zijlstra
2023-10-20 15:05         ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).