All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] rcu: Delay the RCU-selftests during boot.
@ 2022-03-01 17:43 Sebastian Andrzej Siewior
  2022-03-01 17:48 ` Sebastian Andrzej Siewior
  2022-03-02  5:42 ` Paul E. McKenney
  0 siblings, 2 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-01 17:43 UTC (permalink / raw)
  To: rcu
  Cc: Paul E. McKenney, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

The RCU-sefltest expects tthe timer_list timer to be working. This isn't
the case on PREEMPT_RT at this point because it requires that ksoftirqd
is up and running.
There is no known requirement that synchronize_rcu() works at this
point.

Delay RCU-selftests until ksoftirqd is up and running.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---

Paul, I don't mean to be rude or anything by saying that "There is no
requirement…". If my memory serves me well here (and there is no
guarantee for that) then synchronize_rcu() was required that early by
kprobes or jump labels and this requirement was lifted by the time we
discussed this. I have no idea how this discussion back then ended but
as of v5.17-rc6 that timer is still required to function in order for
the system to make progress at boot time.

 include/linux/rcupdate.h |  6 ++++++
 init/main.c              |  1 +
 kernel/rcu/tasks.h       | 10 +++-------
 3 files changed, 10 insertions(+), 7 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 88b42eb464068..2fb1a48ca9272 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -95,6 +95,12 @@ void rcu_init_tasks_generic(void);
 static inline void rcu_init_tasks_generic(void) { }
 #endif
 
+#if defined(CONFIG_PROVE_RCU) && defined(CONFIG_TASKS_RCU_GENERIC)
+void rcu_tasks_initiate_self_tests(void);
+#else
+static inline void rcu_tasks_initiate_self_tests(void) {}
+#endif
+
 #ifdef CONFIG_RCU_STALL_COMMON
 void rcu_sysrq_start(void);
 void rcu_sysrq_end(void);
diff --git a/init/main.c b/init/main.c
index 65fa2e41a9c09..cfc45e47b9ca4 100644
--- a/init/main.c
+++ b/init/main.c
@@ -1600,6 +1600,7 @@ static noinline void __init kernel_init_freeable(void)
 
 	rcu_init_tasks_generic();
 	do_pre_smp_initcalls();
+	rcu_tasks_initiate_self_tests();
 	lockup_detector_init();
 
 	smp_init();
diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d64f0b1d8cd3b..215c2c0f6c184 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -1659,9 +1659,10 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
 	pr_info("Callback from %s invoked.\n", rttd->name);
 
 	rttd->notrun = true;
+	WARN_ON(1);
 }
 
-static void rcu_tasks_initiate_self_tests(void)
+void rcu_tasks_initiate_self_tests(void)
 {
 	pr_info("Running RCU-tasks wait API self tests\n");
 #ifdef CONFIG_TASKS_RCU
@@ -1698,9 +1699,7 @@ static int rcu_tasks_verify_self_tests(void)
 	return ret;
 }
 late_initcall(rcu_tasks_verify_self_tests);
-#else /* #ifdef CONFIG_PROVE_RCU */
-static void rcu_tasks_initiate_self_tests(void) { }
-#endif /* #else #ifdef CONFIG_PROVE_RCU */
+#endif /* #ifdef CONFIG_PROVE_RCU */
 
 void __init rcu_init_tasks_generic(void)
 {
@@ -1715,9 +1714,6 @@ void __init rcu_init_tasks_generic(void)
 #ifdef CONFIG_TASKS_TRACE_RCU
 	rcu_spawn_tasks_trace_kthread();
 #endif
-
-	// Run the self-tests.
-	rcu_tasks_initiate_self_tests();
 }
 
 #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-01 17:43 [PATCH] rcu: Delay the RCU-selftests during boot Sebastian Andrzej Siewior
@ 2022-03-01 17:48 ` Sebastian Andrzej Siewior
  2022-03-02  5:42 ` Paul E. McKenney
  1 sibling, 0 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-01 17:48 UTC (permalink / raw)
  To: rcu
  Cc: Paul E. McKenney, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On 2022-03-01 18:43:02 [+0100], To rcu@vger.kernel.org wrote:
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -1659,9 +1659,10 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
>  	pr_info("Callback from %s invoked.\n", rttd->name);
>  
>  	rttd->notrun = true;
> +	WARN_ON(1);

Please ignore this piece, it sneaked in during testing.

>  }
>  

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-01 17:43 [PATCH] rcu: Delay the RCU-selftests during boot Sebastian Andrzej Siewior
  2022-03-01 17:48 ` Sebastian Andrzej Siewior
@ 2022-03-02  5:42 ` Paul E. McKenney
  2022-03-02 11:09   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-02  5:42 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Tue, Mar 01, 2022 at 06:43:01PM +0100, Sebastian Andrzej Siewior wrote:
> The RCU-sefltest expects tthe timer_list timer to be working. This isn't
> the case on PREEMPT_RT at this point because it requires that ksoftirqd
> is up and running.
> There is no known requirement that synchronize_rcu() works at this
> point.
> 
> Delay RCU-selftests until ksoftirqd is up and running.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> 
> Paul, I don't mean to be rude or anything by saying that "There is no
> requirement…". If my memory serves me well here (and there is no
> guarantee for that) then synchronize_rcu() was required that early by
> kprobes or jump labels and this requirement was lifted by the time we
> discussed this. I have no idea how this discussion back then ended but
> as of v5.17-rc6 that timer is still required to function in order for
> the system to make progress at boot time.

No offense taken.

There have been several different things requiring various pieces of RCU
to be running quite early, and I need to do a bit of code archaeology to
check this out.  One of these is what forced me to move to expedited grace
periods once the scheduler started.  And the expedited grace periods do
require the timers be working, at least in their current form.

I vaguely recall tracing requiring call_rcu() extremely early (as
in before rcu_init() is called), but that was not a problem because
there was no requirement that the callback be invoked before the first
couple of sets of initcalls had completed.  The only challenge in this
case was to avoid stomping on the callback lists at rcu_init() time,
no timers required.

One straightforward approach would be to choose the call site of the
rcu_tasks_initiate_self_tests() function based on whether or not the
kernel was built with CONFIG_PREEMPT_RT.

The other thought that comes to mind is that you may need the old-style
softirq until ksoftirqd is spawned.  I doubt that real-time response is
a big deal that early in boot.  ;-)

							Thanx, Paul

>  include/linux/rcupdate.h |  6 ++++++
>  init/main.c              |  1 +
>  kernel/rcu/tasks.h       | 10 +++-------
>  3 files changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
> index 88b42eb464068..2fb1a48ca9272 100644
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -95,6 +95,12 @@ void rcu_init_tasks_generic(void);
>  static inline void rcu_init_tasks_generic(void) { }
>  #endif
>  
> +#if defined(CONFIG_PROVE_RCU) && defined(CONFIG_TASKS_RCU_GENERIC)
> +void rcu_tasks_initiate_self_tests(void);
> +#else
> +static inline void rcu_tasks_initiate_self_tests(void) {}
> +#endif
> +
>  #ifdef CONFIG_RCU_STALL_COMMON
>  void rcu_sysrq_start(void);
>  void rcu_sysrq_end(void);
> diff --git a/init/main.c b/init/main.c
> index 65fa2e41a9c09..cfc45e47b9ca4 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1600,6 +1600,7 @@ static noinline void __init kernel_init_freeable(void)
>  
>  	rcu_init_tasks_generic();
>  	do_pre_smp_initcalls();
> +	rcu_tasks_initiate_self_tests();
>  	lockup_detector_init();
>  
>  	smp_init();
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index d64f0b1d8cd3b..215c2c0f6c184 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -1659,9 +1659,10 @@ static void test_rcu_tasks_callback(struct rcu_head *rhp)
>  	pr_info("Callback from %s invoked.\n", rttd->name);
>  
>  	rttd->notrun = true;
> +	WARN_ON(1);
>  }
>  
> -static void rcu_tasks_initiate_self_tests(void)
> +void rcu_tasks_initiate_self_tests(void)
>  {
>  	pr_info("Running RCU-tasks wait API self tests\n");
>  #ifdef CONFIG_TASKS_RCU
> @@ -1698,9 +1699,7 @@ static int rcu_tasks_verify_self_tests(void)
>  	return ret;
>  }
>  late_initcall(rcu_tasks_verify_self_tests);
> -#else /* #ifdef CONFIG_PROVE_RCU */
> -static void rcu_tasks_initiate_self_tests(void) { }
> -#endif /* #else #ifdef CONFIG_PROVE_RCU */
> +#endif /* #ifdef CONFIG_PROVE_RCU */
>  
>  void __init rcu_init_tasks_generic(void)
>  {
> @@ -1715,9 +1714,6 @@ void __init rcu_init_tasks_generic(void)
>  #ifdef CONFIG_TASKS_TRACE_RCU
>  	rcu_spawn_tasks_trace_kthread();
>  #endif
> -
> -	// Run the self-tests.
> -	rcu_tasks_initiate_self_tests();
>  }
>  
>  #else /* #ifdef CONFIG_TASKS_RCU_GENERIC */
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-02  5:42 ` Paul E. McKenney
@ 2022-03-02 11:09   ` Sebastian Andrzej Siewior
  2022-03-03  4:36     ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-02 11:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On 2022-03-01 21:42:17 [-0800], Paul E. McKenney wrote:
> There have been several different things requiring various pieces of RCU
> to be running quite early, and I need to do a bit of code archaeology to
> check this out.  One of these is what forced me to move to expedited grace
> periods once the scheduler started.  And the expedited grace periods do
> require the timers be working, at least in their current form.

expedited as in synchronize_rcu_expedited()? Because we don't do this
but then only after boot so it probably doesn't change a thing and makes
other complicated.

> I vaguely recall tracing requiring call_rcu() extremely early (as
> in before rcu_init() is called), but that was not a problem because
> there was no requirement that the callback be invoked before the first
> couple of sets of initcalls had completed.  The only challenge in this
> case was to avoid stomping on the callback lists at rcu_init() time,
> no timers required.

Okay.

> One straightforward approach would be to choose the call site of the
> rcu_tasks_initiate_self_tests() function based on whether or not the
> kernel was built with CONFIG_PREEMPT_RT.

But then we have this change between RT and !RT due to other
limitations. But my understanding is correct that RT does not have a
working synchronize_rcu() in an early_initcall() while !RT does, right?

> The other thought that comes to mind is that you may need the old-style
> softirq until ksoftirqd is spawned.  I doubt that real-time response is
> a big deal that early in boot.  ;-)

Puh. Given that the whole infrastructure is not prepared for that it is
probably more than a few lines plus a conditional jump in the IRQ
hot path. Also we do have scheduling at this point and IRQ-on/off works
which means possible dead locks because sleeping locks are acquired in
hard-IRQ context.

But let me summarize this: I did this:

|@@ -1600,6 +1600,7 @@ static noinline void __init kernel_init_
|
|        rcu_init_tasks_generic();
|        do_pre_smp_initcalls();
|+       rcu_tasks_initiate_self_tests();
|        lockup_detector_init();
|
|        smp_init();

To simply move the test from rcu_init_tasks_generic() to after
do_pre_smp_initcalls(). If we can't move rcu_init_tasks_generic() after
do_pre_smp_initcalls() or at least the test part because we need working
synchronize_rcu() in early_initcall() then I need to move the RT
requirements before. Simple ;)

The requirements are:

--- a/init/main.c
+++ b/init/main.c
@@ -1598,6 +1601,9 @@ static noinline void __init kernel_init_freeable(void)
 
        init_mm_internals();
 
+       spawn_ksoftirqd();
+       irq_work_init_threads();
+
        rcu_init_tasks_generic();
        do_pre_smp_initcalls();
        lockup_detector_init();

spawn_ksoftirqd() is what I mentioned. What I just figured out is
irq_work_init_threads() due to
	call_rcu_tasks_iw_wakeup()

I can't move this to hard-IRQ context because it invokes wake_up() which
acquires sleeping locks. If you say that rtp->cbs_wq has only one waiter
and something like rcuwait_wait_event() / rcuwait_wake_up() would work
as well then call_rcu_tasks_iw_wakeup() could be lifter to hard-IRQ
context and we need to worry only about spawn_ksoftirqd() :)

> 							Thanx, Paul
> 

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-02 11:09   ` Sebastian Andrzej Siewior
@ 2022-03-03  4:36     ` Paul E. McKenney
  2022-03-03 10:10       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-03  4:36 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Wed, Mar 02, 2022 at 12:09:42PM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-03-01 21:42:17 [-0800], Paul E. McKenney wrote:
> > There have been several different things requiring various pieces of RCU
> > to be running quite early, and I need to do a bit of code archaeology to
> > check this out.  One of these is what forced me to move to expedited grace
> > periods once the scheduler started.  And the expedited grace periods do
> > require the timers be working, at least in their current form.
> 
> expedited as in synchronize_rcu_expedited()? Because we don't do this
> but then only after boot so it probably doesn't change a thing and makes
> other complicated.

Expedited as in both synchronize_rcu_expedited() on the one hand and
also synchronize_rcu() between the time that the scheduler is active and
RCU has spawned its kthreads (AKA the "mid-boot dead zone").  I agree
that the former is not important for you, especially if you boot with
rcupdate.rcu_normal=1.  But the mid-boot dead zone uses expedited grace
periods because at that point in the process, the normal grace period
cannot function.

> > I vaguely recall tracing requiring call_rcu() extremely early (as
> > in before rcu_init() is called), but that was not a problem because
> > there was no requirement that the callback be invoked before the first
> > couple of sets of initcalls had completed.  The only challenge in this
> > case was to avoid stomping on the callback lists at rcu_init() time,
> > no timers required.
> 
> Okay.
> 
> > One straightforward approach would be to choose the call site of the
> > rcu_tasks_initiate_self_tests() function based on whether or not the
> > kernel was built with CONFIG_PREEMPT_RT.
> 
> But then we have this change between RT and !RT due to other
> limitations. But my understanding is correct that RT does not have a
> working synchronize_rcu() in an early_initcall() while !RT does, right?

From what you are telling me, yes, right now RT's synchronize_rcu()
breaks in the mid-boot dead zone, but !RT's operates normally.

> > The other thought that comes to mind is that you may need the old-style
> > softirq until ksoftirqd is spawned.  I doubt that real-time response is
> > a big deal that early in boot.  ;-)
> 
> Puh. Given that the whole infrastructure is not prepared for that it is
> probably more than a few lines plus a conditional jump in the IRQ
> hot path. Also we do have scheduling at this point and IRQ-on/off works
> which means possible dead locks because sleeping locks are acquired in
> hard-IRQ context.

This does need to work.  But let me see what you have in the following
paragraphs.

> But let me summarize this: I did this:
> 
> |@@ -1600,6 +1600,7 @@ static noinline void __init kernel_init_
> |
> |        rcu_init_tasks_generic();
> |        do_pre_smp_initcalls();
> |+       rcu_tasks_initiate_self_tests();
> |        lockup_detector_init();
> |
> |        smp_init();
> 
> To simply move the test from rcu_init_tasks_generic() to after
> do_pre_smp_initcalls(). If we can't move rcu_init_tasks_generic() after
> do_pre_smp_initcalls() or at least the test part because we need working
> synchronize_rcu() in early_initcall() then I need to move the RT
> requirements before. Simple ;)

As long as RT confines itself to configurations that do not need a
working synchronize_rcu() in the intervening code, yes, simple.  ;-)

> The requirements are:
> 
> --- a/init/main.c
> +++ b/init/main.c
> @@ -1598,6 +1601,9 @@ static noinline void __init kernel_init_freeable(void)
>  
>         init_mm_internals();
>  
> +       spawn_ksoftirqd();
> +       irq_work_init_threads();
> +
>         rcu_init_tasks_generic();
>         do_pre_smp_initcalls();
>         lockup_detector_init();
> 
> spawn_ksoftirqd() is what I mentioned. What I just figured out is
> irq_work_init_threads() due to
> 	call_rcu_tasks_iw_wakeup()
> 
> I can't move this to hard-IRQ context because it invokes wake_up() which
> acquires sleeping locks. If you say that rtp->cbs_wq has only one waiter
> and something like rcuwait_wait_event() / rcuwait_wake_up() would work
> as well then call_rcu_tasks_iw_wakeup() could be lifter to hard-IRQ
> context and we need to worry only about spawn_ksoftirqd() :)

OK, I was expecting that the swait_event_timeout_exclusive() call from
synchronize_rcu_expedited_wait_once() would be the problem.  Are you
saying that this swait_event_timeout_exclusive() works fine?  Or are you
instead saying that the call_rcu_tasks_iw_wakeup() issues cause trouble
before that swait_event_timeout_exclusive() gets a chance to cause its
own trouble?

Either way, it sounds like that irq_work_queue(&rtpcp->rtp_irq_work) in
call_rcu_tasks_generic() needs some adjustment to work in RT.  This should
be doable.  Given this, and given that the corresponding diagnostic
function rcu_tasks_verify_self_tests() is a late_initcall() function,
you don't need to move the call to rcu_init_tasks_generic(), correct?

Back over to you so that I can learn what I am still missing.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-03  4:36     ` Paul E. McKenney
@ 2022-03-03 10:10       ` Sebastian Andrzej Siewior
  2022-03-03 20:02         ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-03 10:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On 2022-03-02 20:36:50 [-0800], Paul E. McKenney wrote:
> > To simply move the test from rcu_init_tasks_generic() to after
> > do_pre_smp_initcalls(). If we can't move rcu_init_tasks_generic() after
> > do_pre_smp_initcalls() or at least the test part because we need working
> > synchronize_rcu() in early_initcall() then I need to move the RT
> > requirements before. Simple ;)
> 
> As long as RT confines itself to configurations that do not need a
> working synchronize_rcu() in the intervening code, yes, simple.  ;-)

;)

> > The requirements are:
> > 
> > --- a/init/main.c
> > +++ b/init/main.c
> > @@ -1598,6 +1601,9 @@ static noinline void __init kernel_init_freeable(void)
> >  
> >         init_mm_internals();
> >  
> > +       spawn_ksoftirqd();
> > +       irq_work_init_threads();
> > +
> >         rcu_init_tasks_generic();
> >         do_pre_smp_initcalls();
> >         lockup_detector_init();
> > 
> > spawn_ksoftirqd() is what I mentioned. What I just figured out is
> > irq_work_init_threads() due to
> > 	call_rcu_tasks_iw_wakeup()
> > 
> > I can't move this to hard-IRQ context because it invokes wake_up() which
> > acquires sleeping locks. If you say that rtp->cbs_wq has only one waiter
> > and something like rcuwait_wait_event() / rcuwait_wake_up() would work
> > as well then call_rcu_tasks_iw_wakeup() could be lifter to hard-IRQ
> > context and we need to worry only about spawn_ksoftirqd() :)
> 
> OK, I was expecting that the swait_event_timeout_exclusive() call from
> synchronize_rcu_expedited_wait_once() would be the problem.  Are you
> saying that this swait_event_timeout_exclusive() works fine?  

swait_event_timeout_exclusive() uses schedule_timeout() which uses a
timer_list timer and this one requires ksoftirqd to work.

> Or are you
> instead saying that the call_rcu_tasks_iw_wakeup() issues cause trouble
> before that swait_event_timeout_exclusive() gets a chance to cause its
> own trouble?

Both is needed:
- ksoftirqd thread for timer_list timers handling.
- irq_work thread for irq_work which is not done in hard-IRQ context.

What I observe during boot:
| [    0.184838] cblist_init_generic: Setting adjustable number of callback queues.
| [    0.184853] cblist_init_generic: Setting shift to 2 and lim to 1.
| [    0.188116] irq_work_single() wake_up_klogd_work_func+0x0/0x70 26
| [    0.188861] cblist_init_generic: Setting shift to 2 and lim to 1.
| [    0.189671] Running RCU-tasks wait API self tests
| [    0.190254] irq_work_single() call_rcu_tasks_iw_wakeup+0x0/0x20 22
| [    0.292569] expire_timers() process_timeout+0x0/0x10
| [    0.292655] irq_work_single() call_rcu_tasks_iw_wakeup+0x0/0x20 22
| [    0.295082] irq_work_single() call_rcu_tasks_iw_wakeup+0x0/0x20 22
| [    0.296481] irq_work_single() call_rcu_tasks_iw_wakeup+0x0/0x20 22
| [    0.304685] rcu: Hierarchical SRCU implementation.
| [    0.311415] Callback from call_rcu_tasks_trace() invoked.
| [    0.344100] smp: Bringing up secondary CPUs ...

1x schedule_timeout() and 4x call_rcu_tasks_iw_wakeup(). Both are
crucial.

> Either way, it sounds like that irq_work_queue(&rtpcp->rtp_irq_work) in
> call_rcu_tasks_generic() needs some adjustment to work in RT.  This should
> be doable.  Given this, and given that the corresponding diagnostic
> function rcu_tasks_verify_self_tests() is a late_initcall() function,
> you don't need to move the call to rcu_init_tasks_generic(), correct?

#1 ksoftirqd must be spawned first in order to get timer_list timer to
   work. I'm going to do that, this should not be a problem.

#2 call_rcu_tasks_iw_wakeup. Here we have the following options:
   - Don't use, delay, …

   - if you can guarantee that there is only _one_ waiter
     => Replace rcu_tasks::cbs_wq with rcuwait. Let this irq_work run
        then in hard-IRQ context.

   - if you can't guarantee that there is only _one_ waiter
     => spawn the irq-work thread early.

As for #2, I managed to trigger the wakeup via tracing (and stumbled
into a bug un related to this) and I see only one waiter. Doesn't mean I
do it right and they can't be a second waiter.

> Back over to you so that I can learn what I am still missing.  ;-)

Hope that helps.

> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-03 10:10       ` Sebastian Andrzej Siewior
@ 2022-03-03 20:02         ` Paul E. McKenney
  2022-03-04 11:07           ` [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread() Sebastian Andrzej Siewior
  2022-03-04 15:09           ` [PATCH] rcu: Delay the RCU-selftests during boot Sebastian Andrzej Siewior
  0 siblings, 2 replies; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-03 20:02 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Thu, Mar 03, 2022 at 11:10:06AM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-03-02 20:36:50 [-0800], Paul E. McKenney wrote:
> > > To simply move the test from rcu_init_tasks_generic() to after
> > > do_pre_smp_initcalls(). If we can't move rcu_init_tasks_generic() after
> > > do_pre_smp_initcalls() or at least the test part because we need working
> > > synchronize_rcu() in early_initcall() then I need to move the RT
> > > requirements before. Simple ;)
> > 
> > As long as RT confines itself to configurations that do not need a
> > working synchronize_rcu() in the intervening code, yes, simple.  ;-)
> 
> ;)
> 
> > > The requirements are:
> > > 
> > > --- a/init/main.c
> > > +++ b/init/main.c
> > > @@ -1598,6 +1601,9 @@ static noinline void __init kernel_init_freeable(void)
> > >  
> > >         init_mm_internals();
> > >  
> > > +       spawn_ksoftirqd();
> > > +       irq_work_init_threads();
> > > +
> > >         rcu_init_tasks_generic();
> > >         do_pre_smp_initcalls();
> > >         lockup_detector_init();
> > > 
> > > spawn_ksoftirqd() is what I mentioned. What I just figured out is
> > > irq_work_init_threads() due to
> > > 	call_rcu_tasks_iw_wakeup()
> > > 
> > > I can't move this to hard-IRQ context because it invokes wake_up() which
> > > acquires sleeping locks. If you say that rtp->cbs_wq has only one waiter
> > > and something like rcuwait_wait_event() / rcuwait_wake_up() would work
> > > as well then call_rcu_tasks_iw_wakeup() could be lifter to hard-IRQ
> > > context and we need to worry only about spawn_ksoftirqd() :)
> > 
> > OK, I was expecting that the swait_event_timeout_exclusive() call from
> > synchronize_rcu_expedited_wait_once() would be the problem.  Are you
> > saying that this swait_event_timeout_exclusive() works fine?  
> 
> swait_event_timeout_exclusive() uses schedule_timeout() which uses a
> timer_list timer and this one requires ksoftirqd to work.

OK, at least things are consistent, then.  ;-)

> > Or are you
> > instead saying that the call_rcu_tasks_iw_wakeup() issues cause trouble
> > before that swait_event_timeout_exclusive() gets a chance to cause its
> > own trouble?
> 
> Both is needed:
> - ksoftirqd thread for timer_list timers handling.
> - irq_work thread for irq_work which is not done in hard-IRQ context.

Understood.

> What I observe during boot:
> | [    0.184838] cblist_init_generic: Setting adjustable number of callback queues.
> | [    0.184853] cblist_init_generic: Setting shift to 2 and lim to 1.
> | [    0.188116] irq_work_single() wake_up_klogd_work_func+0x0/0x70 26
> | [    0.188861] cblist_init_generic: Setting shift to 2 and lim to 1.
> | [    0.189671] Running RCU-tasks wait API self tests
> | [    0.190254] irq_work_single() call_rcu_tasks_iw_wakeup+0x0/0x20 22
> | [    0.292569] expire_timers() process_timeout+0x0/0x10
> | [    0.292655] irq_work_single() call_rcu_tasks_iw_wakeup+0x0/0x20 22
> | [    0.295082] irq_work_single() call_rcu_tasks_iw_wakeup+0x0/0x20 22
> | [    0.296481] irq_work_single() call_rcu_tasks_iw_wakeup+0x0/0x20 22
> | [    0.304685] rcu: Hierarchical SRCU implementation.
> | [    0.311415] Callback from call_rcu_tasks_trace() invoked.
> | [    0.344100] smp: Bringing up secondary CPUs ...
> 
> 1x schedule_timeout() and 4x call_rcu_tasks_iw_wakeup(). Both are
> crucial.

Got it, thank you!

> > Either way, it sounds like that irq_work_queue(&rtpcp->rtp_irq_work) in
> > call_rcu_tasks_generic() needs some adjustment to work in RT.  This should
> > be doable.  Given this, and given that the corresponding diagnostic
> > function rcu_tasks_verify_self_tests() is a late_initcall() function,
> > you don't need to move the call to rcu_init_tasks_generic(), correct?
> 
> #1 ksoftirqd must be spawned first in order to get timer_list timer to
>    work. I'm going to do that, this should not be a problem.

I very much appreciate your flexibility on this, but it would be even
better if there was a good way to avoid the dependency on ksoftirqd,
at least during boot time.  Spawning ksoftirqd first would narrow the
window of RCU unavailability in RT, but it would be good to have RCU
work throughout, as it currently does in !RT.  (Give or take a short
time during the midst of the scheduler putting itself together.)

This might seem a bit utopian or even unreasonable, but please keep in
mind that both the scheduler and the idle loop use RCU.

However, that swait_event_timeout_exclusive() doesn't need exact timing
during boot.  Anything that let other tasks run for at least a few tens
of microseconds (up to say a millisecond) could easily work just fine.
Is there any such thing in RT?

> #2 call_rcu_tasks_iw_wakeup. Here we have the following options:
>    - Don't use, delay, …

What is the form of the delay?  All this code is trying to do is to make
the specified function execute in a clean environment so as to avoid
potential deadlocks.  So the range of possible solutions is quite broad.

>    - if you can guarantee that there is only _one_ waiter
>      => Replace rcu_tasks::cbs_wq with rcuwait. Let this irq_work run
>         then in hard-IRQ context.

There is indeed only ever one waiter.  

But rcuwait is going to acquire scheduler locks, correct?  It would be
better to avoid that potential deadlock.

>    - if you can't guarantee that there is only _one_ waiter
>      => spawn the irq-work thread early.

Spawning the irq-work kthread early still leaves a hole.

> As for #2, I managed to trigger the wakeup via tracing (and stumbled
> into a bug un related to this) and I see only one waiter. Doesn't mean I
> do it right and they can't be a second waiter.

Only one waiter!  ;-)

Other approaches:

o	For the swait_event_timeout_exclusive(), I could make early
	boot uses instead do swait_event_exclusive() and make early boot
	rcu_sched_clock_irq() do an unconditional wakeup.  This would
	require a loop around one of the swait_event_exclusive()
	calls (cheaper than making rcu_sched_clock_irq() worry about
	durations).

	RCU would need to be informed of the end of "early boot",
	for example, by invoking some TBD RCU function as soon
	as the ksoftirqd kthreads are created.

	This would also require that rcu_needs_cpu() always return
	true during early boot.

	Static branches could be used if required, as they might be in
	rcu_needs_cpu() and maybe also in rcu_sched_clock_irq().

o	A similar TBD RCU function could cause call_rcu_tasks_generic()
	to avoid invoking irq_work_queue() until after the relevant
	kthread was created, but to do any needed wakeup at that point.
	If wakeups are needed before that time (which they might),
	then perhaps the combination of rcu_sched_clock_irq() and
	rcu_needs_cpu() can help out there as well.

These would be conditioned on IS_ENABLED(CONFIG_PREEMPT_RT).

But now you are going to tell me that wakeups cannot be done from the
scheduler tick interrupt handler?  If that is the case, are there other
approaches?

> > Back over to you so that I can learn what I am still missing.  ;-)
> 
> Hope that helps.

Maybe?  You tell me!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread().
  2022-03-03 20:02         ` Paul E. McKenney
@ 2022-03-04 11:07           ` Sebastian Andrzej Siewior
  2022-03-04 23:21             ` Joel Fernandes
  2022-03-05  4:40             ` Paul E. McKenney
  2022-03-04 15:09           ` [PATCH] rcu: Delay the RCU-selftests during boot Sebastian Andrzej Siewior
  1 sibling, 2 replies; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-04 11:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

The waitqueue used by rcu_tasks_kthread() has always only one waiter.
With a guaranteed only one waiter, this can be replaced with rcuwait
which is smaller and simpler. With rcuwait based wake counterpart, the
irqwork function (call_rcu_tasks_iw_wakeup()) can be invoked hardirq
context because it is only a wake up and no sleeping locks are involved
(unlike the wait_queue_head).
As a side effect, this is also one piece of the puzzle to pass the RCU
selftest at early boot on PREEMPT_RT.

Replace wait_queue_head with rcuwait and let the irqwork run in hardirq
context on PREEMPT_RT.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
On 2022-03-03 12:02:37 [-0800], Paul E. McKenney wrote:
> There is indeed only ever one waiter.  
> 
> But rcuwait is going to acquire scheduler locks, correct?  It would be
> better to avoid that potential deadlock.

This is what I had in mind. Does this work for you?

 kernel/rcu/tasks.h | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index d64f0b1d8cd3b..f804afb304135 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -46,7 +46,7 @@ struct rcu_tasks_percpu {
 
 /**
  * struct rcu_tasks - Definition for a Tasks-RCU-like mechanism.
- * @cbs_wq: Wait queue allowing new callback to get kthread's attention.
+ * @cbs_wait: RCU wait allowing a new callback to get kthread's attention.
  * @cbs_gbl_lock: Lock protecting callback list.
  * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
  * @gp_func: This flavor's grace-period-wait function.
@@ -77,7 +77,7 @@ struct rcu_tasks_percpu {
  * @kname: This flavor's kthread name.
  */
 struct rcu_tasks {
-	struct wait_queue_head cbs_wq;
+	struct rcuwait cbs_wait;
 	raw_spinlock_t cbs_gbl_lock;
 	int gp_state;
 	int gp_sleep;
@@ -113,11 +113,11 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp);
 #define DEFINE_RCU_TASKS(rt_name, gp, call, n)						\
 static DEFINE_PER_CPU(struct rcu_tasks_percpu, rt_name ## __percpu) = {			\
 	.lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name ## __percpu.cbs_pcpu_lock),		\
-	.rtp_irq_work = IRQ_WORK_INIT(call_rcu_tasks_iw_wakeup),			\
+	.rtp_irq_work = IRQ_WORK_INIT_HARD(call_rcu_tasks_iw_wakeup),			\
 };											\
 static struct rcu_tasks rt_name =							\
 {											\
-	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(rt_name.cbs_wq),			\
+	.cbs_wait = __RCUWAIT_INITIALIZER(rt_name.wait),				\
 	.cbs_gbl_lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name.cbs_gbl_lock),			\
 	.gp_func = gp,									\
 	.call_func = call,								\
@@ -261,7 +261,7 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp)
 	struct rcu_tasks_percpu *rtpcp = container_of(iwp, struct rcu_tasks_percpu, rtp_irq_work);
 
 	rtp = rtpcp->rtpp;
-	wake_up(&rtp->cbs_wq);
+	rcuwait_wake_up(&rtp->cbs_wait);
 }
 
 // Enqueue a callback for the specified flavor of Tasks RCU.
@@ -509,7 +509,9 @@ static int __noreturn rcu_tasks_kthread(void *arg)
 		set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
 
 		/* If there were none, wait a bit and start over. */
-		wait_event_idle(rtp->cbs_wq, (needgpcb = rcu_tasks_need_gpcb(rtp)));
+		rcuwait_wait_event(&rtp->cbs_wait,
+				   (needgpcb = rcu_tasks_need_gpcb(rtp)),
+				   TASK_IDLE);
 
 		if (needgpcb & 0x2) {
 			// Wait for one grace period.
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-03 20:02         ` Paul E. McKenney
  2022-03-04 11:07           ` [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread() Sebastian Andrzej Siewior
@ 2022-03-04 15:09           ` Sebastian Andrzej Siewior
  2022-03-05  5:00             ` Paul E. McKenney
  1 sibling, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-04 15:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On 2022-03-03 12:02:37 [-0800], Paul E. McKenney wrote:
> > > Either way, it sounds like that irq_work_queue(&rtpcp->rtp_irq_work) in
> > > call_rcu_tasks_generic() needs some adjustment to work in RT.  This should
> > > be doable.  Given this, and given that the corresponding diagnostic
> > > function rcu_tasks_verify_self_tests() is a late_initcall() function,
> > > you don't need to move the call to rcu_init_tasks_generic(), correct?
> > 
> > #1 ksoftirqd must be spawned first in order to get timer_list timer to
> >    work. I'm going to do that, this should not be a problem.
> 
> I very much appreciate your flexibility on this, but it would be even
> better if there was a good way to avoid the dependency on ksoftirqd,
> at least during boot time.  Spawning ksoftirqd first would narrow the
> window of RCU unavailability in RT, but it would be good to have RCU
> work throughout, as it currently does in !RT.  (Give or take a short
> time during the midst of the scheduler putting itself together.)

During SYSTEM_BOOTING we could do softirqs right away but we lack the
infrastructure. Starting with SYSTEM_SCHEDULING we rely on the thread so
it needs to be spawned earlier. The problem with SYSTEM_SCHEDULING+ is
that we may deadlock if the softirqs and performed in IRQ-context.

> This might seem a bit utopian or even unreasonable, but please keep in
> mind that both the scheduler and the idle loop use RCU.

But the problem is only the usage of synchronize_rcu(). So
rcu_read_lock() and call_rcu() works. Only synchronize_rcu() does not.
Couldn't we make a rule to use at earliest within early_initcall()?

> However, that swait_event_timeout_exclusive() doesn't need exact timing
> during boot.  Anything that let other tasks run for at least a few tens
> of microseconds (up to say a millisecond) could easily work just fine.
> Is there any such thing in RT?

swait_event_timeout_exclusive() appears not to be the culprit. It is
invoked a few times (with a 6.5ms timeout) but returns without setting
up a timer. So either my setup avoids the timer or this happens always
and is not related to my config).

rcu_tasks_wait_gp() does schedule_timeout_idle() and this is the one
that blocks. This could be replaced with schedule_hrtimeout() (just
tested). I hate the idea to use a precise delay in a timeout like
situation. But we could use schedule_hrtimeout_range() with a HZ delta
so it kind of feels like the timer_list timer ;)

Also I have no idea how often this is triggered / under which
circumstances (assuming it is bound synchronize_rcu()).

> >    - if you can't guarantee that there is only _one_ waiter
> >      => spawn the irq-work thread early.
> 
> Spawning the irq-work kthread early still leaves a hole.

Why is spawning ksoftirqd + irq-work before early_initcall() still a
hole? If the definition is _no_ synchronize_rcu() before
early_initcall() then it works as documented.

> Other approaches:
> 
> o	For the swait_event_timeout_exclusive(), I could make early
> 	boot uses instead do swait_event_exclusive() and make early boot
> 	rcu_sched_clock_irq() do an unconditional wakeup.  This would
> 	require a loop around one of the swait_event_exclusive()
> 	calls (cheaper than making rcu_sched_clock_irq() worry about
> 	durations).
> 
> 	RCU would need to be informed of the end of "early boot",
> 	for example, by invoking some TBD RCU function as soon
> 	as the ksoftirqd kthreads are created.
> 
> 	This would also require that rcu_needs_cpu() always return
> 	true during early boot.
> 
> 	Static branches could be used if required, as they might be in
> 	rcu_needs_cpu() and maybe also in rcu_sched_clock_irq().

swait_event_timeout_exclusive() appears innocent.

> o	A similar TBD RCU function could cause call_rcu_tasks_generic()
> 	to avoid invoking irq_work_queue() until after the relevant
> 	kthread was created, but to do any needed wakeup at that point.
> 	If wakeups are needed before that time (which they might),
> 	then perhaps the combination of rcu_sched_clock_irq() and
> 	rcu_needs_cpu() can help out there as well.

IRQ-work has been addressed in a different patch.

> These would be conditioned on IS_ENABLED(CONFIG_PREEMPT_RT).
> 
> But now you are going to tell me that wakeups cannot be done from the
> scheduler tick interrupt handler?  If that is the case, are there other
> approaches?

If you by my irqwork patch then I think we are down to:
- spawn ksoftirqd early
- use during boot schedule_hrtimeout() or the whole time (no I idea how
  often this triggers).

> > > Back over to you so that I can learn what I am still missing.  ;-)
> > 
> > Hope that helps.
> 
> Maybe?  You tell me!  ;-)
> 
> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread().
  2022-03-04 11:07           ` [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread() Sebastian Andrzej Siewior
@ 2022-03-04 23:21             ` Joel Fernandes
  2022-03-05  4:40               ` Paul E. McKenney
  2022-03-05  4:40             ` Paul E. McKenney
  1 sibling, 1 reply; 23+ messages in thread
From: Joel Fernandes @ 2022-03-04 23:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Paul E. McKenney, rcu, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Fri, Mar 04, 2022 at 12:07:25PM +0100, Sebastian Andrzej Siewior wrote:
> The waitqueue used by rcu_tasks_kthread() has always only one waiter.
> With a guaranteed only one waiter, this can be replaced with rcuwait
> which is smaller and simpler. With rcuwait based wake counterpart, the
> irqwork function (call_rcu_tasks_iw_wakeup()) can be invoked hardirq
> context because it is only a wake up and no sleeping locks are involved
> (unlike the wait_queue_head).
> As a side effect, this is also one piece of the puzzle to pass the RCU
> selftest at early boot on PREEMPT_RT.
> 
> Replace wait_queue_head with rcuwait and let the irqwork run in hardirq
> context on PREEMPT_RT.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> On 2022-03-03 12:02:37 [-0800], Paul E. McKenney wrote:
> > There is indeed only ever one waiter.  
> > 
> > But rcuwait is going to acquire scheduler locks, correct?  It would be
> > better to avoid that potential deadlock.
> 
> This is what I had in mind. Does this work for you?

In theory, Sebasatian's idea makes sense to me. Since scheduler locks are not
sleepable even on -RT, using rcuwait-based wakeup is OK with the added
benefit that the no locking would be needed during the wake. Hence, this wake
up can happen even from hardirq context in an -RT kernel.

Correct me if I got that wrong though.

thanks,

 - Joel


>  kernel/rcu/tasks.h | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index d64f0b1d8cd3b..f804afb304135 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -46,7 +46,7 @@ struct rcu_tasks_percpu {
>  
>  /**
>   * struct rcu_tasks - Definition for a Tasks-RCU-like mechanism.
> - * @cbs_wq: Wait queue allowing new callback to get kthread's attention.
> + * @cbs_wait: RCU wait allowing a new callback to get kthread's attention.
>   * @cbs_gbl_lock: Lock protecting callback list.
>   * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
>   * @gp_func: This flavor's grace-period-wait function.
> @@ -77,7 +77,7 @@ struct rcu_tasks_percpu {
>   * @kname: This flavor's kthread name.
>   */
>  struct rcu_tasks {
> -	struct wait_queue_head cbs_wq;
> +	struct rcuwait cbs_wait;
>  	raw_spinlock_t cbs_gbl_lock;
>  	int gp_state;
>  	int gp_sleep;
> @@ -113,11 +113,11 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp);
>  #define DEFINE_RCU_TASKS(rt_name, gp, call, n)						\
>  static DEFINE_PER_CPU(struct rcu_tasks_percpu, rt_name ## __percpu) = {			\
>  	.lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name ## __percpu.cbs_pcpu_lock),		\
> -	.rtp_irq_work = IRQ_WORK_INIT(call_rcu_tasks_iw_wakeup),			\
> +	.rtp_irq_work = IRQ_WORK_INIT_HARD(call_rcu_tasks_iw_wakeup),			\
>  };											\
>  static struct rcu_tasks rt_name =							\
>  {											\
> -	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(rt_name.cbs_wq),			\
> +	.cbs_wait = __RCUWAIT_INITIALIZER(rt_name.wait),				\
>  	.cbs_gbl_lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name.cbs_gbl_lock),			\
>  	.gp_func = gp,									\
>  	.call_func = call,								\
> @@ -261,7 +261,7 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp)
>  	struct rcu_tasks_percpu *rtpcp = container_of(iwp, struct rcu_tasks_percpu, rtp_irq_work);
>  
>  	rtp = rtpcp->rtpp;
> -	wake_up(&rtp->cbs_wq);
> +	rcuwait_wake_up(&rtp->cbs_wait);
>  }
>  
>  // Enqueue a callback for the specified flavor of Tasks RCU.
> @@ -509,7 +509,9 @@ static int __noreturn rcu_tasks_kthread(void *arg)
>  		set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
>  
>  		/* If there were none, wait a bit and start over. */
> -		wait_event_idle(rtp->cbs_wq, (needgpcb = rcu_tasks_need_gpcb(rtp)));
> +		rcuwait_wait_event(&rtp->cbs_wait,
> +				   (needgpcb = rcu_tasks_need_gpcb(rtp)),
> +				   TASK_IDLE);
>  
>  		if (needgpcb & 0x2) {
>  			// Wait for one grace period.
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread().
  2022-03-04 11:07           ` [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread() Sebastian Andrzej Siewior
  2022-03-04 23:21             ` Joel Fernandes
@ 2022-03-05  4:40             ` Paul E. McKenney
  2022-03-07 11:17               ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-05  4:40 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Fri, Mar 04, 2022 at 12:07:25PM +0100, Sebastian Andrzej Siewior wrote:
> The waitqueue used by rcu_tasks_kthread() has always only one waiter.
> With a guaranteed only one waiter, this can be replaced with rcuwait
> which is smaller and simpler. With rcuwait based wake counterpart, the
> irqwork function (call_rcu_tasks_iw_wakeup()) can be invoked hardirq
> context because it is only a wake up and no sleeping locks are involved
> (unlike the wait_queue_head).
> As a side effect, this is also one piece of the puzzle to pass the RCU
> selftest at early boot on PREEMPT_RT.
> 
> Replace wait_queue_head with rcuwait and let the irqwork run in hardirq
> context on PREEMPT_RT.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
> On 2022-03-03 12:02:37 [-0800], Paul E. McKenney wrote:
> > There is indeed only ever one waiter.  
> > 
> > But rcuwait is going to acquire scheduler locks, correct?  It would be
> > better to avoid that potential deadlock.
> 
> This is what I had in mind. Does this work for you?

This does seem to be working, thank you!  I even manually inserted
a pi_lock() acquisition across one of the calls to call_rcu_tasks(),
and it did seem to work.

>  kernel/rcu/tasks.h | 14 ++++++++------
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index d64f0b1d8cd3b..f804afb304135 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -46,7 +46,7 @@ struct rcu_tasks_percpu {
>  
>  /**
>   * struct rcu_tasks - Definition for a Tasks-RCU-like mechanism.
> - * @cbs_wq: Wait queue allowing new callback to get kthread's attention.
> + * @cbs_wait: RCU wait allowing a new callback to get kthread's attention.
>   * @cbs_gbl_lock: Lock protecting callback list.
>   * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
>   * @gp_func: This flavor's grace-period-wait function.
> @@ -77,7 +77,7 @@ struct rcu_tasks_percpu {
>   * @kname: This flavor's kthread name.
>   */
>  struct rcu_tasks {
> -	struct wait_queue_head cbs_wq;
> +	struct rcuwait cbs_wait;
>  	raw_spinlock_t cbs_gbl_lock;
>  	int gp_state;
>  	int gp_sleep;
> @@ -113,11 +113,11 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp);
>  #define DEFINE_RCU_TASKS(rt_name, gp, call, n)						\
>  static DEFINE_PER_CPU(struct rcu_tasks_percpu, rt_name ## __percpu) = {			\
>  	.lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name ## __percpu.cbs_pcpu_lock),		\
> -	.rtp_irq_work = IRQ_WORK_INIT(call_rcu_tasks_iw_wakeup),			\
> +	.rtp_irq_work = IRQ_WORK_INIT_HARD(call_rcu_tasks_iw_wakeup),			\

This is the key piece, right?  If so, I am wondering if there are other
irq_work instances that need to move to _HARD.

							Thanx, Paul

>  };											\
>  static struct rcu_tasks rt_name =							\
>  {											\
> -	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(rt_name.cbs_wq),			\
> +	.cbs_wait = __RCUWAIT_INITIALIZER(rt_name.wait),				\
>  	.cbs_gbl_lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name.cbs_gbl_lock),			\
>  	.gp_func = gp,									\
>  	.call_func = call,								\
> @@ -261,7 +261,7 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp)
>  	struct rcu_tasks_percpu *rtpcp = container_of(iwp, struct rcu_tasks_percpu, rtp_irq_work);
>  
>  	rtp = rtpcp->rtpp;
> -	wake_up(&rtp->cbs_wq);
> +	rcuwait_wake_up(&rtp->cbs_wait);
>  }
>  
>  // Enqueue a callback for the specified flavor of Tasks RCU.
> @@ -509,7 +509,9 @@ static int __noreturn rcu_tasks_kthread(void *arg)
>  		set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
>  
>  		/* If there were none, wait a bit and start over. */
> -		wait_event_idle(rtp->cbs_wq, (needgpcb = rcu_tasks_need_gpcb(rtp)));
> +		rcuwait_wait_event(&rtp->cbs_wait,
> +				   (needgpcb = rcu_tasks_need_gpcb(rtp)),
> +				   TASK_IDLE);
>  
>  		if (needgpcb & 0x2) {
>  			// Wait for one grace period.
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread().
  2022-03-04 23:21             ` Joel Fernandes
@ 2022-03-05  4:40               ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-05  4:40 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Sebastian Andrzej Siewior, rcu, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Fri, Mar 04, 2022 at 11:21:31PM +0000, Joel Fernandes wrote:
> On Fri, Mar 04, 2022 at 12:07:25PM +0100, Sebastian Andrzej Siewior wrote:
> > The waitqueue used by rcu_tasks_kthread() has always only one waiter.
> > With a guaranteed only one waiter, this can be replaced with rcuwait
> > which is smaller and simpler. With rcuwait based wake counterpart, the
> > irqwork function (call_rcu_tasks_iw_wakeup()) can be invoked hardirq
> > context because it is only a wake up and no sleeping locks are involved
> > (unlike the wait_queue_head).
> > As a side effect, this is also one piece of the puzzle to pass the RCU
> > selftest at early boot on PREEMPT_RT.
> > 
> > Replace wait_queue_head with rcuwait and let the irqwork run in hardirq
> > context on PREEMPT_RT.
> > 
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > ---
> > On 2022-03-03 12:02:37 [-0800], Paul E. McKenney wrote:
> > > There is indeed only ever one waiter.  
> > > 
> > > But rcuwait is going to acquire scheduler locks, correct?  It would be
> > > better to avoid that potential deadlock.
> > 
> > This is what I had in mind. Does this work for you?
> 
> In theory, Sebasatian's idea makes sense to me. Since scheduler locks are not
> sleepable even on -RT, using rcuwait-based wakeup is OK with the added
> benefit that the no locking would be needed during the wake. Hence, this wake
> up can happen even from hardirq context in an -RT kernel.

Here is hoping.  ;-)

						Thanx, Paul

> Correct me if I got that wrong though.
> 
> thanks,
> 
>  - Joel
> 
> 
> >  kernel/rcu/tasks.h | 14 ++++++++------
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index d64f0b1d8cd3b..f804afb304135 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -46,7 +46,7 @@ struct rcu_tasks_percpu {
> >  
> >  /**
> >   * struct rcu_tasks - Definition for a Tasks-RCU-like mechanism.
> > - * @cbs_wq: Wait queue allowing new callback to get kthread's attention.
> > + * @cbs_wait: RCU wait allowing a new callback to get kthread's attention.
> >   * @cbs_gbl_lock: Lock protecting callback list.
> >   * @kthread_ptr: This flavor's grace-period/callback-invocation kthread.
> >   * @gp_func: This flavor's grace-period-wait function.
> > @@ -77,7 +77,7 @@ struct rcu_tasks_percpu {
> >   * @kname: This flavor's kthread name.
> >   */
> >  struct rcu_tasks {
> > -	struct wait_queue_head cbs_wq;
> > +	struct rcuwait cbs_wait;
> >  	raw_spinlock_t cbs_gbl_lock;
> >  	int gp_state;
> >  	int gp_sleep;
> > @@ -113,11 +113,11 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp);
> >  #define DEFINE_RCU_TASKS(rt_name, gp, call, n)						\
> >  static DEFINE_PER_CPU(struct rcu_tasks_percpu, rt_name ## __percpu) = {			\
> >  	.lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name ## __percpu.cbs_pcpu_lock),		\
> > -	.rtp_irq_work = IRQ_WORK_INIT(call_rcu_tasks_iw_wakeup),			\
> > +	.rtp_irq_work = IRQ_WORK_INIT_HARD(call_rcu_tasks_iw_wakeup),			\
> >  };											\
> >  static struct rcu_tasks rt_name =							\
> >  {											\
> > -	.cbs_wq = __WAIT_QUEUE_HEAD_INITIALIZER(rt_name.cbs_wq),			\
> > +	.cbs_wait = __RCUWAIT_INITIALIZER(rt_name.wait),				\
> >  	.cbs_gbl_lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name.cbs_gbl_lock),			\
> >  	.gp_func = gp,									\
> >  	.call_func = call,								\
> > @@ -261,7 +261,7 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp)
> >  	struct rcu_tasks_percpu *rtpcp = container_of(iwp, struct rcu_tasks_percpu, rtp_irq_work);
> >  
> >  	rtp = rtpcp->rtpp;
> > -	wake_up(&rtp->cbs_wq);
> > +	rcuwait_wake_up(&rtp->cbs_wait);
> >  }
> >  
> >  // Enqueue a callback for the specified flavor of Tasks RCU.
> > @@ -509,7 +509,9 @@ static int __noreturn rcu_tasks_kthread(void *arg)
> >  		set_tasks_gp_state(rtp, RTGS_WAIT_CBS);
> >  
> >  		/* If there were none, wait a bit and start over. */
> > -		wait_event_idle(rtp->cbs_wq, (needgpcb = rcu_tasks_need_gpcb(rtp)));
> > +		rcuwait_wait_event(&rtp->cbs_wait,
> > +				   (needgpcb = rcu_tasks_need_gpcb(rtp)),
> > +				   TASK_IDLE);
> >  
> >  		if (needgpcb & 0x2) {
> >  			// Wait for one grace period.
> > -- 
> > 2.35.1
> > 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-04 15:09           ` [PATCH] rcu: Delay the RCU-selftests during boot Sebastian Andrzej Siewior
@ 2022-03-05  5:00             ` Paul E. McKenney
  2022-03-07 17:54               ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-05  5:00 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Fri, Mar 04, 2022 at 04:09:42PM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-03-03 12:02:37 [-0800], Paul E. McKenney wrote:
> > > > Either way, it sounds like that irq_work_queue(&rtpcp->rtp_irq_work) in
> > > > call_rcu_tasks_generic() needs some adjustment to work in RT.  This should
> > > > be doable.  Given this, and given that the corresponding diagnostic
> > > > function rcu_tasks_verify_self_tests() is a late_initcall() function,
> > > > you don't need to move the call to rcu_init_tasks_generic(), correct?
> > > 
> > > #1 ksoftirqd must be spawned first in order to get timer_list timer to
> > >    work. I'm going to do that, this should not be a problem.
> > 
> > I very much appreciate your flexibility on this, but it would be even
> > better if there was a good way to avoid the dependency on ksoftirqd,
> > at least during boot time.  Spawning ksoftirqd first would narrow the
> > window of RCU unavailability in RT, but it would be good to have RCU
> > work throughout, as it currently does in !RT.  (Give or take a short
> > time during the midst of the scheduler putting itself together.)
> 
> During SYSTEM_BOOTING we could do softirqs right away but we lack the
> infrastructure. Starting with SYSTEM_SCHEDULING we rely on the thread so
> it needs to be spawned earlier. The problem with SYSTEM_SCHEDULING+ is
> that we may deadlock if the softirqs and performed in IRQ-context.

Understood.  My goal is to prevent RCU from being yet another odd
constraint that people writing boot-time code need to worry about.
Or at least no additional odd constraints than the ones that it already
presents.  :-/

> > This might seem a bit utopian or even unreasonable, but please keep in
> > mind that both the scheduler and the idle loop use RCU.
> 
> But the problem is only the usage of synchronize_rcu().

And synchronize_rcu_expedited(), but yes in that call_rcu() and so
on still work.

>                                                         So
> rcu_read_lock() and call_rcu() works. Only synchronize_rcu() does not.
> Couldn't we make a rule to use at earliest within early_initcall()?

Of course we could make such a rule.

And sometimes, people running into problems with that rule might be able
to move their code earlier or later and avoid problems.  But other times
they have to do something else.  Which will sometimes mean that we are
asking them to re-implement some odd special case of RCU within their
own subsystem, which just does not sound like a good idea.

In face, my experience indicates that it is way easier to make RCU work
more globally than to work all the issues stemming from these sorts of
limits on RCU users.  Takes less time, also.

And it probably is not all -that- hard.

> > However, that swait_event_timeout_exclusive() doesn't need exact timing
> > during boot.  Anything that let other tasks run for at least a few tens
> > of microseconds (up to say a millisecond) could easily work just fine.
> > Is there any such thing in RT?
> 
> swait_event_timeout_exclusive() appears not to be the culprit. It is
> invoked a few times (with a 6.5ms timeout) but returns without setting
> up a timer. So either my setup avoids the timer or this happens always
> and is not related to my config).

Now that you mention it, yes.  There is only one CPU, so unless you have
an odd series of preemptions, it quickly figures out that it does not
need to wait.  But that odd series of preemptions really is out there,
patiently waiting for us to lose context on this code.

> rcu_tasks_wait_gp() does schedule_timeout_idle() and this is the one
> that blocks. This could be replaced with schedule_hrtimeout() (just
> tested). I hate the idea to use a precise delay in a timeout like
> situation. But we could use schedule_hrtimeout_range() with a HZ delta
> so it kind of feels like the timer_list timer ;)

If schedule_hrtimeout_range() works, I am good with it.
And you are right, precision is not required here.  And maybe
schedule_hrtimeout_range() could also be used to create a crude
boot-time-only polling loop for the swait_event_timeout_exclusive()?

> Also I have no idea how often this is triggered / under which
> circumstances (assuming it is bound synchronize_rcu()).

If I understand you here, it only has to happen once in a while.

> > >    - if you can't guarantee that there is only _one_ waiter
> > >      => spawn the irq-work thread early.
> > 
> > Spawning the irq-work kthread early still leaves a hole.
> 
> Why is spawning ksoftirqd + irq-work before early_initcall() still a
> hole? If the definition is _no_ synchronize_rcu() before
> early_initcall() then it works as documented.

Because based on past experience, it will be way easier to make RCU not
have that hole that to deal with that hole's existence.

> > Other approaches:
> > 
> > o	For the swait_event_timeout_exclusive(), I could make early
> > 	boot uses instead do swait_event_exclusive() and make early boot
> > 	rcu_sched_clock_irq() do an unconditional wakeup.  This would
> > 	require a loop around one of the swait_event_exclusive()
> > 	calls (cheaper than making rcu_sched_clock_irq() worry about
> > 	durations).
> > 
> > 	RCU would need to be informed of the end of "early boot",
> > 	for example, by invoking some TBD RCU function as soon
> > 	as the ksoftirqd kthreads are created.
> > 
> > 	This would also require that rcu_needs_cpu() always return
> > 	true during early boot.
> > 
> > 	Static branches could be used if required, as they might be in
> > 	rcu_needs_cpu() and maybe also in rcu_sched_clock_irq().
> 
> swait_event_timeout_exclusive() appears innocent.

I agree that it would rarely need to block, but if the task executing the
synchronize_rcu() preempted one of the readers, wouldn't it have to block?
Or am I missing some code path that excludes that possibility?

> > o	A similar TBD RCU function could cause call_rcu_tasks_generic()
> > 	to avoid invoking irq_work_queue() until after the relevant
> > 	kthread was created, but to do any needed wakeup at that point.
> > 	If wakeups are needed before that time (which they might),
> > 	then perhaps the combination of rcu_sched_clock_irq() and
> > 	rcu_needs_cpu() can help out there as well.
> 
> IRQ-work has been addressed in a different patch.

And from what I can see, that IRQ_WORK_INIT_HARD() is just the ticket.
Thank you!!!

> > These would be conditioned on IS_ENABLED(CONFIG_PREEMPT_RT).
> > 
> > But now you are going to tell me that wakeups cannot be done from the
> > scheduler tick interrupt handler?  If that is the case, are there other
> > approaches?
> 
> If you by my irqwork patch then I think we are down to:
> - spawn ksoftirqd early
> - use during boot schedule_hrtimeout() or the whole time (no I idea how
>   often this triggers).

The boot-time schedule_hrtimeout_range() seems to cover things, especially
given that most of the time there would be no need to block.  Or is
there yet another gap where schedule_hrtimeout_range() does not work?
(After the scheduler starts.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread().
  2022-03-05  4:40             ` Paul E. McKenney
@ 2022-03-07 11:17               ` Sebastian Andrzej Siewior
  2022-03-07 15:09                 ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-07 11:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On 2022-03-04 20:40:07 [-0800], Paul E. McKenney wrote:
> > This is what I had in mind. Does this work for you?
> 
> This does seem to be working, thank you!  I even manually inserted
> a pi_lock() acquisition across one of the calls to call_rcu_tasks(),
> and it did seem to work.

perfect.

> >  kernel/rcu/tasks.h | 14 ++++++++------
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > index d64f0b1d8cd3b..f804afb304135 100644
> > --- a/kernel/rcu/tasks.h
> > +++ b/kernel/rcu/tasks.h
> > @@ -113,11 +113,11 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp);
> >  #define DEFINE_RCU_TASKS(rt_name, gp, call, n)						\
> >  static DEFINE_PER_CPU(struct rcu_tasks_percpu, rt_name ## __percpu) = {			\
> >  	.lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name ## __percpu.cbs_pcpu_lock),		\
> > -	.rtp_irq_work = IRQ_WORK_INIT(call_rcu_tasks_iw_wakeup),			\
> > +	.rtp_irq_work = IRQ_WORK_INIT_HARD(call_rcu_tasks_iw_wakeup),			\
> 
> This is the key piece, right? 

Correct. Possible due to the removal of wait_queue_head (plus there is
nothing else that would stand in the way otherwise).

> If so, I am wondering if there are other
> irq_work instances that need to move to _HARD.

We already have:
kernel/rcu/tree.c:      rdp->rcu_iw = IRQ_WORK_INIT_HARD(rcu_iw_handler);

;)

> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread().
  2022-03-07 11:17               ` Sebastian Andrzej Siewior
@ 2022-03-07 15:09                 ` Paul E. McKenney
  2022-03-07 17:57                   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-07 15:09 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Mon, Mar 07, 2022 at 12:17:55PM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-03-04 20:40:07 [-0800], Paul E. McKenney wrote:
> > > This is what I had in mind. Does this work for you?
> > 
> > This does seem to be working, thank you!  I even manually inserted
> > a pi_lock() acquisition across one of the calls to call_rcu_tasks(),
> > and it did seem to work.
> 
> perfect.

Sounds like I should add that to rcutorture.

> > >  kernel/rcu/tasks.h | 14 ++++++++------
> > >  1 file changed, 8 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> > > index d64f0b1d8cd3b..f804afb304135 100644
> > > --- a/kernel/rcu/tasks.h
> > > +++ b/kernel/rcu/tasks.h
> > > @@ -113,11 +113,11 @@ static void call_rcu_tasks_iw_wakeup(struct irq_work *iwp);
> > >  #define DEFINE_RCU_TASKS(rt_name, gp, call, n)						\
> > >  static DEFINE_PER_CPU(struct rcu_tasks_percpu, rt_name ## __percpu) = {			\
> > >  	.lock = __RAW_SPIN_LOCK_UNLOCKED(rt_name ## __percpu.cbs_pcpu_lock),		\
> > > -	.rtp_irq_work = IRQ_WORK_INIT(call_rcu_tasks_iw_wakeup),			\
> > > +	.rtp_irq_work = IRQ_WORK_INIT_HARD(call_rcu_tasks_iw_wakeup),			\
> > 
> > This is the key piece, right? 
> 
> Correct. Possible due to the removal of wait_queue_head (plus there is
> nothing else that would stand in the way otherwise).

Very good!

If the overhead becomes a problem in mainline, presumably we can do
a compile-time switch based on CONFIG_PREEMPT_RT or similar.

> > If so, I am wondering if there are other
> > irq_work instances that need to move to _HARD.
> 
> We already have:
> kernel/rcu/tree.c:      rdp->rcu_iw = IRQ_WORK_INIT_HARD(rcu_iw_handler);

Because you prompted me to change it from IRQ_WORK_INIT() some time
back?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-05  5:00             ` Paul E. McKenney
@ 2022-03-07 17:54               ` Sebastian Andrzej Siewior
  2022-03-07 21:53                 ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-07 17:54 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On 2022-03-04 21:00:25 [-0800], Paul E. McKenney wrote:
> > During SYSTEM_BOOTING we could do softirqs right away but we lack the
> > infrastructure. Starting with SYSTEM_SCHEDULING we rely on the thread so
> > it needs to be spawned earlier. The problem with SYSTEM_SCHEDULING+ is
> > that we may deadlock if the softirqs and performed in IRQ-context.
> 
> Understood.  My goal is to prevent RCU from being yet another odd
> constraint that people writing boot-time code need to worry about.
> Or at least no additional odd constraints than the ones that it already
> presents.  :-/

Do we have that many people doing boot-time code before
early_initcall()?

> > > This might seem a bit utopian or even unreasonable, but please keep in
> > > mind that both the scheduler and the idle loop use RCU.
> > 
> > But the problem is only the usage of synchronize_rcu().
> 
> And synchronize_rcu_expedited(), but yes in that call_rcu() and so
> on still work.

That should be the majority of the users.

> >                                                         So
> > rcu_read_lock() and call_rcu() works. Only synchronize_rcu() does not.
> > Couldn't we make a rule to use at earliest within early_initcall()?
> 
> Of course we could make such a rule.
> 
> And sometimes, people running into problems with that rule might be able
> to move their code earlier or later and avoid problems.  But other times
> they have to do something else.  Which will sometimes mean that we are
> asking them to re-implement some odd special case of RCU within their
> own subsystem, which just does not sound like a good idea.
> 
> In face, my experience indicates that it is way easier to make RCU work
> more globally than to work all the issues stemming from these sorts of
> limits on RCU users.  Takes less time, also.
> 
> And it probably is not all -that- hard.

We had one user _that_ early and he moved away. People might
misunderstand things or optimize for something not really needed. If
this is needed _before_ early_initcall() we could still move it right
after the scheduler is initialized. I would just prefer not to optimize
for things that might be never needed.
For instance flush_workqueue() is made "working" a few functions
earlier (before the RCU selftest). You could enqueue work items earlier,
they would just wait until workqueue_init().

> > > However, that swait_event_timeout_exclusive() doesn't need exact timing
> > > during boot.  Anything that let other tasks run for at least a few tens
> > > of microseconds (up to say a millisecond) could easily work just fine.
> > > Is there any such thing in RT?
> > 
> > swait_event_timeout_exclusive() appears not to be the culprit. It is
> > invoked a few times (with a 6.5ms timeout) but returns without setting
> > up a timer. So either my setup avoids the timer or this happens always
> > and is not related to my config).
> 
> Now that you mention it, yes.  There is only one CPU, so unless you have
> an odd series of preemptions, it quickly figures out that it does not
> need to wait.  But that odd series of preemptions really is out there,
> patiently waiting for us to lose context on this code.

Correct, verified. But this means, that a task within a rcu_read_lock()
section gets preempted for > 26 seconds before that timer fires. That
delay during boot implies that something went wrong while it might
happen at run-time under "normal" circumstances. So I wouldn't try to
get this case covered.

> > rcu_tasks_wait_gp() does schedule_timeout_idle() and this is the one
> > that blocks. This could be replaced with schedule_hrtimeout() (just
> > tested). I hate the idea to use a precise delay in a timeout like
> > situation. But we could use schedule_hrtimeout_range() with a HZ delta
> > so it kind of feels like the timer_list timer ;)
> 
> If schedule_hrtimeout_range() works, I am good with it.
> And you are right, precision is not required here.  And maybe
> schedule_hrtimeout_range() could also be used to create a crude
> boot-time-only polling loop for the swait_event_timeout_exclusive()?

I made something to cover the schedule_hrtimeout_range(). I wouldn't
bother with swait_event_timeout_exclusive() due to large timeout _and_
we are boot-up.

> > swait_event_timeout_exclusive() appears innocent.
> 
> I agree that it would rarely need to block, but if the task executing the
> synchronize_rcu() preempted one of the readers, wouldn't it have to block?
> Or am I missing some code path that excludes that possibility?

As explained above, it means ~20secs delay during bootup which I don't
see happen. Once ksoftirqd is up, it is covered.
Also: If _more_ users require a timer to expire so the system can
continue to boot I am willing to investigate _why_ this is needed
because it does delay boot up progress of the system.

> > > These would be conditioned on IS_ENABLED(CONFIG_PREEMPT_RT).
> > > 
> > > But now you are going to tell me that wakeups cannot be done from the
> > > scheduler tick interrupt handler?  If that is the case, are there other
> > > approaches?
> > 
> > If you by my irqwork patch then I think we are down to:
> > - spawn ksoftirqd early
> > - use during boot schedule_hrtimeout() or the whole time (no I idea how
> >   often this triggers).
> 
> The boot-time schedule_hrtimeout_range() seems to cover things, especially
> given that most of the time there would be no need to block.  Or is
> there yet another gap where schedule_hrtimeout_range() does not work?
> (After the scheduler starts.)

The patch below covers it. This works once the system has a working
timer which aligns with !RT.
I've been testing this and understand that tracing is using it. I didn't
manage to trigger it after boot so I assume here that the user can't
easily trigger that timer _very_ often.

-------->8-----


From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Mon, 7 Mar 2022 17:08:23 +0100
Subject: [PATCH] rcu-tasks: Use schedule_hrtimeout_range() while waiting for
 the gp.

The RCU selftest is using schedule_timeout_idle() which fails on
PREEMPT_RT because it is used early in boot-up phase an which point
ksoftirqd is not yet ready and is required for the timer to expire.

To avoid this lockup, use schedule_hrtimeout() and let the timer expire
in hardirq context. This is ensures that the timer fires even on
PREEMPT_RT without any further requirement.

The timer is set to expire between fract and fract + HZ / 2 jiffies in
order to minimize the amount of extra wake ups and to allign with
possible other timer which expire within this window.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/rcu/tasks.h | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index f804afb304135..e99f9e61cc7a3 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -630,12 +630,15 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	while (!list_empty(&holdouts)) {
 		bool firstreport;
 		bool needreport;
+		ktime_t exp;
 		int rtst;
 
 		/* Slowly back off waiting for holdouts */
 		set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS);
-		schedule_timeout_idle(fract);
-
+		exp = jiffies_to_nsecs(fract);
+		__set_current_state(TASK_IDLE);
+		schedule_hrtimeout_range(&exp, jiffies_to_nsecs(HZ / 2),
+					 HRTIMER_MODE_REL_HARD);
 		if (fract < HZ)
 			fract++;
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread().
  2022-03-07 15:09                 ` Paul E. McKenney
@ 2022-03-07 17:57                   ` Sebastian Andrzej Siewior
  2022-03-07 21:54                     ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-07 17:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On 2022-03-07 07:09:13 [-0800], Paul E. McKenney wrote:
> > Correct. Possible due to the removal of wait_queue_head (plus there is
> > nothing else that would stand in the way otherwise).
> 
> Very good!
> 
> If the overhead becomes a problem in mainline, presumably we can do
> a compile-time switch based on CONFIG_PREEMPT_RT or similar.

the rcuwake is a lot cheaper than waitqueue in terms of code or data
involved, locks, …. The _HARD flag is a nop on !RT.
So it should be a win-win situation here.

> > > If so, I am wondering if there are other
> > > irq_work instances that need to move to _HARD.
> > 
> > We already have:
> > kernel/rcu/tree.c:      rdp->rcu_iw = IRQ_WORK_INIT_HARD(rcu_iw_handler);
> 
> Because you prompted me to change it from IRQ_WORK_INIT() some time
> back?  ;-)

Right. We have printk() for instance using the !HARD irq-work so this
mostly a case-by-case decision usually avoiding HARD.

> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-07 17:54               ` Sebastian Andrzej Siewior
@ 2022-03-07 21:53                 ` Paul E. McKenney
  2022-03-08 11:29                   ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-07 21:53 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Mon, Mar 07, 2022 at 06:54:09PM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-03-04 21:00:25 [-0800], Paul E. McKenney wrote:
> > > During SYSTEM_BOOTING we could do softirqs right away but we lack the
> > > infrastructure. Starting with SYSTEM_SCHEDULING we rely on the thread so
> > > it needs to be spawned earlier. The problem with SYSTEM_SCHEDULING+ is
> > > that we may deadlock if the softirqs and performed in IRQ-context.
> > 
> > Understood.  My goal is to prevent RCU from being yet another odd
> > constraint that people writing boot-time code need to worry about.
> > Or at least no additional odd constraints than the ones that it already
> > presents.  :-/
> 
> Do we have that many people doing boot-time code before
> early_initcall()?

It only takes one with an urgent and valid use case who happens to pop
up at a time that is highly inconvenient to all of us.

> > > > This might seem a bit utopian or even unreasonable, but please keep in
> > > > mind that both the scheduler and the idle loop use RCU.
> > > 
> > > But the problem is only the usage of synchronize_rcu().
> > 
> > And synchronize_rcu_expedited(), but yes in that call_rcu() and so
> > on still work.
> 
> That should be the majority of the users.

Again, it only takes one.

> > >                                                         So
> > > rcu_read_lock() and call_rcu() works. Only synchronize_rcu() does not.
> > > Couldn't we make a rule to use at earliest within early_initcall()?
> > 
> > Of course we could make such a rule.
> > 
> > And sometimes, people running into problems with that rule might be able
> > to move their code earlier or later and avoid problems.  But other times
> > they have to do something else.  Which will sometimes mean that we are
> > asking them to re-implement some odd special case of RCU within their
> > own subsystem, which just does not sound like a good idea.
> > 
> > In face, my experience indicates that it is way easier to make RCU work
> > more globally than to work all the issues stemming from these sorts of
> > limits on RCU users.  Takes less time, also.
> > 
> > And it probably is not all -that- hard.
> 
> We had one user _that_ early and he moved away. People might
> misunderstand things or optimize for something not really needed. If
> this is needed _before_ early_initcall() we could still move it right
> after the scheduler is initialized. I would just prefer not to optimize
> for things that might be never needed.
> For instance flush_workqueue() is made "working" a few functions
> earlier (before the RCU selftest). You could enqueue work items earlier,
> they would just wait until workqueue_init().

Or, like expedited grace periods currently do, use the current thread
instead of using workqueues during that time.

> > > > However, that swait_event_timeout_exclusive() doesn't need exact timing
> > > > during boot.  Anything that let other tasks run for at least a few tens
> > > > of microseconds (up to say a millisecond) could easily work just fine.
> > > > Is there any such thing in RT?
> > > 
> > > swait_event_timeout_exclusive() appears not to be the culprit. It is
> > > invoked a few times (with a 6.5ms timeout) but returns without setting
> > > up a timer. So either my setup avoids the timer or this happens always
> > > and is not related to my config).
> > 
> > Now that you mention it, yes.  There is only one CPU, so unless you have
> > an odd series of preemptions, it quickly figures out that it does not
> > need to wait.  But that odd series of preemptions really is out there,
> > patiently waiting for us to lose context on this code.
> 
> Correct, verified. But this means, that a task within a rcu_read_lock()
> section gets preempted for > 26 seconds before that timer fires. That
> delay during boot implies that something went wrong while it might
> happen at run-time under "normal" circumstances. So I wouldn't try to
> get this case covered.

The point being that there is no problem with the timer being queued,
just that it won't go off until later in the boot process.  Which in
turn means that all that is lost is the RCU CPU stall warning should one
be needed.  Is that correct, or am I missing a turn in there somewhere?

> > > rcu_tasks_wait_gp() does schedule_timeout_idle() and this is the one
> > > that blocks. This could be replaced with schedule_hrtimeout() (just
> > > tested). I hate the idea to use a precise delay in a timeout like
> > > situation. But we could use schedule_hrtimeout_range() with a HZ delta
> > > so it kind of feels like the timer_list timer ;)
> > 
> > If schedule_hrtimeout_range() works, I am good with it.
> > And you are right, precision is not required here.  And maybe
> > schedule_hrtimeout_range() could also be used to create a crude
> > boot-time-only polling loop for the swait_event_timeout_exclusive()?
> 
> I made something to cover the schedule_hrtimeout_range(). I wouldn't
> bother with swait_event_timeout_exclusive() due to large timeout _and_
> we are boot-up.

I welcome the schedule_hrtimeout_range() patch, thank you!

> > > swait_event_timeout_exclusive() appears innocent.
> > 
> > I agree that it would rarely need to block, but if the task executing the
> > synchronize_rcu() preempted one of the readers, wouldn't it have to block?
> > Or am I missing some code path that excludes that possibility?
> 
> As explained above, it means ~20secs delay during bootup which I don't
> see happen. Once ksoftirqd is up, it is covered.
> Also: If _more_ users require a timer to expire so the system can
> continue to boot I am willing to investigate _why_ this is needed
> because it does delay boot up progress of the system.

Just to double-check -- in RT, ksoftirqd is up and running before the boot
sequence reaches the call to rcu_end_inkernel_boot() from kernel_init()?

Otherwise, there is also a one-jiffy wait in play, not just a 20+-second
delay.

> > > > These would be conditioned on IS_ENABLED(CONFIG_PREEMPT_RT).
> > > > 
> > > > But now you are going to tell me that wakeups cannot be done from the
> > > > scheduler tick interrupt handler?  If that is the case, are there other
> > > > approaches?
> > > 
> > > If you by my irqwork patch then I think we are down to:
> > > - spawn ksoftirqd early
> > > - use during boot schedule_hrtimeout() or the whole time (no I idea how
> > >   often this triggers).
> > 
> > The boot-time schedule_hrtimeout_range() seems to cover things, especially
> > given that most of the time there would be no need to block.  Or is
> > there yet another gap where schedule_hrtimeout_range() does not work?
> > (After the scheduler starts.)
> 
> The patch below covers it. This works once the system has a working
> timer which aligns with !RT.
> I've been testing this and understand that tracing is using it. I didn't
> manage to trigger it after boot so I assume here that the user can't
> easily trigger that timer _very_ often.

This looks good to me, independent of the other discussion.  From what
I can see, this code path is infrequent, so using hrtimer all the time
should not be a problem.

Unless there are objections, I will pull this one in.

							Thanx, Paul

> -------->8-----
> 
> 
> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Date: Mon, 7 Mar 2022 17:08:23 +0100
> Subject: [PATCH] rcu-tasks: Use schedule_hrtimeout_range() while waiting for
>  the gp.
> 
> The RCU selftest is using schedule_timeout_idle() which fails on
> PREEMPT_RT because it is used early in boot-up phase an which point
> ksoftirqd is not yet ready and is required for the timer to expire.
> 
> To avoid this lockup, use schedule_hrtimeout() and let the timer expire
> in hardirq context. This is ensures that the timer fires even on
> PREEMPT_RT without any further requirement.
> 
> The timer is set to expire between fract and fract + HZ / 2 jiffies in
> order to minimize the amount of extra wake ups and to allign with
> possible other timer which expire within this window.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  kernel/rcu/tasks.h | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
> index f804afb304135..e99f9e61cc7a3 100644
> --- a/kernel/rcu/tasks.h
> +++ b/kernel/rcu/tasks.h
> @@ -630,12 +630,15 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
>  	while (!list_empty(&holdouts)) {
>  		bool firstreport;
>  		bool needreport;
> +		ktime_t exp;
>  		int rtst;
>  
>  		/* Slowly back off waiting for holdouts */
>  		set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS);
> -		schedule_timeout_idle(fract);
> -
> +		exp = jiffies_to_nsecs(fract);
> +		__set_current_state(TASK_IDLE);
> +		schedule_hrtimeout_range(&exp, jiffies_to_nsecs(HZ / 2),
> +					 HRTIMER_MODE_REL_HARD);
>  		if (fract < HZ)
>  			fract++;
>  
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread().
  2022-03-07 17:57                   ` Sebastian Andrzej Siewior
@ 2022-03-07 21:54                     ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-07 21:54 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Mon, Mar 07, 2022 at 06:57:59PM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-03-07 07:09:13 [-0800], Paul E. McKenney wrote:
> > > Correct. Possible due to the removal of wait_queue_head (plus there is
> > > nothing else that would stand in the way otherwise).
> > 
> > Very good!
> > 
> > If the overhead becomes a problem in mainline, presumably we can do
> > a compile-time switch based on CONFIG_PREEMPT_RT or similar.
> 
> the rcuwake is a lot cheaper than waitqueue in terms of code or data
> involved, locks, …. The _HARD flag is a nop on !RT.
> So it should be a win-win situation here.

Even better!

> > > > If so, I am wondering if there are other
> > > > irq_work instances that need to move to _HARD.
> > > 
> > > We already have:
> > > kernel/rcu/tree.c:      rdp->rcu_iw = IRQ_WORK_INIT_HARD(rcu_iw_handler);
> > 
> > Because you prompted me to change it from IRQ_WORK_INIT() some time
> > back?  ;-)
> 
> Right. We have printk() for instance using the !HARD irq-work so this
> mostly a case-by-case decision usually avoiding HARD.

Sounds about right.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-07 21:53                 ` Paul E. McKenney
@ 2022-03-08 11:29                   ` Sebastian Andrzej Siewior
  2022-03-08 18:10                     ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-08 11:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On 2022-03-07 13:53:27 [-0800], Paul E. McKenney wrote:
> > Correct, verified. But this means, that a task within a rcu_read_lock()
> > section gets preempted for > 26 seconds before that timer fires. That
> > delay during boot implies that something went wrong while it might
> > happen at run-time under "normal" circumstances. So I wouldn't try to
> > get this case covered.
> 
> The point being that there is no problem with the timer being queued,
> just that it won't go off until later in the boot process.  Which in
> turn means that all that is lost is the RCU CPU stall warning should one
> be needed.  Is that correct, or am I missing a turn in there somewhere?

Correct. Before do_pre_smp_initcalls() there is no ksoftirqd so timers
won't be processed. Therefore even the 26secs timer triggers, there
will be no RCU-stall warning. Also I'm not sure how much of preemption
can happen at this point since there is not much going on.
Any delay at this points to a lockup and at this point the lockup
detector isn't even fully operational (it does not trigger here during
test).

> > > > swait_event_timeout_exclusive() appears innocent.
> > > 
> > > I agree that it would rarely need to block, but if the task executing the
> > > synchronize_rcu() preempted one of the readers, wouldn't it have to block?
> > > Or am I missing some code path that excludes that possibility?
> > 
> > As explained above, it means ~20secs delay during bootup which I don't
> > see happen. Once ksoftirqd is up, it is covered.
> > Also: If _more_ users require a timer to expire so the system can
> > continue to boot I am willing to investigate _why_ this is needed
> > because it does delay boot up progress of the system.
> 
> Just to double-check -- in RT, ksoftirqd is up and running before the boot
> sequence reaches the call to rcu_end_inkernel_boot() from kernel_init()?

Yes, this is the case.
rcu_end_inkernel_boot() is really at the end of the boot process, we
reached even SYSTEM_RUNNING ;)

> This looks good to me, independent of the other discussion.  From what
> I can see, this code path is infrequent, so using hrtimer all the time
> should not be a problem.

Okay.

> Unless there are objections, I will pull this one in.

Good. Less patches for me to carry around.
Thank you.

> 							Thanx, Paul
> 

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-08 11:29                   ` Sebastian Andrzej Siewior
@ 2022-03-08 18:10                     ` Paul E. McKenney
  2022-03-09 10:19                       ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-08 18:10 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Tue, Mar 08, 2022 at 12:29:47PM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-03-07 13:53:27 [-0800], Paul E. McKenney wrote:
> > > Correct, verified. But this means, that a task within a rcu_read_lock()
> > > section gets preempted for > 26 seconds before that timer fires. That
> > > delay during boot implies that something went wrong while it might
> > > happen at run-time under "normal" circumstances. So I wouldn't try to
> > > get this case covered.
> > 
> > The point being that there is no problem with the timer being queued,
> > just that it won't go off until later in the boot process.  Which in
> > turn means that all that is lost is the RCU CPU stall warning should one
> > be needed.  Is that correct, or am I missing a turn in there somewhere?
> 
> Correct. Before do_pre_smp_initcalls() there is no ksoftirqd so timers
> won't be processed. Therefore even the 26secs timer triggers, there
> will be no RCU-stall warning. Also I'm not sure how much of preemption
> can happen at this point since there is not much going on.
> Any delay at this points to a lockup and at this point the lockup
> detector isn't even fully operational (it does not trigger here during
> test).

Is sysrq operational at this point in RT?  If so, that can be used
to diagnose the hang, at least in development environments.  If the
need to use sysrq becomes a problem, checks/wakeups can be added to
rcu_sched_clock_irq().

> > > > > swait_event_timeout_exclusive() appears innocent.
> > > > 
> > > > I agree that it would rarely need to block, but if the task executing the
> > > > synchronize_rcu() preempted one of the readers, wouldn't it have to block?
> > > > Or am I missing some code path that excludes that possibility?
> > > 
> > > As explained above, it means ~20secs delay during bootup which I don't
> > > see happen. Once ksoftirqd is up, it is covered.
> > > Also: If _more_ users require a timer to expire so the system can
> > > continue to boot I am willing to investigate _why_ this is needed
> > > because it does delay boot up progress of the system.
> > 
> > Just to double-check -- in RT, ksoftirqd is up and running before the boot
> > sequence reaches the call to rcu_end_inkernel_boot() from kernel_init()?
> 
> Yes, this is the case.
> rcu_end_inkernel_boot() is really at the end of the boot process, we
> reached even SYSTEM_RUNNING ;)

Whew!  ;-)

> > This looks good to me, independent of the other discussion.  From what
> > I can see, this code path is infrequent, so using hrtimer all the time
> > should not be a problem.
> 
> Okay.
> 
> > Unless there are objections, I will pull this one in.
> 
> Good. Less patches for me to carry around.
> Thank you.

And here is what I ended up hand-applying, presumably due to differences
between -rt and -rcu.  I also expanded the changelog a bit.  Could you
please check to make sure that I didn't mess something up?

Oh, and I also couldn't resist taking advantage of 100 columns...
What can I say?

							Thanx, Paul

------------------------------------------------------------------------

commit 2895b3bb5f8a0ebe565c62b1d2e3e1efca669962
Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date:   Tue Mar 8 09:54:13 2022 -0800

    rcu-tasks: Use schedule_hrtimeout_range() to wait for grace periods
    
    The synchronous RCU-tasks grace-period-wait primitives invoke
    schedule_timeout_idle() to give readers a chance to exit their
    read-side critical sections.  Unfortunately, this fails during early
    boot on PREEMPT_RT because PREEMPT_RT relies solely on ksoftirqd to run
    timer handlers.  Because ksoftirqd cannot operate until its kthreads
    are spawned, there is a brief period of time following scheduler
    initialization where PREEMPT_RT cannot run the timer handlers that
    schedule_timeout_idle() relies on, resulting in a hang.
    
    To avoid this boot-time hang, this commit replaces schedule_timeout_idle()
    with schedule_hrtimeout(), so that the timer expires in hardirq context.
    This is ensures that the timer fires even on PREEMPT_RT throughout the
    irqs-enabled portions of boot as well as during runtime.
    
    The timer is set to expire between fract and fract + HZ / 2 jiffies in
    order to align with any other timers that might expire during that time,
    thus reducing the number of wakeups.
    
    Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

diff --git a/kernel/rcu/tasks.h b/kernel/rcu/tasks.h
index 79eb280b1cd2a..bbad6b3376947 100644
--- a/kernel/rcu/tasks.h
+++ b/kernel/rcu/tasks.h
@@ -652,13 +652,16 @@ static void rcu_tasks_wait_gp(struct rcu_tasks *rtp)
 	fract = rtp->init_fract;
 
 	while (!list_empty(&holdouts)) {
+		ktime_t exp;
 		bool firstreport;
 		bool needreport;
 		int rtst;
 
 		// Slowly back off waiting for holdouts
 		set_tasks_gp_state(rtp, RTGS_WAIT_SCAN_HOLDOUTS);
-		schedule_timeout_idle(fract);
+		exp = jiffies_to_nsecs(fract);
+		__set_current_state(TASK_IDLE);
+		schedule_hrtimeout_range(&exp, jiffies_to_nsecs(HZ / 2), HRTIMER_MODE_REL_HARD);
 
 		if (fract < HZ)
 			fract++;

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-08 18:10                     ` Paul E. McKenney
@ 2022-03-09 10:19                       ` Sebastian Andrzej Siewior
  2022-03-09 18:37                         ` Paul E. McKenney
  0 siblings, 1 reply; 23+ messages in thread
From: Sebastian Andrzej Siewior @ 2022-03-09 10:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On 2022-03-08 10:10:40 [-0800], Paul E. McKenney wrote:
> > Correct. Before do_pre_smp_initcalls() there is no ksoftirqd so timers
> > won't be processed. Therefore even the 26secs timer triggers, there
> > will be no RCU-stall warning. Also I'm not sure how much of preemption
> > can happen at this point since there is not much going on.
> > Any delay at this points to a lockup and at this point the lockup
> > detector isn't even fully operational (it does not trigger here during
> > test).
> 
> Is sysrq operational at this point in RT?  If so, that can be used
> to diagnose the hang, at least in development environments.  If the
> need to use sysrq becomes a problem, checks/wakeups can be added to
> rcu_sched_clock_irq().

I doubt (including !RT). The UART driver isn't yet loaded and any kind
of output is provided by an early_printk implementation which is just
output not input.

> And here is what I ended up hand-applying, presumably due to differences
> between -rt and -rcu.  I also expanded the changelog a bit.  Could you
> please check to make sure that I didn't mess something up?

Nope, good, I added something to the description.

> Oh, and I also couldn't resist taking advantage of 100 columns...
> What can I say?

You may not believe it but since last week I am indeed an owner of an
wide screen. So I do enjoy the 100 columns. Also I still try to figure
out what to do with the other half of the screen ;)

> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 2895b3bb5f8a0ebe565c62b1d2e3e1efca669962
> Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Date:   Tue Mar 8 09:54:13 2022 -0800
> 
>     rcu-tasks: Use schedule_hrtimeout_range() to wait for grace periods
>     
>     The synchronous RCU-tasks grace-period-wait primitives invoke
>     schedule_timeout_idle() to give readers a chance to exit their
>     read-side critical sections.  Unfortunately, this fails during early
>     boot on PREEMPT_RT because PREEMPT_RT relies solely on ksoftirqd to run
>     timer handlers.  Because ksoftirqd cannot operate until its kthreads
>     are spawned, there is a brief period of time following scheduler
>     initialization where PREEMPT_RT cannot run the timer handlers that
>     schedule_timeout_idle() relies on, resulting in a hang.

Could we add something like:
   The delay here is used infrequent so it can be made to expire in
   hard-IRQ context on PREEMPT_RT without increasing the overall system
   latency.

My problem here is not to use this (timer does not expire during boot,
lets make it raw) as a general example how things should be handled in
future.
I only managed to trigger this delay via the dynamic ftrace interface/
samples so I'm comfortable with this.
Moreover I don't think that this timer, if batched with another one, will
not increase the latency substantially given that we trade here the wake
of ksoftirqd with the wake of the actual thread and that one wake up
shouldn't make much of a difference. The difference is that with
ksoftirqd, if we expire multiple timer then each timer is expired with
enabled interrupts - preemtible.
That is why I favoured spawning ksoftirqd earlier but I think this
works, too.

>     To avoid this boot-time hang, this commit replaces schedule_timeout_idle()
>     with schedule_hrtimeout(), so that the timer expires in hardirq context.
>     This is ensures that the timer fires even on PREEMPT_RT throughout the
>     irqs-enabled portions of boot as well as during runtime.
>     
>     The timer is set to expire between fract and fract + HZ / 2 jiffies in
>     order to align with any other timers that might expire during that time,
>     thus reducing the number of wakeups.
>     
>     Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>

Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] rcu: Delay the RCU-selftests during boot.
  2022-03-09 10:19                       ` Sebastian Andrzej Siewior
@ 2022-03-09 18:37                         ` Paul E. McKenney
  0 siblings, 0 replies; 23+ messages in thread
From: Paul E. McKenney @ 2022-03-09 18:37 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: rcu, Joel Fernandes, Josh Triplett, Lai Jiangshan,
	Mathieu Desnoyers, Steven Rostedt, Thomas Gleixner

On Wed, Mar 09, 2022 at 11:19:00AM +0100, Sebastian Andrzej Siewior wrote:
> On 2022-03-08 10:10:40 [-0800], Paul E. McKenney wrote:
> > > Correct. Before do_pre_smp_initcalls() there is no ksoftirqd so timers
> > > won't be processed. Therefore even the 26secs timer triggers, there
> > > will be no RCU-stall warning. Also I'm not sure how much of preemption
> > > can happen at this point since there is not much going on.
> > > Any delay at this points to a lockup and at this point the lockup
> > > detector isn't even fully operational (it does not trigger here during
> > > test).
> > 
> > Is sysrq operational at this point in RT?  If so, that can be used
> > to diagnose the hang, at least in development environments.  If the
> > need to use sysrq becomes a problem, checks/wakeups can be added to
> > rcu_sched_clock_irq().
> 
> I doubt (including !RT). The UART driver isn't yet loaded and any kind
> of output is provided by an early_printk implementation which is just
> output not input.

OK, so rcu_sched_clock_irq() it is, should problems arise.  Or a boot-time
switch between timer and hrtimer, but that could get a bit ugly.

> > And here is what I ended up hand-applying, presumably due to differences
> > between -rt and -rcu.  I also expanded the changelog a bit.  Could you
> > please check to make sure that I didn't mess something up?
> 
> Nope, good, I added something to the description.
> 
> > Oh, and I also couldn't resist taking advantage of 100 columns...
> > What can I say?
> 
> You may not believe it but since last week I am indeed an owner of an
> wide screen. So I do enjoy the 100 columns. Also I still try to figure
> out what to do with the other half of the screen ;)

Very good!

And I am sure that you have seen what I do with the extra screen
real estate.  ;-)

							Thanx, Paul

> > ------------------------------------------------------------------------
> > 
> > commit 2895b3bb5f8a0ebe565c62b1d2e3e1efca669962
> > Author: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Date:   Tue Mar 8 09:54:13 2022 -0800
> > 
> >     rcu-tasks: Use schedule_hrtimeout_range() to wait for grace periods
> >     
> >     The synchronous RCU-tasks grace-period-wait primitives invoke
> >     schedule_timeout_idle() to give readers a chance to exit their
> >     read-side critical sections.  Unfortunately, this fails during early
> >     boot on PREEMPT_RT because PREEMPT_RT relies solely on ksoftirqd to run
> >     timer handlers.  Because ksoftirqd cannot operate until its kthreads
> >     are spawned, there is a brief period of time following scheduler
> >     initialization where PREEMPT_RT cannot run the timer handlers that
> >     schedule_timeout_idle() relies on, resulting in a hang.
> 
> Could we add something like:
>    The delay here is used infrequent so it can be made to expire in
>    hard-IRQ context on PREEMPT_RT without increasing the overall system
>    latency.
> 
> My problem here is not to use this (timer does not expire during boot,
> lets make it raw) as a general example how things should be handled in
> future.
> I only managed to trigger this delay via the dynamic ftrace interface/
> samples so I'm comfortable with this.
> Moreover I don't think that this timer, if batched with another one, will
> not increase the latency substantially given that we trade here the wake
> of ksoftirqd with the wake of the actual thread and that one wake up
> shouldn't make much of a difference. The difference is that with
> ksoftirqd, if we expire multiple timer then each timer is expired with
> enabled interrupts - preemtible.
> That is why I favoured spawning ksoftirqd earlier but I think this
> works, too.

Good points.

What I did was to make the addition shown below.

> >     To avoid this boot-time hang, this commit replaces schedule_timeout_idle()
> >     with schedule_hrtimeout(), so that the timer expires in hardirq context.
> >     This is ensures that the timer fires even on PREEMPT_RT throughout the
> >     irqs-enabled portions of boot as well as during runtime.
> >     
> >     The timer is set to expire between fract and fract + HZ / 2 jiffies in
> >     order to align with any other timers that might expire during that time,
> >     thus reducing the number of wakeups.

	Note that RCU-tasks grace periods are infrequent, so the use
	of hrtimer should be fine.  In contrast, in common-case code,
	user of hrtimer could result in performance issues.

Does that work?

And of course if RCU-tasks grace periods do become more frequent, the
next step would be to use hrtimers at boot time and timers later on.
But one step at a time.

I also added Martin Lau and Andrii Nakryiko on CC since they have
come across the occasional RCU-tasks performance issue.

							Thanx, Paul

> >     Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >     Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
> 
> Sebastian

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-03-09 18:37 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-01 17:43 [PATCH] rcu: Delay the RCU-selftests during boot Sebastian Andrzej Siewior
2022-03-01 17:48 ` Sebastian Andrzej Siewior
2022-03-02  5:42 ` Paul E. McKenney
2022-03-02 11:09   ` Sebastian Andrzej Siewior
2022-03-03  4:36     ` Paul E. McKenney
2022-03-03 10:10       ` Sebastian Andrzej Siewior
2022-03-03 20:02         ` Paul E. McKenney
2022-03-04 11:07           ` [PATCH] rcu-tasks: Use rcuwait for the rcu_tasks_kthread() Sebastian Andrzej Siewior
2022-03-04 23:21             ` Joel Fernandes
2022-03-05  4:40               ` Paul E. McKenney
2022-03-05  4:40             ` Paul E. McKenney
2022-03-07 11:17               ` Sebastian Andrzej Siewior
2022-03-07 15:09                 ` Paul E. McKenney
2022-03-07 17:57                   ` Sebastian Andrzej Siewior
2022-03-07 21:54                     ` Paul E. McKenney
2022-03-04 15:09           ` [PATCH] rcu: Delay the RCU-selftests during boot Sebastian Andrzej Siewior
2022-03-05  5:00             ` Paul E. McKenney
2022-03-07 17:54               ` Sebastian Andrzej Siewior
2022-03-07 21:53                 ` Paul E. McKenney
2022-03-08 11:29                   ` Sebastian Andrzej Siewior
2022-03-08 18:10                     ` Paul E. McKenney
2022-03-09 10:19                       ` Sebastian Andrzej Siewior
2022-03-09 18:37                         ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.