linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Alleged fix for writer stall on -rcu branch dev
       [not found]       ` <20190410203302.GT14111@linux.ibm.com>
@ 2019-04-11  0:06         ` Paul E. McKenney
  2019-04-13 14:21           ` Paul E. McKenney
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-11  0:06 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel

On Wed, Apr 10, 2019 at 01:33:02PM -0700, Paul E. McKenney wrote:
> On Wed, Apr 10, 2019 at 09:19:18PM +0200, Sebastian Andrzej Siewior wrote:
> > On 2019-04-10 11:41:05 [-0700], Paul E. McKenney wrote:
> > > On Wed, Apr 10, 2019 at 08:15:38PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2019-04-09 15:14:26 [-0700], Paul E. McKenney wrote:
> > > > > Hello, Sebastian,
> > > > Hi Paul,
> > > > 
> > > > > A few commits later, I finally have something that might work.  ;-)
> > > > 
> > > > Okay. I started with 33e04e4512797b5e0242f452d0027b096d43d006. The first
> > > > batch of 18 completed and this
> > > > 	https://breakpoint.cc/console-tree1.2.log
> > > 
> > > Another new one on me!  ;-)
> > > 
> > > It looks like CPU 11 got stuck in the cpu_stopper_thread trying to
> > > offline CPU 6.  Unfortunately, CPU 6 is not in the list receiving an
> > > NMI, perhaps because it looked like it was already offline.  It does
> > > eventually finish going offline quite some time later.
> > > 
> > > Perhaps my forward-progress testing is interacting badly with CPU
> > > hotplug, but only very occasionally?
> > 
> > Somehow, yes.
> > 
> > [ 8433.835292] Unregister pv shared memory for cpu 6
> > [ 8433.864122] smpboot: CPU 6 is now offline
> > 
> > CPU6 is offline.
> > 
> > [ 8434.934765] smpboot: Booting Node 0 Processor 6 APIC 0x6
> > [ 8434.950632] kvm-clock: cpu 6, msr 17155a181, secondary cpu clock
> > [ 8434.989124] KVM setup async PF for cpu 6
> > [ 8434.990801] kvm-stealtime: cpu 6, msr 17b195380
> > 
> > CPU6 is booting again.
> > 
> > [ 8436.035269] Unregister pv shared memory for cpu 6
> > 
> > going down again
> > 
> > [ 8462.032468] rcu: INFO: rcu_preempt self-detected stall on CPU
> > [ 8462.037385] rcu: 	11-...!: (25587 ticks this GP) idle=57e/1/0x4000000000000002 softirq=418067/418067 fqs=1 last_accelerate: 2456/89e6, Nonlazy posted: .LD
> > 
> > 25587 ticks is ~25secs with HZ=1k. And the stall occurred on CPU11,
> > correct?
> 
> Yes to both.
> 
> > [ 8525.041031] smpboot: CPU 6 is now offline
> > 
> > 63 secs later, the CPU is down. So that is too long.
> 
> Ah, I did miss the down-then-up, thank you!
> 
> > > Something for me to look at, anyway!
> > 
> > There is also
> > [ 8556.907804] WARNING: CPU: 35 PID: 833 at /home/bigeasy/linux-rt/kernel/rcu/rcutorture.c:1827 rcu_torture_fwd_prog+0x5c4/0x630
> > 
> > but I'm not sure what it is :)
> 
> The forward-progress test exceeded eight seconds, but pushed fewer
> than 100 callbacks through the system.  One way that could happen is
> if the forward-progress kthread was delayed during that eight seconds.
> Another is if grace periods were insanely fast (which they are not).
> 
> > > > I should have been alone on that server but I can't guarantee it. I try
> > > > to throw the rcu-torture at a little smaller box and check how it goes.
> > > 
> > > Even if you were not alone, I would guess that preempting a runnable
> > > process for 92 seconds would have been a visible event on the host.
> > 
> > 92 seconds?! Okay, that should have been visible on the host side. And I
> > see nothing in dmesg on the host side.
> > But why 92? 25 for RCU stall + 63 between starting to go down and
> > finishing?
> 
> Because I missed the momentary up-then-down that you spotted.  Your 63
> seconds is correct.
> 
> > > But maybe not?  The watchdog timer is still 120 seconds?  Or does your
> > > host set it to 22 seconds?
> > 
> > I didn't fiddle with it and host runs 4.19.12-1~bpo9+1. So it should be
> > the default value.
> 
> Which I believe to be 22 seconds.  Hmmm...
> 
> Either way, thank you very much for the testing!!!

[ Adding LKML back -- I think we dropped it due to an attachment? ]

One area of some concern is RCU_KTHREAD_YIELDING and the rcuc kthreads.
This shouldn't be a big deal for the intended use case (real-time
workloads with lightish load), but a two-jiffy delay could allow quite a
few callbacks to build up when doing forward-progress testing.  But not
a problem for use_softirq runs, which is what you were running.

But it might be that I need to either skip the forward-progress testing
if !use_softirq, skip the RCU_KTHREAD_YIELDING step if there are too
many callbacks, or some such.  Either way, though, I would need at least
a cond_resched() on each pass through the rcu_cpu_kthread() function.

Thoughts from a real-time perspective?

Oooh...  But in both use_softirq and !use_softirq runs, if the CPU to be
offlined has a huge number of callbacks pending (assuming callbacks are
not offloaded), there could be quite the delay imposed on the offlining
of that CPU.  Now 63 seconds seems a bit much, but there can be tens
of millions of callbacks queued up to be invoked.  It -could- happen,
I suppose, though I would guess more like a couple of seconds.

Maybe I need to ignore the ignoring of blimit for a CPU that is going
offline...  The theory is that the callbacks will just get dumped on
some other CPU if the CPU is going away, and on the other hand invoking
callbacks sparingly would help the CPU come back all the way online
more quickly.  Unless one of the callbacks on that CPU was the one from
the synchronize_rcu() in the CPU-hotplug offline path.  :-/

So I could start going slow on callbacks on the outgoing CPU as soon as
that synchronize_rcu() completes.  Horribly fragile, given the possibility
of synchronize_rcu() in other CPU-hotplug notifiers, though.  Plus it
looks like sched_cpu_deactivate() is about the first thing that happens
when removing a CPU.

But you know...  Ignoring the world while invoking up to LONG_MAX RCU
callbacks back to back probably never was an optimal plan.  So if that
splat is at all reproducible, could you please give the patch below a try?

I have just touch-tested this on TREE01 and TREE04.  I will be testing
the rest of the scenarios overnight, Pacific Time.

Of course, if you have more failures, please let me know!

							Thanx, Paul

------------------------------------------------------------------------

commit 3245c1860cb30f85ae42c25209217bac25104faf
Author: Paul E. McKenney <paulmck@linux.ibm.com>
Date:   Wed Apr 10 17:01:39 2019 -0700

    rcu: Set a maximum limit for back-to-back callback invocation
    
    Currently, if a CPU has more than 10,000 callbacks pending, it will
    increase rdp->blimit to LONG_MAX.  If you are lucky, LONG_MAX is only
    about two billion, but this is still a bit too many callbacks to invoke
    back-to-back while otherwise ignoring the world.
    
    This commit therefore sets a maximum limit of DEFAULT_MAX_RCU_BLIMIT,
    which is set to 10,000, for rdp->blimit.
    
    Reported-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index f53befdeecc1..afeb2751c94f 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -401,7 +401,8 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 	return __this_cpu_read(rcu_data.dynticks_nesting) == 0;
 }
 
-#define DEFAULT_RCU_BLIMIT 10     /* Maximum callbacks per rcu_do_batch. */
+#define DEFAULT_RCU_BLIMIT 10     /* Maximum callbacks per rcu_do_batch ... */
+#define DEFAULT_MAX_RCU_BLIMIT 10000 /* ... even during callback flood. */
 static long blimit = DEFAULT_RCU_BLIMIT;
 #define DEFAULT_RCU_QHIMARK 10000 /* If this many pending, ignore blimit. */
 static long qhimark = DEFAULT_RCU_QHIMARK;
@@ -2134,7 +2135,7 @@ static void rcu_do_batch(struct rcu_data *rdp)
 
 	/* Reinstate batch limit if we have worked down the excess. */
 	count = rcu_segcblist_n_cbs(&rdp->cblist);
-	if (rdp->blimit == LONG_MAX && count <= qlowmark)
+	if (rdp->blimit >= DEFAULT_MAX_RCU_BLIMIT && count <= qlowmark)
 		rdp->blimit = blimit;
 
 	/* Reset ->qlen_last_fqs_check trigger if enough CBs have drained. */
@@ -2464,7 +2465,7 @@ static void __call_rcu_core(struct rcu_data *rdp, struct rcu_head *head,
 			rcu_accelerate_cbs_unlocked(rdp->mynode, rdp);
 		} else {
 			/* Give the grace period a kick. */
-			rdp->blimit = LONG_MAX;
+			rdp->blimit = DEFAULT_MAX_RCU_BLIMIT;
 			if (rcu_state.n_force_qs == rdp->n_force_qs_snap &&
 			    rcu_segcblist_first_pend_cb(&rdp->cblist) != head)
 				rcu_force_quiescent_state();


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-11  0:06         ` Alleged fix for writer stall on -rcu branch dev Paul E. McKenney
@ 2019-04-13 14:21           ` Paul E. McKenney
  2019-04-13 18:17             ` Paul E. McKenney
  2019-04-15 10:40             ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-13 14:21 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel

On Wed, Apr 10, 2019 at 05:06:10PM -0700, Paul E. McKenney wrote:
> On Wed, Apr 10, 2019 at 01:33:02PM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 10, 2019 at 09:19:18PM +0200, Sebastian Andrzej Siewior wrote:
> > > On 2019-04-10 11:41:05 [-0700], Paul E. McKenney wrote:
> > > > On Wed, Apr 10, 2019 at 08:15:38PM +0200, Sebastian Andrzej Siewior wrote:
> > > > > On 2019-04-09 15:14:26 [-0700], Paul E. McKenney wrote:
> > > > > > Hello, Sebastian,
> > > > > Hi Paul,
> > > > > 
> > > > > > A few commits later, I finally have something that might work.  ;-)
> > > > > 
> > > > > Okay. I started with 33e04e4512797b5e0242f452d0027b096d43d006. The first
> > > > > batch of 18 completed and this
> > > > > 	https://breakpoint.cc/console-tree1.2.log
> > > > 
> > > > Another new one on me!  ;-)
> > > > 
> > > > It looks like CPU 11 got stuck in the cpu_stopper_thread trying to
> > > > offline CPU 6.  Unfortunately, CPU 6 is not in the list receiving an
> > > > NMI, perhaps because it looked like it was already offline.  It does
> > > > eventually finish going offline quite some time later.
> > > > 
> > > > Perhaps my forward-progress testing is interacting badly with CPU
> > > > hotplug, but only very occasionally?
> > > 
> > > Somehow, yes.
> > > 
> > > [ 8433.835292] Unregister pv shared memory for cpu 6
> > > [ 8433.864122] smpboot: CPU 6 is now offline
> > > 
> > > CPU6 is offline.
> > > 
> > > [ 8434.934765] smpboot: Booting Node 0 Processor 6 APIC 0x6
> > > [ 8434.950632] kvm-clock: cpu 6, msr 17155a181, secondary cpu clock
> > > [ 8434.989124] KVM setup async PF for cpu 6
> > > [ 8434.990801] kvm-stealtime: cpu 6, msr 17b195380
> > > 
> > > CPU6 is booting again.
> > > 
> > > [ 8436.035269] Unregister pv shared memory for cpu 6
> > > 
> > > going down again
> > > 
> > > [ 8462.032468] rcu: INFO: rcu_preempt self-detected stall on CPU
> > > [ 8462.037385] rcu: 	11-...!: (25587 ticks this GP) idle=57e/1/0x4000000000000002 softirq=418067/418067 fqs=1 last_accelerate: 2456/89e6, Nonlazy posted: .LD
> > > 
> > > 25587 ticks is ~25secs with HZ=1k. And the stall occurred on CPU11,
> > > correct?
> > 
> > Yes to both.
> > 
> > > [ 8525.041031] smpboot: CPU 6 is now offline
> > > 
> > > 63 secs later, the CPU is down. So that is too long.
> > 
> > Ah, I did miss the down-then-up, thank you!
> > 
> > > > Something for me to look at, anyway!
> > > 
> > > There is also
> > > [ 8556.907804] WARNING: CPU: 35 PID: 833 at /home/bigeasy/linux-rt/kernel/rcu/rcutorture.c:1827 rcu_torture_fwd_prog+0x5c4/0x630
> > > 
> > > but I'm not sure what it is :)
> > 
> > The forward-progress test exceeded eight seconds, but pushed fewer
> > than 100 callbacks through the system.  One way that could happen is
> > if the forward-progress kthread was delayed during that eight seconds.
> > Another is if grace periods were insanely fast (which they are not).
> > 
> > > > > I should have been alone on that server but I can't guarantee it. I try
> > > > > to throw the rcu-torture at a little smaller box and check how it goes.
> > > > 
> > > > Even if you were not alone, I would guess that preempting a runnable
> > > > process for 92 seconds would have been a visible event on the host.
> > > 
> > > 92 seconds?! Okay, that should have been visible on the host side. And I
> > > see nothing in dmesg on the host side.
> > > But why 92? 25 for RCU stall + 63 between starting to go down and
> > > finishing?
> > 
> > Because I missed the momentary up-then-down that you spotted.  Your 63
> > seconds is correct.
> > 
> > > > But maybe not?  The watchdog timer is still 120 seconds?  Or does your
> > > > host set it to 22 seconds?
> > > 
> > > I didn't fiddle with it and host runs 4.19.12-1~bpo9+1. So it should be
> > > the default value.
> > 
> > Which I believe to be 22 seconds.  Hmmm...
> > 
> > Either way, thank you very much for the testing!!!
> 
> [ Adding LKML back -- I think we dropped it due to an attachment? ]
> 
> One area of some concern is RCU_KTHREAD_YIELDING and the rcuc kthreads.
> This shouldn't be a big deal for the intended use case (real-time
> workloads with lightish load), but a two-jiffy delay could allow quite a
> few callbacks to build up when doing forward-progress testing.  But not
> a problem for use_softirq runs, which is what you were running.
> 
> But it might be that I need to either skip the forward-progress testing
> if !use_softirq, skip the RCU_KTHREAD_YIELDING step if there are too
> many callbacks, or some such.  Either way, though, I would need at least
> a cond_resched() on each pass through the rcu_cpu_kthread() function.
> 
> Thoughts from a real-time perspective?
> 
> Oooh...  But in both use_softirq and !use_softirq runs, if the CPU to be
> offlined has a huge number of callbacks pending (assuming callbacks are
> not offloaded), there could be quite the delay imposed on the offlining
> of that CPU.  Now 63 seconds seems a bit much, but there can be tens
> of millions of callbacks queued up to be invoked.  It -could- happen,
> I suppose, though I would guess more like a couple of seconds.
> 
> Maybe I need to ignore the ignoring of blimit for a CPU that is going
> offline...  The theory is that the callbacks will just get dumped on
> some other CPU if the CPU is going away, and on the other hand invoking
> callbacks sparingly would help the CPU come back all the way online
> more quickly.  Unless one of the callbacks on that CPU was the one from
> the synchronize_rcu() in the CPU-hotplug offline path.  :-/
> 
> So I could start going slow on callbacks on the outgoing CPU as soon as
> that synchronize_rcu() completes.  Horribly fragile, given the possibility
> of synchronize_rcu() in other CPU-hotplug notifiers, though.  Plus it
> looks like sched_cpu_deactivate() is about the first thing that happens
> when removing a CPU.
> 
> But you know...  Ignoring the world while invoking up to LONG_MAX RCU
> callbacks back to back probably never was an optimal plan.  So if that
> splat is at all reproducible, could you please give the patch below a try?
> 
> I have just touch-tested this on TREE01 and TREE04.  I will be testing
> the rest of the scenarios overnight, Pacific Time.
> 
> Of course, if you have more failures, please let me know!

And after ramping up the stress in rcutorture a bit, I am able to
reproduce this CPU stall involving offlining on my systems.  My plan is
my usual one:  Keep increasing rcutorture stress to get it to reproduce
in reasonable time, add diagnostics as appropriate, and continue making
forward-progress improvements in RCU itself.

Thank you very much for locating this one!!!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-13 14:21           ` Paul E. McKenney
@ 2019-04-13 18:17             ` Paul E. McKenney
  2019-04-15 10:41               ` Sebastian Andrzej Siewior
  2019-04-15 10:40             ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-13 18:17 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel

On Sat, Apr 13, 2019 at 07:21:53AM -0700, Paul E. McKenney wrote:
> On Wed, Apr 10, 2019 at 05:06:10PM -0700, Paul E. McKenney wrote:
> > On Wed, Apr 10, 2019 at 01:33:02PM -0700, Paul E. McKenney wrote:
> > > On Wed, Apr 10, 2019 at 09:19:18PM +0200, Sebastian Andrzej Siewior wrote:
> > > > On 2019-04-10 11:41:05 [-0700], Paul E. McKenney wrote:
> > > > > On Wed, Apr 10, 2019 at 08:15:38PM +0200, Sebastian Andrzej Siewior wrote:
> > > > > > On 2019-04-09 15:14:26 [-0700], Paul E. McKenney wrote:
> > > > > > > Hello, Sebastian,
> > > > > > Hi Paul,
> > > > > > 
> > > > > > > A few commits later, I finally have something that might work.  ;-)
> > > > > > 
> > > > > > Okay. I started with 33e04e4512797b5e0242f452d0027b096d43d006. The first
> > > > > > batch of 18 completed and this
> > > > > > 	https://breakpoint.cc/console-tree1.2.log
> > > > > 
> > > > > Another new one on me!  ;-)
> > > > > 
> > > > > It looks like CPU 11 got stuck in the cpu_stopper_thread trying to
> > > > > offline CPU 6.  Unfortunately, CPU 6 is not in the list receiving an
> > > > > NMI, perhaps because it looked like it was already offline.  It does
> > > > > eventually finish going offline quite some time later.
> > > > > 
> > > > > Perhaps my forward-progress testing is interacting badly with CPU
> > > > > hotplug, but only very occasionally?
> > > > 
> > > > Somehow, yes.
> > > > 
> > > > [ 8433.835292] Unregister pv shared memory for cpu 6
> > > > [ 8433.864122] smpboot: CPU 6 is now offline
> > > > 
> > > > CPU6 is offline.
> > > > 
> > > > [ 8434.934765] smpboot: Booting Node 0 Processor 6 APIC 0x6
> > > > [ 8434.950632] kvm-clock: cpu 6, msr 17155a181, secondary cpu clock
> > > > [ 8434.989124] KVM setup async PF for cpu 6
> > > > [ 8434.990801] kvm-stealtime: cpu 6, msr 17b195380
> > > > 
> > > > CPU6 is booting again.
> > > > 
> > > > [ 8436.035269] Unregister pv shared memory for cpu 6
> > > > 
> > > > going down again
> > > > 
> > > > [ 8462.032468] rcu: INFO: rcu_preempt self-detected stall on CPU
> > > > [ 8462.037385] rcu: 	11-...!: (25587 ticks this GP) idle=57e/1/0x4000000000000002 softirq=418067/418067 fqs=1 last_accelerate: 2456/89e6, Nonlazy posted: .LD
> > > > 
> > > > 25587 ticks is ~25secs with HZ=1k. And the stall occurred on CPU11,
> > > > correct?
> > > 
> > > Yes to both.
> > > 
> > > > [ 8525.041031] smpboot: CPU 6 is now offline
> > > > 
> > > > 63 secs later, the CPU is down. So that is too long.
> > > 
> > > Ah, I did miss the down-then-up, thank you!
> > > 
> > > > > Something for me to look at, anyway!
> > > > 
> > > > There is also
> > > > [ 8556.907804] WARNING: CPU: 35 PID: 833 at /home/bigeasy/linux-rt/kernel/rcu/rcutorture.c:1827 rcu_torture_fwd_prog+0x5c4/0x630
> > > > 
> > > > but I'm not sure what it is :)
> > > 
> > > The forward-progress test exceeded eight seconds, but pushed fewer
> > > than 100 callbacks through the system.  One way that could happen is
> > > if the forward-progress kthread was delayed during that eight seconds.
> > > Another is if grace periods were insanely fast (which they are not).
> > > 
> > > > > > I should have been alone on that server but I can't guarantee it. I try
> > > > > > to throw the rcu-torture at a little smaller box and check how it goes.
> > > > > 
> > > > > Even if you were not alone, I would guess that preempting a runnable
> > > > > process for 92 seconds would have been a visible event on the host.
> > > > 
> > > > 92 seconds?! Okay, that should have been visible on the host side. And I
> > > > see nothing in dmesg on the host side.
> > > > But why 92? 25 for RCU stall + 63 between starting to go down and
> > > > finishing?
> > > 
> > > Because I missed the momentary up-then-down that you spotted.  Your 63
> > > seconds is correct.
> > > 
> > > > > But maybe not?  The watchdog timer is still 120 seconds?  Or does your
> > > > > host set it to 22 seconds?
> > > > 
> > > > I didn't fiddle with it and host runs 4.19.12-1~bpo9+1. So it should be
> > > > the default value.
> > > 
> > > Which I believe to be 22 seconds.  Hmmm...
> > > 
> > > Either way, thank you very much for the testing!!!
> > 
> > [ Adding LKML back -- I think we dropped it due to an attachment? ]
> > 
> > One area of some concern is RCU_KTHREAD_YIELDING and the rcuc kthreads.
> > This shouldn't be a big deal for the intended use case (real-time
> > workloads with lightish load), but a two-jiffy delay could allow quite a
> > few callbacks to build up when doing forward-progress testing.  But not
> > a problem for use_softirq runs, which is what you were running.
> > 
> > But it might be that I need to either skip the forward-progress testing
> > if !use_softirq, skip the RCU_KTHREAD_YIELDING step if there are too
> > many callbacks, or some such.  Either way, though, I would need at least
> > a cond_resched() on each pass through the rcu_cpu_kthread() function.
> > 
> > Thoughts from a real-time perspective?
> > 
> > Oooh...  But in both use_softirq and !use_softirq runs, if the CPU to be
> > offlined has a huge number of callbacks pending (assuming callbacks are
> > not offloaded), there could be quite the delay imposed on the offlining
> > of that CPU.  Now 63 seconds seems a bit much, but there can be tens
> > of millions of callbacks queued up to be invoked.  It -could- happen,
> > I suppose, though I would guess more like a couple of seconds.
> > 
> > Maybe I need to ignore the ignoring of blimit for a CPU that is going
> > offline...  The theory is that the callbacks will just get dumped on
> > some other CPU if the CPU is going away, and on the other hand invoking
> > callbacks sparingly would help the CPU come back all the way online
> > more quickly.  Unless one of the callbacks on that CPU was the one from
> > the synchronize_rcu() in the CPU-hotplug offline path.  :-/
> > 
> > So I could start going slow on callbacks on the outgoing CPU as soon as
> > that synchronize_rcu() completes.  Horribly fragile, given the possibility
> > of synchronize_rcu() in other CPU-hotplug notifiers, though.  Plus it
> > looks like sched_cpu_deactivate() is about the first thing that happens
> > when removing a CPU.
> > 
> > But you know...  Ignoring the world while invoking up to LONG_MAX RCU
> > callbacks back to back probably never was an optimal plan.  So if that
> > splat is at all reproducible, could you please give the patch below a try?
> > 
> > I have just touch-tested this on TREE01 and TREE04.  I will be testing
> > the rest of the scenarios overnight, Pacific Time.
> > 
> > Of course, if you have more failures, please let me know!
> 
> And after ramping up the stress in rcutorture a bit, I am able to
> reproduce this CPU stall involving offlining on my systems.  My plan is
> my usual one:  Keep increasing rcutorture stress to get it to reproduce
> in reasonable time, add diagnostics as appropriate, and continue making
> forward-progress improvements in RCU itself.
> 
> Thank you very much for locating this one!!!

Oh, and probably more to the point...  I reviewed your patch, and with
the subsequent changes it looks sane to me.  Plus the test failures thus
far don't appear to have anything to do with your patch.  However, I am
too cowardly to push this into the upcoming merge window, so I will do
so on the one following (v5.3).

							Thanx, Paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-13 14:21           ` Paul E. McKenney
  2019-04-13 18:17             ` Paul E. McKenney
@ 2019-04-15 10:40             ` Sebastian Andrzej Siewior
  2019-04-15 10:56               ` Paul E. McKenney
  1 sibling, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-04-15 10:40 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

On 2019-04-13 07:21:53 [-0700], Paul E. McKenney wrote:
> 
> And after ramping up the stress in rcutorture a bit, I am able to
> reproduce this CPU stall involving offlining on my systems.  My plan is
> my usual one:  Keep increasing rcutorture stress to get it to reproduce
> in reasonable time, add diagnostics as appropriate, and continue making
> forward-progress improvements in RCU itself.

I *think* this one is also new:

|Assertion failure in /home/bigeasy/linux-rt/tools/testing/selftests/rcutorture/res/2019.04.14-14:28:40/TREE03.37/console.log TREE03
|Summary: Warnings: 6 Call Traces: 6
https://breakpoint.cc/console-03.37.log

|Assertion failure in /home/bigeasy/linux-rt/tools/testing/selftests/rcutorture/res/2019.04.14-14:28:40/TREE03.39/console.log TREE03
|Summary: Warnings: 4 Call Traces: 4
https://breakpoint.cc/console-03.39.log

This is still commit 33e04e451279 ("torture: Allow inter-stutter interval to be
specified").

> Thank you very much for locating this one!!!

You are welcome.

> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-13 18:17             ` Paul E. McKenney
@ 2019-04-15 10:41               ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-04-15 10:41 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

On 2019-04-13 11:17:50 [-0700], Paul E. McKenney wrote:
> Oh, and probably more to the point...  I reviewed your patch, and with
> the subsequent changes it looks sane to me.  Plus the test failures thus
> far don't appear to have anything to do with your patch.  However, I am
> too cowardly to push this into the upcoming merge window, so I will do
> so on the one following (v5.3).

Thank you.

> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-15 10:40             ` Sebastian Andrzej Siewior
@ 2019-04-15 10:56               ` Paul E. McKenney
  2019-04-15 11:04                 ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-15 10:56 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel

On Mon, Apr 15, 2019 at 12:40:47PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-04-13 07:21:53 [-0700], Paul E. McKenney wrote:
> > 
> > And after ramping up the stress in rcutorture a bit, I am able to
> > reproduce this CPU stall involving offlining on my systems.  My plan is
> > my usual one:  Keep increasing rcutorture stress to get it to reproduce
> > in reasonable time, add diagnostics as appropriate, and continue making
> > forward-progress improvements in RCU itself.
> 
> I *think* this one is also new:
> 
> |Assertion failure in /home/bigeasy/linux-rt/tools/testing/selftests/rcutorture/res/2019.04.14-14:28:40/TREE03.37/console.log TREE03
> |Summary: Warnings: 6 Call Traces: 6
> https://breakpoint.cc/console-03.37.log

This is new in rcutorture as well.  It is complaining that we just got
done with a stutter interval (in which torturing temporarily stops)
but not everything has progressed through the callback chain to be freed.
And in this case it is happening during rcutorture shutdown, so I need
to check the callback -- I might need end-of-run checks in that WARN_ON().

> |Assertion failure in /home/bigeasy/linux-rt/tools/testing/selftests/rcutorture/res/2019.04.14-14:28:40/TREE03.39/console.log TREE03
> |Summary: Warnings: 4 Call Traces: 4
> https://breakpoint.cc/console-03.39.log

And the same here.  So neither count against you.  ;-)

							Thanx, Paul

> This is still commit 33e04e451279 ("torture: Allow inter-stutter interval to be
> specified").
> 
> > Thank you very much for locating this one!!!
> 
> You are welcome.
> 
> > 							Thanx, Paul
> 
> Sebastian
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-15 10:56               ` Paul E. McKenney
@ 2019-04-15 11:04                 ` Sebastian Andrzej Siewior
  2019-04-15 22:26                   ` Paul E. McKenney
  2019-04-23 15:25                   ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-04-15 11:04 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

On 2019-04-15 03:56:45 [-0700], Paul E. McKenney wrote:
> This is new in rcutorture as well.  It is complaining that we just got
> done with a stutter interval (in which torturing temporarily stops)
> but not everything has progressed through the callback chain to be freed.
> And in this case it is happening during rcutorture shutdown, so I need
> to check the callback -- I might need end-of-run checks in that WARN_ON().

good so nothing important so far. I hope the box gets to TREE08 soon.
Do you want to me retrigger the run with your newer dev branch once it
is done?

> 
> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-15 11:04                 ` Sebastian Andrzej Siewior
@ 2019-04-15 22:26                   ` Paul E. McKenney
  2019-04-16  8:03                     ` Sebastian Andrzej Siewior
  2019-04-23 15:25                   ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-15 22:26 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel

On Mon, Apr 15, 2019 at 01:04:03PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-04-15 03:56:45 [-0700], Paul E. McKenney wrote:
> > This is new in rcutorture as well.  It is complaining that we just got
> > done with a stutter interval (in which torturing temporarily stops)
> > but not everything has progressed through the callback chain to be freed.
> > And in this case it is happening during rcutorture shutdown, so I need
> > to check the callback -- I might need end-of-run checks in that WARN_ON().

And I did need end-of-run checks, which are now in -rcu on branch dev.

> good so nothing important so far. I hope the box gets to TREE08 soon.
> Do you want to me retrigger the run with your newer dev branch once it
> is done?

I feel a bit guilty monopolizing your system...

I have accumulated a collection of forward-progress ideas, so one approach
would be for me to get those written, then running well on my systems, and
then ask for some testing on your systems.  Would that be a reasonable
approach?

Might make anyone else wanting to use that system in the near term more
happy as well.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-15 22:26                   ` Paul E. McKenney
@ 2019-04-16  8:03                     ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-04-16  8:03 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

On 2019-04-15 15:26:52 [-0700], Paul E. McKenney wrote:
> I feel a bit guilty monopolizing your system...

Don't be. In the end I want my stuff merged without causing any
regressions.

> I have accumulated a collection of forward-progress ideas, so one approach
> would be for me to get those written, then running well on my systems, and
> then ask for some testing on your systems.  Would that be a reasonable
> approach?

Sure.

> Might make anyone else wanting to use that system in the near term more
> happy as well.  ;-)

:)

> 							Thanx, Paul

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-15 11:04                 ` Sebastian Andrzej Siewior
  2019-04-15 22:26                   ` Paul E. McKenney
@ 2019-04-23 15:25                   ` Sebastian Andrzej Siewior
  2019-04-24  5:33                     ` Paul E. McKenney
  1 sibling, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2019-04-23 15:25 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

On 2019-04-15 13:04:03 [+0200], To Paul E. McKenney wrote:
> 
> good so nothing important so far. I hope the box gets to TREE08 soon.

test completed. Nothing new.

> > 
> > 							Thanx, Paul
> 
Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Alleged fix for writer stall on -rcu branch dev
  2019-04-23 15:25                   ` Sebastian Andrzej Siewior
@ 2019-04-24  5:33                     ` Paul E. McKenney
  0 siblings, 0 replies; 11+ messages in thread
From: Paul E. McKenney @ 2019-04-24  5:33 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: linux-kernel

On Tue, Apr 23, 2019 at 05:25:00PM +0200, Sebastian Andrzej Siewior wrote:
> On 2019-04-15 13:04:03 [+0200], To Paul E. McKenney wrote:
> > 
> > good so nothing important so far. I hope the box gets to TREE08 soon.
> 
> test completed. Nothing new.

That took some time!  Thank you for running it!

							Thanx, Paul


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2019-04-24  5:33 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20190409221426.GA11212@linux.ibm.com>
     [not found] ` <20190410181538.3ocva5lmh7ghrisr@linutronix.de>
     [not found]   ` <20190410184105.GR14111@linux.ibm.com>
     [not found]     ` <20190410191918.g2yqnsmuqukmzdfe@linutronix.de>
     [not found]       ` <20190410203302.GT14111@linux.ibm.com>
2019-04-11  0:06         ` Alleged fix for writer stall on -rcu branch dev Paul E. McKenney
2019-04-13 14:21           ` Paul E. McKenney
2019-04-13 18:17             ` Paul E. McKenney
2019-04-15 10:41               ` Sebastian Andrzej Siewior
2019-04-15 10:40             ` Sebastian Andrzej Siewior
2019-04-15 10:56               ` Paul E. McKenney
2019-04-15 11:04                 ` Sebastian Andrzej Siewior
2019-04-15 22:26                   ` Paul E. McKenney
2019-04-16  8:03                     ` Sebastian Andrzej Siewior
2019-04-23 15:25                   ` Sebastian Andrzej Siewior
2019-04-24  5:33                     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).