All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible reproduction of CSD locking issue
@ 2022-01-25 18:27 Corey Minyard
  2022-01-25 18:53 ` Paul E. McKenney
  2022-01-26  6:08 ` Juergen Gross
  0 siblings, 2 replies; 10+ messages in thread
From: Corey Minyard @ 2022-01-25 18:27 UTC (permalink / raw)
  To: Juergen Gross, Ingo Molnar, Paul E. McKenney; +Cc: LKML

We have a customer that had been seeing CSD lock issues on a Centos 7
kernel (unfortunately).  I couldn't find anything or any kernel changes
that might fix it, so I was consdering it was the CSD locking issue you
have been chasing for a while.

So I backported the debug patches.  And of course, they stopped seeing
the issue, at least as much, and they had trouble with the extra CPU
time the debug code took.  But they just reproduced it.  Here are the
logs:

Jan 23 23:39:43 worker0 kernel: [285737.522743] csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
Jan 23 23:39:43 worker0 kernel: [285737.522744]  csd: CSD lock (#1) unresponsive.
Jan 23 23:39:43 worker0 kernel: [285737.522747]  csd: cnt(0000000): 0000->0000 queue
Jan 23 23:39:43 worker0 kernel: [285737.522748]  csd: cnt(0000001): ffff->0037 idle
Jan 23 23:39:43 worker0 kernel: [285737.522749]  csd: cnt(63d8dd8): 0003->0037 ipi
Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dd9): 0003->0037 ping
Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dda): 0003->ffff pinged
Jan 23 23:39:43 worker0 kernel: [285737.522751]  csd: cnt(63d8dea): 0035->0037 pinged
Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8deb): ffff->0037 gotipi
Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8dec): ffff->0037 handle
Jan 23 23:39:43 worker0 kernel: [285737.522753]  csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8e1f): 0003->0037 queue
Jan 23 23:39:43 worker0 kernel: [285737.522755]  csd: cnt(63d8e20): 0003->0037 ipi
Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e21): 0003->0037 ping
Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e22): 0003->0037 queue
Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt(63d8e23): 0003->0037 noipi
Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt now: 63fe4cd
Jan 23 23:39:43 worker0 kernel: [285737.522758] Task dump for CPU 55:
Jan 23 23:39:43 worker0 kernel: [285737.522761] kubelet         R  running task        0 277695      1 0x00080000
Jan 23 23:39:43 worker0 kernel: [285737.522761] Call Trace:
Jan 23 23:39:43 worker0 kernel: [285737.522769]  [<ffffffff84376b6a>] ? __schedule+0x46a/0x990
Jan 23 23:39:43 worker0 kernel: [285737.522774]  [<ffffffff83db6353>] ? context_tracking_user_enter+0x13/0x20
Jan 23 23:39:43 worker0 kernel: [285737.522776]  [<ffffffff843775b5>] ? schedule_user+0x45/0x50
Jan 23 23:39:43 worker0 kernel: [285737.522779]  [<ffffffff8437b518>] ? retint_careful+0x16/0x34
Jan 23 23:39:43 worker0 kernel: [285737.522780] csd: Re-sending CSD lock (#1) IPI from CPU#03 to CPU#55
Jan 23 23:39:43 worker0 kernel: [285737.522788] CPU: 3 PID: 54671 Comm: runc:[2:INIT] Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.12.1.rt56.1042.mvista.test.14.el7.x86_64 #1
Jan 23 23:39:43 worker0 kernel: [285737.522789] Hardware name: Dell Inc. PowerEdge R740/0YWR7D, BIOS 2.9.4 11/06/2020
Jan 23 23:39:43 worker0 kernel: [285737.522789] Call Trace:
Jan 23 23:39:43 worker0 kernel: [285737.522793]  [<ffffffff843718ba>] dump_stack+0x19/0x1b
Jan 23 23:39:43 worker0 kernel: [285737.522798]  [<ffffffff83d0bcd8>] __csd_lock_wait+0x1a8/0x2a0
Jan 23 23:39:43 worker0 kernel: [285737.522800]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
Jan 23 23:39:43 worker0 kernel: [285737.522802]  [<ffffffff83d0bfa4>] smp_call_function_single+0xc4/0x1b0
Jan 23 23:39:43 worker0 kernel: [285737.522804]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
Jan 23 23:39:43 worker0 kernel: [285737.522809]  [<ffffffff83e2684b>] ? page_counter_uncharge+0x3b/0x70
Jan 23 23:39:43 worker0 kernel: [285737.522811]  [<ffffffff83d0c614>] smp_call_function_many+0x344/0x380
Jan 23 23:39:43 worker0 kernel: [285737.522813]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
Jan 23 23:39:43 worker0 kernel: [285737.522816]  [<ffffffff83c6da38>] native_flush_tlb_others+0xb8/0xc0
Jan 23 23:39:43 worker0 kernel: [285737.522818]  [<ffffffff83c6dc25>] flush_tlb_page+0x65/0xf0
Jan 23 23:39:43 worker0 kernel: [285737.522821]  [<ffffffff83dfdf98>] ptep_clear_flush+0x68/0xa0
Jan 23 23:39:43 worker0 kernel: [285737.522825]  [<ffffffff83de6806>] wp_page_copy.isra.83+0x3d6/0x650
Jan 23 23:39:43 worker0 kernel: [285737.522828]  [<ffffffff83de8cb4>] do_wp_page+0xb4/0x710
Jan 23 23:39:43 worker0 kernel: [285737.522832]  [<ffffffff83decbb4>] handle_mm_fault+0x884/0x1340
Jan 23 23:39:43 worker0 kernel: [285737.522835]  [<ffffffff83cd7799>] ? update_cfs_shares+0xa9/0xf0
Jan 23 23:39:43 worker0 kernel: [285737.522839]  [<ffffffff8437efc3>] __do_page_fault+0x213/0x5a0
Jan 23 23:39:43 worker0 kernel: [285737.522841]  [<ffffffff8437f385>] do_page_fault+0x35/0x90
Jan 23 23:39:43 worker0 kernel: [285737.522842]  [<ffffffff8437b728>] page_fault+0x28/0x30
Jan 23 23:39:43 worker0 kernel: [285737.522845] csd: CSD lock (#1) got unstuck on CPU#03, CPU#55 released the lock.

Hopefully this is the issue you are chasing and not something else.
I've been studying them to see what they mean, but I thought you might
be interested to get them asap.

-corey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible reproduction of CSD locking issue
  2022-01-25 18:27 Possible reproduction of CSD locking issue Corey Minyard
@ 2022-01-25 18:53 ` Paul E. McKenney
  2022-01-25 21:52   ` Corey Minyard
  2022-01-26  6:08 ` Juergen Gross
  1 sibling, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2022-01-25 18:53 UTC (permalink / raw)
  To: Corey Minyard; +Cc: Juergen Gross, Ingo Molnar, LKML

On Tue, Jan 25, 2022 at 12:27:37PM -0600, Corey Minyard wrote:
> We have a customer that had been seeing CSD lock issues on a Centos 7
> kernel (unfortunately).  I couldn't find anything or any kernel changes
> that might fix it, so I was consdering it was the CSD locking issue you
> have been chasing for a while.
> 
> So I backported the debug patches.  And of course, they stopped seeing
> the issue, at least as much, and they had trouble with the extra CPU
> time the debug code took.  But they just reproduced it.  Here are the
> logs:
> 
> Jan 23 23:39:43 worker0 kernel: [285737.522743] csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
> Jan 23 23:39:43 worker0 kernel: [285737.522744]  csd: CSD lock (#1) unresponsive.
> Jan 23 23:39:43 worker0 kernel: [285737.522747]  csd: cnt(0000000): 0000->0000 queue
> Jan 23 23:39:43 worker0 kernel: [285737.522748]  csd: cnt(0000001): ffff->0037 idle
> Jan 23 23:39:43 worker0 kernel: [285737.522749]  csd: cnt(63d8dd8): 0003->0037 ipi
> Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dd9): 0003->0037 ping
> Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dda): 0003->ffff pinged
> Jan 23 23:39:43 worker0 kernel: [285737.522751]  csd: cnt(63d8dea): 0035->0037 pinged
> Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8deb): ffff->0037 gotipi
> Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8dec): ffff->0037 handle
> Jan 23 23:39:43 worker0 kernel: [285737.522753]  csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
> Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
> Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8e1f): 0003->0037 queue
> Jan 23 23:39:43 worker0 kernel: [285737.522755]  csd: cnt(63d8e20): 0003->0037 ipi
> Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e21): 0003->0037 ping
> Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e22): 0003->0037 queue
> Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt(63d8e23): 0003->0037 noipi
> Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt now: 63fe4cd
> Jan 23 23:39:43 worker0 kernel: [285737.522758] Task dump for CPU 55:
> Jan 23 23:39:43 worker0 kernel: [285737.522761] kubelet         R  running task        0 277695      1 0x00080000
> Jan 23 23:39:43 worker0 kernel: [285737.522761] Call Trace:
> Jan 23 23:39:43 worker0 kernel: [285737.522769]  [<ffffffff84376b6a>] ? __schedule+0x46a/0x990
> Jan 23 23:39:43 worker0 kernel: [285737.522774]  [<ffffffff83db6353>] ? context_tracking_user_enter+0x13/0x20
> Jan 23 23:39:43 worker0 kernel: [285737.522776]  [<ffffffff843775b5>] ? schedule_user+0x45/0x50
> Jan 23 23:39:43 worker0 kernel: [285737.522779]  [<ffffffff8437b518>] ? retint_careful+0x16/0x34

Long-running interrupt handler, maybe?  Or am I misinterpreting this
stack trace?

> Jan 23 23:39:43 worker0 kernel: [285737.522780] csd: Re-sending CSD lock (#1) IPI from CPU#03 to CPU#55
> Jan 23 23:39:43 worker0 kernel: [285737.522788] CPU: 3 PID: 54671 Comm: runc:[2:INIT] Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.12.1.rt56.1042.mvista.test.14.el7.x86_64 #1
> Jan 23 23:39:43 worker0 kernel: [285737.522789] Hardware name: Dell Inc. PowerEdge R740/0YWR7D, BIOS 2.9.4 11/06/2020
> Jan 23 23:39:43 worker0 kernel: [285737.522789] Call Trace:
> Jan 23 23:39:43 worker0 kernel: [285737.522793]  [<ffffffff843718ba>] dump_stack+0x19/0x1b
> Jan 23 23:39:43 worker0 kernel: [285737.522798]  [<ffffffff83d0bcd8>] __csd_lock_wait+0x1a8/0x2a0
> Jan 23 23:39:43 worker0 kernel: [285737.522800]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> Jan 23 23:39:43 worker0 kernel: [285737.522802]  [<ffffffff83d0bfa4>] smp_call_function_single+0xc4/0x1b0
> Jan 23 23:39:43 worker0 kernel: [285737.522804]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> Jan 23 23:39:43 worker0 kernel: [285737.522809]  [<ffffffff83e2684b>] ? page_counter_uncharge+0x3b/0x70
> Jan 23 23:39:43 worker0 kernel: [285737.522811]  [<ffffffff83d0c614>] smp_call_function_many+0x344/0x380
> Jan 23 23:39:43 worker0 kernel: [285737.522813]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> Jan 23 23:39:43 worker0 kernel: [285737.522816]  [<ffffffff83c6da38>] native_flush_tlb_others+0xb8/0xc0
> Jan 23 23:39:43 worker0 kernel: [285737.522818]  [<ffffffff83c6dc25>] flush_tlb_page+0x65/0xf0
> Jan 23 23:39:43 worker0 kernel: [285737.522821]  [<ffffffff83dfdf98>] ptep_clear_flush+0x68/0xa0
> Jan 23 23:39:43 worker0 kernel: [285737.522825]  [<ffffffff83de6806>] wp_page_copy.isra.83+0x3d6/0x650
> Jan 23 23:39:43 worker0 kernel: [285737.522828]  [<ffffffff83de8cb4>] do_wp_page+0xb4/0x710
> Jan 23 23:39:43 worker0 kernel: [285737.522832]  [<ffffffff83decbb4>] handle_mm_fault+0x884/0x1340
> Jan 23 23:39:43 worker0 kernel: [285737.522835]  [<ffffffff83cd7799>] ? update_cfs_shares+0xa9/0xf0
> Jan 23 23:39:43 worker0 kernel: [285737.522839]  [<ffffffff8437efc3>] __do_page_fault+0x213/0x5a0
> Jan 23 23:39:43 worker0 kernel: [285737.522841]  [<ffffffff8437f385>] do_page_fault+0x35/0x90
> Jan 23 23:39:43 worker0 kernel: [285737.522842]  [<ffffffff8437b728>] page_fault+0x28/0x30
> Jan 23 23:39:43 worker0 kernel: [285737.522845] csd: CSD lock (#1) got unstuck on CPU#03, CPU#55 released the lock.
> 
> Hopefully this is the issue you are chasing and not something else.
> I've been studying them to see what they mean, but I thought you might
> be interested to get them asap.

Well, there have been several bugs causing these CSD lock issues, so what
is one more?  ;-)

More seriously, have you tried Frederic's patch?  This fixes the issue
described here:  https://paulmck.livejournal.com/62071.html
If your stack above was due to an interrupt storm rather than a
long-running interrupt, this might well be the fix.

Oh, and Jürgen Groß reportedly found an issue about a year ago that
could potentially be related, but I see that he is already on CC.

And, unfortunately, even more seriously, this CSD-lock diagnostic code
will very likely continue to find problems, just as the infamous RCU
CPU stall warnings and hard/soft lockup warnings do.

							Thanx, Paul

------------------------------------------------------------------------

commit 53e87e3cdc155f20c3417b689df8d2ac88d79576
Author: Frederic Weisbecker <frederic@kernel.org>
Date:   Tue Oct 26 16:10:54 2021 +0200

    timers/nohz: Last resort update jiffies on nohz_full IRQ entry
    
    When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
    is guaranteed to stay online and to never stop its tick.
    
    Meanwhile on some rare case, the dedicated timekeeper may be running
    with interrupts disabled for a while, such as in stop_machine.
    
    If jiffies stop being updated, a nohz_full CPU may end up endlessly
    programming the next tick in the past, taking the last jiffies update
    monotonic timestamp as a stale base, resulting in an tick storm.
    
    Here is a scenario where it matters:
    
    0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.
    
    1) A stop machine callback is queued to execute somewhere.
    
    2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
       MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
       can still take IRQs.
    
    3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.
    
    4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
       last_jiffies_update as a base. But last_jiffies_update hasn't been
       updated for 2 jiffies since the timekeeper has interrupts disabled.
    
    5) clockevents_program_event(), which relies on ktime_get(), observes
       that the expiration is in the past and therefore programs the min
       delta event on the clock.
    
    6) The tick fires immediately, goto 3)
    
    7) Tick storm, the nohz_full CPU is drown and takes ages to reach
       MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.
    
    Solve this with unconditionally updating jiffies if the value is stale
    on nohz_full IRQ entry. IRQs and other disturbances are expected to be
    rare enough on nohz_full for the unconditional call to ktime_get() to
    actually matter.
    
    Reported-by: Paul E. McKenney <paulmck@kernel.org>
    Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Paul E. McKenney <paulmck@kernel.org>
    Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 322b65d456767..41f470929e991 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -595,7 +595,8 @@ void irq_enter_rcu(void)
 {
 	__irq_enter_raw();
 
-	if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))
+	if (tick_nohz_full_cpu(smp_processor_id()) ||
+	    (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
 		tick_irq_enter();
 
 	account_hardirq_enter(current);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 6bffe5af8cb11..17a283ce2b20f 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void)
 	now = ktime_get();
 	if (ts->idle_active)
 		tick_nohz_stop_idle(ts, now);
+	/*
+	 * If all CPUs are idle. We may need to update a stale jiffies value.
+	 * Note nohz_full is a special case: a timekeeper is guaranteed to stay
+	 * alive but it might be busy looping with interrupts disabled in some
+	 * rare case (typically stop machine). So we must make sure we have a
+	 * last resort.
+	 */
 	if (ts->tick_stopped)
 		tick_nohz_update_jiffies(now);
 }

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Possible reproduction of CSD locking issue
  2022-01-25 18:53 ` Paul E. McKenney
@ 2022-01-25 21:52   ` Corey Minyard
  2022-01-25 22:51     ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Corey Minyard @ 2022-01-25 21:52 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Juergen Gross, Ingo Molnar, LKML

On Tue, Jan 25, 2022 at 10:53:38AM -0800, Paul E. McKenney wrote:
> On Tue, Jan 25, 2022 at 12:27:37PM -0600, Corey Minyard wrote:
> > We have a customer that had been seeing CSD lock issues on a Centos 7
> > kernel (unfortunately).  I couldn't find anything or any kernel changes
> > that might fix it, so I was consdering it was the CSD locking issue you
> > have been chasing for a while.
> > 
> > So I backported the debug patches.  And of course, they stopped seeing
> > the issue, at least as much, and they had trouble with the extra CPU
> > time the debug code took.  But they just reproduced it.  Here are the
> > logs:
> > 
> > Jan 23 23:39:43 worker0 kernel: [285737.522743] csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
> > Jan 23 23:39:43 worker0 kernel: [285737.522744]  csd: CSD lock (#1) unresponsive.
> > Jan 23 23:39:43 worker0 kernel: [285737.522747]  csd: cnt(0000000): 0000->0000 queue
> > Jan 23 23:39:43 worker0 kernel: [285737.522748]  csd: cnt(0000001): ffff->0037 idle
> > Jan 23 23:39:43 worker0 kernel: [285737.522749]  csd: cnt(63d8dd8): 0003->0037 ipi
> > Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dd9): 0003->0037 ping
> > Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dda): 0003->ffff pinged
> > Jan 23 23:39:43 worker0 kernel: [285737.522751]  csd: cnt(63d8dea): 0035->0037 pinged
> > Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8deb): ffff->0037 gotipi
> > Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8dec): ffff->0037 handle
> > Jan 23 23:39:43 worker0 kernel: [285737.522753]  csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
> > Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
> > Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8e1f): 0003->0037 queue
> > Jan 23 23:39:43 worker0 kernel: [285737.522755]  csd: cnt(63d8e20): 0003->0037 ipi
> > Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e21): 0003->0037 ping
> > Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e22): 0003->0037 queue
> > Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt(63d8e23): 0003->0037 noipi
> > Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt now: 63fe4cd
> > Jan 23 23:39:43 worker0 kernel: [285737.522758] Task dump for CPU 55:
> > Jan 23 23:39:43 worker0 kernel: [285737.522761] kubelet         R  running task        0 277695      1 0x00080000
> > Jan 23 23:39:43 worker0 kernel: [285737.522761] Call Trace:
> > Jan 23 23:39:43 worker0 kernel: [285737.522769]  [<ffffffff84376b6a>] ? __schedule+0x46a/0x990
> > Jan 23 23:39:43 worker0 kernel: [285737.522774]  [<ffffffff83db6353>] ? context_tracking_user_enter+0x13/0x20
> > Jan 23 23:39:43 worker0 kernel: [285737.522776]  [<ffffffff843775b5>] ? schedule_user+0x45/0x50
> > Jan 23 23:39:43 worker0 kernel: [285737.522779]  [<ffffffff8437b518>] ? retint_careful+0x16/0x34
> 
> Long-running interrupt handler, maybe?  Or am I misinterpreting this
> stack trace?

Well, maybe.  But why would the condition clear up immediately when
another IPI is sent?  If I understand correctly this box is doing mostly
polled I/O, there shouldn't be many interrupts.  But possibly a driver
bug, I hadn't thought of that.

I could have the system panic when this happens and maybe I can find
some breadcrumbs lying around in the core dump.  The trouble is it's
been two months now to reproduce this.

> 
> > Jan 23 23:39:43 worker0 kernel: [285737.522780] csd: Re-sending CSD lock (#1) IPI from CPU#03 to CPU#55
> > Jan 23 23:39:43 worker0 kernel: [285737.522788] CPU: 3 PID: 54671 Comm: runc:[2:INIT] Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.12.1.rt56.1042.mvista.test.14.el7.x86_64 #1
> > Jan 23 23:39:43 worker0 kernel: [285737.522789] Hardware name: Dell Inc. PowerEdge R740/0YWR7D, BIOS 2.9.4 11/06/2020
> > Jan 23 23:39:43 worker0 kernel: [285737.522789] Call Trace:
> > Jan 23 23:39:43 worker0 kernel: [285737.522793]  [<ffffffff843718ba>] dump_stack+0x19/0x1b
> > Jan 23 23:39:43 worker0 kernel: [285737.522798]  [<ffffffff83d0bcd8>] __csd_lock_wait+0x1a8/0x2a0
> > Jan 23 23:39:43 worker0 kernel: [285737.522800]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > Jan 23 23:39:43 worker0 kernel: [285737.522802]  [<ffffffff83d0bfa4>] smp_call_function_single+0xc4/0x1b0
> > Jan 23 23:39:43 worker0 kernel: [285737.522804]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > Jan 23 23:39:43 worker0 kernel: [285737.522809]  [<ffffffff83e2684b>] ? page_counter_uncharge+0x3b/0x70
> > Jan 23 23:39:43 worker0 kernel: [285737.522811]  [<ffffffff83d0c614>] smp_call_function_many+0x344/0x380
> > Jan 23 23:39:43 worker0 kernel: [285737.522813]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > Jan 23 23:39:43 worker0 kernel: [285737.522816]  [<ffffffff83c6da38>] native_flush_tlb_others+0xb8/0xc0
> > Jan 23 23:39:43 worker0 kernel: [285737.522818]  [<ffffffff83c6dc25>] flush_tlb_page+0x65/0xf0
> > Jan 23 23:39:43 worker0 kernel: [285737.522821]  [<ffffffff83dfdf98>] ptep_clear_flush+0x68/0xa0
> > Jan 23 23:39:43 worker0 kernel: [285737.522825]  [<ffffffff83de6806>] wp_page_copy.isra.83+0x3d6/0x650
> > Jan 23 23:39:43 worker0 kernel: [285737.522828]  [<ffffffff83de8cb4>] do_wp_page+0xb4/0x710
> > Jan 23 23:39:43 worker0 kernel: [285737.522832]  [<ffffffff83decbb4>] handle_mm_fault+0x884/0x1340
> > Jan 23 23:39:43 worker0 kernel: [285737.522835]  [<ffffffff83cd7799>] ? update_cfs_shares+0xa9/0xf0
> > Jan 23 23:39:43 worker0 kernel: [285737.522839]  [<ffffffff8437efc3>] __do_page_fault+0x213/0x5a0
> > Jan 23 23:39:43 worker0 kernel: [285737.522841]  [<ffffffff8437f385>] do_page_fault+0x35/0x90
> > Jan 23 23:39:43 worker0 kernel: [285737.522842]  [<ffffffff8437b728>] page_fault+0x28/0x30
> > Jan 23 23:39:43 worker0 kernel: [285737.522845] csd: CSD lock (#1) got unstuck on CPU#03, CPU#55 released the lock.
> > 
> > Hopefully this is the issue you are chasing and not something else.
> > I've been studying them to see what they mean, but I thought you might
> > be interested to get them asap.
> 
> Well, there have been several bugs causing these CSD lock issues, so what
> is one more?  ;-)
> 
> More seriously, have you tried Frederic's patch?  This fixes the issue
> described here:  https://paulmck.livejournal.com/62071.html
> If your stack above was due to an interrupt storm rather than a
> long-running interrupt, this might well be the fix.

The code is so different that it will take me some time.  I'll have to
understand the issue more deeply to trace that down.  But thanks for the
reference.

> 
> Oh, and Jürgen Groß reportedly found an issue about a year ago that
> could potentially be related, but I see that he is already on CC.
> 
> And, unfortunately, even more seriously, this CSD-lock diagnostic code
> will very likely continue to find problems, just as the infamous RCU
> CPU stall warnings and hard/soft lockup warnings do.

Yeah, I've already fixed a couple of bugs related to CSD lockups.  So I
suppose I've handled all the low-hanging fruit.  With a 104 core box it
makes uncovering these sorts of issues more likely, I suppose.

-corey

> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> commit 53e87e3cdc155f20c3417b689df8d2ac88d79576
> Author: Frederic Weisbecker <frederic@kernel.org>
> Date:   Tue Oct 26 16:10:54 2021 +0200
> 
>     timers/nohz: Last resort update jiffies on nohz_full IRQ entry
>     
>     When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
>     is guaranteed to stay online and to never stop its tick.
>     
>     Meanwhile on some rare case, the dedicated timekeeper may be running
>     with interrupts disabled for a while, such as in stop_machine.
>     
>     If jiffies stop being updated, a nohz_full CPU may end up endlessly
>     programming the next tick in the past, taking the last jiffies update
>     monotonic timestamp as a stale base, resulting in an tick storm.
>     
>     Here is a scenario where it matters:
>     
>     0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.
>     
>     1) A stop machine callback is queued to execute somewhere.
>     
>     2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
>        MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
>        can still take IRQs.
>     
>     3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.
>     
>     4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
>        last_jiffies_update as a base. But last_jiffies_update hasn't been
>        updated for 2 jiffies since the timekeeper has interrupts disabled.
>     
>     5) clockevents_program_event(), which relies on ktime_get(), observes
>        that the expiration is in the past and therefore programs the min
>        delta event on the clock.
>     
>     6) The tick fires immediately, goto 3)
>     
>     7) Tick storm, the nohz_full CPU is drown and takes ages to reach
>        MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.
>     
>     Solve this with unconditionally updating jiffies if the value is stale
>     on nohz_full IRQ entry. IRQs and other disturbances are expected to be
>     rare enough on nohz_full for the unconditional call to ktime_get() to
>     actually matter.
>     
>     Reported-by: Paul E. McKenney <paulmck@kernel.org>
>     Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
>     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>     Tested-by: Paul E. McKenney <paulmck@kernel.org>
>     Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org
> 
> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 322b65d456767..41f470929e991 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -595,7 +595,8 @@ void irq_enter_rcu(void)
>  {
>  	__irq_enter_raw();
>  
> -	if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))
> +	if (tick_nohz_full_cpu(smp_processor_id()) ||
> +	    (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
>  		tick_irq_enter();
>  
>  	account_hardirq_enter(current);
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 6bffe5af8cb11..17a283ce2b20f 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void)
>  	now = ktime_get();
>  	if (ts->idle_active)
>  		tick_nohz_stop_idle(ts, now);
> +	/*
> +	 * If all CPUs are idle. We may need to update a stale jiffies value.
> +	 * Note nohz_full is a special case: a timekeeper is guaranteed to stay
> +	 * alive but it might be busy looping with interrupts disabled in some
> +	 * rare case (typically stop machine). So we must make sure we have a
> +	 * last resort.
> +	 */
>  	if (ts->tick_stopped)
>  		tick_nohz_update_jiffies(now);
>  }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible reproduction of CSD locking issue
  2022-01-25 21:52   ` Corey Minyard
@ 2022-01-25 22:51     ` Paul E. McKenney
  0 siblings, 0 replies; 10+ messages in thread
From: Paul E. McKenney @ 2022-01-25 22:51 UTC (permalink / raw)
  To: Corey Minyard; +Cc: Juergen Gross, Ingo Molnar, LKML

On Tue, Jan 25, 2022 at 03:52:38PM -0600, Corey Minyard wrote:
> On Tue, Jan 25, 2022 at 10:53:38AM -0800, Paul E. McKenney wrote:
> > On Tue, Jan 25, 2022 at 12:27:37PM -0600, Corey Minyard wrote:
> > > We have a customer that had been seeing CSD lock issues on a Centos 7
> > > kernel (unfortunately).  I couldn't find anything or any kernel changes
> > > that might fix it, so I was consdering it was the CSD locking issue you
> > > have been chasing for a while.
> > > 
> > > So I backported the debug patches.  And of course, they stopped seeing
> > > the issue, at least as much, and they had trouble with the extra CPU
> > > time the debug code took.  But they just reproduced it.  Here are the
> > > logs:
> > > 
> > > Jan 23 23:39:43 worker0 kernel: [285737.522743] csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
> > > Jan 23 23:39:43 worker0 kernel: [285737.522744]  csd: CSD lock (#1) unresponsive.
> > > Jan 23 23:39:43 worker0 kernel: [285737.522747]  csd: cnt(0000000): 0000->0000 queue
> > > Jan 23 23:39:43 worker0 kernel: [285737.522748]  csd: cnt(0000001): ffff->0037 idle
> > > Jan 23 23:39:43 worker0 kernel: [285737.522749]  csd: cnt(63d8dd8): 0003->0037 ipi
> > > Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dd9): 0003->0037 ping
> > > Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dda): 0003->ffff pinged
> > > Jan 23 23:39:43 worker0 kernel: [285737.522751]  csd: cnt(63d8dea): 0035->0037 pinged
> > > Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8deb): ffff->0037 gotipi
> > > Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8dec): ffff->0037 handle
> > > Jan 23 23:39:43 worker0 kernel: [285737.522753]  csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
> > > Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
> > > Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8e1f): 0003->0037 queue
> > > Jan 23 23:39:43 worker0 kernel: [285737.522755]  csd: cnt(63d8e20): 0003->0037 ipi
> > > Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e21): 0003->0037 ping
> > > Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e22): 0003->0037 queue
> > > Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt(63d8e23): 0003->0037 noipi
> > > Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt now: 63fe4cd
> > > Jan 23 23:39:43 worker0 kernel: [285737.522758] Task dump for CPU 55:
> > > Jan 23 23:39:43 worker0 kernel: [285737.522761] kubelet         R  running task        0 277695      1 0x00080000
> > > Jan 23 23:39:43 worker0 kernel: [285737.522761] Call Trace:
> > > Jan 23 23:39:43 worker0 kernel: [285737.522769]  [<ffffffff84376b6a>] ? __schedule+0x46a/0x990
> > > Jan 23 23:39:43 worker0 kernel: [285737.522774]  [<ffffffff83db6353>] ? context_tracking_user_enter+0x13/0x20
> > > Jan 23 23:39:43 worker0 kernel: [285737.522776]  [<ffffffff843775b5>] ? schedule_user+0x45/0x50
> > > Jan 23 23:39:43 worker0 kernel: [285737.522779]  [<ffffffff8437b518>] ? retint_careful+0x16/0x34
> > 
> > Long-running interrupt handler, maybe?  Or am I misinterpreting this
> > stack trace?
> 
> Well, maybe.  But why would the condition clear up immediately when
> another IPI is sent?  If I understand correctly this box is doing mostly
> polled I/O, there shouldn't be many interrupts.  But possibly a driver
> bug, I hadn't thought of that.

All fair points!

Me, I suspected lost interrupts for the longest time, but never found
any hard evidence of such a thing.  But maybe that is what you are seeing.

> I could have the system panic when this happens and maybe I can find
> some breadcrumbs lying around in the core dump.  The trouble is it's
> been two months now to reproduce this.

And being as it is a customer machine, you probably do not want to take
steps to make it happen more often...

But would it be possible to trace interrupt entry and exit, and then
dump the offending CPU's trace buffer when things recover?

> > > Jan 23 23:39:43 worker0 kernel: [285737.522780] csd: Re-sending CSD lock (#1) IPI from CPU#03 to CPU#55
> > > Jan 23 23:39:43 worker0 kernel: [285737.522788] CPU: 3 PID: 54671 Comm: runc:[2:INIT] Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.12.1.rt56.1042.mvista.test.14.el7.x86_64 #1
> > > Jan 23 23:39:43 worker0 kernel: [285737.522789] Hardware name: Dell Inc. PowerEdge R740/0YWR7D, BIOS 2.9.4 11/06/2020
> > > Jan 23 23:39:43 worker0 kernel: [285737.522789] Call Trace:
> > > Jan 23 23:39:43 worker0 kernel: [285737.522793]  [<ffffffff843718ba>] dump_stack+0x19/0x1b
> > > Jan 23 23:39:43 worker0 kernel: [285737.522798]  [<ffffffff83d0bcd8>] __csd_lock_wait+0x1a8/0x2a0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522800]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > > Jan 23 23:39:43 worker0 kernel: [285737.522802]  [<ffffffff83d0bfa4>] smp_call_function_single+0xc4/0x1b0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522804]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > > Jan 23 23:39:43 worker0 kernel: [285737.522809]  [<ffffffff83e2684b>] ? page_counter_uncharge+0x3b/0x70
> > > Jan 23 23:39:43 worker0 kernel: [285737.522811]  [<ffffffff83d0c614>] smp_call_function_many+0x344/0x380
> > > Jan 23 23:39:43 worker0 kernel: [285737.522813]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > > Jan 23 23:39:43 worker0 kernel: [285737.522816]  [<ffffffff83c6da38>] native_flush_tlb_others+0xb8/0xc0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522818]  [<ffffffff83c6dc25>] flush_tlb_page+0x65/0xf0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522821]  [<ffffffff83dfdf98>] ptep_clear_flush+0x68/0xa0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522825]  [<ffffffff83de6806>] wp_page_copy.isra.83+0x3d6/0x650
> > > Jan 23 23:39:43 worker0 kernel: [285737.522828]  [<ffffffff83de8cb4>] do_wp_page+0xb4/0x710
> > > Jan 23 23:39:43 worker0 kernel: [285737.522832]  [<ffffffff83decbb4>] handle_mm_fault+0x884/0x1340
> > > Jan 23 23:39:43 worker0 kernel: [285737.522835]  [<ffffffff83cd7799>] ? update_cfs_shares+0xa9/0xf0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522839]  [<ffffffff8437efc3>] __do_page_fault+0x213/0x5a0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522841]  [<ffffffff8437f385>] do_page_fault+0x35/0x90
> > > Jan 23 23:39:43 worker0 kernel: [285737.522842]  [<ffffffff8437b728>] page_fault+0x28/0x30
> > > Jan 23 23:39:43 worker0 kernel: [285737.522845] csd: CSD lock (#1) got unstuck on CPU#03, CPU#55 released the lock.
> > > 
> > > Hopefully this is the issue you are chasing and not something else.
> > > I've been studying them to see what they mean, but I thought you might
> > > be interested to get them asap.
> > 
> > Well, there have been several bugs causing these CSD lock issues, so what
> > is one more?  ;-)
> > 
> > More seriously, have you tried Frederic's patch?  This fixes the issue
> > described here:  https://paulmck.livejournal.com/62071.html
> > If your stack above was due to an interrupt storm rather than a
> > long-running interrupt, this might well be the fix.
> 
> The code is so different that it will take me some time.  I'll have to
> understand the issue more deeply to trace that down.  But thanks for the
> reference.

Understood!

> > Oh, and Jürgen Groß reportedly found an issue about a year ago that
> > could potentially be related, but I see that he is already on CC.
> > 
> > And, unfortunately, even more seriously, this CSD-lock diagnostic code
> > will very likely continue to find problems, just as the infamous RCU
> > CPU stall warnings and hard/soft lockup warnings do.
> 
> Yeah, I've already fixed a couple of bugs related to CSD lockups.  So I
> suppose I've handled all the low-hanging fruit.  With a 104 core box it
> makes uncovering these sorts of issues more likely, I suppose.

Especially if that system has more than one socket.

							Thanx, Paul

> -corey
> 
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > commit 53e87e3cdc155f20c3417b689df8d2ac88d79576
> > Author: Frederic Weisbecker <frederic@kernel.org>
> > Date:   Tue Oct 26 16:10:54 2021 +0200
> > 
> >     timers/nohz: Last resort update jiffies on nohz_full IRQ entry
> >     
> >     When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
> >     is guaranteed to stay online and to never stop its tick.
> >     
> >     Meanwhile on some rare case, the dedicated timekeeper may be running
> >     with interrupts disabled for a while, such as in stop_machine.
> >     
> >     If jiffies stop being updated, a nohz_full CPU may end up endlessly
> >     programming the next tick in the past, taking the last jiffies update
> >     monotonic timestamp as a stale base, resulting in an tick storm.
> >     
> >     Here is a scenario where it matters:
> >     
> >     0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.
> >     
> >     1) A stop machine callback is queued to execute somewhere.
> >     
> >     2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
> >        MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
> >        can still take IRQs.
> >     
> >     3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.
> >     
> >     4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
> >        last_jiffies_update as a base. But last_jiffies_update hasn't been
> >        updated for 2 jiffies since the timekeeper has interrupts disabled.
> >     
> >     5) clockevents_program_event(), which relies on ktime_get(), observes
> >        that the expiration is in the past and therefore programs the min
> >        delta event on the clock.
> >     
> >     6) The tick fires immediately, goto 3)
> >     
> >     7) Tick storm, the nohz_full CPU is drown and takes ages to reach
> >        MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.
> >     
> >     Solve this with unconditionally updating jiffies if the value is stale
> >     on nohz_full IRQ entry. IRQs and other disturbances are expected to be
> >     rare enough on nohz_full for the unconditional call to ktime_get() to
> >     actually matter.
> >     
> >     Reported-by: Paul E. McKenney <paulmck@kernel.org>
> >     Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
> >     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >     Tested-by: Paul E. McKenney <paulmck@kernel.org>
> >     Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org
> > 
> > diff --git a/kernel/softirq.c b/kernel/softirq.c
> > index 322b65d456767..41f470929e991 100644
> > --- a/kernel/softirq.c
> > +++ b/kernel/softirq.c
> > @@ -595,7 +595,8 @@ void irq_enter_rcu(void)
> >  {
> >  	__irq_enter_raw();
> >  
> > -	if (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET))
> > +	if (tick_nohz_full_cpu(smp_processor_id()) ||
> > +	    (is_idle_task(current) && (irq_count() == HARDIRQ_OFFSET)))
> >  		tick_irq_enter();
> >  
> >  	account_hardirq_enter(current);
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 6bffe5af8cb11..17a283ce2b20f 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -1375,6 +1375,13 @@ static inline void tick_nohz_irq_enter(void)
> >  	now = ktime_get();
> >  	if (ts->idle_active)
> >  		tick_nohz_stop_idle(ts, now);
> > +	/*
> > +	 * If all CPUs are idle. We may need to update a stale jiffies value.
> > +	 * Note nohz_full is a special case: a timekeeper is guaranteed to stay
> > +	 * alive but it might be busy looping with interrupts disabled in some
> > +	 * rare case (typically stop machine). So we must make sure we have a
> > +	 * last resort.
> > +	 */
> >  	if (ts->tick_stopped)
> >  		tick_nohz_update_jiffies(now);
> >  }

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible reproduction of CSD locking issue
  2022-01-25 18:27 Possible reproduction of CSD locking issue Corey Minyard
  2022-01-25 18:53 ` Paul E. McKenney
@ 2022-01-26  6:08 ` Juergen Gross
  2022-01-26 13:56   ` Corey Minyard
  1 sibling, 1 reply; 10+ messages in thread
From: Juergen Gross @ 2022-01-26  6:08 UTC (permalink / raw)
  To: minyard, Ingo Molnar, Paul E. McKenney; +Cc: LKML


[-- Attachment #1.1.1: Type: text/plain, Size: 6093 bytes --]

On 25.01.22 19:27, Corey Minyard wrote:
> We have a customer that had been seeing CSD lock issues on a Centos 7
> kernel (unfortunately).  I couldn't find anything or any kernel changes
> that might fix it, so I was consdering it was the CSD locking issue you
> have been chasing for a while.

Is this on bare metal or in a virtualized environment?

The bug I was hunting occurred when running under Xen, and in the end I
found a scheduling problem in the Xen hypervisor.

> 
> So I backported the debug patches.  And of course, they stopped seeing
> the issue, at least as much, and they had trouble with the extra CPU
> time the debug code took.  But they just reproduced it.  Here are the
> logs:
> 
> Jan 23 23:39:43 worker0 kernel: [285737.522743] csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
> Jan 23 23:39:43 worker0 kernel: [285737.522744]  csd: CSD lock (#1) unresponsive.
> Jan 23 23:39:43 worker0 kernel: [285737.522747]  csd: cnt(0000000): 0000->0000 queue
> Jan 23 23:39:43 worker0 kernel: [285737.522748]  csd: cnt(0000001): ffff->0037 idle
> Jan 23 23:39:43 worker0 kernel: [285737.522749]  csd: cnt(63d8dd8): 0003->0037 ipi
> Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dd9): 0003->0037 ping
> Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dda): 0003->ffff pinged
> Jan 23 23:39:43 worker0 kernel: [285737.522751]  csd: cnt(63d8dea): 0035->0037 pinged
> Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8deb): ffff->0037 gotipi
> Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8dec): ffff->0037 handle
> Jan 23 23:39:43 worker0 kernel: [285737.522753]  csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
> Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
> Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8e1f): 0003->0037 queue
> Jan 23 23:39:43 worker0 kernel: [285737.522755]  csd: cnt(63d8e20): 0003->0037 ipi
> Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e21): 0003->0037 ping
> Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e22): 0003->0037 queue
> Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt(63d8e23): 0003->0037 noipi
> Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt now: 63fe4cd
> Jan 23 23:39:43 worker0 kernel: [285737.522758] Task dump for CPU 55:
> Jan 23 23:39:43 worker0 kernel: [285737.522761] kubelet         R  running task        0 277695      1 0x00080000
> Jan 23 23:39:43 worker0 kernel: [285737.522761] Call Trace:
> Jan 23 23:39:43 worker0 kernel: [285737.522769]  [<ffffffff84376b6a>] ? __schedule+0x46a/0x990
> Jan 23 23:39:43 worker0 kernel: [285737.522774]  [<ffffffff83db6353>] ? context_tracking_user_enter+0x13/0x20
> Jan 23 23:39:43 worker0 kernel: [285737.522776]  [<ffffffff843775b5>] ? schedule_user+0x45/0x50
> Jan 23 23:39:43 worker0 kernel: [285737.522779]  [<ffffffff8437b518>] ? retint_careful+0x16/0x34
> Jan 23 23:39:43 worker0 kernel: [285737.522780] csd: Re-sending CSD lock (#1) IPI from CPU#03 to CPU#55
> Jan 23 23:39:43 worker0 kernel: [285737.522788] CPU: 3 PID: 54671 Comm: runc:[2:INIT] Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.12.1.rt56.1042.mvista.test.14.el7.x86_64 #1
> Jan 23 23:39:43 worker0 kernel: [285737.522789] Hardware name: Dell Inc. PowerEdge R740/0YWR7D, BIOS 2.9.4 11/06/2020
> Jan 23 23:39:43 worker0 kernel: [285737.522789] Call Trace:
> Jan 23 23:39:43 worker0 kernel: [285737.522793]  [<ffffffff843718ba>] dump_stack+0x19/0x1b
> Jan 23 23:39:43 worker0 kernel: [285737.522798]  [<ffffffff83d0bcd8>] __csd_lock_wait+0x1a8/0x2a0
> Jan 23 23:39:43 worker0 kernel: [285737.522800]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> Jan 23 23:39:43 worker0 kernel: [285737.522802]  [<ffffffff83d0bfa4>] smp_call_function_single+0xc4/0x1b0
> Jan 23 23:39:43 worker0 kernel: [285737.522804]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> Jan 23 23:39:43 worker0 kernel: [285737.522809]  [<ffffffff83e2684b>] ? page_counter_uncharge+0x3b/0x70
> Jan 23 23:39:43 worker0 kernel: [285737.522811]  [<ffffffff83d0c614>] smp_call_function_many+0x344/0x380
> Jan 23 23:39:43 worker0 kernel: [285737.522813]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> Jan 23 23:39:43 worker0 kernel: [285737.522816]  [<ffffffff83c6da38>] native_flush_tlb_others+0xb8/0xc0
> Jan 23 23:39:43 worker0 kernel: [285737.522818]  [<ffffffff83c6dc25>] flush_tlb_page+0x65/0xf0
> Jan 23 23:39:43 worker0 kernel: [285737.522821]  [<ffffffff83dfdf98>] ptep_clear_flush+0x68/0xa0
> Jan 23 23:39:43 worker0 kernel: [285737.522825]  [<ffffffff83de6806>] wp_page_copy.isra.83+0x3d6/0x650
> Jan 23 23:39:43 worker0 kernel: [285737.522828]  [<ffffffff83de8cb4>] do_wp_page+0xb4/0x710
> Jan 23 23:39:43 worker0 kernel: [285737.522832]  [<ffffffff83decbb4>] handle_mm_fault+0x884/0x1340
> Jan 23 23:39:43 worker0 kernel: [285737.522835]  [<ffffffff83cd7799>] ? update_cfs_shares+0xa9/0xf0
> Jan 23 23:39:43 worker0 kernel: [285737.522839]  [<ffffffff8437efc3>] __do_page_fault+0x213/0x5a0
> Jan 23 23:39:43 worker0 kernel: [285737.522841]  [<ffffffff8437f385>] do_page_fault+0x35/0x90
> Jan 23 23:39:43 worker0 kernel: [285737.522842]  [<ffffffff8437b728>] page_fault+0x28/0x30
> Jan 23 23:39:43 worker0 kernel: [285737.522845] csd: CSD lock (#1) got unstuck on CPU#03, CPU#55 released the lock.
> 
> Hopefully this is the issue you are chasing and not something else.
> I've been studying them to see what they mean, but I thought you might
> be interested to get them asap.

As said before: I've solved my problem.

But I can share some more insight from my experience:

Don't be so sure that resending the IPI woke up the cpu again. It might
have been the NMI for getting the target cpu's stacktrace which resulted
in letting it run again. I missed this possibility in the beginning and
this was the reason I dismissed the Xen scheduling issue too early.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible reproduction of CSD locking issue
  2022-01-26  6:08 ` Juergen Gross
@ 2022-01-26 13:56   ` Corey Minyard
  2022-01-26 14:12     ` Corey Minyard
  2022-01-26 14:51     ` Juergen Gross
  0 siblings, 2 replies; 10+ messages in thread
From: Corey Minyard @ 2022-01-26 13:56 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Ingo Molnar, Paul E. McKenney, LKML

On Wed, Jan 26, 2022 at 07:08:22AM +0100, Juergen Gross wrote:
> On 25.01.22 19:27, Corey Minyard wrote:
> > We have a customer that had been seeing CSD lock issues on a Centos 7
> > kernel (unfortunately).  I couldn't find anything or any kernel changes
> > that might fix it, so I was consdering it was the CSD locking issue you
> > have been chasing for a while.
> 
> Is this on bare metal or in a virtualized environment?

This is bare metal.

I do think I know what happened.  Here's my analysis...

csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
 csd: CSD lock (#1) unresponsive.
 csd: cnt(0000000): 0000->0000 queue
 csd: cnt(0000001): ffff->0037 idle

The above means that these events weren't seen, I think.  We can
ignore them in any case.

 csd: cnt(63d8dd8): 0003->0037 ipi
 csd: cnt(63d8dd9): 0003->0037 ping
 csd: cnt(63d8dda): 0003->ffff pinged

This is a little confusing.  The first two lines have to be from
__smp_call_single_queue_debug.  The last line has to be from
smp_call_function_many.  But you never see the pinged from
__smp_call_single_queue_debug.

 csd: cnt(63d8dea): 0035->0037 pinged

The tail of CPU 53 sending an IPI to CPU 55 in
__smp_call_single_queue_debug.

 csd: cnt(63d8deb): ffff->0037 gotipi
 csd: cnt(63d8dec): ffff->0037 handle
 csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
 csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)

CPU 55 is handling the IPI(s) it was sent earlier.

 csd: cnt(63d8e1f): 0003->0037 queue
 csd: cnt(63d8e20): 0003->0037 ipi
 csd: cnt(63d8e21): 0003->0037 ping

In __smp_call_single_queue_debug CPU 3 sends another message to
CPU 55 and sends an IPI.  But there should be a pinged entry
after this.

 csd: cnt(63d8e22): 0003->0037 queue
 csd: cnt(63d8e23): 0003->0037 noipi

In __smp_call_single_queue_debug CPU 3 sends another message to
CPU 55, but no IPI is required because one is already pending.

I deleted the timestamps and the rest of the log, but a second
IPI is sent and CPU 55 wakes up immediately.  That tells me that
CPU 55 wasn't stuck for some reason (interrupt storm, long time
with interrupts disabled).  Though maybe not, with your insight
below?

Looking at the code, it appears this code runs with interrupts and
preemption enabled, so they can be switched out at any time here.  I can
only see one possibility from the above sequence:

* CPU 3 queues the list item but is switched out before sending
  the IPI (no pinged in the sequence).

* Another thread on CPU 3 is switched in and finishes a send
  many.

* CPU 53 sends an IPI to CPU 55 and wakes it up.

* CPU 55 handles the IPI.

* Another thread on CPU 3 then sends another message to CPU 55,
  enqueues the link, but is switched out before actually sending
  the IPI (again, no pinged in the sequence).

* Yet another thread on CPU 3 sends a message, but the item
  is already queued and thus it does not send an IPI.

* Something on CPU 3 prevents the first two threads from sending
  the IPI.

So assuming it's not a hardware bug, it appears the problem is
process starvation on CPU 3.  Knowing our customer's
architecture, this is actually a possibility.  This could be
pretty easily solved by turning off interrupts (or preemption?) around
the code that enqueues the link and sends the IPI.

> 
> The bug I was hunting occurred when running under Xen, and in the end I
> found a scheduling problem in the Xen hypervisor.
> 
> > 
> > So I backported the debug patches.  And of course, they stopped seeing
> > the issue, at least as much, and they had trouble with the extra CPU
> > time the debug code took.  But they just reproduced it.  Here are the
> > logs:
> > 
> > Jan 23 23:39:43 worker0 kernel: [285737.522743] csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
> > Jan 23 23:39:43 worker0 kernel: [285737.522744]  csd: CSD lock (#1) unresponsive.
> > Jan 23 23:39:43 worker0 kernel: [285737.522747]  csd: cnt(0000000): 0000->0000 queue
> > Jan 23 23:39:43 worker0 kernel: [285737.522748]  csd: cnt(0000001): ffff->0037 idle
> > Jan 23 23:39:43 worker0 kernel: [285737.522749]  csd: cnt(63d8dd8): 0003->0037 ipi
> > Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dd9): 0003->0037 ping
> > Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dda): 0003->ffff pinged
> > Jan 23 23:39:43 worker0 kernel: [285737.522751]  csd: cnt(63d8dea): 0035->0037 pinged
> > Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8deb): ffff->0037 gotipi
> > Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8dec): ffff->0037 handle
> > Jan 23 23:39:43 worker0 kernel: [285737.522753]  csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
> > Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
> > Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8e1f): 0003->0037 queue
> > Jan 23 23:39:43 worker0 kernel: [285737.522755]  csd: cnt(63d8e20): 0003->0037 ipi
> > Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e21): 0003->0037 ping
> > Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e22): 0003->0037 queue
> > Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt(63d8e23): 0003->0037 noipi
> > Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt now: 63fe4cd
> > Jan 23 23:39:43 worker0 kernel: [285737.522758] Task dump for CPU 55:
> > Jan 23 23:39:43 worker0 kernel: [285737.522761] kubelet         R  running task        0 277695      1 0x00080000
> > Jan 23 23:39:43 worker0 kernel: [285737.522761] Call Trace:
> > Jan 23 23:39:43 worker0 kernel: [285737.522769]  [<ffffffff84376b6a>] ? __schedule+0x46a/0x990
> > Jan 23 23:39:43 worker0 kernel: [285737.522774]  [<ffffffff83db6353>] ? context_tracking_user_enter+0x13/0x20
> > Jan 23 23:39:43 worker0 kernel: [285737.522776]  [<ffffffff843775b5>] ? schedule_user+0x45/0x50
> > Jan 23 23:39:43 worker0 kernel: [285737.522779]  [<ffffffff8437b518>] ? retint_careful+0x16/0x34
> > Jan 23 23:39:43 worker0 kernel: [285737.522780] csd: Re-sending CSD lock (#1) IPI from CPU#03 to CPU#55
> > Jan 23 23:39:43 worker0 kernel: [285737.522788] CPU: 3 PID: 54671 Comm: runc:[2:INIT] Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.12.1.rt56.1042.mvista.test.14.el7.x86_64 #1
> > Jan 23 23:39:43 worker0 kernel: [285737.522789] Hardware name: Dell Inc. PowerEdge R740/0YWR7D, BIOS 2.9.4 11/06/2020
> > Jan 23 23:39:43 worker0 kernel: [285737.522789] Call Trace:
> > Jan 23 23:39:43 worker0 kernel: [285737.522793]  [<ffffffff843718ba>] dump_stack+0x19/0x1b
> > Jan 23 23:39:43 worker0 kernel: [285737.522798]  [<ffffffff83d0bcd8>] __csd_lock_wait+0x1a8/0x2a0
> > Jan 23 23:39:43 worker0 kernel: [285737.522800]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > Jan 23 23:39:43 worker0 kernel: [285737.522802]  [<ffffffff83d0bfa4>] smp_call_function_single+0xc4/0x1b0
> > Jan 23 23:39:43 worker0 kernel: [285737.522804]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > Jan 23 23:39:43 worker0 kernel: [285737.522809]  [<ffffffff83e2684b>] ? page_counter_uncharge+0x3b/0x70
> > Jan 23 23:39:43 worker0 kernel: [285737.522811]  [<ffffffff83d0c614>] smp_call_function_many+0x344/0x380
> > Jan 23 23:39:43 worker0 kernel: [285737.522813]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > Jan 23 23:39:43 worker0 kernel: [285737.522816]  [<ffffffff83c6da38>] native_flush_tlb_others+0xb8/0xc0
> > Jan 23 23:39:43 worker0 kernel: [285737.522818]  [<ffffffff83c6dc25>] flush_tlb_page+0x65/0xf0
> > Jan 23 23:39:43 worker0 kernel: [285737.522821]  [<ffffffff83dfdf98>] ptep_clear_flush+0x68/0xa0
> > Jan 23 23:39:43 worker0 kernel: [285737.522825]  [<ffffffff83de6806>] wp_page_copy.isra.83+0x3d6/0x650
> > Jan 23 23:39:43 worker0 kernel: [285737.522828]  [<ffffffff83de8cb4>] do_wp_page+0xb4/0x710
> > Jan 23 23:39:43 worker0 kernel: [285737.522832]  [<ffffffff83decbb4>] handle_mm_fault+0x884/0x1340
> > Jan 23 23:39:43 worker0 kernel: [285737.522835]  [<ffffffff83cd7799>] ? update_cfs_shares+0xa9/0xf0
> > Jan 23 23:39:43 worker0 kernel: [285737.522839]  [<ffffffff8437efc3>] __do_page_fault+0x213/0x5a0
> > Jan 23 23:39:43 worker0 kernel: [285737.522841]  [<ffffffff8437f385>] do_page_fault+0x35/0x90
> > Jan 23 23:39:43 worker0 kernel: [285737.522842]  [<ffffffff8437b728>] page_fault+0x28/0x30
> > Jan 23 23:39:43 worker0 kernel: [285737.522845] csd: CSD lock (#1) got unstuck on CPU#03, CPU#55 released the lock.
> > 
> > Hopefully this is the issue you are chasing and not something else.
> > I've been studying them to see what they mean, but I thought you might
> > be interested to get them asap.
> 
> As said before: I've solved my problem.
> 
> But I can share some more insight from my experience:
> 
> Don't be so sure that resending the IPI woke up the cpu again. It might
> have been the NMI for getting the target cpu's stacktrace which resulted
> in letting it run again. I missed this possibility in the beginning and
> this was the reason I dismissed the Xen scheduling issue too early.

Hmm.  That's non-obvious, obviously.  I'm not sure how sending an NMI
would unstick something like this, but I'll take your word for it.
I think my above analysis is correct based on the sequence, but this
gives me something else to consider.

Thanks for the input.

-corey

> 
> 
> Juergen






^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible reproduction of CSD locking issue
  2022-01-26 13:56   ` Corey Minyard
@ 2022-01-26 14:12     ` Corey Minyard
  2022-01-26 14:51     ` Juergen Gross
  1 sibling, 0 replies; 10+ messages in thread
From: Corey Minyard @ 2022-01-26 14:12 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Ingo Molnar, Paul E. McKenney, LKML

On Wed, Jan 26, 2022 at 07:56:43AM -0600, Corey Minyard wrote:
> On Wed, Jan 26, 2022 at 07:08:22AM +0100, Juergen Gross wrote:
> > On 25.01.22 19:27, Corey Minyard wrote:
> > > We have a customer that had been seeing CSD lock issues on a Centos 7
> > > kernel (unfortunately).  I couldn't find anything or any kernel changes
> > > that might fix it, so I was consdering it was the CSD locking issue you
> > > have been chasing for a while.
> > 
> > Is this on bare metal or in a virtualized environment?
> 
> This is bare metal.
> 
> I do think I know what happened.  Here's my analysis...

Well, never mind, I missed that preemption is disabled in this code.
The lack of pinged entries is still a little confusing.

-corey

> 
> csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
>  csd: CSD lock (#1) unresponsive.
>  csd: cnt(0000000): 0000->0000 queue
>  csd: cnt(0000001): ffff->0037 idle
> 
> The above means that these events weren't seen, I think.  We can
> ignore them in any case.
> 
>  csd: cnt(63d8dd8): 0003->0037 ipi
>  csd: cnt(63d8dd9): 0003->0037 ping
>  csd: cnt(63d8dda): 0003->ffff pinged
> 
> This is a little confusing.  The first two lines have to be from
> __smp_call_single_queue_debug.  The last line has to be from
> smp_call_function_many.  But you never see the pinged from
> __smp_call_single_queue_debug.
> 
>  csd: cnt(63d8dea): 0035->0037 pinged
> 
> The tail of CPU 53 sending an IPI to CPU 55 in
> __smp_call_single_queue_debug.
> 
>  csd: cnt(63d8deb): ffff->0037 gotipi
>  csd: cnt(63d8dec): ffff->0037 handle
>  csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
>  csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
> 
> CPU 55 is handling the IPI(s) it was sent earlier.
> 
>  csd: cnt(63d8e1f): 0003->0037 queue
>  csd: cnt(63d8e20): 0003->0037 ipi
>  csd: cnt(63d8e21): 0003->0037 ping
> 
> In __smp_call_single_queue_debug CPU 3 sends another message to
> CPU 55 and sends an IPI.  But there should be a pinged entry
> after this.
> 
>  csd: cnt(63d8e22): 0003->0037 queue
>  csd: cnt(63d8e23): 0003->0037 noipi
> 
> In __smp_call_single_queue_debug CPU 3 sends another message to
> CPU 55, but no IPI is required because one is already pending.
> 
> I deleted the timestamps and the rest of the log, but a second
> IPI is sent and CPU 55 wakes up immediately.  That tells me that
> CPU 55 wasn't stuck for some reason (interrupt storm, long time
> with interrupts disabled).  Though maybe not, with your insight
> below?
> 
> Looking at the code, it appears this code runs with interrupts and
> preemption enabled, so they can be switched out at any time here.  I can
> only see one possibility from the above sequence:
> 
> * CPU 3 queues the list item but is switched out before sending
>   the IPI (no pinged in the sequence).
> 
> * Another thread on CPU 3 is switched in and finishes a send
>   many.
> 
> * CPU 53 sends an IPI to CPU 55 and wakes it up.
> 
> * CPU 55 handles the IPI.
> 
> * Another thread on CPU 3 then sends another message to CPU 55,
>   enqueues the link, but is switched out before actually sending
>   the IPI (again, no pinged in the sequence).
> 
> * Yet another thread on CPU 3 sends a message, but the item
>   is already queued and thus it does not send an IPI.
> 
> * Something on CPU 3 prevents the first two threads from sending
>   the IPI.
> 
> So assuming it's not a hardware bug, it appears the problem is
> process starvation on CPU 3.  Knowing our customer's
> architecture, this is actually a possibility.  This could be
> pretty easily solved by turning off interrupts (or preemption?) around
> the code that enqueues the link and sends the IPI.
> 
> > 
> > The bug I was hunting occurred when running under Xen, and in the end I
> > found a scheduling problem in the Xen hypervisor.
> > 
> > > 
> > > So I backported the debug patches.  And of course, they stopped seeing
> > > the issue, at least as much, and they had trouble with the extra CPU
> > > time the debug code took.  But they just reproduced it.  Here are the
> > > logs:
> > > 
> > > Jan 23 23:39:43 worker0 kernel: [285737.522743] csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
> > > Jan 23 23:39:43 worker0 kernel: [285737.522744]  csd: CSD lock (#1) unresponsive.
> > > Jan 23 23:39:43 worker0 kernel: [285737.522747]  csd: cnt(0000000): 0000->0000 queue
> > > Jan 23 23:39:43 worker0 kernel: [285737.522748]  csd: cnt(0000001): ffff->0037 idle
> > > Jan 23 23:39:43 worker0 kernel: [285737.522749]  csd: cnt(63d8dd8): 0003->0037 ipi
> > > Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dd9): 0003->0037 ping
> > > Jan 23 23:39:43 worker0 kernel: [285737.522750]  csd: cnt(63d8dda): 0003->ffff pinged
> > > Jan 23 23:39:43 worker0 kernel: [285737.522751]  csd: cnt(63d8dea): 0035->0037 pinged
> > > Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8deb): ffff->0037 gotipi
> > > Jan 23 23:39:43 worker0 kernel: [285737.522752]  csd: cnt(63d8dec): ffff->0037 handle
> > > Jan 23 23:39:43 worker0 kernel: [285737.522753]  csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
> > > Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
> > > Jan 23 23:39:43 worker0 kernel: [285737.522754]  csd: cnt(63d8e1f): 0003->0037 queue
> > > Jan 23 23:39:43 worker0 kernel: [285737.522755]  csd: cnt(63d8e20): 0003->0037 ipi
> > > Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e21): 0003->0037 ping
> > > Jan 23 23:39:43 worker0 kernel: [285737.522756]  csd: cnt(63d8e22): 0003->0037 queue
> > > Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt(63d8e23): 0003->0037 noipi
> > > Jan 23 23:39:43 worker0 kernel: [285737.522757]  csd: cnt now: 63fe4cd
> > > Jan 23 23:39:43 worker0 kernel: [285737.522758] Task dump for CPU 55:
> > > Jan 23 23:39:43 worker0 kernel: [285737.522761] kubelet         R  running task        0 277695      1 0x00080000
> > > Jan 23 23:39:43 worker0 kernel: [285737.522761] Call Trace:
> > > Jan 23 23:39:43 worker0 kernel: [285737.522769]  [<ffffffff84376b6a>] ? __schedule+0x46a/0x990
> > > Jan 23 23:39:43 worker0 kernel: [285737.522774]  [<ffffffff83db6353>] ? context_tracking_user_enter+0x13/0x20
> > > Jan 23 23:39:43 worker0 kernel: [285737.522776]  [<ffffffff843775b5>] ? schedule_user+0x45/0x50
> > > Jan 23 23:39:43 worker0 kernel: [285737.522779]  [<ffffffff8437b518>] ? retint_careful+0x16/0x34
> > > Jan 23 23:39:43 worker0 kernel: [285737.522780] csd: Re-sending CSD lock (#1) IPI from CPU#03 to CPU#55
> > > Jan 23 23:39:43 worker0 kernel: [285737.522788] CPU: 3 PID: 54671 Comm: runc:[2:INIT] Kdump: loaded Tainted: G           OE  ------------ T 3.10.0-1062.12.1.rt56.1042.mvista.test.14.el7.x86_64 #1
> > > Jan 23 23:39:43 worker0 kernel: [285737.522789] Hardware name: Dell Inc. PowerEdge R740/0YWR7D, BIOS 2.9.4 11/06/2020
> > > Jan 23 23:39:43 worker0 kernel: [285737.522789] Call Trace:
> > > Jan 23 23:39:43 worker0 kernel: [285737.522793]  [<ffffffff843718ba>] dump_stack+0x19/0x1b
> > > Jan 23 23:39:43 worker0 kernel: [285737.522798]  [<ffffffff83d0bcd8>] __csd_lock_wait+0x1a8/0x2a0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522800]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > > Jan 23 23:39:43 worker0 kernel: [285737.522802]  [<ffffffff83d0bfa4>] smp_call_function_single+0xc4/0x1b0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522804]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > > Jan 23 23:39:43 worker0 kernel: [285737.522809]  [<ffffffff83e2684b>] ? page_counter_uncharge+0x3b/0x70
> > > Jan 23 23:39:43 worker0 kernel: [285737.522811]  [<ffffffff83d0c614>] smp_call_function_many+0x344/0x380
> > > Jan 23 23:39:43 worker0 kernel: [285737.522813]  [<ffffffff83c6d870>] ? leave_mm+0x120/0x120
> > > Jan 23 23:39:43 worker0 kernel: [285737.522816]  [<ffffffff83c6da38>] native_flush_tlb_others+0xb8/0xc0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522818]  [<ffffffff83c6dc25>] flush_tlb_page+0x65/0xf0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522821]  [<ffffffff83dfdf98>] ptep_clear_flush+0x68/0xa0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522825]  [<ffffffff83de6806>] wp_page_copy.isra.83+0x3d6/0x650
> > > Jan 23 23:39:43 worker0 kernel: [285737.522828]  [<ffffffff83de8cb4>] do_wp_page+0xb4/0x710
> > > Jan 23 23:39:43 worker0 kernel: [285737.522832]  [<ffffffff83decbb4>] handle_mm_fault+0x884/0x1340
> > > Jan 23 23:39:43 worker0 kernel: [285737.522835]  [<ffffffff83cd7799>] ? update_cfs_shares+0xa9/0xf0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522839]  [<ffffffff8437efc3>] __do_page_fault+0x213/0x5a0
> > > Jan 23 23:39:43 worker0 kernel: [285737.522841]  [<ffffffff8437f385>] do_page_fault+0x35/0x90
> > > Jan 23 23:39:43 worker0 kernel: [285737.522842]  [<ffffffff8437b728>] page_fault+0x28/0x30
> > > Jan 23 23:39:43 worker0 kernel: [285737.522845] csd: CSD lock (#1) got unstuck on CPU#03, CPU#55 released the lock.
> > > 
> > > Hopefully this is the issue you are chasing and not something else.
> > > I've been studying them to see what they mean, but I thought you might
> > > be interested to get them asap.
> > 
> > As said before: I've solved my problem.
> > 
> > But I can share some more insight from my experience:
> > 
> > Don't be so sure that resending the IPI woke up the cpu again. It might
> > have been the NMI for getting the target cpu's stacktrace which resulted
> > in letting it run again. I missed this possibility in the beginning and
> > this was the reason I dismissed the Xen scheduling issue too early.
> 
> Hmm.  That's non-obvious, obviously.  I'm not sure how sending an NMI
> would unstick something like this, but I'll take your word for it.
> I think my above analysis is correct based on the sequence, but this
> gives me something else to consider.
> 
> Thanks for the input.
> 
> -corey
> 
> > 
> > 
> > Juergen
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible reproduction of CSD locking issue
  2022-01-26 13:56   ` Corey Minyard
  2022-01-26 14:12     ` Corey Minyard
@ 2022-01-26 14:51     ` Juergen Gross
  2022-01-26 15:31       ` Corey Minyard
  1 sibling, 1 reply; 10+ messages in thread
From: Juergen Gross @ 2022-01-26 14:51 UTC (permalink / raw)
  To: minyard; +Cc: Ingo Molnar, Paul E. McKenney, LKML


[-- Attachment #1.1.1: Type: text/plain, Size: 3550 bytes --]

On 26.01.22 14:56, Corey Minyard wrote:
> On Wed, Jan 26, 2022 at 07:08:22AM +0100, Juergen Gross wrote:
>> On 25.01.22 19:27, Corey Minyard wrote:
>>> We have a customer that had been seeing CSD lock issues on a Centos 7
>>> kernel (unfortunately).  I couldn't find anything or any kernel changes
>>> that might fix it, so I was consdering it was the CSD locking issue you
>>> have been chasing for a while.
>>
>> Is this on bare metal or in a virtualized environment?
> 
> This is bare metal.
> 
> I do think I know what happened.  Here's my analysis...
> 
> csd: Detected non-responsive CSD lock (#1) on CPU#3, waiting 5000000042 ns for CPU#55 flush_tlb_func+0x0/0xb0(0xffff8e0b3e2afbe8).
>   csd: CSD lock (#1) unresponsive.
>   csd: cnt(0000000): 0000->0000 queue
>   csd: cnt(0000001): ffff->0037 idle
> 
> The above means that these events weren't seen, I think.  We can
> ignore them in any case.
> 
>   csd: cnt(63d8dd8): 0003->0037 ipi
>   csd: cnt(63d8dd9): 0003->0037 ping
>   csd: cnt(63d8dda): 0003->ffff pinged
> 
> This is a little confusing.  The first two lines have to be from
> __smp_call_single_queue_debug.  The last line has to be from
> smp_call_function_many.  But you never see the pinged from
> __smp_call_single_queue_debug.

Be careful here. For each event-type/cpu combination there is only one
entry saved. It is still possible to see some events more than once,
as each entry holds information of two events (the location _where_
the entry is stored is determining what just happened, and the _data_
stored at this position tells us what happened one event before that).

So any "missing" entry which type/cpu combination is seen further down
in the trace _might_ be just overwritten by the later entry.

I've used this scheme instead of simple trace buffers as on a large
system any buffer would be overwritten multiple times before the CSD
timeout is over. This way at least the trace entries of the cpus
having problems (the sender and/or the receiver) are still available.

After having found a local reproducer for my problem I could enhance
the patches by using small arrays instead of single entries for each
trace entry location, producing better traces with less "missing"
entries. I didn't post those patches, as they were heavily
intertwined with hypervisor patches for producing traces covering
both, kernel and hypervisor events.

> 
>   csd: cnt(63d8dea): 0035->0037 pinged
> 
> The tail of CPU 53 sending an IPI to CPU 55 in
> __smp_call_single_queue_debug.
> 
>   csd: cnt(63d8deb): ffff->0037 gotipi
>   csd: cnt(63d8dec): ffff->0037 handle
>   csd: cnt(63d8ded): ffff->0037 dequeue (src CPU 0 == empty)
>   csd: cnt(63d8dee): ffff->0037 hdlend (src CPU 0 == early)
> 
> CPU 55 is handling the IPI(s) it was sent earlier.

Right. Everything up to here is completely fine and can be ignored.

> 
>   csd: cnt(63d8e1f): 0003->0037 queue
>   csd: cnt(63d8e20): 0003->0037 ipi
>   csd: cnt(63d8e21): 0003->0037 ping
> 
> In __smp_call_single_queue_debug CPU 3 sends another message to
> CPU 55 and sends an IPI.  But there should be a pinged entry
> after this.
> 
>   csd: cnt(63d8e22): 0003->0037 queue
>   csd: cnt(63d8e23): 0003->0037 noipi

This is interesting. Those are 5 consecutive entries without any
missing in between (see the counter values). Could it be that after
the ping there was an interrupt and the code was re-entered for
sending another IPI? This would clearly result in a hang as seen.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible reproduction of CSD locking issue
  2022-01-26 14:51     ` Juergen Gross
@ 2022-01-26 15:31       ` Corey Minyard
  2022-01-26 15:34         ` Juergen Gross
  0 siblings, 1 reply; 10+ messages in thread
From: Corey Minyard @ 2022-01-26 15:31 UTC (permalink / raw)
  To: Juergen Gross; +Cc: Ingo Molnar, Paul E. McKenney, LKML

On Wed, Jan 26, 2022 at 03:51:36PM +0100, Juergen Gross wrote:
> On 26.01.22 14:56, Corey Minyard wrote:
> > On Wed, Jan 26, 2022 at 07:08:22AM +0100, Juergen Gross wrote:

snip..

> > 
> >   csd: cnt(63d8e1f): 0003->0037 queue
> >   csd: cnt(63d8e20): 0003->0037 ipi
> >   csd: cnt(63d8e21): 0003->0037 ping
> > 
> > In __smp_call_single_queue_debug CPU 3 sends another message to
> > CPU 55 and sends an IPI.  But there should be a pinged entry
> > after this.
> > 
> >   csd: cnt(63d8e22): 0003->0037 queue
> >   csd: cnt(63d8e23): 0003->0037 noipi
> 
> This is interesting. Those are 5 consecutive entries without any
> missing in between (see the counter values). Could it be that after
> the ping there was an interrupt and the code was re-entered for
> sending another IPI? This would clearly result in a hang as seen.

Since preempt is enabled, wouldn't it eventually come back to the first
thread and send the IPI?  Unless CPU 3 is stuck in an interrupt or
interrupt storm.

-corey

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Possible reproduction of CSD locking issue
  2022-01-26 15:31       ` Corey Minyard
@ 2022-01-26 15:34         ` Juergen Gross
  0 siblings, 0 replies; 10+ messages in thread
From: Juergen Gross @ 2022-01-26 15:34 UTC (permalink / raw)
  To: minyard; +Cc: Ingo Molnar, Paul E. McKenney, LKML


[-- Attachment #1.1.1: Type: text/plain, Size: 1270 bytes --]

On 26.01.22 16:31, Corey Minyard wrote:
> On Wed, Jan 26, 2022 at 03:51:36PM +0100, Juergen Gross wrote:
>> On 26.01.22 14:56, Corey Minyard wrote:
>>> On Wed, Jan 26, 2022 at 07:08:22AM +0100, Juergen Gross wrote:
> 
> snip..
> 
>>>
>>>    csd: cnt(63d8e1f): 0003->0037 queue
>>>    csd: cnt(63d8e20): 0003->0037 ipi
>>>    csd: cnt(63d8e21): 0003->0037 ping
>>>
>>> In __smp_call_single_queue_debug CPU 3 sends another message to
>>> CPU 55 and sends an IPI.  But there should be a pinged entry
>>> after this.
>>>
>>>    csd: cnt(63d8e22): 0003->0037 queue
>>>    csd: cnt(63d8e23): 0003->0037 noipi
>>
>> This is interesting. Those are 5 consecutive entries without any
>> missing in between (see the counter values). Could it be that after
>> the ping there was an interrupt and the code was re-entered for
>> sending another IPI? This would clearly result in a hang as seen.
> 
> Since preempt is enabled, wouldn't it eventually come back to the first
> thread and send the IPI?  Unless CPU 3 is stuck in an interrupt or
> interrupt storm.

With preempt disabled (you probably meant that) only an IPI from
interrupt context would be possible. And it would be stuck, of course,
as it would need to wait for the CSD lock.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2022-01-26 15:34 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-25 18:27 Possible reproduction of CSD locking issue Corey Minyard
2022-01-25 18:53 ` Paul E. McKenney
2022-01-25 21:52   ` Corey Minyard
2022-01-25 22:51     ` Paul E. McKenney
2022-01-26  6:08 ` Juergen Gross
2022-01-26 13:56   ` Corey Minyard
2022-01-26 14:12     ` Corey Minyard
2022-01-26 14:51     ` Juergen Gross
2022-01-26 15:31       ` Corey Minyard
2022-01-26 15:34         ` Juergen Gross

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.