linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] powerpc: watchdog fixes
@ 2021-11-04 16:10 Nicholas Piggin
  2021-11-04 16:10 ` [PATCH v2 1/5] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race Nicholas Piggin
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Nicholas Piggin @ 2021-11-04 16:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Laurent Dufour, Nicholas Piggin

These are some watchdog fixes and improvements.

This has taken a while to post out because upstream printk code seems
to have a problem with indefinitely delaying NMI context printk while
the CPU remains stuck, so that confounded testing a bit. That might
require another watchdog change after I discuss the issue with upstream
printk maintainers.

Since v1:
- Fixes noticed by Laurent in v1.
- Correct the description of the ABBA deadlock I wrote incorrectly in
  v1.
- Made several other improvements (patches 2,4,5).

Thanks,
Nick

Nicholas Piggin (5):
  powerpc/watchdog: Fix missed watchdog reset due to memory ordering
    race
  powerpc/watchdog: Tighten non-atomic read-modify-write access
  powerpc/watchdog: Avoid holding wd_smp_lock over printk and
    smp_send_nmi_ipi
  powerpc/watchdog: Read TB close to where it is used
  powerpc/watchdog: Remove backtrace print from unstuck message

 arch/powerpc/kernel/watchdog.c | 183 +++++++++++++++++++++++++--------
 1 file changed, 142 insertions(+), 41 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/5] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race
  2021-11-04 16:10 [PATCH v2 0/5] powerpc: watchdog fixes Nicholas Piggin
@ 2021-11-04 16:10 ` Nicholas Piggin
  2021-11-05  9:20   ` Laurent Dufour
  2021-11-04 16:10 ` [PATCH v2 2/5] powerpc/watchdog: Tighten non-atomic read-modify-write access Nicholas Piggin
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Nicholas Piggin @ 2021-11-04 16:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Laurent Dufour, Nicholas Piggin

It is possible for all CPUs to miss the pending cpumask becoming clear,
and then nobody resetting it, which will cause the lockup detector to
stop working. It will eventually expire, but watchdog_smp_panic will
avoid doing anything if the pending mask is clear and it will never be
reset.

Order the cpumask clear vs the subsequent test to close this race.

Add an extra check for an empty pending mask when the watchdog fires and
finds its bit still clear, to try to catch any other possible races or
bugs here and keep the watchdog working. The extra test in
arch_touch_nmi_watchdog is required to prevent the new warning from
firing off.

Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/watchdog.c | 36 +++++++++++++++++++++++++++++++++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index f9ea0e5357f9..be80071336a4 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -135,6 +135,10 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
 {
 	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
 	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
+	/*
+	 * See wd_smp_clear_cpu_pending()
+	 */
+	smp_mb();
 	if (cpumask_empty(&wd_smp_cpus_pending)) {
 		wd_smp_last_reset_tb = tb;
 		cpumask_andnot(&wd_smp_cpus_pending,
@@ -215,13 +219,39 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
 
 			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
 			wd_smp_unlock(&flags);
+		} else {
+			/*
+			 * The last CPU to clear pending should have reset the
+			 * watchdog, yet we find it empty here. This should not
+			 * happen but we can try to recover and avoid a false
+			 * positive if it does.
+			 */
+			if (WARN_ON_ONCE(cpumask_empty(&wd_smp_cpus_pending)))
+				goto none_pending;
 		}
 		return;
 	}
+
 	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
+
+	/*
+	 * Order the store to clear pending with the load(s) to check all
+	 * words in the pending mask to check they are all empty. This orders
+	 * with the same barrier on another CPU. This prevents two CPUs
+	 * clearing the last 2 pending bits, but neither seeing the other's
+	 * store when checking if the mask is empty, and missing an empty
+	 * mask, which ends with a false positive.
+	 */
+	smp_mb();
 	if (cpumask_empty(&wd_smp_cpus_pending)) {
 		unsigned long flags;
 
+none_pending:
+		/*
+		 * Double check under lock because more than one CPU could see
+		 * a clear mask with the lockless check after clearing their
+		 * pending bits.
+		 */
 		wd_smp_lock(&flags);
 		if (cpumask_empty(&wd_smp_cpus_pending)) {
 			wd_smp_last_reset_tb = tb;
@@ -312,8 +342,12 @@ void arch_touch_nmi_watchdog(void)
 {
 	unsigned long ticks = tb_ticks_per_usec * wd_timer_period_ms * 1000;
 	int cpu = smp_processor_id();
-	u64 tb = get_tb();
+	u64 tb;
 
+	if (!cpumask_test_cpu(cpu, &watchdog_cpumask))
+		return;
+
+	tb = get_tb();
 	if (tb - per_cpu(wd_timer_tb, cpu) >= ticks) {
 		per_cpu(wd_timer_tb, cpu) = tb;
 		wd_smp_clear_cpu_pending(cpu, tb);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/5] powerpc/watchdog: Tighten non-atomic read-modify-write access
  2021-11-04 16:10 [PATCH v2 0/5] powerpc: watchdog fixes Nicholas Piggin
  2021-11-04 16:10 ` [PATCH v2 1/5] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race Nicholas Piggin
@ 2021-11-04 16:10 ` Nicholas Piggin
  2021-11-05 16:17   ` Laurent Dufour
  2021-11-04 16:10 ` [PATCH v2 3/5] powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi Nicholas Piggin
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Nicholas Piggin @ 2021-11-04 16:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Laurent Dufour, Nicholas Piggin

Most updates to wd_smp_cpus_pending are under lock except the watchdog
interrupt bit clear.

This can race with non-atomic RMW updates to the mask under lock, which
can happen in two instances:

Firstly, if another CPU detects this one is stuck, removes it from the
mask, mask becomes empty and is re-filled with non-atomic stores. This
is okay because it would re-fill the mask with this CPU's bit clear
anyway (because this CPU is now stuck), so it doesn't matter that the
bit clear update got "lost". Add a comment for this.

Secondly, if another CPU detects a different CPU is stuck and removes it
from the pending mask with a non-atomic store to bytes which also
include the bit of this CPU. This case can result in the bit clear being
lost and the end result being the bit is set. This should be so rare it
hardly matters, but to make things simpler to reason about just avoid
the non-atomic access for that case.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/watchdog.c | 36 ++++++++++++++++++++++++----------
 1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index be80071336a4..1d2623230297 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -131,10 +131,10 @@ static void wd_lockup_ipi(struct pt_regs *regs)
 	/* Do not panic from here because that can recurse into NMI IPI layer */
 }
 
-static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
+static bool set_cpu_stuck(int cpu, u64 tb)
 {
-	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
-	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
+	cpumask_set_cpu(cpu, &wd_smp_cpus_stuck);
+	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
 	/*
 	 * See wd_smp_clear_cpu_pending()
 	 */
@@ -144,11 +144,9 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
 		cpumask_andnot(&wd_smp_cpus_pending,
 				&wd_cpus_enabled,
 				&wd_smp_cpus_stuck);
+		return true;
 	}
-}
-static void set_cpu_stuck(int cpu, u64 tb)
-{
-	set_cpumask_stuck(cpumask_of(cpu), tb);
+	return false;
 }
 
 static void watchdog_smp_panic(int cpu, u64 tb)
@@ -177,15 +175,17 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 		 * get a backtrace on all of them anyway.
 		 */
 		for_each_cpu(c, &wd_smp_cpus_pending) {
+			bool empty;
 			if (c == cpu)
 				continue;
+			/* Take the stuck CPUs out of the watch group */
+			empty = set_cpu_stuck(c, tb);
 			smp_send_nmi_ipi(c, wd_lockup_ipi, 1000000);
+			if (empty)
+				break;
 		}
 	}
 
-	/* Take the stuck CPUs out of the watch group */
-	set_cpumask_stuck(&wd_smp_cpus_pending, tb);
-
 	wd_smp_unlock(&flags);
 
 	if (sysctl_hardlockup_all_cpu_backtrace)
@@ -232,6 +232,22 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
 		return;
 	}
 
+	/*
+	 * All other updates to wd_smp_cpus_pending are performed under
+	 * wd_smp_lock. All of them are atomic except the case where the
+	 * mask becomes empty and is reset. This will not happen here because
+	 * cpu was tested to be in the bitmap (above), and a CPU only clears
+	 * its own bit. _Except_ in the case where another CPU has detected a
+	 * hard lockup on our CPU and takes us out of the pending mask. So in
+	 * normal operation there will be no race here, no problem.
+	 *
+	 * In the lockup case, this atomic clear-bit vs a store that refills
+	 * other bits in the accessed word wll not be a problem. The bit clear
+	 * is atomic so it will not cause the store to get lost, and the store
+	 * will never set this bit so it will not overwrite the bit clear. The
+	 * only way for a stuck CPU to return to the pending bitmap is to
+	 * become unstuck itself.
+	 */
 	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
 
 	/*
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/5] powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi
  2021-11-04 16:10 [PATCH v2 0/5] powerpc: watchdog fixes Nicholas Piggin
  2021-11-04 16:10 ` [PATCH v2 1/5] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race Nicholas Piggin
  2021-11-04 16:10 ` [PATCH v2 2/5] powerpc/watchdog: Tighten non-atomic read-modify-write access Nicholas Piggin
@ 2021-11-04 16:10 ` Nicholas Piggin
  2021-11-04 16:10 ` [PATCH v2 4/5] powerpc/watchdog: Read TB close to where it is used Nicholas Piggin
  2021-11-04 16:10 ` [PATCH v2 5/5] powerpc/watchdog: Remove backtrace print from unstuck message Nicholas Piggin
  4 siblings, 0 replies; 13+ messages in thread
From: Nicholas Piggin @ 2021-11-04 16:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Laurent Dufour, Nicholas Piggin

There is a deadlock with the console_owner lock and the wd_smp_lock:

CPU x takes the console_owner lock
CPU y takes a watchdog timer interrupt and takes __wd_smp_lock
CPU x takes a soft-NMI interrupt, detects deadlock, spins on __wd_smp_lock
CPU y detects deadlock, tries to print something and spins on console_owner
-> deadlock

Change the watchdog locking scheme so wd_smp_lock protects the watchdog
internal data, but "reporting" (printing, issuing NMI IPIs, taking any
action outside of watchdog) uses a non-waiting exclusion. If a CPU detects
a problem but can not take the reporting lock, it just returns because
something else is already reporting. It will try again at some point.

Typically hard lockup watchdog report usefulness is not impacted due to
failure to spewing a large enough amount of data in as short a time as
possible, but by messages getting garbled.

Laurent debugged this and found the deadlock, and this patch is based on
his general approach to avoid expensive operations while holding the lock.
With the addition of the reporting exclusion.

Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
[np: rework to add reporting exclusion update changelog]
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/watchdog.c | 89 ++++++++++++++++++++++++++--------
 1 file changed, 70 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 1d2623230297..0265d27340f1 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -85,10 +85,32 @@ static DEFINE_PER_CPU(u64, wd_timer_tb);
 
 /* SMP checker bits */
 static unsigned long __wd_smp_lock;
+static unsigned long __wd_reporting;
 static cpumask_t wd_smp_cpus_pending;
 static cpumask_t wd_smp_cpus_stuck;
 static u64 wd_smp_last_reset_tb;
 
+/*
+ * Try to take the exclusive watchdog action / NMI IPI / printing lock.
+ * wd_smp_lock must be held. If this fails, we should return and wait
+ * for the watchdog to kick in again (or another CPU to trigger it).
+ */
+static bool wd_try_report(void)
+{
+	if (__wd_reporting)
+		return false;
+	__wd_reporting = 1;
+	return true;
+}
+
+/* End printing after successful wd_try_report. wd_smp_lock not required. */
+static void wd_end_reporting(void)
+{
+	smp_mb(); /* End printing "critical section" */
+	WARN_ON_ONCE(__wd_reporting == 0);
+	WRITE_ONCE(__wd_reporting, 0);
+}
+
 static inline void wd_smp_lock(unsigned long *flags)
 {
 	/*
@@ -151,6 +173,7 @@ static bool set_cpu_stuck(int cpu, u64 tb)
 
 static void watchdog_smp_panic(int cpu, u64 tb)
 {
+	static cpumask_t wd_smp_cpus_ipi; // protected by reporting
 	unsigned long flags;
 	int c;
 
@@ -160,11 +183,26 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 		goto out;
 	if (cpumask_test_cpu(cpu, &wd_smp_cpus_pending))
 		goto out;
-	if (cpumask_weight(&wd_smp_cpus_pending) == 0)
+	if (!wd_try_report())
 		goto out;
+	for_each_online_cpu(c) {
+		if (!cpumask_test_cpu(c, &wd_smp_cpus_pending))
+			continue;
+		if (c == cpu)
+			continue; // should not happen
+
+		__cpumask_set_cpu(c, &wd_smp_cpus_ipi);
+		if (set_cpu_stuck(c, tb))
+			break;
+	}
+	if (cpumask_empty(&wd_smp_cpus_ipi)) {
+		wd_end_reporting();
+		goto out;
+	}
+	wd_smp_unlock(&flags);
 
 	pr_emerg("CPU %d detected hard LOCKUP on other CPUs %*pbl\n",
-		 cpu, cpumask_pr_args(&wd_smp_cpus_pending));
+		 cpu, cpumask_pr_args(&wd_smp_cpus_ipi));
 	pr_emerg("CPU %d TB:%lld, last SMP heartbeat TB:%lld (%lldms ago)\n",
 		 cpu, tb, wd_smp_last_reset_tb,
 		 tb_to_ns(tb - wd_smp_last_reset_tb) / 1000000);
@@ -174,26 +212,20 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 		 * Try to trigger the stuck CPUs, unless we are going to
 		 * get a backtrace on all of them anyway.
 		 */
-		for_each_cpu(c, &wd_smp_cpus_pending) {
-			bool empty;
-			if (c == cpu)
-				continue;
-			/* Take the stuck CPUs out of the watch group */
-			empty = set_cpu_stuck(c, tb);
+		for_each_cpu(c, &wd_smp_cpus_ipi) {
 			smp_send_nmi_ipi(c, wd_lockup_ipi, 1000000);
-			if (empty)
-				break;
+			__cpumask_clear_cpu(c, &wd_smp_cpus_ipi);
 		}
-	}
-
-	wd_smp_unlock(&flags);
-
-	if (sysctl_hardlockup_all_cpu_backtrace)
+	} else {
 		trigger_allbutself_cpu_backtrace();
+		cpumask_clear(&wd_smp_cpus_ipi);
+	}
 
 	if (hardlockup_panic)
 		nmi_panic(NULL, "Hard LOCKUP");
 
+	wd_end_reporting();
+
 	return;
 
 out:
@@ -207,8 +239,6 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
 			struct pt_regs *regs = get_irq_regs();
 			unsigned long flags;
 
-			wd_smp_lock(&flags);
-
 			pr_emerg("CPU %d became unstuck TB:%lld\n",
 				 cpu, tb);
 			print_irqtrace_events(current);
@@ -217,6 +247,7 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
 			else
 				dump_stack();
 
+			wd_smp_lock(&flags);
 			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
 			wd_smp_unlock(&flags);
 		} else {
@@ -307,13 +338,28 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
 
 	tb = get_tb();
 	if (tb - per_cpu(wd_timer_tb, cpu) >= wd_panic_timeout_tb) {
+		/*
+		 * Taking wd_smp_lock here means it is a soft-NMI lock, which
+		 * means we can't take any regular or irqsafe spin locks while
+		 * holding this lock. This is why timers can't printk while
+		 * holding the lock.
+		 */
 		wd_smp_lock(&flags);
 		if (cpumask_test_cpu(cpu, &wd_smp_cpus_stuck)) {
 			wd_smp_unlock(&flags);
 			return 0;
 		}
+		if (!wd_try_report()) {
+			wd_smp_unlock(&flags);
+			/* Couldn't report, try again in 100ms */
+			mtspr(SPRN_DEC, 100 * tb_ticks_per_usec * 1000);
+			return 0;
+		}
+
 		set_cpu_stuck(cpu, tb);
 
+		wd_smp_unlock(&flags);
+
 		pr_emerg("CPU %d self-detected hard LOCKUP @ %pS\n",
 			 cpu, (void *)regs->nip);
 		pr_emerg("CPU %d TB:%lld, last heartbeat TB:%lld (%lldms ago)\n",
@@ -323,14 +369,19 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
 		print_irqtrace_events(current);
 		show_regs(regs);
 
-		wd_smp_unlock(&flags);
-
 		if (sysctl_hardlockup_all_cpu_backtrace)
 			trigger_allbutself_cpu_backtrace();
 
 		if (hardlockup_panic)
 			nmi_panic(regs, "Hard LOCKUP");
+
+		wd_end_reporting();
 	}
+	/*
+	 * We are okay to change DEC in soft_nmi_interrupt because the masked
+	 * handler has marked a DEC as pending, so the timer interrupt will be
+	 * replayed as soon as local irqs are enabled again.
+	 */
 	if (wd_panic_timeout_tb < 0x7fffffff)
 		mtspr(SPRN_DEC, wd_panic_timeout_tb);
 
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/5] powerpc/watchdog: Read TB close to where it is used
  2021-11-04 16:10 [PATCH v2 0/5] powerpc: watchdog fixes Nicholas Piggin
                   ` (2 preceding siblings ...)
  2021-11-04 16:10 ` [PATCH v2 3/5] powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi Nicholas Piggin
@ 2021-11-04 16:10 ` Nicholas Piggin
  2021-11-05 13:39   ` Laurent Dufour
  2021-11-04 16:10 ` [PATCH v2 5/5] powerpc/watchdog: Remove backtrace print from unstuck message Nicholas Piggin
  4 siblings, 1 reply; 13+ messages in thread
From: Nicholas Piggin @ 2021-11-04 16:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Laurent Dufour, Nicholas Piggin

When taking watchdog actions, printing messages, comparing and
re-setting wd_smp_last_reset_tb, etc., read TB close to the point of use
and under wd_smp_lock or printing lock (if applicable).

This should keep timebase mostly monotonic with kernel log messages, and
could prevent (in theory) a laggy CPU updating wd_smp_last_reset_tb to
something a long way in the past, and causing other CPUs to appear to be
stuck.

These additional TB reads are all slowpath (lockup has been detected),
so performance does not matter.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/watchdog.c | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 0265d27340f1..2444cd10b61a 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -94,6 +94,10 @@ static u64 wd_smp_last_reset_tb;
  * Try to take the exclusive watchdog action / NMI IPI / printing lock.
  * wd_smp_lock must be held. If this fails, we should return and wait
  * for the watchdog to kick in again (or another CPU to trigger it).
+ *
+ * Importantly, if hardlockup_panic is set, wd_try_report failure should
+ * not delay the panic, because whichever other CPU is reporting will
+ * call panic.
  */
 static bool wd_try_report(void)
 {
@@ -153,7 +157,7 @@ static void wd_lockup_ipi(struct pt_regs *regs)
 	/* Do not panic from here because that can recurse into NMI IPI layer */
 }
 
-static bool set_cpu_stuck(int cpu, u64 tb)
+static bool set_cpu_stuck(int cpu)
 {
 	cpumask_set_cpu(cpu, &wd_smp_cpus_stuck);
 	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
@@ -162,7 +166,7 @@ static bool set_cpu_stuck(int cpu, u64 tb)
 	 */
 	smp_mb();
 	if (cpumask_empty(&wd_smp_cpus_pending)) {
-		wd_smp_last_reset_tb = tb;
+		wd_smp_last_reset_tb = get_tb();
 		cpumask_andnot(&wd_smp_cpus_pending,
 				&wd_cpus_enabled,
 				&wd_smp_cpus_stuck);
@@ -171,14 +175,16 @@ static bool set_cpu_stuck(int cpu, u64 tb)
 	return false;
 }
 
-static void watchdog_smp_panic(int cpu, u64 tb)
+static void watchdog_smp_panic(int cpu)
 {
 	static cpumask_t wd_smp_cpus_ipi; // protected by reporting
 	unsigned long flags;
+	u64 tb;
 	int c;
 
 	wd_smp_lock(&flags);
 	/* Double check some things under lock */
+	tb = get_tb();
 	if ((s64)(tb - wd_smp_last_reset_tb) < (s64)wd_smp_panic_timeout_tb)
 		goto out;
 	if (cpumask_test_cpu(cpu, &wd_smp_cpus_pending))
@@ -192,7 +198,7 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 			continue; // should not happen
 
 		__cpumask_set_cpu(c, &wd_smp_cpus_ipi);
-		if (set_cpu_stuck(c, tb))
+		if (set_cpu_stuck(c))
 			break;
 	}
 	if (cpumask_empty(&wd_smp_cpus_ipi)) {
@@ -232,7 +238,7 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 	wd_smp_unlock(&flags);
 }
 
-static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
+static void wd_smp_clear_cpu_pending(int cpu)
 {
 	if (!cpumask_test_cpu(cpu, &wd_smp_cpus_pending)) {
 		if (unlikely(cpumask_test_cpu(cpu, &wd_smp_cpus_stuck))) {
@@ -240,7 +246,7 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
 			unsigned long flags;
 
 			pr_emerg("CPU %d became unstuck TB:%lld\n",
-				 cpu, tb);
+				 cpu, get_tb());
 			print_irqtrace_events(current);
 			if (regs)
 				show_regs(regs);
@@ -301,7 +307,7 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
 		 */
 		wd_smp_lock(&flags);
 		if (cpumask_empty(&wd_smp_cpus_pending)) {
-			wd_smp_last_reset_tb = tb;
+			wd_smp_last_reset_tb = get_tb();
 			cpumask_andnot(&wd_smp_cpus_pending,
 					&wd_cpus_enabled,
 					&wd_smp_cpus_stuck);
@@ -316,10 +322,10 @@ static void watchdog_timer_interrupt(int cpu)
 
 	per_cpu(wd_timer_tb, cpu) = tb;
 
-	wd_smp_clear_cpu_pending(cpu, tb);
+	wd_smp_clear_cpu_pending(cpu);
 
 	if ((s64)(tb - wd_smp_last_reset_tb) >= (s64)wd_smp_panic_timeout_tb)
-		watchdog_smp_panic(cpu, tb);
+		watchdog_smp_panic(cpu);
 }
 
 DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
@@ -356,7 +362,7 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
 			return 0;
 		}
 
-		set_cpu_stuck(cpu, tb);
+		set_cpu_stuck(cpu);
 
 		wd_smp_unlock(&flags);
 
@@ -417,7 +423,7 @@ void arch_touch_nmi_watchdog(void)
 	tb = get_tb();
 	if (tb - per_cpu(wd_timer_tb, cpu) >= ticks) {
 		per_cpu(wd_timer_tb, cpu) = tb;
-		wd_smp_clear_cpu_pending(cpu, tb);
+		wd_smp_clear_cpu_pending(cpu);
 	}
 }
 EXPORT_SYMBOL(arch_touch_nmi_watchdog);
@@ -475,7 +481,7 @@ static void stop_watchdog(void *arg)
 	cpumask_clear_cpu(cpu, &wd_cpus_enabled);
 	wd_smp_unlock(&flags);
 
-	wd_smp_clear_cpu_pending(cpu, get_tb());
+	wd_smp_clear_cpu_pending(cpu);
 }
 
 static int stop_watchdog_on_cpu(unsigned int cpu)
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 5/5] powerpc/watchdog: Remove backtrace print from unstuck message
  2021-11-04 16:10 [PATCH v2 0/5] powerpc: watchdog fixes Nicholas Piggin
                   ` (3 preceding siblings ...)
  2021-11-04 16:10 ` [PATCH v2 4/5] powerpc/watchdog: Read TB close to where it is used Nicholas Piggin
@ 2021-11-04 16:10 ` Nicholas Piggin
  2021-11-04 16:48   ` Laurent Dufour
  4 siblings, 1 reply; 13+ messages in thread
From: Nicholas Piggin @ 2021-11-04 16:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Laurent Dufour, Nicholas Piggin

The watchdog unstuck message can't be serialised with other watchdog
messages because that might prevent watchdog reporting. This removes
the big backtrace from the unstuck message, which can get mixed with
other messages and confuse logs, and just prints a single line.

Signed-of-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/kernel/watchdog.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index 2444cd10b61a..5f69ba4de1f3 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -242,16 +242,10 @@ static void wd_smp_clear_cpu_pending(int cpu)
 {
 	if (!cpumask_test_cpu(cpu, &wd_smp_cpus_pending)) {
 		if (unlikely(cpumask_test_cpu(cpu, &wd_smp_cpus_stuck))) {
-			struct pt_regs *regs = get_irq_regs();
 			unsigned long flags;
 
 			pr_emerg("CPU %d became unstuck TB:%lld\n",
 				 cpu, get_tb());
-			print_irqtrace_events(current);
-			if (regs)
-				show_regs(regs);
-			else
-				dump_stack();
 
 			wd_smp_lock(&flags);
 			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 5/5] powerpc/watchdog: Remove backtrace print from unstuck message
  2021-11-04 16:10 ` [PATCH v2 5/5] powerpc/watchdog: Remove backtrace print from unstuck message Nicholas Piggin
@ 2021-11-04 16:48   ` Laurent Dufour
  2021-11-05  1:28     ` Nicholas Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Laurent Dufour @ 2021-11-04 16:48 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

Le 04/11/2021 à 17:10, Nicholas Piggin a écrit :
> The watchdog unstuck message can't be serialised with other watchdog
> messages because that might prevent watchdog reporting. This removes
> the big backtrace from the unstuck message, which can get mixed with
> other messages and confuse logs, and just prints a single line.

I'm not sure that's a good idea to remove the registers and backtrace here.
I agree that this output may interleaved (and usually it does), but it is also 
providing some good information about the culprit block of code. Usually, it's 
pointing the IRQ release code, and so the IRQ blocking one which are really useful.

I don't have a good way to prevent trace interleaving here, but I think 
interleaved traces are better here than nothing.

Thanks,
Laurent.

> Signed-of-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   arch/powerpc/kernel/watchdog.c | 6 ------
>   1 file changed, 6 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index 2444cd10b61a..5f69ba4de1f3 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -242,16 +242,10 @@ static void wd_smp_clear_cpu_pending(int cpu)
>   {
>   	if (!cpumask_test_cpu(cpu, &wd_smp_cpus_pending)) {
>   		if (unlikely(cpumask_test_cpu(cpu, &wd_smp_cpus_stuck))) {
> -			struct pt_regs *regs = get_irq_regs();
>   			unsigned long flags;
>   
>   			pr_emerg("CPU %d became unstuck TB:%lld\n",
>   				 cpu, get_tb());
> -			print_irqtrace_events(current);
> -			if (regs)
> -				show_regs(regs);
> -			else
> -				dump_stack();
>   
>   			wd_smp_lock(&flags);
>   			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 5/5] powerpc/watchdog: Remove backtrace print from unstuck message
  2021-11-04 16:48   ` Laurent Dufour
@ 2021-11-05  1:28     ` Nicholas Piggin
  0 siblings, 0 replies; 13+ messages in thread
From: Nicholas Piggin @ 2021-11-05  1:28 UTC (permalink / raw)
  To: Laurent Dufour, linuxppc-dev

Excerpts from Laurent Dufour's message of November 5, 2021 2:48 am:
> Le 04/11/2021 à 17:10, Nicholas Piggin a écrit :
>> The watchdog unstuck message can't be serialised with other watchdog
>> messages because that might prevent watchdog reporting. This removes
>> the big backtrace from the unstuck message, which can get mixed with
>> other messages and confuse logs, and just prints a single line.
> 
> I'm not sure that's a good idea to remove the registers and backtrace here.
> I agree that this output may interleaved (and usually it does), but it is also 
> providing some good information about the culprit block of code. Usually, it's 
> pointing the IRQ release code, and so the IRQ blocking one which are really useful.

Okay, I was thinking that be inferred from the context usually, but 
sometimes it's not that easy which I guess is why I added it in the
first place.

> I don't have a good way to prevent trace interleaving here, but I think 
> interleaved traces are better here than nothing.

Okay we can leave this patch off.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race
  2021-11-04 16:10 ` [PATCH v2 1/5] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race Nicholas Piggin
@ 2021-11-05  9:20   ` Laurent Dufour
  2021-11-05 11:46     ` Nicholas Piggin
  0 siblings, 1 reply; 13+ messages in thread
From: Laurent Dufour @ 2021-11-05  9:20 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

Le 04/11/2021 à 17:10, Nicholas Piggin a écrit :
> It is possible for all CPUs to miss the pending cpumask becoming clear,
> and then nobody resetting it, which will cause the lockup detector to
> stop working. It will eventually expire, but watchdog_smp_panic will
> avoid doing anything if the pending mask is clear and it will never be
> reset.
> 
> Order the cpumask clear vs the subsequent test to close this race.
> 
> Add an extra check for an empty pending mask when the watchdog fires and
> finds its bit still clear, to try to catch any other possible races or
> bugs here and keep the watchdog working. The extra test in
> arch_touch_nmi_watchdog is required to prevent the new warning from
> firing off.
> 
> Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   arch/powerpc/kernel/watchdog.c | 36 +++++++++++++++++++++++++++++++++-
>   1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index f9ea0e5357f9..be80071336a4 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -135,6 +135,10 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
>   {
>   	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
>   	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
> +	/*
> +	 * See wd_smp_clear_cpu_pending()
> +	 */
> +	smp_mb();
>   	if (cpumask_empty(&wd_smp_cpus_pending)) {
>   		wd_smp_last_reset_tb = tb;
>   		cpumask_andnot(&wd_smp_cpus_pending,
> @@ -215,13 +219,39 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>   
>   			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
>   			wd_smp_unlock(&flags);
> +		} else {
> +			/*
> +			 * The last CPU to clear pending should have reset the
> +			 * watchdog, yet we find it empty here. This should not
> +			 * happen but we can try to recover and avoid a false
> +			 * positive if it does.
> +			 */
> +			if (WARN_ON_ONCE(cpumask_empty(&wd_smp_cpus_pending)))
> +				goto none_pending;

I run a stress on my victim node on top of this patch and hit that warning:

[  C475] ------------[ cut here ]------------
[  C475] WARNING: CPU: 475 PID: 0 at 
/home/laurent/src/linux-ppc/arch/powerpc/kernel/watchdog.c:260 
wd_smp_clear_cpu_pending+0x320/0x4b0
[  C475] Modules linked in: rpadlpar_io rpaphp xt_tcpudp iptable_filter 
ip_tables x_tables xfs pseries_rng rng_core vmx_crypto gf128mul be2net fuse 
btrfs blake2b_generic libcrc32c xor zstd_compress lzo_compress raid6_pq 
dm_service_time crc32c_vpmsum lpfc crc_t10dif crct10dif_generic crct10dif_common 
dm_mirror dm_region_hash dm_log dm_multipath scsi_dh_rdac scsi_dh_alua autofs4
[  C475] CPU: 475 PID: 0 Comm: swapper/475 Kdump: loaded Not tainted 
5.15.0-rc2-ppc-bz192129+ #72
[  C475] NIP:  c00000000003d710 LR: c00000000003d478 CTR: c00000000003e2e0
[  C475] REGS: c00006b16026f420 TRAP: 0700   Not tainted  (5.15.0-rc2-ppc-bz192129+)
[  C475] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22000222  XER: 20000000
[  C475] CFAR: c00000000003d480 IRQMASK: 3
[  C475] GPR00: c00000000003e3bc c00006b16026f6c0 c000000001b28700 0000000000000800
[  C475] GPR04: 0000000000000000 0000000000000800 0000000000000800 0000000000000000
[  C475] GPR08: 0000000000000000 0000000000000000 00000000000000f8 00013f06986272e7
[  C475] GPR12: c00000000003e2e0 c000000007d3df00 0000000000000000 000000001f043b60
[  C475] GPR16: c00006b1601b7b00 0000000000000000 c00000000003e2e0 0000000000000001
[  C475] GPR20: 0000347411d4cf28 c00007adbdb0a898 0000000000000001 0000000000000000
[  C475] GPR24: 0000000000000000 0000000000000003 c000000001b6d7d0 00013f0698627d84
[  C475] GPR28: c000000001bd05c8 c000000001bd05b8 c000000001bd06c8 00000000000001db
[  C475] NIP [c00000000003d710] wd_smp_clear_cpu_pending+0x320/0x4b0
[  C475] LR [c00000000003d478] wd_smp_clear_cpu_pending+0x88/0x4b0
[  C475] Call Trace:
[  C475] [c00006b16026f6c0] [0000000000000001] 0x1 (unreliable)
[  C475] [c00006b16026f770] [c00000000003e3bc] watchdog_timer_fn+0xdc/0x5a0
[  C475] [c00006b16026f840] [c000000000245a4c] __hrtimer_run_queues+0x49c/0x700
[  C475] [c00006b16026f8f0] [c000000000246c20] hrtimer_interrupt+0x110/0x310
[  C475] [c00006b16026f9a0] [c0000000000292f8] timer_interrupt+0x1e8/0x5a0
[  C475] [c00006b16026fa00] [c000000000009a00] decrementer_common_virt+0x210/0x220
[  C475] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
[  C475] NIP:  c0000000000e5dd0 LR: c000000000c18f04 CTR: 0000000000000000
[  C475] REGS: c00006b16026fa70 TRAP: 0900   Not tainted  (5.15.0-rc2-ppc-bz192129+)
[  C475] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22000824  XER: 20000000
[  C475] CFAR: 0000000000000c00 IRQMASK: 0
[  C475] GPR00: 0000000000000000 c00006b16026fd10 c000000001b28700 0000000000000000
[  C475] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  C475] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000001
[  C475] GPR12: 000000000000ffff c000000007d3df00 0000000000000000 000000001f043b60
[  C475] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[  C475] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000001a46cd0
[  C475] GPR24: c000000001b68e38 000034738d859946 0000000000000000 0000000000000001
[  C475] GPR28: 0000000000000000 0000000000000001 c000000001472360 c000000001472368
[  C475] NIP [c0000000000e5dd0] plpar_hcall_norets_notrace+0x18/0x2c
[  C475] LR [c000000000c18f04] check_and_cede_processor.part.2+0x24/0x70
[  C475] --- interrupt: 900
[  C475] [c00006b16026fd10] [c00007adbdb0a880] 0xc00007adbdb0a880 (unreliable)
[  C475] [c00006b16026fd70] [c000000000c194f4] dedicated_cede_loop+0x174/0x200
[  C475] [c00006b16026fdb0] [c000000000c15b2c] cpuidle_enter_state+0x3ac/0x6d0
[  C475] [c00006b16026fe20] [c000000000c15ef0] cpuidle_enter+0x50/0x70
[  C475] [c00006b16026fe60] [c0000000001a7f9c] call_cpuidle+0x4c/0x90
[  C475] [c00006b16026fe80] [c0000000001a84f0] do_idle+0x310/0x3c0
[  C475] [c00006b16026ff00] [c0000000001a8948] cpu_startup_entry+0x38/0x50
[  C475] [c00006b16026ff30] [c00000000005fb5c] start_secondary+0x2bc/0x2f0
[  C475] [c00006b16026ff90] [c00000000000d254] start_secondary_prolog+0x10/0x14
[  C475] Instruction dump:
[  C475] 48eb7049 60000000 e8610068 4bfffee4 392d0918 7c20492a 482c54f1 60000000
[  C475] 4bfffe4c 60000000 60000000 60000000 <0fe00000> fb210078 fb410080 fb610088
[  C475] irq event stamp: 0
[  C475] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
[  C475] hardirqs last disabled at (0): [<c00000000014342c>] 
copy_process+0x76c/0x1e00
[  C475] softirqs last  enabled at (0): [<c00000000014342c>] 
copy_process+0x76c/0x1e00
[  C475] softirqs last disabled at (0): [<0000000000000000>] 0x0
[  C475] ---[ end trace 6e8311d1692d057b ]---

I guess there is a possible race here between watchdog_timer_interrupt() and 
another CPU watchdog_smp_panic().

>   		}
>   		return;
>   	}
> +
>   	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
> +
> +	/*
> +	 * Order the store to clear pending with the load(s) to check all
> +	 * words in the pending mask to check they are all empty. This orders
> +	 * with the same barrier on another CPU. This prevents two CPUs
> +	 * clearing the last 2 pending bits, but neither seeing the other's
> +	 * store when checking if the mask is empty, and missing an empty
> +	 * mask, which ends with a false positive.
> +	 */
> +	smp_mb();
>   	if (cpumask_empty(&wd_smp_cpus_pending)) {
>   		unsigned long flags;
>   
> +none_pending:
> +		/*
> +		 * Double check under lock because more than one CPU could see
> +		 * a clear mask with the lockless check after clearing their
> +		 * pending bits.
> +		 */
>   		wd_smp_lock(&flags);
>   		if (cpumask_empty(&wd_smp_cpus_pending)) {
>   			wd_smp_last_reset_tb = tb;
> @@ -312,8 +342,12 @@ void arch_touch_nmi_watchdog(void)
>   {
>   	unsigned long ticks = tb_ticks_per_usec * wd_timer_period_ms * 1000;
>   	int cpu = smp_processor_id();
> -	u64 tb = get_tb();
> +	u64 tb;
>   
> +	if (!cpumask_test_cpu(cpu, &watchdog_cpumask))
> +		return;
> +
> +	tb = get_tb();
>   	if (tb - per_cpu(wd_timer_tb, cpu) >= ticks) {
>   		per_cpu(wd_timer_tb, cpu) = tb;
>   		wd_smp_clear_cpu_pending(cpu, tb);
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race
  2021-11-05  9:20   ` Laurent Dufour
@ 2021-11-05 11:46     ` Nicholas Piggin
  2021-11-05 12:15       ` Laurent Dufour
  0 siblings, 1 reply; 13+ messages in thread
From: Nicholas Piggin @ 2021-11-05 11:46 UTC (permalink / raw)
  To: Laurent Dufour, linuxppc-dev

Excerpts from Laurent Dufour's message of November 5, 2021 7:20 pm:
> Le 04/11/2021 à 17:10, Nicholas Piggin a écrit :
>> It is possible for all CPUs to miss the pending cpumask becoming clear,
>> and then nobody resetting it, which will cause the lockup detector to
>> stop working. It will eventually expire, but watchdog_smp_panic will
>> avoid doing anything if the pending mask is clear and it will never be
>> reset.
>> 
>> Order the cpumask clear vs the subsequent test to close this race.
>> 
>> Add an extra check for an empty pending mask when the watchdog fires and
>> finds its bit still clear, to try to catch any other possible races or
>> bugs here and keep the watchdog working. The extra test in
>> arch_touch_nmi_watchdog is required to prevent the new warning from
>> firing off.
>> 
>> Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>   arch/powerpc/kernel/watchdog.c | 36 +++++++++++++++++++++++++++++++++-
>>   1 file changed, 35 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
>> index f9ea0e5357f9..be80071336a4 100644
>> --- a/arch/powerpc/kernel/watchdog.c
>> +++ b/arch/powerpc/kernel/watchdog.c
>> @@ -135,6 +135,10 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
>>   {
>>   	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
>>   	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
>> +	/*
>> +	 * See wd_smp_clear_cpu_pending()
>> +	 */
>> +	smp_mb();
>>   	if (cpumask_empty(&wd_smp_cpus_pending)) {
>>   		wd_smp_last_reset_tb = tb;
>>   		cpumask_andnot(&wd_smp_cpus_pending,
>> @@ -215,13 +219,39 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>>   
>>   			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
>>   			wd_smp_unlock(&flags);
>> +		} else {
>> +			/*
>> +			 * The last CPU to clear pending should have reset the
>> +			 * watchdog, yet we find it empty here. This should not
>> +			 * happen but we can try to recover and avoid a false
>> +			 * positive if it does.
>> +			 */
>> +			if (WARN_ON_ONCE(cpumask_empty(&wd_smp_cpus_pending)))
>> +				goto none_pending;
> 
> I run a stress on my victim node on top of this patch and hit that warning:
> 
> [  C475] ------------[ cut here ]------------
> [  C475] WARNING: CPU: 475 PID: 0 at 
> /home/laurent/src/linux-ppc/arch/powerpc/kernel/watchdog.c:260 
> wd_smp_clear_cpu_pending+0x320/0x4b0
> [  C475] Modules linked in: rpadlpar_io rpaphp xt_tcpudp iptable_filter 
> ip_tables x_tables xfs pseries_rng rng_core vmx_crypto gf128mul be2net fuse 
> btrfs blake2b_generic libcrc32c xor zstd_compress lzo_compress raid6_pq 
> dm_service_time crc32c_vpmsum lpfc crc_t10dif crct10dif_generic crct10dif_common 
> dm_mirror dm_region_hash dm_log dm_multipath scsi_dh_rdac scsi_dh_alua autofs4
> [  C475] CPU: 475 PID: 0 Comm: swapper/475 Kdump: loaded Not tainted 
> 5.15.0-rc2-ppc-bz192129+ #72
> [  C475] NIP:  c00000000003d710 LR: c00000000003d478 CTR: c00000000003e2e0
> [  C475] REGS: c00006b16026f420 TRAP: 0700   Not tainted  (5.15.0-rc2-ppc-bz192129+)
> [  C475] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22000222  XER: 20000000
> [  C475] CFAR: c00000000003d480 IRQMASK: 3
> [  C475] GPR00: c00000000003e3bc c00006b16026f6c0 c000000001b28700 0000000000000800
> [  C475] GPR04: 0000000000000000 0000000000000800 0000000000000800 0000000000000000
> [  C475] GPR08: 0000000000000000 0000000000000000 00000000000000f8 00013f06986272e7
> [  C475] GPR12: c00000000003e2e0 c000000007d3df00 0000000000000000 000000001f043b60
> [  C475] GPR16: c00006b1601b7b00 0000000000000000 c00000000003e2e0 0000000000000001
> [  C475] GPR20: 0000347411d4cf28 c00007adbdb0a898 0000000000000001 0000000000000000
> [  C475] GPR24: 0000000000000000 0000000000000003 c000000001b6d7d0 00013f0698627d84
> [  C475] GPR28: c000000001bd05c8 c000000001bd05b8 c000000001bd06c8 00000000000001db
> [  C475] NIP [c00000000003d710] wd_smp_clear_cpu_pending+0x320/0x4b0
> [  C475] LR [c00000000003d478] wd_smp_clear_cpu_pending+0x88/0x4b0
> [  C475] Call Trace:
> [  C475] [c00006b16026f6c0] [0000000000000001] 0x1 (unreliable)
> [  C475] [c00006b16026f770] [c00000000003e3bc] watchdog_timer_fn+0xdc/0x5a0
> [  C475] [c00006b16026f840] [c000000000245a4c] __hrtimer_run_queues+0x49c/0x700
> [  C475] [c00006b16026f8f0] [c000000000246c20] hrtimer_interrupt+0x110/0x310
> [  C475] [c00006b16026f9a0] [c0000000000292f8] timer_interrupt+0x1e8/0x5a0
> [  C475] [c00006b16026fa00] [c000000000009a00] decrementer_common_virt+0x210/0x220
> [  C475] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
> [  C475] NIP:  c0000000000e5dd0 LR: c000000000c18f04 CTR: 0000000000000000
> [  C475] REGS: c00006b16026fa70 TRAP: 0900   Not tainted  (5.15.0-rc2-ppc-bz192129+)
> [  C475] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22000824  XER: 20000000
> [  C475] CFAR: 0000000000000c00 IRQMASK: 0
> [  C475] GPR00: 0000000000000000 c00006b16026fd10 c000000001b28700 0000000000000000
> [  C475] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  C475] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000001
> [  C475] GPR12: 000000000000ffff c000000007d3df00 0000000000000000 000000001f043b60
> [  C475] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [  C475] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000001a46cd0
> [  C475] GPR24: c000000001b68e38 000034738d859946 0000000000000000 0000000000000001
> [  C475] GPR28: 0000000000000000 0000000000000001 c000000001472360 c000000001472368
> [  C475] NIP [c0000000000e5dd0] plpar_hcall_norets_notrace+0x18/0x2c
> [  C475] LR [c000000000c18f04] check_and_cede_processor.part.2+0x24/0x70
> [  C475] --- interrupt: 900
> [  C475] [c00006b16026fd10] [c00007adbdb0a880] 0xc00007adbdb0a880 (unreliable)
> [  C475] [c00006b16026fd70] [c000000000c194f4] dedicated_cede_loop+0x174/0x200
> [  C475] [c00006b16026fdb0] [c000000000c15b2c] cpuidle_enter_state+0x3ac/0x6d0
> [  C475] [c00006b16026fe20] [c000000000c15ef0] cpuidle_enter+0x50/0x70
> [  C475] [c00006b16026fe60] [c0000000001a7f9c] call_cpuidle+0x4c/0x90
> [  C475] [c00006b16026fe80] [c0000000001a84f0] do_idle+0x310/0x3c0
> [  C475] [c00006b16026ff00] [c0000000001a8948] cpu_startup_entry+0x38/0x50
> [  C475] [c00006b16026ff30] [c00000000005fb5c] start_secondary+0x2bc/0x2f0
> [  C475] [c00006b16026ff90] [c00000000000d254] start_secondary_prolog+0x10/0x14
> [  C475] Instruction dump:
> [  C475] 48eb7049 60000000 e8610068 4bfffee4 392d0918 7c20492a 482c54f1 60000000
> [  C475] 4bfffe4c 60000000 60000000 60000000 <0fe00000> fb210078 fb410080 fb610088
> [  C475] irq event stamp: 0
> [  C475] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
> [  C475] hardirqs last disabled at (0): [<c00000000014342c>] 
> copy_process+0x76c/0x1e00
> [  C475] softirqs last  enabled at (0): [<c00000000014342c>] 
> copy_process+0x76c/0x1e00
> [  C475] softirqs last disabled at (0): [<0000000000000000>] 0x0
> [  C475] ---[ end trace 6e8311d1692d057b ]---
> 
> I guess there is a possible race here between watchdog_timer_interrupt() and 
> another CPU watchdog_smp_panic().

Hmm, yeah of course there would be. May have to just remove that warn.

Thanks,
Nick


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 1/5] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race
  2021-11-05 11:46     ` Nicholas Piggin
@ 2021-11-05 12:15       ` Laurent Dufour
  0 siblings, 0 replies; 13+ messages in thread
From: Laurent Dufour @ 2021-11-05 12:15 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

Le 05/11/2021 à 12:46, Nicholas Piggin a écrit :
> Excerpts from Laurent Dufour's message of November 5, 2021 7:20 pm:
>> Le 04/11/2021 à 17:10, Nicholas Piggin a écrit :
>>> It is possible for all CPUs to miss the pending cpumask becoming clear,
>>> and then nobody resetting it, which will cause the lockup detector to
>>> stop working. It will eventually expire, but watchdog_smp_panic will
>>> avoid doing anything if the pending mask is clear and it will never be
>>> reset.
>>>
>>> Order the cpumask clear vs the subsequent test to close this race.
>>>
>>> Add an extra check for an empty pending mask when the watchdog fires and
>>> finds its bit still clear, to try to catch any other possible races or
>>> bugs here and keep the watchdog working. The extra test in
>>> arch_touch_nmi_watchdog is required to prevent the new warning from
>>> firing off.
>>>
>>> Debugged-by: Laurent Dufour <ldufour@linux.ibm.com>
>>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>>> ---
>>>    arch/powerpc/kernel/watchdog.c | 36 +++++++++++++++++++++++++++++++++-
>>>    1 file changed, 35 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
>>> index f9ea0e5357f9..be80071336a4 100644
>>> --- a/arch/powerpc/kernel/watchdog.c
>>> +++ b/arch/powerpc/kernel/watchdog.c
>>> @@ -135,6 +135,10 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
>>>    {
>>>    	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
>>>    	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
>>> +	/*
>>> +	 * See wd_smp_clear_cpu_pending()
>>> +	 */
>>> +	smp_mb();
>>>    	if (cpumask_empty(&wd_smp_cpus_pending)) {
>>>    		wd_smp_last_reset_tb = tb;
>>>    		cpumask_andnot(&wd_smp_cpus_pending,
>>> @@ -215,13 +219,39 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>>>    
>>>    			cpumask_clear_cpu(cpu, &wd_smp_cpus_stuck);
>>>    			wd_smp_unlock(&flags);
>>> +		} else {
>>> +			/*
>>> +			 * The last CPU to clear pending should have reset the
>>> +			 * watchdog, yet we find it empty here. This should not
>>> +			 * happen but we can try to recover and avoid a false
>>> +			 * positive if it does.
>>> +			 */
>>> +			if (WARN_ON_ONCE(cpumask_empty(&wd_smp_cpus_pending)))
>>> +				goto none_pending;
>>
>> I run a stress on my victim node on top of this patch and hit that warning:
>>
>> [  C475] ------------[ cut here ]------------
>> [  C475] WARNING: CPU: 475 PID: 0 at
>> /home/laurent/src/linux-ppc/arch/powerpc/kernel/watchdog.c:260
>> wd_smp_clear_cpu_pending+0x320/0x4b0
>> [  C475] Modules linked in: rpadlpar_io rpaphp xt_tcpudp iptable_filter
>> ip_tables x_tables xfs pseries_rng rng_core vmx_crypto gf128mul be2net fuse
>> btrfs blake2b_generic libcrc32c xor zstd_compress lzo_compress raid6_pq
>> dm_service_time crc32c_vpmsum lpfc crc_t10dif crct10dif_generic crct10dif_common
>> dm_mirror dm_region_hash dm_log dm_multipath scsi_dh_rdac scsi_dh_alua autofs4
>> [  C475] CPU: 475 PID: 0 Comm: swapper/475 Kdump: loaded Not tainted
>> 5.15.0-rc2-ppc-bz192129+ #72
>> [  C475] NIP:  c00000000003d710 LR: c00000000003d478 CTR: c00000000003e2e0
>> [  C475] REGS: c00006b16026f420 TRAP: 0700   Not tainted  (5.15.0-rc2-ppc-bz192129+)
>> [  C475] MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22000222  XER: 20000000
>> [  C475] CFAR: c00000000003d480 IRQMASK: 3
>> [  C475] GPR00: c00000000003e3bc c00006b16026f6c0 c000000001b28700 0000000000000800
>> [  C475] GPR04: 0000000000000000 0000000000000800 0000000000000800 0000000000000000
>> [  C475] GPR08: 0000000000000000 0000000000000000 00000000000000f8 00013f06986272e7
>> [  C475] GPR12: c00000000003e2e0 c000000007d3df00 0000000000000000 000000001f043b60
>> [  C475] GPR16: c00006b1601b7b00 0000000000000000 c00000000003e2e0 0000000000000001
>> [  C475] GPR20: 0000347411d4cf28 c00007adbdb0a898 0000000000000001 0000000000000000
>> [  C475] GPR24: 0000000000000000 0000000000000003 c000000001b6d7d0 00013f0698627d84
>> [  C475] GPR28: c000000001bd05c8 c000000001bd05b8 c000000001bd06c8 00000000000001db
>> [  C475] NIP [c00000000003d710] wd_smp_clear_cpu_pending+0x320/0x4b0
>> [  C475] LR [c00000000003d478] wd_smp_clear_cpu_pending+0x88/0x4b0
>> [  C475] Call Trace:
>> [  C475] [c00006b16026f6c0] [0000000000000001] 0x1 (unreliable)
>> [  C475] [c00006b16026f770] [c00000000003e3bc] watchdog_timer_fn+0xdc/0x5a0
>> [  C475] [c00006b16026f840] [c000000000245a4c] __hrtimer_run_queues+0x49c/0x700
>> [  C475] [c00006b16026f8f0] [c000000000246c20] hrtimer_interrupt+0x110/0x310
>> [  C475] [c00006b16026f9a0] [c0000000000292f8] timer_interrupt+0x1e8/0x5a0
>> [  C475] [c00006b16026fa00] [c000000000009a00] decrementer_common_virt+0x210/0x220
>> [  C475] --- interrupt: 900 at plpar_hcall_norets_notrace+0x18/0x2c
>> [  C475] NIP:  c0000000000e5dd0 LR: c000000000c18f04 CTR: 0000000000000000
>> [  C475] REGS: c00006b16026fa70 TRAP: 0900   Not tainted  (5.15.0-rc2-ppc-bz192129+)
>> [  C475] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 22000824  XER: 20000000
>> [  C475] CFAR: 0000000000000c00 IRQMASK: 0
>> [  C475] GPR00: 0000000000000000 c00006b16026fd10 c000000001b28700 0000000000000000
>> [  C475] GPR04: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [  C475] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000001
>> [  C475] GPR12: 000000000000ffff c000000007d3df00 0000000000000000 000000001f043b60
>> [  C475] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [  C475] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000001a46cd0
>> [  C475] GPR24: c000000001b68e38 000034738d859946 0000000000000000 0000000000000001
>> [  C475] GPR28: 0000000000000000 0000000000000001 c000000001472360 c000000001472368
>> [  C475] NIP [c0000000000e5dd0] plpar_hcall_norets_notrace+0x18/0x2c
>> [  C475] LR [c000000000c18f04] check_and_cede_processor.part.2+0x24/0x70
>> [  C475] --- interrupt: 900
>> [  C475] [c00006b16026fd10] [c00007adbdb0a880] 0xc00007adbdb0a880 (unreliable)
>> [  C475] [c00006b16026fd70] [c000000000c194f4] dedicated_cede_loop+0x174/0x200
>> [  C475] [c00006b16026fdb0] [c000000000c15b2c] cpuidle_enter_state+0x3ac/0x6d0
>> [  C475] [c00006b16026fe20] [c000000000c15ef0] cpuidle_enter+0x50/0x70
>> [  C475] [c00006b16026fe60] [c0000000001a7f9c] call_cpuidle+0x4c/0x90
>> [  C475] [c00006b16026fe80] [c0000000001a84f0] do_idle+0x310/0x3c0
>> [  C475] [c00006b16026ff00] [c0000000001a8948] cpu_startup_entry+0x38/0x50
>> [  C475] [c00006b16026ff30] [c00000000005fb5c] start_secondary+0x2bc/0x2f0
>> [  C475] [c00006b16026ff90] [c00000000000d254] start_secondary_prolog+0x10/0x14
>> [  C475] Instruction dump:
>> [  C475] 48eb7049 60000000 e8610068 4bfffee4 392d0918 7c20492a 482c54f1 60000000
>> [  C475] 4bfffe4c 60000000 60000000 60000000 <0fe00000> fb210078 fb410080 fb610088
>> [  C475] irq event stamp: 0
>> [  C475] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
>> [  C475] hardirqs last disabled at (0): [<c00000000014342c>]
>> copy_process+0x76c/0x1e00
>> [  C475] softirqs last  enabled at (0): [<c00000000014342c>]
>> copy_process+0x76c/0x1e00
>> [  C475] softirqs last disabled at (0): [<0000000000000000>] 0x0
>> [  C475] ---[ end trace 6e8311d1692d057b ]---
>>
>> I guess there is a possible race here between watchdog_timer_interrupt() and
>> another CPU watchdog_smp_panic().
> 
> Hmm, yeah of course there would be. May have to just remove that warn.

Yes I do agree, it should be removed.

Cheers,
Laurent.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 4/5] powerpc/watchdog: Read TB close to where it is used
  2021-11-04 16:10 ` [PATCH v2 4/5] powerpc/watchdog: Read TB close to where it is used Nicholas Piggin
@ 2021-11-05 13:39   ` Laurent Dufour
  0 siblings, 0 replies; 13+ messages in thread
From: Laurent Dufour @ 2021-11-05 13:39 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

Le 04/11/2021 à 17:10, Nicholas Piggin a écrit :
> When taking watchdog actions, printing messages, comparing and
> re-setting wd_smp_last_reset_tb, etc., read TB close to the point of use
> and under wd_smp_lock or printing lock (if applicable).
> 
> This should keep timebase mostly monotonic with kernel log messages, and
> could prevent (in theory) a laggy CPU updating wd_smp_last_reset_tb to
> something a long way in the past, and causing other CPUs to appear to be
> stuck.
> 
> These additional TB reads are all slowpath (lockup has been detected),
> so performance does not matter.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   arch/powerpc/kernel/watchdog.c | 30 ++++++++++++++++++------------
>   1 file changed, 18 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index 0265d27340f1..2444cd10b61a 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -94,6 +94,10 @@ static u64 wd_smp_last_reset_tb;
>    * Try to take the exclusive watchdog action / NMI IPI / printing lock.
>    * wd_smp_lock must be held. If this fails, we should return and wait
>    * for the watchdog to kick in again (or another CPU to trigger it).
> + *
> + * Importantly, if hardlockup_panic is set, wd_try_report failure should
> + * not delay the panic, because whichever other CPU is reporting will
> + * call panic.
>    */

I guess this comment should be part of the previous commit in this series.

Despite that, please consider

Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>

>   static bool wd_try_report(void)
>   {
> @@ -153,7 +157,7 @@ static void wd_lockup_ipi(struct pt_regs *regs)
>   	/* Do not panic from here because that can recurse into NMI IPI layer */
>   }
>   
> -static bool set_cpu_stuck(int cpu, u64 tb)
> +static bool set_cpu_stuck(int cpu)
>   {
>   	cpumask_set_cpu(cpu, &wd_smp_cpus_stuck);
>   	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
> @@ -162,7 +166,7 @@ static bool set_cpu_stuck(int cpu, u64 tb)
>   	 */
>   	smp_mb();
>   	if (cpumask_empty(&wd_smp_cpus_pending)) {
> -		wd_smp_last_reset_tb = tb;
> +		wd_smp_last_reset_tb = get_tb();
>   		cpumask_andnot(&wd_smp_cpus_pending,
>   				&wd_cpus_enabled,
>   				&wd_smp_cpus_stuck);
> @@ -171,14 +175,16 @@ static bool set_cpu_stuck(int cpu, u64 tb)
>   	return false;
>   }
>   
> -static void watchdog_smp_panic(int cpu, u64 tb)
> +static void watchdog_smp_panic(int cpu)
>   {
>   	static cpumask_t wd_smp_cpus_ipi; // protected by reporting
>   	unsigned long flags;
> +	u64 tb;
>   	int c;
>   
>   	wd_smp_lock(&flags);
>   	/* Double check some things under lock */
> +	tb = get_tb();
>   	if ((s64)(tb - wd_smp_last_reset_tb) < (s64)wd_smp_panic_timeout_tb)
>   		goto out;
>   	if (cpumask_test_cpu(cpu, &wd_smp_cpus_pending))
> @@ -192,7 +198,7 @@ static void watchdog_smp_panic(int cpu, u64 tb)
>   			continue; // should not happen
>   
>   		__cpumask_set_cpu(c, &wd_smp_cpus_ipi);
> -		if (set_cpu_stuck(c, tb))
> +		if (set_cpu_stuck(c))
>   			break;
>   	}
>   	if (cpumask_empty(&wd_smp_cpus_ipi)) {
> @@ -232,7 +238,7 @@ static void watchdog_smp_panic(int cpu, u64 tb)
>   	wd_smp_unlock(&flags);
>   }
>   
> -static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
> +static void wd_smp_clear_cpu_pending(int cpu)
>   {
>   	if (!cpumask_test_cpu(cpu, &wd_smp_cpus_pending)) {
>   		if (unlikely(cpumask_test_cpu(cpu, &wd_smp_cpus_stuck))) {
> @@ -240,7 +246,7 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>   			unsigned long flags;
>   
>   			pr_emerg("CPU %d became unstuck TB:%lld\n",
> -				 cpu, tb);
> +				 cpu, get_tb());
>   			print_irqtrace_events(current);
>   			if (regs)
>   				show_regs(regs);
> @@ -301,7 +307,7 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>   		 */
>   		wd_smp_lock(&flags);
>   		if (cpumask_empty(&wd_smp_cpus_pending)) {
> -			wd_smp_last_reset_tb = tb;
> +			wd_smp_last_reset_tb = get_tb();
>   			cpumask_andnot(&wd_smp_cpus_pending,
>   					&wd_cpus_enabled,
>   					&wd_smp_cpus_stuck);
> @@ -316,10 +322,10 @@ static void watchdog_timer_interrupt(int cpu)
>   
>   	per_cpu(wd_timer_tb, cpu) = tb;
>   
> -	wd_smp_clear_cpu_pending(cpu, tb);
> +	wd_smp_clear_cpu_pending(cpu);
>   
>   	if ((s64)(tb - wd_smp_last_reset_tb) >= (s64)wd_smp_panic_timeout_tb)
> -		watchdog_smp_panic(cpu, tb);
> +		watchdog_smp_panic(cpu);
>   }
>   
>   DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
> @@ -356,7 +362,7 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
>   			return 0;
>   		}
>   
> -		set_cpu_stuck(cpu, tb);
> +		set_cpu_stuck(cpu);
>   
>   		wd_smp_unlock(&flags);
>   
> @@ -417,7 +423,7 @@ void arch_touch_nmi_watchdog(void)
>   	tb = get_tb();
>   	if (tb - per_cpu(wd_timer_tb, cpu) >= ticks) {
>   		per_cpu(wd_timer_tb, cpu) = tb;
> -		wd_smp_clear_cpu_pending(cpu, tb);
> +		wd_smp_clear_cpu_pending(cpu);
>   	}
>   }
>   EXPORT_SYMBOL(arch_touch_nmi_watchdog);
> @@ -475,7 +481,7 @@ static void stop_watchdog(void *arg)
>   	cpumask_clear_cpu(cpu, &wd_cpus_enabled);
>   	wd_smp_unlock(&flags);
>   
> -	wd_smp_clear_cpu_pending(cpu, get_tb());
> +	wd_smp_clear_cpu_pending(cpu);
>   }
>   
>   static int stop_watchdog_on_cpu(unsigned int cpu)
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 2/5] powerpc/watchdog: Tighten non-atomic read-modify-write access
  2021-11-04 16:10 ` [PATCH v2 2/5] powerpc/watchdog: Tighten non-atomic read-modify-write access Nicholas Piggin
@ 2021-11-05 16:17   ` Laurent Dufour
  0 siblings, 0 replies; 13+ messages in thread
From: Laurent Dufour @ 2021-11-05 16:17 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev

Le 04/11/2021 à 17:10, Nicholas Piggin a écrit :
> Most updates to wd_smp_cpus_pending are under lock except the watchdog
> interrupt bit clear.
> 
> This can race with non-atomic RMW updates to the mask under lock, which
> can happen in two instances:
> 
> Firstly, if another CPU detects this one is stuck, removes it from the
> mask, mask becomes empty and is re-filled with non-atomic stores. This
> is okay because it would re-fill the mask with this CPU's bit clear
> anyway (because this CPU is now stuck), so it doesn't matter that the
> bit clear update got "lost". Add a comment for this.
> 
> Secondly, if another CPU detects a different CPU is stuck and removes it
> from the pending mask with a non-atomic store to bytes which also
> include the bit of this CPU. This case can result in the bit clear being
> lost and the end result being the bit is set. This should be so rare it
> hardly matters, but to make things simpler to reason about just avoid
> the non-atomic access for that case.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com>

> ---
>   arch/powerpc/kernel/watchdog.c | 36 ++++++++++++++++++++++++----------
>   1 file changed, 26 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index be80071336a4..1d2623230297 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -131,10 +131,10 @@ static void wd_lockup_ipi(struct pt_regs *regs)
>   	/* Do not panic from here because that can recurse into NMI IPI layer */
>   }
>   
> -static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
> +static bool set_cpu_stuck(int cpu, u64 tb)
>   {
> -	cpumask_or(&wd_smp_cpus_stuck, &wd_smp_cpus_stuck, cpumask);
> -	cpumask_andnot(&wd_smp_cpus_pending, &wd_smp_cpus_pending, cpumask);
> +	cpumask_set_cpu(cpu, &wd_smp_cpus_stuck);
> +	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
>   	/*
>   	 * See wd_smp_clear_cpu_pending()
>   	 */
> @@ -144,11 +144,9 @@ static void set_cpumask_stuck(const struct cpumask *cpumask, u64 tb)
>   		cpumask_andnot(&wd_smp_cpus_pending,
>   				&wd_cpus_enabled,
>   				&wd_smp_cpus_stuck);
> +		return true;
>   	}
> -}
> -static void set_cpu_stuck(int cpu, u64 tb)
> -{
> -	set_cpumask_stuck(cpumask_of(cpu), tb);
> +	return false;
>   }
>   
>   static void watchdog_smp_panic(int cpu, u64 tb)
> @@ -177,15 +175,17 @@ static void watchdog_smp_panic(int cpu, u64 tb)
>   		 * get a backtrace on all of them anyway.
>   		 */
>   		for_each_cpu(c, &wd_smp_cpus_pending) {
> +			bool empty;
>   			if (c == cpu)
>   				continue;
> +			/* Take the stuck CPUs out of the watch group */
> +			empty = set_cpu_stuck(c, tb);
>   			smp_send_nmi_ipi(c, wd_lockup_ipi, 1000000);
> +			if (empty)
> +				break;
>   		}
>   	}
>   
> -	/* Take the stuck CPUs out of the watch group */
> -	set_cpumask_stuck(&wd_smp_cpus_pending, tb);
> -
>   	wd_smp_unlock(&flags);
>   
>   	if (sysctl_hardlockup_all_cpu_backtrace)
> @@ -232,6 +232,22 @@ static void wd_smp_clear_cpu_pending(int cpu, u64 tb)
>   		return;
>   	}
>   
> +	/*
> +	 * All other updates to wd_smp_cpus_pending are performed under
> +	 * wd_smp_lock. All of them are atomic except the case where the
> +	 * mask becomes empty and is reset. This will not happen here because
> +	 * cpu was tested to be in the bitmap (above), and a CPU only clears
> +	 * its own bit. _Except_ in the case where another CPU has detected a
> +	 * hard lockup on our CPU and takes us out of the pending mask. So in
> +	 * normal operation there will be no race here, no problem.
> +	 *
> +	 * In the lockup case, this atomic clear-bit vs a store that refills
> +	 * other bits in the accessed word wll not be a problem. The bit clear
> +	 * is atomic so it will not cause the store to get lost, and the store
> +	 * will never set this bit so it will not overwrite the bit clear. The
> +	 * only way for a stuck CPU to return to the pending bitmap is to
> +	 * become unstuck itself.
> +	 */
>   	cpumask_clear_cpu(cpu, &wd_smp_cpus_pending);
>   
>   	/*
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-11-05 16:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-04 16:10 [PATCH v2 0/5] powerpc: watchdog fixes Nicholas Piggin
2021-11-04 16:10 ` [PATCH v2 1/5] powerpc/watchdog: Fix missed watchdog reset due to memory ordering race Nicholas Piggin
2021-11-05  9:20   ` Laurent Dufour
2021-11-05 11:46     ` Nicholas Piggin
2021-11-05 12:15       ` Laurent Dufour
2021-11-04 16:10 ` [PATCH v2 2/5] powerpc/watchdog: Tighten non-atomic read-modify-write access Nicholas Piggin
2021-11-05 16:17   ` Laurent Dufour
2021-11-04 16:10 ` [PATCH v2 3/5] powerpc/watchdog: Avoid holding wd_smp_lock over printk and smp_send_nmi_ipi Nicholas Piggin
2021-11-04 16:10 ` [PATCH v2 4/5] powerpc/watchdog: Read TB close to where it is used Nicholas Piggin
2021-11-05 13:39   ` Laurent Dufour
2021-11-04 16:10 ` [PATCH v2 5/5] powerpc/watchdog: Remove backtrace print from unstuck message Nicholas Piggin
2021-11-04 16:48   ` Laurent Dufour
2021-11-05  1:28     ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).