[PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful
@ 2019-08-19 10:47 Petr Mladek
  2019-08-19 10:47 ` [PATCH 1/3] watchdog/softlockup: Preserve original timestamp when touching watchdog externally Petr Mladek
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Petr Mladek @ 2019-08-19 10:47 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

( Resending this as a proper patch with updated commit messages.
  The original was
  https://lkml.kernel.org/r/20190605140954.28471-1-pmladek@suse.com )

We were analyzing logs with several softlockup reports in flush_tlb_kernel_range().
They were confusing. Especially it was not clear whether it was deadlock,
livelock, or separate softlockups.

It went out that even a simple busy loop:

	while (true)
	      cpu_relax();

is able to produce several softlockups reports:

  [  168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
  [  196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
  [  236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]
                                                              ^^^^

This patchset fixes the problem in two steps:

+ 1st patch prevents restart of the watchdog from unrelated locations.
  Each softlockup is reported only once:

  [  320.248948] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4916]


+ 2nd patch helps to distinguish several possible situations by
  regular reports. The report looks like:

  [  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
  [  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
  [  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
  [  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]
                                                              ^^^^^

3rd patch provides a sample code to trigger a softlockup.


Petr Mladek (3):
  watchdog/softlockup: Preserve original timestamp when touching
    watchdog externally
  watchdog/softlockup: Report the same softlockup regularly
  Test softlockup

 fs/proc/consoles.c |  5 ++++
 fs/proc/version.c  |  7 +++++
 kernel/watchdog.c  | 85 +++++++++++++++++++++++++++++++-----------------------
 3 files changed, 61 insertions(+), 36 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] watchdog/softlockup: Preserve original timestamp when touching watchdog externally
  2019-08-19 10:47 [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful Petr Mladek
@ 2019-08-19 10:47 ` Petr Mladek
  2019-10-21 12:42   ` Peter Zijlstra
  2019-08-19 10:47 ` [PATCH 2/3] watchdog/softlockup: Report the same softlockup regularly Petr Mladek
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Petr Mladek @ 2019-08-19 10:47 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

Some bug report included the same softlockups in flush_tlb_kernel_range()
in regular intervals. Unfortunately was not clear if there was a progress
or not.

The situation can be simulated with a simply busy loop:

	while (true)
	      cpu_relax();

The softlockup detector produces:

[  168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
[  196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
[  236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]

One would expect only one softlockup report or several reports with
an increased duration.

The problem are touch_*_watchdog() calls spread all over the code. They are
typically used in code producing many printk() messages, for example:

   + touch_softlockup_watchdog() in nmi_trigger_cpumask_backtrace()
   + touch_all_softlockup_watchdogs() in show_state_filter()
   + touch_nmi_watchdog() in printk_stack_address()

One solution would be to remove all these calls. They are actually hiding
information about that the system is not working properly.

On the other hand, the watchdogs are touched in situations when the system
prints information about another problem. It might be confusing to report
softlockups caused by printing the other problem. In the worst situation,
a softlockup report might cause softlockup on its own and livelock.

This patch takes another approach. The softlockup is detected when
softlockup_fn() is not scheduled within the given time threshold.
All the externally called touch_*_watchdog() functions restart
the period that is used to check the threshold.

The result is that each softlockup is reported only once unless
another process get scheduled:

[  320.248948] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4916]

Note that the logic preventing repeated reports is horrible. It will
get removed in a separate patch.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/watchdog.c | 39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 7f9e7b9306fe..bd249676ee3d 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -169,7 +169,9 @@ static bool softlockup_initialized __read_mostly;
 static u64 __read_mostly sample_period;
 
 static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
+static DEFINE_PER_CPU(unsigned long, watchdog_period_ts);
 static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);
+static DEFINE_PER_CPU(bool, watchdog_restart_period);
 static DEFINE_PER_CPU(bool, softlockup_touch_sync);
 static DEFINE_PER_CPU(bool, soft_watchdog_warn);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
@@ -254,10 +256,19 @@ static void set_sample_period(void)
 	watchdog_update_hrtimer_threshold(sample_period);
 }
 
+/* Commands for restarting the watchdog period */
+static void __restart_watchdog_period(void)
+{
+	__this_cpu_write(watchdog_period_ts, get_timestamp());
+	__this_cpu_write(watchdog_restart_period, false);
+}
+
 /* Commands for resetting the watchdog */
 static void __touch_watchdog(void)
 {
 	__this_cpu_write(watchdog_touch_ts, get_timestamp());
+	__restart_watchdog_period();
+	__this_cpu_write(soft_watchdog_warn, false);
 }
 
 /**
@@ -271,10 +282,10 @@ static void __touch_watchdog(void)
 notrace void touch_softlockup_watchdog_sched(void)
 {
 	/*
-	 * Preemption can be enabled.  It doesn't matter which CPU's timestamp
-	 * gets zeroed here, so use the raw_ operation.
+	 * Preemption can be enabled.  It doesn't matter which CPU's watchdog
+	 * period gets restarted here, so use the raw_ operation.
 	 */
-	raw_cpu_write(watchdog_touch_ts, 0);
+	raw_cpu_write(watchdog_restart_period, true);
 }
 
 notrace void touch_softlockup_watchdog(void)
@@ -298,23 +309,23 @@ void touch_all_softlockup_watchdogs(void)
 	 * the softlockup check.
 	 */
 	for_each_cpu(cpu, &watchdog_allowed_mask)
-		per_cpu(watchdog_touch_ts, cpu) = 0;
+		per_cpu(watchdog_restart_period, cpu) = true;
 	wq_watchdog_touch(-1);
 }
 
 void touch_softlockup_watchdog_sync(void)
 {
 	__this_cpu_write(softlockup_touch_sync, true);
-	__this_cpu_write(watchdog_touch_ts, 0);
+	__this_cpu_write(watchdog_restart_period, true);
 }
 
-static int is_softlockup(unsigned long touch_ts)
+static int is_softlockup(unsigned long touch_ts, unsigned long period_ts)
 {
 	unsigned long now = get_timestamp();
 
 	if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
 		/* Warn about unreasonable delays. */
-		if (time_after(now, touch_ts + get_softlockup_thresh()))
+		if (time_after(now, period_ts + get_softlockup_thresh()))
 			return now - touch_ts;
 	}
 	return 0;
@@ -362,6 +373,7 @@ static int softlockup_fn(void *data)
 static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 {
 	unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
+	unsigned long period_ts = __this_cpu_read(watchdog_period_ts);
 	struct pt_regs *regs = get_irq_regs();
 	int duration;
 	int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace;
@@ -383,7 +395,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 	/* .. and repeat */
 	hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
 
-	if (touch_ts == 0) {
+	if (__this_cpu_read(watchdog_restart_period)) {
+		__restart_watchdog_period();
+
 		if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
 			/*
 			 * If the time stamp was touched atomically
@@ -395,7 +409,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 
 		/* Clear the guest paused flag on watchdog reset */
 		kvm_check_and_clear_guest_paused();
-		__touch_watchdog();
+
 		return HRTIMER_RESTART;
 	}
 
@@ -405,7 +419,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 	 * indicate it is getting cpu time.  If it hasn't then
 	 * this is a good indication some task is hogging the cpu
 	 */
-	duration = is_softlockup(touch_ts);
+	duration = is_softlockup(touch_ts, period_ts);
 	if (unlikely(duration)) {
 		/*
 		 * If a virtual machine is stopped by the host it can look to
@@ -428,7 +442,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 			if (__this_cpu_read(softlockup_task_ptr_saved) !=
 			    current) {
 				__this_cpu_write(soft_watchdog_warn, false);
-				__touch_watchdog();
+				__restart_watchdog_period();
 			}
 			return HRTIMER_RESTART;
 		}
@@ -470,8 +484,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		if (softlockup_panic)
 			panic("softlockup: hung tasks");
 		__this_cpu_write(soft_watchdog_warn, true);
-	} else
-		__this_cpu_write(soft_watchdog_warn, false);
+	}
 
 	return HRTIMER_RESTART;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] watchdog/softlockup: Report the same softlockup regularly
  2019-08-19 10:47 [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful Petr Mladek
  2019-08-19 10:47 ` [PATCH 1/3] watchdog/softlockup: Preserve original timestamp when touching watchdog externally Petr Mladek
@ 2019-08-19 10:47 ` Petr Mladek
  2019-10-21 12:43   ` Peter Zijlstra
  2019-08-19 10:47 ` [PATCH 3/3] Test softlockup Petr Mladek
  2019-10-21 12:32 ` [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful Petr Mladek
  3 siblings, 1 reply; 13+ messages in thread
From: Petr Mladek @ 2019-08-19 10:47 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

Softlockup report means that there is no progress on the given CPU. It
might be a "short" affair where the system gets recovered. But often
the system stops being responsive and need to get rebooted.

The softlockup might be root of the problems or just a symptom. It might
be a deadlock, livelock, or often repeated state.

Regular reports help to distinguish different situations. Fortunately,
the watchdog is finally able to show correct information how long
softlockup_fn() was not scheduled.

Report before this patch:

[  320.248948] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4916]

And after this patch:

[  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
[  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
[  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
[  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]

Note that the horrible code never really worked before the accounting
was fixed. The last working timestamp was regularly lost by the many
touch*watchdog() calls.

Also note that the full report is useful to distinguish livelock
and deadlock.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/watchdog.c | 25 +------------------------
 1 file changed, 1 insertion(+), 24 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index bd249676ee3d..2058229ed398 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -173,10 +173,8 @@ static DEFINE_PER_CPU(unsigned long, watchdog_period_ts);
 static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);
 static DEFINE_PER_CPU(bool, watchdog_restart_period);
 static DEFINE_PER_CPU(bool, softlockup_touch_sync);
-static DEFINE_PER_CPU(bool, soft_watchdog_warn);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
 static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
-static DEFINE_PER_CPU(struct task_struct *, softlockup_task_ptr_saved);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
 static unsigned long soft_lockup_nmi_warn;
 
@@ -268,7 +266,6 @@ static void __touch_watchdog(void)
 {
 	__this_cpu_write(watchdog_touch_ts, get_timestamp());
 	__restart_watchdog_period();
-	__this_cpu_write(soft_watchdog_warn, false);
 }
 
 /**
@@ -429,31 +426,13 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		if (kvm_check_and_clear_guest_paused())
 			return HRTIMER_RESTART;
 
-		/* only warn once */
-		if (__this_cpu_read(soft_watchdog_warn) == true) {
-			/*
-			 * When multiple processes are causing softlockups the
-			 * softlockup detector only warns on the first one
-			 * because the code relies on a full quiet cycle to
-			 * re-arm.  The second process prevents the quiet cycle
-			 * and never gets reported.  Use task pointers to detect
-			 * this.
-			 */
-			if (__this_cpu_read(softlockup_task_ptr_saved) !=
-			    current) {
-				__this_cpu_write(soft_watchdog_warn, false);
-				__restart_watchdog_period();
-			}
-			return HRTIMER_RESTART;
-		}
-
 		if (softlockup_all_cpu_backtrace) {
 			/* Prevent multiple soft-lockup reports if one cpu is already
 			 * engaged in dumping cpu back traces
 			 */
 			if (test_and_set_bit(0, &soft_lockup_nmi_warn)) {
 				/* Someone else will report us. Let's give up */
-				__this_cpu_write(soft_watchdog_warn, true);
+				__restart_watchdog_period();
 				return HRTIMER_RESTART;
 			}
 		}
@@ -461,7 +440,6 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
 			smp_processor_id(), duration,
 			current->comm, task_pid_nr(current));
-		__this_cpu_write(softlockup_task_ptr_saved, current);
 		print_modules();
 		print_irqtrace_events(current);
 		if (regs)
@@ -483,7 +461,6 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
 		if (softlockup_panic)
 			panic("softlockup: hung tasks");
-		__this_cpu_write(soft_watchdog_warn, true);
 	}
 
 	return HRTIMER_RESTART;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] Test softlockup
  2019-08-19 10:47 [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful Petr Mladek
  2019-08-19 10:47 ` [PATCH 1/3] watchdog/softlockup: Preserve original timestamp when touching watchdog externally Petr Mladek
  2019-08-19 10:47 ` [PATCH 2/3] watchdog/softlockup: Report the same softlockup regularly Petr Mladek
@ 2019-08-19 10:47 ` Petr Mladek
  2019-10-21 12:45   ` Peter Zijlstra
  2019-10-21 12:32 ` [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful Petr Mladek
  3 siblings, 1 reply; 13+ messages in thread
From: Petr Mladek @ 2019-08-19 10:47 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

Trigger busy loop by:
$> cat /proc/version

Stop the busy loop by:
$> cat /proc/consoles

The code also shows the first touch*watchdog() function that hides
softlockup on a "well known" location.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 fs/proc/consoles.c |  5 +++++
 fs/proc/version.c  |  7 +++++++
 kernel/watchdog.c  | 25 ++++++++++++++++++++++++-
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/proc/consoles.c b/fs/proc/consoles.c
index dfe6ce3505ce..213c0a209a7c 100644
--- a/fs/proc/consoles.c
+++ b/fs/proc/consoles.c
@@ -9,6 +9,8 @@
 #include <linux/seq_file.h>
 #include <linux/tty_driver.h>
 
+extern volatile bool proc_version_wait;
+
 /*
  * This is handler for /proc/consoles
  */
@@ -30,6 +32,9 @@ static int show_console_dev(struct seq_file *m, void *v)
 	unsigned int a;
 	dev_t dev = 0;
 
+	printk("%s: Going to break /proc/version infinite loop\n", __func__);
+	proc_version_wait = false;
+
 	if (con->device) {
 		const struct tty_driver *driver;
 		int index;
diff --git a/fs/proc/version.c b/fs/proc/version.c
index b449f186577f..15ec6a502589 100644
--- a/fs/proc/version.c
+++ b/fs/proc/version.c
@@ -6,8 +6,15 @@
 #include <linux/seq_file.h>
 #include <linux/utsname.h>
 
+volatile bool proc_version_wait;
+
 static int version_proc_show(struct seq_file *m, void *v)
 {
+	printk("%s: Going to wait until stopped\n", __func__);
+	proc_version_wait = true;
+	while (proc_version_wait)
+		cpu_relax();
+
 	seq_printf(m, linux_proc_banner,
 		utsname()->sysname,
 		utsname()->release,
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2058229ed398..3bfe6fbc468b 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -172,6 +172,7 @@ static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
 static DEFINE_PER_CPU(unsigned long, watchdog_period_ts);
 static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);
 static DEFINE_PER_CPU(bool, watchdog_restart_period);
+static DEFINE_PER_CPU(bool, watchdog_report_restart_period);
 static DEFINE_PER_CPU(bool, softlockup_touch_sync);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
 static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
@@ -259,6 +260,7 @@ static void __restart_watchdog_period(void)
 {
 	__this_cpu_write(watchdog_period_ts, get_timestamp());
 	__this_cpu_write(watchdog_restart_period, false);
+	__this_cpu_write(watchdog_report_restart_period, false);
 }
 
 /* Commands for resetting the watchdog */
@@ -283,6 +285,13 @@ notrace void touch_softlockup_watchdog_sched(void)
 	 * period gets restarted here, so use the raw_ operation.
 	 */
 	raw_cpu_write(watchdog_restart_period, true);
+
+	if (raw_cpu_read(watchdog_report_restart_period)) {
+		printk_deferred("Softlockup watchdog need reset from %s\n",
+			__func__);
+		trace_dump_stack(0);
+		raw_cpu_write(watchdog_report_restart_period, false);
+	}
 }
 
 notrace void touch_softlockup_watchdog(void)
@@ -305,8 +314,15 @@ void touch_all_softlockup_watchdogs(void)
 	 * update as well, the only side effect might be a cycle delay for
 	 * the softlockup check.
 	 */
-	for_each_cpu(cpu, &watchdog_allowed_mask)
+	for_each_cpu(cpu, &watchdog_allowed_mask) {
 		per_cpu(watchdog_restart_period, cpu) = true;
+
+		if (per_cpu(watchdog_report_restart_period, cpu) == true) {
+			WARN(1, "Softlockup watchdog need reset\n");
+			per_cpu(watchdog_report_restart_period, cpu) = false;
+		}
+	}
+
 	wq_watchdog_touch(-1);
 }
 
@@ -314,6 +330,11 @@ void touch_softlockup_watchdog_sync(void)
 {
 	__this_cpu_write(softlockup_touch_sync, true);
 	__this_cpu_write(watchdog_restart_period, true);
+
+	if (raw_cpu_read(watchdog_report_restart_period)) {
+		WARN(1, "Softlockup watchdog need reset\n");
+		raw_cpu_write(watchdog_report_restart_period, false);
+	}
 }
 
 static int is_softlockup(unsigned long touch_ts, unsigned long period_ts)
@@ -461,6 +482,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
 		if (softlockup_panic)
 			panic("softlockup: hung tasks");
+
+		__this_cpu_write(watchdog_report_restart_period, true);
 	}
 
 	return HRTIMER_RESTART;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful
  2019-08-19 10:47 [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful Petr Mladek
                   ` (2 preceding siblings ...)
  2019-08-19 10:47 ` [PATCH 3/3] Test softlockup Petr Mladek
@ 2019-10-21 12:32 ` Petr Mladek
  3 siblings, 0 replies; 13+ messages in thread
From: Petr Mladek @ 2019-10-21 12:32 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner
  Cc: Vincent Whitchurch, Laurence Oberman, Michal Hocko, linux-kernel

On Mon 2019-08-19 12:47:29, Petr Mladek wrote:
> ( Resending this as a proper patch with updated commit messages.
>   The original was
>   https://lkml.kernel.org/r/20190605140954.28471-1-pmladek@suse.com )
> 
> We were analyzing logs with several softlockup reports in flush_tlb_kernel_range().
> They were confusing. Especially it was not clear whether it was deadlock,
> livelock, or separate softlockups.
> 
> It went out that even a simple busy loop:
> 
> 	while (true)
> 	      cpu_relax();
> 
> is able to produce several softlockups reports:
> 
>   [  168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
>   [  196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
>   [  236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]
>                                                               ^^^^
> 
> This patchset fixes the problem in two steps:
> 
> + 1st patch prevents restart of the watchdog from unrelated locations.
>   Each softlockup is reported only once:
> 
>   [  320.248948] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4916]
> 
> 
> + 2nd patch helps to distinguish several possible situations by
>   regular reports. The report looks like:
> 
>   [  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
>   [  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
>   [  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
>   [  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]

Gently ping. Is anyone against merging the two fixes, please?

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] watchdog/softlockup: Preserve original timestamp when touching watchdog externally
  2019-08-19 10:47 ` [PATCH 1/3] watchdog/softlockup: Preserve original timestamp when touching watchdog externally Petr Mladek
@ 2019-10-21 12:42   ` Peter Zijlstra
  2019-10-21 13:04     ` Petr Mladek
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2019-10-21 12:42 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Thomas Gleixner, Ingo Molnar, Laurence Oberman,
	Vincent Whitchurch, Michal Hocko, linux-kernel

On Mon, Aug 19, 2019 at 12:47:30PM +0200, Petr Mladek wrote:
> Some bug report included the same softlockups in flush_tlb_kernel_range()
> in regular intervals. Unfortunately was not clear if there was a progress
> or not.
> 
> The situation can be simulated with a simply busy loop:
> 
> 	while (true)
> 	      cpu_relax();
> 
> The softlockup detector produces:
> 
> [  168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
> [  196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
> [  236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]
> 
> One would expect only one softlockup report or several reports with
> an increased duration.

Let's just say our expectations differ.

> The result is that each softlockup is reported only once unless
> another process get scheduled:
> 
> [  320.248948] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4916]

Which would greatly confuse me; as the above would have me think the
situation got resolved (no more lockups reported) even though it is
still very much stuck there.

IOW, I don't see how this makes anything better. You're removing
information.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: Report the same softlockup regularly
  2019-08-19 10:47 ` [PATCH 2/3] watchdog/softlockup: Report the same softlockup regularly Petr Mladek
@ 2019-10-21 12:43   ` Peter Zijlstra
  2019-10-21 13:40     ` Petr Mladek
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2019-10-21 12:43 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Thomas Gleixner, Ingo Molnar, Laurence Oberman,
	Vincent Whitchurch, Michal Hocko, linux-kernel

On Mon, Aug 19, 2019 at 12:47:31PM +0200, Petr Mladek wrote:
> Softlockup report means that there is no progress on the given CPU. It
> might be a "short" affair where the system gets recovered. But often
> the system stops being responsive and need to get rebooted.
> 
> The softlockup might be root of the problems or just a symptom. It might
> be a deadlock, livelock, or often repeated state.
> 
> Regular reports help to distinguish different situations. Fortunately,
> the watchdog is finally able to show correct information how long
> softlockup_fn() was not scheduled.
> 
> Report before this patch:
> 
> [  320.248948] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4916]
> 
> And after this patch:
> 
> [  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
> [  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
> [  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
> [  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]
> 
> Note that the horrible code never really worked before the accounting
> was fixed. The last working timestamp was regularly lost by the many
> touch*watchdog() calls.

So what's the point of patch 1? Just confusing people?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] Test softlockup
  2019-08-19 10:47 ` [PATCH 3/3] Test softlockup Petr Mladek
@ 2019-10-21 12:45   ` Peter Zijlstra
  2019-10-21 13:06     ` Petr Mladek
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2019-10-21 12:45 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Thomas Gleixner, Ingo Molnar, Laurence Oberman,
	Vincent Whitchurch, Michal Hocko, linux-kernel

On Mon, Aug 19, 2019 at 12:47:32PM +0200, Petr Mladek wrote:
> Trigger busy loop by:
> $> cat /proc/version
> 
> Stop the busy loop by:
> $> cat /proc/consoles
> 
> The code also shows the first touch*watchdog() function that hides
> softlockup on a "well known" location.

This seems like a terrible interface...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 1/3] watchdog/softlockup: Preserve original timestamp when touching watchdog externally
  2019-10-21 12:42   ` Peter Zijlstra
@ 2019-10-21 13:04     ` Petr Mladek
  0 siblings, 0 replies; 13+ messages in thread
From: Petr Mladek @ 2019-10-21 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Laurence Oberman,
	Vincent Whitchurch, Michal Hocko, linux-kernel

On Mon 2019-10-21 14:42:14, Peter Zijlstra wrote:
> On Mon, Aug 19, 2019 at 12:47:30PM +0200, Petr Mladek wrote:
> > Some bug report included the same softlockups in flush_tlb_kernel_range()
> > in regular intervals. Unfortunately was not clear if there was a progress
> > or not.
> > 
> > The situation can be simulated with a simply busy loop:
> > 
> > 	while (true)
> > 	      cpu_relax();
> > 
> > The softlockup detector produces:
> > 
> > [  168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
> > [  196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
> > [  236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]
> > 
> > One would expect only one softlockup report or several reports with
> > an increased duration.
> 
> Let's just say our expectations differ.
> 
> > The result is that each softlockup is reported only once unless
> > another process get scheduled:
> > 
> > [  320.248948] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4916]
> 
> Which would greatly confuse me; as the above would have me think the
> situation got resolved (no more lockups reported) even though it is
> still very much stuck there.
> 
> IOW, I don't see how this makes anything better. You're removing
> information.

The 2nd patch brings back the regular report but with correctly
counted time (stuck for XXs).

I split it into two patches because I was not sure what would be
preferred behavior. I prefer the regular reports as well.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/3] Test softlockup
  2019-10-21 12:45   ` Peter Zijlstra
@ 2019-10-21 13:06     ` Petr Mladek
  0 siblings, 0 replies; 13+ messages in thread
From: Petr Mladek @ 2019-10-21 13:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Laurence Oberman,
	Vincent Whitchurch, Michal Hocko, linux-kernel

On Mon 2019-10-21 14:45:18, Peter Zijlstra wrote:
> On Mon, Aug 19, 2019 at 12:47:32PM +0200, Petr Mladek wrote:
> > Trigger busy loop by:
> > $> cat /proc/version
> > 
> > Stop the busy loop by:
> > $> cat /proc/consoles
> > 
> > The code also shows the first touch*watchdog() function that hides
> > softlockup on a "well known" location.
> 
> This seems like a terrible interface...

Ah, this was just a patch for testing. It was not meant to be applied.
I am sorry for confusion.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: Report the same softlockup regularly
  2019-10-21 12:43   ` Peter Zijlstra
@ 2019-10-21 13:40     ` Petr Mladek
  2019-10-21 14:09       ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Petr Mladek @ 2019-10-21 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Thomas Gleixner, Ingo Molnar, Laurence Oberman,
	Vincent Whitchurch, Michal Hocko, linux-kernel

On Mon 2019-10-21 14:43:39, Peter Zijlstra wrote:
> On Mon, Aug 19, 2019 at 12:47:31PM +0200, Petr Mladek wrote:
> > Softlockup report means that there is no progress on the given CPU. It
> > might be a "short" affair where the system gets recovered. But often
> > the system stops being responsive and need to get rebooted.
> > 
> > The softlockup might be root of the problems or just a symptom. It might
> > be a deadlock, livelock, or often repeated state.
> > 
> > Regular reports help to distinguish different situations. Fortunately,
> > the watchdog is finally able to show correct information how long
> > softlockup_fn() was not scheduled.
> > 
> > Report before this patch:
> > 
> > [  320.248948] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4916]
> > 
> > And after this patch:
> > 
> > [  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
> > [  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
> > [  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
> > [  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]
> > 
> > Note that the horrible code never really worked before the accounting
> > was fixed. The last working timestamp was regularly lost by the many
> > touch*watchdog() calls.
> 
> So what's the point of patch 1? Just confusing people?

I was not sure what was the expected behavior. The code actually
looked like only the first report was wanted. But it probably never
worked that way.

Should I squash the two patches and send it again, please?

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: Report the same softlockup regularly
  2019-10-21 13:40     ` Petr Mladek
@ 2019-10-21 14:09       ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2019-10-21 14:09 UTC (permalink / raw)
  To: Petr Mladek
  Cc: Thomas Gleixner, Ingo Molnar, Laurence Oberman,
	Vincent Whitchurch, Michal Hocko, linux-kernel

On Mon, Oct 21, 2019 at 03:40:38PM +0200, Petr Mladek wrote:
> On Mon 2019-10-21 14:43:39, Peter Zijlstra wrote:
> > On Mon, Aug 19, 2019 at 12:47:31PM +0200, Petr Mladek wrote:
> > > Softlockup report means that there is no progress on the given CPU. It
> > > might be a "short" affair where the system gets recovered. But often
> > > the system stops being responsive and need to get rebooted.
> > > 
> > > The softlockup might be root of the problems or just a symptom. It might
> > > be a deadlock, livelock, or often repeated state.
> > > 
> > > Regular reports help to distinguish different situations. Fortunately,
> > > the watchdog is finally able to show correct information how long
> > > softlockup_fn() was not scheduled.
> > > 
> > > Report before this patch:
> > > 
> > > [  320.248948] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4916]
> > > 
> > > And after this patch:
> > > 
> > > [  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
> > > [  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
> > > [  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
> > > [  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]
> > > 
> > > Note that the horrible code never really worked before the accounting
> > > was fixed. The last working timestamp was regularly lost by the many
> > > touch*watchdog() calls.
> > 
> > So what's the point of patch 1? Just confusing people?
> 
> I was not sure what was the expected behavior. The code actually
> looked like only the first report was wanted. But it probably never
> worked that way.

Not that I can remember at least :-) I normally don't bother with the
actual time, and if I do then I look at the printk timestamps to figure
out how long thing've been stuck.

But this is indeed nicer..

> Should I squash the two patches and send it again, please?

Probably makes sense to squash it. That also avoids having to ever
expose that ugleh :-)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 3/3] Test softlockup
  2019-06-05 14:09 [RFC " Petr Mladek
@ 2019-06-05 14:09 ` Petr Mladek
  0 siblings, 0 replies; 13+ messages in thread
From: Petr Mladek @ 2019-06-05 14:09 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

Trigger busy loop by:
$> cat /proc/version

Stop the busy loop by:
$> cat /proc/console

The code also shows the first touch*watchdog() function that hides
softlockup on a "well known" location.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 fs/proc/consoles.c |  5 +++++
 fs/proc/version.c  |  7 +++++++
 kernel/watchdog.c  | 25 ++++++++++++++++++++++++-
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/fs/proc/consoles.c b/fs/proc/consoles.c
index 954caf0b7fee..288ac1ab33c6 100644
--- a/fs/proc/consoles.c
+++ b/fs/proc/consoles.c
@@ -10,6 +10,8 @@
 #include <linux/seq_file.h>
 #include <linux/tty_driver.h>
 
+extern volatile bool proc_version_wait;
+
 /*
  * This is handler for /proc/consoles
  */
@@ -31,6 +33,9 @@ static int show_console_dev(struct seq_file *m, void *v)
 	unsigned int a;
 	dev_t dev = 0;
 
+	printk("%s: Going to break /proc/version infinite loop\n", __func__);
+	proc_version_wait = false;
+
 	if (con->device) {
 		const struct tty_driver *driver;
 		int index;
diff --git a/fs/proc/version.c b/fs/proc/version.c
index b449f186577f..15ec6a502589 100644
--- a/fs/proc/version.c
+++ b/fs/proc/version.c
@@ -6,8 +6,15 @@
 #include <linux/seq_file.h>
 #include <linux/utsname.h>
 
+volatile bool proc_version_wait;
+
 static int version_proc_show(struct seq_file *m, void *v)
 {
+	printk("%s: Going to wait until stopped\n", __func__);
+	proc_version_wait = true;
+	while (proc_version_wait)
+		cpu_relax();
+
 	seq_printf(m, linux_proc_banner,
 		utsname()->sysname,
 		utsname()->release,
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 2058229ed398..3bfe6fbc468b 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -172,6 +172,7 @@ static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
 static DEFINE_PER_CPU(unsigned long, watchdog_period_ts);
 static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);
 static DEFINE_PER_CPU(bool, watchdog_restart_period);
+static DEFINE_PER_CPU(bool, watchdog_report_restart_period);
 static DEFINE_PER_CPU(bool, softlockup_touch_sync);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
 static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
@@ -259,6 +260,7 @@ static void __restart_watchdog_period(void)
 {
 	__this_cpu_write(watchdog_period_ts, get_timestamp());
 	__this_cpu_write(watchdog_restart_period, false);
+	__this_cpu_write(watchdog_report_restart_period, false);
 }
 
 /* Commands for resetting the watchdog */
@@ -283,6 +285,13 @@ notrace void touch_softlockup_watchdog_sched(void)
 	 * period gets restarted here, so use the raw_ operation.
 	 */
 	raw_cpu_write(watchdog_restart_period, true);
+
+	if (raw_cpu_read(watchdog_report_restart_period)) {
+		printk_deferred("Softlockup watchdog need reset from %s\n",
+			__func__);
+		trace_dump_stack(0);
+		raw_cpu_write(watchdog_report_restart_period, false);
+	}
 }
 
 notrace void touch_softlockup_watchdog(void)
@@ -305,8 +314,15 @@ void touch_all_softlockup_watchdogs(void)
 	 * update as well, the only side effect might be a cycle delay for
 	 * the softlockup check.
 	 */
-	for_each_cpu(cpu, &watchdog_allowed_mask)
+	for_each_cpu(cpu, &watchdog_allowed_mask) {
 		per_cpu(watchdog_restart_period, cpu) = true;
+
+		if (per_cpu(watchdog_report_restart_period, cpu) == true) {
+			WARN(1, "Softlockup watchdog need reset\n");
+			per_cpu(watchdog_report_restart_period, cpu) = false;
+		}
+	}
+
 	wq_watchdog_touch(-1);
 }
 
@@ -314,6 +330,11 @@ void touch_softlockup_watchdog_sync(void)
 {
 	__this_cpu_write(softlockup_touch_sync, true);
 	__this_cpu_write(watchdog_restart_period, true);
+
+	if (raw_cpu_read(watchdog_report_restart_period)) {
+		WARN(1, "Softlockup watchdog need reset\n");
+		raw_cpu_write(watchdog_report_restart_period, false);
+	}
 }
 
 static int is_softlockup(unsigned long touch_ts, unsigned long period_ts)
@@ -461,6 +482,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
 		if (softlockup_panic)
 			panic("softlockup: hung tasks");
+
+		__this_cpu_write(watchdog_report_restart_period, true);
 	}
 
 	return HRTIMER_RESTART;
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2019-10-21 14:09 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-19 10:47 [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful Petr Mladek
2019-08-19 10:47 ` [PATCH 1/3] watchdog/softlockup: Preserve original timestamp when touching watchdog externally Petr Mladek
2019-10-21 12:42   ` Peter Zijlstra
2019-10-21 13:04     ` Petr Mladek
2019-08-19 10:47 ` [PATCH 2/3] watchdog/softlockup: Report the same softlockup regularly Petr Mladek
2019-10-21 12:43   ` Peter Zijlstra
2019-10-21 13:40     ` Petr Mladek
2019-10-21 14:09       ` Peter Zijlstra
2019-08-19 10:47 ` [PATCH 3/3] Test softlockup Petr Mladek
2019-10-21 12:45   ` Peter Zijlstra
2019-10-21 13:06     ` Petr Mladek
2019-10-21 12:32 ` [PATCH 0/3] watchdog/softlockup: Make softlockup reports more reliable and useful Petr Mladek
  -- strict thread matches above, loose matches on Subject: below --
2019-06-05 14:09 [RFC " Petr Mladek
2019-06-05 14:09 ` [PATCH 3/3] Test softlockup Petr Mladek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.