linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup
@ 2019-10-24 11:49 Petr Mladek
  2019-10-24 11:49 ` [PATCH 1/3] watchdog/softlockup: Remove obsolete check of last reported task Petr Mladek
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Petr Mladek @ 2019-10-24 11:49 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

This is rework of the softlockup improvements sent as
https://lkml.kernel.org/r/20190819104732.20966-1-pmladek@suse.com
based on feedback from Peter Zijlstra.

      + merged the two patches
      + moved some cleanup changes into separate patches
      + improved the code and commit messages

The main change is in 2nd patch. It fixes the time spent in softlockup.

Original:

  [  168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
  [  196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
  [  236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]
                                                              ^^^

New:

  [  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
  [  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
  [  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
  [  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]
                                                              ^^^^^
1st and 3rd patch clean up the code.


Petr Mladek (3):
  watchdog/softlockup: Remove obsolete check of last reported task
  watchdog/softlockup: Report the overall time of softlockups
  watchdog/softlockup: Remove logic that tried to prevent repeated
    reports

 kernel/watchdog.c | 67 ++++++++++++++++++++++++-------------------------------
 1 file changed, 29 insertions(+), 38 deletions(-)

-- 
2.16.4


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/3] watchdog/softlockup: Remove obsolete check of last reported task
  2019-10-24 11:49 [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup Petr Mladek
@ 2019-10-24 11:49 ` Petr Mladek
  2020-01-16 13:55   ` [tip: core/core] " tip-bot2 for Petr Mladek
  2019-10-24 11:49 ` [PATCH 2/3] watchdog/softlockup: Report the overall time of softlockups Petr Mladek
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Petr Mladek @ 2019-10-24 11:49 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

The commit 9cf57731b63e37ed995 ("watchdog/softlockup: Replace "watchdog/%u"
 threads with cpu_stop_work") caused that the watchdog is reliably touched
during a task switch.

As a result the check for a unnoticed task switch is not longer needed.
And the commit b1a8de1f534337b398c7778 ("softlockup: make detector be aware
of task switch of processes hogging cpu") can be reverted.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/watchdog.c | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index f41334ef0971..b7e2046380cf 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -174,7 +174,6 @@ static DEFINE_PER_CPU(bool, softlockup_touch_sync);
 static DEFINE_PER_CPU(bool, soft_watchdog_warn);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
 static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
-static DEFINE_PER_CPU(struct task_struct *, softlockup_task_ptr_saved);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
 static unsigned long soft_lockup_nmi_warn;
 
@@ -416,22 +415,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 			return HRTIMER_RESTART;
 
 		/* only warn once */
-		if (__this_cpu_read(soft_watchdog_warn) == true) {
-			/*
-			 * When multiple processes are causing softlockups the
-			 * softlockup detector only warns on the first one
-			 * because the code relies on a full quiet cycle to
-			 * re-arm.  The second process prevents the quiet cycle
-			 * and never gets reported.  Use task pointers to detect
-			 * this.
-			 */
-			if (__this_cpu_read(softlockup_task_ptr_saved) !=
-			    current) {
-				__this_cpu_write(soft_watchdog_warn, false);
-				__touch_watchdog();
-			}
+		if (__this_cpu_read(soft_watchdog_warn) == true)
 			return HRTIMER_RESTART;
-		}
 
 		if (softlockup_all_cpu_backtrace) {
 			/* Prevent multiple soft-lockup reports if one cpu is already
@@ -447,7 +432,6 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
 			smp_processor_id(), duration,
 			current->comm, task_pid_nr(current));
-		__this_cpu_write(softlockup_task_ptr_saved, current);
 		print_modules();
 		print_irqtrace_events(current);
 		if (regs)
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/3] watchdog/softlockup: Report the overall time of softlockups
  2019-10-24 11:49 [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup Petr Mladek
  2019-10-24 11:49 ` [PATCH 1/3] watchdog/softlockup: Remove obsolete check of last reported task Petr Mladek
@ 2019-10-24 11:49 ` Petr Mladek
  2020-01-16 13:39   ` Thomas Gleixner
  2019-10-24 11:49 ` [PATCH 3/3] watchdog/softlockup: Remove logic that tried to prevent repeated reports Petr Mladek
  2019-11-08 10:13 ` [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup Petr Mladek
  3 siblings, 1 reply; 8+ messages in thread
From: Petr Mladek @ 2019-10-24 11:49 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

The softlockup detector currently shows the time spent since the last
report. As a result it is not clear whether a CPU is infinitely hogged
by a single task or if it is a repeated event.

The situation can be simulated with a simply busy loop:

	while (true)
	      cpu_relax();

The softlockup detector produces:

[  168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
[  196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
[  236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]

But it should be, something like:

[  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
[  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
[  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
[  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]

For the better output, an additional timestamp, of the last report, must
be stored. And only this timestamp must get reset when the watchdog
is intentionally touched from code paths that are slow for a reason.

Also the timestamp should get reset explicitly. It is done also by the code
printing the backtrace. But it is just a side effect and far from obvious.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/watchdog.c | 37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index b7e2046380cf..1eca651daf59 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -168,7 +168,10 @@ unsigned int __read_mostly softlockup_panic =
 static bool softlockup_initialized __read_mostly;
 static u64 __read_mostly sample_period;
 
+/* Timestamp taken after the last successful reschedule. */
 static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
+/* Timestamp of the last softlockup report period. */
+static DEFINE_PER_CPU(unsigned long, report_period_ts);
 static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);
 static DEFINE_PER_CPU(bool, softlockup_touch_sync);
 static DEFINE_PER_CPU(bool, soft_watchdog_warn);
@@ -253,10 +256,16 @@ static void set_sample_period(void)
 	watchdog_update_hrtimer_threshold(sample_period);
 }
 
+static void reset_report_period_ts(void)
+{
+	__this_cpu_write(report_period_ts, get_timestamp());
+}
+
 /* Commands for resetting the watchdog */
 static void __touch_watchdog(void)
 {
 	__this_cpu_write(watchdog_touch_ts, get_timestamp());
+	reset_report_period_ts();
 }
 
 /**
@@ -270,10 +279,10 @@ static void __touch_watchdog(void)
 notrace void touch_softlockup_watchdog_sched(void)
 {
 	/*
-	 * Preemption can be enabled.  It doesn't matter which CPU's timestamp
-	 * gets zeroed here, so use the raw_ operation.
+	 * Preemption can be enabled.  It doesn't matter which CPU's watchdog
+	 * report period gets restarted here, so use the raw_ operation.
 	 */
-	raw_cpu_write(watchdog_touch_ts, 0);
+	raw_cpu_write(report_period_ts, 0);
 }
 
 notrace void touch_softlockup_watchdog(void)
@@ -297,23 +306,23 @@ void touch_all_softlockup_watchdogs(void)
 	 * the softlockup check.
 	 */
 	for_each_cpu(cpu, &watchdog_allowed_mask)
-		per_cpu(watchdog_touch_ts, cpu) = 0;
+		per_cpu(report_period_ts, cpu) = 0;
 	wq_watchdog_touch(-1);
 }
 
 void touch_softlockup_watchdog_sync(void)
 {
 	__this_cpu_write(softlockup_touch_sync, true);
-	__this_cpu_write(watchdog_touch_ts, 0);
+	__this_cpu_write(report_period_ts, 0);
 }
 
-static int is_softlockup(unsigned long touch_ts)
+static int is_softlockup(unsigned long touch_ts, unsigned long period_ts)
 {
 	unsigned long now = get_timestamp();
 
 	if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){
 		/* Warn about unreasonable delays. */
-		if (time_after(now, touch_ts + get_softlockup_thresh()))
+		if (time_after(now, period_ts + get_softlockup_thresh()))
 			return now - touch_ts;
 	}
 	return 0;
@@ -361,6 +370,7 @@ static int softlockup_fn(void *data)
 static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 {
 	unsigned long touch_ts = __this_cpu_read(watchdog_touch_ts);
+	unsigned long period_ts = __this_cpu_read(report_period_ts);
 	struct pt_regs *regs = get_irq_regs();
 	int duration;
 	int softlockup_all_cpu_backtrace = sysctl_softlockup_all_cpu_backtrace;
@@ -382,7 +392,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 	/* .. and repeat */
 	hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
 
-	if (touch_ts == 0) {
+	/* Was the watchdog touched externally by a known slow code? */
+	if (period_ts == 0) {
 		if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
 			/*
 			 * If the time stamp was touched atomically
@@ -394,7 +405,12 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 
 		/* Clear the guest paused flag on watchdog reset */
 		kvm_check_and_clear_guest_paused();
-		__touch_watchdog();
+		/*
+		 * The above kvm*() function could touch the watchdog.
+		 * Set the real timestamp later to avoid an infinite loop.
+		 */
+		reset_report_period_ts();
+
 		return HRTIMER_RESTART;
 	}
 
@@ -404,8 +420,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 	 * indicate it is getting cpu time.  If it hasn't then
 	 * this is a good indication some task is hogging the cpu
 	 */
-	duration = is_softlockup(touch_ts);
+	duration = is_softlockup(touch_ts, period_ts);
 	if (unlikely(duration)) {
+		reset_report_period_ts();
 		/*
 		 * If a virtual machine is stopped by the host it can look to
 		 * the watchdog like a soft lockup, check to see if the host
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/3] watchdog/softlockup: Remove logic that tried to prevent repeated reports
  2019-10-24 11:49 [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup Petr Mladek
  2019-10-24 11:49 ` [PATCH 1/3] watchdog/softlockup: Remove obsolete check of last reported task Petr Mladek
  2019-10-24 11:49 ` [PATCH 2/3] watchdog/softlockup: Report the overall time of softlockups Petr Mladek
@ 2019-10-24 11:49 ` Petr Mladek
  2019-11-08 10:13 ` [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup Petr Mladek
  3 siblings, 0 replies; 8+ messages in thread
From: Petr Mladek @ 2019-10-24 11:49 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

The softlockup detector does some gymnastic with the variable
soft_watchdog_warn. It was added by the commit 58687acba59266735ad
("lockup_detector: Combine nmi_watchdog and softlockup detector").

The purpose is not completely clear. There are the following clues:

  1. The comment "only warn once".

  2. watchdog_touch_ts is not explicitly updated when the softlockup is
    reported. It means that the softlockup might get reported every time
    watchdog_timer_fn() is called.

  3. The variable is set when softlockup is reported. And it is cleared
     when the watchdog was touched and the CPU is not longer is softlockup
     state.

It probably never worked. In each case, it have not worked last many years
because the watchdog gets touched in many known slow paths. For example,
it is touched in printk_stack_address(). Therefore it is touched
when printing the report. As a result, the report is always periodic.

Simply remove the logic. People want the periodic report anyway.

Signed-off-by: Petr Mladek <pmladek@suse.com>
---
 kernel/watchdog.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 1eca651daf59..3626590c6542 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -174,7 +174,6 @@ static DEFINE_PER_CPU(unsigned long, watchdog_touch_ts);
 static DEFINE_PER_CPU(unsigned long, report_period_ts);
 static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);
 static DEFINE_PER_CPU(bool, softlockup_touch_sync);
-static DEFINE_PER_CPU(bool, soft_watchdog_warn);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
 static DEFINE_PER_CPU(unsigned long, soft_lockup_hrtimer_cnt);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
@@ -431,19 +430,12 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		if (kvm_check_and_clear_guest_paused())
 			return HRTIMER_RESTART;
 
-		/* only warn once */
-		if (__this_cpu_read(soft_watchdog_warn) == true)
-			return HRTIMER_RESTART;
-
 		if (softlockup_all_cpu_backtrace) {
 			/* Prevent multiple soft-lockup reports if one cpu is already
 			 * engaged in dumping cpu back traces
 			 */
-			if (test_and_set_bit(0, &soft_lockup_nmi_warn)) {
-				/* Someone else will report us. Let's give up */
-				__this_cpu_write(soft_watchdog_warn, true);
+			if (test_and_set_bit(0, &soft_lockup_nmi_warn))
 				return HRTIMER_RESTART;
-			}
 		}
 
 		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
@@ -470,9 +462,7 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
 		if (softlockup_panic)
 			panic("softlockup: hung tasks");
-		__this_cpu_write(soft_watchdog_warn, true);
-	} else
-		__this_cpu_write(soft_watchdog_warn, false);
+	}
 
 	return HRTIMER_RESTART;
 }
-- 
2.16.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup
  2019-10-24 11:49 [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup Petr Mladek
                   ` (2 preceding siblings ...)
  2019-10-24 11:49 ` [PATCH 3/3] watchdog/softlockup: Remove logic that tried to prevent repeated reports Petr Mladek
@ 2019-11-08 10:13 ` Petr Mladek
  3 siblings, 0 replies; 8+ messages in thread
From: Petr Mladek @ 2019-11-08 10:13 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel

On Thu 2019-10-24 13:49:25, Petr Mladek wrote:
> This is rework of the softlockup improvements sent as
> https://lkml.kernel.org/r/20190819104732.20966-1-pmladek@suse.com
> based on feedback from Peter Zijlstra.
> 
>       + merged the two patches
>       + moved some cleanup changes into separate patches
>       + improved the code and commit messages
> 
> The main change is in 2nd patch. It fixes the time spent in softlockup.
> 
> Original:
> 
>   [  168.277520] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
>   [  196.277604] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [cat:4865]
>   [  236.277522] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cat:4865]
>                                                               ^^^
> 
> New:
> 
>   [  480.372418] watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [cat:4943]
>   [  508.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 52s! [cat:4943]
>   [  548.372359] watchdog: BUG: soft lockup - CPU#2 stuck for 89s! [cat:4943]
>   [  576.372351] watchdog: BUG: soft lockup - CPU#2 stuck for 115s! [cat:4943]
>                                                               ^^^^^
> 1st and 3rd patch clean up the code.
> 
> 
> Petr Mladek (3):
>   watchdog/softlockup: Remove obsolete check of last reported task
>   watchdog/softlockup: Report the overall time of softlockups
>   watchdog/softlockup: Remove logic that tried to prevent repeated
>     reports
> 
>  kernel/watchdog.c | 67 ++++++++++++++++++++++++-------------------------------
>  1 file changed, 29 insertions(+), 38 deletions(-)

Gently ping. I wonder if there is still chance to get this for 5.5.

I hope that this variant is much easier to review than
the previous one.

Best Regards,
Petr

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: Report the overall time of softlockups
  2019-10-24 11:49 ` [PATCH 2/3] watchdog/softlockup: Report the overall time of softlockups Petr Mladek
@ 2020-01-16 13:39   ` Thomas Gleixner
  2020-01-17 10:00     ` Petr Mladek
  0 siblings, 1 reply; 8+ messages in thread
From: Thomas Gleixner @ 2020-01-16 13:39 UTC (permalink / raw)
  To: Petr Mladek, Ingo Molnar, Peter Zijlstra
  Cc: Laurence Oberman, Vincent Whitchurch, Michal Hocko, linux-kernel,
	Petr Mladek

Petr,

Petr Mladek <pmladek@suse.com> writes:
> -	if (touch_ts == 0) {
> +	/* Was the watchdog touched externally by a known slow code? */
> +	if (period_ts == 0) {
>  		if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
>  			/*
>  			 * If the time stamp was touched atomically
> @@ -394,7 +405,12 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>  
>  		/* Clear the guest paused flag on watchdog reset */
>  		kvm_check_and_clear_guest_paused();
> -		__touch_watchdog();
> +		/*
> +		 * The above kvm*() function could touch the watchdog.
> +		 * Set the real timestamp later to avoid an infinite
>  		loop.

This comment makes no sense whatsoever. If period_ts is 0,
i.e. something invoked touch_softlockup_watchdog*() then it does not
make any difference whether the kvm function invokes one of those
functions once more. The result is the same. The per cpu period_ts is
still 0.

The point is that _AFTER_ a intentional watchdog reset, the reporting
base time needs to be set to now() in order to make it functional again.

> +		 */
> +		reset_report_period_ts();

Btw, the function name is misleading. I got confused several times
because I expected the reset to set the timestamp to 0, which is not the
case. update_report_period_ts() is far more intuitive.

> @@ -404,8 +420,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>  	 * indicate it is getting cpu time.  If it hasn't then
>  	 * this is a good indication some task is hogging the cpu
>  	 */
> -	duration = is_softlockup(touch_ts);
> +	duration = is_softlockup(touch_ts, period_ts);
>  	if (unlikely(duration)) {

This lacks a comment. Your changelog paragraph:

 > Also the timestamp should get reset explicitly. It is done also by the code
 > printing the backtrace. But it is just a side effect and far from
 > obvious.

is probably referring to this, but it confused me more than it helped
simply because the update of the timestamp happens unconditionally even
when the backtrace code is not reached due to the KVM check

So this is a change vs. the current implementation which is not
documented and explained.

> +		reset_report_period_ts();
>  		/*
>  		 * If a virtual machine is stopped by the host it can look to
>  		 * the watchdog like a soft lockup, check to see if the host

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [tip: core/core] watchdog/softlockup: Remove obsolete check of last reported task
  2019-10-24 11:49 ` [PATCH 1/3] watchdog/softlockup: Remove obsolete check of last reported task Petr Mladek
@ 2020-01-16 13:55   ` tip-bot2 for Petr Mladek
  0 siblings, 0 replies; 8+ messages in thread
From: tip-bot2 for Petr Mladek @ 2020-01-16 13:55 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Petr Mladek, Thomas Gleixner, Peter Ziljstra, x86, LKML

The following commit has been merged into the core/core branch of tip:

Commit-ID:     3a51449b7959f68cc45abe67298e40c7eb57167b
Gitweb:        https://git.kernel.org/tip/3a51449b7959f68cc45abe67298e40c7eb57167b
Author:        Petr Mladek <pmladek@suse.com>
AuthorDate:    Thu, 24 Oct 2019 13:49:26 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Thu, 16 Jan 2020 14:52:48 +01:00

watchdog/softlockup: Remove obsolete check of last reported task

commit 9cf57731b63e ("watchdog/softlockup: Replace "watchdog/%u" threads
 with cpu_stop_work") ensures that the watchdog is reliably touched during
a task switch.

As a result the check for an unnoticed task switch is not longer needed.

Remove the relevant code, which effectively reverts commit b1a8de1f5343
("softlockup: make detector be aware of task switch of processes hogging
cpu")

Signed-off-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Ziljstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20191024114928.15377-2-pmladek@suse.com
---
 kernel/watchdog.c | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 0621301..e3774e9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -173,7 +173,6 @@ static DEFINE_PER_CPU(struct hrtimer, watchdog_hrtimer);
 static DEFINE_PER_CPU(bool, softlockup_touch_sync);
 static DEFINE_PER_CPU(bool, soft_watchdog_warn);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts);
-static DEFINE_PER_CPU(struct task_struct *, softlockup_task_ptr_saved);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
 static unsigned long soft_lockup_nmi_warn;
 
@@ -413,22 +412,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 			return HRTIMER_RESTART;
 
 		/* only warn once */
-		if (__this_cpu_read(soft_watchdog_warn) == true) {
-			/*
-			 * When multiple processes are causing softlockups the
-			 * softlockup detector only warns on the first one
-			 * because the code relies on a full quiet cycle to
-			 * re-arm.  The second process prevents the quiet cycle
-			 * and never gets reported.  Use task pointers to detect
-			 * this.
-			 */
-			if (__this_cpu_read(softlockup_task_ptr_saved) !=
-			    current) {
-				__this_cpu_write(soft_watchdog_warn, false);
-				__touch_watchdog();
-			}
+		if (__this_cpu_read(soft_watchdog_warn) == true)
 			return HRTIMER_RESTART;
-		}
 
 		if (softlockup_all_cpu_backtrace) {
 			/* Prevent multiple soft-lockup reports if one cpu is already
@@ -444,7 +429,6 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		pr_emerg("BUG: soft lockup - CPU#%d stuck for %us! [%s:%d]\n",
 			smp_processor_id(), duration,
 			current->comm, task_pid_nr(current));
-		__this_cpu_write(softlockup_task_ptr_saved, current);
 		print_modules();
 		print_irqtrace_events(current);
 		if (regs)

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/3] watchdog/softlockup: Report the overall time of softlockups
  2020-01-16 13:39   ` Thomas Gleixner
@ 2020-01-17 10:00     ` Petr Mladek
  0 siblings, 0 replies; 8+ messages in thread
From: Petr Mladek @ 2020-01-17 10:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Peter Zijlstra, Laurence Oberman,
	Vincent Whitchurch, Michal Hocko, linux-kernel

On Thu 2020-01-16 14:39:31, Thomas Gleixner wrote:
> Petr,
> 
> Petr Mladek <pmladek@suse.com> writes:
> > -	if (touch_ts == 0) {
> > +	/* Was the watchdog touched externally by a known slow code? */
> > +	if (period_ts == 0) {
> >  		if (unlikely(__this_cpu_read(softlockup_touch_sync))) {
> >  			/*
> >  			 * If the time stamp was touched atomically
> > @@ -394,7 +405,12 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >  
> >  		/* Clear the guest paused flag on watchdog reset */
> >  		kvm_check_and_clear_guest_paused();
> > -		__touch_watchdog();
> > +		/*
> > +		 * The above kvm*() function could touch the watchdog.
> > +		 * Set the real timestamp later to avoid an infinite
> >  		loop.
> 
> This comment makes no sense whatsoever. If period_ts is 0,
> i.e. something invoked touch_softlockup_watchdog*() then it does not
> make any difference whether the kvm function invokes one of those
> functions once more. The result is the same. The per cpu period_ts is
> still 0.
>
> The point is that _AFTER_ a intentional watchdog reset, the reporting
> base time needs to be set to now() in order to make it functional again.

Exactly. I think that my comment is just confusing. I wanted to say
that the order was important.

It was not obvious to me that kvm_check_and_clear_guest_paused() cleared
perior_ts and must be called before update_report_period_ts(). I spent
some time with debugging why the reshufled code stopped working ;-)

What about the following?

		/*
		 * Clear the guest paused flag on watchdog. Might clear
		 *  report_period_ts.
		 */
		kvm_check_and_clear_guest_paused();

		update_report_period_ts();


> > +		 */
> > +		reset_report_period_ts();
> 
> Btw, the function name is misleading. I got confused several times
> because I expected the reset to set the timestamp to 0, which is not the
> case. update_report_period_ts() is far more intuitive.

Sounds good.


> > @@ -404,8 +420,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >  	 * indicate it is getting cpu time.  If it hasn't then
> >  	 * this is a good indication some task is hogging the cpu
> >  	 */
> > -	duration = is_softlockup(touch_ts);
> > +	duration = is_softlockup(touch_ts, period_ts);
> >  	if (unlikely(duration)) {
> 
> This lacks a comment. Your changelog paragraph:
> 
>  > Also the timestamp should get reset explicitly. It is done also by the code
>  > printing the backtrace. But it is just a side effect and far from
>  > obvious.
> 
> is probably referring to this, but it confused me more than it helped
> simply because the update of the timestamp happens unconditionally even
> when the backtrace code is not reached due to the KVM check

Where is the timestamp updated unconditionaly, please?

I found that it happens, for example, in printk_stack_address() that
is hidden "deep" in show_stack().

> So this is a change vs. the current implementation which is not
> documented and explained.

To me, it looks obvious to reset/update the period when the current
period ended and we are about to report softlockup.

OK, it might create wrong assumtion that the updated timestamp will
really get used. It is not true because it will get reset inside
the above mentioned show_stack(). But is it really guaranteed
that the watchdog will be touched there?

IMHO, the explicit call makes the code more reliable and easier
to understand. The hidden touch() might get re(moved) at any time.


> > +		reset_report_period_ts();
> >  		/*
> >  		 * If a virtual machine is stopped by the host it can look to
> >  		 * the watchdog like a soft lockup, check to see if the host

Thanks a lot for review.

Best Regards,
Petr

PS: I will have only limited internet connection the following week.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-01-17 10:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-24 11:49 [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup Petr Mladek
2019-10-24 11:49 ` [PATCH 1/3] watchdog/softlockup: Remove obsolete check of last reported task Petr Mladek
2020-01-16 13:55   ` [tip: core/core] " tip-bot2 for Petr Mladek
2019-10-24 11:49 ` [PATCH 2/3] watchdog/softlockup: Report the overall time of softlockups Petr Mladek
2020-01-16 13:39   ` Thomas Gleixner
2020-01-17 10:00     ` Petr Mladek
2019-10-24 11:49 ` [PATCH 3/3] watchdog/softlockup: Remove logic that tried to prevent repeated reports Petr Mladek
2019-11-08 10:13 ` [PATCH 0/3] watchdog/softlockup: Report overall time and some cleanup Petr Mladek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).