All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] improve handling of errors returned by kthread_park()
@ 2015-09-28 20:44 Ulrich Obergfell
  2015-09-28 20:44 ` [PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh() Ulrich Obergfell
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Ulrich Obergfell @ 2015-09-28 20:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, dzickus, atomlin, uobergfe

The original watchdog_park_threads() function that was introduced by
commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple
approach to handle errors returned by kthread_park(): It attempts to
roll back all watchdog threads to the unparked state. However, this
may be undesired behaviour from the perspective of the caller which
may want to handle errors as appropriate in its specific context.
Currently, there are two possible call chains:

- watchdog suspend/resume interface

    lockup_detector_suspend
      watchdog_park_threads

- write to parameters in /proc/sys/kernel

    proc_watchdog_update
      watchdog_enable_all_cpus
        update_watchdog_all_cpus
          watchdog_park_threads

Instead of 'blindly' attempting to unpark the watchdog threads if a 
kthread_park() call fails, the new approach is to disable the lockup
detectors in the above call chains. Failure becomes visible to the
user as follows:

- error messages from lockup_detector_suspend()
                   or watchdog_enable_all_cpus()

- the state that can be read from /proc/sys/kernel/watchdog_enabled

- the 'write' system call in the latter call chain returns an error

Ulrich Obergfell (5):
  watchdog: fix error handling in proc_watchdog_thresh()
  watchdog: move watchdog_disable_all_cpus() outside of ifdef
  watchdog: implement error handling in update_watchdog_all_cpus() and
    callers
  watchdog: implement error handling in lockup_detector_suspend()
  watchdog: do not unpark threads in watchdog_park_threads() on error

 kernel/watchdog.c | 60 +++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 43 insertions(+), 17 deletions(-)

-- 
1.7.11.7


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh()
  2015-09-28 20:44 [PATCH 0/5] improve handling of errors returned by kthread_park() Ulrich Obergfell
@ 2015-09-28 20:44 ` Ulrich Obergfell
  2015-09-29  6:39   ` Aaron Tomlin
  2015-09-28 20:44 ` [PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef Ulrich Obergfell
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Ulrich Obergfell @ 2015-09-28 20:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, dzickus, atomlin, uobergfe

Restore the previous value of watchdog_thresh _and_ sample_period
if proc_watchdog_update() returns an error. The variables must be
consistent to avoid false positives of the lockup detectors.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
---
 kernel/watchdog.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 64ed1c3..cd9a504 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -914,13 +914,14 @@ int proc_watchdog_thresh(struct ctl_table *table, int write,
 		goto out;
 
 	/*
-	 * Update the sample period.
-	 * Restore 'watchdog_thresh' on failure.
+	 * Update the sample period. Restore on failure.
 	 */
 	set_sample_period();
 	err = proc_watchdog_update();
-	if (err)
+	if (err) {
 		watchdog_thresh = old;
+		set_sample_period();
+	}
 out:
 	mutex_unlock(&watchdog_proc_mutex);
 	return err;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef
  2015-09-28 20:44 [PATCH 0/5] improve handling of errors returned by kthread_park() Ulrich Obergfell
  2015-09-28 20:44 ` [PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh() Ulrich Obergfell
@ 2015-09-28 20:44 ` Ulrich Obergfell
  2015-09-29  6:40   ` Aaron Tomlin
  2015-09-29 23:37   ` Andrew Morton
  2015-09-28 20:44 ` [PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers Ulrich Obergfell
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 15+ messages in thread
From: Ulrich Obergfell @ 2015-09-28 20:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, dzickus, atomlin, uobergfe

It makes sense to place watchdog_{dis|enable}_all_cpus() outside of
the ifdef so that _both_ are available even if CONFIG_SYSCTL is not
defined.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
---
 kernel/watchdog.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index cd9a504..eb9527c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -347,6 +347,9 @@ static void watchdog_interrupt_count(void)
 static int watchdog_nmi_enable(unsigned int cpu);
 static void watchdog_nmi_disable(unsigned int cpu);
 
+static int watchdog_enable_all_cpus(void);
+static void watchdog_disable_all_cpus(void);
+
 /* watchdog kicker functions */
 static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 {
@@ -756,9 +759,6 @@ static int watchdog_enable_all_cpus(void)
 	return err;
 }
 
-/* prepare/enable/disable routines */
-/* sysctl functions */
-#ifdef CONFIG_SYSCTL
 static void watchdog_disable_all_cpus(void)
 {
 	if (watchdog_running) {
@@ -767,6 +767,8 @@ static void watchdog_disable_all_cpus(void)
 	}
 }
 
+#ifdef CONFIG_SYSCTL
+
 /*
  * Update the run state of the lockup detectors.
  */
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers
  2015-09-28 20:44 [PATCH 0/5] improve handling of errors returned by kthread_park() Ulrich Obergfell
  2015-09-28 20:44 ` [PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh() Ulrich Obergfell
  2015-09-28 20:44 ` [PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef Ulrich Obergfell
@ 2015-09-28 20:44 ` Ulrich Obergfell
  2015-09-29  6:40   ` Aaron Tomlin
  2015-09-28 20:44 ` [PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend() Ulrich Obergfell
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Ulrich Obergfell @ 2015-09-28 20:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, dzickus, atomlin, uobergfe

update_watchdog_all_cpus() now passes errors from watchdog_park_threads()
up to functions in the call chain. This allows watchdog_enable_all_cpus()
and proc_watchdog_update() to handle such errors too.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
---
 kernel/watchdog.c | 30 +++++++++++++++++++++++-------
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index eb9527c..457113c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -731,10 +731,17 @@ void lockup_detector_resume(void)
 	mutex_unlock(&watchdog_proc_mutex);
 }
 
-static void update_watchdog_all_cpus(void)
+static int update_watchdog_all_cpus(void)
 {
-	watchdog_park_threads();
+	int ret;
+
+	ret = watchdog_park_threads();
+	if (ret)
+		return ret;
+
 	watchdog_unpark_threads();
+
+	return 0;
 }
 
 static int watchdog_enable_all_cpus(void)
@@ -753,9 +760,17 @@ static int watchdog_enable_all_cpus(void)
 		 * Enable/disable the lockup detectors or
 		 * change the sample period 'on the fly'.
 		 */
-		update_watchdog_all_cpus();
+		err = update_watchdog_all_cpus();
+
+		if (err) {
+			watchdog_disable_all_cpus();
+			pr_err("Failed to update lockup detectors, disabled\n");
+		}
 	}
 
+	if (err)
+		watchdog_enabled = 0;
+
 	return err;
 }
 
@@ -851,12 +866,13 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write,
 		} while (cmpxchg(&watchdog_enabled, old, new) != old);
 
 		/*
-		 * Update the run state of the lockup detectors.
-		 * Restore 'watchdog_enabled' on failure.
+		 * Update the run state of the lockup detectors. There is _no_
+		 * need to check the value returned by proc_watchdog_update()
+		 * and to restore the previous value of 'watchdog_enabled' as
+		 * both lockup detectors are disabled if proc_watchdog_update()
+		 * returns an error.
 		 */
 		err = proc_watchdog_update();
-		if (err)
-			watchdog_enabled = old;
 	}
 out:
 	mutex_unlock(&watchdog_proc_mutex);
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend()
  2015-09-28 20:44 [PATCH 0/5] improve handling of errors returned by kthread_park() Ulrich Obergfell
                   ` (2 preceding siblings ...)
  2015-09-28 20:44 ` [PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers Ulrich Obergfell
@ 2015-09-28 20:44 ` Ulrich Obergfell
  2015-09-29  6:41   ` Aaron Tomlin
  2015-09-28 20:44 ` [PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error Ulrich Obergfell
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Ulrich Obergfell @ 2015-09-28 20:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, dzickus, atomlin, uobergfe

lockup_detector_suspend() now handles errors from watchdog_park_threads().

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
---
 kernel/watchdog.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 457113c..3bc22a9 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -707,6 +707,11 @@ int lockup_detector_suspend(void)
 
 	if (ret == 0)
 		watchdog_suspended++;
+	else {
+		watchdog_disable_all_cpus();
+		pr_err("Failed to suspend lockup detectors, disabled\n");
+		watchdog_enabled = 0;
+	}
 
 	mutex_unlock(&watchdog_proc_mutex);
 
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error
  2015-09-28 20:44 [PATCH 0/5] improve handling of errors returned by kthread_park() Ulrich Obergfell
                   ` (3 preceding siblings ...)
  2015-09-28 20:44 ` [PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend() Ulrich Obergfell
@ 2015-09-28 20:44 ` Ulrich Obergfell
  2015-09-29  6:41   ` Aaron Tomlin
  2015-09-29 23:30 ` [PATCH 0/5] improve handling of errors returned by kthread_park() Andrew Morton
  2015-10-12 21:13 ` Don Zickus
  6 siblings, 1 reply; 15+ messages in thread
From: Ulrich Obergfell @ 2015-09-28 20:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, dzickus, atomlin, uobergfe

If kthread_park() returns an error, watchdog_park_threads() should not
blindly 'roll back' the already parked threads to the unparked state.
Instead leave it up to the callers to handle such errors appropriately
in their context. For example, it is redundant to unpark the threads
if the lockup detectors will soon be disabled by the callers anyway.

Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
---
 kernel/watchdog.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 3bc22a9..af70bf2 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -654,6 +654,12 @@ static struct smp_hotplug_thread watchdog_threads = {
 
 /*
  * park all watchdog threads that are specified in 'watchdog_cpumask'
+ *
+ * This function returns an error if kthread_park() of a watchdog thread
+ * fails. In this situation, the watchdog threads of some CPUs can already
+ * be parked and the watchdog threads of other CPUs can still be runnable.
+ * Callers are expected to handle this special condition as appropriate in
+ * their context.
  */
 static int watchdog_park_threads(void)
 {
@@ -665,10 +671,6 @@ static int watchdog_park_threads(void)
 		if (ret)
 			break;
 	}
-	if (ret) {
-		for_each_watchdog_cpu(cpu)
-			kthread_unpark(per_cpu(softlockup_watchdog, cpu));
-	}
 	put_online_cpus();
 
 	return ret;
-- 
1.7.11.7


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh()
  2015-09-28 20:44 ` [PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh() Ulrich Obergfell
@ 2015-09-29  6:39   ` Aaron Tomlin
  0 siblings, 0 replies; 15+ messages in thread
From: Aaron Tomlin @ 2015-09-29  6:39 UTC (permalink / raw)
  To: Ulrich Obergfell; +Cc: linux-kernel, akpm, dzickus

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

On Mon 2015-09-28 22:44 +0200, Ulrich Obergfell wrote:
> Restore the previous value of watchdog_thresh _and_ sample_period
> if proc_watchdog_update() returns an error. The variables must be
> consistent to avoid false positives of the lockup detectors.
> 
> Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
> ---
>  kernel/watchdog.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 64ed1c3..cd9a504 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -914,13 +914,14 @@ int proc_watchdog_thresh(struct ctl_table *table, int write,
>  		goto out;
>  
>  	/*
> -	 * Update the sample period.
> -	 * Restore 'watchdog_thresh' on failure.
> +	 * Update the sample period. Restore on failure.
>  	 */
>  	set_sample_period();
>  	err = proc_watchdog_update();
> -	if (err)
> +	if (err) {
>  		watchdog_thresh = old;
> +		set_sample_period();
> +	}
>  out:
>  	mutex_unlock(&watchdog_proc_mutex);
>  	return err;

Reviewed-by: Aaron Tomlin <atomlin@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef
  2015-09-28 20:44 ` [PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef Ulrich Obergfell
@ 2015-09-29  6:40   ` Aaron Tomlin
  2015-09-29 23:37   ` Andrew Morton
  1 sibling, 0 replies; 15+ messages in thread
From: Aaron Tomlin @ 2015-09-29  6:40 UTC (permalink / raw)
  To: Ulrich Obergfell; +Cc: linux-kernel, akpm, dzickus

[-- Attachment #1: Type: text/plain, Size: 1412 bytes --]

On Mon 2015-09-28 22:44 +0200, Ulrich Obergfell wrote:
> It makes sense to place watchdog_{dis|enable}_all_cpus() outside of
> the ifdef so that _both_ are available even if CONFIG_SYSCTL is not
> defined.
> 
> Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
> ---
>  kernel/watchdog.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index cd9a504..eb9527c 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -347,6 +347,9 @@ static void watchdog_interrupt_count(void)
>  static int watchdog_nmi_enable(unsigned int cpu);
>  static void watchdog_nmi_disable(unsigned int cpu);
>  
> +static int watchdog_enable_all_cpus(void);
> +static void watchdog_disable_all_cpus(void);
> +
>  /* watchdog kicker functions */
>  static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>  {
> @@ -756,9 +759,6 @@ static int watchdog_enable_all_cpus(void)
>  	return err;
>  }
>  
> -/* prepare/enable/disable routines */
> -/* sysctl functions */
> -#ifdef CONFIG_SYSCTL
>  static void watchdog_disable_all_cpus(void)
>  {
>  	if (watchdog_running) {
> @@ -767,6 +767,8 @@ static void watchdog_disable_all_cpus(void)
>  	}
>  }
>  
> +#ifdef CONFIG_SYSCTL
> +
>  /*
>   * Update the run state of the lockup detectors.
>   */

Reviewed-by: Aaron Tomlin <atomlin@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers
  2015-09-28 20:44 ` [PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers Ulrich Obergfell
@ 2015-09-29  6:40   ` Aaron Tomlin
  0 siblings, 0 replies; 15+ messages in thread
From: Aaron Tomlin @ 2015-09-29  6:40 UTC (permalink / raw)
  To: Ulrich Obergfell; +Cc: linux-kernel, akpm, dzickus

[-- Attachment #1: Type: text/plain, Size: 2278 bytes --]

On Mon 2015-09-28 22:44 +0200, Ulrich Obergfell wrote:
> update_watchdog_all_cpus() now passes errors from watchdog_park_threads()
> up to functions in the call chain. This allows watchdog_enable_all_cpus()
> and proc_watchdog_update() to handle such errors too.
> 
> Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
> ---
>  kernel/watchdog.c | 30 +++++++++++++++++++++++-------
>  1 file changed, 23 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index eb9527c..457113c 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -731,10 +731,17 @@ void lockup_detector_resume(void)
>  	mutex_unlock(&watchdog_proc_mutex);
>  }
>  
> -static void update_watchdog_all_cpus(void)
> +static int update_watchdog_all_cpus(void)
>  {
> -	watchdog_park_threads();
> +	int ret;
> +
> +	ret = watchdog_park_threads();
> +	if (ret)
> +		return ret;
> +
>  	watchdog_unpark_threads();
> +
> +	return 0;
>  }
>  
>  static int watchdog_enable_all_cpus(void)
> @@ -753,9 +760,17 @@ static int watchdog_enable_all_cpus(void)
>  		 * Enable/disable the lockup detectors or
>  		 * change the sample period 'on the fly'.
>  		 */
> -		update_watchdog_all_cpus();
> +		err = update_watchdog_all_cpus();
> +
> +		if (err) {
> +			watchdog_disable_all_cpus();
> +			pr_err("Failed to update lockup detectors, disabled\n");
> +		}
>  	}
>  
> +	if (err)
> +		watchdog_enabled = 0;
> +
>  	return err;
>  }
>  
> @@ -851,12 +866,13 @@ static int proc_watchdog_common(int which, struct ctl_table *table, int write,
>  		} while (cmpxchg(&watchdog_enabled, old, new) != old);
>  
>  		/*
> -		 * Update the run state of the lockup detectors.
> -		 * Restore 'watchdog_enabled' on failure.
> +		 * Update the run state of the lockup detectors. There is _no_
> +		 * need to check the value returned by proc_watchdog_update()
> +		 * and to restore the previous value of 'watchdog_enabled' as
> +		 * both lockup detectors are disabled if proc_watchdog_update()
> +		 * returns an error.
>  		 */
>  		err = proc_watchdog_update();
> -		if (err)
> -			watchdog_enabled = old;
>  	}
>  out:
>  	mutex_unlock(&watchdog_proc_mutex);

Reviewed-by: Aaron Tomlin <atomlin@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend()
  2015-09-28 20:44 ` [PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend() Ulrich Obergfell
@ 2015-09-29  6:41   ` Aaron Tomlin
  0 siblings, 0 replies; 15+ messages in thread
From: Aaron Tomlin @ 2015-09-29  6:41 UTC (permalink / raw)
  To: Ulrich Obergfell; +Cc: linux-kernel, akpm, dzickus

[-- Attachment #1: Type: text/plain, Size: 772 bytes --]

On Mon 2015-09-28 22:44 +0200, Ulrich Obergfell wrote:
> lockup_detector_suspend() now handles errors from watchdog_park_threads().
> 
> Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
> ---
>  kernel/watchdog.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 457113c..3bc22a9 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -707,6 +707,11 @@ int lockup_detector_suspend(void)
>  
>  	if (ret == 0)
>  		watchdog_suspended++;
> +	else {
> +		watchdog_disable_all_cpus();
> +		pr_err("Failed to suspend lockup detectors, disabled\n");
> +		watchdog_enabled = 0;
> +	}
>  
>  	mutex_unlock(&watchdog_proc_mutex);
>  

Reviewed-by: Aaron Tomlin <atomlin@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error
  2015-09-28 20:44 ` [PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error Ulrich Obergfell
@ 2015-09-29  6:41   ` Aaron Tomlin
  0 siblings, 0 replies; 15+ messages in thread
From: Aaron Tomlin @ 2015-09-29  6:41 UTC (permalink / raw)
  To: Ulrich Obergfell; +Cc: linux-kernel, akpm, dzickus

[-- Attachment #1: Type: text/plain, Size: 1597 bytes --]

On Mon 2015-09-28 22:44 +0200, Ulrich Obergfell wrote:
> If kthread_park() returns an error, watchdog_park_threads() should not
> blindly 'roll back' the already parked threads to the unparked state.
> Instead leave it up to the callers to handle such errors appropriately
> in their context. For example, it is redundant to unpark the threads
> if the lockup detectors will soon be disabled by the callers anyway.
> 
> Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
> ---
>  kernel/watchdog.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 3bc22a9..af70bf2 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -654,6 +654,12 @@ static struct smp_hotplug_thread watchdog_threads = {
>  
>  /*
>   * park all watchdog threads that are specified in 'watchdog_cpumask'
> + *
> + * This function returns an error if kthread_park() of a watchdog thread
> + * fails. In this situation, the watchdog threads of some CPUs can already
> + * be parked and the watchdog threads of other CPUs can still be runnable.
> + * Callers are expected to handle this special condition as appropriate in
> + * their context.
>   */
>  static int watchdog_park_threads(void)
>  {
> @@ -665,10 +671,6 @@ static int watchdog_park_threads(void)
>  		if (ret)
>  			break;
>  	}
> -	if (ret) {
> -		for_each_watchdog_cpu(cpu)
> -			kthread_unpark(per_cpu(softlockup_watchdog, cpu));
> -	}
>  	put_online_cpus();
>  
>  	return ret;

Reviewed-by: Aaron Tomlin <atomlin@redhat.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] improve handling of errors returned by kthread_park()
  2015-09-28 20:44 [PATCH 0/5] improve handling of errors returned by kthread_park() Ulrich Obergfell
                   ` (4 preceding siblings ...)
  2015-09-28 20:44 ` [PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error Ulrich Obergfell
@ 2015-09-29 23:30 ` Andrew Morton
  2015-09-30 10:54   ` Ulrich Obergfell
  2015-10-12 21:13 ` Don Zickus
  6 siblings, 1 reply; 15+ messages in thread
From: Andrew Morton @ 2015-09-29 23:30 UTC (permalink / raw)
  To: Ulrich Obergfell; +Cc: linux-kernel, dzickus, atomlin

On Mon, 28 Sep 2015 22:44:07 +0200 Ulrich Obergfell <uobergfe@redhat.com> wrote:

> The original watchdog_park_threads() function that was introduced by
> commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple
> approach to handle errors returned by kthread_park(): It attempts to
> roll back all watchdog threads to the unparked state. However, this
> may be undesired behaviour from the perspective of the caller which
> may want to handle errors as appropriate in its specific context.
> Currently, there are two possible call chains:
> 
> - watchdog suspend/resume interface
> 
>     lockup_detector_suspend
>       watchdog_park_threads
> 
> - write to parameters in /proc/sys/kernel
> 
>     proc_watchdog_update
>       watchdog_enable_all_cpus
>         update_watchdog_all_cpus
>           watchdog_park_threads
> 
> Instead of 'blindly' attempting to unpark the watchdog threads if a 
> kthread_park() call fails, the new approach is to disable the lockup
> detectors in the above call chains. Failure becomes visible to the
> user as follows:
> 
> - error messages from lockup_detector_suspend()
>                    or watchdog_enable_all_cpus()
> 
> - the state that can be read from /proc/sys/kernel/watchdog_enabled
> 
> - the 'write' system call in the latter call chain returns an error
> 

hm, you made me look at kthread parking.  Why does it exist?  What is a
"parked" thread anyway, and how does it differ from, say, a sleeping
one?  The 2a1d446019f9a5983ec5a335b changelog is pretty useless and the
patch added no useful documentation, sigh.

Anwyay...  what inspired this patchset?  Are you experiencing
kthread_park() failures in practice?  If so, what is causing them?  And
what is the user-visible effect of these failures?  This is all pretty
important context for such a patchset.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef
  2015-09-28 20:44 ` [PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef Ulrich Obergfell
  2015-09-29  6:40   ` Aaron Tomlin
@ 2015-09-29 23:37   ` Andrew Morton
  1 sibling, 0 replies; 15+ messages in thread
From: Andrew Morton @ 2015-09-29 23:37 UTC (permalink / raw)
  To: Ulrich Obergfell; +Cc: linux-kernel, dzickus, atomlin

On Mon, 28 Sep 2015 22:44:09 +0200 Ulrich Obergfell <uobergfe@redhat.com> wrote:

> It makes sense to place watchdog_{dis|enable}_all_cpus() outside of
> the ifdef so that _both_ are available even if CONFIG_SYSCTL is not
> defined.

Not really.  With CONFIG_SYSCTL=n this change will cause a compile
warning and will bloat up vmlinux.

I'll rewrite the changelog to

: Move watchdog_disable_all_cpus() outside of the ifdef so that it is
: available if CONFIG_SYSCTL is not defined.  This is preparation for
: "watchdog: implement error handling in update_watchdog_all_cpus() and
: callers".



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] improve handling of errors returned by kthread_park()
  2015-09-29 23:30 ` [PATCH 0/5] improve handling of errors returned by kthread_park() Andrew Morton
@ 2015-09-30 10:54   ` Ulrich Obergfell
  0 siblings, 0 replies; 15+ messages in thread
From: Ulrich Obergfell @ 2015-09-30 10:54 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel, dzickus, atomlin


Andrew,

> ... what inspired this patchset?
> Are you experiencing kthread_park() failures in practice?

I did not experience kthread_park() failures in practice.  Looking at
watchdog_park_threads() from 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28
I realized that there is a theoretical corner case which would not be
handled well. Let's assume that kthread_park() would return an error
in the following flow of execution (the user changes watchdog_thresh).

  proc_watchdog_thresh
    set_sample_period()
    //
    // The watchdog_thresh and sample_period variable are now set to
    // the new value.
    //
    proc_watchdog_update
      watchdog_enable_all_cpus
        update_watchdog_all_cpus
          watchdog_park_threads

Let's say the system has eight CPUs and that kthread_park() failed to
park watchdog/4. In this example watchdog/0 .. watchdog/3 are already
parked and watchdog/5 .. watchdog/7 are not parked yet (we don't know
exactly what happened to watchdog/4). watchdog_park_threads() unparks
the threads if kthread_park() of one thread fails.

  for_each_watchdog_cpu(cpu) {
          ret = kthread_park(per_cpu(softlockup_watchdog, cpu));
          if (ret)
                  break;
  }
  if (ret) {
          for_each_watchdog_cpu(cpu)
                  kthread_unpark(per_cpu(softlockup_watchdog, cpu));
  }

watchdog/0 .. watchdog/3 will pick up the new watchdog_thresh value
when they are unparked (please see the watchdog_enable() function),
whereas watchdog/5 .. watchdog/7 will continue to use the old value
for the hard lockup detector and begin using the new value for the
soft lockup detector (kthread_unpark() sees watchdog/5 .. watchdog/7
in the unparked state, so it skips these threads). The inconsistency
which results from using different watchdog_thresh values can cause
unexpected behaviour of the lockup detectors (e.g. false positives).

The new error handling that is introduced by this patch set aims to
handle the above corner case in a better way (this was my original
motivation to come up with a patch set). However, I also think that
_if_ kthread_park() would ever be changed in the future so that it
could return errors under various (other) conditions, the patch set
should prepare the watchdog code for this possibility.

Since I did not experience kthread_park() failures in practice, I
used some instrumentation to fake error returns from kthread_park()
in order to test the patches.


Regards,

Uli


----- Original Message -----
From: "Andrew Morton" <akpm@linux-foundation.org>
To: "Ulrich Obergfell" <uobergfe@redhat.com>
Cc: linux-kernel@vger.kernel.org, dzickus@redhat.com, atomlin@redhat.com
Sent: Wednesday, September 30, 2015 1:30:36 AM
Subject: Re: [PATCH 0/5] improve handling of errors returned by kthread_park()

On Mon, 28 Sep 2015 22:44:07 +0200 Ulrich Obergfell <uobergfe@redhat.com> wrote:

> The original watchdog_park_threads() function that was introduced by
> commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple
> approach to handle errors returned by kthread_park(): It attempts to
> roll back all watchdog threads to the unparked state. However, this
> may be undesired behaviour from the perspective of the caller which
> may want to handle errors as appropriate in its specific context.
> Currently, there are two possible call chains:
> 
> - watchdog suspend/resume interface
> 
>     lockup_detector_suspend
>       watchdog_park_threads
> 
> - write to parameters in /proc/sys/kernel
> 
>     proc_watchdog_update
>       watchdog_enable_all_cpus
>         update_watchdog_all_cpus
>           watchdog_park_threads
> 
> Instead of 'blindly' attempting to unpark the watchdog threads if a 
> kthread_park() call fails, the new approach is to disable the lockup
> detectors in the above call chains. Failure becomes visible to the
> user as follows:
> 
> - error messages from lockup_detector_suspend()
>                    or watchdog_enable_all_cpus()
> 
> - the state that can be read from /proc/sys/kernel/watchdog_enabled
> 
> - the 'write' system call in the latter call chain returns an error
> 

hm, you made me look at kthread parking.  Why does it exist?  What is a
"parked" thread anyway, and how does it differ from, say, a sleeping
one?  The 2a1d446019f9a5983ec5a335b changelog is pretty useless and the
patch added no useful documentation, sigh.

Anwyay...  what inspired this patchset?  Are you experiencing
kthread_park() failures in practice?  If so, what is causing them?  And
what is the user-visible effect of these failures?  This is all pretty
important context for such a patchset.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/5] improve handling of errors returned by kthread_park()
  2015-09-28 20:44 [PATCH 0/5] improve handling of errors returned by kthread_park() Ulrich Obergfell
                   ` (5 preceding siblings ...)
  2015-09-29 23:30 ` [PATCH 0/5] improve handling of errors returned by kthread_park() Andrew Morton
@ 2015-10-12 21:13 ` Don Zickus
  6 siblings, 0 replies; 15+ messages in thread
From: Don Zickus @ 2015-10-12 21:13 UTC (permalink / raw)
  To: Ulrich Obergfell; +Cc: linux-kernel, akpm, atomlin

On Mon, Sep 28, 2015 at 10:44:07PM +0200, Ulrich Obergfell wrote:
> The original watchdog_park_threads() function that was introduced by
> commit 81a4beef91ba4a9e8ad6054ca9933dff7e25ff28 takes a very simple
> approach to handle errors returned by kthread_park(): It attempts to
> roll back all watchdog threads to the unparked state. However, this
> may be undesired behaviour from the perspective of the caller which
> may want to handle errors as appropriate in its specific context.
> Currently, there are two possible call chains:


I spent a little time cleaning up some of my test scripts for this code.
Things seem to work well.  Hardlockups and softlockups work along with
enabling/disabling the various lockup knobs.

Acked-by: Don Zickus <dzickus@redhat.com>

> 
> - watchdog suspend/resume interface
> 
>     lockup_detector_suspend
>       watchdog_park_threads
> 
> - write to parameters in /proc/sys/kernel
> 
>     proc_watchdog_update
>       watchdog_enable_all_cpus
>         update_watchdog_all_cpus
>           watchdog_park_threads
> 
> Instead of 'blindly' attempting to unpark the watchdog threads if a 
> kthread_park() call fails, the new approach is to disable the lockup
> detectors in the above call chains. Failure becomes visible to the
> user as follows:
> 
> - error messages from lockup_detector_suspend()
>                    or watchdog_enable_all_cpus()
> 
> - the state that can be read from /proc/sys/kernel/watchdog_enabled
> 
> - the 'write' system call in the latter call chain returns an error
> 
> Ulrich Obergfell (5):
>   watchdog: fix error handling in proc_watchdog_thresh()
>   watchdog: move watchdog_disable_all_cpus() outside of ifdef
>   watchdog: implement error handling in update_watchdog_all_cpus() and
>     callers
>   watchdog: implement error handling in lockup_detector_suspend()
>   watchdog: do not unpark threads in watchdog_park_threads() on error
> 
>  kernel/watchdog.c | 60 +++++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 43 insertions(+), 17 deletions(-)
> 
> -- 
> 1.7.11.7
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-10-12 21:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-09-28 20:44 [PATCH 0/5] improve handling of errors returned by kthread_park() Ulrich Obergfell
2015-09-28 20:44 ` [PATCH 1/5] watchdog: fix error handling in proc_watchdog_thresh() Ulrich Obergfell
2015-09-29  6:39   ` Aaron Tomlin
2015-09-28 20:44 ` [PATCH 2/5] watchdog: move watchdog_disable_all_cpus() outside of ifdef Ulrich Obergfell
2015-09-29  6:40   ` Aaron Tomlin
2015-09-29 23:37   ` Andrew Morton
2015-09-28 20:44 ` [PATCH 3/5] watchdog: implement error handling in update_watchdog_all_cpus() and callers Ulrich Obergfell
2015-09-29  6:40   ` Aaron Tomlin
2015-09-28 20:44 ` [PATCH 4/5] watchdog: implement error handling in lockup_detector_suspend() Ulrich Obergfell
2015-09-29  6:41   ` Aaron Tomlin
2015-09-28 20:44 ` [PATCH 5/5] watchdog: do not unpark threads in watchdog_park_threads() on error Ulrich Obergfell
2015-09-29  6:41   ` Aaron Tomlin
2015-09-29 23:30 ` [PATCH 0/5] improve handling of errors returned by kthread_park() Andrew Morton
2015-09-30 10:54   ` Ulrich Obergfell
2015-10-12 21:13 ` Don Zickus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.