linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 1/2] RFC: hung_task: taint kernel
@ 2019-05-02 19:42 Daniel Vetter
  2019-05-02 19:42 ` [PATCH 2/2] RFC: soft/hardlookup: " Daniel Vetter
  2019-05-02 20:46 ` [PATCH] RFC: hung_task: " Daniel Vetter
  0 siblings, 2 replies; 8+ messages in thread
From: Daniel Vetter @ 2019-05-02 19:42 UTC (permalink / raw)
  To: Intel Graphics Development
  Cc: LKML, Daniel Vetter, Daniel Vetter, Andrew Morton, Tetsuo Handa,
	Dmitry Vyukov, Paul E. McKenney, Valdis Kletnieks,
	Vitaly Kuznetsov, Liu, Chuansheng

There's the hung_task_panic sysctl, but that's a bit an extreme measure.
As a fallback taint at least the machine.

Our CI uses this to decide when a reboot is necessary, plus to figure
out whether the kernel is still happy.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Liu, Chuansheng" <chuansheng.liu@intel.com>
---
 kernel/hung_task.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index f108a95882c6..7fae16f1b49c 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -203,6 +203,8 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
 	if (hung_task_call_panic) {
 		trigger_all_cpu_backtrace();
 		panic("hung_task: blocked tasks");
+	} else {
+		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
 	}
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] RFC: soft/hardlookup: taint kernel
  2019-05-02 19:42 [PATCH 1/2] RFC: hung_task: taint kernel Daniel Vetter
@ 2019-05-02 19:42 ` Daniel Vetter
  2019-05-03  0:00   ` Laurence Oberman
  2019-05-09  9:24   ` Sergey Senozhatsky
  2019-05-02 20:46 ` [PATCH] RFC: hung_task: " Daniel Vetter
  1 sibling, 2 replies; 8+ messages in thread
From: Daniel Vetter @ 2019-05-02 19:42 UTC (permalink / raw)
  To: Intel Graphics Development
  Cc: LKML, Daniel Vetter, Daniel Vetter, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Valdis Kletnieks, Laurence Oberman,
	Vincent Whitchurch, Don Zickus, Andrew Morton,
	Sergey Senozhatsky, Sinan Kaya

There's the soft/hardlookup_panic sysctls, but that's a bit an extreme
measure. As a fallback taint at least the machine.

Our CI uses this to decide when a reboot is necessary, plus to figure
out whether the kernel is still happy.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Sinan Kaya <okaya@kernel.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 kernel/watchdog.c     | 2 ++
 kernel/watchdog_hld.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 6a5787233113..de7a60503517 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -469,6 +469,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
 		add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
 		if (softlockup_panic)
 			panic("softlockup: hung tasks");
+		else
+			add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
 		__this_cpu_write(soft_watchdog_warn, true);
 	} else
 		__this_cpu_write(soft_watchdog_warn, false);
diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
index 247bf0b1582c..cce46cf75d76 100644
--- a/kernel/watchdog_hld.c
+++ b/kernel/watchdog_hld.c
@@ -154,6 +154,8 @@ static void watchdog_overflow_callback(struct perf_event *event,
 
 		if (hardlockup_panic)
 			nmi_panic(regs, "Hard LOCKUP");
+		else
+			add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
 
 		__this_cpu_write(hard_watchdog_warn, true);
 		return;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH] RFC: hung_task: taint kernel
  2019-05-02 19:42 [PATCH 1/2] RFC: hung_task: taint kernel Daniel Vetter
  2019-05-02 19:42 ` [PATCH 2/2] RFC: soft/hardlookup: " Daniel Vetter
@ 2019-05-02 20:46 ` Daniel Vetter
  2019-05-03  0:47   ` Tetsuo Handa
  1 sibling, 1 reply; 8+ messages in thread
From: Daniel Vetter @ 2019-05-02 20:46 UTC (permalink / raw)
  To: Intel Graphics Development
  Cc: LKML, Daniel Vetter, Daniel Vetter, Andrew Morton, Tetsuo Handa,
	Dmitry Vyukov, Paul E. McKenney, Valdis Kletnieks,
	Vitaly Kuznetsov, Liu, Chuansheng

There's the hung_task_panic sysctl, but that's a bit an extreme measure.
As a fallback taint at least the machine.

Our CI uses this to decide when a reboot is necessary, plus to figure
out whether the kernel is still happy.

v2: Works much better when I put the else { add_taint() } at the right
place.

Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Liu, Chuansheng" <chuansheng.liu@intel.com>
---
 kernel/hung_task.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index f108a95882c6..d90d98f53ccb 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -117,6 +117,8 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout)
 		console_verbose();
 		hung_task_show_lock = true;
 		hung_task_call_panic = true;
+	} else {
+		add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
 	}
 
 	/*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] RFC: soft/hardlookup: taint kernel
  2019-05-02 19:42 ` [PATCH 2/2] RFC: soft/hardlookup: " Daniel Vetter
@ 2019-05-03  0:00   ` Laurence Oberman
  2019-05-09  9:24   ` Sergey Senozhatsky
  1 sibling, 0 replies; 8+ messages in thread
From: Laurence Oberman @ 2019-05-03  0:00 UTC (permalink / raw)
  To: Daniel Vetter, Intel Graphics Development
  Cc: LKML, Daniel Vetter, Thomas Gleixner, Ingo Molnar,
	Peter Zijlstra, Valdis Kletnieks, Vincent Whitchurch, Don Zickus,
	Andrew Morton, Sergey Senozhatsky, Sinan Kaya

On Thu, 2019-05-02 at 21:42 +0200, Daniel Vetter wrote:
> There's the soft/hardlookup_panic sysctls, but that's a bit an
> extreme
> measure. As a fallback taint at least the machine.
> 
> Our CI uses this to decide when a reboot is necessary, plus to figure
> out whether the kernel is still happy.
> 
> Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Ingo Molnar <mingo@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Valdis Kletnieks <valdis.kletnieks@vt.edu>
> Cc: Laurence Oberman <loberman@redhat.com>
> Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
> Cc: Don Zickus <dzickus@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
> Cc: Sinan Kaya <okaya@kernel.org>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  kernel/watchdog.c     | 2 ++
>  kernel/watchdog_hld.c | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 6a5787233113..de7a60503517 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -469,6 +469,8 @@ static enum hrtimer_restart
> watchdog_timer_fn(struct hrtimer *hrtimer)
>  		add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
>  		if (softlockup_panic)
>  			panic("softlockup: hung tasks");
> +		else
> +			add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
>  		__this_cpu_write(soft_watchdog_warn, true);
>  	} else
>  		__this_cpu_write(soft_watchdog_warn, false);
> diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
> index 247bf0b1582c..cce46cf75d76 100644
> --- a/kernel/watchdog_hld.c
> +++ b/kernel/watchdog_hld.c
> @@ -154,6 +154,8 @@ static void watchdog_overflow_callback(struct
> perf_event *event,
>  
>  		if (hardlockup_panic)
>  			nmi_panic(regs, "Hard LOCKUP");
> +		else
> +			add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
>  
>  		__this_cpu_write(hard_watchdog_warn, true);
>  		return;

This looks OK to me, could be useful to know we would have triggered
had the flags been set.

Reviewed-by: Laurence Oberman <loberman@redhat.com>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RFC: hung_task: taint kernel
  2019-05-02 20:46 ` [PATCH] RFC: hung_task: " Daniel Vetter
@ 2019-05-03  0:47   ` Tetsuo Handa
  2019-05-03  8:51     ` Daniel Vetter
  0 siblings, 1 reply; 8+ messages in thread
From: Tetsuo Handa @ 2019-05-03  0:47 UTC (permalink / raw)
  To: Daniel Vetter, Intel Graphics Development
  Cc: LKML, Daniel Vetter, Andrew Morton, Dmitry Vyukov,
	Paul E. McKenney, Valdis Kletnieks, Vitaly Kuznetsov, Liu,
	Chuansheng

On 2019/05/03 5:46, Daniel Vetter wrote:
> There's the hung_task_panic sysctl, but that's a bit an extreme measure.
> As a fallback taint at least the machine.
> 
> Our CI uses this to decide when a reboot is necessary, plus to figure
> out whether the kernel is still happy.

Why your CI can't watch for "blocked for more than" message instead of
setting the taint flag? How does your CI decide a reboot is necessary?

There is no need to set the tainted flag when some task was just blocked
for a while. It might be due to memory pressure, it might be due to setting
very short timeout (e.g. a few seconds), it might be due to busy CPUs doing
something else...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] RFC: hung_task: taint kernel
  2019-05-03  0:47   ` Tetsuo Handa
@ 2019-05-03  8:51     ` Daniel Vetter
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Vetter @ 2019-05-03  8:51 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Daniel Vetter, Intel Graphics Development, LKML, Daniel Vetter,
	Andrew Morton, Dmitry Vyukov, Paul E. McKenney, Valdis Kletnieks,
	Vitaly Kuznetsov, Liu, Chuansheng

On Fri, May 03, 2019 at 09:47:03AM +0900, Tetsuo Handa wrote:
> On 2019/05/03 5:46, Daniel Vetter wrote:
> > There's the hung_task_panic sysctl, but that's a bit an extreme measure.
> > As a fallback taint at least the machine.
> > 
> > Our CI uses this to decide when a reboot is necessary, plus to figure
> > out whether the kernel is still happy.
> 
> Why your CI can't watch for "blocked for more than" message instead of
> setting the taint flag? How does your CI decide a reboot is necessary?

We spam an awful lot into dmesg, and at least historically had
occasionally trouble capturing it all (we're better than that now I
think). Plus the thing that parses dmesg isn't the thing that runs
testcases, hence why we started to use taint flags (or procfs lockdep
status) and similar things to check the kernel is still alive enough.

> There is no need to set the tainted flag when some task was just blocked
> for a while. It might be due to memory pressure, it might be due to setting
> very short timeout (e.g. a few seconds), it might be due to busy CPUs doing
> something else...

Yeah I realize that this probably doesn't have much use outside of our CI,
but maybe there's someone how likes the idea.

Wrt spurious taints: You can disable the hung_tasks checker outright,
which also stops the tainting.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] RFC: soft/hardlookup: taint kernel
  2019-05-02 19:42 ` [PATCH 2/2] RFC: soft/hardlookup: " Daniel Vetter
  2019-05-03  0:00   ` Laurence Oberman
@ 2019-05-09  9:24   ` Sergey Senozhatsky
  2019-05-09 10:16     ` Daniel Vetter
  1 sibling, 1 reply; 8+ messages in thread
From: Sergey Senozhatsky @ 2019-05-09  9:24 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Intel Graphics Development, LKML, Daniel Vetter, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Valdis Kletnieks, Laurence Oberman,
	Vincent Whitchurch, Don Zickus, Andrew Morton,
	Sergey Senozhatsky, Sinan Kaya

On (05/02/19 21:42), Daniel Vetter wrote:
[..]
> @@ -469,6 +469,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
>  		add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
>  		if (softlockup_panic)
>  			panic("softlockup: hung tasks");
> +		else
> +			add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
>  		__this_cpu_write(soft_watchdog_warn, true);

Soft lockup sets TAINT_SOFTLOCKUP bit. Would it be enough for your CI?

[..]
> @@ -154,6 +154,8 @@ static void watchdog_overflow_callback(struct perf_event *event,
>
>  		if (hardlockup_panic)
>  			nmi_panic(regs, "Hard LOCKUP");
> +		else
> +			add_taint(TAINT_WARN, LOCKDEP_STILL_OK);

Maybe you can mirror what soft lockup does. Add a HARDLOCKUP taint bit

+++ b/include/linux/kernel.h
@@ -571,7 +571,8 @@ extern enum system_states {
 #define TAINT_LIVEPATCH                        15
 #define TAINT_AUX                      16
 #define TAINT_RANDSTRUCT               17
-#define TAINT_FLAGS_COUNT              18
+#define TAINT_HARDLOCKUP               18
+#define TAINT_FLAGS_COUNT              19

and then set TAINT_HARDLOCKUP in watchdog_overflow_callback().

Just a small idea, I'll leave this to more experienced people.

	-ss

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] RFC: soft/hardlookup: taint kernel
  2019-05-09  9:24   ` Sergey Senozhatsky
@ 2019-05-09 10:16     ` Daniel Vetter
  0 siblings, 0 replies; 8+ messages in thread
From: Daniel Vetter @ 2019-05-09 10:16 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Intel Graphics Development, LKML, Daniel Vetter, Thomas Gleixner,
	Ingo Molnar, Peter Zijlstra, Valdis Kletnieks, Laurence Oberman,
	Vincent Whitchurch, Don Zickus, Andrew Morton, Sinan Kaya

On Thu, May 9, 2019 at 11:24 AM Sergey Senozhatsky
<sergey.senozhatsky.work@gmail.com> wrote:
>
> On (05/02/19 21:42), Daniel Vetter wrote:
> [..]
> > @@ -469,6 +469,8 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer)
> >               add_taint(TAINT_SOFTLOCKUP, LOCKDEP_STILL_OK);
> >               if (softlockup_panic)
> >                       panic("softlockup: hung tasks");
> > +             else
> > +                     add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
> >               __this_cpu_write(soft_watchdog_warn, true);
>
> Soft lockup sets TAINT_SOFTLOCKUP bit. Would it be enough for your CI?

I'm blind :-/ Yes this is totally useful.

> [..]
> > @@ -154,6 +154,8 @@ static void watchdog_overflow_callback(struct perf_event *event,
> >
> >               if (hardlockup_panic)
> >                       nmi_panic(regs, "Hard LOCKUP");
> > +             else
> > +                     add_taint(TAINT_WARN, LOCKDEP_STILL_OK);
>
> Maybe you can mirror what soft lockup does. Add a HARDLOCKUP taint bit

We'd also want a taint for hung tasks (separate patch, same idea), not
sure it's a good idea to use a new taint bit for all of them? Atm we
don't check for all taint bits (some of them are set because of things
our testcases do, like module reload or setting unsafe kernel options
meant for testing only, so picking one of the bits we check already
was least resistance.

> +++ b/include/linux/kernel.h
> @@ -571,7 +571,8 @@ extern enum system_states {
>  #define TAINT_LIVEPATCH                        15
>  #define TAINT_AUX                      16
>  #define TAINT_RANDSTRUCT               17
> -#define TAINT_FLAGS_COUNT              18
> +#define TAINT_HARDLOCKUP               18
> +#define TAINT_FLAGS_COUNT              19
>
> and then set TAINT_HARDLOCKUP in watchdog_overflow_callback().
>
> Just a small idea, I'll leave this to more experienced people.

The hung_tasks taint wasn't all that positively received, I feels like
this will stay a hack private to our CI. Except if someone else pipes
up who wants this, then I'm happy to polish.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-05-09 10:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-02 19:42 [PATCH 1/2] RFC: hung_task: taint kernel Daniel Vetter
2019-05-02 19:42 ` [PATCH 2/2] RFC: soft/hardlookup: " Daniel Vetter
2019-05-03  0:00   ` Laurence Oberman
2019-05-09  9:24   ` Sergey Senozhatsky
2019-05-09 10:16     ` Daniel Vetter
2019-05-02 20:46 ` [PATCH] RFC: hung_task: " Daniel Vetter
2019-05-03  0:47   ` Tetsuo Handa
2019-05-03  8:51     ` Daniel Vetter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).