All of lore.kernel.org
 help / color / mirror / Atom feed
* Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog
@ 2012-12-14  9:33 Bjørn Mork
  2012-12-14  9:34 ` [PATCH] Revert "watchdog: Fix CPU hotplug regression" Bjørn Mork
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Bjørn Mork @ 2012-12-14  9:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: Norbert Warmuth, Joseph Salisbury, Thomas Gleixner

Hello,

I was a bit surprised by v3.7 hanging "all the time", usually just
freezing but sometimes with a stack dump pointing to timerqueue_del
calling rb_erase and oopsing there. Of course I could never make it dump
anything when prepared to capture it via netconsole, so the best I have
is this picture: http://www.mork.no/~bjorn/20121213_203557.jpg

I discovered that the freeze was related to plugging AC power and
started bisecting from v3.7-rc3, which I had been running for quite a
while and therefore was confident did not have this problem.  The bisect
log ended up as this:

bjorn@canardo:/usr/local/src/build-tmp/linux$ git bisect log
# bad: [29594404d7fe73cd80eaa4ee8c43dcc53970c60e] Linux 3.7
# good: [8f0d8163b50e01f398b14bcd4dc039ac5ab18d64] Linux 3.7-rc3
git bisect start 'v3.7' 'v3.7-rc3'
# good: [29282fde80d44e587f8c152b10049a56e61659f0] KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update()
git bisect good 29282fde80d44e587f8c152b10049a56e61659f0
# good: [2654ad44b5f7a5f1b12d722a37d9b9df69d57899] Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 2654ad44b5f7a5f1b12d722a37d9b9df69d57899
# good: [086486e46e4206cfa1140fb9682ad67c8a4502fb] Merge branch 'for-linus' of git://git.samba.org/sfrench/cifs-2.6
git bisect good 086486e46e4206cfa1140fb9682ad67c8a4502fb
# good: [25a3bc6bd1ca03ab504b8c55c98f8d0135644d53] [parisc] open(2) compat bug
git bisect good 25a3bc6bd1ca03ab504b8c55c98f8d0135644d53
# bad: [cfd1f032f98e5ab3a04f23a0adbd53ff8744827d] Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad cfd1f032f98e5ab3a04f23a0adbd53ff8744827d
# good: [b69f0859dc8e633c5d8c06845811588fe17e68b3] Linux 3.7-rc8
git bisect good b69f0859dc8e633c5d8c06845811588fe17e68b3
# good: [ca50496eb487b639de1f502e77a48dde84152fb9] Merge branch 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
git bisect good ca50496eb487b639de1f502e77a48dde84152fb9
# good: [70dcc535bdf30ffaef58b867fbde45c0287f34c6] Merge tag 'upstream-3.7-rc9' of git://git.infradead.org/linux-ubi
git bisect good 70dcc535bdf30ffaef58b867fbde45c0287f34c6
# good: [df2fc246c8ee8b6067af1fa55d3bc23107457f61] Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
git bisect good df2fc246c8ee8b6067af1fa55d3bc23107457f61
# bad: [8d4516904b39507458bee8115793528e12b1d8dd] watchdog: Fix CPU hotplug regression
git bisect bad 8d4516904b39507458bee8115793528e12b1d8dd


and I can confirm that reverting 8d451690 does fix the problem.  After
having found this, it became clearer to me why the freeze was related to
plugging AC power: I use laptop-mode-tools in its default Debian
configuration, which will disable the nmi_watchdog on battery power and
enable it on AC power.

So the simple test case which will hang my (and presumably also your) PC
on plain v3.7 is:

 echo 0 > /proc/sys/kernel/nmi_watchdog
 echo 1 > /proc/sys/kernel/nmi_watchdog

without any cmdline nmi_watchdog parameters, i.e. enabled.

I'll followup with a proposed revert patch, but I guess you will want to
come up with something which fixes the original problem as well.  But
given the serious effect of the bug, causing a 100% reproducable freeze
on a standard distro laptop setup, I hope that the revert will go into
v3.7.1 unless a real fix can be prepared in time.


Thanks,
Bjørn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH] Revert "watchdog: Fix CPU hotplug regression"
  2012-12-14  9:33 Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog Bjørn Mork
@ 2012-12-14  9:34 ` Bjørn Mork
  2012-12-14 13:44 ` [PATCH v2] watchdog: Fix disable/enable regression Bjørn Mork
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Bjørn Mork @ 2012-12-14  9:34 UTC (permalink / raw)
  To: linux-kernel
  Cc: Bjørn Mork, Norbert Warmuth, Joseph Salisbury, Thomas Gleixner

This reverts commit 8d4516904b39507458bee8115793528e12b1d8dd.

commit 8d451690 ("watchdog: Fix CPU hotplug regression") cause
a hard lockup when I connect AC power on my laptop.  The indirect
reason is that laptop-mode-tools is configured to do

  echo 0 > /proc/sys/kernel/nmi_watchdog

when AC power is disconnected and

  echo 1 > /proc/sys/kernel/nmi_watchdog

when AC power is connected.  This is the default configuration of
the Debian laptop-mode-tools package.

Cc: <stable@vger.kernel.org> # v3.7
Cc: Norbert Warmuth <nwarmuth@t-online.de>
Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
 kernel/watchdog.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index c8c21be..dd4b80a 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -368,9 +368,6 @@ static void watchdog_disable(unsigned int cpu)
 {
 	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
 
-	if (!watchdog_enabled)
-		return;
-
 	watchdog_set_prio(SCHED_NORMAL, 0);
 	hrtimer_cancel(hrtimer);
 	/* disable the perf event */
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2] watchdog: Fix disable/enable regression
  2012-12-14  9:33 Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog Bjørn Mork
  2012-12-14  9:34 ` [PATCH] Revert "watchdog: Fix CPU hotplug regression" Bjørn Mork
@ 2012-12-14 13:44 ` Bjørn Mork
  2012-12-16 19:41 ` Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog Maciej Rutecki
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: Bjørn Mork @ 2012-12-14 13:44 UTC (permalink / raw)
  To: linux-kernel
  Cc: Bjørn Mork, Norbert Warmuth, Joseph Salisbury, Thomas Gleixner

commit 8d451690 ("watchdog: Fix CPU hotplug regression") cause
an oops or hard lockup when doing

 echo 0 > /proc/sys/kernel/nmi_watchdog
 echo 1 > /proc/sys/kernel/nmi_watchdog

and the kernel is booted with nmi_watchdog=1 (default)

Running laptop-mode-tools and disconnecting/connecting AC power
will cause this to trigger, making it a common failure scenario
on laptops.

Instead of bailing out of watchdog_disable() when !watchdog_enabled
we can initialize the hrtimer regardless of watchdog_enabled status.
This makes it safe to call watchdog_disable() in the nmi_watchdog=0
case, without the negative effect on the enabled => disabled =>
enabled case.

All these tests pass with this patch:
- nmi_watchdog=1
  echo 0 > /proc/sys/kernel/nmi_watchdog
  echo 1 > /proc/sys/kernel/nmi_watchdog

- nmi_watchdog=0
  echo 0 > /sys/devices/system/cpu/cpu1/online

- nmi_watchdog=0
  echo mem > /sys/power/state

Cc: <stable@vger.kernel.org> # v3.7
Cc: Norbert Warmuth <nwarmuth@t-online.de>
Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
v2:
  implemented an alternate workaround for the original problem.

Hello Thomas,

I will not claim to understand this code, but it seemed to me like the
original problem was caused by the missing initialization of the hrtimer
in the disabled case. Calling hrtimer_cancel() on an initialized timer
not running should be perfectly OK.  And watchdog_nmi_disable() will 
not do anything unless the event is initialized.  So this patch looks
like a fix.


Bjørn

 kernel/watchdog.c |   11 ++++-------
 1 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index c8c21be..762081c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -343,6 +343,10 @@ static void watchdog_enable(unsigned int cpu)
 {
 	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
 
+	/* kick off the timer for the hardlockup detector */
+	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer->function = watchdog_timer_fn;
+
 	if (!watchdog_enabled) {
 		kthread_park(current);
 		return;
@@ -351,10 +355,6 @@ static void watchdog_enable(unsigned int cpu)
 	/* Enable the perf event */
 	watchdog_nmi_enable(cpu);
 
-	/* kick off the timer for the hardlockup detector */
-	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	hrtimer->function = watchdog_timer_fn;
-
 	/* done here because hrtimer_start can only pin to smp_processor_id() */
 	hrtimer_start(hrtimer, ns_to_ktime(get_sample_period()),
 		      HRTIMER_MODE_REL_PINNED);
@@ -368,9 +368,6 @@ static void watchdog_disable(unsigned int cpu)
 {
 	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
 
-	if (!watchdog_enabled)
-		return;
-
 	watchdog_set_prio(SCHED_NORMAL, 0);
 	hrtimer_cancel(hrtimer);
 	/* disable the perf event */
-- 
1.7.2.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog
  2012-12-14  9:33 Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog Bjørn Mork
  2012-12-14  9:34 ` [PATCH] Revert "watchdog: Fix CPU hotplug regression" Bjørn Mork
  2012-12-14 13:44 ` [PATCH v2] watchdog: Fix disable/enable regression Bjørn Mork
@ 2012-12-16 19:41 ` Maciej Rutecki
  2012-12-18  8:13 ` [regression][PATCH v3] watchdog: Fix disable/enable regression Bjørn Mork
  2012-12-19 19:51 ` [RESEND][PATCH " Bjørn Mork
  4 siblings, 0 replies; 10+ messages in thread
From: Maciej Rutecki @ 2012-12-16 19:41 UTC (permalink / raw)
  To: Bjørn Mork
  Cc: linux-kernel, Norbert Warmuth, Joseph Salisbury, Thomas Gleixner

On piątek, 14 grudnia 2012 o 10:33:33 Bjørn Mork wrote:
> Hello,
> 
> I was a bit surprised by v3.7 hanging "all the time", usually just
> freezing but sometimes with a stack dump pointing to timerqueue_del
> calling rb_erase and oopsing there. Of course I could never make it dump
> anything when prepared to capture it via netconsole, so the best I have
> is this picture: http://www.mork.no/~bjorn/20121213_203557.jpg
> 
> I discovered that the freeze was related to plugging AC power and
> started bisecting from v3.7-rc3, which I had been running for quite a
> while and therefore was confident did not have this problem.  The bisect
> log ended up as this:
> 
> bjorn@canardo:/usr/local/src/build-tmp/linux$ git bisect log
> # bad: [29594404d7fe73cd80eaa4ee8c43dcc53970c60e] Linux 3.7
> # good: [8f0d8163b50e01f398b14bcd4dc039ac5ab18d64] Linux 3.7-rc3
> git bisect start 'v3.7' 'v3.7-rc3'
> # good: [29282fde80d44e587f8c152b10049a56e61659f0] KVM: x86: Fix invalid
> secondary exec controls in vmx_cpuid_update() git bisect good
> 29282fde80d44e587f8c152b10049a56e61659f0
> # good: [2654ad44b5f7a5f1b12d722a37d9b9df69d57899] Merge branch
> 'x86-urgent-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect good
> 2654ad44b5f7a5f1b12d722a37d9b9df69d57899
> # good: [086486e46e4206cfa1140fb9682ad67c8a4502fb] Merge branch 'for-linus'
> of git://git.samba.org/sfrench/cifs-2.6 git bisect good
> 086486e46e4206cfa1140fb9682ad67c8a4502fb
> # good: [25a3bc6bd1ca03ab504b8c55c98f8d0135644d53] [parisc] open(2) compat
> bug git bisect good 25a3bc6bd1ca03ab504b8c55c98f8d0135644d53
> # bad: [cfd1f032f98e5ab3a04f23a0adbd53ff8744827d] Merge branch
> 'core-urgent-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip git bisect bad
> cfd1f032f98e5ab3a04f23a0adbd53ff8744827d
> # good: [b69f0859dc8e633c5d8c06845811588fe17e68b3] Linux 3.7-rc8
> git bisect good b69f0859dc8e633c5d8c06845811588fe17e68b3
> # good: [ca50496eb487b639de1f502e77a48dde84152fb9] Merge branch
> 'for-3.7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq git
> bisect good ca50496eb487b639de1f502e77a48dde84152fb9
> # good: [70dcc535bdf30ffaef58b867fbde45c0287f34c6] Merge tag
> 'upstream-3.7-rc9' of git://git.infradead.org/linux-ubi git bisect good
> 70dcc535bdf30ffaef58b867fbde45c0287f34c6
> # good: [df2fc246c8ee8b6067af1fa55d3bc23107457f61] Merge branch 'fixes' of
> git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux git bisect good
> df2fc246c8ee8b6067af1fa55d3bc23107457f61
> # bad: [8d4516904b39507458bee8115793528e12b1d8dd] watchdog: Fix CPU hotplug
> regression git bisect bad 8d4516904b39507458bee8115793528e12b1d8dd
> 
> 
> and I can confirm that reverting 8d451690 does fix the problem.  After
> having found this, it became clearer to me why the freeze was related to
> plugging AC power: I use laptop-mode-tools in its default Debian
> configuration, which will disable the nmi_watchdog on battery power and
> enable it on AC power.
> 
> So the simple test case which will hang my (and presumably also your) PC
> on plain v3.7 is:
> 
>  echo 0 > /proc/sys/kernel/nmi_watchdog
>  echo 1 > /proc/sys/kernel/nmi_watchdog
> 
> without any cmdline nmi_watchdog parameters, i.e. enabled.
> 
> I'll followup with a proposed revert patch, but I guess you will want to
> come up with something which fixes the original problem as well.  But
> given the serious effect of the bug, causing a 100% reproducable freeze
> on a standard distro laptop setup, I hope that the revert will go into
> v3.7.1 unless a real fix can be prepared in time.
> 
> 
> Thanks,
> Bjørn

Confirm similar behaviour (I didn't do bisection): kernel sometimes opps when 
plug in AC adapter.

Regards

-- 
Maciej Rutecki
http://www.mrutecki.pl

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [regression][PATCH v3] watchdog: Fix disable/enable regression
  2012-12-14  9:33 Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog Bjørn Mork
                   ` (2 preceding siblings ...)
  2012-12-16 19:41 ` Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog Maciej Rutecki
@ 2012-12-18  8:13 ` Bjørn Mork
  2012-12-19 19:51 ` [RESEND][PATCH " Bjørn Mork
  4 siblings, 0 replies; 10+ messages in thread
From: Bjørn Mork @ 2012-12-18  8:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Don Zickus, linux-kernel, Bjørn Mork, Norbert Warmuth,
	Joseph Salisbury, Thomas Gleixner

commit 8d451690 ("watchdog: Fix CPU hotplug regression") cause
an oops or hard lockup when doing

 echo 0 > /proc/sys/kernel/nmi_watchdog
 echo 1 > /proc/sys/kernel/nmi_watchdog

and the kernel is booted with nmi_watchdog=1 (default)

Running laptop-mode-tools and disconnecting/connecting AC power
will cause this to trigger, making it a common failure scenario
on laptops.

Instead of bailing out of watchdog_disable() when !watchdog_enabled
we can initialize the hrtimer regardless of watchdog_enabled status.
This makes it safe to call watchdog_disable() in the nmi_watchdog=0
case, without the negative effect on the enabled => disabled =>
enabled case.

All these tests pass with this patch:
- nmi_watchdog=1
  echo 0 > /proc/sys/kernel/nmi_watchdog
  echo 1 > /proc/sys/kernel/nmi_watchdog

- nmi_watchdog=0
  echo 0 > /sys/devices/system/cpu/cpu1/online

- nmi_watchdog=0
  echo mem > /sys/power/state

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51661

Cc: <stable@vger.kernel.org> # v3.7
Cc: Norbert Warmuth <nwarmuth@t-online.de>
Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
v3:
  added Bugzilla reference and additional recipients
  rebased on current mainline
v2:
  implemented an alternate workaround for the original problem.

Hello,

Sorry for nagging, but this patch or some other fix should be applied
to mainline ASAP so it can be included in the 3.7 stable series. 3.7.0
and 3.7.1 dies when plugging AC power on a large number of laptop
systems.

I will not claim to understand this code, but it seemed to me like the
original problem was caused by the missing initialization of the hrtimer
in the disabled case. Calling hrtimer_cancel() on an initialized timer
not running should be perfectly OK.  And watchdog_nmi_disable() will 
not do anything unless the event is initialized.  So this patch looks
like a fix.

At least it survives both the original test cases and the post v3.7-rc8
regression test case.


Bjørn

 kernel/watchdog.c |   11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 997c6a1..75a2ab3 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -344,6 +344,10 @@ static void watchdog_enable(unsigned int cpu)
 {
 	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
 
+	/* kick off the timer for the hardlockup detector */
+	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer->function = watchdog_timer_fn;
+
 	if (!watchdog_enabled) {
 		kthread_park(current);
 		return;
@@ -352,10 +356,6 @@ static void watchdog_enable(unsigned int cpu)
 	/* Enable the perf event */
 	watchdog_nmi_enable(cpu);
 
-	/* kick off the timer for the hardlockup detector */
-	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	hrtimer->function = watchdog_timer_fn;
-
 	/* done here because hrtimer_start can only pin to smp_processor_id() */
 	hrtimer_start(hrtimer, ns_to_ktime(sample_period),
 		      HRTIMER_MODE_REL_PINNED);
@@ -369,9 +369,6 @@ static void watchdog_disable(unsigned int cpu)
 {
 	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
 
-	if (!watchdog_enabled)
-		return;
-
 	watchdog_set_prio(SCHED_NORMAL, 0);
 	hrtimer_cancel(hrtimer);
 	/* disable the perf event */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [RESEND][PATCH v3] watchdog: Fix disable/enable regression
  2012-12-14  9:33 Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog Bjørn Mork
                   ` (3 preceding siblings ...)
  2012-12-18  8:13 ` [regression][PATCH v3] watchdog: Fix disable/enable regression Bjørn Mork
@ 2012-12-19 19:51 ` Bjørn Mork
  2012-12-19 20:13   ` Don Zickus
  4 siblings, 1 reply; 10+ messages in thread
From: Bjørn Mork @ 2012-12-19 19:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Don Zickus, linux-kernel, Bjørn Mork,
	Norbert Warmuth, Joseph Salisbury, Thomas Gleixner

commit 8d451690 ("watchdog: Fix CPU hotplug regression") cause
an oops or hard lockup when doing

 echo 0 > /proc/sys/kernel/nmi_watchdog
 echo 1 > /proc/sys/kernel/nmi_watchdog

and the kernel is booted with nmi_watchdog=1 (default)

Running laptop-mode-tools and disconnecting/connecting AC power
will cause this to trigger, making it a common failure scenario
on laptops.

Instead of bailing out of watchdog_disable() when !watchdog_enabled
we can initialize the hrtimer regardless of watchdog_enabled status.
This makes it safe to call watchdog_disable() in the nmi_watchdog=0
case, without the negative effect on the enabled => disabled =>
enabled case.

All these tests pass with this patch:
- nmi_watchdog=1
  echo 0 > /proc/sys/kernel/nmi_watchdog
  echo 1 > /proc/sys/kernel/nmi_watchdog

- nmi_watchdog=0
  echo 0 > /sys/devices/system/cpu/cpu1/online

- nmi_watchdog=0
  echo mem > /sys/power/state

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51661

Cc: <stable@vger.kernel.org> # v3.7
Cc: Norbert Warmuth <nwarmuth@t-online.de>
Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Bjørn Mork <bjorn@mork.no>
---
Hello Linus,

This post v3.7-rc8 regression kills any machine with laptop-mode-tools
using default config, making it somewhat critical to get a fix into the
v3.7.x stable series ASAP.  I was hoping for some response from Thomas
or the reporters of the original bug, either verifying that my proposal
is OK or providing a better fix.  But I believe that this cannot wait
much longer.

Please apply.  Thanks,

Bjørn


Patch history:

v3:
  added Bugzilla reference and additional recipients
  rebased on current mainline
v2:
  implemented an alternate workaround for the original problem.
v1:
  plain revert of 8d451690


 kernel/watchdog.c |   11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 997c6a1..75a2ab3 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -344,6 +344,10 @@ static void watchdog_enable(unsigned int cpu)
 {
 	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
 
+	/* kick off the timer for the hardlockup detector */
+	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+	hrtimer->function = watchdog_timer_fn;
+
 	if (!watchdog_enabled) {
 		kthread_park(current);
 		return;
@@ -352,10 +356,6 @@ static void watchdog_enable(unsigned int cpu)
 	/* Enable the perf event */
 	watchdog_nmi_enable(cpu);
 
-	/* kick off the timer for the hardlockup detector */
-	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
-	hrtimer->function = watchdog_timer_fn;
-
 	/* done here because hrtimer_start can only pin to smp_processor_id() */
 	hrtimer_start(hrtimer, ns_to_ktime(sample_period),
 		      HRTIMER_MODE_REL_PINNED);
@@ -369,9 +369,6 @@ static void watchdog_disable(unsigned int cpu)
 {
 	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
 
-	if (!watchdog_enabled)
-		return;
-
 	watchdog_set_prio(SCHED_NORMAL, 0);
 	hrtimer_cancel(hrtimer);
 	/* disable the perf event */
-- 
1.7.10.4


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [RESEND][PATCH v3] watchdog: Fix disable/enable regression
  2012-12-19 19:51 ` [RESEND][PATCH " Bjørn Mork
@ 2012-12-19 20:13   ` Don Zickus
  2012-12-19 21:17     ` Bjørn Mork
  0 siblings, 1 reply; 10+ messages in thread
From: Don Zickus @ 2012-12-19 20:13 UTC (permalink / raw)
  To: Bjørn Mork
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Norbert Warmuth,
	Joseph Salisbury, Thomas Gleixner

On Wed, Dec 19, 2012 at 08:51:31PM +0100, Bjørn Mork wrote:
> commit 8d451690 ("watchdog: Fix CPU hotplug regression") cause
> an oops or hard lockup when doing
> 
>  echo 0 > /proc/sys/kernel/nmi_watchdog
>  echo 1 > /proc/sys/kernel/nmi_watchdog
> 
> and the kernel is booted with nmi_watchdog=1 (default)
> 
> Running laptop-mode-tools and disconnecting/connecting AC power
> will cause this to trigger, making it a common failure scenario
> on laptops.
> 
> Instead of bailing out of watchdog_disable() when !watchdog_enabled
> we can initialize the hrtimer regardless of watchdog_enabled status.
> This makes it safe to call watchdog_disable() in the nmi_watchdog=0
> case, without the negative effect on the enabled => disabled =>
> enabled case.
> 
> All these tests pass with this patch:
> - nmi_watchdog=1
>   echo 0 > /proc/sys/kernel/nmi_watchdog
>   echo 1 > /proc/sys/kernel/nmi_watchdog
> 
> - nmi_watchdog=0
>   echo 0 > /sys/devices/system/cpu/cpu1/online
> 
> - nmi_watchdog=0
>   echo mem > /sys/power/state

What about the opposite cases?  
nmi_watchdog=1
echo 1 > /sys/devices/system/cpu/cpu1/online

Are we going to leak memory by re-initing hrtimer?  Or is that caught
somewhere?

Other than that, the patch seems reasonable to me.  Basically just forcing
the init of hrtimer so there is something to cancel on the disable side.
Then again, I don't fully understand the original problem.

Cheers,
Don

> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=51661
> 
> Cc: <stable@vger.kernel.org> # v3.7
> Cc: Norbert Warmuth <nwarmuth@t-online.de>
> Cc: Joseph Salisbury <joseph.salisbury@canonical.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Bjørn Mork <bjorn@mork.no>
> ---
> Hello Linus,
> 
> This post v3.7-rc8 regression kills any machine with laptop-mode-tools
> using default config, making it somewhat critical to get a fix into the
> v3.7.x stable series ASAP.  I was hoping for some response from Thomas
> or the reporters of the original bug, either verifying that my proposal
> is OK or providing a better fix.  But I believe that this cannot wait
> much longer.
> 
> Please apply.  Thanks,
> 
> Bjørn
> 
> 
> Patch history:
> 
> v3:
>   added Bugzilla reference and additional recipients
>   rebased on current mainline
> v2:
>   implemented an alternate workaround for the original problem.
> v1:
>   plain revert of 8d451690
> 
> 
>  kernel/watchdog.c |   11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 997c6a1..75a2ab3 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -344,6 +344,10 @@ static void watchdog_enable(unsigned int cpu)
>  {
>  	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
>  
> +	/* kick off the timer for the hardlockup detector */
> +	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> +	hrtimer->function = watchdog_timer_fn;
> +
>  	if (!watchdog_enabled) {
>  		kthread_park(current);
>  		return;
> @@ -352,10 +356,6 @@ static void watchdog_enable(unsigned int cpu)
>  	/* Enable the perf event */
>  	watchdog_nmi_enable(cpu);
>  
> -	/* kick off the timer for the hardlockup detector */
> -	hrtimer_init(hrtimer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> -	hrtimer->function = watchdog_timer_fn;
> -
>  	/* done here because hrtimer_start can only pin to smp_processor_id() */
>  	hrtimer_start(hrtimer, ns_to_ktime(sample_period),
>  		      HRTIMER_MODE_REL_PINNED);
> @@ -369,9 +369,6 @@ static void watchdog_disable(unsigned int cpu)
>  {
>  	struct hrtimer *hrtimer = &__raw_get_cpu_var(watchdog_hrtimer);
>  
> -	if (!watchdog_enabled)
> -		return;
> -
>  	watchdog_set_prio(SCHED_NORMAL, 0);
>  	hrtimer_cancel(hrtimer);
>  	/* disable the perf event */
> -- 
> 1.7.10.4
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RESEND][PATCH v3] watchdog: Fix disable/enable regression
  2012-12-19 20:13   ` Don Zickus
@ 2012-12-19 21:17     ` Bjørn Mork
  2012-12-19 21:44       ` Bjørn Mork
  0 siblings, 1 reply; 10+ messages in thread
From: Bjørn Mork @ 2012-12-19 21:17 UTC (permalink / raw)
  To: Don Zickus
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Norbert Warmuth,
	Joseph Salisbury, Thomas Gleixner

Don Zickus <dzickus@redhat.com> writes:
> On Wed, Dec 19, 2012 at 08:51:31PM +0100, Bjørn Mork wrote:
>> commit 8d451690 ("watchdog: Fix CPU hotplug regression") cause
>> an oops or hard lockup when doing
>> 
>>  echo 0 > /proc/sys/kernel/nmi_watchdog
>>  echo 1 > /proc/sys/kernel/nmi_watchdog
>> 
>> and the kernel is booted with nmi_watchdog=1 (default)
>> 
>> Running laptop-mode-tools and disconnecting/connecting AC power
>> will cause this to trigger, making it a common failure scenario
>> on laptops.
>> 
>> Instead of bailing out of watchdog_disable() when !watchdog_enabled
>> we can initialize the hrtimer regardless of watchdog_enabled status.
>> This makes it safe to call watchdog_disable() in the nmi_watchdog=0
>> case, without the negative effect on the enabled => disabled =>
>> enabled case.
>> 
>> All these tests pass with this patch:
>> - nmi_watchdog=1
>>   echo 0 > /proc/sys/kernel/nmi_watchdog
>>   echo 1 > /proc/sys/kernel/nmi_watchdog
>> 
>> - nmi_watchdog=0
>>   echo 0 > /sys/devices/system/cpu/cpu1/online
>> 
>> - nmi_watchdog=0
>>   echo mem > /sys/power/state
>
> What about the opposite cases?  
> nmi_watchdog=1
> echo 1 > /sys/devices/system/cpu/cpu1/online

I don't see why not. But verifying it would be nice.  I thought that it
would be a simple thing to test using qemu-kvm, but it seems that the
CPU hotplugging support there isn't quite ready. The guest just dies
with "Assertion `bus->allow_hotplug' failed."

I'll go digging for alternatives, but if anyone else could verify this
then I'd appreciate it.

> Are we going to leak memory by re-initing hrtimer?  Or is that caught
> somewhere?

There are no allocations in hrtimer_init() AFAICS? It just initializes
the preallocated hrtimer struct.

> Other than that, the patch seems reasonable to me.  Basically just forcing
> the init of hrtimer so there is something to cancel on the disable side.
> Then again, I don't fully understand the original problem.

I believe the original problem was calling hrtimer_cancel on an
uninitialized hrtimer.  That is likely to blow up here:

static inline
void unlock_hrtimer_base(const struct hrtimer *timer, unsigned long *flags)
{
        raw_spin_unlock_irqrestore(&timer->base->cpu_base->lock, *flags);
}


Note that the lock_hrtimer_base() function protects access to
timer->base with a NULL test.  That should probably be done in unlock as
well, for symmetry?  But this is another issue and not at all urgent.




Bjørn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RESEND][PATCH v3] watchdog: Fix disable/enable regression
  2012-12-19 21:17     ` Bjørn Mork
@ 2012-12-19 21:44       ` Bjørn Mork
  2012-12-20 19:17         ` Don Zickus
  0 siblings, 1 reply; 10+ messages in thread
From: Bjørn Mork @ 2012-12-19 21:44 UTC (permalink / raw)
  To: Don Zickus
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Norbert Warmuth,
	Joseph Salisbury, Thomas Gleixner

Bjørn Mork <bjorn@mork.no> writes:
> Don Zickus <dzickus@redhat.com> writes:
>
>> What about the opposite cases?  
>> nmi_watchdog=1
>> echo 1 > /sys/devices/system/cpu/cpu1/online
>
> I don't see why not. But verifying it would be nice.  I thought that it
> would be a simple thing to test using qemu-kvm, but it seems that the
> CPU hotplugging support there isn't quite ready. The guest just dies
> with "Assertion `bus->allow_hotplug' failed."
>
> I'll go digging for alternatives, but if anyone else could verify this
> then I'd appreciate it.

I just realized that you might not really want/need hotplugging, but
just booting with some cores initially offline?

That's easier to test at least:

qmitest:~# cat /proc/cmdline 
initrd=/initrd.img.test root=/dev/sda1 ro console=tty0 console=ttyS0,9600n8 maxcpus=1 BOOT_IMAGE=/vmlinuz.test 
qmitest:~#  grep . /sys/devices/system/cpu/o*line
/sys/devices/system/cpu/offline:1-3
/sys/devices/system/cpu/online:0
qmitest:~# grep . /sys/devices/system/cpu/cpu?/online 
/sys/devices/system/cpu/cpu1/online:0
/sys/devices/system/cpu/cpu2/online:0
/sys/devices/system/cpu/cpu3/online:0
qmitest:~# grep . /sys/devices/system/cpu/o*line
/sys/devices/system/cpu/offline:1-3
/sys/devices/system/cpu/online:0
qmitest:~# grep . /proc/sys/kernel/nmi_watchdog 
1


qmitest:~#  echo 1 > /sys/devices/system/cpu/cpu1/online
qmitest:~#  echo 1 > /sys/devices/system/cpu/cpu2/online
qmitest:~# grep . /sys/devices/system/cpu/cpu?/online 
/sys/devices/system/cpu/cpu1/online:1
/sys/devices/system/cpu/cpu2/online:1
/sys/devices/system/cpu/cpu3/online:0


No problem observed.  As expected.



Bjørn

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [RESEND][PATCH v3] watchdog: Fix disable/enable regression
  2012-12-19 21:44       ` Bjørn Mork
@ 2012-12-20 19:17         ` Don Zickus
  0 siblings, 0 replies; 10+ messages in thread
From: Don Zickus @ 2012-12-20 19:17 UTC (permalink / raw)
  To: Bjørn Mork
  Cc: Linus Torvalds, Andrew Morton, linux-kernel, Norbert Warmuth,
	Joseph Salisbury, Thomas Gleixner

On Wed, Dec 19, 2012 at 10:44:44PM +0100, Bjørn Mork wrote:
> Bjørn Mork <bjorn@mork.no> writes:
> > Don Zickus <dzickus@redhat.com> writes:
> >
> >> What about the opposite cases?  
> >> nmi_watchdog=1
> >> echo 1 > /sys/devices/system/cpu/cpu1/online
> >
> > I don't see why not. But verifying it would be nice.  I thought that it
> > would be a simple thing to test using qemu-kvm, but it seems that the
> > CPU hotplugging support there isn't quite ready. The guest just dies
> > with "Assertion `bus->allow_hotplug' failed."
> >
> > I'll go digging for alternatives, but if anyone else could verify this
> > then I'd appreciate it.
> 
> I just realized that you might not really want/need hotplugging, but
> just booting with some cores initially offline?

Actually, I did a poor job explaining myself but your previous email
answered my question.

The original problem was that hrtimer_cancel was being called on an
un-initialized object.  I was wondering what happens if watchdog_enable
was called multiple times, would it re-initialize a new object causing a
memory leak?

But as you pointed out that couldn't happen.

So I am fine with your changes.  Thanks for the initial and follow up
testing.

Acked-by: Don Zickus <dzickus@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-12-20 19:18 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-14  9:33 Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog Bjørn Mork
2012-12-14  9:34 ` [PATCH] Revert "watchdog: Fix CPU hotplug regression" Bjørn Mork
2012-12-14 13:44 ` [PATCH v2] watchdog: Fix disable/enable regression Bjørn Mork
2012-12-16 19:41 ` Bisected oops regression between v3.7-rc8 and v3.7: nmi_watchdog Maciej Rutecki
2012-12-18  8:13 ` [regression][PATCH v3] watchdog: Fix disable/enable regression Bjørn Mork
2012-12-19 19:51 ` [RESEND][PATCH " Bjørn Mork
2012-12-19 20:13   ` Don Zickus
2012-12-19 21:17     ` Bjørn Mork
2012-12-19 21:44       ` Bjørn Mork
2012-12-20 19:17         ` Don Zickus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.