linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt()
@ 2008-08-23  9:48 Mikael Pettersson
  2008-08-23 16:01 ` [PATCH] rtc: fix deadlock Ingo Molnar
  2008-08-24  9:14 ` [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt() Vegard Nossum
  0 siblings, 2 replies; 15+ messages in thread
From: Mikael Pettersson @ 2008-08-23  9:48 UTC (permalink / raw)
  To: linux-kernel; +Cc: hpa, mingo, tglx

Since 2.6.27-rc1 my Core2Duo has been getting sporadic oopses
from hpet_rtc_interrupt, usually during shutdown or reboot,
but occasionally also early in init. Today I finally managed
to capture one via a serial cable:

INIT: version 2.86 booting
		Welcome to Fedora Core
		Press 'I' to enter interactive startup.
BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0117092, registers:
Modules linked in: ehci_hcd uhci_hcd usbcore

Pid: 311, comm: nash-hotplug Not tainted (2.6.27-rc4 #1)
EIP: 0060:[<c0117092>] EFLAGS: 00000097 CPU: 0
EIP is at hpet_rtc_interrupt+0x2d2/0x310
EAX: 00000000 EBX: 00000002 ECX: 00000046 EDX: 00000002
ESI: 000000a6 EDI: ffff8e25 EBP: 00000008 ESP: f7bd7f28
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process nash-hotplug (pid: 311, ti=f7bd6000 task=f7b70460 task.ti=f7bd6000)
Stack: f7bd7f6c c0139cc0 00000000 c035ba04 00000000 00000000 00000000 00000000 
       00000000 00000000 00000000 00000000 00000000 f7b845a0 00000000 00000000 
       00000008 c01478a8 c035bf80 f7b845a0 c035bfb0 00000008 c0148f71 00000400 
Call Trace:
 [<c0139cc0>] hrtimer_run_pending+0x20/0x90
 [<c01478a8>] handle_IRQ_event+0x28/0x50
 [<c0148f71>] handle_edge_irq+0xa1/0x120
 [<c010615b>] do_IRQ+0x3b/0x70
 [<c0113225>] smp_apic_timer_interrupt+0x55/0x80
 [<c0103c4f>] common_interrupt+0x23/0x28
 [<c02c0000>] unix_release_sock+0xc0/0x220
 =======================
Code: 89 44 24 18 0f b6 c2 e8 5d 74 0c 00 8b 0d d8 9c 3b c0 89 44 24 1c 8b 44 24 0c 48 89 44 24 20 e9 84 fd ff ff 90 8d 74 26 00 f3 90 <a1> 80 ba 35 c0 29 f8 83 f8 01 76 f2 e9 e1 fe ff ff 90 8d 74 26 

This points to the following loop in hpet_rtc_interrupt:

0xc0117090 <hpet_rtc_interrupt+720>:    pause  
0xc0117092 <hpet_rtc_interrupt+722>:    mov    0xc035ba80,%eax
0xc0117097 <hpet_rtc_interrupt+727>:    sub    %edi,%eax
0xc0117099 <hpet_rtc_interrupt+729>:    cmp    $0x1,%eax
0xc011709c <hpet_rtc_interrupt+732>:    jbe    0xc0117090 <hpet_rtc_interrupt+720>

Note: 0xc035ba80 == &jiffies

This loop originates from asm-generic/rtc.h:get_rtc_time()

		while (jiffies - uip_watchdog < 2*HZ/100) {
			barrier();
			cpu_relax();
		}

Note: HZ == CONFIG_HZ == 100

The bug may not originate from the 2.6.27-rc series as I only recently
enabled HPET in this machine's kernels (not due to HPET problems, it
inherited its .config way back from an older machine w/o HPET).

/Mikael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] rtc: fix deadlock
  2008-08-23  9:48 [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt() Mikael Pettersson
@ 2008-08-23 16:01 ` Ingo Molnar
  2008-08-23 16:58   ` Mikael Pettersson
  2008-08-29 11:48   ` [PATCH] rtc: fix deadlock: fixes regression since 2.6.24 Frank van Maarseveen
  2008-08-24  9:14 ` [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt() Vegard Nossum
  1 sibling, 2 replies; 15+ messages in thread
From: Ingo Molnar @ 2008-08-23 16:01 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: linux-kernel, hpa, mingo, tglx, Andrew Morton


* Mikael Pettersson <mikpe@it.uu.se> wrote:

> Since 2.6.27-rc1 my Core2Duo has been getting sporadic oopses
> from hpet_rtc_interrupt, usually during shutdown or reboot,
> but occasionally also early in init. Today I finally managed
> to capture one via a serial cable:
> 
> INIT: version 2.86 booting
> 		Welcome to Fedora Core
> 		Press 'I' to enter interactive startup.
> BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0117092, registers:
> Modules linked in: ehci_hcd uhci_hcd usbcore
> 
> Pid: 311, comm: nash-hotplug Not tainted (2.6.27-rc4 #1)
> EIP: 0060:[<c0117092>] EFLAGS: 00000097 CPU: 0
> EIP is at hpet_rtc_interrupt+0x2d2/0x310
> EAX: 00000000 EBX: 00000002 ECX: 00000046 EDX: 00000002
> ESI: 000000a6 EDI: ffff8e25 EBP: 00000008 ESP: f7bd7f28
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process nash-hotplug (pid: 311, ti=f7bd6000 task=f7b70460 task.ti=f7bd6000)
> Stack: f7bd7f6c c0139cc0 00000000 c035ba04 00000000 00000000 00000000 00000000 
>        00000000 00000000 00000000 00000000 00000000 f7b845a0 00000000 00000000 
>        00000008 c01478a8 c035bf80 f7b845a0 c035bfb0 00000008 c0148f71 00000400 
> Call Trace:
>  [<c0139cc0>] hrtimer_run_pending+0x20/0x90
>  [<c01478a8>] handle_IRQ_event+0x28/0x50
>  [<c0148f71>] handle_edge_irq+0xa1/0x120
>  [<c010615b>] do_IRQ+0x3b/0x70
>  [<c0113225>] smp_apic_timer_interrupt+0x55/0x80
>  [<c0103c4f>] common_interrupt+0x23/0x28
>  [<c02c0000>] unix_release_sock+0xc0/0x220
>  =======================
> Code: 89 44 24 18 0f b6 c2 e8 5d 74 0c 00 8b 0d d8 9c 3b c0 89 44 24 1c 8b 44 24 0c 48 89 44 24 20 e9 84 fd ff ff 90 8d 74 26 00 f3 90 <a1> 80 ba 35 c0 29 f8 83 f8 01 76 f2 e9 e1 fe ff ff 90 8d 74 26 
> 
> This points to the following loop in hpet_rtc_interrupt:
> 
> 0xc0117090 <hpet_rtc_interrupt+720>:    pause  
> 0xc0117092 <hpet_rtc_interrupt+722>:    mov    0xc035ba80,%eax
> 0xc0117097 <hpet_rtc_interrupt+727>:    sub    %edi,%eax
> 0xc0117099 <hpet_rtc_interrupt+729>:    cmp    $0x1,%eax
> 0xc011709c <hpet_rtc_interrupt+732>:    jbe    0xc0117090 <hpet_rtc_interrupt+720>
> 
> Note: 0xc035ba80 == &jiffies
> 
> This loop originates from asm-generic/rtc.h:get_rtc_time()
> 
> 		while (jiffies - uip_watchdog < 2*HZ/100) {
> 			barrier();
> 			cpu_relax();
> 		}
> 
> Note: HZ == CONFIG_HZ == 100
> 
> The bug may not originate from the 2.6.27-rc series as I only recently 
> enabled HPET in this machine's kernels (not due to HPET problems, it 
> inherited its .config way back from an older machine w/o HPET).

argh, that loop in asm-generic/rtc.h:get_rtc_time looks extremely 
fragile, we'll lock up if it's ever called with hardirqs off!

Does the patch below do the trick?

	Ingo

----------------->
>From 2273cc870b52a7ed09eb225142a6db97299e4f39 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Sat, 23 Aug 2008 17:59:07 +0200
Subject: [PATCH] rtc: fix deadlock

if get_rtc_time() is _ever_ called with IRQs off, we deadlock badly
in it, waiting for jiffies to increment.

So make the code more robust by doing an explicit mdelay(20).

This solves a very hard to reproduce/debug hard lockup reported
by Mikael Pettersson.

Reported-by: Mikael Pettersson <mikpe@it.uu.se>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/asm-generic/rtc.h |   12 ++++--------
 1 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/include/asm-generic/rtc.h b/include/asm-generic/rtc.h
index be4af00..71ef3f0 100644
--- a/include/asm-generic/rtc.h
+++ b/include/asm-generic/rtc.h
@@ -15,6 +15,7 @@
 #include <linux/mc146818rtc.h>
 #include <linux/rtc.h>
 #include <linux/bcd.h>
+#include <linux/delay.h>
 
 #define RTC_PIE 0x40		/* periodic interrupt enable */
 #define RTC_AIE 0x20		/* alarm interrupt enable */
@@ -43,7 +44,6 @@ static inline unsigned char rtc_is_updating(void)
 
 static inline unsigned int get_rtc_time(struct rtc_time *time)
 {
-	unsigned long uip_watchdog = jiffies;
 	unsigned char ctrl;
 	unsigned long flags;
 
@@ -53,19 +53,15 @@ static inline unsigned int get_rtc_time(struct rtc_time *time)
 
 	/*
 	 * read RTC once any update in progress is done. The update
-	 * can take just over 2ms. We wait 10 to 20ms. There is no need to
+	 * can take just over 2ms. We wait 20ms. There is no need to
 	 * to poll-wait (up to 1s - eeccch) for the falling edge of RTC_UIP.
 	 * If you need to know *exactly* when a second has started, enable
 	 * periodic update complete interrupts, (via ioctl) and then 
 	 * immediately read /dev/rtc which will block until you get the IRQ.
 	 * Once the read clears, read the RTC time (again via ioctl). Easy.
 	 */
-
-	if (rtc_is_updating() != 0)
-		while (jiffies - uip_watchdog < 2*HZ/100) {
-			barrier();
-			cpu_relax();
-		}
+	if (rtc_is_updating())
+		mdelay(20);
 
 	/*
 	 * Only the values that we read from the RTC are set. We leave

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] rtc: fix deadlock
  2008-08-23 16:01 ` [PATCH] rtc: fix deadlock Ingo Molnar
@ 2008-08-23 16:58   ` Mikael Pettersson
  2008-08-23 17:11     ` Ingo Molnar
  2008-08-29 11:48   ` [PATCH] rtc: fix deadlock: fixes regression since 2.6.24 Frank van Maarseveen
  1 sibling, 1 reply; 15+ messages in thread
From: Mikael Pettersson @ 2008-08-23 16:58 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mikael Pettersson, linux-kernel, hpa, mingo, tglx, Andrew Morton

Ingo Molnar writes:
 > 
 > * Mikael Pettersson <mikpe@it.uu.se> wrote:
 > 
 > > Since 2.6.27-rc1 my Core2Duo has been getting sporadic oopses
 > > from hpet_rtc_interrupt, usually during shutdown or reboot,
 > > but occasionally also early in init. Today I finally managed
 > > to capture one via a serial cable:
 > > 
 > > INIT: version 2.86 booting
 > > 		Welcome to Fedora Core
 > > 		Press 'I' to enter interactive startup.
 > > BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0117092, registers:
 > > Modules linked in: ehci_hcd uhci_hcd usbcore
 > > 
 > > Pid: 311, comm: nash-hotplug Not tainted (2.6.27-rc4 #1)
 > > EIP: 0060:[<c0117092>] EFLAGS: 00000097 CPU: 0
 > > EIP is at hpet_rtc_interrupt+0x2d2/0x310
 > > EAX: 00000000 EBX: 00000002 ECX: 00000046 EDX: 00000002
 > > ESI: 000000a6 EDI: ffff8e25 EBP: 00000008 ESP: f7bd7f28
 > >  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 > > Process nash-hotplug (pid: 311, ti=f7bd6000 task=f7b70460 task.ti=f7bd6000)
 > > Stack: f7bd7f6c c0139cc0 00000000 c035ba04 00000000 00000000 00000000 00000000 
 > >        00000000 00000000 00000000 00000000 00000000 f7b845a0 00000000 00000000 
 > >        00000008 c01478a8 c035bf80 f7b845a0 c035bfb0 00000008 c0148f71 00000400 
 > > Call Trace:
 > >  [<c0139cc0>] hrtimer_run_pending+0x20/0x90
 > >  [<c01478a8>] handle_IRQ_event+0x28/0x50
 > >  [<c0148f71>] handle_edge_irq+0xa1/0x120
 > >  [<c010615b>] do_IRQ+0x3b/0x70
 > >  [<c0113225>] smp_apic_timer_interrupt+0x55/0x80
 > >  [<c0103c4f>] common_interrupt+0x23/0x28
 > >  [<c02c0000>] unix_release_sock+0xc0/0x220
 > >  =======================
 > > Code: 89 44 24 18 0f b6 c2 e8 5d 74 0c 00 8b 0d d8 9c 3b c0 89 44 24 1c 8b 44 24 0c 48 89 44 24 20 e9 84 fd ff ff 90 8d 74 26 00 f3 90 <a1> 80 ba 35 c0 29 f8 83 f8 01 76 f2 e9 e1 fe ff ff 90 8d 74 26 
 > > 
 > > This points to the following loop in hpet_rtc_interrupt:
 > > 
 > > 0xc0117090 <hpet_rtc_interrupt+720>:    pause  
 > > 0xc0117092 <hpet_rtc_interrupt+722>:    mov    0xc035ba80,%eax
 > > 0xc0117097 <hpet_rtc_interrupt+727>:    sub    %edi,%eax
 > > 0xc0117099 <hpet_rtc_interrupt+729>:    cmp    $0x1,%eax
 > > 0xc011709c <hpet_rtc_interrupt+732>:    jbe    0xc0117090 <hpet_rtc_interrupt+720>
 > > 
 > > Note: 0xc035ba80 == &jiffies
 > > 
 > > This loop originates from asm-generic/rtc.h:get_rtc_time()
 > > 
 > > 		while (jiffies - uip_watchdog < 2*HZ/100) {
 > > 			barrier();
 > > 			cpu_relax();
 > > 		}
 > > 
 > > Note: HZ == CONFIG_HZ == 100
 > > 
 > > The bug may not originate from the 2.6.27-rc series as I only recently 
 > > enabled HPET in this machine's kernels (not due to HPET problems, it 
 > > inherited its .config way back from an older machine w/o HPET).
 > 
 > argh, that loop in asm-generic/rtc.h:get_rtc_time looks extremely 
 > fragile, we'll lock up if it's ever called with hardirqs off!
 > 
 > Does the patch below do the trick?

Thanks for the patch, I'll give it a try ASAP.

The sporadic nature of the bug means that it will probably take a
couple of days of testing and dozens of reboots w/o problems before
I'm confident to say that the problem's been fixed.

/Mikael

 > 
 > 	Ingo
 > 
 > ----------------->
 > >From 2273cc870b52a7ed09eb225142a6db97299e4f39 Mon Sep 17 00:00:00 2001
 > From: Ingo Molnar <mingo@elte.hu>
 > Date: Sat, 23 Aug 2008 17:59:07 +0200
 > Subject: [PATCH] rtc: fix deadlock
 > 
 > if get_rtc_time() is _ever_ called with IRQs off, we deadlock badly
 > in it, waiting for jiffies to increment.
 > 
 > So make the code more robust by doing an explicit mdelay(20).
 > 
 > This solves a very hard to reproduce/debug hard lockup reported
 > by Mikael Pettersson.
 > 
 > Reported-by: Mikael Pettersson <mikpe@it.uu.se>
 > Signed-off-by: Ingo Molnar <mingo@elte.hu>
 > ---
 >  include/asm-generic/rtc.h |   12 ++++--------
 >  1 files changed, 4 insertions(+), 8 deletions(-)
 > 
 > diff --git a/include/asm-generic/rtc.h b/include/asm-generic/rtc.h
 > index be4af00..71ef3f0 100644
 > --- a/include/asm-generic/rtc.h
 > +++ b/include/asm-generic/rtc.h
 > @@ -15,6 +15,7 @@
 >  #include <linux/mc146818rtc.h>
 >  #include <linux/rtc.h>
 >  #include <linux/bcd.h>
 > +#include <linux/delay.h>
 >  
 >  #define RTC_PIE 0x40		/* periodic interrupt enable */
 >  #define RTC_AIE 0x20		/* alarm interrupt enable */
 > @@ -43,7 +44,6 @@ static inline unsigned char rtc_is_updating(void)
 >  
 >  static inline unsigned int get_rtc_time(struct rtc_time *time)
 >  {
 > -	unsigned long uip_watchdog = jiffies;
 >  	unsigned char ctrl;
 >  	unsigned long flags;
 >  
 > @@ -53,19 +53,15 @@ static inline unsigned int get_rtc_time(struct rtc_time *time)
 >  
 >  	/*
 >  	 * read RTC once any update in progress is done. The update
 > -	 * can take just over 2ms. We wait 10 to 20ms. There is no need to
 > +	 * can take just over 2ms. We wait 20ms. There is no need to
 >  	 * to poll-wait (up to 1s - eeccch) for the falling edge of RTC_UIP.
 >  	 * If you need to know *exactly* when a second has started, enable
 >  	 * periodic update complete interrupts, (via ioctl) and then 
 >  	 * immediately read /dev/rtc which will block until you get the IRQ.
 >  	 * Once the read clears, read the RTC time (again via ioctl). Easy.
 >  	 */
 > -
 > -	if (rtc_is_updating() != 0)
 > -		while (jiffies - uip_watchdog < 2*HZ/100) {
 > -			barrier();
 > -			cpu_relax();
 > -		}
 > +	if (rtc_is_updating())
 > +		mdelay(20);
 >  
 >  	/*
 >  	 * Only the values that we read from the RTC are set. We leave
 > 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] rtc: fix deadlock
  2008-08-23 16:58   ` Mikael Pettersson
@ 2008-08-23 17:11     ` Ingo Molnar
  2008-08-23 18:45       ` Maciej W. Rozycki
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2008-08-23 17:11 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: linux-kernel, hpa, mingo, tglx, Andrew Morton


* Mikael Pettersson <mikpe@it.uu.se> wrote:

> Thanks for the patch, I'll give it a try ASAP.
> 
> The sporadic nature of the bug means that it will probably take a 
> couple of days of testing and dozens of reboots w/o problems before 
> I'm confident to say that the problem's been fixed.

btw., i've seen that lockup too on a testbox - but havent been able to 
track it down yet. You went further and presented the bug on a plate by 
tracking it down to the loop in get_rtc_time() - so now we can both test 
the fix and see whether the lockups are gone.

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] rtc: fix deadlock
  2008-08-23 17:11     ` Ingo Molnar
@ 2008-08-23 18:45       ` Maciej W. Rozycki
  2008-08-23 19:46         ` Mikael Pettersson
  0 siblings, 1 reply; 15+ messages in thread
From: Maciej W. Rozycki @ 2008-08-23 18:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mikael Pettersson, linux-kernel, hpa, mingo, tglx, Andrew Morton

On Sat, 23 Aug 2008, Ingo Molnar wrote:

> > The sporadic nature of the bug means that it will probably take a 
> > couple of days of testing and dozens of reboots w/o problems before 
> > I'm confident to say that the problem's been fixed.
> 
> btw., i've seen that lockup too on a testbox - but havent been able to 
> track it down yet. You went further and presented the bug on a plate by 
> tracking it down to the loop in get_rtc_time() - so now we can both test 
> the fix and see whether the lockups are gone.

 Hmm, waiting for 20ms with interrupts disabled sounds like a bad idea --
a lot of system timer ticks may be lost and some devices may not be happy
about such a delay either (serial?).  I think your proposal only papers
over the actual bug and it should simply be forbidden to call this
function with interrupts disabled.

 I can see this is a part of the legacy RTC interface, so except from the
long delay the function itself may have a chance to work with interrupts
disabled.  But with the modern RTC library this is no longer true as some
RTC chips it supports can only be accessed with interrupts enabled,
because their accessor functions may sleep.  This is for example common
with RTC chips wired through I2C.

 Therefore I think to avoid propagating the design problem with places
calling get_rtc_time() when they eventually switch to the RTC library it
would be better to put:

BUG_ON(irqs_disabled());

into get_rtc_time() now and fix the callers which happen to trigger it.

  Maciej

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] rtc: fix deadlock
  2008-08-23 18:45       ` Maciej W. Rozycki
@ 2008-08-23 19:46         ` Mikael Pettersson
  0 siblings, 0 replies; 15+ messages in thread
From: Mikael Pettersson @ 2008-08-23 19:46 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Ingo Molnar, Mikael Pettersson, linux-kernel, hpa, mingo, tglx,
	Andrew Morton

Maciej W. Rozycki writes:
 > On Sat, 23 Aug 2008, Ingo Molnar wrote:
 > 
 > > > The sporadic nature of the bug means that it will probably take a 
 > > > couple of days of testing and dozens of reboots w/o problems before 
 > > > I'm confident to say that the problem's been fixed.
 > > 
 > > btw., i've seen that lockup too on a testbox - but havent been able to 
 > > track it down yet. You went further and presented the bug on a plate by 
 > > tracking it down to the loop in get_rtc_time() - so now we can both test 
 > > the fix and see whether the lockups are gone.
 > 
 >  Hmm, waiting for 20ms with interrupts disabled sounds like a bad idea --
 > a lot of system timer ticks may be lost and some devices may not be happy
 > about such a delay either (serial?).  I think your proposal only papers
 > over the actual bug and it should simply be forbidden to call this
 > function with interrupts disabled.
 > 
 >  I can see this is a part of the legacy RTC interface, so except from the
 > long delay the function itself may have a chance to work with interrupts
 > disabled.  But with the modern RTC library this is no longer true as some
 > RTC chips it supports can only be accessed with interrupts enabled,
 > because their accessor functions may sleep.  This is for example common
 > with RTC chips wired through I2C.
 > 
 >  Therefore I think to avoid propagating the design problem with places
 > calling get_rtc_time() when they eventually switch to the RTC library it
 > would be better to put:
 > 
 > BUG_ON(irqs_disabled());
 > 
 > into get_rtc_time() now and fix the callers which happen to trigger it.

On my C2D, I've only seen the lockup happen when user-space is
using hwclock to init system time (early init) or to flush system
time to the hw (end of shutdown/reboot). And the lockup is always
in hpet_rtc_interrupt. I've never ever seen the lockup while the
system is up and running normally.

So I suspect a bad interaction between the HPET RTC stuff and
the RTC user-space interface.

/Mikael

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt()
  2008-08-23  9:48 [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt() Mikael Pettersson
  2008-08-23 16:01 ` [PATCH] rtc: fix deadlock Ingo Molnar
@ 2008-08-24  9:14 ` Vegard Nossum
  2008-08-24 10:32   ` Mikael Pettersson
  1 sibling, 1 reply; 15+ messages in thread
From: Vegard Nossum @ 2008-08-24  9:14 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: linux-kernel, hpa, mingo, tglx

Hi,

On Sat, Aug 23, 2008 at 11:48 AM, Mikael Pettersson <mikpe@it.uu.se> wrote:
> Since 2.6.27-rc1 my Core2Duo has been getting sporadic oopses
> from hpet_rtc_interrupt, usually during shutdown or reboot,
> but occasionally also early in init. Today I finally managed
> to capture one via a serial cable:
>
> INIT: version 2.86 booting
>                Welcome to Fedora Core
>                Press 'I' to enter interactive startup.
> BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0117092, registers:
> Modules linked in: ehci_hcd uhci_hcd usbcore
>
> Pid: 311, comm: nash-hotplug Not tainted (2.6.27-rc4 #1)
> EIP: 0060:[<c0117092>] EFLAGS: 00000097 CPU: 0
> EIP is at hpet_rtc_interrupt+0x2d2/0x310
> EAX: 00000000 EBX: 00000002 ECX: 00000046 EDX: 00000002
> ESI: 000000a6 EDI: ffff8e25 EBP: 00000008 ESP: f7bd7f28
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process nash-hotplug (pid: 311, ti=f7bd6000 task=f7b70460 task.ti=f7bd6000)
> Stack: f7bd7f6c c0139cc0 00000000 c035ba04 00000000 00000000 00000000 00000000
>       00000000 00000000 00000000 00000000 00000000 f7b845a0 00000000 00000000
>       00000008 c01478a8 c035bf80 f7b845a0 c035bfb0 00000008 c0148f71 00000400
> Call Trace:
>  [<c0139cc0>] hrtimer_run_pending+0x20/0x90
>  [<c01478a8>] handle_IRQ_event+0x28/0x50
>  [<c0148f71>] handle_edge_irq+0xa1/0x120
>  [<c010615b>] do_IRQ+0x3b/0x70
>  [<c0113225>] smp_apic_timer_interrupt+0x55/0x80
>  [<c0103c4f>] common_interrupt+0x23/0x28
>  [<c02c0000>] unix_release_sock+0xc0/0x220
>  =======================
> Code: 89 44 24 18 0f b6 c2 e8 5d 74 0c 00 8b 0d d8 9c 3b c0 89 44 24 1c 8b 44 24 0c 48 89 44 24 20 e9 84 fd ff ff 90 8d 74 26 00 f3 90 <a1> 80 ba 35 c0 29 f8 83 f8 01 76 f2 e9 e1 fe ff ff 90 8d 74 26
>
> This points to the following loop in hpet_rtc_interrupt:
>
> 0xc0117090 <hpet_rtc_interrupt+720>:    pause
> 0xc0117092 <hpet_rtc_interrupt+722>:    mov    0xc035ba80,%eax
> 0xc0117097 <hpet_rtc_interrupt+727>:    sub    %edi,%eax
> 0xc0117099 <hpet_rtc_interrupt+729>:    cmp    $0x1,%eax
> 0xc011709c <hpet_rtc_interrupt+732>:    jbe    0xc0117090 <hpet_rtc_interrupt+720>
>
> Note: 0xc035ba80 == &jiffies
>
> This loop originates from asm-generic/rtc.h:get_rtc_time()
>
>                while (jiffies - uip_watchdog < 2*HZ/100) {
>                        barrier();
>                        cpu_relax();
>                }
>
> Note: HZ == CONFIG_HZ == 100
>
> The bug may not originate from the 2.6.27-rc series as I only recently
> enabled HPET in this machine's kernels (not due to HPET problems, it
> inherited its .config way back from an older machine w/o HPET).

I also just got this during shutdown:

Syncing hardware clock to system time BUG: NMI Watchdog detected
LOCKUP on CPU0, ip c011d922, registers:
Pid: 4181, comm: hwclock Not tainted (2.6.27-rc3-00464-g1fca254-dirty #42)
EIP: 0060:[<c011d922>] EFLAGS: 00200097 CPU: 0
EIP is at hpet_rtc_interrupt+0x282/0x2e0
EAX: 00000000 EBX: 00200096 ECX: f3990000 EDX: 00010000
ESI: 000000a6 EDI: 0004f806 EBP: f3991edc ESP: f3991e98
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process hwclock (pid: 4181, ti=f3990000 task=f359d340 task.ti=f3990000)
Stack: f359d340 c08621c0 00000000 f359d340 00001d12 00000000 00000000 00000000
       00000000 00000000 00000000 00000000 00000000 00000000 f6b4f788 00000000
       00000008 f3991ef4 c017ac08 00000000 c0862180 f6b4f788 00000008 f3991f0c
Call Trace:
 [<c017ac08>] ? handle_IRQ_event+0x28/0x70
 [<c017c1cf>] ? handle_edge_irq+0xaf/0x140
 [<c0107138>] ? do_IRQ+0x48/0xa0
 [<c036e254>] ? trace_hardirqs_off_thunk+0xc/0x18
 [<c0104a3c>] ? common_interrupt+0x28/0x30
 [<c03c007b>] ? tty_get_baud_rate+0x3b/0x60
 [<c036e641>] ? copy_from_user+0x1/0x80
 [<c01bba1f>] ? sys_select+0x5f/0x190
 [<c01ba157>] ? do_vfs_ioctl+0x57/0x2b0
 [<c036e244>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c015bc64>] ? trace_hardirqs_on_caller+0xd4/0x160
 [<c0103f3b>] ? sysenter_do_call+0x12/0x3f
 =======================
Code: 65 10 25 00 89 45 d8 0f b6 45 cc e8 59 10 25 00 89 45 dc 0f b6
45 c8 e8 4d 10 25 00 83 e8 01 89 45 e0 e9 04 fe f
f ff 66 90 f3 90 <a1> 00 1b 86 c0 29 f8 83 f8 13 76 f2 e9 01 ff ff ff
83 c0 64 89
------------[ cut here ]------------
WARNING: at /uio/arkimedes/s29/vegardno/git-working/linux-2.6/kernel/mutex.c:351
mutex_trylock+0x123/0x160()
Pid: 4181, comm: hwclock Not tainted 2.6.27-rc3-00464-g1fca254-dirty #42
 [<c0137ddf>] warn_on_slowpath+0x4f/0x80
 [<c03c856d>] ? vt_console_print+0x1dd/0x2a0
 [<c01589cb>] ? trace_hardirqs_off+0xb/0x10
 [<c01500fb>] ? up+0x2b/0x40
 [<c066fc85>] ? _spin_lock_irqsave+0x85/0xa0
 [<c01383c1>] ? release_console_sem+0x1c1/0x1f0
 [<c01589cb>] ? trace_hardirqs_off+0xb/0x10
 [<c0670093>] ? _spin_unlock_irqrestore+0x43/0x70
 [<c01383d5>] ? release_console_sem+0x1d5/0x1f0
 [<c0138763>] ? vprintk+0x163/0x3c0
 [<c01051f5>] ? print_trace_address+0x45/0x50
 [<c0103f3b>] ? sysenter_do_call+0x12/0x3f
 [<c066da53>] mutex_trylock+0x123/0x160
 [<c0670093>] ? _spin_unlock_irqrestore+0x43/0x70
 [<c01694cb>] crash_kexec+0x1b/0xc0
 [<c038bffe>] ? vgacon_cursor+0x16e/0x1d0
 [<c03c8083>] ? set_cursor+0x53/0x70
 [<c03ca20b>] ? do_unblank_screen+0xbb/0x130
 [<c0105dd9>] ? die_nmi+0xb9/0x100
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c0105e16>] die_nmi+0xf6/0x100
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c0118bc5>] nmi_watchdog_tick+0x1d5/0x1e0
 [<c0106227>] do_nmi+0x97/0x2d0
 [<c06704f3>] nmi_stack_correct+0x26/0x2b
 [<c015007b>] ? srcu_read_lock+0x3b/0x50
 [<c06700d8>] ? _spin_unlock_irq+0x18/0x60
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c017ac08>] handle_IRQ_event+0x28/0x70
 [<c017c1cf>] handle_edge_irq+0xaf/0x140
 [<c0107138>] do_IRQ+0x48/0xa0
 [<c036e254>] ? trace_hardirqs_off_thunk+0xc/0x18
 [<c0104a3c>] common_interrupt+0x28/0x30
 [<c03c007b>] ? tty_get_baud_rate+0x3b/0x60
 [<c036e641>] ? copy_from_user+0x1/0x80
 [<c01bba1f>] ? sys_select+0x5f/0x190
 [<c01ba157>] ? do_vfs_ioctl+0x57/0x2b0
 [<c036e244>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c015bc64>] ? trace_hardirqs_on_caller+0xd4/0x160
 [<c0103f3b>] sysenter_do_call+0x12/0x3f
 =======================
---[ end trace de495b709f3b3b84 ]---
Kernel panic - not syncing: Aiee, killing interrupt handler!
------------[ cut here ]------------
WARNING: at /uio/arkimedes/s29/vegardno/git-working/linux-2.6/kernel/smp.c:332
smp_call_function_mask+0x1b1/0x1c0()
Pid: 4181, comm: hwclock Tainted: G        W 2.6.27-rc3-00464-g1fca254-dirty #42
 [<c0137ddf>] warn_on_slowpath+0x4f/0x80
 [<c038be14>] ? vgacon_set_cursor_size+0xa4/0x120
 [<c01589cb>] ? trace_hardirqs_off+0xb/0x10
 [<c017dd2b>] ? __rcu_read_unlock+0x9b/0xc0
 [<c01506ec>] ? __atomic_notifier_call_chain+0x3c/0x50
 [<c03c856d>] ? vt_console_print+0x1dd/0x2a0
 [<c015071a>] ? atomic_notifier_call_chain+0x1a/0x20
 [<c037d056>] ? _raw_spin_unlock+0x46/0x80
 [<c066ff87>] ? _spin_unlock+0x27/0x50
 [<c03c856d>] ? vt_console_print+0x1dd/0x2a0
 [<c01589cb>] ? trace_hardirqs_off+0xb/0x10
 [<c01500fb>] ? up+0x2b/0x40
 [<c066fc85>] ? _spin_lock_irqsave+0x85/0xa0
 [<c01383c1>] ? release_console_sem+0x1c1/0x1f0
 [<c01589cb>] ? trace_hardirqs_off+0xb/0x10
 [<c0163371>] smp_call_function_mask+0x1b1/0x1c0
 [<c01383d5>] ? release_console_sem+0x1d5/0x1f0
 [<c01170e0>] ? stop_this_cpu+0x0/0x50
 [<c066df98>] ? mutex_unlock+0x8/0x10
 [<c01589cb>] ? trace_hardirqs_off+0xb/0x10
 [<c066ded4>] ? __mutex_unlock_slowpath+0xa4/0x160
 [<c066df98>] ? mutex_unlock+0x8/0x10
 [<c016951d>] ? crash_kexec+0x6d/0xc0
 [<c01051f5>] ? print_trace_address+0x45/0x50
 [<c0103f3b>] ? sysenter_do_call+0x12/0x3f
 [<c01170e0>] ? stop_this_cpu+0x0/0x50
 [<c01633b0>] smp_call_function+0x30/0x60
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c011718e>] native_smp_send_stop+0x1e/0x70
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c0137ccf>] panic+0x5f/0x120
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c013b282>] do_exit+0x7e2/0x880
 [<c03c8083>] ? set_cursor+0x53/0x70
 [<c03ca20b>] ? do_unblank_screen+0xbb/0x130
 [<c0105dd9>] ? die_nmi+0xb9/0x100
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c0105dff>] die_nmi+0xdf/0x100
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c0118bc5>] nmi_watchdog_tick+0x1d5/0x1e0
 [<c0106227>] do_nmi+0x97/0x2d0
 [<c06704f3>] nmi_stack_correct+0x26/0x2b
 [<c015007b>] ? srcu_read_lock+0x3b/0x50
 [<c06700d8>] ? _spin_unlock_irq+0x18/0x60
 [<c011d922>] ? hpet_rtc_interrupt+0x282/0x2e0
 [<c017ac08>] handle_IRQ_event+0x28/0x70
 [<c017c1cf>] handle_edge_irq+0xaf/0x140
 [<c0107138>] do_IRQ+0x48/0xa0
 [<c036e254>] ? trace_hardirqs_off_thunk+0xc/0x18
 [<c0104a3c>] common_interrupt+0x28/0x30
 [<c03c007b>] ? tty_get_baud_rate+0x3b/0x60
 [<c036e641>] ? copy_from_user+0x1/0x80
 [<c01bba1f>] ? sys_select+0x5f/0x190
 [<c01ba157>] ? do_vfs_ioctl+0x57/0x2b0
 [<c036e244>] ? trace_hardirqs_on_thunk+0xc/0x10
 [<c015bc64>] ? trace_hardirqs_on_caller+0xd4/0x160
 [<c0103f3b>] sysenter_do_call+0x12/0x3f
 =======================
---[ end trace de495b709f3b3b84 ]---


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt()
  2008-08-24  9:14 ` [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt() Vegard Nossum
@ 2008-08-24 10:32   ` Mikael Pettersson
  2008-08-24 11:48     ` Vegard Nossum
  0 siblings, 1 reply; 15+ messages in thread
From: Mikael Pettersson @ 2008-08-24 10:32 UTC (permalink / raw)
  To: Vegard Nossum; +Cc: Mikael Pettersson, linux-kernel, hpa, mingo, tglx

Vegard Nossum writes:
 > Hi,
 > 
 > On Sat, Aug 23, 2008 at 11:48 AM, Mikael Pettersson <mikpe@it.uu.se> wrote:
 > > Since 2.6.27-rc1 my Core2Duo has been getting sporadic oopses
 > > from hpet_rtc_interrupt, usually during shutdown or reboot,
 > > but occasionally also early in init. Today I finally managed
 > > to capture one via a serial cable:
 > >
 > > INIT: version 2.86 booting
 > >                Welcome to Fedora Core
 > >                Press 'I' to enter interactive startup.
 > > BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0117092, registers:
 > > Modules linked in: ehci_hcd uhci_hcd usbcore
 > >
 > > Pid: 311, comm: nash-hotplug Not tainted (2.6.27-rc4 #1)
 > > EIP: 0060:[<c0117092>] EFLAGS: 00000097 CPU: 0
 > > EIP is at hpet_rtc_interrupt+0x2d2/0x310
 > > EAX: 00000000 EBX: 00000002 ECX: 00000046 EDX: 00000002
 > > ESI: 000000a6 EDI: ffff8e25 EBP: 00000008 ESP: f7bd7f28
 > >  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 > > Process nash-hotplug (pid: 311, ti=f7bd6000 task=f7b70460 task.ti=f7bd6000)
 > > Stack: f7bd7f6c c0139cc0 00000000 c035ba04 00000000 00000000 00000000 00000000
 > >       00000000 00000000 00000000 00000000 00000000 f7b845a0 00000000 00000000
 > >       00000008 c01478a8 c035bf80 f7b845a0 c035bfb0 00000008 c0148f71 00000400
 > > Call Trace:
 > >  [<c0139cc0>] hrtimer_run_pending+0x20/0x90
 > >  [<c01478a8>] handle_IRQ_event+0x28/0x50
 > >  [<c0148f71>] handle_edge_irq+0xa1/0x120
 > >  [<c010615b>] do_IRQ+0x3b/0x70
 > >  [<c0113225>] smp_apic_timer_interrupt+0x55/0x80
 > >  [<c0103c4f>] common_interrupt+0x23/0x28
 > >  [<c02c0000>] unix_release_sock+0xc0/0x220
 > >  =======================
 > > Code: 89 44 24 18 0f b6 c2 e8 5d 74 0c 00 8b 0d d8 9c 3b c0 89 44 24 1c 8b 44 24 0c 48 89 44 24 20 e9 84 fd ff ff 90 8d 74 26 00 f3 90 <a1> 80 ba 35 c0 29 f8 83 f8 01 76 f2 e9 e1 fe ff ff 90 8d 74 26
 > >
 > > This points to the following loop in hpet_rtc_interrupt:
 > >
 > > 0xc0117090 <hpet_rtc_interrupt+720>:    pause
 > > 0xc0117092 <hpet_rtc_interrupt+722>:    mov    0xc035ba80,%eax
 > > 0xc0117097 <hpet_rtc_interrupt+727>:    sub    %edi,%eax
 > > 0xc0117099 <hpet_rtc_interrupt+729>:    cmp    $0x1,%eax
 > > 0xc011709c <hpet_rtc_interrupt+732>:    jbe    0xc0117090 <hpet_rtc_interrupt+720>
 > >
 > > Note: 0xc035ba80 == &jiffies
 > >
 > > This loop originates from asm-generic/rtc.h:get_rtc_time()
 > >
 > >                while (jiffies - uip_watchdog < 2*HZ/100) {
 > >                        barrier();
 > >                        cpu_relax();
 > >                }
 > >
 > > Note: HZ == CONFIG_HZ == 100
 > >
 > > The bug may not originate from the 2.6.27-rc series as I only recently
 > > enabled HPET in this machine's kernels (not due to HPET problems, it
 > > inherited its .config way back from an older machine w/o HPET).
 > 
 > I also just got this during shutdown:
 > 
 > Syncing hardware clock to system time BUG: NMI Watchdog detected
 > LOCKUP on CPU0, ip c011d922, registers:
 > Pid: 4181, comm: hwclock Not tainted (2.6.27-rc3-00464-g1fca254-dirty #42)
 > EIP: 0060:[<c011d922>] EFLAGS: 00200097 CPU: 0
 > EIP is at hpet_rtc_interrupt+0x282/0x2e0
 > EAX: 00000000 EBX: 00200096 ECX: f3990000 EDX: 00010000
 > ESI: 000000a6 EDI: 0004f806 EBP: f3991edc ESP: f3991e98
 >  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
 > Process hwclock (pid: 4181, ti=f3990000 task=f359d340 task.ti=f3990000)
 > Stack: f359d340 c08621c0 00000000 f359d340 00001d12 00000000 00000000 00000000
 >        00000000 00000000 00000000 00000000 00000000 00000000 f6b4f788 00000000
 >        00000008 f3991ef4 c017ac08 00000000 c0862180 f6b4f788 00000008 f3991f0c
 > Call Trace:
 >  [<c017ac08>] ? handle_IRQ_event+0x28/0x70
 >  [<c017c1cf>] ? handle_edge_irq+0xaf/0x140
 >  [<c0107138>] ? do_IRQ+0x48/0xa0
 >  [<c036e254>] ? trace_hardirqs_off_thunk+0xc/0x18
 >  [<c0104a3c>] ? common_interrupt+0x28/0x30
 >  [<c03c007b>] ? tty_get_baud_rate+0x3b/0x60
 >  [<c036e641>] ? copy_from_user+0x1/0x80
 >  [<c01bba1f>] ? sys_select+0x5f/0x190
 >  [<c01ba157>] ? do_vfs_ioctl+0x57/0x2b0
 >  [<c036e244>] ? trace_hardirqs_on_thunk+0xc/0x10
 >  [<c015bc64>] ? trace_hardirqs_on_caller+0xd4/0x160
 >  [<c0103f3b>] ? sysenter_do_call+0x12/0x3f
 >  =======================
 > Code: 65 10 25 00 89 45 d8 0f b6 45 cc e8 59 10 25 00 89 45 dc 0f b6
 > 45 c8 e8 4d 10 25 00 83 e8 01 89 45 e0 e9 04 fe f
 > f ff 66 90 f3 90 <a1> 00 1b 86 c0 29 f8 83 f8 13 76 f2 e9 01 ff ff ff
 > 83 c0 64 89

See my reply in the thread following Ingo's patch
<http://lkml.org/lkml/2008/8/23/158>. I've only seen the lockup
while hwclock was setting or flushing the system time, so I suspect
broken interaction between the hpet rtc emulation and the rtc
user-space interface.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt()
  2008-08-24 10:32   ` Mikael Pettersson
@ 2008-08-24 11:48     ` Vegard Nossum
  2008-08-26 10:25       ` Alan Jenkins
  0 siblings, 1 reply; 15+ messages in thread
From: Vegard Nossum @ 2008-08-24 11:48 UTC (permalink / raw)
  To: Mikael Pettersson; +Cc: linux-kernel, hpa, mingo, tglx

On Sun, Aug 24, 2008 at 12:32 PM, Mikael Pettersson <mikpe@it.uu.se> wrote:
>  > I also just got this during shutdown:
>  >
>  > Syncing hardware clock to system time BUG: NMI Watchdog detected
>  > LOCKUP on CPU0, ip c011d922, registers:
>  > Pid: 4181, comm: hwclock Not tainted (2.6.27-rc3-00464-g1fca254-dirty #42)
[...]

> See my reply in the thread following Ingo's patch
> <http://lkml.org/lkml/2008/8/23/158>. I've only seen the lockup
> while hwclock was setting or flushing the system time, so I suspect
> broken interaction between the hpet rtc emulation and the rtc
> user-space interface.

Oops, I picked the wrong thread :-)

Anyway, I guess it wouldn't be too hard to make a test cast by running
"hwclock" (or the corresponding initscript) in a tight loop? (Sorry, I
am not able to test this myself at the moment). That would make it
easier to test the patch too, I guess. Maybe something like:

    $ while true; do /etc/init.d/ntpdate up; /etc/init.d/ntpdate down; done

(At least "ntpdate" is the one that runs hwclock for me on F9.)

I will try this later unless you beat me to it :-)

Thanks,


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt()
  2008-08-24 11:48     ` Vegard Nossum
@ 2008-08-26 10:25       ` Alan Jenkins
  2008-08-26 10:39         ` Ingo Molnar
  0 siblings, 1 reply; 15+ messages in thread
From: Alan Jenkins @ 2008-08-26 10:25 UTC (permalink / raw)
  To: Vegard Nossum; +Cc: Mikael Pettersson, hpa, mingo, tglx, linux-kernel

Vegard Nossum wrote:
> On Sun, Aug 24, 2008 at 12:32 PM, Mikael Pettersson <mikpe@it.uu.se> wrote:
>   
>>  > I also just got this during shutdown:
>>  >
>>  > Syncing hardware clock to system time BUG: NMI Watchdog detected
>>  > LOCKUP on CPU0, ip c011d922, registers:
>>  > Pid: 4181, comm: hwclock Not tainted (2.6.27-rc3-00464-g1fca254-dirty #42)
>>     
> [...]
>
>   
>> See my reply in the thread following Ingo's patch
>> <http://lkml.org/lkml/2008/8/23/158>. I've only seen the lockup
>> while hwclock was setting or flushing the system time, so I suspect
>> broken interaction between the hpet rtc emulation and the rtc
>> user-space interface.
>>     
>
> Oops, I picked the wrong thread :-)
>
> Anyway, I guess it wouldn't be too hard to make a test cast by running
> "hwclock" (or the corresponding initscript) in a tight loop? (Sorry, I
> am not able to test this myself at the moment). That would make it
> easier to test the patch too, I guess. Maybe something like:
>
>     $ while true; do /etc/init.d/ntpdate up; /etc/init.d/ntpdate down; done
>
> (At least "ntpdate" is the one that runs hwclock for me on F9.)
>
> I will try this later unless you beat me to it :-)
>
>   
I beat you :-).  I noticed the problem myself.  It does trigger if I run
hwclock in a loop.

while true; do hwclock; done

However it is fine in latest git.  Looks like Ingo's patch has been
merged, and has fixed it.

Alan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Re: [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt()
  2008-08-26 10:25       ` Alan Jenkins
@ 2008-08-26 10:39         ` Ingo Molnar
  2008-08-27  8:54           ` Alan Jenkins
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2008-08-26 10:39 UTC (permalink / raw)
  To: Alan Jenkins
  Cc: Vegard Nossum, Mikael Pettersson, hpa, mingo, tglx, linux-kernel


* Alan Jenkins <alan-jenkins@tuffmail.co.uk> wrote:

> I beat you :-).  I noticed the problem myself.  It does trigger if I 
> run hwclock in a loop.
> 
> while true; do hwclock; done
> 
> However it is fine in latest git.  Looks like Ingo's patch has been 
> merged, and has fixed it.

ah, good - thanks for testing it. I havent seen a lockup on the T60 
either - but it was very sporadic to begin with. (and now we understand 
why: hwclock is running only once per reboot.)

Plus, the real bug has to be found as well: who in the rtc ioctl path 
calls this code with irqs disabled - it's not nice.

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt()
  2008-08-26 10:39         ` Ingo Molnar
@ 2008-08-27  8:54           ` Alan Jenkins
  0 siblings, 0 replies; 15+ messages in thread
From: Alan Jenkins @ 2008-08-27  8:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vegard Nossum, Mikael Pettersson, hpa, mingo, tglx, linux-kernel

Ingo Molnar wrote:
> * Alan Jenkins <alan-jenkins@tuffmail.co.uk> wrote:
>
>   
>> I beat you :-).  I noticed the problem myself.  It does trigger if I 
>> run hwclock in a loop.
>>
>> while true; do hwclock; done
>>
>> However it is fine in latest git.  Looks like Ingo's patch has been 
>> merged, and has fixed it.
>>     
>
> ah, good - thanks for testing it. I havent seen a lockup on the T60 
> either - but it was very sporadic to begin with. (and now we understand 
> why: hwclock is running only once per reboot.)
>
> Plus, the real bug has to be found as well: who in the rtc ioctl path 
> calls this code with irqs disabled - it's not nice.
>   

Isn't the problem that it gets called from a timer IRQ?  So irqs are
enabled globally, but the timer IRQ is masked.

 [<c0139cc0>] hrtimer_run_pending+0x20/0x90
 [<c01478a8>] handle_IRQ_event+0x28/0x50
 [<c0148f71>] handle_edge_irq+0xa1/0x120
 [<c010615b>] do_IRQ+0x3b/0x70
 [<c0113225>] smp_apic_timer_interrupt+0x55/0x80
 [<c0103c4f>] common_interrupt+0x23/0x28

Which in turn is probably because the userspace "RTC interrupt"
interface is emulated using RTC reads within the main timer interrupt
handler.

Alan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] rtc: fix deadlock: fixes regression since 2.6.24
  2008-08-23 16:01 ` [PATCH] rtc: fix deadlock Ingo Molnar
  2008-08-23 16:58   ` Mikael Pettersson
@ 2008-08-29 11:48   ` Frank van Maarseveen
  2008-09-06 18:32     ` Ingo Molnar
  1 sibling, 1 reply; 15+ messages in thread
From: Frank van Maarseveen @ 2008-08-29 11:48 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Mikael Pettersson, linux-kernel, hpa, mingo, tglx, Andrew Morton

On Sat, Aug 23, 2008 at 06:01:51PM +0200, Ingo Molnar wrote:
> 
> * Mikael Pettersson <mikpe@it.uu.se> wrote:
> 
> > Since 2.6.27-rc1 my Core2Duo has been getting sporadic oopses
> > from hpet_rtc_interrupt, usually during shutdown or reboot,
> > but occasionally also early in init. Today I finally managed
> > to capture one via a serial cable:
> > 
> > INIT: version 2.86 booting
> > 		Welcome to Fedora Core
> > 		Press 'I' to enter interactive startup.
> > BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0117092, registers:
> > Modules linked in: ehci_hcd uhci_hcd usbcore
> > 
> > Pid: 311, comm: nash-hotplug Not tainted (2.6.27-rc4 #1)
> > EIP: 0060:[<c0117092>] EFLAGS: 00000097 CPU: 0
> > EIP is at hpet_rtc_interrupt+0x2d2/0x310
> > EAX: 00000000 EBX: 00000002 ECX: 00000046 EDX: 00000002
> > ESI: 000000a6 EDI: ffff8e25 EBP: 00000008 ESP: f7bd7f28
> >  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > Process nash-hotplug (pid: 311, ti=f7bd6000 task=f7b70460 task.ti=f7bd6000)
> > Stack: f7bd7f6c c0139cc0 00000000 c035ba04 00000000 00000000 00000000 00000000 
> >        00000000 00000000 00000000 00000000 00000000 f7b845a0 00000000 00000000 
> >        00000008 c01478a8 c035bf80 f7b845a0 c035bfb0 00000008 c0148f71 00000400 
> > Call Trace:
> >  [<c0139cc0>] hrtimer_run_pending+0x20/0x90
> >  [<c01478a8>] handle_IRQ_event+0x28/0x50
> >  [<c0148f71>] handle_edge_irq+0xa1/0x120
> >  [<c010615b>] do_IRQ+0x3b/0x70
> >  [<c0113225>] smp_apic_timer_interrupt+0x55/0x80
> >  [<c0103c4f>] common_interrupt+0x23/0x28
> >  [<c02c0000>] unix_release_sock+0xc0/0x220
> >  =======================
> > Code: 89 44 24 18 0f b6 c2 e8 5d 74 0c 00 8b 0d d8 9c 3b c0 89 44 24 1c 8b 44 24 0c 48 89 44 24 20 e9 84 fd ff ff 90 8d 74 26 00 f3 90 <a1> 80 ba 35 c0 29 f8 83 f8 01 76 f2 e9 e1 fe ff ff 90 8d 74 26 
> > 
> > This points to the following loop in hpet_rtc_interrupt:
> > 
> > 0xc0117090 <hpet_rtc_interrupt+720>:    pause  
> > 0xc0117092 <hpet_rtc_interrupt+722>:    mov    0xc035ba80,%eax
> > 0xc0117097 <hpet_rtc_interrupt+727>:    sub    %edi,%eax
> > 0xc0117099 <hpet_rtc_interrupt+729>:    cmp    $0x1,%eax
> > 0xc011709c <hpet_rtc_interrupt+732>:    jbe    0xc0117090 <hpet_rtc_interrupt+720>
> > 
> > Note: 0xc035ba80 == &jiffies
> > 
> > This loop originates from asm-generic/rtc.h:get_rtc_time()
> > 
> > 		while (jiffies - uip_watchdog < 2*HZ/100) {
> > 			barrier();
> > 			cpu_relax();
> > 		}
> > 
> > Note: HZ == CONFIG_HZ == 100
> > 
> > The bug may not originate from the 2.6.27-rc series as I only recently 
> > enabled HPET in this machine's kernels (not due to HPET problems, it 
> > inherited its .config way back from an older machine w/o HPET).
> 
> argh, that loop in asm-generic/rtc.h:get_rtc_time looks extremely 
> fragile, we'll lock up if it's ever called with hardirqs off!
> 
> Does the patch below do the trick?
> 
> 	Ingo
> 
> ----------------->
> >From 2273cc870b52a7ed09eb225142a6db97299e4f39 Mon Sep 17 00:00:00 2001
> From: Ingo Molnar <mingo@elte.hu>
> Date: Sat, 23 Aug 2008 17:59:07 +0200
> Subject: [PATCH] rtc: fix deadlock
> 
> if get_rtc_time() is _ever_ called with IRQs off, we deadlock badly
> in it, waiting for jiffies to increment.
> 
> So make the code more robust by doing an explicit mdelay(20).
> 
> This solves a very hard to reproduce/debug hard lockup reported
> by Mikael Pettersson.
> 
> Reported-by: Mikael Pettersson <mikpe@it.uu.se>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  include/asm-generic/rtc.h |   12 ++++--------
>  1 files changed, 4 insertions(+), 8 deletions(-)
> 
> diff --git a/include/asm-generic/rtc.h b/include/asm-generic/rtc.h
> index be4af00..71ef3f0 100644
> --- a/include/asm-generic/rtc.h
> +++ b/include/asm-generic/rtc.h
> @@ -15,6 +15,7 @@
>  #include <linux/mc146818rtc.h>
>  #include <linux/rtc.h>
>  #include <linux/bcd.h>
> +#include <linux/delay.h>
>  
>  #define RTC_PIE 0x40		/* periodic interrupt enable */
>  #define RTC_AIE 0x20		/* alarm interrupt enable */
> @@ -43,7 +44,6 @@ static inline unsigned char rtc_is_updating(void)
>  
>  static inline unsigned int get_rtc_time(struct rtc_time *time)
>  {
> -	unsigned long uip_watchdog = jiffies;
>  	unsigned char ctrl;
>  	unsigned long flags;
>  
> @@ -53,19 +53,15 @@ static inline unsigned int get_rtc_time(struct rtc_time *time)
>  
>  	/*
>  	 * read RTC once any update in progress is done. The update
> -	 * can take just over 2ms. We wait 10 to 20ms. There is no need to
> +	 * can take just over 2ms. We wait 20ms. There is no need to
>  	 * to poll-wait (up to 1s - eeccch) for the falling edge of RTC_UIP.
>  	 * If you need to know *exactly* when a second has started, enable
>  	 * periodic update complete interrupts, (via ioctl) and then 
>  	 * immediately read /dev/rtc which will block until you get the IRQ.
>  	 * Once the read clears, read the RTC time (again via ioctl). Easy.
>  	 */
> -
> -	if (rtc_is_updating() != 0)
> -		while (jiffies - uip_watchdog < 2*HZ/100) {
> -			barrier();
> -			cpu_relax();
> -		}
> +	if (rtc_is_updating())
> +		mdelay(20);
>  
>  	/*
>  	 * Only the values that we read from the RTC are set. We leave

This patch fixes a regression since 2.6.24: 2.6.25 and 2.6.26 occasionally
locked up hard here without a trace and even alt-sysrq did not work
anymore. It's easy to reproduce with

	while :; do hwclock; done

Others are experiencing this issue too:
- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=494036
- http://kerneltrap.org/mailarchive/message-id/20080821163920.GA19140@gamma.logic.tuwien.ac.at/linux-kernel
- people (me included) experienced booting problems because of
  this (lockup after initscripts says "Setting the system clock").

maybe this is 2.6.25.x and 2.6.26.x material too?


Anyway, get_rtc_time() is called by interrupt handler(s). I think 20ms
is a awful lot of time for an interrupt handler.

-- 
Frank

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] rtc: fix deadlock: fixes regression since 2.6.24
  2008-08-29 11:48   ` [PATCH] rtc: fix deadlock: fixes regression since 2.6.24 Frank van Maarseveen
@ 2008-09-06 18:32     ` Ingo Molnar
  2008-10-01 23:51       ` [stable] " Greg KH
  0 siblings, 1 reply; 15+ messages in thread
From: Ingo Molnar @ 2008-09-06 18:32 UTC (permalink / raw)
  To: Frank van Maarseveen
  Cc: Mikael Pettersson, linux-kernel, hpa, mingo, tglx, Andrew Morton, stable


* Frank van Maarseveen <frankvm@frankvm.com> wrote:

> On Sat, Aug 23, 2008 at 06:01:51PM +0200, Ingo Molnar wrote:
> > 
> > * Mikael Pettersson <mikpe@it.uu.se> wrote:
> > 
> > > Since 2.6.27-rc1 my Core2Duo has been getting sporadic oopses
> > > from hpet_rtc_interrupt, usually during shutdown or reboot,
> > > but occasionally also early in init. Today I finally managed
> > > to capture one via a serial cable:
> > > 
> > > INIT: version 2.86 booting
> > > 		Welcome to Fedora Core
> > > 		Press 'I' to enter interactive startup.
> > > BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0117092, registers:
> > > Modules linked in: ehci_hcd uhci_hcd usbcore
> > > 
> > > Pid: 311, comm: nash-hotplug Not tainted (2.6.27-rc4 #1)
> > > EIP: 0060:[<c0117092>] EFLAGS: 00000097 CPU: 0
> > > EIP is at hpet_rtc_interrupt+0x2d2/0x310
> > > EAX: 00000000 EBX: 00000002 ECX: 00000046 EDX: 00000002
> > > ESI: 000000a6 EDI: ffff8e25 EBP: 00000008 ESP: f7bd7f28
> > >  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > > Process nash-hotplug (pid: 311, ti=f7bd6000 task=f7b70460 task.ti=f7bd6000)
> > > Stack: f7bd7f6c c0139cc0 00000000 c035ba04 00000000 00000000 00000000 00000000 
> > >        00000000 00000000 00000000 00000000 00000000 f7b845a0 00000000 00000000 
> > >        00000008 c01478a8 c035bf80 f7b845a0 c035bfb0 00000008 c0148f71 00000400 
> > > Call Trace:
> > >  [<c0139cc0>] hrtimer_run_pending+0x20/0x90
> > >  [<c01478a8>] handle_IRQ_event+0x28/0x50
> > >  [<c0148f71>] handle_edge_irq+0xa1/0x120
> > >  [<c010615b>] do_IRQ+0x3b/0x70
> > >  [<c0113225>] smp_apic_timer_interrupt+0x55/0x80
> > >  [<c0103c4f>] common_interrupt+0x23/0x28
> > >  [<c02c0000>] unix_release_sock+0xc0/0x220
> > >  =======================
> > > Code: 89 44 24 18 0f b6 c2 e8 5d 74 0c 00 8b 0d d8 9c 3b c0 89 44 24 1c 8b 44 24 0c 48 89 44 24 20 e9 84 fd ff ff 90 8d 74 26 00 f3 90 <a1> 80 ba 35 c0 29 f8 83 f8 01 76 f2 e9 e1 fe ff ff 90 8d 74 26 
> > > 
> > > This points to the following loop in hpet_rtc_interrupt:
> > > 
> > > 0xc0117090 <hpet_rtc_interrupt+720>:    pause  
> > > 0xc0117092 <hpet_rtc_interrupt+722>:    mov    0xc035ba80,%eax
> > > 0xc0117097 <hpet_rtc_interrupt+727>:    sub    %edi,%eax
> > > 0xc0117099 <hpet_rtc_interrupt+729>:    cmp    $0x1,%eax
> > > 0xc011709c <hpet_rtc_interrupt+732>:    jbe    0xc0117090 <hpet_rtc_interrupt+720>
> > > 
> > > Note: 0xc035ba80 == &jiffies
> > > 
> > > This loop originates from asm-generic/rtc.h:get_rtc_time()
> > > 
> > > 		while (jiffies - uip_watchdog < 2*HZ/100) {
> > > 			barrier();
> > > 			cpu_relax();
> > > 		}
> > > 
> > > Note: HZ == CONFIG_HZ == 100
> > > 
> > > The bug may not originate from the 2.6.27-rc series as I only recently 
> > > enabled HPET in this machine's kernels (not due to HPET problems, it 
> > > inherited its .config way back from an older machine w/o HPET).
> > 
> > argh, that loop in asm-generic/rtc.h:get_rtc_time looks extremely 
> > fragile, we'll lock up if it's ever called with hardirqs off!
> > 
> > Does the patch below do the trick?
> > 
> > 	Ingo
> > 
> > ----------------->
> > >From 2273cc870b52a7ed09eb225142a6db97299e4f39 Mon Sep 17 00:00:00 2001
> > From: Ingo Molnar <mingo@elte.hu>
> > Date: Sat, 23 Aug 2008 17:59:07 +0200
> > Subject: [PATCH] rtc: fix deadlock
> > 
> > if get_rtc_time() is _ever_ called with IRQs off, we deadlock badly
> > in it, waiting for jiffies to increment.
> > 
> > So make the code more robust by doing an explicit mdelay(20).
> > 
> > This solves a very hard to reproduce/debug hard lockup reported
> > by Mikael Pettersson.
> > 
> > Reported-by: Mikael Pettersson <mikpe@it.uu.se>
> > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > ---
> >  include/asm-generic/rtc.h |   12 ++++--------
> >  1 files changed, 4 insertions(+), 8 deletions(-)
> > 
> > diff --git a/include/asm-generic/rtc.h b/include/asm-generic/rtc.h
> > index be4af00..71ef3f0 100644
> > --- a/include/asm-generic/rtc.h
> > +++ b/include/asm-generic/rtc.h
> > @@ -15,6 +15,7 @@
> >  #include <linux/mc146818rtc.h>
> >  #include <linux/rtc.h>
> >  #include <linux/bcd.h>
> > +#include <linux/delay.h>
> >  
> >  #define RTC_PIE 0x40		/* periodic interrupt enable */
> >  #define RTC_AIE 0x20		/* alarm interrupt enable */
> > @@ -43,7 +44,6 @@ static inline unsigned char rtc_is_updating(void)
> >  
> >  static inline unsigned int get_rtc_time(struct rtc_time *time)
> >  {
> > -	unsigned long uip_watchdog = jiffies;
> >  	unsigned char ctrl;
> >  	unsigned long flags;
> >  
> > @@ -53,19 +53,15 @@ static inline unsigned int get_rtc_time(struct rtc_time *time)
> >  
> >  	/*
> >  	 * read RTC once any update in progress is done. The update
> > -	 * can take just over 2ms. We wait 10 to 20ms. There is no need to
> > +	 * can take just over 2ms. We wait 20ms. There is no need to
> >  	 * to poll-wait (up to 1s - eeccch) for the falling edge of RTC_UIP.
> >  	 * If you need to know *exactly* when a second has started, enable
> >  	 * periodic update complete interrupts, (via ioctl) and then 
> >  	 * immediately read /dev/rtc which will block until you get the IRQ.
> >  	 * Once the read clears, read the RTC time (again via ioctl). Easy.
> >  	 */
> > -
> > -	if (rtc_is_updating() != 0)
> > -		while (jiffies - uip_watchdog < 2*HZ/100) {
> > -			barrier();
> > -			cpu_relax();
> > -		}
> > +	if (rtc_is_updating())
> > +		mdelay(20);
> >  
> >  	/*
> >  	 * Only the values that we read from the RTC are set. We leave
> 
> This patch fixes a regression since 2.6.24: 2.6.25 and 2.6.26 occasionally
> locked up hard here without a trace and even alt-sysrq did not work
> anymore. It's easy to reproduce with
> 
> 	while :; do hwclock; done
> 
> Others are experiencing this issue too:
> - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=494036
> - http://kerneltrap.org/mailarchive/message-id/20080821163920.GA19140@gamma.logic.tuwien.ac.at/linux-kernel
> - people (me included) experienced booting problems because of
>   this (lockup after initscripts says "Setting the system clock").
> 
> maybe this is 2.6.25.x and 2.6.26.x material too?

agreed - stable Cc:-ed.

It's about this upstream commit:

| commit 38c052f8cff1bd323ccfa968136a9556652ee420
| Author: Ingo Molnar <mingo@elte.hu>
| Date:   Sat Aug 23 17:59:07 2008 +0200
|
|     rtc: fix deadlock

please backport it into -stable .26 and .25. Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [stable] [PATCH] rtc: fix deadlock: fixes regression since 2.6.24
  2008-09-06 18:32     ` Ingo Molnar
@ 2008-10-01 23:51       ` Greg KH
  0 siblings, 0 replies; 15+ messages in thread
From: Greg KH @ 2008-10-01 23:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Frank van Maarseveen, Mikael Pettersson, linux-kernel, stable,
	mingo, hpa, tglx, Andrew Morton

On Sat, Sep 06, 2008 at 08:32:11PM +0200, Ingo Molnar wrote:
> 
> * Frank van Maarseveen <frankvm@frankvm.com> wrote:
> 
> > On Sat, Aug 23, 2008 at 06:01:51PM +0200, Ingo Molnar wrote:
> > > 
> > > * Mikael Pettersson <mikpe@it.uu.se> wrote:
> > > 
> > > > Since 2.6.27-rc1 my Core2Duo has been getting sporadic oopses
> > > > from hpet_rtc_interrupt, usually during shutdown or reboot,
> > > > but occasionally also early in init. Today I finally managed
> > > > to capture one via a serial cable:
> > > > 
> > > > INIT: version 2.86 booting
> > > > 		Welcome to Fedora Core
> > > > 		Press 'I' to enter interactive startup.
> > > > BUG: NMI Watchdog detected LOCKUP on CPU0, ip c0117092, registers:
> > > > Modules linked in: ehci_hcd uhci_hcd usbcore
> > > > 
> > > > Pid: 311, comm: nash-hotplug Not tainted (2.6.27-rc4 #1)
> > > > EIP: 0060:[<c0117092>] EFLAGS: 00000097 CPU: 0
> > > > EIP is at hpet_rtc_interrupt+0x2d2/0x310
> > > > EAX: 00000000 EBX: 00000002 ECX: 00000046 EDX: 00000002
> > > > ESI: 000000a6 EDI: ffff8e25 EBP: 00000008 ESP: f7bd7f28
> > > >  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> > > > Process nash-hotplug (pid: 311, ti=f7bd6000 task=f7b70460 task.ti=f7bd6000)
> > > > Stack: f7bd7f6c c0139cc0 00000000 c035ba04 00000000 00000000 00000000 00000000 
> > > >        00000000 00000000 00000000 00000000 00000000 f7b845a0 00000000 00000000 
> > > >        00000008 c01478a8 c035bf80 f7b845a0 c035bfb0 00000008 c0148f71 00000400 
> > > > Call Trace:
> > > >  [<c0139cc0>] hrtimer_run_pending+0x20/0x90
> > > >  [<c01478a8>] handle_IRQ_event+0x28/0x50
> > > >  [<c0148f71>] handle_edge_irq+0xa1/0x120
> > > >  [<c010615b>] do_IRQ+0x3b/0x70
> > > >  [<c0113225>] smp_apic_timer_interrupt+0x55/0x80
> > > >  [<c0103c4f>] common_interrupt+0x23/0x28
> > > >  [<c02c0000>] unix_release_sock+0xc0/0x220
> > > >  =======================
> > > > Code: 89 44 24 18 0f b6 c2 e8 5d 74 0c 00 8b 0d d8 9c 3b c0 89 44 24 1c 8b 44 24 0c 48 89 44 24 20 e9 84 fd ff ff 90 8d 74 26 00 f3 90 <a1> 80 ba 35 c0 29 f8 83 f8 01 76 f2 e9 e1 fe ff ff 90 8d 74 26 
> > > > 
> > > > This points to the following loop in hpet_rtc_interrupt:
> > > > 
> > > > 0xc0117090 <hpet_rtc_interrupt+720>:    pause  
> > > > 0xc0117092 <hpet_rtc_interrupt+722>:    mov    0xc035ba80,%eax
> > > > 0xc0117097 <hpet_rtc_interrupt+727>:    sub    %edi,%eax
> > > > 0xc0117099 <hpet_rtc_interrupt+729>:    cmp    $0x1,%eax
> > > > 0xc011709c <hpet_rtc_interrupt+732>:    jbe    0xc0117090 <hpet_rtc_interrupt+720>
> > > > 
> > > > Note: 0xc035ba80 == &jiffies
> > > > 
> > > > This loop originates from asm-generic/rtc.h:get_rtc_time()
> > > > 
> > > > 		while (jiffies - uip_watchdog < 2*HZ/100) {
> > > > 			barrier();
> > > > 			cpu_relax();
> > > > 		}
> > > > 
> > > > Note: HZ == CONFIG_HZ == 100
> > > > 
> > > > The bug may not originate from the 2.6.27-rc series as I only recently 
> > > > enabled HPET in this machine's kernels (not due to HPET problems, it 
> > > > inherited its .config way back from an older machine w/o HPET).
> > > 
> > > argh, that loop in asm-generic/rtc.h:get_rtc_time looks extremely 
> > > fragile, we'll lock up if it's ever called with hardirqs off!
> > > 
> > > Does the patch below do the trick?
> > > 
> > > 	Ingo
> > > 
> > > ----------------->
> > > >From 2273cc870b52a7ed09eb225142a6db97299e4f39 Mon Sep 17 00:00:00 2001
> > > From: Ingo Molnar <mingo@elte.hu>
> > > Date: Sat, 23 Aug 2008 17:59:07 +0200
> > > Subject: [PATCH] rtc: fix deadlock
> > > 
> > > if get_rtc_time() is _ever_ called with IRQs off, we deadlock badly
> > > in it, waiting for jiffies to increment.
> > > 
> > > So make the code more robust by doing an explicit mdelay(20).
> > > 
> > > This solves a very hard to reproduce/debug hard lockup reported
> > > by Mikael Pettersson.
> > > 
> > > Reported-by: Mikael Pettersson <mikpe@it.uu.se>
> > > Signed-off-by: Ingo Molnar <mingo@elte.hu>
> > > ---
> > >  include/asm-generic/rtc.h |   12 ++++--------
> > >  1 files changed, 4 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/include/asm-generic/rtc.h b/include/asm-generic/rtc.h
> > > index be4af00..71ef3f0 100644
> > > --- a/include/asm-generic/rtc.h
> > > +++ b/include/asm-generic/rtc.h
> > > @@ -15,6 +15,7 @@
> > >  #include <linux/mc146818rtc.h>
> > >  #include <linux/rtc.h>
> > >  #include <linux/bcd.h>
> > > +#include <linux/delay.h>
> > >  
> > >  #define RTC_PIE 0x40		/* periodic interrupt enable */
> > >  #define RTC_AIE 0x20		/* alarm interrupt enable */
> > > @@ -43,7 +44,6 @@ static inline unsigned char rtc_is_updating(void)
> > >  
> > >  static inline unsigned int get_rtc_time(struct rtc_time *time)
> > >  {
> > > -	unsigned long uip_watchdog = jiffies;
> > >  	unsigned char ctrl;
> > >  	unsigned long flags;
> > >  
> > > @@ -53,19 +53,15 @@ static inline unsigned int get_rtc_time(struct rtc_time *time)
> > >  
> > >  	/*
> > >  	 * read RTC once any update in progress is done. The update
> > > -	 * can take just over 2ms. We wait 10 to 20ms. There is no need to
> > > +	 * can take just over 2ms. We wait 20ms. There is no need to
> > >  	 * to poll-wait (up to 1s - eeccch) for the falling edge of RTC_UIP.
> > >  	 * If you need to know *exactly* when a second has started, enable
> > >  	 * periodic update complete interrupts, (via ioctl) and then 
> > >  	 * immediately read /dev/rtc which will block until you get the IRQ.
> > >  	 * Once the read clears, read the RTC time (again via ioctl). Easy.
> > >  	 */
> > > -
> > > -	if (rtc_is_updating() != 0)
> > > -		while (jiffies - uip_watchdog < 2*HZ/100) {
> > > -			barrier();
> > > -			cpu_relax();
> > > -		}
> > > +	if (rtc_is_updating())
> > > +		mdelay(20);
> > >  
> > >  	/*
> > >  	 * Only the values that we read from the RTC are set. We leave
> > 
> > This patch fixes a regression since 2.6.24: 2.6.25 and 2.6.26 occasionally
> > locked up hard here without a trace and even alt-sysrq did not work
> > anymore. It's easy to reproduce with
> > 
> > 	while :; do hwclock; done
> > 
> > Others are experiencing this issue too:
> > - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=494036
> > - http://kerneltrap.org/mailarchive/message-id/20080821163920.GA19140@gamma.logic.tuwien.ac.at/linux-kernel
> > - people (me included) experienced booting problems because of
> >   this (lockup after initscripts says "Setting the system clock").
> > 
> > maybe this is 2.6.25.x and 2.6.26.x material too?
> 
> agreed - stable Cc:-ed.
> 
> It's about this upstream commit:
> 
> | commit 38c052f8cff1bd323ccfa968136a9556652ee420
> | Author: Ingo Molnar <mingo@elte.hu>
> | Date:   Sat Aug 23 17:59:07 2008 +0200
> |
> |     rtc: fix deadlock
> 
> please backport it into -stable .26 and .25. Thanks,

backported.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2008-10-02  0:03 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-08-23  9:48 [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt() Mikael Pettersson
2008-08-23 16:01 ` [PATCH] rtc: fix deadlock Ingo Molnar
2008-08-23 16:58   ` Mikael Pettersson
2008-08-23 17:11     ` Ingo Molnar
2008-08-23 18:45       ` Maciej W. Rozycki
2008-08-23 19:46         ` Mikael Pettersson
2008-08-29 11:48   ` [PATCH] rtc: fix deadlock: fixes regression since 2.6.24 Frank van Maarseveen
2008-09-06 18:32     ` Ingo Molnar
2008-10-01 23:51       ` [stable] " Greg KH
2008-08-24  9:14 ` [BUG] get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt() Vegard Nossum
2008-08-24 10:32   ` Mikael Pettersson
2008-08-24 11:48     ` Vegard Nossum
2008-08-26 10:25       ` Alan Jenkins
2008-08-26 10:39         ` Ingo Molnar
2008-08-27  8:54           ` Alan Jenkins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).