linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* 2.4.0-test11pre2-ac1 and previous problem
@ 2000-11-10 17:17 Pawe³ Kot
  2000-11-10 17:30 ` Rik van Riel
  2000-11-11  0:25 ` Andrew Morton
  0 siblings, 2 replies; 11+ messages in thread
From: Pawe³ Kot @ 2000-11-10 17:17 UTC (permalink / raw)
  To: linux-kernel


Hello,

I've following error with 2.4.0-test{9|10|pre11pre1-ac1|pre11pre2-ac1}:

NMI Watchdog detected LOCKUP on CPU3, registers:

And then the machine hangs. No response at all.
Always CPU3 is mentioned.
The machine is:
The latest Intel motherboard for 4xCPU (ISP4040)
4xPentium III 700 (Xeon)
4GB RAM
mylex raid array (the newest controller)
eepro100 ethernet card

This machine is running only MySQL database.

What can be wrong?

pkot
-- 
mailto:pkot@linuxnews.pl
http://urtica.linuxnews.pl/~pkot/
http://newsreader.linuxnews.pl/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-10 17:17 2.4.0-test11pre2-ac1 and previous problem Pawe³ Kot
@ 2000-11-10 17:30 ` Rik van Riel
  2000-11-10 17:36   ` Pawe³ Kot
  2000-11-11  0:25 ` Andrew Morton
  1 sibling, 1 reply; 11+ messages in thread
From: Rik van Riel @ 2000-11-10 17:30 UTC (permalink / raw)
  To: Pawe³ Kot; +Cc: linux-kernel

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=X-UNKNOWN, Size: 640 bytes --]

On Fri, 10 Nov 2000, [iso-8859-2] Pawe³ Kot wrote:

> NMI Watchdog detected LOCKUP on CPU3, registers:

> What can be wrong?

You forgot to read the REPORTING-BUGS file.

You told us everything except the really
important information ... the backtrace from
the info printed by the NMI Oopser...

regards,

Rik
--
The Internet is not a network of computers. It is a network
of people. That is its real strength.

http://www.conectiva.com/		http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-10 17:30 ` Rik van Riel
@ 2000-11-10 17:36   ` Pawe³ Kot
  0 siblings, 0 replies; 11+ messages in thread
From: Pawe³ Kot @ 2000-11-10 17:36 UTC (permalink / raw)
  To: Rik van Riel; +Cc: linux-kernel


> On Fri, 10 Nov 2000, [iso-8859-2] Paweł Kot wrote:
>
> > NMI Watchdog detected LOCKUP on CPU3, registers:
>
> > What can be wrong?
>
> You forgot to read the REPORTING-BUGS file.
>
> You told us everything except the really
> important information ... the backtrace from
> the info printed by the NMI Oopser...

No ooops was generated. Or at least I can't find it. On console there was
the  only line. There's nothing in logs.

regards
pkot
-- 
mailto:pkot@linuxnews.pl
http://urtica.linuxnews.pl/~pkot/
http://newsreader.linuxnews.pl/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-10 17:17 2.4.0-test11pre2-ac1 and previous problem Pawe³ Kot
  2000-11-10 17:30 ` Rik van Riel
@ 2000-11-11  0:25 ` Andrew Morton
  2000-11-11 15:27   ` Jasper Spaans
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2000-11-11  0:25 UTC (permalink / raw)
  To: Pawe³ Kot; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 613 bytes --]

"Pawe³ Kot" wrote:
> 
> Hello,
> 
> I've following error with 2.4.0-test{9|10|pre11pre1-ac1|pre11pre2-ac1}:
> 
> NMI Watchdog detected LOCKUP on CPU3, registers:
> 
> And then the machine hangs. No response at all.
> Always CPU3 is mentioned.
> The machine is:
> The latest Intel motherboard for 4xCPU (ISP4040)
> 4xPentium III 700 (Xeon)
> 4GB RAM
> mylex raid array (the newest controller)
> eepro100 ethernet card
> 
> This machine is running only MySQL database.
> 
> What can be wrong?

Oh no.  Another one.  Could you please try this attached
patch (against test11-pre2) and see if the diagnostics
come out?

[-- Attachment #2: kumon.patch --]
[-- Type: text/plain, Size: 2750 bytes --]

--- linux-2.4.0-test11-pre2/arch/i386/kernel/traps.c	Fri Nov 10 20:24:02 2000
+++ linux-akpm/arch/i386/kernel/traps.c	Sat Nov 11 02:35:25 2000
@@ -382,20 +382,10 @@
 	printk("Do you have a strange power saving mode enabled?\n");
 }
 
-#if CONFIG_X86_IO_APIC
-
-int nmi_watchdog = 1;
-
-static int __init setup_nmi_watchdog(char *str)
-{
-        get_option(&str, &nmi_watchdog);
-        return 1;
-}
-
-__setup("nmi_watchdog=", setup_nmi_watchdog);
-
-extern spinlock_t console_lock, timerlist_lock;
+extern spinlock_t console_lock, timerlist_lock, runqueue_lock;
 static spinlock_t nmi_print_lock = SPIN_LOCK_UNLOCKED;
+extern wait_queue_head_t log_wait;
+static int ignore_spinlocks = -1;
 
 /*
  * Unlock any spinlocks which will prevent us from getting the
@@ -404,9 +394,30 @@
  */
 void bust_spinlocks(void)
 {
+	ignore_spinlocks = smp_processor_id();
+	global_irq_lock = 0;
 	spin_lock_init(&console_lock);
 	spin_lock_init(&timerlist_lock);
+	spin_lock_init(&runqueue_lock);
+	log_wait.lock = WAITQUEUE_RW_LOCK_UNLOCKED;
 }
+
+int no_spinlocks()
+{
+	return smp_processor_id() == ignore_spinlocks;
+}
+
+#if CONFIG_X86_IO_APIC
+
+int nmi_watchdog = 1;
+
+static int __init setup_nmi_watchdog(char *str)
+{
+        get_option(&str, &nmi_watchdog);
+        return 1;
+}
+
+__setup("nmi_watchdog=", setup_nmi_watchdog);
 
 inline void nmi_watchdog_tick(struct pt_regs * regs)
 {
--- linux-2.4.0-test11-pre2/include/asm-i386/spinlock.h	Sun Oct 15 01:27:46 2000
+++ linux-akpm/include/asm-i386/spinlock.h	Sat Nov 11 02:17:46 2000
@@ -8,6 +8,8 @@
 extern int printk(const char * fmt, ...)
 	__attribute__ ((format (printf, 1, 2)));
 
+extern int no_spinlocks(void);
+
 /* It seems that people are forgetting to
  * initialize their spinlocks properly, tsk tsk.
  * Remember to turn this off in 2.4. -ben
@@ -68,6 +70,8 @@
 static inline int spin_trylock(spinlock_t *lock)
 {
 	char oldval;
+	if (no_spinlocks())
+		return 0;
 	__asm__ __volatile__(
 		"xchgb %b0,%1"
 		:"=q" (oldval), "=m" (lock->lock)
@@ -85,6 +89,8 @@
 		BUG();
 	}
 #endif
+	if (no_spinlocks())
+		return;
 	__asm__ __volatile__(
 		spin_lock_string
 		:"=m" (lock->lock) : : "memory");
@@ -149,6 +155,9 @@
 	if (rw->magic != RWLOCK_MAGIC)
 		BUG();
 #endif
+	if (no_spinlocks())
+		return;
+
 	__build_read_lock(rw, "__read_lock_failed");
 }
 
@@ -158,6 +167,8 @@
 	if (rw->magic != RWLOCK_MAGIC)
 		BUG();
 #endif
+	if (no_spinlocks())
+		return;
 	__build_write_lock(rw, "__write_lock_failed");
 }
 
--- linux-2.4.0-test11-pre2/arch/i386/kernel/i386_ksyms.c	Fri Aug 11 19:06:11 2000
+++ linux-akpm/arch/i386/kernel/i386_ksyms.c	Sat Nov 11 02:24:32 2000
@@ -155,3 +155,6 @@
 #ifdef CONFIG_X86_PAE
 EXPORT_SYMBOL(empty_zero_page);
 #endif
+
+EXPORT_SYMBOL(no_spinlocks);
+

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-11  0:25 ` Andrew Morton
@ 2000-11-11 15:27   ` Jasper Spaans
  2000-11-12  1:32     ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Jasper Spaans @ 2000-11-11 15:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 538 bytes --]

On Sat, Nov 11, 2000 at 11:25:00AM +1100, Andrew Morton wrote:

> > NMI Watchdog detected LOCKUP on CPU3, registers:

> Oh no.  Another one.  Could you please try this attached
> patch (against test11-pre2) and see if the diagnostics
> come out?

And yet another one... I applied your patch, and ran my oopses through
ksymoops, results attached.

Kernel: 2.4.0-test11-pre2 + reiserfs-3.6.18
2 * P-II 350, 256 MB RAM, no special hardware, AFAIK.

Of course, more details are available.

Regards,
-- 
Jasper Spaans  <jasper@spaans.ds9a.nl>

[-- Attachment #2: ksymoops-output --]
[-- Type: text/plain, Size: 4120 bytes --]

ksymoops 2.3.4 on i686 2.4.0-test11.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.0-test11/ (default)
     -m /boot/System.map-2.4.0-test11-pre2 (specified)

Nov 11 13:37:08 spaans kernel: NMI Watchdog detected LOCKUP on CPU1, registers:
Nov 11 13:37:08 spaans kernel: CPU:    1
Nov 11 13:37:08 spaans kernel: EIP:    0010:[tvecs+21104/42928]
Nov 11 13:37:08 spaans kernel: EFLAGS: 00000086
Nov 11 13:37:08 spaans kernel: eax: 00000000   ebx: ce780000   ecx: 00000000   edx: ffffffff
Nov 11 13:37:08 spaans kernel: esi: 00000002   edi: 00000000   ebp: cd0f1eb0   esp: cd0f1ea8
Nov 11 13:37:08 spaans kernel: ds: 0018   es: 0018   ss: 0018
Nov 11 13:37:08 spaans kernel: Process mysqld (pid: 5799, stackpage=cd0f1000)
Nov 11 13:37:08 spaans kernel: Stack: ce780000 00000021 00000000 c01152ea ce780000 00000021 00000086 c01153c6 
Nov 11 13:37:08 spaans kernel:        00000021 cd0f1f04 ce780000 00040001 00000000 cd0f0000 00000021 c011589a 
Nov 11 13:37:08 spaans kernel:        00000021 cd0f1f04 ce780000 cd0f0000 c1458000 cd0f0000 ce780000 00000021 
Nov 11 13:37:08 spaans kernel: Call Trace: [do_ioctl+334/820] [do_ioctl+554/820] [apm+358/644] [will_become_orphaned_pgrp+14/124] [do_fork+211/2788] [do_fork+1188/2788] [do_fork+1274/2788] 
Nov 11 13:37:08 spaans kernel: Code: 80 3d 00 44 2e c0 00 f3 90 7e f5 e9 67 59 ee ff 80 3b 00 f3 
Using defaults from ksymoops -t elf32-i386 -a i386

Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
   0:   80 3d 00 44 2e c0 00      cmpb   $0x0,0xc02e4400
Code;  00000007 Before first symbol
   7:   f3 90                     repz nop 
Code;  00000009 Before first symbol
   9:   7e f5                     jle    0 <_EIP>
Code;  0000000b Before first symbol
   b:   e9 67 59 ee ff            jmp    ffee5977 <_EIP+0xffee5977> ffee5977 <END_OF_CODE+2ec39fd0/????>
Code;  00000010 Before first symbol
  10:   80 3b 00                  cmpb   $0x0,(%ebx)
Code;  00000013 Before first symbol
  13:   f3 00 00                  repz add %al,(%eax)

Nov 11 13:37:08 spaans kernel: NMI Watchdog detected LOCKUP on CPU0, registers:
Nov 11 13:37:08 spaans kernel: CPU:    0
Nov 11 13:37:08 spaans kernel: EIP:    0010:[__down+339/588]
Nov 11 13:37:08 spaans kernel: EFLAGS: 00000097
Nov 11 13:37:08 spaans kernel: eax: c02e4420   ebx: ca6c0000   ecx: 00000000   edx: ffffffff
Nov 11 13:37:08 spaans kernel: esi: c02e4420   edi: ca6c1fa8   ebp: ca6c1fa8   esp: ca6c1f90
Nov 11 13:37:08 spaans kernel: ds: 0018   es: 0018   ss: 0018
Nov 11 13:37:08 spaans kernel: Process mysqld (pid: 5801, stackpage=ca6c1000)
Nov 11 13:37:08 spaans kernel: Stack: c0234797 ca6c0000 40083890 bf5ffc00 fffffff2 00000008 ca6c1fbc c0119ad4 
Nov 11 13:37:08 spaans kernel:        000016a9 00000000 bf5ff8e8 bf5ff8b4 c010b203 000016a9 00000000 bf5ff8e8 
Nov 11 13:37:08 spaans kernel:        40083890 bf5ffc00 bf5ff8b4 0000009c 0000002b 0000002b 0000009c 401a03e4 
Nov 11 13:37:08 spaans kernel: Call Trace: [tvecs+21059/42928] [proc_sel+68/140] [restore_sigcontext+63/328] 
Nov 11 13:37:08 spaans kernel: Code: 83 38 01 78 fb f0 ff 08 0f 88 ef ff ff ff c3 90 90 90 90 90 

Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
   0:   83 38 01                  cmpl   $0x1,(%eax)
Code;  00000003 Before first symbol
   3:   78 fb                     js     0 <_EIP>
Code;  00000005 Before first symbol
   5:   f0 ff 08                  lock decl (%eax)
Code;  00000008 Before first symbol
   8:   0f 88 ef ff ff ff         js     fffffffd <_EIP+0xfffffffd> fffffffd <END_OF_CODE+2ed54656/????>
Code;  0000000e Before first symbol
   e:   c3                        ret    
Code;  0000000f Before first symbol
   f:   90                        nop    
Code;  00000010 Before first symbol
  10:   90                        nop    
Code;  00000011 Before first symbol
  11:   90                        nop    
Code;  00000012 Before first symbol
  12:   90                        nop    
Code;  00000013 Before first symbol
  13:   90                        nop    


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-11 15:27   ` Jasper Spaans
@ 2000-11-12  1:32     ` Andrew Morton
  2000-11-12  2:27       ` Keith Owens
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2000-11-12  1:32 UTC (permalink / raw)
  To: Jasper Spaans; +Cc: linux-kernel

Jasper Spaans wrote:
> 
> On Sat, Nov 11, 2000 at 11:25:00AM +1100, Andrew Morton wrote:
> 
> > > NMI Watchdog detected LOCKUP on CPU3, registers:
> 
> > Oh no.  Another one.  Could you please try this attached
> > patch (against test11-pre2) and see if the diagnostics
> > come out?
> 
> And yet another one... I applied your patch, and ran my oopses through
> ksymoops, results attached.
> 
> Kernel: 2.4.0-test11-pre2 + reiserfs-3.6.18
> 2 * P-II 350, 256 MB RAM, no special hardware, AFAIK.
> 
> Of course, more details are available.

That's a pretty wierd trace.  You seem to have addresses related
to the `apm' kernel thread on mysqld's stack.

Are you saying that your machine is getting NMI lockups but
it is not usually able to print the register and stack
dumps? (ie: a printk deadlock)?

Do they go away if you disable APM?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-12  1:32     ` Andrew Morton
@ 2000-11-12  2:27       ` Keith Owens
  2000-11-13  8:58         ` Jasper Spaans
  0 siblings, 1 reply; 11+ messages in thread
From: Keith Owens @ 2000-11-12  2:27 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Jasper Spaans, linux-kernel

On Sun, 12 Nov 2000 12:32:55 +1100, 
Andrew Morton <andrewm@uow.edu.au> wrote:
>> > > NMI Watchdog detected LOCKUP on CPU3, registers:
>That's a pretty wierd trace.  You seem to have addresses related
>to the `apm' kernel thread on mysqld's stack.

Normal unfortunately.  Firstly the ix86 oops code just scans the stack
and prints anything that looks like it might be a kernel address.  It
makes no attempt to confirm that these really are return addresses, so
an ix86 oops trace gets lots of false positives.  Secondly that trace
was converted by klogd (symbols in call trace line instead of numbers)
not by ksymoops, I do not trust the klogd algorithm at all.  Between
the false positives and the very real possibility of klogd having got
it wrong, you have to take any ix86 oops with a pinch of salt.

Statrting klogd as "klogd -x" in /etc/rc.d/init.d/syslog will get rid
of one source of error.  Using the kdb patch[*] will give you a much
better debug environment for NMI lockups.  kdb gives an accurate back
trace because it understands ix86 stack frames as well as the out of
line lock handlers, at the expense of some very ugly code.

[*]ftp://oss.sgi.com/projects/kdb/download/ix86/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-12  2:27       ` Keith Owens
@ 2000-11-13  8:58         ` Jasper Spaans
  2000-11-13 10:44           ` Keith Owens
  2000-11-13 12:11           ` Andrew Morton
  0 siblings, 2 replies; 11+ messages in thread
From: Jasper Spaans @ 2000-11-13  8:58 UTC (permalink / raw)
  To: Keith Owens; +Cc: Andrew Morton, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 926 bytes --]

On Sun, Nov 12, 2000 at 01:27:13PM +1100, Keith Owens wrote:

> >That's a pretty wierd trace.  You seem to have addresses related
> >to the `apm' kernel thread on mysqld's stack.
> 
> Normal unfortunately.  Firstly the ix86 oops code just scans the stack
> and prints anything that looks like it might be a kernel address.  It
> makes no attempt to confirm that these really are return addresses, so
> an ix86 oops trace gets lots of false positives.  Secondly that trace
> was converted by klogd (symbols in call trace line instead of numbers)
> not by ksymoops, I do not trust the klogd algorithm at all.  

All right, here's another one, this time using the oops directly from the
console -- this seems to give better symbols.. The 'console shuts up ...'
works, the oops from the other CPU didn't get put out.

Will try test11-pre3 + kdb this afternoon, if it compiles.

Regards,
-- 
Jasper Spaans  <jasper@spaans.ds9a.nl>

[-- Attachment #2: ksymoops-output-2 --]
[-- Type: text/plain, Size: 2028 bytes --]

ksymoops 2.3.4 on i686 2.4.0-test11.  Options used
     -V (default)
     -k /proc/ksyms (default)
     -l /proc/modules (default)
     -o /lib/modules/2.4.0-test11/ (default)
     -m /boot/System.map-2.4.0-test11-pre2 (specified)

NMI Watchdog detected LOCKUP on CPU1, registers:
CPU:    1
EIP:    0010:[<c02347c4>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000086
eax: 00000000   ebx: ce780000   ecx: 00000000   edx: ffffffff
esi: 00000002   edi: 00000000   ebp: cd0f1eb0   esp: cd0f1ea8
ds: 0018   es: 0018   ss: 0018
Process mysqld (pid: 5799, stackpage=cd0f1000)
Stack: ce780000 00000021 00000000 c01152ea ce780000 00000021 00000086 c01153c6 
       00000021 cd0f1f04 ce780000 00040001 00000000 cd0f0000 00000021 c011589a 
       00000021 cd0f1f04 ce780000 cd0f0000 c1458000 cd0f0000 ce780000 00000021 
Call Trace: [<c01152ea>] [<c01153c6>] [<c011589a>] [<c0122012>] [<c011e727>] [<c011eaf8>] [<c011eb4e>] 
       [<c010b203>] 
Code: 80 3d 00 44 2e c0 00 f3 90 7e f5 e9 67 59 ee ff 80 3b 00 f3 

>>EIP; c02347c4 <stext_lock+8e8/98ac>   <=====
Trace; c01152ea <deliver_signal+4a/90>
Trace; c01153c6 <send_sig_info+96/c0>
Trace; c011589a <do_notify_parent+c6/e0>
Trace; c0122012 <do_acct_process+21e/22c>
Trace; c011e727 <exit_notify+1a7/30c>
Trace; c011eaf8 <do_exit+26c/2b4>
Trace; c011eb4e <sys_exit+e/10>
Trace; c010b203 <system_call+33/38>
Code;  c02347c4 <stext_lock+8e8/98ac>
00000000 <_EIP>:
Code;  c02347c4 <stext_lock+8e8/98ac>   <=====
   0:   80 3d 00 44 2e c0 00      cmpb   $0x0,0xc02e4400   <=====
Code;  c02347cb <stext_lock+8ef/98ac>
   7:   f3 90                     repz nop 
Code;  c02347cd <stext_lock+8f1/98ac>
   9:   7e f5                     jle    0 <_EIP>
Code;  c02347cf <stext_lock+8f3/98ac>
   b:   e9 67 59 ee ff            jmp    ffee5977 <_EIP+0xffee5977> c011a13b <wake_up_process+13/68>
Code;  c02347d4 <stext_lock+8f8/98ac>
  10:   80 3b 00                  cmpb   $0x0,(%ebx)
Code;  c02347d7 <stext_lock+8fb/98ac>
  13:   f3 00 00                  repz add %al,(%eax)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-13  8:58         ` Jasper Spaans
@ 2000-11-13 10:44           ` Keith Owens
  2000-11-13 11:16             ` Andrew Morton
  2000-11-13 12:11           ` Andrew Morton
  1 sibling, 1 reply; 11+ messages in thread
From: Keith Owens @ 2000-11-13 10:44 UTC (permalink / raw)
  To: Jasper Spaans; +Cc: Andrew Morton, linux-kernel

On Mon, 13 Nov 2000 09:58:17 +0100, 
Jasper Spaans <jasper@spaans.ds9a.nl> wrote:
>All right, here's another one, this time using the oops directly from the
>console -- this seems to give better symbols.. The 'console shuts up ...'
>works, the oops from the other CPU didn't get put out.

Ohhhh, damn!  For NMI lockups we want the console to stay live so NMI
detection on the other cpus can be printed.  NMI is normally caused by
spinlock problems and it is useful to know what the other cpus are
doing.  Andrew, do you want to have a go at fixing this?

>Will try test11-pre3 + kdb this afternoon, if it compiles.

Patch kdb-v1.5-2.4.0-test11-pre3.gz should be OK.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-13 10:44           ` Keith Owens
@ 2000-11-13 11:16             ` Andrew Morton
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2000-11-13 11:16 UTC (permalink / raw)
  To: Keith Owens; +Cc: Jasper Spaans, linux-kernel

Keith Owens wrote:
> 
> On Mon, 13 Nov 2000 09:58:17 +0100,
> Jasper Spaans <jasper@spaans.ds9a.nl> wrote:
> >All right, here's another one, this time using the oops directly from the
> >console -- this seems to give better symbols.. The 'console shuts up ...'
> >works, the oops from the other CPU didn't get put out.
> 
> Ohhhh, damn!  For NMI lockups we want the console to stay live so NMI
> detection on the other cpus can be printed.  NMI is normally caused by
> spinlock problems and it is useful to know what the other cpus are
> doing.  Andrew, do you want to have a go at fixing this?

Uh, sure - I just _love_ running fsck :)  I'm working on this stuff
at present.  That wake_up in printk() is baaaaad...

> >Will try test11-pre3 + kdb this afternoon, if it compiles.
> 
> Patch kdb-v1.5-2.4.0-test11-pre3.gz should be OK.

It would be very, very interesting to see where the other CPU is.

I can see one bug from Jasper's trace: setscheduler() does:

        spin_lock_irq(&runqueue_lock);
        read_lock(&tasklist_lock);

whereas the exit_notify->do_notify_parent->send_sig_info->wake_up_process
path does:

        write_lock_irq(&tasklist_lock);
        spin_lock_irqsave(&runqueue_lock, flags);

Death by double deadlock.  But I doubt if setscheduler() is the
source - who ever calls that?

The correct locking hierarchy is, I think:

	spin_lock(runqueue_lock)
	read/write_lock(tasklist_lock)
	read/write_unlock(tasklist_lock)
	spin_unlock(runqueue_lock)

Jasper, as a random stab in the dark you may care to try this:

--- linux-2.4.0-test11-pre4/kernel/exit.c	Sun Oct 15 01:27:46 2000
+++ linux-akpm/kernel/exit.c	Mon Nov 13 22:05:37 2000
@@ -381,8 +381,10 @@
 	 *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
 	 */
 
-	write_lock_irq(&tasklist_lock);
+	read_lock_irq(&tasklist_lock);
 	do_notify_parent(current, current->exit_signal);
+	read_unlock_irq(&tasklist_lock);
+	write_lock_irq(&tasklist_lock);
 	while (current->p_cptr != NULL) {
 		p = current->p_cptr;
 		current->p_cptr = p->p_osptr;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: 2.4.0-test11pre2-ac1 and previous problem
  2000-11-13  8:58         ` Jasper Spaans
  2000-11-13 10:44           ` Keith Owens
@ 2000-11-13 12:11           ` Andrew Morton
  1 sibling, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2000-11-13 12:11 UTC (permalink / raw)
  To: Jasper Spaans; +Cc: Keith Owens, linux-kernel

Jasper Spaans wrote:
> 
> On Sun, Nov 12, 2000 at 01:27:13PM +1100, Keith Owens wrote:
> 
> > >That's a pretty wierd trace.  You seem to have addresses related
> > >to the `apm' kernel thread on mysqld's stack.
> >
> > Normal unfortunately.  Firstly the ix86 oops code just scans the stack
> > and prints anything that looks like it might be a kernel address.  It
> > makes no attempt to confirm that these really are return addresses, so
> > an ix86 oops trace gets lots of false positives.  Secondly that trace
> > was converted by klogd (symbols in call trace line instead of numbers)
> > not by ksymoops, I do not trust the klogd algorithm at all.
> 
> All right, here's another one, this time using the oops directly from the
> console -- this seems to give better symbols.. The 'console shuts up ...'
> works, the oops from the other CPU didn't get put out.

OK this should show us the backtrace on both CPUs.  Please back out
the previous patchlet (if you applied it) and apply this one.

Thanks.


--- linux-2.4.0-test11-pre4/include/linux/kernel.h	Sun Oct 15 01:27:46 2000
+++ linux-akpm/include/linux/kernel.h	Mon Nov 13 22:22:50 2000
@@ -62,6 +62,7 @@
 
 asmlinkage int printk(const char * fmt, ...)
 	__attribute__ ((format (printf, 1, 2)));
+extern int oops_in_progress;
 
 #if DEBUG
 #define pr_debug(fmt,arg...) \
--- linux-2.4.0-test11-pre4/kernel/printk.c	Sat Nov  4 16:22:49 2000
+++ linux-akpm/kernel/printk.c	Mon Nov 13 22:23:33 2000
@@ -51,6 +51,7 @@
 static unsigned long logged_chars;
 struct console_cmdline console_cmdline[MAX_CMDLINECONSOLES];
 static int preferred_console = -1;
+int oops_in_progress;
 
 /*
  *	Setup a list of consoles. Called from init/main.c
@@ -260,6 +261,8 @@
 	static signed char msg_level = -1;
 	long flags;
 
+	if (oops_in_progress)
+		spin_lock_init(&console_lock);
 	spin_lock_irqsave(&console_lock, flags);
 	va_start(args, fmt);
 	i = vsprintf(buf + 3, fmt, args); /* hopefully i < sizeof(buf)-4 */
@@ -308,7 +311,8 @@
 			msg_level = -1;
 	}
 	spin_unlock_irqrestore(&console_lock, flags);
-	wake_up_interruptible(&log_wait);
+	if (!oops_in_progress)
+		wake_up_interruptible(&log_wait);
 	return i;
 }
 
--- linux-2.4.0-test11-pre4/arch/i386/mm/fault.c	Mon Nov 13 18:23:49 2000
+++ linux-akpm/arch/i386/mm/fault.c	Mon Nov 13 22:37:37 2000
@@ -77,17 +77,31 @@
 	return 0;
 }
 
-extern spinlock_t console_lock, timerlist_lock;
+#ifdef CONFIG_SMP
+extern unsigned volatile int global_irq_lock;
+#endif
 
 /*
  * Unlock any spinlocks which will prevent us from getting the
- * message out (timerlist_lock is aquired through the
- * console unblank code)
+ * message out and tell the printk/console paths that an emergency
+ * message is coming through
  */
-void bust_spinlocks(void)
+void bust_spinlocks(int yes)
 {
-	spin_lock_init(&console_lock);
-	spin_lock_init(&timerlist_lock);
+	if (yes) {
+		oops_in_progress = 1;
+#ifdef CONFIG_SMP
+		global_irq_lock = 0;	/* Many serial drivers do __global_cli() */
+#endif
+	} else {
+		oops_in_progress = 0;
+		/*
+		 * OK, the message is on the console.  Now we call printk()
+		 * without oops_in_progress set so that printk will give syslogd
+		 * a poke.  Hold onto your hats...
+		 */
+		printk("\n");
+	}		
 }
 
 asmlinkage void do_invalid_op(struct pt_regs *, unsigned long);
@@ -264,8 +278,7 @@
  * terminate things with extreme prejudice.
  */
 
-	bust_spinlocks();
-
+	bust_spinlocks(1);
 	if (address < PAGE_SIZE)
 		printk(KERN_ALERT "Unable to handle kernel NULL pointer dereference");
 	else
@@ -283,6 +296,7 @@
 		printk(KERN_ALERT "*pte = %08lx\n", page);
 	}
 	die("Oops", regs, error_code);
+	bust_spinlocks(0);
 	do_exit(SIGKILL);
 
 /*
--- linux-2.4.0-test11-pre4/arch/i386/kernel/traps.c	Mon Nov 13 18:23:49 2000
+++ linux-akpm/arch/i386/kernel/traps.c	Mon Nov 13 22:27:13 2000
@@ -63,7 +63,7 @@
 struct desc_struct idt_table[256] __attribute__((__section__(".data.idt"))) = { {0, 0}, };
 
 extern int console_loglevel;
-extern void bust_spinlocks(void);
+extern void bust_spinlocks(int yes);
 
 static inline void console_silent(void)
 {
@@ -414,8 +414,9 @@
 	 *  here too!]
 	 */
 
-	static unsigned int last_irq_sums [NR_CPUS],
-				alert_counter [NR_CPUS];
+	static unsigned int	last_irq_sums [NR_CPUS],
+				alert_counter [NR_CPUS],
+				printed_message [NR_CPUS];
 
 	/*
 	 * Since current-> is always on the stack, and we always switch
@@ -432,17 +433,19 @@
 		 */
 		alert_counter[cpu]++;
 		if (alert_counter[cpu] == 5*HZ) {
-			spin_lock(&nmi_print_lock);
-			/*
-			 * We are in trouble anyway, lets at least try
-			 * to get a message out.
-			 */
-			bust_spinlocks();
-			printk("NMI Watchdog detected LOCKUP on CPU%d, registers:\n", cpu);
-			show_registers(regs);
-			printk("console shuts up ...\n");
-			console_silent();
-			spin_unlock(&nmi_print_lock);
+			if (printed_message[cpu] == 0) {
+				printed_message[cpu] = 1;
+				spin_lock(&nmi_print_lock);
+				/*
+				 * We are in trouble anyway, lets at least try
+				 * to get a message out.
+				 */
+				bust_spinlocks(1);
+				printk("NMI Watchdog detected LOCKUP on CPU%d, registers:\n", cpu);
+				show_registers(regs);
+				spin_unlock(&nmi_print_lock);
+				bust_spinlocks(0);
+			}
 			do_exit(SIGSEGV);
 		}
 	} else {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2000-11-13 12:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2000-11-10 17:17 2.4.0-test11pre2-ac1 and previous problem Pawe³ Kot
2000-11-10 17:30 ` Rik van Riel
2000-11-10 17:36   ` Pawe³ Kot
2000-11-11  0:25 ` Andrew Morton
2000-11-11 15:27   ` Jasper Spaans
2000-11-12  1:32     ` Andrew Morton
2000-11-12  2:27       ` Keith Owens
2000-11-13  8:58         ` Jasper Spaans
2000-11-13 10:44           ` Keith Owens
2000-11-13 11:16             ` Andrew Morton
2000-11-13 12:11           ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).