linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Oops in sched.c on PPro SMP
@ 2002-09-16 14:25 Peter Waechtler
  2002-09-16 14:49 ` Alan Cox
  2002-09-16 15:42 ` Andrea Arcangeli
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Waechtler @ 2002-09-16 14:25 UTC (permalink / raw)
  To: linux-kernel; +Cc: andrea, mingo

I suffered from lockups on PPro SMP as I switched from 2.4.18-SuSE
to 2.4.19 and 2.4.20-pre7

First I thought it was caused by bttv, overheat or stack overflow -
but to make it short: I think it's a bug in schedule() or
something broken with  CONFIG_X86_PPRO_FENCE

I fetched three Oopses at exactly the same instruction
ksymoops 2.4.3 on i686 2.4.20-pre7.  Options used
      -V (default)
      -k /proc/ksyms (default)
      -l /proc/modules (default)
      -o /lib/modules/2.4.20-pre7/ (default)
      -m /boot/System.map-2.4.20-pre7 (specified)

Unable to handle kernel NULL pointer dereference at virtual address 
00000020
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c0117f10>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210207
eax: 00000005   ebx: 00000014   ecx: 00000000   edx: 00000006
esi: c0be0000   edi: c51d803c   ebp: c0be1fbc   esp: c0be1f8c
ds: 0018   es: 0018   ss: 0018
Process setiathome (pid: 4674, stackpage=c0be1000)
Stack: c0be0000 000040b1 000040b0 cd406ea0 c02c1400 00000000 00000001 
00000001
00000000 00000000 c0be0000 c03079c0 bffff0cc c0108d09 00000000 0000000c
409d0040 000040b1 000040b0 bffff0cc 000c40b1 0000002b 0020002b ffffffef
Call Trace:    [<c0108d09>]
Code: 8b 51 20 d1 fa 89 d8 2b 41 24 c1 f8 02 8d 54 10 01 89 51 20

 >>EIP; c0117f10 <schedule+248/550>   <=====
Trace; c0108d08 <reschedule+4/c>
Code;  c0117f10 <schedule+248/550>
00000000 <_EIP>:
Code;  c0117f10 <schedule+248/550>   <=====
    0:   8b 51 20                  mov    0x20(%ecx),%edx   <=====
Code;  c0117f12 <schedule+24a/550>
    3:   d1 fa                     sar    %edx
Code;  c0117f14 <schedule+24c/550>
    5:   89 d8                     mov    %ebx,%eax
Code;  c0117f16 <schedule+24e/550>
    7:   2b 41 24                  sub    0x24(%ecx),%eax
Code;  c0117f1a <schedule+252/550>
    a:   c1 f8 02                  sar    $0x2,%eax
Code;  c0117f1c <schedule+254/550>
    d:   8d 54 10 01               lea    0x1(%eax,%edx,1),%edx
Code;  c0117f20 <schedule+258/550>
   11:   89 51 20                  mov    %edx,0x20(%ecx)

---------- Oops 2 -------------
ksymoops 2.4.3 on i686 2.4.19.  Options used
      -V (default)
      -k /proc/ksyms (default)
      -l /proc/modules (default)
      -o /lib/modules/2.4.19/ (default)
      -m /boot/System.map-2.4.19 (default)

Unable to handle kernel NULL pointer dereference at virtual address 
00000020
*pde = 00000000
Oops: 0000
CPU:    1
EIP:    0010:[<c0117f80>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210207
eax: 00000005   ebx: 00000014   ecx: 00000000   edx: 00000006
esi: c6d82000   edi: c6d8203c   ebp: c6d83fbc   esp: c6d83f8c
ds: 0018   es: 0018   ss: 0018
Process setiathome (pid: 2005, stackpage=c6d83000)
Stack: c6d82000 40ef2040 0000013c c7125080 c02b9900 00000004 c011590f 
00000002
00000000 00000001 c6d82000 c02ff9e0 bffff05c c0108a99 40efa2fc 40ef7874
40ef4ed4 40ef2040 0000013c bffff05c 00000000 c010002b 0000002b ffffffef
Call Trace:    [<c011590f>] [<c0108a99>]
Code: 8b 51 20 d1 fa 89 d8 2b 41 24 c1 f8 02 8d 54 10 01 89 51 20

 >>EIP; c0117f80 <schedule+228/530>   <=====
Trace; c011590e <smp_apic_timer_interrupt+f2/114>
Trace; c0108a98 <reschedule+4/c>
Code;  c0117f80 <schedule+228/530>
00000000 <_EIP>:
Code;  c0117f80 <schedule+228/530>   <=====
    0:   8b 51 20                  mov    0x20(%ecx),%edx   <=====
---------- Oops 3 (not sure about kernel version) -------------
Unable to handle kernel NULL pointer dereference at virtual address 
00000020
*pde = 00000000
Oops: 0000
CPU:    1
EIP:    0010:[<c0117f80>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210207
eax: 00000005   ebx: 00000014   ecx: 00000000   edx: 00000006
esi: c529a000   edi: c529a03c   ebp: c529bfbc   esp: c529bf8c
ds: 0018   es: 0018   ss: 0018
Process setiathome (pid: 2001, stackpage=c529b000)
Stack: c529a000 000c2703 00184e08 cd43a7a0 c02b9900 00000004 c011590f 
00000002
        00000000 00000001 c529a000 c02ff9e0 bffff0bc c0108a99 00184e06 
0807ac20
        0000009f 000c2703 00184e08 bffff0bc 001842f0 c010002b 0000002b 
ffffffef
Call Trace:    [<c011590f>] [<c0108a99>]
Code: 8b 51 20 d1 fa 89 d8 2b 41 24 c1 f8 02 8d 54 10 01 89 51 20

 >>EIP; c0117f80 <schedule+2b8/550>   <=====
Trace; c011590e <nmi_watchdog_tick+4e/100>
Trace; c0108a98 <do_signal+22c/294>
Code;  c0117f80 <schedule+2b8/550>
00000000 <_EIP>:
Code;  c0117f80 <schedule+2b8/550>   <=====
    0:   8b 51 20                  mov    0x20(%ecx),%edx   <=====
---------- Oops 4 -------------
While typing this: Oops with ecx!=NULL:
Unable to handle kernel paging request< >t rirtial adpress e1170009
*pde = 00000000
Oops: 0000
CPU:    0
EIP:    0010:[<c0117f10>]    Not tainted
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00210206
eax: f7e60005   ebx: 00000014   ecx: e9000009   edx: fa345205
esi: c45d2000   edi: c45d203c   ebp: c45d3fbc   esp: c45d3f8c
ds: 0018   es: 0018   ss: 0018
Process setiathome (pid: 2035, stackpage=c45d3000)
Stack: c45d2000 00013e0c 40ec2040 c26efc20 c02c1400 00000000 00000001 
00000001
00000000 00000000 c45d2000 c03079c0 bfffefec c0108d09 0001360c 0001320c
40f0b870 00013e0c 40ec2040 bfffefec 00012e0c 0000002b 0000002b ffffffef
Call Trace:    [<c0108d09>]
Code: 8b 51 20 d1 fa 89 d8 2b 41 24 c1 f8 02 8d 54 10 01 89 51 20

 >>EIP; c0117f10 <schedule+248/550>   <=====
Trace; c0108d08 <reschedule+4/c>
Code;  c0117f10 <schedule+248/550>
00000000 <_EIP>:
Code;  c0117f10 <schedule+248/550>   <=====
    0:   8b 51 20                  mov    0x20(%ecx),%edx   <=====
[...]


Here the relevant intermixed assembly/source from objdump:
gcc -I ../include -I ../include/linux -I ../include/asm-i386 \
	-g -O2 -c sched.c -D__KERNEL__
objdump -S sched.o >sched.txt

     /* Do we need to re-calculate counters? */
     if (unlikely(!c)) {
      62d:   83 7d f0 00             cmpl   $0x0,0xfffffff0(%ebp)
      631:   75 6d                   jne    6a0 <schedule+0x290>
         :"0" (oldval) : "memory"

static inline void spin_unlock(spinlock_t *lock)
{
     char oldval = 1;
      633:   b0 01                   mov    $0x1,%al
#if SPINLOCK_DEBUG
     if (lock->magic != SPINLOCK_MAGIC)
         BUG();
     if (!spin_is_locked(lock))
         BUG();
#endif
     __asm__ __volatile__(
      635:   86 05 00 00 00 00       xchg   %al,0x0
         struct task_struct *p;

         spin_unlock_irq(&runqueue_lock);
      63b:   fb                      sti	<===== INTERRUPTS ENABLE
  */
/* the spinlock helpers are in arch/i386/kernel/semaphore.c */
static inline void read_lock(rwlock_t *rw)
{
      63c:   b8 00 00 00 00          mov    $0x0,%eax
#if SPINLOCK_DEBUG
     if (rw->magic != RWLOCK_MAGIC)
         BUG();
#endif
     __build_read_lock(rw, "__read_lock_failed");
      641:   f0 83 28 01             lock subl $0x1,(%eax)
      645:   0f 88 59 13 00 00       js     19a4 <Letext+0x40>
         read_lock(&tasklist_lock);
         for_each_task(p)
      64b:   8b 0d 48 00 00 00       mov    0x48,%ecx
      651:   81 f9 00 00 00 00       cmp    $0x0,%ecx
      657:   74 26                   je     67f <schedule+0x26f>
      659:   bb 14 00 00 00          mov    $0x14,%ebx
      65e:   89 f6                   mov    %esi,%esi
             p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
      660:   8b 51 20                mov    0x20(%ecx),%edx  <= CRASH
      663:   d1 fa                   sar    %edx
      665:   89 d8                   mov    %ebx,%eax
      667:   2b 41 24                sub    0x24(%ecx),%eax
      66a:   c1 f8 02                sar    $0x2,%eax
      66d:   8d 54 10 01             lea    0x1(%eax,%edx,1),%edx
      671:   89 51 20                mov    %edx,0x20(%ecx)
      674:   8b 49 48                mov    0x48(%ecx),%ecx
      677:   81 f9 00 00 00 00       cmp    $0x0,%ecx
      67d:   75 e1                   jne    660 <schedule+0x250>
         read_unlock(&tasklist_lock);
      67f:   f0 ff 05 00 00 00 00    lock incl 0x0
         spin_lock_irq(&runqueue_lock);
      686:   fa                      cli


First I thought the readlocks were broken by the compiler, due
to syntax changes. But staring at the code I wondered how
%ecx can become zero at 660: - from this code it's impossible!
But wait: we allowed interrupts again...

So my explanation is as follows: the scheduler is interrupted
and entry.S calls:

ENTRY(ret_from_intr)
	GET_CURRENT(%ebx)
ret_from_exception:
	movl EFLAGS(%esp),%eax		# mix EFLAGS and CS
	movb CS(%esp),%al
	testl $(VM_MASK | 3),%eax	# return to VM86 mode or non-supervisor?
	jne ret_from_sys_call
	jmp restore_all

	ALIGN
reschedule:
	call SYMBOL_NAME(schedule)    # test
	jmp ret_from_sys_call

ENTRY(ret_from_sys_call)
	cli				# need_resched and signals atomic test
	cmpl $0,need_resched(%ebx)
	jne reschedule
	cmpl $0,sigpending(%ebx)
	jne signal_return
restore_all:
	RESTORE_ALL


The spin_unlock_irq(&runqueue_lock); was already there in 2.4.16
I can't check when it was introduced - perhaps since the beginning?
But this explains my former reported crashes in the scheduler half
a year ago - I think it always happened when switching from a SuSE
kernel to Linus tree.

So there are 2 possibilities: the spin_unlock_irq(&runqueue_lock)
is wrong in the scheduler, but this should be noted by more SMP
users then, or the CONFIG_X86_PPRO_FENCE does not work as expected.
I will replace the spin_unlock_irq with spin_unlock for a test spin..


This is a snippet from
diff -Nu linux-2.4.18.SuSE/kernel/sched.c linux-2.4.19/kernel/sched.c

         /* Do we need to re-calculate counters? */
         if (unlikely(!c)) {
-               ++rcl_curr;
-               list_for_each(tmp, numa_runqueue_head(numa_node_id())) {
-                       p = list_entry(tmp, struct task_struct, 
run_list);
-                       p->time_slice = TASK_TIMESLICE(p);
-               }
-#ifdef CONFIG_NUMA_SCHED
-               if (recalculate_all) {
-                       int nid;
-
-                       for (nid = 0; nid < numnodes; nid++) {
-                               if (nid == numa_node_id())
-                                       continue;
-                               list_for_each(tmp, 
numa_runqueue_head(nid)) {
-                                       p = list_entry(tmp, struct 
task_struct,
run_list);
-                                       p->time_slice = 
TASK_TIMESLICE(p);
-                               }
-                       }
-               }
-#endif
+               struct task_struct *p;
+
+               spin_unlock_irq(&runqueue_lock);
+               read_lock(&tasklist_lock);
+               for_each_task(p)
+                       p->counter = (p->counter >> 1) + 
NICE_TO_TICKS(p->nice);
+               read_unlock(&tasklist_lock);
+               spin_lock_irq(&runqueue_lock);
                 goto repeat_schedule;
         }


--- snippet .config ---
CONFIG_M686=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_L1_CACHE_SHIFT=5
CONFIG_X86_HAS_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_PGE=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_PPRO_FENCE=y
CONFIG_X86_F00F_WORKS_OK=y
CONFIG_X86_MCE=y

--- /proc/cpuinfo ---
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 1
model name      : Pentium Pro
stepping        : 7
cpu MHz         : 200.457
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov
bogomips        : 399.76

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 1
model name      : Pentium Pro
stepping        : 9
cpu MHz         : 200.457
cache size      : 256 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
mca cmov
bogomips        : 400.58


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-16 14:25 Oops in sched.c on PPro SMP Peter Waechtler
@ 2002-09-16 14:49 ` Alan Cox
  2002-09-16 15:44   ` Andrea Arcangeli
  2002-09-16 21:10   ` Peter Waechtler
  2002-09-16 15:42 ` Andrea Arcangeli
  1 sibling, 2 replies; 13+ messages in thread
From: Alan Cox @ 2002-09-16 14:49 UTC (permalink / raw)
  To: Peter Waechtler; +Cc: linux-kernel, andrea, mingo

On Mon, 2002-09-16 at 15:25, Peter Waechtler wrote:
> I suffered from lockups on PPro SMP as I switched from 2.4.18-SuSE
> to 2.4.19 and 2.4.20-pre7

What compiler did you build it with ? I've seen oopses like this from
gcc 3.0.x that went away with gcc 3.2, 2.95 or 2.96 

Also does turning off the nmi watchdog junk make the box stable ?


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-16 14:25 Oops in sched.c on PPro SMP Peter Waechtler
  2002-09-16 14:49 ` Alan Cox
@ 2002-09-16 15:42 ` Andrea Arcangeli
  2002-09-17 17:11   ` Peter Waechtler
  1 sibling, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2002-09-16 15:42 UTC (permalink / raw)
  To: Peter Waechtler; +Cc: linux-kernel, mingo

On Mon, Sep 16, 2002 at 04:25:03PM +0200, Peter Waechtler wrote:
> Process setiathome (pid: 2035, stackpage=c45d3000)
	  ^^^^^^^^^^ stress the cpu

>      651:   81 f9 00 00 00 00       cmp    $0x0,%ecx
				      ^^^^^^^^^^^^^^^^^
>      657:   74 26                   je     67f <schedule+0x26f>
>      659:   bb 14 00 00 00          mov    $0x14,%ebx
>      65e:   89 f6                   mov    %esi,%esi
>             p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
>      660:   8b 51 20                mov    0x20(%ecx),%edx  <= CRASH
>      663:   d1 fa                   sar    %edx
>      665:   89 d8                   mov    %ebx,%eax
>      667:   2b 41 24                sub    0x24(%ecx),%eax
>      66a:   c1 f8 02                sar    $0x2,%eax
>      66d:   8d 54 10 01             lea    0x1(%eax,%edx,1),%edx
>      671:   89 51 20                mov    %edx,0x20(%ecx)
>      674:   8b 49 48                mov    0x48(%ecx),%ecx
>      677:   81 f9 00 00 00 00       cmp    $0x0,%ecx
				      ^^^^^^^^^^^^^^^^
>      67d:   75 e1                   jne    660 <schedule+0x250>
>         read_unlock(&tasklist_lock);
>      67f:   f0 ff 05 00 00 00 00    lock incl 0x0
>         spin_lock_irq(&runqueue_lock);
>      686:   fa                      cli

as you said apparently ecx cannot be zero. But I bet the 0x00 really is
waiting relocation at link time. that shouldn't be zero infact, it
should be the address of the init_task (&init_task). 

Can you disassemble the .o object using objdump -Dr or can you disassemble such
piece of code from the vmlinux instead of compiling with the -S flag to
verify that to verify that? If it really checks against zero then it's a
miscompilation, it should check against &init_task as said above.

> First I thought the readlocks were broken by the compiler, due
> to syntax changes. But staring at the code I wondered how
> %ecx can become zero at 660: - from this code it's impossible!
> But wait: we allowed interrupts again...
> 
> So my explanation is as follows: the scheduler is interrupted
> and entry.S calls:

interrupts cannot clobber %ecx or change the tasklist, if they clobber
%ecx or modify the tasklist that would be the real bug.

> So there are 2 possibilities: the spin_unlock_irq(&runqueue_lock)
> is wrong in the scheduler, but this should be noted by more SMP
> users then, or the CONFIG_X86_PPRO_FENCE does not work as expected.

the PPRO_FENCE is strictier than the non FENCE one. However here the
corruption you notice is in the tasklist, and the read/write locks are
not affected by the FENCE option, so FENCE isn't likely to explain it.

If something I would suspect something wrong in the read/write
spinlocks, to rule out such possibility you could for example try to
replace all the read_lock and write_lock around the tasklist with
spin_lock_irqsave/spin_lock_irqrestore. So that you would use the FENCE
xchg functionality around the tasklist too and you would also make sure
that no irq can happen in between, just to be 100% sure that if it
crashes it's because the memory is corrupted somehow. But really, the
read/write locks just use the lock on the bus when they modify the
spinlock value so I'd be *very* surprised if that doesn't work on ppro.
The non-FENCE case of recent cpus is used to skip the lock on the bus
during the unlock operation to exploit the release semantics of all
writes to memory in writeback cache mode of the recent x86 (that allows
unrelated speculative reads outside the critical section to enter
inside).

I really suspect an hardware fault here, if you could reproduce easily
you could try to drop a dimm of ram and retest, you can also try memtst
from cerberus testsuite or/and memtest86 from the lilo menu.

the tasklist walking very likely is triggering very quick cacheline
bouncing and lots of ram I/O, 99% of hardware bugs triggers while
walking lists because of the cpu-dcache trashing and the ram cannot cope
with it. Probably the O1 scheduler would hide the problem because it
drops such tight loop. Note that the 2.4.18 SuSE kernel scheduler
algorithm is O(N) where N is always the number of running tasks, never the
total number of tasks in the system, while in mainline the scheduler is
O(N) where N is the total number of tasks in the system, this mean
normally in mainline you can walk a list with 100/500 elements, while
with the SuSE kernel you'll walk a list of always around 2/3 elements,
depends on the workload of course. The O1 scheduler included in 8.1
reduces N to a constant 1.  So if you cannot reproduce with the SuSE 8.0
kernel could be simply because you've lots of tasks in your system but
only a few of them runs at the same time. That's a dramatic optimization
in the SuSE kernel but it's not a bugfix, it only hides the corruption
in your case methinks, like the O1 scheduler in 8.1 will hide it even
better, even if you have lots of tasks running at the same time ;). It
is true to walk the runqueue we need the runqueue_lock that needs irq
disabled, but regardless irqs must not screwup anything in the tasklist.

So I would say, it's either an hardware issue, or random memory
corruption generated by some driver. Just some guesses. And if it's the
irq clobbering the %ecx or the tasklist then something is very wrong in
the irq code or in the hardware (I'd exclude such possibility, but you
can try adding _irq to the read_lock/read_unlock of the tasklist_lock to
disable irq and see if you can still reproduce). If %ecx is checked
against zero as well something is very wrong, but in the compiler, and
that would explain things too (you're recommended to use either 2.95 or
3.2).

Hope this helps,

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-16 14:49 ` Alan Cox
@ 2002-09-16 15:44   ` Andrea Arcangeli
  2002-09-16 21:16     ` Peter Waechtler
  2002-09-16 21:10   ` Peter Waechtler
  1 sibling, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2002-09-16 15:44 UTC (permalink / raw)
  To: Alan Cox; +Cc: Peter Waechtler, linux-kernel, mingo

On Mon, Sep 16, 2002 at 03:49:27PM +0100, Alan Cox wrote:
> Also does turning off the nmi watchdog junk make the box stable ?

good idea, I didn't though about this one since I only heard the nmi to
lockup hard boxes after hours of load, never to generate any
malfunction, but certainly the nmi handling isn't probably one of the
most exercised hardware paths in the cpus, so it's a good idea to
reproduce with it turned off (OTOH I guess you probably turned it on
explicitly only after you got these troubles, in order to debug them).

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-16 14:49 ` Alan Cox
  2002-09-16 15:44   ` Andrea Arcangeli
@ 2002-09-16 21:10   ` Peter Waechtler
  1 sibling, 0 replies; 13+ messages in thread
From: Peter Waechtler @ 2002-09-16 21:10 UTC (permalink / raw)
  To: Alan Cox; +Cc: linux-kernel, andrea, mingo

Am Montag den, 16. September 2002, um 16:49, schrieb Alan Cox:

> On Mon, 2002-09-16 at 15:25, Peter Waechtler wrote:
>> I suffered from lockups on PPro SMP as I switched from 2.4.18-SuSE
>> to 2.4.19 and 2.4.20-pre7
>
> What compiler did you build it with ? I've seen oopses like this from
> gcc 3.0.x that went away with gcc 3.2, 2.95 or 2.96
>
> Also does turning off the nmi watchdog junk make the box stable ?
>

It's 2.95.3:

Reading specs from /usr/lib/gcc-lib/i486-suse-linux/2.95.3/specs
gcc version 2.95.3 20010315 (SuSE)

But I will check the readlocks more closely.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-16 15:44   ` Andrea Arcangeli
@ 2002-09-16 21:16     ` Peter Waechtler
  2002-09-16 23:13       ` Andrea Arcangeli
  2002-09-17  2:51       ` Adam Kropelin
  0 siblings, 2 replies; 13+ messages in thread
From: Peter Waechtler @ 2002-09-16 21:16 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Alan Cox, linux-kernel, mingo

Am Montag den, 16. September 2002, um 17:44, schrieb Andrea Arcangeli:

> On Mon, Sep 16, 2002 at 03:49:27PM +0100, Alan Cox wrote:
>> Also does turning off the nmi watchdog junk make the box stable ?
>
> good idea, I didn't though about this one since I only heard the nmi to
> lockup hard boxes after hours of load, never to generate any
> malfunction, but certainly the nmi handling isn't probably one of the
> most exercised hardware paths in the cpus, so it's a good idea to
> reproduce with it turned off (OTOH I guess you probably turned it on
> explicitly only after you got these troubles, in order to debug them).
>

I only turned the nmi watchdog on, on the one "unknown" version Oops.

This box was running fine with 2.4.18-SuSE with uptimes 40+days. _Now_
I am almost sure, that it's _not_ a hardware problem (FENCE counting
here as software - since there is a software workaround).

I had 3 lockups in 2 days, when I switched to 2.4.19 - and even lower
room temperature. No, there _must_ be a bug :)

With the relocation you are right - I thought it would test against 
NULL :-(

I think that the tasklist is broken inbetween - either due to broken
readlocks (no working EFENCE on PPRO)

Can someone explain me the difference for label 1 and 2?
Why is the "js 2f" there? This I don't understand fully -
it looks broken to me.

include/asm-i386/rwlock.h

#define __build_read_lock_ptr(rw, helper)   \
     asm volatile(LOCK "subl $1,(%0)\n\t" \
              "js 2f\n" \
              "1:\n" \
              LOCK_SECTION_START("") \
              "2:\tcall " helper "\n\t" \
              "jmp 1b\n" \
              LOCK_SECTION_END \
              ::"a" (rw) : "memory")


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-16 21:16     ` Peter Waechtler
@ 2002-09-16 23:13       ` Andrea Arcangeli
  2002-09-17 17:18         ` Peter Waechtler
  2002-09-17  2:51       ` Adam Kropelin
  1 sibling, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2002-09-16 23:13 UTC (permalink / raw)
  To: Peter Waechtler; +Cc: Alan Cox, linux-kernel, mingo

On Mon, Sep 16, 2002 at 11:16:20PM +0200, Peter Waechtler wrote:
> Am Montag den, 16. September 2002, um 17:44, schrieb Andrea Arcangeli:
> 
> >On Mon, Sep 16, 2002 at 03:49:27PM +0100, Alan Cox wrote:
> >>Also does turning off the nmi watchdog junk make the box stable ?
> >
> >good idea, I didn't though about this one since I only heard the nmi to
> >lockup hard boxes after hours of load, never to generate any
> >malfunction, but certainly the nmi handling isn't probably one of the
> >most exercised hardware paths in the cpus, so it's a good idea to
> >reproduce with it turned off (OTOH I guess you probably turned it on
> >explicitly only after you got these troubles, in order to debug them).
> >
> 
> I only turned the nmi watchdog on, on the one "unknown" version Oops.
> 
> This box was running fine with 2.4.18-SuSE with uptimes 40+days. _Now_
> I am almost sure, that it's _not_ a hardware problem (FENCE counting
> here as software - since there is a software workaround).
> 
> I had 3 lockups in 2 days, when I switched to 2.4.19 - and even lower
> room temperature. No, there _must_ be a bug :)

possible. Which was the previous kernel running in the machine before
2.4.18-SuSE?

It could be an unlucky trashing pattern run by the tasklist walking that
is eliminated in the SuSE kernel, so it may not be a bug in mainline,
but we don't know for sure.

> With the relocation you are right - I thought it would test against 
> NULL :-(
> 
> I think that the tasklist is broken inbetween - either due to broken
> readlocks (no working EFENCE on PPRO)

The SuSE kernel should have exactly the same read/write locks and you
said the SuSE kernel works fine for you (the read/write locks aren't
used only in the tasklist).

> Can someone explain me the difference for label 1 and 2?
> Why is the "js 2f" there? This I don't understand fully -
> it looks broken to me.

it's correct, if not we would have noticed since a long time ;)

What it does is to subtract 1 to the lock, if it goes negative (signed)
it jumps into  the looping slow path  (label 2), then when it finally
stops looping because it acquired the lock, it jumps back to 1 and
enters the critical section. The slow path takes care of acquiring the
lock internally, first polling and doing without requiring the cacheline
exclusive the trylock again.

> 
> include/asm-i386/rwlock.h
> 
> #define __build_read_lock_ptr(rw, helper)   \
>     asm volatile(LOCK "subl $1,(%0)\n\t" \
>              "js 2f\n" \
>              "1:\n" \
>              LOCK_SECTION_START("") \
>              "2:\tcall " helper "\n\t" \
>              "jmp 1b\n" \
>              LOCK_SECTION_END \
>              ::"a" (rw) : "memory")

As said the oops shows clear corruption in the tasklist, that contains
null pointers or random values. So I still tend to think this is
not shown in the SuSE kernel because it doesn't need to loop through all
the long tasklist to make the scheduling decision reducing signficantly
the cacheline trashing and memory I/O. You can also try to run a loop
like:

	import os

	while 1:
		x = os.listdir('/proc')

that will trigger a frequent tasklist walking too, even if it's not a
tight loop like the one that schedule triggered. You could try to
simulate it with a kernel module too. If you make it easily reproducible
it'll be easier to find what's wrong.

You can also try to backout the dynsched patch from the 8.0 kernel, and
see if it returns reproducible then (you should find the patch in the srpm
or in my kernel.org directory).

You can also compare the number of tasks in the system with the number
of simultaneously runnable tasks. And also have a look if the problem
goes away if you stop setiathome.

Hope this helps,

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-16 21:16     ` Peter Waechtler
  2002-09-16 23:13       ` Andrea Arcangeli
@ 2002-09-17  2:51       ` Adam Kropelin
  1 sibling, 0 replies; 13+ messages in thread
From: Adam Kropelin @ 2002-09-17  2:51 UTC (permalink / raw)
  To: Peter Waechtler; +Cc: Andrea Arcangeli, Alan Cox, linux-kernel, mingo

I just built and booted 2.4.20-pre7 on my SMP ppro. If there's some specific
workload I can run to try to reproduce this, I'd be happy to try. Otherwise I'll
just run my normal tasks and see if anything goes haywire. It's been stable for
four hours so far.

(Built with gcc 2.96, BTW.)

--Adam


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-16 15:42 ` Andrea Arcangeli
@ 2002-09-17 17:11   ` Peter Waechtler
  2002-09-17 17:41     ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Waechtler @ 2002-09-17 17:11 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, mingo

Am Montag den, 16. September 2002, um 17:42, schrieb Andrea Arcangeli:

> Can you disassemble the .o object using objdump -Dr or can you 
> disassemble such
> piece of code from the vmlinux instead of compiling with the -S flag to
> verify that to verify that? If it really checks against zero then it's a
> miscompilation, it should check against &init_task as said above.
>
>

objdump -Dr vmlinux
--- schedule snippet ---
	if (unlikely(!c)) {
c0117edb:       83 7d f0 00             cmpl   $0x0,0xfffffff0(%ebp)
c0117edf:       75 6f                   jne    c0117f50 <schedule+0x288>

		spin_unlock_irq(&runqueue_lock);
c0117ee1:       b0 01                   mov    $0x1,%al
c0117ee3:       86 05 80 79 30 c0       xchg   %al,0xc0307980
c0117ee9:       fb                      sti

		read_lock(&tasklist_lock);
c0117eea:       b8 a0 79 30 c0          mov    $0xc03079a0,%eax
c0117eef:       f0 83 28 01             lock subl $0x1,(%eax)
c0117ef3:       0f 88 90 12 00 00       js     c0119189 
<.text.lock.sched+0x40>

		for_each_task(p)
			p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
c0117ef9:       8b 0d 48 c0 2d c0       mov    0xc02dc048,%ecx
c0117eff:       81 f9 00 c0 2d c0       cmp    $0xc02dc000,%ecx
c0117f05:       74 28                   je     c0117f2f <schedule+0x267>
c0117f07:       bb 14 00 00 00          mov    $0x14,%ebx
c0117f0c:       8d 74 26 00             lea    0x0(%esi,1),%esi ;for 
alignment

c0117f10:       8b 51 20                mov    0x20(%ecx),%edx
c0117f13:       d1 fa                   sar    %edx
c0117f15:       89 d8                   mov    %ebx,%eax
c0117f17:       2b 41 24                sub    0x24(%ecx),%eax
c0117f1a:       c1 f8 02                sar    $0x2,%eax
c0117f1d:       8d 54 10 01             lea    0x1(%eax,%edx,1),%edx
c0117f21:       89 51 20                mov    %edx,0x20(%ecx)
c0117f24:       8b 49 48                mov    0x48(%ecx),%ecx
c0117f27:       81 f9 00 c0 2d c0       cmp    $0xc02dc000,%ecx
c0117f2d:       75 e1                   jne    c0117f10 <schedule+0x248>

		read_unlock(&tasklist_lock);
c0117f2f:       f0 ff 05 a0 79 30 c0    lock incl 0xc03079a0
c0117f36:       fa                      cli

		spin_lock_irq(&runqueue_lock);
c0117f37:       f0 fe 0d 80 79 30 c0    lock decb 0xc0307980
c0117f3e:       0f 88 4f 12 00 00       js     c0119193 
<.text.lock.sched+0x4a>

		goto repeat_schedule;
c0117f43:       e9 e4 fe ff ff          jmp    c0117e2c <schedule+0x164>


-----
<.text.lock.sched+0x4a>
c0119189:       e8 66 e9 fe ff          call   c0107af4 
<__read_lock_failed>
c011918e:       e9 65 ed ff ff          jmp    c0117ef8 <schedule+0x230>


c0107af4 <__read_lock_failed>:
c0107af4:       f0 ff 00                lock incl (%eax)
c0107af7:       f3 90                   repz nop
c0107af9:       83 38 01                cmpl   $0x1,(%eax)
c0107afc:       78 f9                   js     c0107af7 
<__read_lock_failed+0x3>
c0107afe:       f0 ff 08                lock decl (%eax)
c0107b01:       0f 88 ed ff ff ff       js     c0107af4 
<__read_lock_failed>
c0107b07:       c3                      ret


--- do_fork snippet ---
	write_lock_irq(&tasklist_lock);

c011a416:       fa                      cli
c011a417:       ba a0 79 30 c0          mov    $0xc03079a0,%edx
c011a41c:       89 d0                   mov    %edx,%eax
c011a41e:       f0 81 28 00 00 00 01    lock subl $0x1000000,(%eax)
c011a425:       0f 85 6c 03 00 00       jne    c011a797 
<.text.lock.fork+0xc4>
; we got the lock

	write_unlock_irq(&tasklist_lock);
c011a53b:       f0 81 05 a0 79 30 c0    lock addl $0x1000000,0xc03079a0
c011a542:       00 00 00 01
c011a546:       fb                      sti

<.text.lock.fork+0xc4>
c011a797:       e8 38 d3 fe ff          call   c0107ad4 
<__write_lock_failed>
c011a79c:       e9 8a fc ff ff          jmp    c011a42b <do_fork+0x597>

c0107ad4 <__write_lock_failed>:
c0107ad4:       f0 81 00 00 00 00 01    lock addl $0x1000000,(%eax)
c0107adb:       f3 90                   repz nop
c0107add:       81 38 00 00 00 01       cmpl   $0x1000000,(%eax)
c0107ae3:       75 f6                   jne    c0107adb 
<__write_lock_failed+0x7>
c0107ae5:       f0 81 28 00 00 00 01    lock subl $0x1000000,(%eax)
c0107aec:       0f 85 e2 ff ff ff       jne    c0107ad4 
<__write_lock_failed>
c0107af2:       c3                      ret
c0107af3:       90                      nop


--- do_exit even more complicated but looks OK ---

> I really suspect an hardware fault here, if you could reproduce easily
> you could try to drop a dimm of ram and retest, you can also try memtst
> from cerberus testsuite or/and memtest86 from the lilo menu.
>

I did that in January - as I encountered these lockups the first time.
I removed some SIMMs, crashed again, exchanged the SIMMs and so on.
memtest86 didn't find anything in one day.

Once I had a machine check exception - sine then I lowered the CPU clock.
After the box was running fine with 180MHz I switched to 200MHz
(yes, I overclocked the CPUs with 233MHz 2 or 3 years - without problems)
2.4.18-SuSE was running fine with the same load over summer. NO lockup
in 2 months.

Now I have to believe that the silicon aged and that tight loop causes
errors? There are similar loops in kernel/exit.c in 2.4.18-SuSE but it
_always_ crashes in schedule().

I also checked the bttv driver and used the 2.4.18-SuSE one in 2.4.19 -
but still crashes. I run the driver almost all the time - I have two
cards, one for watching TV or listen Radio, the other grabs EPG data.

But the box also crashed when I unloaded the drivers over night.
Argh, it seems that I'm forced to use O(1) scheduler!

> So I would say, it's either an hardware issue, or random memory
> corruption generated by some driver. Just some guesses. And if it's the
> irq clobbering the %ecx or the tasklist then something is very wrong in
> the irq code or in the hardware (I'd exclude such possibility, but you
> can try adding _irq to the read_lock/read_unlock of the tasklist_lock to
> disable irq and see if you can still reproduce). If %ecx is checked
> against zero as well something is very wrong, but in the compiler, and
> that would explain things too (you're recommended to use either 2.95 or
> 3.2).
>
I use 2.95.3 out of SuSE 8.0 - everything seems to be compiled correctly.

Perhaps I add a MAGIC to the tasklist to notice when it gets thrashed?
Puh...

> Hope this helps,
>
Thanx for your effort.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-16 23:13       ` Andrea Arcangeli
@ 2002-09-17 17:18         ` Peter Waechtler
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Waechtler @ 2002-09-17 17:18 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Alan Cox, linux-kernel, mingo

Am Dienstag den, 17. September 2002, um 01:13, schrieb Andrea Arcangeli:

> On Mon, Sep 16, 2002 at 11:16:20PM +0200, Peter Waechtler wrote:
>> Am Montag den, 16. September 2002, um 17:44, schrieb Andrea Arcangeli:
>>
>>> On Mon, Sep 16, 2002 at 03:49:27PM +0100, Alan Cox wrote:
>>>> Also does turning off the nmi watchdog junk make the box stable ?
>>>
>>> good idea, I didn't though about this one since I only heard the nmi 
>>> to
>>> lockup hard boxes after hours of load, never to generate any
>>> malfunction, but certainly the nmi handling isn't probably one of the
>>> most exercised hardware paths in the cpus, so it's a good idea to
>>> reproduce with it turned off (OTOH I guess you probably turned it on
>>> explicitly only after you got these troubles, in order to debug them).
>>>
>>
>> I only turned the nmi watchdog on, on the one "unknown" version Oops.
>>
>> This box was running fine with 2.4.18-SuSE with uptimes 40+days. _Now_
>> I am almost sure, that it's _not_ a hardware problem (FENCE counting
>> here as software - since there is a software workaround).
>>
>> I had 3 lockups in 2 days, when I switched to 2.4.19 - and even lower
>> room temperature. No, there _must_ be a bug :)
>
> possible. Which was the previous kernel running in the machine before
> 2.4.18-SuSE?
>

I guess 2.4.10-SuSE from 7.3
In january I switched to 2.4.14 and 2.4.17 and applied the xfs patches.
At exactly the same instruction: kaboom!

http://marc.theaimsgroup.com/?l=linux-kernel&m=101113532211430&w=2


>> Can someone explain me the difference for label 1 and 2?
>> Why is the "js 2f" there? This I don't understand fully -
>> it looks broken to me.
>
> it's correct, if not we would have noticed since a long time ;)
>
> What it does is to subtract 1 to the lock, if it goes negative (signed)
> it jumps into  the looping slow path  (label 2), then when it finally
> stops looping because it acquired the lock, it jumps back to 1 and
> enters the critical section. The slow path takes care of acquiring the
> lock internally, first polling and doing without requiring the cacheline
> exclusive the trylock again.

After studying the disassembly I now see the "trick" with a jump
to a new section.

>>
>> include/asm-i386/rwlock.h
>>
>> #define __build_read_lock_ptr(rw, helper)   \
>>     asm volatile(LOCK "subl $1,(%0)\n\t" \
>>              "js 2f\n" \
>>              "1:\n" \
>>              LOCK_SECTION_START("") \
>>              "2:\tcall " helper "\n\t" \
>>              "jmp 1b\n" \
>>              LOCK_SECTION_END \
>>              ::"a" (rw) : "memory")


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-17 17:11   ` Peter Waechtler
@ 2002-09-17 17:41     ` Andrea Arcangeli
  2002-09-18 11:00       ` Peter Waechtler
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Arcangeli @ 2002-09-17 17:41 UTC (permalink / raw)
  To: Peter Waechtler; +Cc: linux-kernel, mingo

On Tue, Sep 17, 2002 at 07:11:52PM +0200, Peter Waechtler wrote:
> Once I had a machine check exception - sine then I lowered the CPU clock.
> After the box was running fine with 180MHz I switched to 200MHz
> (yes, I overclocked the CPUs with 233MHz 2 or 3 years - without problems)

I guess this explain the corruption. Please make sure the cpu are not
overclocked at all and then try to reproduce. You cannot choose 180mhz
or 200mhz randomly based on which kernel crashes or not, if the cpu are
180mhz ppro you should use 180mhz only, 200mhz will break.  It won't
break so easily as 233mhz, but it will, the timings are strict on smp.
So please try to reproduce at 180mhz if the cpu should run at 180mhz.

I don't want to sound boring but please next times try if you can
reproduce on non overclocked hardware before reporting anything to l-k.

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-17 17:41     ` Andrea Arcangeli
@ 2002-09-18 11:00       ` Peter Waechtler
  2002-09-18 17:52         ` Andrea Arcangeli
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Waechtler @ 2002-09-18 11:00 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel, mingo

Am Dienstag den, 17. September 2002, um 19:41, schrieb Andrea Arcangeli:

> On Tue, Sep 17, 2002 at 07:11:52PM +0200, Peter Waechtler wrote:
>> Once I had a machine check exception - sine then I lowered the CPU 
>> clock.
>> After the box was running fine with 180MHz I switched to 200MHz
>> (yes, I overclocked the CPUs with 233MHz 2 or 3 years - without 
>> problems)
>
> I guess this explain the corruption. Please make sure the cpu are not
> overclocked at all and then try to reproduce. You cannot choose 180mhz
> or 200mhz randomly based on which kernel crashes or not, if the cpu are
> 180mhz ppro you should use 180mhz only, 200mhz will break.  It won't
> break so easily as 233mhz, but it will, the timings are strict on smp.
> So please try to reproduce at 180mhz if the cpu should run at 180mhz.
>
> I don't want to sound boring but please next times try if you can
> reproduce on non overclocked hardware before reporting anything to l-k.
>

They are 200MHz PPros. So don't be bored.
To minimize the risk, I underclocked them, eh?

I slowed down memory timings - the box survived one night - but it
would be too early to count on this.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Oops in sched.c on PPro SMP
  2002-09-18 11:00       ` Peter Waechtler
@ 2002-09-18 17:52         ` Andrea Arcangeli
  0 siblings, 0 replies; 13+ messages in thread
From: Andrea Arcangeli @ 2002-09-18 17:52 UTC (permalink / raw)
  To: Peter Waechtler; +Cc: linux-kernel, mingo

On Wed, Sep 18, 2002 at 01:00:58PM +0200, Peter Waechtler wrote:
> They are 200MHz PPros. So don't be bored.
> To minimize the risk, I underclocked them, eh?

ok ;), I misunderstood your previous sentence then.

> I slowed down memory timings - the box survived one night - but it
> would be too early to count on this.

maybe there's an hope ;)

Andrea

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2002-09-18 17:47 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-16 14:25 Oops in sched.c on PPro SMP Peter Waechtler
2002-09-16 14:49 ` Alan Cox
2002-09-16 15:44   ` Andrea Arcangeli
2002-09-16 21:16     ` Peter Waechtler
2002-09-16 23:13       ` Andrea Arcangeli
2002-09-17 17:18         ` Peter Waechtler
2002-09-17  2:51       ` Adam Kropelin
2002-09-16 21:10   ` Peter Waechtler
2002-09-16 15:42 ` Andrea Arcangeli
2002-09-17 17:11   ` Peter Waechtler
2002-09-17 17:41     ` Andrea Arcangeli
2002-09-18 11:00       ` Peter Waechtler
2002-09-18 17:52         ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).