linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* bad pmd filemap.c, oops; 2.4.30 and 2.4.32
@ 2005-12-27 16:58 Chris Stromsoe
  2005-12-28  0:10 ` Marcelo Tosatti
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2005-12-27 16:58 UTC (permalink / raw)
  To: linux-kernel

I have a machine that oopsed twice in the last 3 weeks.  Immediately 
before each oops was a "filemap.c:2234: bad pmd" message.  The first oops 
happened with 2.4.30, the second with 2.4.32.  The oops from 2.4.30 is 
below.  I don't have the oops from 2.4.32.

The machine is a usenet feeder and does a constant ~110mbit/s traffic.  I 
have the tg3 and bonding modules loaded.  There are 2 Adaptec controllers, 
one onboard, one pci (aic7899 and 3960D).  There are 5 disks off the first 
channel of aic7899 (comes up as scsi2), 4 of which are in a RAID5.  The 
other 3 channels are unused.  I have the .config for 2.4.30 available if 
needed.

Pointers for where to look if/when it happens again would be appreciated. 
Thanks.


-Chris

filemap.c:2234: bad pmd 00c001e3.
filemap.c:2234: bad pmd 010001e3.
Unable to handle kernel paging request at virtual address c13aef08
  printing eip:
c012d7b6
*pde = 010001e3
*pte = ce919a00
Oops: 0000
CPU:    1
EIP:    0010:[mark_page_accessed+6/48]    Not tainted
EFLAGS: 00010296
eax: c13aeef0   ebx: c13aeef0   ecx: 0005d800   edx: ee030900
esi: 0005d7a0   edi: 0005d8a9   ebp: f66b1c3c   esp: f66b1c38
ds: 0018   es: 0018   ss: 0018
Process innfeed (pid: 526, stackpage=f66b1000)
Stack: c13aeef0 f66b1c70 c012ea08 ee030900 0005d7a0 0005d8a9 0005d8a9 f7fa1d60
        f6628080 f6628144 f7628200 ee030900 c012e830 f77f4d80 f66b1cb8 c012a18e
        ee030900 63ca0000 00000000 f66b1ce4 c027404c 00000000 f77f4d80 00000106
Call Trace:    [filemap_nopage+472/544] [filemap_nopage+0/544] [do_no_page+126/608] [ip_queue_xmit+780/1424] [handle_mm_fault+121/272]
   [do_page_fault+1024/1472] [tcp_write_xmit+353/688] [tcp_new_space+137/160] [tcp_rcv_established+716/2480] [memcpy_toiovec+67/112] [do_page_fault+0/1472]
   [error_code+52/60] [csum_partial_copy_generic+61/260] [tcp_sendmsg+2367/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176]
   [sock_writev+79/96] [do_readv_writev+567/608] [sys_writev+88/128] [system_call+51/56]

Code: 8b 40 18 a8 80 75 07 8b 43 18 a8 04 75 0c f0 0f ba 6b 18 02




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-27 16:58 bad pmd filemap.c, oops; 2.4.30 and 2.4.32 Chris Stromsoe
@ 2005-12-28  0:10 ` Marcelo Tosatti
  2005-12-29  2:52   ` Chris Stromsoe
  0 siblings, 1 reply; 40+ messages in thread
From: Marcelo Tosatti @ 2005-12-28  0:10 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: linux-kernel

Hi Chris,

On Tue, Dec 27, 2005 at 08:58:39AM -0800, Chris Stromsoe wrote:
> I have a machine that oopsed twice in the last 3 weeks.  Immediately 
> before each oops was a "filemap.c:2234: bad pmd" message.  The first oops 
> happened with 2.4.30, the second with 2.4.32.  The oops from 2.4.30 is 
> below.  I don't have the oops from 2.4.32.
> 
> The machine is a usenet feeder and does a constant ~110mbit/s traffic.  I 
> have the tg3 and bonding modules loaded.  There are 2 Adaptec controllers, 
> one onboard, one pci (aic7899 and 3960D).  There are 5 disks off the first 
> channel of aic7899 (comes up as scsi2), 4 of which are in a RAID5.  The 
> other 3 channels are unused.  I have the .config for 2.4.30 available if 
> needed.
> 
> Pointers for where to look if/when it happens again would be appreciated. 
>
> Thanks.
> 
> 
> -Chris
> 
> filemap.c:2234: bad pmd 00c001e3.
> filemap.c:2234: bad pmd 010001e3.

This is usually due to memory corruption. Please verify it with
memtest86.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-28  0:10 ` Marcelo Tosatti
@ 2005-12-29  2:52   ` Chris Stromsoe
  2005-12-29  5:12     ` Willy Tarreau
  2005-12-31  0:12     ` Chris Stromsoe
  0 siblings, 2 replies; 40+ messages in thread
From: Chris Stromsoe @ 2005-12-29  2:52 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

On Tue, 27 Dec 2005, Marcelo Tosatti wrote:
> On Tue, Dec 27, 2005 at 08:58:39AM -0800, Chris Stromsoe wrote:
>>
>> filemap.c:2234: bad pmd 00c001e3.
>> filemap.c:2234: bad pmd 010001e3.
>
> This is usually due to memory corruption. Please verify it with 
> memtest86.

I've run through three complete memtest86 passes so far with no errors. 
I'll keep running, but I'm not expecting to see anything.

I caught another two bad pmd errors followed by an oops this morning. 
This is with 2.4.32, bond/tg3 loaded as modules.  Full .config available.


-Chris

Dec 27 09:28:19 filemap.c:2234: bad pmd 020001e3.
Dec 27 09:28:19 filemap.c:2234: bad pmd 024001e3.

The oops came in ata 09:28:20

ksymoops 2.4.9 on i686 2.4.32.  Options used
      -V (default)
      -k /proc/ksyms (default)
      -l /proc/modules (default)
      -o /lib/modules/2.4.32/ (default)
      -m /boot/System.map-2.4.32 (specified)

Unable to handle kernel paging request at virtual address c22eee80
c0259bb3
*pde = 020001e3
Oops: 0002
CPU:    2
EIP:    0010:[alloc_skb+275/480]    Not tainted
EFLAGS: 00010282
eax: c22eee80   ebx: ccbdb480   ecx: 000006bc   edx: 00000680
esi: 000001f0   edi: 00000000   ebp: f663bdf0   esp: f663bddc
ds: 0018   es: 0018   ss: 0018
Process innfeed (pid: 526, stackpage=f663b000)
Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b 00000680
        000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38 d84bec34
        d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774 000005a8 
Call Trace:    [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b 
Using defaults from ksymoops -t elf32-i386 -a i386


>>eax; c22eee80 <_end+1f0d380/38650560>
>>ebx; ccbdb480 <_end+c7f9980/38650560>
>>ebp; f663bdf0 <_end+3625a2f0/38650560>
>>esp; f663bddc <_end+3625a2dc/38650560>

Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
    0:   c7 00 01 00 00 00         movl   $0x1,(%eax)
Code;  00000006 Before first symbol
    6:   8b 83 8c 00 00 00         mov    0x8c(%ebx),%eax
Code;  0000000c Before first symbol
    c:   c7 40 04 00 00 00 00      movl   $0x0,0x4(%eax)
Code;  00000013 Before first symbol
   13:   8b 00                     mov    (%eax),%eax

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-29  2:52   ` Chris Stromsoe
@ 2005-12-29  5:12     ` Willy Tarreau
  2005-12-29  9:33       ` Chris Stromsoe
  2005-12-31  0:12     ` Chris Stromsoe
  1 sibling, 1 reply; 40+ messages in thread
From: Willy Tarreau @ 2005-12-29  5:12 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Marcelo Tosatti, linux-kernel

On Wed, Dec 28, 2005 at 06:52:06PM -0800, Chris Stromsoe wrote:
> On Tue, 27 Dec 2005, Marcelo Tosatti wrote:
> >On Tue, Dec 27, 2005 at 08:58:39AM -0800, Chris Stromsoe wrote:
> >>
> >>filemap.c:2234: bad pmd 00c001e3.
> >>filemap.c:2234: bad pmd 010001e3.
> >
> >This is usually due to memory corruption. Please verify it with 
> >memtest86.
> 
> I've run through three complete memtest86 passes so far with no errors. 
> I'll keep running, but I'm not expecting to see anything.
> 
> I caught another two bad pmd errors followed by an oops this morning. 
> This is with 2.4.32, bond/tg3 loaded as modules.  Full .config available.
> 

I have some servers running on tg3+bond with up to 70 Mbps with about one
year of uptime. Ok, they're not on 2.4.32 yet, but that's just to say that
I dont suspect those drivers.

> -Chris
> 
> Dec 27 09:28:19 filemap.c:2234: bad pmd 020001e3.
> Dec 27 09:28:19 filemap.c:2234: bad pmd 024001e3.
> 
> The oops came in ata 09:28:20
> 
> ksymoops 2.4.9 on i686 2.4.32.  Options used
>      -V (default)
>      -k /proc/ksyms (default)
>      -l /proc/modules (default)
>      -o /lib/modules/2.4.32/ (default)
>      -m /boot/System.map-2.4.32 (specified)
> 
> Unable to handle kernel paging request at virtual address c22eee80
> c0259bb3
> *pde = 020001e3
> Oops: 0002
> CPU:    2
       ^^^^^
interesting, this machine is SMP.
memtest86 only involves CPU0 in tests. I've already had a great difficulty
trying to detect memory problems which occured only when more than one CPU
was accessing the RAM. Can your machine support its load with only one CPU ?
Maybe you observe more I/O than pure CPU. It would be interesting to restart
it with the 'nosmp' boot option.


> EIP:    0010:[alloc_skb+275/480]    Not tainted

I'm somewhat surprized, because I've not found a direct nor indirect call
path from alloc_skb() to filemap_sync_pte_range() in which the error is
reported. I'm clearly missing something here.


> EFLAGS: 00010282
> eax: c22eee80   ebx: ccbdb480   ecx: 000006bc   edx: 00000680
> esi: 000001f0   edi: 00000000   ebp: f663bdf0   esp: f663bddc
> ds: 0018   es: 0018   ss: 0018
> Process innfeed (pid: 526, stackpage=f663b000)
> Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b 
> 00000680
>        000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38 
>        d84bec34
>        d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774 
>        000005a8 Call Trace:    [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] 
> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b 
> Using defaults from ksymoops -t elf32-i386 -a i386
> 
> 
> >>eax; c22eee80 <_end+1f0d380/38650560>
> >>ebx; ccbdb480 <_end+c7f9980/38650560>
> >>ebp; f663bdf0 <_end+3625a2f0/38650560>
> >>esp; f663bddc <_end+3625a2dc/38650560>
> 
> Code;  00000000 Before first symbol
> 00000000 <_EIP>:
> Code;  00000000 Before first symbol
>    0:   c7 00 01 00 00 00         movl   $0x1,(%eax)
> Code;  00000006 Before first symbol
>    6:   8b 83 8c 00 00 00         mov    0x8c(%ebx),%eax
> Code;  0000000c Before first symbol
>    c:   c7 40 04 00 00 00 00      movl   $0x0,0x4(%eax)
> Code;  00000013 Before first symbol
>   13:   8b 00                     mov    (%eax),%eax

Regards,
willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-29  5:12     ` Willy Tarreau
@ 2005-12-29  9:33       ` Chris Stromsoe
  2005-12-29 10:08         ` Willy Tarreau
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2005-12-29  9:33 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Marcelo Tosatti, linux-kernel

On Thu, 29 Dec 2005, Willy Tarreau wrote:
> On Wed, Dec 28, 2005 at 06:52:06PM -0800, Chris Stromsoe wrote:
>>
>> Dec 27 09:28:19 filemap.c:2234: bad pmd 020001e3.
>> Dec 27 09:28:19 filemap.c:2234: bad pmd 024001e3.
>>
>> The oops came in ata 09:28:20
>>
>> ksymoops 2.4.9 on i686 2.4.32.  Options used
>>      -V (default)
>>      -k /proc/ksyms (default)
>>      -l /proc/modules (default)
>>      -o /lib/modules/2.4.32/ (default)
>>      -m /boot/System.map-2.4.32 (specified)
>>
>> Unable to handle kernel paging request at virtual address c22eee80
>> c0259bb3
>> *pde = 020001e3
>> Oops: 0002
>> CPU:    2
>       ^^^^^
> interesting, this machine is SMP.
> memtest86 only involves CPU0 in tests. I've already had a great difficulty
> trying to detect memory problems which occured only when more than one CPU
> was accessing the RAM. Can your machine support its load with only one CPU ?
> Maybe you observe more I/O than pure CPU. It would be interesting to restart
> it with the 'nosmp' boot option.

The machine is a dual P4 Xeon with hyperthreading on.  It can probably get 
by with only one cpu enabled.  If/when it goes down again, I'll boot with 
nosmp.  For what it's worth, I ran a Dell memory tester ("MP Memory") 
which claims to test all of the CPUs for a few hours and didn't come up 
with anything.  The machine feeds usenet and is seeing a lot more io than 
cpu.  (There are two Adaptec controllers, 4 channels, aic79xx, 5 drives on 
one channel, 3 unused, spool is on a 4 disk raid5, jfs formatted.)


>> EIP:  0010:[alloc_skb+275/480] Not tainted
>
> I'm somewhat surprized, because I've not found a direct nor indirect 
> call path from alloc_skb() to filemap_sync_pte_range() in which the 
> error is reported. I'm clearly missing something here.

If it helps, the oops with 2.4.30 had two "bad pmd" messages right before 
it then:

Unable to handle kernel paging request at virtual address c13aef08
  printing eip:
c012d7b6
*pde = 010001e3
*pte = ce919a00
Oops: 0000
CPU:    1
EIP:    0010:[mark_page_accessed+6/48]    Not tainted
EFLAGS: 00010296
eax: c13aeef0   ebx: c13aeef0   ecx: 0005d800   edx: ee030900
esi: 0005d7a0   edi: 0005d8a9   ebp: f66b1c3c   esp: f66b1c38
ds: 0018   es: 0018   ss: 0018
Process innfeed (pid: 526, stackpage=f66b1000)
Stack: c13aeef0 f66b1c70 c012ea08 ee030900 0005d7a0 0005d8a9 0005d8a9 f7fa1d60
        f6628080 f6628144 f7628200 ee030900 c012e830 f77f4d80 f66b1cb8 c012a18e
        ee030900 63ca0000 00000000 f66b1ce4 c027404c 00000000 f77f4d80 00000106
Call Trace:    [filemap_nopage+472/544] [filemap_nopage+0/544][do_no_page+126/608] [ip_queue_xmit+780/1424] [handle_mm_fault+121/272]
   [do_page_fault+1024/1472] [tcp_write_xmit+353/688] [tcp_new_space+137/160][tcp_rcv_established+716/2480] [memcpy_toiovec+67/112] [do_page_fault+0/1472]
   [error_code+52/60] [csum_partial_copy_generic+61/260] [tcp_sendmsg+2367/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176]
   [sock_writev+79/96] [do_readv_writev+567/608] [sys_writev+88/128] [system_call+51/56]

Code: 8b 40 18 a8 80 75 07 8b 43 18 a8 04 75 0c f0 0f ba 6b 18 02



-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-29  9:33       ` Chris Stromsoe
@ 2005-12-29 10:08         ` Willy Tarreau
  2005-12-29 12:01           ` Chris Stromsoe
  0 siblings, 1 reply; 40+ messages in thread
From: Willy Tarreau @ 2005-12-29 10:08 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Marcelo Tosatti, linux-kernel

On Thu, Dec 29, 2005 at 01:33:47AM -0800, Chris Stromsoe wrote:
> On Thu, 29 Dec 2005, Willy Tarreau wrote:
> >On Wed, Dec 28, 2005 at 06:52:06PM -0800, Chris Stromsoe wrote:
> >>
> >>Dec 27 09:28:19 filemap.c:2234: bad pmd 020001e3.
> >>Dec 27 09:28:19 filemap.c:2234: bad pmd 024001e3.
> >>
> >>The oops came in ata 09:28:20
> >>
> >>ksymoops 2.4.9 on i686 2.4.32.  Options used
> >>     -V (default)
> >>     -k /proc/ksyms (default)
> >>     -l /proc/modules (default)
> >>     -o /lib/modules/2.4.32/ (default)
> >>     -m /boot/System.map-2.4.32 (specified)
> >>
> >>Unable to handle kernel paging request at virtual address c22eee80
> >>c0259bb3
> >>*pde = 020001e3
> >>Oops: 0002
> >>CPU:    2
> >      ^^^^^
> >interesting, this machine is SMP.
> >memtest86 only involves CPU0 in tests. I've already had a great 
> >difficulty
> >trying to detect memory problems which occured only when more than one 
> >CPU
> >was accessing the RAM. Can your machine support its load with only one 
> >CPU ?
> >Maybe you observe more I/O than pure CPU. It would be interesting to 
> >restart
> >it with the 'nosmp' boot option.
> 
> The machine is a dual P4 Xeon with hyperthreading on.  It can probably 
> get by with only one cpu enabled.  If/when it goes down again, I'll boot 
> with nosmp.  For what it's worth, I ran a Dell memory tester ("MP 
> Memory") which claims to test all of the CPUs for a few hours and didn't 
> come up with anything.  The machine feeds usenet and is seeing a lot more 
> io than cpu.  (There are two Adaptec controllers, 4 channels, aic79xx, 5 
> drives on one channel, 3 unused, spool is on a 4 disk raid5, jfs 
> formatted.)

OK, I've found two old similar reports from people running news servers :
  http://www.ussg.iu.edu/hypermail/linux/kernel/0308.1/0807.html
  http://seclists.org/lists/linux-kernel/2004/Jan/5699.html

both were using an SMP server with an AIC7xxx adapter, and kernels varying
from 2.4.18 to 2.4.24. One of them used XFS and not JFS, so we will exclude
any potential JFS-related cause for now.

If you feel brave, you can try to switch the AIC7xxx driver to Justin Gibbs'
more recent version, but which has not evolved during last year, but which
I have running reliably on production servers :

   http://people.freebsd.org/~gibbs/linux/

I also have it rediffed for recent kernels if you prefer :

   http://w.ods.org/kernel/2.4-wt/2.4.32-wt2/patches-2.4.32-wt2/pool/aic79xx-20040522-linux-2.4.30-pre3.rediff


> >>EIP:  0010:[alloc_skb+275/480] Not tainted
> >
> >I'm somewhat surprized, because I've not found a direct nor indirect 
> >call path from alloc_skb() to filemap_sync_pte_range() in which the 
> >error is reported. I'm clearly missing something here.
> 
> If it helps, the oops with 2.4.30 had two "bad pmd" messages right before 
> it then:
> 
> Unable to handle kernel paging request at virtual address c13aef08
>  printing eip:
> c012d7b6
> *pde = 010001e3
> *pte = ce919a00
> Oops: 0000
> CPU:    1
> EIP:    0010:[mark_page_accessed+6/48]    Not tainted
> EFLAGS: 00010296
> eax: c13aeef0   ebx: c13aeef0   ecx: 0005d800   edx: ee030900
> esi: 0005d7a0   edi: 0005d8a9   ebp: f66b1c3c   esp: f66b1c38
> ds: 0018   es: 0018   ss: 0018
> Process innfeed (pid: 526, stackpage=f66b1000)
> Stack: c13aeef0 f66b1c70 c012ea08 ee030900 0005d7a0 0005d8a9 0005d8a9 
> f7fa1d60
>        f6628080 f6628144 f7628200 ee030900 c012e830 f77f4d80 f66b1cb8 
>        c012a18e
>        ee030900 63ca0000 00000000 f66b1ce4 c027404c 00000000 f77f4d80 
>        00000106
> Call Trace:    [filemap_nopage+472/544] 
> [filemap_nopage+0/544][do_no_page+126/608] [ip_queue_xmit+780/1424] 
> [handle_mm_fault+121/272]
>   [do_page_fault+1024/1472] [tcp_write_xmit+353/688] 
>   [tcp_new_space+137/160][tcp_rcv_established+716/2480] 
>   [memcpy_toiovec+67/112] [do_page_fault+0/1472]
>   [error_code+52/60] [csum_partial_copy_generic+61/260] 
>   [tcp_sendmsg+2367/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] 
>   [sock_readv_writev+116/176]
>   [sock_writev+79/96] [do_readv_writev+567/608] [sys_writev+88/128] 
>   [system_call+51/56]
> 
> Code: 8b 40 18 a8 80 75 07 8b 43 18 a8 04 75 0c f0 0f ba 6b 18 02

Out of curiosity, it would be interesting to disable swap if you have it
enabled.

> -Chris

Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-29 10:08         ` Willy Tarreau
@ 2005-12-29 12:01           ` Chris Stromsoe
  0 siblings, 0 replies; 40+ messages in thread
From: Chris Stromsoe @ 2005-12-29 12:01 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Marcelo Tosatti, linux-kernel

On Thu, 29 Dec 2005, Willy Tarreau wrote:
> On Thu, Dec 29, 2005 at 01:33:47AM -0800, Chris Stromsoe wrote:
>>
>> The machine is a dual P4 Xeon with hyperthreading on.  It can probably 
>> get by with only one cpu enabled.  If/when it goes down again, I'll 
>> boot with nosmp.  For what it's worth, I ran a Dell memory tester ("MP 
>> Memory") which claims to test all of the CPUs for a few hours and 
>> didn't come up with anything.  The machine feeds usenet and is seeing a 
>> lot more io than cpu.  (There are two Adaptec controllers, 4 channels, 
>> aic79xx, 5 drives on one channel, 3 unused, spool is on a 4 disk raid5, 
>> jfs formatted.)
>
> OK, I've found two old similar reports from people running news servers  :
>  http://www.ussg.iu.edu/hypermail/linux/kernel/0308.1/0807.html
>  http://seclists.org/lists/linux-kernel/2004/Jan/5699.html
>
> both were using an SMP server with an AIC7xxx adapter, and kernels 
> varying from 2.4.18 to 2.4.24. One of them used XFS and not JFS, so we 
> will exclude any potential JFS-related cause for now.

I am also building with highmem/4Gb support, which one of the reports 
mentioned.  I did not have any pmd messages while running 2.4.26 or 
2.4.27, built with the same set of options (make oldconfig dep clean 
bzimage .... )


> If you feel brave, you can try to switch the AIC7xxx driver to Justin 
> Gibbs' more recent version, but which has not evolved during last year, 
> but which I have running reliably on production servers :
>
>   http://people.freebsd.org/~gibbs/linux/
>
> I also have it rediffed for recent kernels if you prefer :
>
> http://w.ods.org/kernel/2.4-wt/2.4.32-wt2/patches-2.4.32-wt2/pool/aic79xx-20040522-linux-2.4.30-pre3.rediff

I've pulled the patch and saved it.  I don't want to change more than one 
thing at a time.  I'll try the alternate driver if booting with nosmp 
doesn't help.

> Out of curiosity, it would be interesting to disable swap if you have it 
> enabled.

I'm running with 4G of swap, but usually don't dip more than 30M or 40M 
into it.  I'll add disabling swap to the list of things to check.



-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-29  2:52   ` Chris Stromsoe
  2005-12-29  5:12     ` Willy Tarreau
@ 2005-12-31  0:12     ` Chris Stromsoe
  2005-12-31  1:48       ` Chris Stromsoe
  1 sibling, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2005-12-31  0:12 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

I oopsed again last night with an identical EIP and Call Trace to the oops 
from the 28th.  The new oops is below, the prior below that.  I'm going to 
reboot the machine into UP and see if that helps.

-Chris

Unable to handle kernel paging request at virtual address c211ce80
c0259bb3
*pde = 020001e3
Oops: 0002
CPU:    2
EIP:    0010:[alloc_skb+275/480]    Not tainted
EFLAGS: 00010282
eax: c211ce80   ebx: f5303680   ecx: f7eeb780   edx: 00000680
esi: 000001f0   edi: 00000000   ebp: d348ddf0   esp: d348dddc
ds: 0018   es: 0018   ss: 0018
Process innfeed (pid: 25080, stackpage=d348d000)
Stack: 000006bc 000001f0 ebabc980 eb0e64d8 eb0e6400 d348de68 c027b50b 00000680
        000001f0 000005a8 00000000 d348de54 00000000 00000000 00000001 00000000
        012815b5 00000000 00000000 d7a160a0 d348c000 636686ac 000c3dec 000087c0
Call Trace:    [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b
Using defaults from ksymoops -t elf32-i386 -a i386


>>eax; c211ce80 <_end+1d3b380/38650560>
>>ebx; f5303680 <_end+34f21b80/38650560>
>>ecx; f7eeb780 <_end+37b09c80/38650560>
>>ebp; d348ddf0 <_end+130ac2f0/38650560>
>>esp; d348dddc <_end+130ac2dc/38650560>

Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
    0:   c7 00 01 00 00 00         movl   $0x1,(%eax)
Code;  00000006 Before first symbol
    6:   8b 83 8c 00 00 00         mov    0x8c(%ebx),%eax
Code;  0000000c Before first symbol
    c:   c7 40 04 00 00 00 00      movl   $0x0,0x4(%eax)
Code;  00000013 Before first symbol
   13:   8b 00                     mov    (%eax),%eax


On Wed, 28 Dec 2005, Chris Stromsoe wrote:

> Unable to handle kernel paging request at virtual address c22eee80
> c0259bb3
> *pde = 020001e3
> Oops: 0002
> CPU:    2
> EIP:    0010:[alloc_skb+275/480]    Not tainted
> EFLAGS: 00010282
> eax: c22eee80   ebx: ccbdb480   ecx: 000006bc   edx: 00000680
> esi: 000001f0   edi: 00000000   ebp: f663bdf0   esp: f663bddc
> ds: 0018   es: 0018   ss: 0018
> Process innfeed (pid: 526, stackpage=f663b000)
> Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b 00000680
>        000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38 d84bec34
>        d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774 000005a8
> Call Trace:    [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b Using 
> defaults from ksymoops -t elf32-i386 -a i386
>
>>> eax; c22eee80 <_end+1f0d380/38650560>
>>> ebx; ccbdb480 <_end+c7f9980/38650560>
>>> ebp; f663bdf0 <_end+3625a2f0/38650560>
>>> esp; f663bddc <_end+3625a2dc/38650560>
>
> Code;  00000000 Before first symbol
> 00000000 <_EIP>:
> Code;  00000000 Before first symbol
>   0:   c7 00 01 00 00 00         movl   $0x1,(%eax)
> Code;  00000006 Before first symbol
>   6:   8b 83 8c 00 00 00         mov    0x8c(%ebx),%eax
> Code;  0000000c Before first symbol
>   c:   c7 40 04 00 00 00 00      movl   $0x0,0x4(%eax)
> Code;  00000013 Before first symbol
>  13:   8b 00                     mov    (%eax),%eax
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31  0:12     ` Chris Stromsoe
@ 2005-12-31  1:48       ` Chris Stromsoe
  2005-12-31  4:00         ` Chris Stromsoe
                           ` (2 more replies)
  0 siblings, 3 replies; 40+ messages in thread
From: Chris Stromsoe @ 2005-12-31  1:48 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

I'm starting to suspect bad hardware.  Booting is now hanging (with 
2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:

.....

Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4
ide: Assuming 33MHz system bus speed for PIO modes; override with 
idebus=xx
hda: TEAC CD-ROM CD-224E, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: attached ide-cdrom driver.
hda: ATAPI 24X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.12
SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
         <Adaptec 3960D Ultra160 SCSI adapter>
         aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
         <Adaptec 3960D Ultra160 SCSI adapter>
         aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
         <Adaptec aic7899 Ultra160 SCSI adapter>
         aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

blk: queue f7e46018, I/O limit 4095Mb (mask 0xffffffff)


If I wait several minutes (around 10 or 15 minutes), I get:

scsi0:0:0:0: Attempting to queue an ABORT message
CDB: 0x12 0x0 0x0 0x0 0xff 0x0
scsi0:0:0:0: Command already completed
aic7xxx_abort returns 0x2002
scsi0:0:0:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi0:0:0:0: Command already completed
aic7xxx_abort returns 0x2002
scsi0:0:0:0: Attempting to queue a TARGET RESET message
CDB: 0x12 0x0 0x0 0x0 0xff 0x0
scsi0:0:0:0: Is not an active device
aic7xxx_dev_reset returns 0x2002
scsi0:0:0:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi0:0:0:0: Command already completed
aic7xxx_abort returns 0x2002
scsi0:0:0:0: Attempting to queue an ABORT message
CDB: 0x0 0x0 0x0 0x0 0x0 0x0
scsi0:0:0:0: Command already completed
aic7xxx_abort returns 0x2002
scsi: device set offline - not ready or command retry failed after bus reset: host 0 channel 0 id 0 lun 0


The messages repeated for all 15 targets on scsi0.  It's looking like it 
will repeat for scsi1 as well.

How likely is it that a failing scsi controller contribute to the other 
problems I was seeing?


-Chris

On Fri, 30 Dec 2005, Chris Stromsoe wrote:

> I oopsed again last night with an identical EIP and Call Trace to the 
> oops from the 28th.  The new oops is below, the prior below that.  I'm 
> going to reboot the machine into UP and see if that helps.
>
> -Chris
>
> Unable to handle kernel paging request at virtual address c211ce80
> c0259bb3
> *pde = 020001e3
> Oops: 0002
> CPU:    2
> EIP:    0010:[alloc_skb+275/480]    Not tainted
> EFLAGS: 00010282
> eax: c211ce80   ebx: f5303680   ecx: f7eeb780   edx: 00000680
> esi: 000001f0   edi: 00000000   ebp: d348ddf0   esp: d348dddc
> ds: 0018   es: 0018   ss: 0018
> Process innfeed (pid: 25080, stackpage=d348d000)
> Stack: 000006bc 000001f0 ebabc980 eb0e64d8 eb0e6400 d348de68 c027b50b 
> 00000680
>       000001f0 000005a8 00000000 d348de54 00000000 00000000 00000001 
> 00000000
>       012815b5 00000000 00000000 d7a160a0 d348c000 636686ac 000c3dec 
> 000087c0
> Call Trace:    [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] 
> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b
> Using defaults from ksymoops -t elf32-i386 -a i386
>
>
>>> eax; c211ce80 <_end+1d3b380/38650560>
>>> ebx; f5303680 <_end+34f21b80/38650560>
>>> ecx; f7eeb780 <_end+37b09c80/38650560>
>>> ebp; d348ddf0 <_end+130ac2f0/38650560>
>>> esp; d348dddc <_end+130ac2dc/38650560>
>
> Code;  00000000 Before first symbol
> 00000000 <_EIP>:
> Code;  00000000 Before first symbol
>   0:   c7 00 01 00 00 00         movl   $0x1,(%eax)
> Code;  00000006 Before first symbol
>   6:   8b 83 8c 00 00 00         mov    0x8c(%ebx),%eax
> Code;  0000000c Before first symbol
>   c:   c7 40 04 00 00 00 00      movl   $0x0,0x4(%eax)
> Code;  00000013 Before first symbol
>  13:   8b 00                     mov    (%eax),%eax
>
>
> On Wed, 28 Dec 2005, Chris Stromsoe wrote:
>
>> Unable to handle kernel paging request at virtual address c22eee80
>> c0259bb3
>> *pde = 020001e3
>> Oops: 0002
>> CPU:    2
>> EIP:    0010:[alloc_skb+275/480]    Not tainted
>> EFLAGS: 00010282
>> eax: c22eee80   ebx: ccbdb480   ecx: 000006bc   edx: 00000680
>> esi: 000001f0   edi: 00000000   ebp: f663bdf0   esp: f663bddc
>> ds: 0018   es: 0018   ss: 0018
>> Process innfeed (pid: 526, stackpage=f663b000)
>> Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b 
>> 00000680
>>        000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38 
>> d84bec34
>>        d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774 
>> 000005a8
>> Call Trace:    [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] 
>> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
>> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b Using 
>> defaults from ksymoops -t elf32-i386 -a i386
>> 
>>>> eax; c22eee80 <_end+1f0d380/38650560>
>>>> ebx; ccbdb480 <_end+c7f9980/38650560>
>>>> ebp; f663bdf0 <_end+3625a2f0/38650560>
>>>> esp; f663bddc <_end+3625a2dc/38650560>
>> 
>> Code;  00000000 Before first symbol
>> 00000000 <_EIP>:
>> Code;  00000000 Before first symbol
>>   0:   c7 00 01 00 00 00         movl   $0x1,(%eax)
>> Code;  00000006 Before first symbol
>>   6:   8b 83 8c 00 00 00         mov    0x8c(%ebx),%eax
>> Code;  0000000c Before first symbol
>>   c:   c7 40 04 00 00 00 00      movl   $0x0,0x4(%eax)
>> Code;  00000013 Before first symbol
>>  13:   8b 00                     mov    (%eax),%eax
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31  1:48       ` Chris Stromsoe
@ 2005-12-31  4:00         ` Chris Stromsoe
  2005-12-31  7:25           ` Willy Tarreau
  2005-12-31  7:12         ` Willy Tarreau
  2005-12-31 12:08         ` Alan Cox
  2 siblings, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2005-12-31  4:00 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel

I couldn't get the machine to come up with 2.4.32, 2.4.30, or 2.4.27.  It 
was hanging and then throwing the SCSI errors below.  The machine did come 
up with a vanilla 2.6.14.4 and appears to be working fine.  I'm going to 
leave it up over the weekend and see if it oopses.  If it would help, I 
can mail out the .config for the 2.4.32 and 2.6.14.4 builds, or provide 
other information of interest.

-Chris

On Fri, 30 Dec 2005, Chris Stromsoe wrote:

> I'm starting to suspect bad hardware.  Booting is now hanging (with 
> 2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:
>
> .....
>
> Floppy drive(s): fd0 is 1.44M
> FDC 0 is a National Semiconductor PC87306
> Uniform Multi-Platform E-IDE driver Revision: 7.00beta4-2.4
> ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
> hda: TEAC CD-ROM CD-224E, ATAPI CD/DVD-ROM drive
> ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> hda: attached ide-cdrom driver.
> hda: ATAPI 24X CD-ROM drive, 128kB Cache
> Uniform CD-ROM driver Revision: 3.12
> SCSI subsystem driver Revision: 1.00
> scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
>        <Adaptec 3960D Ultra160 SCSI adapter>
>        aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
>
> scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
>        <Adaptec 3960D Ultra160 SCSI adapter>
>        aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
>
> scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
>        <Adaptec aic7899 Ultra160 SCSI adapter>
>        aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
>
> blk: queue f7e46018, I/O limit 4095Mb (mask 0xffffffff)
>
>
> If I wait several minutes (around 10 or 15 minutes), I get:
>
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue a TARGET RESET message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Is not an active device
> aic7xxx_dev_reset returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi: device set offline - not ready or command retry failed after bus reset: 
> host 0 channel 0 id 0 lun 0
>
>
> The messages repeated for all 15 targets on scsi0.  It's looking like it will 
> repeat for scsi1 as well.
>
> How likely is it that a failing scsi controller contribute to the other 
> problems I was seeing?
>
>
> -Chris
>
> On Fri, 30 Dec 2005, Chris Stromsoe wrote:
>
>> I oopsed again last night with an identical EIP and Call Trace to the oops 
>> from the 28th.  The new oops is below, the prior below that.  I'm going to 
>> reboot the machine into UP and see if that helps.
>> 
>> -Chris
>> 
>> Unable to handle kernel paging request at virtual address c211ce80
>> c0259bb3
>> *pde = 020001e3
>> Oops: 0002
>> CPU:    2
>> EIP:    0010:[alloc_skb+275/480]    Not tainted
>> EFLAGS: 00010282
>> eax: c211ce80   ebx: f5303680   ecx: f7eeb780   edx: 00000680
>> esi: 000001f0   edi: 00000000   ebp: d348ddf0   esp: d348dddc
>> ds: 0018   es: 0018   ss: 0018
>> Process innfeed (pid: 25080, stackpage=d348d000)
>> Stack: 000006bc 000001f0 ebabc980 eb0e64d8 eb0e6400 d348de68 c027b50b 
>> 00000680
>>       000001f0 000005a8 00000000 d348de54 00000000 00000000 00000001 
>> 00000000
>>       012815b5 00000000 00000000 d7a160a0 d348c000 636686ac 000c3dec 
>> 000087c0
>> Call Trace:    [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] 
>> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
>> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b
>> Using defaults from ksymoops -t elf32-i386 -a i386
>> 
>> 
>>>> eax; c211ce80 <_end+1d3b380/38650560>
>>>> ebx; f5303680 <_end+34f21b80/38650560>
>>>> ecx; f7eeb780 <_end+37b09c80/38650560>
>>>> ebp; d348ddf0 <_end+130ac2f0/38650560>
>>>> esp; d348dddc <_end+130ac2dc/38650560>
>> 
>> Code;  00000000 Before first symbol
>> 00000000 <_EIP>:
>> Code;  00000000 Before first symbol
>>   0:   c7 00 01 00 00 00         movl   $0x1,(%eax)
>> Code;  00000006 Before first symbol
>>   6:   8b 83 8c 00 00 00         mov    0x8c(%ebx),%eax
>> Code;  0000000c Before first symbol
>>   c:   c7 40 04 00 00 00 00      movl   $0x0,0x4(%eax)
>> Code;  00000013 Before first symbol
>>  13:   8b 00                     mov    (%eax),%eax
>> 
>> 
>> On Wed, 28 Dec 2005, Chris Stromsoe wrote:
>> 
>>> Unable to handle kernel paging request at virtual address c22eee80
>>> c0259bb3
>>> *pde = 020001e3
>>> Oops: 0002
>>> CPU:    2
>>> EIP:    0010:[alloc_skb+275/480]    Not tainted
>>> EFLAGS: 00010282
>>> eax: c22eee80   ebx: ccbdb480   ecx: 000006bc   edx: 00000680
>>> esi: 000001f0   edi: 00000000   ebp: f663bdf0   esp: f663bddc
>>> ds: 0018   es: 0018   ss: 0018
>>> Process innfeed (pid: 526, stackpage=f663b000)
>>> Stack: 000006bc 000001f0 ccbdb080 00000000 f7185800 f663be68 c027b50b 
>>> 00000680
>>>        000001f0 000005a8 00000000 f663be54 00000000 00000287 d84bec38 
>>> d84bec34
>>>        d84bec54 f663a000 00000000 d5fbd8a0 f663a000 586d4438 0002c774 
>>> 000005a8
>>> Call Trace:    [tcp_sendmsg+2619/4512] [inet_sendmsg+65/80] 
>>> [sock_sendmsg+102/176] [sock_readv_writev+116/176] [sock_writev+79/96]
>>> Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b Using 
>>> defaults from ksymoops -t elf32-i386 -a i386
>>> 
>>>>> eax; c22eee80 <_end+1f0d380/38650560>
>>>>> ebx; ccbdb480 <_end+c7f9980/38650560>
>>>>> ebp; f663bdf0 <_end+3625a2f0/38650560>
>>>>> esp; f663bddc <_end+3625a2dc/38650560>
>>> 
>>> Code;  00000000 Before first symbol
>>> 00000000 <_EIP>:
>>> Code;  00000000 Before first symbol
>>>   0:   c7 00 01 00 00 00         movl   $0x1,(%eax)
>>> Code;  00000006 Before first symbol
>>>   6:   8b 83 8c 00 00 00         mov    0x8c(%ebx),%eax
>>> Code;  0000000c Before first symbol
>>>   c:   c7 40 04 00 00 00 00      movl   $0x0,0x4(%eax)
>>> Code;  00000013 Before first symbol
>>>  13:   8b 00                     mov    (%eax),%eax
>>> -
>>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at  http://www.tux.org/lkml/
>>> 
>> -
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31  1:48       ` Chris Stromsoe
  2005-12-31  4:00         ` Chris Stromsoe
@ 2005-12-31  7:12         ` Willy Tarreau
  2005-12-31 10:39           ` Chris Stromsoe
  2005-12-31 12:08         ` Alan Cox
  2 siblings, 1 reply; 40+ messages in thread
From: Willy Tarreau @ 2005-12-31  7:12 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Marcelo Tosatti, linux-kernel


On Fri, Dec 30, 2005 at 05:48:15PM -0800, Chris Stromsoe wrote:
> I'm starting to suspect bad hardware.  Booting is now hanging (with 
> 2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:

And nothing changed since previous boot, except UP ?

(...) 
> If I wait several minutes (around 10 or 15 minutes), I get:
> 
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue a TARGET RESET message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Is not an active device
> aic7xxx_dev_reset returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x0 0x0 0x0 0x0 0x0 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002
> scsi: device set offline - not ready or command retry failed after bus 
> reset: host 0 channel 0 id 0 lun 0
> 
> 
> The messages repeated for all 15 targets on scsi0.  It's looking like it 
> will repeat for scsi1 as well.
(...)

it recalls me bad memories on my machine a very long time ago when the
driver was buggy :-(
It's not necessarily bad hardware. I also had trouble on one version
of the 29160 bios where it hanged during device scan if there were
too many terminations. Oh, BTW, please check that you have disabled
"automatic" termination in the BIOS. Manually set it either to ON or
OFF (low/high depending on your setup).

> How likely is it that a failing scsi controller contribute to the other 
> problems I was seeing?

Not much. Perhaps at worst, a failing controller could corrupt memory
by writing garbage at wrong locations, but you would not always get
the same messages. It seems to be a different problem here. To be
honnest, it's where I think you should try the new driver.

Regards,
Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31  4:00         ` Chris Stromsoe
@ 2005-12-31  7:25           ` Willy Tarreau
  2005-12-31 11:06             ` Chris Stromsoe
  0 siblings, 1 reply; 40+ messages in thread
From: Willy Tarreau @ 2005-12-31  7:25 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Marcelo Tosatti, linux-kernel

On Fri, Dec 30, 2005 at 08:00:34PM -0800, Chris Stromsoe wrote:
> I couldn't get the machine to come up with 2.4.32, 2.4.30, or 2.4.27.  It 
> was hanging and then throwing the SCSI errors below.  The machine did 
> come up with a vanilla 2.6.14.4 and appears to be working fine.  I'm 
> going to leave it up over the weekend and see if it oopses.  If it would 
> help, I can mail out the .config for the 2.4.32 and 2.6.14.4 builds, or 
> provide other information of interest.

Please do post at least the 2.4.32 .config, I'll try to boot it on my
system right here. I find it amazing that it suddenly stopped working
with the same kernels as before.

> -Chris

Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31  7:12         ` Willy Tarreau
@ 2005-12-31 10:39           ` Chris Stromsoe
  2005-12-31 10:56             ` Willy Tarreau
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2005-12-31 10:39 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Marcelo Tosatti, linux-kernel

On Sat, 31 Dec 2005, Willy Tarreau wrote:
> On Fri, Dec 30, 2005 at 05:48:15PM -0800, Chris Stromsoe wrote:
>
>> I'm starting to suspect bad hardware.  Booting is now hanging (with 
>> 2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:
>
> And nothing changed since previous boot, except UP ?

All I changed was adding nosmp to the kernel boot line.

> It's not necessarily bad hardware. I also had trouble on one version of 
> the 29160 bios where it hanged during device scan if there were too many 
> terminations. Oh, BTW, please check that you have disabled "automatic" 
> termination in the BIOS. Manually set it either to ON or OFF (low/high 
> depending on your setup).

I'll have to check it tomorrow or on Monday.

>> How likely is it that a failing scsi controller contribute to the other 
>> problems I was seeing?
>
> Not much. Perhaps at worst, a failing controller could corrupt memory by 
> writing garbage at wrong locations, but you would not always get the 
> same messages. It seems to be a different problem here. To be honnest, 
> it's where I think you should try the new driver.

The machine has been running 2.6.14.4 for the last 6 hours.  It came up 
fine.  I did not try booting it with nosmp.  If I have time, I will revert 
back to 2.4 with the newer driver to test.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31 10:39           ` Chris Stromsoe
@ 2005-12-31 10:56             ` Willy Tarreau
  0 siblings, 0 replies; 40+ messages in thread
From: Willy Tarreau @ 2005-12-31 10:56 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Marcelo Tosatti, linux-kernel

On Sat, Dec 31, 2005 at 02:39:43AM -0800, Chris Stromsoe wrote:
> On Sat, 31 Dec 2005, Willy Tarreau wrote:
> >On Fri, Dec 30, 2005 at 05:48:15PM -0800, Chris Stromsoe wrote:
> >
> >>I'm starting to suspect bad hardware.  Booting is now hanging (with 
> >>2.4.27, 2.4.30 and 2.4.32) after scsi drivers load:
> >
> >And nothing changed since previous boot, except UP ?
> 
> All I changed was adding nosmp to the kernel boot line.

OK maybe interrupts don't get distributed to the remaining CPU, which
would explain your timeouts.

> >It's not necessarily bad hardware. I also had trouble on one version of 
> >the 29160 bios where it hanged during device scan if there were too many 
> >terminations. Oh, BTW, please check that you have disabled "automatic" 
> >termination in the BIOS. Manually set it either to ON or OFF (low/high 
> >depending on your setup).
> 
> I'll have to check it tomorrow or on Monday.
> 
> >>How likely is it that a failing scsi controller contribute to the other 
> >>problems I was seeing?
> >
> >Not much. Perhaps at worst, a failing controller could corrupt memory by 
> >writing garbage at wrong locations, but you would not always get the 
> >same messages. It seems to be a different problem here. To be honnest, 
> >it's where I think you should try the new driver.
> 
> The machine has been running 2.6.14.4 for the last 6 hours.  It came up 
> fine.  I did not try booting it with nosmp.  If I have time, I will 
> revert back to 2.4 with the newer driver to test.

Thanks.

> -Chris

Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31  7:25           ` Willy Tarreau
@ 2005-12-31 11:06             ` Chris Stromsoe
  0 siblings, 0 replies; 40+ messages in thread
From: Chris Stromsoe @ 2005-12-31 11:06 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Marcelo Tosatti, linux-kernel

On Sat, 31 Dec 2005, Willy Tarreau wrote:
> On Fri, Dec 30, 2005 at 08:00:34PM -0800, Chris Stromsoe wrote:
>
>> I couldn't get the machine to come up with 2.4.32, 2.4.30, or 2.4.27. 
>> It was hanging and then throwing the SCSI errors below.  The machine 
>> did come up with a vanilla 2.6.14.4 and appears to be working fine. 
>> I'm going to leave it up over the weekend and see if it oopses.  If it 
>> would help, I can mail out the .config for the 2.4.32 and 2.6.14.4 
>> builds, or provide other information of interest.
>
> Please do post at least the 2.4.32 .config, I'll try to boot it on my 
> system right here. I find it amazing that it suddenly stopped working 
> with the same kernels as before.

Both configs are at <http://hashbrown.cts.ucla.edu/pub/oops-200512/>.

I have no idea why it wouldn't come up with nosmp on the command line 
(being supplied by lilo as append="nosmp").  I tried warm boot, cold boot, 
removing all power from the hardware.  I booted from a rescue cd that had 
2.6 on it and the machine came up right away.  I tried to go back to 2.4 
and it hung and then had SCSI errors again, so I installed 2.6.14.4 and 
left it running.  I can put a copy of the 2.4.32 kernel and modules up if 
that would help.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31  1:48       ` Chris Stromsoe
  2005-12-31  4:00         ` Chris Stromsoe
  2005-12-31  7:12         ` Willy Tarreau
@ 2005-12-31 12:08         ` Alan Cox
  2005-12-31 13:01           ` Willy Tarreau
  2 siblings, 1 reply; 40+ messages in thread
From: Alan Cox @ 2005-12-31 12:08 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Marcelo Tosatti, linux-kernel

On Gwe, 2005-12-30 at 17:48 -0800, Chris Stromsoe wrote:
> scsi0:0:0:0: Attempting to queue an ABORT message
> CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> scsi0:0:0:0: Command already completed
> aic7xxx_abort returns 0x2002

IRQ routing by the look of that trace. Make sure that if you are using
2.4.x you have ACPI disabled and see it looks any better


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31 12:08         ` Alan Cox
@ 2005-12-31 13:01           ` Willy Tarreau
  2006-01-05  3:52             ` Chris Stromsoe
  0 siblings, 1 reply; 40+ messages in thread
From: Willy Tarreau @ 2005-12-31 13:01 UTC (permalink / raw)
  To: Alan Cox; +Cc: Chris Stromsoe, Marcelo Tosatti, linux-kernel

Hi Alan,

On Sat, Dec 31, 2005 at 12:08:21PM +0000, Alan Cox wrote:
> On Gwe, 2005-12-30 at 17:48 -0800, Chris Stromsoe wrote:
> > scsi0:0:0:0: Attempting to queue an ABORT message
> > CDB: 0x12 0x0 0x0 0x0 0xff 0x0
> > scsi0:0:0:0: Command already completed
> > aic7xxx_abort returns 0x2002
> 
> IRQ routing by the look of that trace. Make sure that if you are using
> 2.4.x you have ACPI disabled and see it looks any better

Correct, and I came to the same conclusion ; Chris told us he booted with
the "nosmp" option. I've checked his config, and he has CONFIG_ACPI_BOOT=y.
I've just tried the same here, and I confirm that my machine (dual athlon)
does not boot with "nosmp" unless I also add "acpi=off". Mine even stops
ealier, while scanning IDE devices.

So now we're back to the original problem, i.e. why does he get bad pmd
that often on 2.4. It leaves us with the following possible next steps
after the problem occurs again (if it still happens with 2.6.14 or if
Chris is OK for a few more tests) :
  - 2.4.32 nosmp acpi=off       => the easiest one
  - 2.4.32 + aic7xxx+20040522   => the more interesting one

Regards,
Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2005-12-31 13:01           ` Willy Tarreau
@ 2006-01-05  3:52             ` Chris Stromsoe
  2006-01-05  5:43               ` Willy Tarreau
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-05  3:52 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Alan Cox, Marcelo Tosatti, linux-kernel

On Sat, 31 Dec 2005, Willy Tarreau wrote:
> On Sat, Dec 31, 2005 at 12:08:21PM +0000, Alan Cox wrote:
>> On Gwe, 2005-12-30 at 17:48 -0800, Chris Stromsoe wrote:
>>> scsi0:0:0:0: Attempting to queue an ABORT message CDB: 0x12 0x0 0x0 
>>> 0x0 0xff 0x0 scsi0:0:0:0: Command already completed aic7xxx_abort 
>>> returns 0x2002
>>
>> IRQ routing by the look of that trace. Make sure that if you are using 
>> 2.4.x you have ACPI disabled and see it looks any better
>
> Correct, and I came to the same conclusion ; Chris told us he booted 
> with the "nosmp" option. I've checked his config, and he has 
> CONFIG_ACPI_BOOT=y. I've just tried the same here, and I confirm that my 
> machine (dual athlon) does not boot with "nosmp" unless I also add 
> "acpi=off". Mine even stops ealier, while scanning IDE devices.

2.6.14.4 has been running stable for 4 days.  For the long term, I'll 
probably migrate the box to 2.6 and leave it there.

> So now we're back to the original problem, i.e. why does he get bad pmd
> that often on 2.4. It leaves us with the following possible next steps
> after the problem occurs again (if it still happens with 2.6.14 or if
> Chris is OK for a few more tests) :
>  - 2.4.32 nosmp acpi=off       => the easiest one
>  - 2.4.32 + aic7xxx+20040522   => the more interesting one

I booted 2.4.32 with the aic7xxx patch you pointed me at last week.  It's 
been up for a few hours.  I'll let it run for at least a week or two and 
will report back positive or negative results.  After that, I'll try 
2.4.32 with nosmp and acpi=off.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-05  3:52             ` Chris Stromsoe
@ 2006-01-05  5:43               ` Willy Tarreau
  2006-01-06 21:54                 ` Chris Stromsoe
  0 siblings, 1 reply; 40+ messages in thread
From: Willy Tarreau @ 2006-01-05  5:43 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Alan Cox, Marcelo Tosatti, linux-kernel

On Wed, Jan 04, 2006 at 07:52:36PM -0800, Chris Stromsoe wrote:
> On Sat, 31 Dec 2005, Willy Tarreau wrote:
> >On Sat, Dec 31, 2005 at 12:08:21PM +0000, Alan Cox wrote:
> >>On Gwe, 2005-12-30 at 17:48 -0800, Chris Stromsoe wrote:
> >>>scsi0:0:0:0: Attempting to queue an ABORT message CDB: 0x12 0x0 0x0 
> >>>0x0 0xff 0x0 scsi0:0:0:0: Command already completed aic7xxx_abort 
> >>>returns 0x2002
> >>
> >>IRQ routing by the look of that trace. Make sure that if you are using 
> >>2.4.x you have ACPI disabled and see it looks any better
> >
> >Correct, and I came to the same conclusion ; Chris told us he booted 
> >with the "nosmp" option. I've checked his config, and he has 
> >CONFIG_ACPI_BOOT=y. I've just tried the same here, and I confirm that my 
> >machine (dual athlon) does not boot with "nosmp" unless I also add 
> >"acpi=off". Mine even stops ealier, while scanning IDE devices.
> 
> 2.6.14.4 has been running stable for 4 days.  For the long term, I'll 
> probably migrate the box to 2.6 and leave it there.
> 
> >So now we're back to the original problem, i.e. why does he get bad pmd
> >that often on 2.4. It leaves us with the following possible next steps
> >after the problem occurs again (if it still happens with 2.6.14 or if
> >Chris is OK for a few more tests) :
> > - 2.4.32 nosmp acpi=off       => the easiest one
> > - 2.4.32 + aic7xxx+20040522   => the more interesting one
> 
> I booted 2.4.32 with the aic7xxx patch you pointed me at last week.  It's 
> been up for a few hours.  I'll let it run for at least a week or two and 
> will report back positive or negative results.  After that, I'll try 
> 2.4.32 with nosmp and acpi=off.

Thanks for your continued feedback, Chris. Your reports are very helpful,
they tend to prove that your hardware is OK and that there's a bug in
mainline 2.4.32 with SMP+ACPI+aic7xxx enabled. That's already a good
piece of information.

> -Chris

Regards,
Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-05  5:43               ` Willy Tarreau
@ 2006-01-06 21:54                 ` Chris Stromsoe
  2006-01-06 22:14                   ` Chris Stromsoe
  2006-01-08  9:45                   ` Willy Tarreau
  0 siblings, 2 replies; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-06 21:54 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Alan Cox, Marcelo Tosatti, linux-kernel

On Thu, 5 Jan 2006, Willy Tarreau wrote:
> On Wed, Jan 04, 2006 at 07:52:36PM -0800, Chris Stromsoe wrote:
> 
>> I booted 2.4.32 with the aic7xxx patch you pointed me at last week. 
>> It's been up for a few hours.  I'll let it run for at least a week or 
>> two and will report back positive or negative results.  After that, 
>> I'll try 2.4.32 with nosmp and acpi=off.
>
> Thanks for your continued feedback, Chris. Your reports are very 
> helpful, they tend to prove that your hardware is OK and that there's a 
> bug in mainline 2.4.32 with SMP+ACPI+aic7xxx enabled. That's already a 
> good piece of information.

After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got 
another bad pmd and an oops this morning at 4:23am.  I'm going to boot 
vanilla 2.4.32 with nosmp and acpi=off.


-Chris

ksymoops 2.4.9 on i686 2.4.32-aic79xx.  Options used
      -V (default)
      -k /proc/ksyms (default)
      -l /proc/modules (default)
      -o /lib/modules/2.4.32-aic79xx/ (default)
      -m /boot/System.map-2.4.32-aic79xx (specified)

Unable to handle kernel paging request at virtual address c2deee80
c025b3d3
*pde = 02c001e3
Oops: 0002
CPU:    2
EIP:    0010:[alloc_skb+275/480]    Not tainted
EFLAGS: 00010282
eax: c2deee80   ebx: e0508880   ecx: 000006bc   edx: 00000680
esi: 000001f0   edi: 00000000   ebp: f6cf7df0   esp: f6cf7ddc
ds: 0018   es: 0018   ss: 0018
Process innfeed (pid: 523, stackpage=f6cf7000)
Stack: 000006bc 000001f0 f3023b80 00000000 d307e000 f6cf7e68 c027cd2b 00000680
        000001f0 000005a8 00000000 f6cf7e54 00000000 00000283 cb3f3000 c025a339
        c8083280 00000000 00000000 c43428a0 f6cf6000 461800d6 00009bc7 00010430 
Call Trace:    [tcp_sendmsg+2619/4512] [sock_wfree+73/80] [inet_sendmsg+65/80] [sock_sendmsg+102/176] [sock_readv_writev+116/176]
Code: c7 00 01 00 00 00 8b 83 8c 00 00 00 c7 40 04 00 00 00 00 8b 
Using defaults from ksymoops -t elf32-i386 -a i386


>>eax; c2deee80 <_end+2a0b300/3864e4e0>
>>ebx; e0508880 <_end+20124d00/3864e4e0>
>>ebp; f6cf7df0 <_end+36914270/3864e4e0>
>>esp; f6cf7ddc <_end+3691425c/3864e4e0>

Code;  00000000 Before first symbol
00000000 <_EIP>:
Code;  00000000 Before first symbol
    0:   c7 00 01 00 00 00         movl   $0x1,(%eax)
Code;  00000006 Before first symbol
    6:   8b 83 8c 00 00 00         mov    0x8c(%ebx),%eax
Code;  0000000c Before first symbol
    c:   c7 40 04 00 00 00 00      movl   $0x0,0x4(%eax)
Code;  00000013 Before first symbol
   13:   8b 00                     mov    (%eax),%eax


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-06 21:54                 ` Chris Stromsoe
@ 2006-01-06 22:14                   ` Chris Stromsoe
  2006-01-06 22:16                     ` Chris Stromsoe
  2006-01-07  9:19                     ` Roberto Nibali
  2006-01-08  9:45                   ` Willy Tarreau
  1 sibling, 2 replies; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-06 22:14 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Alan Cox, Marcelo Tosatti, linux-kernel

On Fri, 6 Jan 2006, Chris Stromsoe wrote:

> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got 
> another bad pmd and an oops this morning at 4:23am.  I'm going to boot 
> vanilla 2.4.32 with nosmp and acpi=off.

booting with "nosmp acpi=off" did not help.  The box hung as before, at

hda: TEAC CD-ROM CD-224E, ATAPI CD/DVD-ROM drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
hda: attached ide-cdrom driver.
hda: ATAPI 24X CD-ROM drive, 128kB Cache
Uniform CD-ROM driver Revision: 3.12
SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
        <Adaptec 3960D Ultra160 SCSI adapter>
        aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs

scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
        <Adaptec 3960D Ultra160 SCSI adapter>
        aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
        <Adaptec aic7899 Ultra160 SCSI adapter>
        aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs

blk: queue f7e46018, I/O limit 4095Mb (mask 0xffffffff)


I waited about 10 minutes to see if it would continue, then booted back 
into 2.6.14.4.



-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-06 22:14                   ` Chris Stromsoe
@ 2006-01-06 22:16                     ` Chris Stromsoe
  2006-01-07  9:19                     ` Roberto Nibali
  1 sibling, 0 replies; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-06 22:16 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Alan Cox, Marcelo Tosatti, linux-kernel

On Fri, 6 Jan 2006, Chris Stromsoe wrote:
> On Fri, 6 Jan 2006, Chris Stromsoe wrote:
>
>> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got 
>> another bad pmd and an oops this morning at 4:23am.  I'm going to boot 
>> vanilla 2.4.32 with nosmp and acpi=off.
>
> booting with "nosmp acpi=off" did not help.  The box hung as before, at

One last datapoint; 2.6.14.4 boots fine with "nosmp acpi=off".


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-06 22:14                   ` Chris Stromsoe
  2006-01-06 22:16                     ` Chris Stromsoe
@ 2006-01-07  9:19                     ` Roberto Nibali
  2006-01-09 18:28                       ` Chris Stromsoe
  1 sibling, 1 reply; 40+ messages in thread
From: Roberto Nibali @ 2006-01-07  9:19 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Willy Tarreau, Alan Cox, Marcelo Tosatti, linux-kernel

>> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got 
>> another bad pmd and an oops this morning at 4:23am.  I'm going to boot 
>> vanilla 2.4.32 with nosmp and acpi=off.

Your oops does not make much sense, could you enable following, please:

CONFIG_DEBUG_KERNEL=y
CONFIG_DEBUG_SLAB=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_FRAME_POINTER=y

> booting with "nosmp acpi=off" did not help.  The box hung as before, at

Could you boot with pci=noacpi and report again? The difference is that 
ACPI will still be used but not for IRQ routing. I have a few boxes out 
with 2.4.x kernels and Adaptec HBAs that need this to work reliably.

> hda: TEAC CD-ROM CD-224E, ATAPI CD/DVD-ROM drive
> ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
> hda: attached ide-cdrom driver.
> hda: ATAPI 24X CD-ROM drive, 128kB Cache
> Uniform CD-ROM driver Revision: 3.12
> SCSI subsystem driver Revision: 1.00
> scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36

What's the SCSI BIOS version?

>        <Adaptec 3960D Ultra160 SCSI adapter>
>        aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
> 
> scsi1 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
>        <Adaptec 3960D Ultra160 SCSI adapter>
>        aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
> 
> scsi2 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.36
>        <Adaptec aic7899 Ultra160 SCSI adapter>
>        aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs
> 
> blk: queue f7e46018, I/O limit 4095Mb (mask 0xffffffff)
>  
> I waited about 10 minutes to see if it would continue, then booted back 
> into 2.6.14.4.

What's the diff between /proc/interrupt and lspci -v on those kernels, 
when they've finished the booting sequence?

If you find time, send me your BIOS settings and your .config in private 
email. I didn't track this thread from the beginning, so I don't know if 
you've already done this.

It might also help to carry this problem over to the linux-scsi mailing 
list, since, I believe, most SCSI guys don't ready lkml too frequently.

Of course, if 2.6.x works for you and you need to go productive, then 
I'd switch to it if I was you.

Just my 2 cents,
Roberto Nibali, ratz
-- 
echo 
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-06 21:54                 ` Chris Stromsoe
  2006-01-06 22:14                   ` Chris Stromsoe
@ 2006-01-08  9:45                   ` Willy Tarreau
  2006-01-09 18:33                     ` Chris Stromsoe
  1 sibling, 1 reply; 40+ messages in thread
From: Willy Tarreau @ 2006-01-08  9:45 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Alan Cox, Marcelo Tosatti, linux-kernel

Hi Chris,

On Fri, Jan 06, 2006 at 01:54:45PM -0800, Chris Stromsoe wrote:
> On Thu, 5 Jan 2006, Willy Tarreau wrote:
> >On Wed, Jan 04, 2006 at 07:52:36PM -0800, Chris Stromsoe wrote:
> >
> >>I booted 2.4.32 with the aic7xxx patch you pointed me at last week. 
> >>It's been up for a few hours.  I'll let it run for at least a week or 
> >>two and will report back positive or negative results.  After that, 
> >>I'll try 2.4.32 with nosmp and acpi=off.
> >
> >Thanks for your continued feedback, Chris. Your reports are very 
> >helpful, they tend to prove that your hardware is OK and that there's a 
> >bug in mainline 2.4.32 with SMP+ACPI+aic7xxx enabled. That's already a 
> >good piece of information.
> 
> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got 
> another bad pmd and an oops this morning at 4:23am.  I'm going to boot 
> vanilla 2.4.32 with nosmp and acpi=off.

Well, I'm puzzled. On the one hand, your oopses don't all look the
same, so we could think it's a hardware problem. On the other hand,
your hardware tests did not find anything and 2.6.14 runs fine. BTW,
I also have other machines running in production with an adaptec
29160 like yours and I don't encounter this. It looks like some
memory corruption, but finding what causes it seems very hard. In
fact, it somewhat reminds me the problems encountered by Stephan
von Krawczynski 2.5 years ago. He encountered data corruption when
saving large amounts of data to a DLT connected to an AIC7xxx, and
often had freezes, and sometimes an oops. IIRC, changing the board
for something else fixed his problem.

I've compared the driver between 2.4 and 2.6, and the core has not
changed much, but its interface to the OS has changed a lot, so it's
not easy to identify a potential fix.

Eventhough I don't like this, I would join Roberto's advice to
upgrade to 2.6 and stick to it. If you finally encounter the
same problem on 2.6 after a very long time, then it would be
an indication that the problem is well in your hardware.

> -Chris

Thanks for all your investigations,
Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-07  9:19                     ` Roberto Nibali
@ 2006-01-09 18:28                       ` Chris Stromsoe
  2006-01-09 20:16                         ` Roberto Nibali
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-09 18:28 UTC (permalink / raw)
  To: Roberto Nibali; +Cc: Willy Tarreau, Alan Cox, Marcelo Tosatti, linux-kernel

On Sat, 7 Jan 2006, Roberto Nibali wrote:

>>> After a little more than one day up with 2.4.32 SMP+ACP+aic7xxx, I got 
>>> another bad pmd and an oops this morning at 4:23am.  I'm going to boot 
>>> vanilla 2.4.32 with nosmp and acpi=off.
>
> Your oops does not make much sense, could you enable following, please:
>
> CONFIG_DEBUG_KERNEL=y
> CONFIG_DEBUG_SLAB=y
> CONFIG_MAGIC_SYSRQ=y
> CONFIG_FRAME_POINTER=y

kernel, sysrq, and frame_pointer were already enabled.  I'll enable 
debug_slab, as well.

>> booting with "nosmp acpi=off" did not help.  The box hung as before, at
>
> Could you boot with pci=noacpi and report again? The difference is that 
> ACPI will still be used but not for IRQ routing. I have a few boxes out 
> with 2.4.x kernels and Adaptec HBAs that need this to work reliably.

Are you interested in results from "pci=noacpi" by itself or in 
conjunction with nosmp?

> What's the SCSI BIOS version?

The SCSI controller is an onboard AIC 7899 (in a Dell PowerEdge 2650), and 
reports itself as "25309".

> What's the diff between /proc/interrupt and lspci -v on those kernels, 
> when they've finished the booting sequence?

> If you find time, send me your BIOS settings and your .config in private 
> email. I didn't track this thread from the beginning, so I don't know if 
> you've already done this.

<http://hashbrown.cts.ucla.edu/pub/oops-200512/> has the .config, lspci 
-v, and /proc/interrupts for 2.6.14.4 and 2.4.32.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-08  9:45                   ` Willy Tarreau
@ 2006-01-09 18:33                     ` Chris Stromsoe
  0 siblings, 0 replies; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-09 18:33 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Alan Cox, Marcelo Tosatti, linux-kernel

On Sun, 8 Jan 2006, Willy Tarreau wrote:

> Eventhough I don't like this, I would join Roberto's advice to upgrade 
> to 2.6 and stick to it. If you finally encounter the same problem on 2.6 
> after a very long time, then it would be an indication that the problem 
> is well in your hardware.

I'll keep 2.4.32 with DEBUG_SLAB up until it oopses again and will report 
that.  After, I'll probably stick with 2.6.  Thanks.

-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-09 18:28                       ` Chris Stromsoe
@ 2006-01-09 20:16                         ` Roberto Nibali
  2006-01-09 20:22                           ` Chris Stromsoe
  0 siblings, 1 reply; 40+ messages in thread
From: Roberto Nibali @ 2006-01-09 20:16 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Willy Tarreau, Alan Cox, Marcelo Tosatti, linux-kernel

>> CONFIG_DEBUG_KERNEL=y
>> CONFIG_DEBUG_SLAB=y
>> CONFIG_MAGIC_SYSRQ=y
>> CONFIG_FRAME_POINTER=y
> 
> kernel, sysrq, and frame_pointer were already enabled.  I'll enable 
> debug_slab, as well.

Excellent.

>>> booting with "nosmp acpi=off" did not help.  The box hung as before, at
>>
>> Could you boot with pci=noacpi and report again? The difference is 
>> that ACPI will still be used but not for IRQ routing. I have a few 
>> boxes out with 2.4.x kernels and Adaptec HBAs that need this to work 
>> reliably.
> 
> Are you interested in results from "pci=noacpi" by itself or in 
> conjunction with nosmp?

With SMP, please.

>> What's the SCSI BIOS version?
> 
> The SCSI controller is an onboard AIC 7899 (in a Dell PowerEdge 2650), 
> and reports itself as "25309".

What I meant was the SCSI Bios revision you get to see when you cold 
reset the system.

>> If you find time, send me your BIOS settings and your .config in 
>> private email. I didn't track this thread from the beginning, so I 
>> don't know if you've already done this.
> 
> <http://hashbrown.cts.ucla.edu/pub/oops-200512/> has the .config, lspci 
> -v, and /proc/interrupts for 2.6.14.4 and 2.4.32.

Thanks, I'll skim over these and get back to you if I can correlate 
anything with the issues we were having using this controller.

Regards,
Roberto Nibali, ratz
-- 
echo 
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-09 20:16                         ` Roberto Nibali
@ 2006-01-09 20:22                           ` Chris Stromsoe
  2006-01-09 22:22                             ` Roberto Nibali
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-09 20:22 UTC (permalink / raw)
  To: Roberto Nibali; +Cc: Willy Tarreau, Alan Cox, Marcelo Tosatti, linux-kernel

On Mon, 9 Jan 2006, Roberto Nibali wrote:

>>> What's the SCSI BIOS version?
>> 
>> The SCSI controller is an onboard AIC 7899 (in a Dell PowerEdge 2650), 
>> and reports itself as "25309".
>
> What I meant was the SCSI Bios revision you get to see when you cold 
> reset the system.

That is the SCSI BIOS rev.  The machine is a Dell PowerEdge 2650 and 
that's the onboard AIC 7899.  It comes up as "BIOS Build 25309".


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-09 20:22                           ` Chris Stromsoe
@ 2006-01-09 22:22                             ` Roberto Nibali
  2006-01-10  0:59                               ` Chris Stromsoe
  0 siblings, 1 reply; 40+ messages in thread
From: Roberto Nibali @ 2006-01-09 22:22 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Willy Tarreau, Alan Cox, Marcelo Tosatti, linux-kernel

> That is the SCSI BIOS rev.  The machine is a Dell PowerEdge 2650 and 
> that's the onboard AIC 7899.  It comes up as "BIOS Build 25309".

Brain is engaged now, thanks ;). If you find time, could you maybe 
compile a 2.4.32 kernel using following config (slightly changed from 
yours):

http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s

And put a dmidecode[1] output onto your website. Is the BMC interface 
enabled in your BIOS?

[1] http://download.savannah.nongnu.org/releases/dmidecode/

Best regards,
Roberto Nibali, ratz
-- 
echo '[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq'|dc

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-09 22:22                             ` Roberto Nibali
@ 2006-01-10  0:59                               ` Chris Stromsoe
  2006-01-15 11:29                                 ` Chris Stromsoe
  0 siblings, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-10  0:59 UTC (permalink / raw)
  To: Roberto Nibali; +Cc: Willy Tarreau, Alan Cox, Marcelo Tosatti, linux-kernel

On Mon, 9 Jan 2006, Roberto Nibali wrote:

>> That is the SCSI BIOS rev.  The machine is a Dell PowerEdge 2650 and 
>> that's the onboard AIC 7899.  It comes up as "BIOS Build 25309".
>
> Brain is engaged now, thanks ;). If you find time, could you maybe 
> compile a 2.4.32 kernel using following config (slightly changed from 
> yours):
>
> http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s

If/when the current run with DEBUG_SLAB oopses, I'll reboot with the 
config modifications.

> And put a dmidecode[1] output onto your website.

http://hashbrown.cts.ucla.edu/pub/oops-200512/dmidecode.out

> Is the BMC interface enabled in your BIOS?

I haven't changed the BMC defaults and am not using it, but I believe that 
it shipped as enabled so should still be.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-10  0:59                               ` Chris Stromsoe
@ 2006-01-15 11:29                                 ` Chris Stromsoe
  2006-01-15 12:12                                   ` Willy Tarreau
  2006-01-15 22:38                                   ` Chris Stromsoe
  0 siblings, 2 replies; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-15 11:29 UTC (permalink / raw)
  To: Roberto Nibali; +Cc: Willy Tarreau, Alan Cox, Marcelo Tosatti, linux-kernel

On Mon, 9 Jan 2006, Chris Stromsoe wrote:
> On Mon, 9 Jan 2006, Roberto Nibali wrote:
>
>>> That is the SCSI BIOS rev.  The machine is a Dell PowerEdge 2650 and 
>>> that's the onboard AIC 7899.  It comes up as "BIOS Build 25309".
>> 
>> Brain is engaged now, thanks ;). If you find time, could you maybe 
>> compile a 2.4.32 kernel using following config (slightly changed from 
>> yours):
>> 
>> http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
>
> If/when the current run with DEBUG_SLAB oopses, I'll reboot with the 
> config modifications.

I've been running stable with the propsed changes since the 10th.  The 
original config and the currently running config are both at 
<http://hashbrown.cts.ucla.edu/pub/oops-200512/>.  This is the diff:

cbs@hashbrown:~ > diff config-2.4.32 config-2.4.32-20060115

65c65
< CONFIG_HIGHIO=y
---
> # CONFIG_HIGHIO is not set
69c69
< CONFIG_NR_CPUS=32
---
> CONFIG_NR_CPUS=4
87c87
< CONFIG_ISA=y
---
> # CONFIG_ISA is not set
109c109
< # CONFIG_ACPI is not set
---
> CONFIG_ACPI=y
110a111,127
> CONFIG_ACPI_BUS=y
> CONFIG_ACPI_INTERPRETER=y
> CONFIG_ACPI_EC=y
> CONFIG_ACPI_POWER=y
> CONFIG_ACPI_PCI=y
> CONFIG_ACPI_MMCONFIG=y
> CONFIG_ACPI_SLEEP=y
> CONFIG_ACPI_SYSTEM=y
> # CONFIG_ACPI_AC is not set
> # CONFIG_ACPI_BATTERY is not set
> # CONFIG_ACPI_BUTTON is not set
> # CONFIG_ACPI_FAN is not set
> # CONFIG_ACPI_PROCESSOR is not set
> # CONFIG_ACPI_THERMAL is not set
> # CONFIG_ACPI_ASUS is not set
> # CONFIG_ACPI_TOSHIBA is not set
> # CONFIG_ACPI_DEBUG is not set
385c402
< # CONFIG_AIC7XXX_DEBUG_ENABLE is not set
---
> CONFIG_AIC7XXX_DEBUG_ENABLE=y
387c404
< # CONFIG_AIC7XXX_REG_PRETTY_PRINT is not set
---
> CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
492,493d508
< # CONFIG_AT1700 is not set
< # CONFIG_DEPCA is not set
500d514
< # CONFIG_AC3200 is not set
585,589d598
< # Old CD-ROM drivers (not SCSI, not IDE)
< #
< # CONFIG_CD_NO_IDESCSI is not set
<
< #
864,865c873,874
< # CONFIG_DEBUG_HIGHMEM is not set
< # CONFIG_DEBUG_SLAB is not set
---
> CONFIG_DEBUG_HIGHMEM=y
> CONFIG_DEBUG_SLAB=y




-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-15 11:29                                 ` Chris Stromsoe
@ 2006-01-15 12:12                                   ` Willy Tarreau
  2006-01-15 21:18                                     ` Chris Stromsoe
  2006-01-15 22:38                                   ` Chris Stromsoe
  1 sibling, 1 reply; 40+ messages in thread
From: Willy Tarreau @ 2006-01-15 12:12 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Roberto Nibali, Alan Cox, Marcelo Tosatti, linux-kernel

On Sun, Jan 15, 2006 at 03:29:15AM -0800, Chris Stromsoe wrote:
> On Mon, 9 Jan 2006, Chris Stromsoe wrote:
> >On Mon, 9 Jan 2006, Roberto Nibali wrote:
> >
> >>>That is the SCSI BIOS rev.  The machine is a Dell PowerEdge 2650 and 
> >>>that's the onboard AIC 7899.  It comes up as "BIOS Build 25309".
> >>
> >>Brain is engaged now, thanks ;). If you find time, could you maybe 
> >>compile a 2.4.32 kernel using following config (slightly changed from 
> >>yours):
> >>
> >>http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
> >
> >If/when the current run with DEBUG_SLAB oopses, I'll reboot with the 
> >config modifications.
> 
> I've been running stable with the propsed changes since the 10th.  The 
> original config and the currently running config are both at 
> <http://hashbrown.cts.ucla.edu/pub/oops-200512/>.  This is the diff:
> 
> cbs@hashbrown:~ > diff config-2.4.32 config-2.4.32-20060115
> 
> 65c65
> < CONFIG_HIGHIO=y
> ---
> ># CONFIG_HIGHIO is not set

I wonder if this change could be suspected of affecting stability. With
this unset, data will be sent from the card to low memory, then bounced
to high mem when needed. Maybe the card, northbridge or anything else
sometimes corrupts memory during direct highmem I/O from PCI ? :-/

Or perhaps it's simply too early to conclude anything.

Thanks for your report anyway.

Regards,
Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-15 12:12                                   ` Willy Tarreau
@ 2006-01-15 21:18                                     ` Chris Stromsoe
  0 siblings, 0 replies; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-15 21:18 UTC (permalink / raw)
  To: Willy Tarreau; +Cc: Roberto Nibali, Alan Cox, Marcelo Tosatti, linux-kernel

On Sun, 15 Jan 2006, Willy Tarreau wrote:
> On Sun, Jan 15, 2006 at 03:29:15AM -0800, Chris Stromsoe wrote:
>>
>> I've been running stable with the propsed changes since the 10th.  The 
>> original config and the currently running config are both at 
>> <http://hashbrown.cts.ucla.edu/pub/oops-200512/>.  This is the diff:
>>
>> cbs@hashbrown:~ > diff config-2.4.32 config-2.4.32-20060115
>>
>> 65c65
>> < CONFIG_HIGHIO=y
>> ---
>> > # CONFIG_HIGHIO is not set
>
> I wonder if this change could be suspected of affecting stability. With 
> this unset, data will be sent from the card to low memory, then bounced 
> to high mem when needed. Maybe the card, northbridge or anything else 
> sometimes corrupts memory during direct highmem I/O from PCI ? :-/

I'll let it run for another week as it is. If it would be useful 
information, I can switch CONFIG_HIGHIO back to =y and let that kernel run 
for a while.  Otherwise, I'll probably switch permanently to 2.6.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-15 11:29                                 ` Chris Stromsoe
  2006-01-15 12:12                                   ` Willy Tarreau
@ 2006-01-15 22:38                                   ` Chris Stromsoe
  2006-01-15 22:46                                     ` Willy TARREAU
  1 sibling, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-15 22:38 UTC (permalink / raw)
  To: Roberto Nibali; +Cc: Willy Tarreau, Alan Cox, Marcelo Tosatti, linux-kernel

On Sun, 15 Jan 2006, Chris Stromsoe wrote:
> On Mon, 9 Jan 2006, Chris Stromsoe wrote:
>> On Mon, 9 Jan 2006, Roberto Nibali wrote:
>> 
>>>> That is the SCSI BIOS rev.  The machine is a Dell PowerEdge 2650 and 
>>>> that's the onboard AIC 7899.  It comes up as "BIOS Build 25309".
>>> 
>>> Brain is engaged now, thanks ;). If you find time, could you maybe 
>>> compile a 2.4.32 kernel using following config (slightly changed from 
>>> yours):
>>> 
>>> http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
>> 
>> If/when the current run with DEBUG_SLAB oopses, I'll reboot with the 
>> config modifications.
>
> I've been running stable with the propsed changes since the 10th.  The 
> original config and the currently running config are both at 
> <http://hashbrown.cts.ucla.edu/pub/oops-200512/>.  This is the diff:

I made a mistake.

The machine was /not/ booted into that config.  It is running the original 
config from http://hashbrown.cts.ucla.edu/pub/oops-200512/config-2.4.32 
with DEBUG_SLAB defined and "pci=noacpi" passed in on the command line.

The config with HIGHIO disabled an ACPI=y has not been tested.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-15 22:38                                   ` Chris Stromsoe
@ 2006-01-15 22:46                                     ` Willy TARREAU
  2006-01-15 22:54                                       ` Chris Stromsoe
  0 siblings, 1 reply; 40+ messages in thread
From: Willy TARREAU @ 2006-01-15 22:46 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Roberto Nibali, Alan Cox, Marcelo Tosatti, linux-kernel

On Sun, Jan 15, 2006 at 02:38:51PM -0800, Chris Stromsoe wrote:
> On Sun, 15 Jan 2006, Chris Stromsoe wrote:
> >On Mon, 9 Jan 2006, Chris Stromsoe wrote:
> >>On Mon, 9 Jan 2006, Roberto Nibali wrote:
> >>
> >>>>That is the SCSI BIOS rev.  The machine is a Dell PowerEdge 2650 and 
> >>>>that's the onboard AIC 7899.  It comes up as "BIOS Build 25309".
> >>>
> >>>Brain is engaged now, thanks ;). If you find time, could you maybe 
> >>>compile a 2.4.32 kernel using following config (slightly changed from 
> >>>yours):
> >>>
> >>>http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
> >>
> >>If/when the current run with DEBUG_SLAB oopses, I'll reboot with the 
> >>config modifications.
> >
> >I've been running stable with the propsed changes since the 10th.  The 
> >original config and the currently running config are both at 
> ><http://hashbrown.cts.ucla.edu/pub/oops-200512/>.  This is the diff:
> 
> I made a mistake.
> 
> The machine was /not/ booted into that config.  It is running the original 
> config from http://hashbrown.cts.ucla.edu/pub/oops-200512/config-2.4.32 
> with DEBUG_SLAB defined and "pci=noacpi" passed in on the command line.
> 
> The config with HIGHIO disabled an ACPI=y has not been tested.

Thanks for the precision. So logically we should expect it to break sooner
or later ?

> 
> -Chris

Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-15 22:46                                     ` Willy TARREAU
@ 2006-01-15 22:54                                       ` Chris Stromsoe
  2006-01-16 20:52                                         ` Roberto Nibali
  2006-02-08  6:32                                         ` Chris Stromsoe
  0 siblings, 2 replies; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-15 22:54 UTC (permalink / raw)
  To: Willy TARREAU; +Cc: Roberto Nibali, Alan Cox, Marcelo Tosatti, linux-kernel

On Sun, 15 Jan 2006, Willy TARREAU wrote:
> On Sun, Jan 15, 2006 at 02:38:51PM -0800, Chris Stromsoe wrote:
>> On Sun, 15 Jan 2006, Chris Stromsoe wrote:
>>> On Mon, 9 Jan 2006, Chris Stromsoe wrote:
>>>> On Mon, 9 Jan 2006, Roberto Nibali wrote:
>>>>
>>>>>> That is the SCSI BIOS rev.  The machine is a Dell PowerEdge 2650 
>>>>>> and that's the onboard AIC 7899.  It comes up as "BIOS Build 
>>>>>> 25309".
>>>>>
>>>>> Brain is engaged now, thanks ;). If you find time, could you maybe 
>>>>> compile a 2.4.32 kernel using following config (slightly changed 
>>>>> from yours):
>>>>>
>>>>> http://www.drugphish.ch/patches/ratz/kernel/configs/config-2.4.32-chris_s
>>>>
>>>> If/when the current run with DEBUG_SLAB oopses, I'll reboot with the 
>>>> config modifications.
>>>
>>> I've been running stable with the propsed changes since the 10th. 
>>> The original config and the currently running config are both at 
>>> <http://hashbrown.cts.ucla.edu/pub/oops-200512/>.  This is the diff:
>>
>> I made a mistake.
>>
>> The machine was /not/ booted into that config.  It is running the 
>> original config from 
>> http://hashbrown.cts.ucla.edu/pub/oops-200512/config-2.4.32 with 
>> DEBUG_SLAB defined and "pci=noacpi" passed in on the command line.
>>
>> The config with HIGHIO disabled an ACPI=y has not been tested.
>
> Thanks for the precision. So logically we should expect it to break 
> sooner or later ?

It is the same .config as one that crashed before, except that it has 
DEBUG_SLAB defined.  If it does not crash, then adding pci=noacpi to the 
command fixes the problem for me.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-15 22:54                                       ` Chris Stromsoe
@ 2006-01-16 20:52                                         ` Roberto Nibali
  2006-01-16 21:32                                           ` Chris Stromsoe
  2006-02-08  6:32                                         ` Chris Stromsoe
  1 sibling, 1 reply; 40+ messages in thread
From: Roberto Nibali @ 2006-01-16 20:52 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Willy TARREAU, Alan Cox, Marcelo Tosatti, linux-kernel

>>> The machine was /not/ booted into that config.  It is running the 
>>> original config from 
>>> http://hashbrown.cts.ucla.edu/pub/oops-200512/config-2.4.32 with 
>>> DEBUG_SLAB defined and "pci=noacpi" passed in on the command line.
>>>
>>> The config with HIGHIO disabled an ACPI=y has not been tested.

CONFIG_SMP at least sets CONFIG_ACPI_BOOT. Do you still have the boot 
messages somewhere (dmesg)? I'd be interested in the difference between 
IOAPIC PCI routing entries between pci=noacpi and normal boot.

>> Thanks for the precision. So logically we should expect it to break 
>> sooner or later ?
> 
> It is the same .config as one that crashed before, except that it has 
> DEBUG_SLAB defined.  If it does not crash, then adding pci=noacpi to the 
> command fixes the problem for me.

Hmm, I'm not fully convinced yet, however glad that it has been a bit 
more stable for you.

Sidenote: We boot our systems having built-in AIC7* SCSI on moderately 
cheap motherboards with "bad" interrupt routing using pci=noacpi on 
2.4.x kernels to evade instability.

I suggest that if you experience more problems using this setup _and_ 
would like to continue debugging the issue, we take this off-list into a 
private discussion.

[Another thing which would be interesting to test regarding the HIGHIO 
setting is a RedHat based 2.4.x kernel, since according to some SCSI 
driver's documentation, RedHat had a different HIGHIO convention.]

Thanks for your feedback,
Roberto Nibali, ratz
-- 
echo 
'[q]sa[ln0=aln256%Pln256/snlbx]sb3135071790101768542287578439snlbxq' | dc

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-16 20:52                                         ` Roberto Nibali
@ 2006-01-16 21:32                                           ` Chris Stromsoe
  0 siblings, 0 replies; 40+ messages in thread
From: Chris Stromsoe @ 2006-01-16 21:32 UTC (permalink / raw)
  To: Roberto Nibali; +Cc: Willy TARREAU, Alan Cox, Marcelo Tosatti, linux-kernel

On Mon, 16 Jan 2006, Roberto Nibali wrote:

>>> Thanks for the precision. So logically we should expect it to break 
>>> sooner or later ?
>> 
>> It is the same .config as one that crashed before, except that it has 
>> DEBUG_SLAB defined.  If it does not crash, then adding pci=noacpi to 
>> the command fixes the problem for me.
>
> Hmm, I'm not fully convinced yet, however glad that it has been a bit 
> more stable for you.

The stability only lasted for a week.  Last night I got another bad pmd 
message, an oops, and a hang.  I was not able to capture the oops.

> Sidenote: We boot our systems having built-in AIC7* SCSI on moderately 
> cheap motherboards with "bad" interrupt routing using pci=noacpi on 
> 2.4.x kernels to evade instability.
>
> I suggest that if you experience more problems using this setup _and_ 
> would like to continue debugging the issue, we take this off-list into a 
> private discussion.

At this point, I'm going to stick with 2.6.  If I get more time to debug 
this laster, I'll drop back down to the modified 2.4 with HIGHIO disabled.

> [Another thing which would be interesting to test regarding the HIGHIO 
> setting is a RedHat based 2.4.x kernel, since according to some SCSI 
> driver's documentation, RedHat had a different HIGHIO convention.]

Thanks.  I'll keep that on my list of things to try if I ever get back to 
this.  I appreciate the pointers.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-01-15 22:54                                       ` Chris Stromsoe
  2006-01-16 20:52                                         ` Roberto Nibali
@ 2006-02-08  6:32                                         ` Chris Stromsoe
  2006-02-08  6:37                                           ` Willy Tarreau
  1 sibling, 1 reply; 40+ messages in thread
From: Chris Stromsoe @ 2006-02-08  6:32 UTC (permalink / raw)
  To: Willy TARREAU; +Cc: Roberto Nibali, Alan Cox, Marcelo Tosatti, linux-kernel

On Sun, 15 Jan 2006, Chris Stromsoe wrote:
> On Sun, 15 Jan 2006, Willy TARREAU wrote:
>> 
>> Thanks for the precision. So logically we should expect it to break 
>> sooner or later ?
>
> It is the same .config as one that crashed before, except that it has 
> DEBUG_SLAB defined.  If it does not crash, then adding pci=noacpi to the 
> command fixes the problem for me.

For what it's worth, I'm fairly certain at this point that the problem was 
hardware related.  After a week of uptime with 2.6 we had another pmd 
error and oops.  We then replaced the system board and one of the CPUs and 
have not seen any problems since.


-Chris

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: bad pmd filemap.c, oops; 2.4.30 and 2.4.32
  2006-02-08  6:32                                         ` Chris Stromsoe
@ 2006-02-08  6:37                                           ` Willy Tarreau
  0 siblings, 0 replies; 40+ messages in thread
From: Willy Tarreau @ 2006-02-08  6:37 UTC (permalink / raw)
  To: Chris Stromsoe; +Cc: Roberto Nibali, Alan Cox, Marcelo Tosatti, linux-kernel

On Tue, Feb 07, 2006 at 10:32:45PM -0800, Chris Stromsoe wrote:
> On Sun, 15 Jan 2006, Chris Stromsoe wrote:
> >On Sun, 15 Jan 2006, Willy TARREAU wrote:
> >>
> >>Thanks for the precision. So logically we should expect it to break 
> >>sooner or later ?
> >
> >It is the same .config as one that crashed before, except that it has 
> >DEBUG_SLAB defined.  If it does not crash, then adding pci=noacpi to the 
> >command fixes the problem for me.
> 
> For what it's worth, I'm fairly certain at this point that the problem 
> was hardware related.  After a week of uptime with 2.6 we had another pmd 
> error and oops.  We then replaced the system board and one of the CPUs 
> and have not seen any problems since.

Chris, thank you very much for this useful feedback. Now we're sure that
it's not worth investigating on the aic7xxx driver for any potential
memory corruption bug.

> -Chris

Regards,
Willy


^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2006-02-08  6:38 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-12-27 16:58 bad pmd filemap.c, oops; 2.4.30 and 2.4.32 Chris Stromsoe
2005-12-28  0:10 ` Marcelo Tosatti
2005-12-29  2:52   ` Chris Stromsoe
2005-12-29  5:12     ` Willy Tarreau
2005-12-29  9:33       ` Chris Stromsoe
2005-12-29 10:08         ` Willy Tarreau
2005-12-29 12:01           ` Chris Stromsoe
2005-12-31  0:12     ` Chris Stromsoe
2005-12-31  1:48       ` Chris Stromsoe
2005-12-31  4:00         ` Chris Stromsoe
2005-12-31  7:25           ` Willy Tarreau
2005-12-31 11:06             ` Chris Stromsoe
2005-12-31  7:12         ` Willy Tarreau
2005-12-31 10:39           ` Chris Stromsoe
2005-12-31 10:56             ` Willy Tarreau
2005-12-31 12:08         ` Alan Cox
2005-12-31 13:01           ` Willy Tarreau
2006-01-05  3:52             ` Chris Stromsoe
2006-01-05  5:43               ` Willy Tarreau
2006-01-06 21:54                 ` Chris Stromsoe
2006-01-06 22:14                   ` Chris Stromsoe
2006-01-06 22:16                     ` Chris Stromsoe
2006-01-07  9:19                     ` Roberto Nibali
2006-01-09 18:28                       ` Chris Stromsoe
2006-01-09 20:16                         ` Roberto Nibali
2006-01-09 20:22                           ` Chris Stromsoe
2006-01-09 22:22                             ` Roberto Nibali
2006-01-10  0:59                               ` Chris Stromsoe
2006-01-15 11:29                                 ` Chris Stromsoe
2006-01-15 12:12                                   ` Willy Tarreau
2006-01-15 21:18                                     ` Chris Stromsoe
2006-01-15 22:38                                   ` Chris Stromsoe
2006-01-15 22:46                                     ` Willy TARREAU
2006-01-15 22:54                                       ` Chris Stromsoe
2006-01-16 20:52                                         ` Roberto Nibali
2006-01-16 21:32                                           ` Chris Stromsoe
2006-02-08  6:32                                         ` Chris Stromsoe
2006-02-08  6:37                                           ` Willy Tarreau
2006-01-08  9:45                   ` Willy Tarreau
2006-01-09 18:33                     ` Chris Stromsoe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).