linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: linux-2.4.0 scsi problems on NetFinity servers
@ 2001-01-10 13:58 Ken Brunsen/Iris
  2001-01-10 15:32 ` Problem with 2.4.0 agpgart on Dell D4100 (probably) Intel i815 Charles McLachlan
  0 siblings, 1 reply; 11+ messages in thread
From: Ken Brunsen/Iris @ 2001-01-10 13:58 UTC (permalink / raw)
  To: JP Navarro <navarro; +Cc: linux-kernel

Ok, on a suggestion from JP, I've built the 2.4.0 kernel without SMP
support.  I did, however, have to make one change to a header file to get
it to compile the kernel and here is a diff

diff include/linux/linux/kernel_stat.h include/linux/linux/kernel_stat.h.sv
48d47
< #ifdef CONFIG_SMP
50d48
< #endif

The code was only for stats, but did not have the appropriate wrapper
around a for-loop clause to access an SMP only variable.

So, with SMP support off, I started my tests and then headed home for the
evening.  This morning when I arrived, I found my machine had crashed on
the 2nd run of my copy test, but with a little bit of a different crash.
First I'm getting multiple messages of the type

I/O error:  dev 08:01

and then I'm getting messages of the type

EXT2-fs error (device sd(8,1)): read_inode_bitmap: Cannot read inode bitmap
- block_group = 34, inode_bitmap = 1114113
EXT2-fs error (device sd(8,1)): ext2_write_inode: unable to read inode
block - inode=229832, block=458756

where the numbers for the inodes, blocks, bitmaps, and groups vary; also
attempting to run any process in my root tty (even just reboot) results in
a segv and the I/O error messages some more (of note, the only filesystem
that gets trashed, ever, is the one with the test running on it, the root
partition is separate and never is affected, nor is any other partition).
So, although SMP may aggrevate the situation, it is not, apparently, the
cause of the problem.

BTW:  this test case basically destroys my test filesystem such that I've
taken to creating a new fs on the partition each time - I tried fixing it
twice with fsck, and after 2 days it had still not completed the fixup in
each case.

thanks

kenbo

______________________
Firebirds rule, `stangs serve!

Kenneth "kenbo" Brunsen
Iris Associates

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Problem with 2.4.0 agpgart on Dell D4100 (probably) Intel i815
  2001-01-10 13:58 linux-2.4.0 scsi problems on NetFinity servers Ken Brunsen/Iris
@ 2001-01-10 15:32 ` Charles McLachlan
  2001-01-10 18:56   ` Jeff Hartmann
  0 siblings, 1 reply; 11+ messages in thread
From: Charles McLachlan @ 2001-01-10 15:32 UTC (permalink / raw)
  To: linux-kernel


(The ultimate cause of what I'm about to tell you may well be a chipset
problem, but I think I've uncovered a tiny bit of kernel weirdness none
the less)

Using 2.4.0

modprobe agpgart.o

/var/log/messages says
>Jan 10 14:11:56 x kernel: Linux agpgart interface v0.99 (c) Jeff
> Hartmann
> Jan 10 14:11:56 x kernel: agpgart: Maximum main memory to use for
> agp memory: 439M
> Jan 10 14:11:56 x kernel: agpgart: agpgart: Detected an Intel
> i815, but could not find the secondary device.

lspci says
> 00:00.0 Host bridge: Intel Corporation: Unknown device 1130 (rev 02)
> 00:01.0 PCI bridge: Intel Corporation: Unknown device 1131 (rev 02)
> 00:1e.0 PCI bridge: Intel Corporation: Unknown device 244e (rev 02)
> 00:1f.0 ISA bridge: Intel Corporation: Unknown device 2440 (rev 02)
> ...

http://www.datashopper.dk/~finth/pci.html says
> 1130h	82815 i815 (Solano) Host to Hub Bridge (Fully featured chipset)
> 1131h	82815 i815 (Solano) PCI to AGP Bridge
> 1132h	82815 i815 (Solano) Interal GUI Accelerator

I take it "Interal GUI Accelerator" is a built in graphics card (that I
don't have)

/usr/src/linux/drivers/char/agp/agp.h says (amongst other things)
> #define PCI_DEVICE_ID_INTEL_815_0       0x1130
> ...
> #define PCI_DEVICE_ID_INTEL_815_1       0x1132

/usr/src/linux/drivers/char/agp/agpgart_be.c says
> case PCI_DEVICE_ID_INTEL_815_0:
>		   /* The i815 can operate either as an i810 style
>		    * integrated device, or as an AGP4X motherboard.
>		    *
>		    * This only addresses the first mode:
>		    */
>
>	i810_dev = pci_find_device(PCI_VENDOR_ID_INTEL,
>				PCI_DEVICE_ID_INTEL_815_1,
>						   NULL);

It is this call that is failing and causing the error message.

Questions:
Why don't the PCI ids match in aph.h and lspci? Which one is right?
Is my i815 acting as a "AGP4X motherboard"?
If so does anyone have any suggestions as to how I get it to work?

My BIOS doesn't have many settings for jiggering around with the AGP
stuff, although it does say "AGP 4x" in big letters.

When I alter agp.h to have the "right" PCI id, then /var/log/messages
says:

> Jan 10 14:25:45 x kernel: Linux agpgart interface v0.99 (c) Jeff
> Hartmann
>Jan 10 14:25:45 x kernel: agpgart: Maximum main memory to use for
> agp memory: 439M
> Jan 10 14:25:45 x kernel: agpgart: agpgart: Detected an Intel
> i815 Chipset.
> Jan 10 14:25:45 x kernel: agpgart: i810 is disabled
> Jan 10 14:25:45 x kernel: agpgart: unable to detrimine aperture
> size.

The nasty bit in this case is:

> pci_read_config_dword(agp_bridge.dev, I810_SMRAM_MISCC, &smram_miscc);
>
> if ((smram_miscc & I810_GMS) == I810_GMS_DISABLE) {
>		printk(KERN_WARNING PFX "i810 is disabled\n");
>		return 0;
>	}

smram_miscc comes out as 0xa82800c whereas I810_GMS is 0xc0

This made me think that my i815 is *not* "acting like an i810" but I
carried on bodging anyway.

I'm pretty sure my AGP aperture size is 64Mb (that's what the BIOS reckons
anyway) so I commented out the GMS check, so that intel_i810_fetch_size
would return 64Mb.

after modprobe agpgart I got a *lot* of messages like
> Jan 10 14:37:31 x kernel: io mapaddr 0x1fff4 not valid at
> agpgart_be.c:898!
> Jan 10 14:37:31 x kernel: io mapaddr 0x1fff8 not valid at
>agpgart_be.c:898!
> Jan 10 14:37:31 x kernel: io mapaddr 0x1fffc not valid at
> agpgart_be.c:89

line 898 does an OUTREG32 on some i810 private registers, which (I think)
is more evidence that my chipset is not acting like an i810.

Then the (rather worrying)
> Jan 10 14:37:31 x kernel: agpgart: AGP aperture is 64M @ 0x0

So far so bad. I then insmodded my Nvidia module and started X
> Jan 10 14:38:30 herschel kernel: NVRM: Intel i810 AGP chipset
> Jan 10 14:38:30 herschel kernel: mtrr: type mismatch for 0000,4000000
> old: write-back new: write-combining
> Jan 10 14:38:30 herschel kernel: NVRM: error: unable to set mtrr
> write-combining
> Jan 10 14:38:30 herschel kernel: NVRM: error: unable to remap aperture

Which isn't very good, although X actually did run and didn't hose my
machine, as I was half expecting.

Does anyone have any idea what is going on?

Charlie - Queens' College - Cavendish Astrophysics - 07866 636318


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problem with 2.4.0 agpgart on Dell D4100 (probably) Intel i815
  2001-01-10 15:32 ` Problem with 2.4.0 agpgart on Dell D4100 (probably) Intel i815 Charles McLachlan
@ 2001-01-10 18:56   ` Jeff Hartmann
  0 siblings, 0 replies; 11+ messages in thread
From: Jeff Hartmann @ 2001-01-10 18:56 UTC (permalink / raw)
  To: Charles McLachlan; +Cc: linux-kernel

Charles McLachlan wrote:

> (The ultimate cause of what I'm about to tell you may well be a chipset
> problem, but I think I've uncovered a tiny bit of kernel weirdness none
> the less)
<snip>

>
> Does anyone have any idea what is going on?
> 
> Charlie - Queens' College - Cavendish Astrophysics - 07866 636318
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/

Don't compile in I810/I815 Graphics support into the kernel.  Just 
compile the Intel 440LX/BX/GX/815/840/850 support.  That should make 
everything work fine.

-Jeff

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-2.4.0 scsi problems on NetFinity servers
@ 2001-01-11 20:00 kenbo
  0 siblings, 0 replies; 11+ messages in thread
From: kenbo @ 2001-01-11 20:00 UTC (permalink / raw)
  To: JP Navarro; +Cc: linux-kernel, Tim Wright

I sent an email to the NetFinity mailing list, and here is there response
after they started testing with 2.4.0.  FYI: I've tried all suggestions
(non-SMP, flag at boot time,...) and none of them have worked yet; I did
see that someone thought they had found an nfs bug and posted a patch for
it, so I'm gonna patch and test next.

Still looking into it )

Thanks!

kenbo

______________________
Firebirds rule, `stangs serve!

Kenneth "kenbo" Brunsen
Iris Associates
----- Forwarded by Ken Brunsen/Iris on 01/11/01 02:37 PM -----
                                                                                                                           
                    "ServeRAID                                                                                             
                    For Linux"           To:     kenbo@iris.com                                                            
                    <ipslinux@us.        cc:                                                                               
                    ibm.com>             Subject:     Re: linux-2.4.0 scsi problems on NetFinity servers                   
                                                                                                                           
                    01/11/01                                                                                               
                    12:38 PM                                                                                               
                                                                                                                           
                                                                                                                           




We have been able to reproduce this problem in our lab ( using ServeRAID )
.    We have also seen this same lockup at least once in a system which did
not contain any ServeRAID ( only Adaptec ).   Our engineers are
investigating this issue at this time and we are also notifying the Red Hat
engineers of the problem.

Thanks for bringing this to our attention.




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-2.4.0 scsi problems on NetFinity servers
@ 2001-01-11 13:54 kenbo
  0 siblings, 0 replies; 11+ messages in thread
From: kenbo @ 2001-01-11 13:54 UTC (permalink / raw)
  To: timw; +Cc: linux-kernel, JP Navarro


The problem I'm seeing must be different.  I tried your suggestion of
booting with nmi_watchdog=0, and I still see the same crashes.  I'm now in
the process of getting a SMP Dell to try and do the same testing.

Thanks!

kenbo

______________________
Firebirds rule, `stangs serve!

Kenneth "kenbo" Brunsen
Iris Associates


                                                                                                                           
                    Tim Wright                                                                                             
                    <timw@splhi.c        To:     JP Navarro <navarro@mcs.anl.gov>                                          
                    om>                  cc:     Ken Brunsen/Iris <kenbo@iris.com>, linux-kernel@vger.kernel.org           
                                         Subject:     Re: linux-2.4.0 scsi problems on NetFinity servers                   
                    01/10/01                                                                                               
                    03:49 PM                                                                                               
                    Please                                                                                                 
                    respond to                                                                                             
                    timw                                                                                                   
                                                                                                                           
                                                                                                                           




Hmmm...
it's actually not quite that simple. The card on it's own doesn't cause any
problems. It's when the NMI watchdog stuff is enabled that all hell breaks
loose at least on my 8500R. Basically, every CPU in the system gets
hammered
with NMIs (1000's per second). The system is slower than it should be, and
in
my case it hangs after ~45 minutes (~256,000 NMIs per cpu). Booting with
nmi_watchdog=0 makes the problem go away and the machine is stable, so
there's
some kind of nasty interaction with the card.

It seems a little unlikely that this is related to SCSI problems, but I
could
be wrong. Anyway, I am trying to find more information on the adapter to
find
out where the problem may lie.

Regards,

Tim

On Tue, Jan 09, 2001 at 03:08:03PM -0600, JP Navarro wrote:
> One possibility:
>
> When we first tested 2.4.0-test8 on NetFinity 7000s we had random
crashes,
> typically within an hour of booting. The problem was identified as a
Wiseman
> Systems Management adapter generated hardware interrupt that 2.4 doesn't
handle
> (this was not a problem with 2.2.x).
>
> If you have these adapters installed, remove them.
>
> JP Navarro
> --
> John-Paul Navarro                                           (630)
252-1233
> Mathematics & Computer Science Division
> Argonne National Laboratory
navarro@mcs.anl.gov
> Argonne, IL 60439
http://www.mcs.anl.gov/~navarro
>
>
> Ken Brunsen/Iris wrote:
> >
> > Hello all,
> >
> >      I've been sorta pulling the 2.4 kernel and testing with it now for
> > awhile on my IBM NetFinity 5500 and since the test12 I've been having a
> > continuous issue with crashing the OS during a pull of source code
across
> > the network (>1Gb files).  I've been trying to figure out what it may
be
> > related to, but I'm relatively new with debugging the kernel so thought
I'd
> > see if y'all could help.  From looking at the archives, I did not see
that
> > anyone else had been seeing these issues either.  Basically, I've got 2
> > different machines which I'm working with - a NetFinity Quad CPU 5500
M20
> > with 2Gb Ram and Raid and a NetFinity Dual CPU 5500 M10 with 1Gb Ram
and
> > Raid.  Both machines exhibit the same behavior.  Initially, both
machines
> > had RH 6.0, now one is RH 7.0 (and I know about the compiler issue) and
the
> > other is SuSE 7.0.  I downloaded the 2.4.0 release and still got the
issue,
> > so thought it was time to bring it here.  Here is a stack of one crash:
> >
> >      Started getting Scsi errors on controller during NFS transfer of
>1Gb
> > worth of files
> >
> > SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> > I/O error: dev 08:05, sector 31731256
> > SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> > I/O error: dev 08:05, sector 31731264
> > SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> > I/O error: dev 08:05, sector 31731272
> > SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> > I/O error: dev 08:05, sector 31731280
> > .
> > .
> > .
> >
> >      (the sector varies from run to run, is never consistent), and then
> > kernel panics with the following
> >
> > (ips0) Resetting controller.
> > NMI Watchdog detected LOCKUP on CPU1, registers:
> > CPU: 1
> > EIP: 0010:[<c0246544>]
> > EFLAGS: 00000002
> > eax: 003e240   ebx: 000612b0  ecx: 5a21a2f5   edx: 00000063
> > esi: 00000004  edi: 00000000  ebp:f7de2a78    esp: f7ddbf00
> > ds: 0018  es: 0018  ss: 0018
> > Process scsi_eh_0 (pid: 8, stackpage=f7ddb000)
> > Stack:    000003e6 c0246587 000612b0 c02465f5 000612b0 c01df470
00418570
> > ffffffff
> >      f7de2a78 00000082 00000001 200012b0 f7ddbf36 000612b0 c01dfa7c
> > f7de2a78
> >      f7de2ab8 f7de2a78 f7db1400 f7de2ab8 c01dc4ae f7de2a78 c0296220
> > c0295c67
> > Call Trace: [<c0246587>] [<c02465f5>] [<c01df470>] [<c01dfa7c>]
> > [<c01dc4ae>]
> >      [<c01bda9c>] [<c01be1db>] [<c01be4e6>] [<c01074c4>]
> >
> > Code: 39 d8 72 f8 5b c3 89 f6 8b 44 24 04 eb 0e 8d b4 26 00 00 00
> > console shuts up ...
> >
> > Thinking it could be memory related - since I see the Cache fill up and
the
> > system go to just over 1mb free prior to crash - i disabled highmem
> > support.  I then disabled NFSv3 and automounter v4 support, jic.  In
the
> > last test, I disabled swap - since one thing I've noticed is that the
2.4
> > kernel never touches my swap at all.  None of these changes have
affected
> > the outcome; the closest I've gotten is by contintually doing "sync" in
> > another window which sometimes keeps it from crashing on a run,
although
> > I'll still end up with a few of the SCSI disk error messages (although
not
> > nearly as many as I get before a failure).  Since this happens on
multiple
> > machines, I do not believe it is.  We're also seeing failures of this
same
> > type when we try to do heavy database loading on the machine, ie.,
intense
> > disk accesses.  Any help would be greatly appreciated, as we are really
> > needing to get this 2.4 kernel working
> >
> > Since I only get the archive list, please CC me with any responses!
> >
> > Thanks!
> >
> > kenbo
> >
> > ______________________
> > Firebirds rule, `stangs serve!
> >
> > Kenneth "kenbo" Brunsen
> > Iris Associates
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/

--
Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com
IBM Linux Technology Center, Beaverton, Oregon
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-2.4.0 scsi problems on NetFinity servers
  2001-01-09 21:08 ` JP Navarro
  2001-01-09 21:36   ` Miles Lane
@ 2001-01-10 20:49   ` Tim Wright
  1 sibling, 0 replies; 11+ messages in thread
From: Tim Wright @ 2001-01-10 20:49 UTC (permalink / raw)
  To: JP Navarro; +Cc: Ken Brunsen/Iris, linux-kernel

Hmmm...
it's actually not quite that simple. The card on it's own doesn't cause any
problems. It's when the NMI watchdog stuff is enabled that all hell breaks
loose at least on my 8500R. Basically, every CPU in the system gets hammered
with NMIs (1000's per second). The system is slower than it should be, and in
my case it hangs after ~45 minutes (~256,000 NMIs per cpu). Booting with
nmi_watchdog=0 makes the problem go away and the machine is stable, so there's
some kind of nasty interaction with the card.

It seems a little unlikely that this is related to SCSI problems, but I could
be wrong. Anyway, I am trying to find more information on the adapter to find
out where the problem may lie.

Regards,

Tim

On Tue, Jan 09, 2001 at 03:08:03PM -0600, JP Navarro wrote:
> One possibility:
> 
> When we first tested 2.4.0-test8 on NetFinity 7000s we had random crashes,
> typically within an hour of booting. The problem was identified as a Wiseman
> Systems Management adapter generated hardware interrupt that 2.4 doesn't handle
> (this was not a problem with 2.2.x).
> 
> If you have these adapters installed, remove them.
> 
> JP Navarro
> -- 
> John-Paul Navarro                                           (630) 252-1233
> Mathematics & Computer Science Division
> Argonne National Laboratory                            navarro@mcs.anl.gov
> Argonne, IL 60439                          http://www.mcs.anl.gov/~navarro
> 
> 
> Ken Brunsen/Iris wrote:
> > 
> > Hello all,
> > 
> >      I've been sorta pulling the 2.4 kernel and testing with it now for
> > awhile on my IBM NetFinity 5500 and since the test12 I've been having a
> > continuous issue with crashing the OS during a pull of source code across
> > the network (>1Gb files).  I've been trying to figure out what it may be
> > related to, but I'm relatively new with debugging the kernel so thought I'd
> > see if y'all could help.  From looking at the archives, I did not see that
> > anyone else had been seeing these issues either.  Basically, I've got 2
> > different machines which I'm working with - a NetFinity Quad CPU 5500 M20
> > with 2Gb Ram and Raid and a NetFinity Dual CPU 5500 M10 with 1Gb Ram and
> > Raid.  Both machines exhibit the same behavior.  Initially, both machines
> > had RH 6.0, now one is RH 7.0 (and I know about the compiler issue) and the
> > other is SuSE 7.0.  I downloaded the 2.4.0 release and still got the issue,
> > so thought it was time to bring it here.  Here is a stack of one crash:
> > 
> >      Started getting Scsi errors on controller during NFS transfer of >1Gb
> > worth of files
> > 
> > SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> > I/O error: dev 08:05, sector 31731256
> > SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> > I/O error: dev 08:05, sector 31731264
> > SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> > I/O error: dev 08:05, sector 31731272
> > SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> > I/O error: dev 08:05, sector 31731280
> > .
> > .
> > .
> > 
> >      (the sector varies from run to run, is never consistent), and then
> > kernel panics with the following
> > 
> > (ips0) Resetting controller.
> > NMI Watchdog detected LOCKUP on CPU1, registers:
> > CPU: 1
> > EIP: 0010:[<c0246544>]
> > EFLAGS: 00000002
> > eax: 003e240   ebx: 000612b0  ecx: 5a21a2f5   edx: 00000063
> > esi: 00000004  edi: 00000000  ebp:f7de2a78    esp: f7ddbf00
> > ds: 0018  es: 0018  ss: 0018
> > Process scsi_eh_0 (pid: 8, stackpage=f7ddb000)
> > Stack:    000003e6 c0246587 000612b0 c02465f5 000612b0 c01df470 00418570
> > ffffffff
> >      f7de2a78 00000082 00000001 200012b0 f7ddbf36 000612b0 c01dfa7c
> > f7de2a78
> >      f7de2ab8 f7de2a78 f7db1400 f7de2ab8 c01dc4ae f7de2a78 c0296220
> > c0295c67
> > Call Trace: [<c0246587>] [<c02465f5>] [<c01df470>] [<c01dfa7c>]
> > [<c01dc4ae>]
> >      [<c01bda9c>] [<c01be1db>] [<c01be4e6>] [<c01074c4>]
> > 
> > Code: 39 d8 72 f8 5b c3 89 f6 8b 44 24 04 eb 0e 8d b4 26 00 00 00
> > console shuts up ...
> > 
> > Thinking it could be memory related - since I see the Cache fill up and the
> > system go to just over 1mb free prior to crash - i disabled highmem
> > support.  I then disabled NFSv3 and automounter v4 support, jic.  In the
> > last test, I disabled swap - since one thing I've noticed is that the 2.4
> > kernel never touches my swap at all.  None of these changes have affected
> > the outcome; the closest I've gotten is by contintually doing "sync" in
> > another window which sometimes keeps it from crashing on a run, although
> > I'll still end up with a few of the SCSI disk error messages (although not
> > nearly as many as I get before a failure).  Since this happens on multiple
> > machines, I do not believe it is.  We're also seeing failures of this same
> > type when we try to do heavy database loading on the machine, ie., intense
> > disk accesses.  Any help would be greatly appreciated, as we are really
> > needing to get this 2.4 kernel working
> > 
> > Since I only get the archive list, please CC me with any responses!
> > 
> > Thanks!
> > 
> > kenbo
> > 
> > ______________________
> > Firebirds rule, `stangs serve!
> > 
> > Kenneth "kenbo" Brunsen
> > Iris Associates
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> Please read the FAQ at http://www.tux.org/lkml/

-- 
Tim Wright - timw@splhi.com or timw@aracnet.com or twright@us.ibm.com
IBM Linux Technology Center, Beaverton, Oregon
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-2.4.0 scsi problems on NetFinity servers
  2001-01-09 21:36   ` Miles Lane
@ 2001-01-09 22:22     ` JP Navarro
  0 siblings, 0 replies; 11+ messages in thread
From: JP Navarro @ 2001-01-09 22:22 UTC (permalink / raw)
  To: Miles Lane; +Cc: Ken Brunsen/Iris, linux-kernel

Miles Lane wrote:
...
> Are you saying that this is a hardware bug that is impossible to
> develop a work-around for in the kernel?  If this is just a bug,
> shouldn't we try to fix it rather than avoid it?

This is hardware behaving as designed but not supported by the kernel. IBM was
aware of the problem and working on a solution. Since the offending hardware is
a PCI card that is useless under Linux. The simple solution is to remove it.

If IBM wants these cards to work with Linux they should do a lot more than
supply patches that keep the kernel from crashing.  At a minimum, publish specs
so someone else can patch the kernel and write drivers to make full use of the
card's features under Linux. We're still hoping.

> If you have detailed information about the interrupt problem,
> perhaps you could send it to the list and see if a fix is possible.

Wish I could have. Our machines would totally freeze.

JP Navarro
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-2.4.0 scsi problems on NetFinity servers
  2001-01-09 21:08 ` JP Navarro
@ 2001-01-09 21:36   ` Miles Lane
  2001-01-09 22:22     ` JP Navarro
  2001-01-10 20:49   ` Tim Wright
  1 sibling, 1 reply; 11+ messages in thread
From: Miles Lane @ 2001-01-09 21:36 UTC (permalink / raw)
  To: JP Navarro; +Cc: Ken Brunsen/Iris, linux-kernel

JP Navarro wrote:

> One possibility:
> 
> When we first tested 2.4.0-test8 on NetFinity 7000s we had random crashes,
> typically within an hour of booting. The problem was identified as a Wiseman
> Systems Management adapter generated hardware interrupt that 2.4 doesn't handle
> (this was not a problem with 2.2.x).
> 
> If you have these adapters installed, remove them.

Are you saying that this is a hardware bug that is impossible to
develop a work-around for in the kernel?  If this is just a bug,
shouldn't we try to fix it rather than avoid it?

If you have detailed information about the interrupt problem,
perhaps you could send it to the list and see if a fix is possible.

	Miles

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-2.4.0 scsi problems on NetFinity servers
@ 2001-01-09 21:27 Ken Brunsen/Iris
  0 siblings, 0 replies; 11+ messages in thread
From: Ken Brunsen/Iris @ 2001-01-09 21:27 UTC (permalink / raw)
  To: JP Navarro <navarro; +Cc: linux-kernel


We had that problem in the early 2.4 kernel as well and disabled the
adapters also.  Sorry I forgot to mention that.  The issue almost seems to
be load related, as under light use, we see no issues, it's only when we
push it (such as a very fast network copy of > 1Gb files, or heavy database
usage) that we hit the problem.  With no load, the 2.4 kernel stays up
fine.

Thanks though.

Next :-)

kenbo
______________________
Firebirds rule, `stangs serve!

Kenneth "kenbo" Brunsen
Iris Associates


                                                                                                                           
                    JP Navarro                                                                                             
                    <navarro@mcs.        To:     Ken Brunsen/Iris <kenbo@iris.com>                                         
                    anl.gov>             cc:     linux-kernel@vger.kernel.org                                              
                                         Subject:     Re: linux-2.4.0 scsi problems on NetFinity servers                   
                    01/09/01                                                                                               
                    04:08 PM                                                                                               
                                                                                                                           
                                                                                                                           




One possibility:

When we first tested 2.4.0-test8 on NetFinity 7000s we had random crashes,
typically within an hour of booting. The problem was identified as a
Wiseman
Systems Management adapter generated hardware interrupt that 2.4 doesn't
handle
(this was not a problem with 2.2.x).

If you have these adapters installed, remove them.

JP Navarro
--
John-Paul Navarro                                           (630) 252-1233
Mathematics & Computer Science Division
Argonne National Laboratory                            navarro@mcs.anl.gov
Argonne, IL 60439                          http://www.mcs.anl.gov/~navarro


Ken Brunsen/Iris wrote:
>
> Hello all,
>
>      I've been sorta pulling the 2.4 kernel and testing with it now for
> awhile on my IBM NetFinity 5500 and since the test12 I've been having a
> continuous issue with crashing the OS during a pull of source code across
> the network (>1Gb files).  I've been trying to figure out what it may be
> related to, but I'm relatively new with debugging the kernel so thought
I'd
> see if y'all could help.  From looking at the archives, I did not see
that
> anyone else had been seeing these issues either.  Basically, I've got 2
> different machines which I'm working with - a NetFinity Quad CPU 5500 M20
> with 2Gb Ram and Raid and a NetFinity Dual CPU 5500 M10 with 1Gb Ram and
> Raid.  Both machines exhibit the same behavior.  Initially, both machines
> had RH 6.0, now one is RH 7.0 (and I know about the compiler issue) and
the
> other is SuSE 7.0.  I downloaded the 2.4.0 release and still got the
issue,
> so thought it was time to bring it here.  Here is a stack of one crash:
>
>      Started getting Scsi errors on controller during NFS transfer of
>1Gb
> worth of files
>
> SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> I/O error: dev 08:05, sector 31731256
> SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> I/O error: dev 08:05, sector 31731264
> SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> I/O error: dev 08:05, sector 31731272
> SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> I/O error: dev 08:05, sector 31731280
> .
> .
> .
>
>      (the sector varies from run to run, is never consistent), and then
> kernel panics with the following
>
> (ips0) Resetting controller.
> NMI Watchdog detected LOCKUP on CPU1, registers:
> CPU: 1
> EIP: 0010:[<c0246544>]
> EFLAGS: 00000002
> eax: 003e240   ebx: 000612b0  ecx: 5a21a2f5   edx: 00000063
> esi: 00000004  edi: 00000000  ebp:f7de2a78    esp: f7ddbf00
> ds: 0018  es: 0018  ss: 0018
> Process scsi_eh_0 (pid: 8, stackpage=f7ddb000)
> Stack:    000003e6 c0246587 000612b0 c02465f5 000612b0 c01df470 00418570
> ffffffff
>      f7de2a78 00000082 00000001 200012b0 f7ddbf36 000612b0 c01dfa7c
> f7de2a78
>      f7de2ab8 f7de2a78 f7db1400 f7de2ab8 c01dc4ae f7de2a78 c0296220
> c0295c67
> Call Trace: [<c0246587>] [<c02465f5>] [<c01df470>] [<c01dfa7c>]
> [<c01dc4ae>]
>      [<c01bda9c>] [<c01be1db>] [<c01be4e6>] [<c01074c4>]
>
> Code: 39 d8 72 f8 5b c3 89 f6 8b 44 24 04 eb 0e 8d b4 26 00 00 00
> console shuts up ...
>
> Thinking it could be memory related - since I see the Cache fill up and
the
> system go to just over 1mb free prior to crash - i disabled highmem
> support.  I then disabled NFSv3 and automounter v4 support, jic.  In the
> last test, I disabled swap - since one thing I've noticed is that the 2.4
> kernel never touches my swap at all.  None of these changes have affected
> the outcome; the closest I've gotten is by contintually doing "sync" in
> another window which sometimes keeps it from crashing on a run, although
> I'll still end up with a few of the SCSI disk error messages (although
not
> nearly as many as I get before a failure).  Since this happens on
multiple
> machines, I do not believe it is.  We're also seeing failures of this
same
> type when we try to do heavy database loading on the machine, ie.,
intense
> disk accesses.  Any help would be greatly appreciated, as we are really
> needing to get this 2.4 kernel working
>
> Since I only get the archive list, please CC me with any responses!
>
> Thanks!
>
> kenbo
>
> ______________________
> Firebirds rule, `stangs serve!
>
> Kenneth "kenbo" Brunsen
> Iris Associates




-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: linux-2.4.0 scsi problems on NetFinity servers
  2001-01-09 20:50 Ken Brunsen/Iris
@ 2001-01-09 21:08 ` JP Navarro
  2001-01-09 21:36   ` Miles Lane
  2001-01-10 20:49   ` Tim Wright
  0 siblings, 2 replies; 11+ messages in thread
From: JP Navarro @ 2001-01-09 21:08 UTC (permalink / raw)
  To: Ken Brunsen/Iris; +Cc: linux-kernel

One possibility:

When we first tested 2.4.0-test8 on NetFinity 7000s we had random crashes,
typically within an hour of booting. The problem was identified as a Wiseman
Systems Management adapter generated hardware interrupt that 2.4 doesn't handle
(this was not a problem with 2.2.x).

If you have these adapters installed, remove them.

JP Navarro
-- 
John-Paul Navarro                                           (630) 252-1233
Mathematics & Computer Science Division
Argonne National Laboratory                            navarro@mcs.anl.gov
Argonne, IL 60439                          http://www.mcs.anl.gov/~navarro


Ken Brunsen/Iris wrote:
> 
> Hello all,
> 
>      I've been sorta pulling the 2.4 kernel and testing with it now for
> awhile on my IBM NetFinity 5500 and since the test12 I've been having a
> continuous issue with crashing the OS during a pull of source code across
> the network (>1Gb files).  I've been trying to figure out what it may be
> related to, but I'm relatively new with debugging the kernel so thought I'd
> see if y'all could help.  From looking at the archives, I did not see that
> anyone else had been seeing these issues either.  Basically, I've got 2
> different machines which I'm working with - a NetFinity Quad CPU 5500 M20
> with 2Gb Ram and Raid and a NetFinity Dual CPU 5500 M10 with 1Gb Ram and
> Raid.  Both machines exhibit the same behavior.  Initially, both machines
> had RH 6.0, now one is RH 7.0 (and I know about the compiler issue) and the
> other is SuSE 7.0.  I downloaded the 2.4.0 release and still got the issue,
> so thought it was time to bring it here.  Here is a stack of one crash:
> 
>      Started getting Scsi errors on controller during NFS transfer of >1Gb
> worth of files
> 
> SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> I/O error: dev 08:05, sector 31731256
> SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> I/O error: dev 08:05, sector 31731264
> SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> I/O error: dev 08:05, sector 31731272
> SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
> I/O error: dev 08:05, sector 31731280
> .
> .
> .
> 
>      (the sector varies from run to run, is never consistent), and then
> kernel panics with the following
> 
> (ips0) Resetting controller.
> NMI Watchdog detected LOCKUP on CPU1, registers:
> CPU: 1
> EIP: 0010:[<c0246544>]
> EFLAGS: 00000002
> eax: 003e240   ebx: 000612b0  ecx: 5a21a2f5   edx: 00000063
> esi: 00000004  edi: 00000000  ebp:f7de2a78    esp: f7ddbf00
> ds: 0018  es: 0018  ss: 0018
> Process scsi_eh_0 (pid: 8, stackpage=f7ddb000)
> Stack:    000003e6 c0246587 000612b0 c02465f5 000612b0 c01df470 00418570
> ffffffff
>      f7de2a78 00000082 00000001 200012b0 f7ddbf36 000612b0 c01dfa7c
> f7de2a78
>      f7de2ab8 f7de2a78 f7db1400 f7de2ab8 c01dc4ae f7de2a78 c0296220
> c0295c67
> Call Trace: [<c0246587>] [<c02465f5>] [<c01df470>] [<c01dfa7c>]
> [<c01dc4ae>]
>      [<c01bda9c>] [<c01be1db>] [<c01be4e6>] [<c01074c4>]
> 
> Code: 39 d8 72 f8 5b c3 89 f6 8b 44 24 04 eb 0e 8d b4 26 00 00 00
> console shuts up ...
> 
> Thinking it could be memory related - since I see the Cache fill up and the
> system go to just over 1mb free prior to crash - i disabled highmem
> support.  I then disabled NFSv3 and automounter v4 support, jic.  In the
> last test, I disabled swap - since one thing I've noticed is that the 2.4
> kernel never touches my swap at all.  None of these changes have affected
> the outcome; the closest I've gotten is by contintually doing "sync" in
> another window which sometimes keeps it from crashing on a run, although
> I'll still end up with a few of the SCSI disk error messages (although not
> nearly as many as I get before a failure).  Since this happens on multiple
> machines, I do not believe it is.  We're also seeing failures of this same
> type when we try to do heavy database loading on the machine, ie., intense
> disk accesses.  Any help would be greatly appreciated, as we are really
> needing to get this 2.4 kernel working
> 
> Since I only get the archive list, please CC me with any responses!
> 
> Thanks!
> 
> kenbo
> 
> ______________________
> Firebirds rule, `stangs serve!
> 
> Kenneth "kenbo" Brunsen
> Iris Associates
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* linux-2.4.0 scsi problems on NetFinity servers
@ 2001-01-09 20:50 Ken Brunsen/Iris
  2001-01-09 21:08 ` JP Navarro
  0 siblings, 1 reply; 11+ messages in thread
From: Ken Brunsen/Iris @ 2001-01-09 20:50 UTC (permalink / raw)
  To: linux-kernel

Hello all,

     I've been sorta pulling the 2.4 kernel and testing with it now for
awhile on my IBM NetFinity 5500 and since the test12 I've been having a
continuous issue with crashing the OS during a pull of source code across
the network (>1Gb files).  I've been trying to figure out what it may be
related to, but I'm relatively new with debugging the kernel so thought I'd
see if y'all could help.  From looking at the archives, I did not see that
anyone else had been seeing these issues either.  Basically, I've got 2
different machines which I'm working with - a NetFinity Quad CPU 5500 M20
with 2Gb Ram and Raid and a NetFinity Dual CPU 5500 M10 with 1Gb Ram and
Raid.  Both machines exhibit the same behavior.  Initially, both machines
had RH 6.0, now one is RH 7.0 (and I know about the compiler issue) and the
other is SuSE 7.0.  I downloaded the 2.4.0 release and still got the issue,
so thought it was time to bring it here.  Here is a stack of one crash:

     Started getting Scsi errors on controller during NFS transfer of >1Gb
worth of files

SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
I/O error: dev 08:05, sector 31731256
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
I/O error: dev 08:05, sector 31731264
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
I/O error: dev 08:05, sector 31731272
SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 70000
I/O error: dev 08:05, sector 31731280
.
.
.

     (the sector varies from run to run, is never consistent), and then
kernel panics with the following

(ips0) Resetting controller.
NMI Watchdog detected LOCKUP on CPU1, registers:
CPU: 1
EIP: 0010:[<c0246544>]
EFLAGS: 00000002
eax: 003e240   ebx: 000612b0  ecx: 5a21a2f5   edx: 00000063
esi: 00000004  edi: 00000000  ebp:f7de2a78    esp: f7ddbf00
ds: 0018  es: 0018  ss: 0018
Process scsi_eh_0 (pid: 8, stackpage=f7ddb000)
Stack:    000003e6 c0246587 000612b0 c02465f5 000612b0 c01df470 00418570
ffffffff
     f7de2a78 00000082 00000001 200012b0 f7ddbf36 000612b0 c01dfa7c
f7de2a78
     f7de2ab8 f7de2a78 f7db1400 f7de2ab8 c01dc4ae f7de2a78 c0296220
c0295c67
Call Trace: [<c0246587>] [<c02465f5>] [<c01df470>] [<c01dfa7c>]
[<c01dc4ae>]
     [<c01bda9c>] [<c01be1db>] [<c01be4e6>] [<c01074c4>]

Code: 39 d8 72 f8 5b c3 89 f6 8b 44 24 04 eb 0e 8d b4 26 00 00 00
console shuts up ...


Thinking it could be memory related - since I see the Cache fill up and the
system go to just over 1mb free prior to crash - i disabled highmem
support.  I then disabled NFSv3 and automounter v4 support, jic.  In the
last test, I disabled swap - since one thing I've noticed is that the 2.4
kernel never touches my swap at all.  None of these changes have affected
the outcome; the closest I've gotten is by contintually doing "sync" in
another window which sometimes keeps it from crashing on a run, although
I'll still end up with a few of the SCSI disk error messages (although not
nearly as many as I get before a failure).  Since this happens on multiple
machines, I do not believe it is.  We're also seeing failures of this same
type when we try to do heavy database loading on the machine, ie., intense
disk accesses.  Any help would be greatly appreciated, as we are really
needing to get this 2.4 kernel working

Since I only get the archive list, please CC me with any responses!

Thanks!

kenbo

______________________
Firebirds rule, `stangs serve!

Kenneth "kenbo" Brunsen
Iris Associates

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2001-01-11 19:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-01-10 13:58 linux-2.4.0 scsi problems on NetFinity servers Ken Brunsen/Iris
2001-01-10 15:32 ` Problem with 2.4.0 agpgart on Dell D4100 (probably) Intel i815 Charles McLachlan
2001-01-10 18:56   ` Jeff Hartmann
  -- strict thread matches above, loose matches on Subject: below --
2001-01-11 20:00 linux-2.4.0 scsi problems on NetFinity servers kenbo
2001-01-11 13:54 kenbo
2001-01-09 21:27 Ken Brunsen/Iris
2001-01-09 20:50 Ken Brunsen/Iris
2001-01-09 21:08 ` JP Navarro
2001-01-09 21:36   ` Miles Lane
2001-01-09 22:22     ` JP Navarro
2001-01-10 20:49   ` Tim Wright

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).