Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline

All of lore.kernel.org
 help / color / mirror / Atom feed

* Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
@ 2004-10-28  7:53 Andrew Morton
  2004-10-28 18:21 ` Phil Brutsche
  2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
  0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2004-10-28  7:53 UTC (permalink / raw)
  To: linux-scsi



Begin forwarded message:

Date: Thu, 28 Oct 2004 00:50:31 -0700
From: bugme-daemon@osdl.org
To: bugme-new@lists.osdl.org
Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline


http://bugme.osdl.org/show_bug.cgi?id=3651

           Summary: dell poweredge 4600 aacraid PERC 3/Di Container goes
                    offline
    Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
            Status: NEW
          Severity: high
             Owner: andmike@us.ibm.com
         Submitter: oliver.polterauer@ewave.at
                CC: oliver.polterauer@ewave.at


Distribution: Debian Sarge
Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in 
one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on 
Firmware version 2.80 Build 6092
Software Environment: aacraid
Problem Description: 
The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with 
the following error message:

SCSI:0 (0:0): rejecting I/O to offline device
Buffer I/O error due to I/O error on sda8

Steps to reproduce:

I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di 
Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s 
16000 -n 150 -r 8000 -u nobody:nogroup

my partition table:

/dev/sda1 --> /boot (80mb)
/dev/sda2 --> swap (16 GB)
/dev/sda3 --> /root (4 GB)
/dev/sda5 --> /home (4 GB)
/dev/sda6 --> /tmp (1 GB)
/dev/sda7 --> /var (8 GB)
/dev/sda8 --> /var/lib/postgres (502 GB)

Filesystem ext3
kernel is a non modular plain vanilla kernel source (no patches)

I have turned on remote syslog to our syslog server, but it did not log any 
more errormessages as stated above. The error messages above are printed only 
on the console


Best regards,

Oliver

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
  2004-10-28  7:53 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Andrew Morton
@ 2004-10-28 18:21 ` Phil Brutsche
  2004-11-09 20:22   ` Ryan Anderson
  2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
  1 sibling, 1 reply; 12+ messages in thread
From: Phil Brutsche @ 2004-10-28 18:21 UTC (permalink / raw)
  To: linux-scsi

Andrew Morton wrote:
> Distribution: Debian Sarge
> Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in 
> one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on 
> Firmware version 2.80 Build 6092
> Software Environment: aacraid
> Problem Description: 
> The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with 
> the following error message:
> 
> SCSI:0 (0:0): rejecting I/O to offline device
> Buffer I/O error due to I/O error on sda8
> 
> Steps to reproduce:
> 
> I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di 
> Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s 
> 16000 -n 150 -r 8000 -u nobody:nogroup

FYI, I have been seeing this as well.

I can trigger this card lockup at will with mkfs.ext3; for other
filesystems, I may need to extract a kernel source .tar.gz in order to
cause a lockup.

aacraid: Host adapter reset request. SCSI hang ?
aacraid: Host adapter appears dead
Device offlined - not ready after error recovery: host 1 channel 0 id 0
lun 0
SCSI error : <1 0 0 0> return code = 0x6000000
end_request: I/O error, dev sdb, sector 1667007
Buffer I/O error on device sdb1, logical block 208368
lost page write due to I/O error on sdb1
scsi1 (0:0): rejecting I/O to offline device
Buffer I/O error on device sdb1, logical block 208369

I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives - yes, I
know about the Seagate firmware timeout problem, these drives are brand
new with firmware rev 0006 and thus aren't affected.

This hardware has no problems with kernel 2.4.x.

-- 

Phil Brutsche
phil@brutsche.us

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
  2004-10-28 18:21 ` Phil Brutsche
@ 2004-11-09 20:22   ` Ryan Anderson
  2004-11-09 21:32     ` Otto Solares
  0 siblings, 1 reply; 12+ messages in thread
From: Ryan Anderson @ 2004-11-09 20:22 UTC (permalink / raw)
  To: Phil Brutsche; +Cc: linux-scsi

[-- Attachment #1: Type: text/plain, Size: 3596 bytes --]

On Thu, 2004-10-28 at 14:21, Phil Brutsche wrote:
> Andrew Morton wrote:
> > Distribution: Debian Sarge
> > Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in 
> > one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on 
> > Firmware version 2.80 Build 6092
> > Software Environment: aacraid

I have a Dell 2650, 3 disks, RAID 5, 72GB drives (Fujitsu drives)
4 GB Ram, dual 2.4GHz Xeon

dmesg tells me I have this specific firmware:
AAC0: kernel 2.8.4 build 6092
AAC0: monitor 2.8.4 build 6092
AAC0: bios 2.8.0 build 6092
AAC0: serial 83ac41d3fafaf001
scsi0 : percraid
  Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
  Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02


Currently I have 2.6.8 on this machine.  (I believe it's actually
2.6.8.1, Debian sometimes blurs things a little bit in terms of
backporting a patch.)

> > Problem Description: 
> > The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with 
> > the following error message:
> > 
> > SCSI:0 (0:0): rejecting I/O to offline device
> > Buffer I/O error due to I/O error on sda8

That's what I'm seeing.  It's rather hard to capture because the only
disks on this machine are in the RAID container that keeps going
offline.

> > Steps to reproduce:
> > 
> > I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di 
> > Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s 
> > 16000 -n 150 -r 8000 -u nobody:nogroup
> 
> FYI, I have been seeing this as well.
> 
> I can trigger this card lockup at will with mkfs.ext3; for other
> filesystems, I may need to extract a kernel source .tar.gz in order to
> cause a lockup.
> 
> aacraid: Host adapter reset request. SCSI hang ?
> aacraid: Host adapter appears dead
> Device offlined - not ready after error recovery: host 1 channel 0 id 0
> lun 0
> SCSI error : <1 0 0 0> return code = 0x6000000
> end_request: I/O error, dev sdb, sector 1667007
> Buffer I/O error on device sdb1, logical block 208368
> lost page write due to I/O error on sdb1
> scsi1 (0:0): rejecting I/O to offline device
> Buffer I/O error on device sdb1, logical block 208369
> 
> I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives - yes, I
> know about the Seagate firmware timeout problem, these drives are brand
> new with firmware rev 0006 and thus aren't affected.
> 
> This hardware has no problems with kernel 2.4.x.

I had similar, but not nearly as bad, problems with 2.4.x.
Under 2.4.x, this machine would become unavailable for approximately 20
minutes, and then would recover.  The load would be around 20 when it
came back, and would rapidly drop off.

This was with 2.4.20, using a config modeled off the Debian 2.4.18-bf2.4
config.

Due to this problem, my machine is no longer a production machine, so I
can do whatever testing is necessary to fix this.

I have gone through the hardware diagnostics process with Dell, with the
exception of completing the Elite HD diagnostics on the Fujitsu drives. 
(The program filled the boot floppy with the log, and I haven't gotten
around to rerunning it yet.)

I can debug and attempt to reproduce as much as is necessary at this
point, if anyone can give me a place to start and/or a patch to apply.


-- 

Ryan Anderson                
AutoWeb Communications, Inc. 
email: ryan@autoweb.net 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
  2004-11-09 20:22   ` Ryan Anderson
@ 2004-11-09 21:32     ` Otto Solares
  2004-11-09 23:49       ` Andrew Kinney
  2004-11-10 17:43       ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
  0 siblings, 2 replies; 12+ messages in thread
From: Otto Solares @ 2004-11-09 21:32 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: Phil Brutsche, linux-scsi

JFYI

I have exactly this same problem on 3 brand new Dell PE2650
machines with Perc3/Di controllers, my other new Dell servers
with the Perc4/Di controller have never fail.

Dell customer support sucks, they would not help me as I am
not running a supported distro/kernel.

The faulty servers have the latest BIOS, Perc3/Di firmware (6092),
latest ERA/RAC firmware, latest debian sarge, kernel 2.6.10-rc1-bk7.
Both 2.4 and 2.6 hangs the controller.

The problem appears when too many IO is happening, the kernel
don't die, as if I have a ssh session I could execute some
cached binaries like ps, bash, etc.  Everything in memory runs
fine until it touches sda that is offlined as you can see
from this kernel messages:

Nov  5 14:53:30 saruman kernel: aacraid: Host adapter reset request. SCSI hang ? 
Nov  5 14:54:33 saruman kernel: aacraid: SCSI bus appears hung 
Nov  5 14:54:34 saruman kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 
Nov  5 14:54:34 saruman kernel: Device sda not ready. 
Nov  5 14:54:34 saruman kernel: end_request: I/O error, dev sda, sector 127952537 
Nov  5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device 
Nov  5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device 
Nov  5 14:54:34 saruman kernel: EXT3-fs error (device sda4): ext3_find_entry: reading directory #13880243 offset 0 
Nov  5 14:54:34 saruman kernel:  
Nov  5 14:54:34 saruman kernel: Remounting filesystem read-only

-otto

On Tue, Nov 09, 2004 at 03:22:54PM -0500, Ryan Anderson wrote:
> On Thu, 2004-10-28 at 14:21, Phil Brutsche wrote:
> > Andrew Morton wrote:
> > > Distribution: Debian Sarge
> > > Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in 
> > > one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on 
> > > Firmware version 2.80 Build 6092
> > > Software Environment: aacraid
> 
> I have a Dell 2650, 3 disks, RAID 5, 72GB drives (Fujitsu drives)
> 4 GB Ram, dual 2.4GHz Xeon
> 
> dmesg tells me I have this specific firmware:
> AAC0: kernel 2.8.4 build 6092
> AAC0: monitor 2.8.4 build 6092
> AAC0: bios 2.8.0 build 6092
> AAC0: serial 83ac41d3fafaf001
> scsi0 : percraid
>   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
>   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
>   Type:   Direct-Access                      ANSI SCSI revision: 02
> 
> 
> Currently I have 2.6.8 on this machine.  (I believe it's actually
> 2.6.8.1, Debian sometimes blurs things a little bit in terms of
> backporting a patch.)
> 
> > > Problem Description: 
> > > The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with 
> > > the following error message:
> > > 
> > > SCSI:0 (0:0): rejecting I/O to offline device
> > > Buffer I/O error due to I/O error on sda8
> 
> That's what I'm seeing.  It's rather hard to capture because the only
> disks on this machine are in the RAID container that keeps going
> offline.
> 
> > > Steps to reproduce:
> > > 
> > > I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di 
> > > Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s 
> > > 16000 -n 150 -r 8000 -u nobody:nogroup
> > 
> > FYI, I have been seeing this as well.
> > 
> > I can trigger this card lockup at will with mkfs.ext3; for other
> > filesystems, I may need to extract a kernel source .tar.gz in order to
> > cause a lockup.
> > 
> > aacraid: Host adapter reset request. SCSI hang ?
> > aacraid: Host adapter appears dead
> > Device offlined - not ready after error recovery: host 1 channel 0 id 0
> > lun 0
> > SCSI error : <1 0 0 0> return code = 0x6000000
> > end_request: I/O error, dev sdb, sector 1667007
> > Buffer I/O error on device sdb1, logical block 208368
> > lost page write due to I/O error on sdb1
> > scsi1 (0:0): rejecting I/O to offline device
> > Buffer I/O error on device sdb1, logical block 208369
> > 
> > I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives - yes, I
> > know about the Seagate firmware timeout problem, these drives are brand
> > new with firmware rev 0006 and thus aren't affected.
> > 
> > This hardware has no problems with kernel 2.4.x.
> 
> I had similar, but not nearly as bad, problems with 2.4.x.
> Under 2.4.x, this machine would become unavailable for approximately 20
> minutes, and then would recover.  The load would be around 20 when it
> came back, and would rapidly drop off.
> 
> This was with 2.4.20, using a config modeled off the Debian 2.4.18-bf2.4
> config.
> 
> Due to this problem, my machine is no longer a production machine, so I
> can do whatever testing is necessary to fix this.
> 
> I have gone through the hardware diagnostics process with Dell, with the
> exception of completing the Elite HD diagnostics on the Fujitsu drives. 
> (The program filled the boot floppy with the log, and I haven't gotten
> around to rerunning it yet.)
> 
> I can debug and attempt to reproduce as much as is necessary at this
> point, if anyone can give me a place to start and/or a patch to apply.
> 
> 
> -- 
> 
> Ryan Anderson                
> AutoWeb Communications, Inc. 
> email: ryan@autoweb.net 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
  2004-11-09 21:32     ` Otto Solares
@ 2004-11-09 23:49       ` Andrew Kinney
  2004-11-10 17:43       ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
  1 sibling, 0 replies; 12+ messages in thread
From: Andrew Kinney @ 2004-11-09 23:49 UTC (permalink / raw)
  To: Otto Solares; +Cc: Phil Brutsche, linux-scsi

FWIW, we had the same problems on two identically configured Dell 
PE2500 machines with PERC3/DI controllers.  They were purchased about 
two years ago.  The problem surfaced when we moved to a 2.4.20 kernel 
from an older kernel and our disk system loads increased.  It seemed 
that a combination of seeking through large numbers of files spread 
all over the array (12 million or so), a sequential read of a largish 
log file (200MB or more), and a lot of random writes all over the 
array caused a single drive to become unresponsive in the array.  
When the drive became unresponsive, the upper layer drivers offlined 
the container when they gave up waiting for the controller kernel to 
finish marking the drive as dead (this is an interpretation of our 
NVRAM controller logs) and continue servicing requests to the array.

The end result is that a single misbehaving drive, cable, or 
connector can cause the entire array to be taken offline by the OS.  
Obviously, this isn't the intended operation, so there is still 
something that isn't happening correctly, but it could be that the 
controller just needs to offline the bad disk faster.

At any rate, our solution (covered by Dell warranties) was to replace 
the drive, replace the controller, replace the cables, and replace 
the SCSI backplane.  We also reinitialized the arrays with a 64KB 
stripe size instead of a 32KB stripe size to reduce the physical I/O 
overhead associated with many small files. This fixed the problem for 
us (so far).  

It's tough to say what exactly was the fix since we took a shotgun 
approach to the problem, but my guess is that the drive itself wasn't 
responding quickly enough.  Replacing the drive with a drive of the 
same RPM and capacity but designed for U320 operation instead of U160 
operation is what I suspect resolved the trouble for us.  The logic 
chips on the U320 drive appear to process commands faster than those 
on the U160 drive, thus limiting the possibility of getting jammed 
with commands.  Of course, the drive is also a different brand than 
the other drives in the array, so that could have been related.

Hopefully that information was useful to you and others on this list. 
 Unfortunately, I'm not a kernel programmer nor do I have time to 
contribute code, so I'm unable to offer anything other than what 
solved the issue for us.

Andrew

On 9 Nov 2004 at 15:32, Otto Solares wrote:

> JFYI
> 
> I have exactly this same problem on 3 brand new Dell PE2650
> machines with Perc3/Di controllers, my other new Dell servers
> with the Perc4/Di controller have never fail.
> 
> Dell customer support sucks, they would not help me as I am
> not running a supported distro/kernel.
> 
> The faulty servers have the latest BIOS, Perc3/Di firmware (6092),
> latest ERA/RAC firmware, latest debian sarge, kernel 2.6.10-rc1-bk7.
> Both 2.4 and 2.6 hangs the controller.
> 
> The problem appears when too many IO is happening, the kernel
> don't die, as if I have a ssh session I could execute some
> cached binaries like ps, bash, etc.  Everything in memory runs
> fine until it touches sda that is offlined as you can see
> from this kernel messages:
> 
> Nov  5 14:53:30 saruman kernel: aacraid: Host adapter reset request.
> SCSI hang ? Nov  5 14:54:33 saruman kernel: aacraid: SCSI bus appears
> hung Nov  5 14:54:34 saruman kernel: scsi: Device offlined - not ready
> after error recovery: host 0 channel 0 id 0 lun 0 Nov  5 14:54:34
> saruman kernel: Device sda not ready. Nov  5 14:54:34 saruman kernel:
> end_request: I/O error, dev sda, sector 127952537 Nov  5 14:54:34
> saruman kernel: scsi0 (0:0): rejecting I/O to offline device Nov  5
> 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device
> Nov  5 14:54:34 saruman kernel: EXT3-fs error (device sda4):
> ext3_find_entry: reading directory #13880243 offset 0 Nov  5 14:54:34
> saruman kernel:  Nov  5 14:54:34 saruman kernel: Remounting filesystem
> read-only
> 
> -otto
> 
> On Tue, Nov 09, 2004 at 03:22:54PM -0500, Ryan Anderson wrote:
> > On Thu, 2004-10-28 at 14:21, Phil Brutsche wrote:
> > > Andrew Morton wrote:
> > > > Distribution: Debian Sarge
> > > > Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in
> > > > a Raid 5 in one container, 8 GB RAM, Dual Xenon 2GHz. The Perc
> > > > 3/Di Controller is on Firmware version 2.80 Build 6092 Software
> > > > Environment: aacraid
> > 
> > I have a Dell 2650, 3 disks, RAID 5, 72GB drives (Fujitsu drives) 4
> > GB Ram, dual 2.4GHz Xeon
> > 
> > dmesg tells me I have this specific firmware:
> > AAC0: kernel 2.8.4 build 6092
> > AAC0: monitor 2.8.4 build 6092
> > AAC0: bios 2.8.0 build 6092
> > AAC0: serial 83ac41d3fafaf001
> > scsi0 : percraid
> >   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0
> >   Type:   Direct-Access                      ANSI SCSI revision: 02
> >   Vendor: DELL      Model: PERCRAID RAID5    Rev: V1.0 Type:  
> >   Direct-Access                      ANSI SCSI revision: 02
> > 
> > 
> > Currently I have 2.6.8 on this machine.  (I believe it's actually
> > 2.6.8.1, Debian sometimes blurs things a little bit in terms of
> > backporting a patch.)
> > 
> > > > Problem Description: 
> > > > The Container on the PERC 3/Di Controller goes offline on heavy
> > > > I/O Load with the following error message:
> > > > 
> > > > SCSI:0 (0:0): rejecting I/O to offline device
> > > > Buffer I/O error due to I/O error on sda8
> > 
> > That's what I'm seeing.  It's rather hard to capture because the
> > only disks on this machine are in the RAID container that keeps
> > going offline.
> > 
> > > > Steps to reproduce:
> > > > 
> > > > I am using bonnie++ to produce I/O load on the only Volume on
> > > > the Perc 3/Di Controller with the following parameters bonnie++
> > > > -d /var/lib/postgres/test -s 16000 -n 150 -r 8000 -u
> > > > nobody:nogroup
> > > 
> > > FYI, I have been seeing this as well.
> > > 
> > > I can trigger this card lockup at will with mkfs.ext3; for other
> > > filesystems, I may need to extract a kernel source .tar.gz in
> > > order to cause a lockup.
> > > 
> > > aacraid: Host adapter reset request. SCSI hang ?
> > > aacraid: Host adapter appears dead
> > > Device offlined - not ready after error recovery: host 1 channel 0
> > > id 0 lun 0 SCSI error : <1 0 0 0> return code = 0x6000000
> > > end_request: I/O error, dev sdb, sector 1667007 Buffer I/O error
> > > on device sdb1, logical block 208368 lost page write due to I/O
> > > error on sdb1 scsi1 (0:0): rejecting I/O to offline device Buffer
> > > I/O error on device sdb1, logical block 208369
> > > 
> > > I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives -
> > > yes, I know about the Seagate firmware timeout problem, these
> > > drives are brand new with firmware rev 0006 and thus aren't
> > > affected.
> > > 
> > > This hardware has no problems with kernel 2.4.x.
> > 
> > I had similar, but not nearly as bad, problems with 2.4.x.
> > Under 2.4.x, this machine would become unavailable for approximately
> > 20 minutes, and then would recover.  The load would be around 20
> > when it came back, and would rapidly drop off.
> > 
> > This was with 2.4.20, using a config modeled off the Debian
> > 2.4.18-bf2.4 config.
> > 
> > Due to this problem, my machine is no longer a production machine,
> > so I can do whatever testing is necessary to fix this.
> > 
> > I have gone through the hardware diagnostics process with Dell, with
> > the exception of completing the Elite HD diagnostics on the Fujitsu
> > drives. (The program filled the boot floppy with the log, and I
> > haven't gotten around to rerunning it yet.)
> > 
> > I can debug and attempt to reproduce as much as is necessary at this
> > point, if anyone can give me a place to start and/or a patch to
> > apply.
> > 
> > 
> > -- 
> > 
> > Ryan Anderson                
> > AutoWeb Communications, Inc. 
> > email: ryan@autoweb.net 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at  http://vger.kernel.org/majordomo-info.html
> 


Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net




^ permalink raw reply	[flat|nested] 12+ messages in thread

* aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline]
  2004-11-09 21:32     ` Otto Solares
  2004-11-09 23:49       ` Andrew Kinney
@ 2004-11-10 17:43       ` Otto Solares
  2004-11-10 20:33         ` Phil Brutsche
  1 sibling, 1 reply; 12+ messages in thread
From: Otto Solares @ 2004-11-10 17:43 UTC (permalink / raw)
  To: linux-scsi

FYI again:

There is a recent KernelTrap thread showing more datails:

http://kerneltrap.org/node/view/3778

Investigating I found 2 more things I'll check:

A seagate time-out issue with exactly my disks:

http://www.seagate.com/support/disc/u320_firmware.html

Unfortunately adaptec will not give me directly the firmware
and Dell does not respond to my support mail queries.  I'm really
impressed with Dell support, they suck so badly, at least
those updates must appear in their download area.  I am pretty
sure my next servers will not come from Dell.

Does anybody have this seagate firmware updates?

And the other issue is a different driver from adaptec:

http://www.adaptec.com/worldwide/support/driverdetail.jsp?sess=no&language=English+US&cat=%2FOperating+System%2FLinux+Driver+Source+Code&filekey=aacraid-src_1.1.5.tgz

Diffing the adaptec and latest 2.6 kernel show they come from
the same base but it seems more complete the Adaptec one,
does somebody knows the difference?  The important thing is
which is better for heavy I/O servers?  Why two drivers?

Thanks.

-otto

On Tue, Nov 09, 2004 at 03:32:15PM -0600, Otto Solares wrote:
> JFYI
> 
> I have exactly this same problem on 3 brand new Dell PE2650
> machines with Perc3/Di controllers, my other new Dell servers
> with the Perc4/Di controller have never fail.
> 
> Dell customer support sucks, they would not help me as I am
> not running a supported distro/kernel.
> 
> The faulty servers have the latest BIOS, Perc3/Di firmware (6092),
> latest ERA/RAC firmware, latest debian sarge, kernel 2.6.10-rc1-bk7.
> Both 2.4 and 2.6 hangs the controller.
> 
> The problem appears when too many IO is happening, the kernel
> don't die, as if I have a ssh session I could execute some
> cached binaries like ps, bash, etc.  Everything in memory runs
> fine until it touches sda that is offlined as you can see
> from this kernel messages:
> 
> Nov  5 14:53:30 saruman kernel: aacraid: Host adapter reset request. SCSI hang ? 
> Nov  5 14:54:33 saruman kernel: aacraid: SCSI bus appears hung 
> Nov  5 14:54:34 saruman kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0 
> Nov  5 14:54:34 saruman kernel: Device sda not ready. 
> Nov  5 14:54:34 saruman kernel: end_request: I/O error, dev sda, sector 127952537 
> Nov  5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device 
> Nov  5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device 
> Nov  5 14:54:34 saruman kernel: EXT3-fs error (device sda4): ext3_find_entry: reading directory #13880243 offset 0 
> Nov  5 14:54:34 saruman kernel:  
> Nov  5 14:54:34 saruman kernel: Remounting filesystem read-only
> 
> -otto

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline]
  2004-11-10 17:43       ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
@ 2004-11-10 20:33         ` Phil Brutsche
  2004-11-10 21:08           ` Otto Solares
  0 siblings, 1 reply; 12+ messages in thread
From: Phil Brutsche @ 2004-11-10 20:33 UTC (permalink / raw)
  To: Otto Solares; +Cc: linux-scsi

Otto Solares wrote:

> A seagate time-out issue with exactly my disks:
> 
> http://www.seagate.com/support/disc/u320_firmware.html

[...]

> Does anybody have this seagate firmware updates?

I have a pair of RAID5s with these Seagate drives; I have flashed the
firmware on the affected drives to the latest 0006 rev (they are 15k RPM
drives).

I *do not* have a problem with aacraid under vanilla 2.4.x and the
included aacraid driver.  The machine using these drives in a RAID5 with
an Adaptec controller has had uptimes of 4 months (or longer) without
any problems.

I *do* have a problem (check the archives for the error messages) with
2.6.9 and the included aacraid driver.

-- 

Phil Brutsche
phil@brutsche.us

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline]
  2004-11-10 20:33         ` Phil Brutsche
@ 2004-11-10 21:08           ` Otto Solares
  0 siblings, 0 replies; 12+ messages in thread
From: Otto Solares @ 2004-11-10 21:08 UTC (permalink / raw)
  To: Phil Brutsche; +Cc: linux-scsi

On Wed, Nov 10, 2004 at 02:33:19PM -0600, Phil Brutsche wrote:
> Otto Solares wrote:
> 
> > A seagate time-out issue with exactly my disks:
> > 
> > http://www.seagate.com/support/disc/u320_firmware.html
> 
> [...]
> 
> > Does anybody have this seagate firmware updates?
> 
> I have a pair of RAID5s with these Seagate drives; I have flashed the
> firmware on the affected drives to the latest 0006 rev (they are 15k RPM
> drives).

I have Dell servers with the following disks:

SPEED BRAND   MODEL     REV
10K   SEAGATE ST3146807 DS09
15K   SEAGATE ST373453  DX10

Clearly they are affected by the _time_out_ issue.

They are in RAID1 configurations.  I couldn't obtain the
firmware from Dell nor Seagate, how do you obtain the
firmware updates?

-otto

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
  2004-10-28  7:53 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Andrew Morton
  2004-10-28 18:21 ` Phil Brutsche
@ 2004-11-23 21:41 ` Ryan Anderson
  2004-11-23 22:58   ` Otto Solares
  2004-11-24  1:00   ` Andrew Kinney
  1 sibling, 2 replies; 12+ messages in thread
From: Ryan Anderson @ 2004-11-23 21:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-scsi

[-- Attachment #1: Type: text/plain, Size: 1601 bytes --]

On Thu, 2004-10-28 at 00:53 -0700, Andrew Morton wrote:
> Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
> 
> 
> http://bugme.osdl.org/show_bug.cgi?id=3651
> 
>            Summary: dell poweredge 4600 aacraid PERC 3/Di Container goes
>                     offline
>     Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
>             Status: NEW
>           Severity: high
>              Owner: andmike@us.ibm.com
>          Submitter: oliver.polterauer@ewave.at
>                 CC: oliver.polterauer@ewave.at

Is there any update on this problem?
To reiterate my particular hardware involved that can trigger this
problem:

Dell 2650, Dual 2.4Ghz Xeon processors (hyperthreading no, though the
problem occured in 2.4.20 without hyperthreading disabled via "noht")

4 GB of ram
Only load is PostgreSQL related (i.e, network queries, plus twice daily
dumps of the database to a NFS store, and a rsync back to the server for
a second copy)

Under load, I repeatedly saw containers go offline.

Dell's recommended hardware diagnostics do not turn up anything (at
all!)

The harddrive are Fujitsu drives, so the Seagate Firmware issue should
not affect them.

I have since taken this server out of production.  Unfortunately, this
makes the error much harder to trigger (i.e, I have failed so far to
trigger it, even with multiple bonnie++ runs)

Suggestions, diagnostics, etc, would be greatly appreciated.


-- 

Ryan Anderson                
AutoWeb Communications, Inc. 
email: ryan@autoweb.net 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
  2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
@ 2004-11-23 22:58   ` Otto Solares
  2004-11-24  1:00   ` Andrew Kinney
  1 sibling, 0 replies; 12+ messages in thread
From: Otto Solares @ 2004-11-23 22:58 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: Andrew Morton, linux-scsi

On Tue, Nov 23, 2004 at 04:41:41PM -0500, Ryan Anderson wrote:
> On Thu, 2004-10-28 at 00:53 -0700, Andrew Morton wrote:
> > Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
> > 
> > 
> > http://bugme.osdl.org/show_bug.cgi?id=3651
> > 
> >            Summary: dell poweredge 4600 aacraid PERC 3/Di Container goes
> >                     offline
> >     Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
> >             Status: NEW
> >           Severity: high
> >              Owner: andmike@us.ibm.com
> >          Submitter: oliver.polterauer@ewave.at
> >                 CC: oliver.polterauer@ewave.at
> 
> Is there any update on this problem?
> To reiterate my particular hardware involved that can trigger this
> problem:
> 
> Dell 2650, Dual 2.4Ghz Xeon processors (hyperthreading no, though the
> problem occured in 2.4.20 without hyperthreading disabled via "noht")
> 
> 4 GB of ram
> Only load is PostgreSQL related (i.e, network queries, plus twice daily
> dumps of the database to a NFS store, and a rsync back to the server for
> a second copy)
> 
> Under load, I repeatedly saw containers go offline.
> 
> Dell's recommended hardware diagnostics do not turn up anything (at
> all!)
> 
> The harddrive are Fujitsu drives, so the Seagate Firmware issue should
> not affect them.
> 
> I have since taken this server out of production.  Unfortunately, this
> makes the error much harder to trigger (i.e, I have failed so far to
> trigger it, even with multiple bonnie++ runs)
> 
> Suggestions, diagnostics, etc, would be greatly appreciated.

I used to have this very same problem with exactly the same hardware as you:

2 x 2.4GHz Xeon processor
4GB RAM
PERC 3/Di
4 x Fujitsu MAP3147NC Rev 5608 10K RPMs disks.

I tried all kernels on earth (2.4/2.6) and the controller always dies with
container offline (search this list for the past 15 days and you'll find
my problem).

Currently I'm running 2.6.10-rc1-bk20-adaptec-1.1.5-2370 _WITHOUT_ANY_
problems (ACPI on, HT enabled), my controller:

Red Hat/Adaptec aacraid driver (1.1-5[2370])
ACPI: PCI interrupt 0000:04:08.1[A] -> GSI 30 (level, low) -> IRQ 201
AAC0: kernel 2.8-0[6092] 
AAC0: monitor 2.8-0[6092]
AAC0: bios 2.8-0[6092]
AAC0: serial 3520a1d3
aacraid_setup("")
nondasd=-1 dacmode=-1 commit=-1 coalescethreshold=16 acbsize=-1
scsi0 : percraid
  Vendor: DELL      Model: PERC RAID5        Rev: V1.0
  Type:   Direct-Access                      ANSI SCSI revision: 02
SCSI device sda: 860149632 512-byte hdwr sectors (440397 MB)
SCSI device sda: drive cache: write through
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0

uptime:

 16:48:22 up 13 days, 29 min,  2 users,  load average: 1.15, 1.27, 1.31

I know 13 days is not much for a server but this server dies in the
1-2 day frame time so it is a huge improvement.

You should try that driver, it works for me.

I had to thank Mark Salyzyn from Adaptec for the updated driver, is my
opinion that this "enhanced driver" should make it in Linus' kernel.

-otto

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
  2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
  2004-11-23 22:58   ` Otto Solares
@ 2004-11-24  1:00   ` Andrew Kinney
  2004-11-24 18:35     ` Ryan Anderson
  1 sibling, 1 reply; 12+ messages in thread
From: Andrew Kinney @ 2004-11-24  1:00 UTC (permalink / raw)
  To: Ryan Anderson; +Cc: linux-scsi

I'll just add my two cents having gone through the exact same thing 
with a pair of PowerEdge 2500 systems with PERC 3/Di RAID 
controllers.  They both had the problem.

Our problem was also only able to be replicated with the machine in 
production with heavy RAM usage, moderate CPU usage, and many 
simultaneous small I/Os mixed with large file writes.  PostGres dumps 
seemed to be the most likely cause since it crashed on schedule for 
awhile and a cron job was running a PostGres dump at that time.  That 
wasn't able to cause the crash 100% of the time though.  It only 
caused it about 10% of the time, from what I could tell.  I was 
unable to reproduce the problem when it was out of production.  It 
took some hocus pocus mix of system activity that only the collective 
consciousness of our customers could seem to cause.  I spent weeks 
using various stress testing programs of all sorts to try to 
reproduce the issue, but I just couldn't replicate it.

First, what enabled us to locate the problem was the following 
command run in the boot immediately following a crash.  It needs to 
be run from afacli after opening the controller.

diagnostic show history /old

That will display the series of events leading up to the container 
going offline.  Pay particular attention to the beginning of the 
problem.  It would look similar to this:

[13]: ID(0:01:0) Timeout detected on cmd[0x28]
[14]: ID(0:01:0): Timeout detected on cmd[0x2a]
[15]: ID(0:01:0): Timeout detected on cmd[0x28]
[16]:  <...repeats 2 more times>
<snip>
[50]: ID(0:01:0) Cmd[0x28] Fail: Block Range 3424717 : 3424718 at
[51]:  509184 sec
<snip>
[78]: ID(0:01:0) Cmd[0x28] Fail: Block Range 0 : 0 at 509262 sec
[79]: 2 can't read mbr dev_t:1
[80]:  <...repeats 1 more times>
[81]: can't read config from slice #[1]

Note the timestamp on line 51 (continued from line 50).  Yours will 
have a different time stamp and a different line number, but you'll 
want to make note of the earliest time stamp visible during or just 
before the problem. Our was almost 6 days from boot.

Then note the latest time stamp near the end of the problem (line 78 
in our log).  In our case it was 78 elapsed seconds.  I'm pretty sure 
 it was coincidence that it was line 78 and 78 seconds elapsed.  This 
is, of course, longer than it should take for the firmware to offline 
a dead drive, but that's what happened.  This resulted in the linux 
SCSI layer timing out and hosing everything since we ran everything 
(including swap, boot, and root) from this container.

Our solution was UGLY.  We replaced everything in the entire disk 
subsystem in addition to getting the latest BIOS, firmware, and 
drivers.  Fortunately, both systems were under warranty, so Dell 
provided the parts and labor to replace the errant hard drive (ID 0:1 
in our case), the backplane, the cables, and the mainboard (since the 
controller is embedded).  The good news is that we've been up 47 days 
since implimenting this solution.  Previous uptimes were typically 
less than a week.

I've had many theories about what the root cause was (some better 
than others), but my latest iteration is that when the system was 
under normal load, a power component on the SCSI backplane couldn't 
supply the proper voltage to one drive for a short period. The drive 
became unresponsive upon executing a command that used the servo 
motor enough to cause the power fluctuation to affect the drive's 
command processor, causing the drive to lockup.

The problem was compounded by the firmware not marking the drive as 
bad in a reasonable time frame so it could continue processing 
commands in degraded mode before the upper layer wigged out.  The 
upper layer linux SCSI drivers can't do anything about this, 
unfortunately, and just blindly take down your only available disk 
storage before the controller comes back from marking the disk as 
bad.

Maybe our experience will benefit someone when they go to modify the 
controller firmware to properly mark a drive as bad in a reasonable 
time so the container can continue to operate in degraded mode rather 
than being taken offline entirely by the OS.  Hint. Hint.

Andrew

On 23 Nov 2004 at 16:41, Ryan Anderson wrote:

> On Thu, 2004-10-28 at 00:53 -0700, Andrew Morton wrote:
> > Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid
> > PERC 3/Di Container goes offline
> > 
> > 
> > http://bugme.osdl.org/show_bug.cgi?id=3651
> > 
> >            Summary: dell poweredge 4600 aacraid PERC 3/Di Container
> >            goes
> >                     offline
> >     Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
> >             Status: NEW
> >           Severity: high
> >              Owner: andmike@us.ibm.com
> >          Submitter: oliver.polterauer@ewave.at
> >                 CC: oliver.polterauer@ewave.at
> 
> Is there any update on this problem?
> To reiterate my particular hardware involved that can trigger this
> problem:
> 
> Dell 2650, Dual 2.4Ghz Xeon processors (hyperthreading no, though the
> problem occured in 2.4.20 without hyperthreading disabled via "noht")
> 
> 4 GB of ram
> Only load is PostgreSQL related (i.e, network queries, plus twice
> daily dumps of the database to a NFS store, and a rsync back to the
> server for a second copy)
> 
> Under load, I repeatedly saw containers go offline.
> 
> Dell's recommended hardware diagnostics do not turn up anything (at
> all!)
> 
> The harddrive are Fujitsu drives, so the Seagate Firmware issue should
> not affect them.
> 
> I have since taken this server out of production.  Unfortunately, this
> makes the error much harder to trigger (i.e, I have failed so far to
> trigger it, even with multiple bonnie++ runs)
> 
> Suggestions, diagnostics, etc, would be greatly appreciated.
> 
> 
> -- 
> 
> Ryan Anderson                
> AutoWeb Communications, Inc. 
> email: ryan@autoweb.net 
> 

Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
  2004-11-24  1:00   ` Andrew Kinney
@ 2004-11-24 18:35     ` Ryan Anderson
  0 siblings, 0 replies; 12+ messages in thread
From: Ryan Anderson @ 2004-11-24 18:35 UTC (permalink / raw)
  To: andykinney; +Cc: linux-scsi

[-- Attachment #1: Type: text/plain, Size: 839 bytes --]

On Tue, 2004-11-23 at 17:00 -0800, Andrew Kinney wrote:
> I'll just add my two cents having gone through the exact same thing 
> with a pair of PowerEdge 2500 systems with PERC 3/Di RAID 
> controllers.  They both had the problem.

Thanks!

[snip]

> First, what enabled us to locate the problem was the following 
> command run in the boot immediately following a crash.  It needs to 
> be run from afacli after opening the controller.
> 
> diagnostic show history /old

Ah, that helps.

I'm going to try like mad to get my system to fail again (it's rather
hard, oddly, but I was used to 10-day uptimes when this was in
production, so...)

Thanks for the tip, and I'll see what I can do to get a good diagnostic
out.

-- 

Ryan Anderson                
AutoWeb Communications, Inc. 
email: ryan@autoweb.net 

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2004-11-24 18:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-28  7:53 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Andrew Morton
2004-10-28 18:21 ` Phil Brutsche
2004-11-09 20:22   ` Ryan Anderson
2004-11-09 21:32     ` Otto Solares
2004-11-09 23:49       ` Andrew Kinney
2004-11-10 17:43       ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
2004-11-10 20:33         ` Phil Brutsche
2004-11-10 21:08           ` Otto Solares
2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
2004-11-23 22:58   ` Otto Solares
2004-11-24  1:00   ` Andrew Kinney
2004-11-24 18:35     ` Ryan Anderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.