* Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
@ 2004-10-28 7:53 Andrew Morton
2004-10-28 18:21 ` Phil Brutsche
2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
0 siblings, 2 replies; 12+ messages in thread
From: Andrew Morton @ 2004-10-28 7:53 UTC (permalink / raw)
To: linux-scsi
Begin forwarded message:
Date: Thu, 28 Oct 2004 00:50:31 -0700
From: bugme-daemon@osdl.org
To: bugme-new@lists.osdl.org
Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
http://bugme.osdl.org/show_bug.cgi?id=3651
Summary: dell poweredge 4600 aacraid PERC 3/Di Container goes
offline
Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
Status: NEW
Severity: high
Owner: andmike@us.ibm.com
Submitter: oliver.polterauer@ewave.at
CC: oliver.polterauer@ewave.at
Distribution: Debian Sarge
Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in
one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on
Firmware version 2.80 Build 6092
Software Environment: aacraid
Problem Description:
The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with
the following error message:
SCSI:0 (0:0): rejecting I/O to offline device
Buffer I/O error due to I/O error on sda8
Steps to reproduce:
I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di
Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s
16000 -n 150 -r 8000 -u nobody:nogroup
my partition table:
/dev/sda1 --> /boot (80mb)
/dev/sda2 --> swap (16 GB)
/dev/sda3 --> /root (4 GB)
/dev/sda5 --> /home (4 GB)
/dev/sda6 --> /tmp (1 GB)
/dev/sda7 --> /var (8 GB)
/dev/sda8 --> /var/lib/postgres (502 GB)
Filesystem ext3
kernel is a non modular plain vanilla kernel source (no patches)
I have turned on remote syslog to our syslog server, but it did not log any
more errormessages as stated above. The error messages above are printed only
on the console
Best regards,
Oliver
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
2004-10-28 7:53 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Andrew Morton
@ 2004-10-28 18:21 ` Phil Brutsche
2004-11-09 20:22 ` Ryan Anderson
2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
1 sibling, 1 reply; 12+ messages in thread
From: Phil Brutsche @ 2004-10-28 18:21 UTC (permalink / raw)
To: linux-scsi
Andrew Morton wrote:
> Distribution: Debian Sarge
> Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in
> one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on
> Firmware version 2.80 Build 6092
> Software Environment: aacraid
> Problem Description:
> The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with
> the following error message:
>
> SCSI:0 (0:0): rejecting I/O to offline device
> Buffer I/O error due to I/O error on sda8
>
> Steps to reproduce:
>
> I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di
> Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s
> 16000 -n 150 -r 8000 -u nobody:nogroup
FYI, I have been seeing this as well.
I can trigger this card lockup at will with mkfs.ext3; for other
filesystems, I may need to extract a kernel source .tar.gz in order to
cause a lockup.
aacraid: Host adapter reset request. SCSI hang ?
aacraid: Host adapter appears dead
Device offlined - not ready after error recovery: host 1 channel 0 id 0
lun 0
SCSI error : <1 0 0 0> return code = 0x6000000
end_request: I/O error, dev sdb, sector 1667007
Buffer I/O error on device sdb1, logical block 208368
lost page write due to I/O error on sdb1
scsi1 (0:0): rejecting I/O to offline device
Buffer I/O error on device sdb1, logical block 208369
I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives - yes, I
know about the Seagate firmware timeout problem, these drives are brand
new with firmware rev 0006 and thus aren't affected.
This hardware has no problems with kernel 2.4.x.
--
Phil Brutsche
phil@brutsche.us
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
2004-10-28 18:21 ` Phil Brutsche
@ 2004-11-09 20:22 ` Ryan Anderson
2004-11-09 21:32 ` Otto Solares
0 siblings, 1 reply; 12+ messages in thread
From: Ryan Anderson @ 2004-11-09 20:22 UTC (permalink / raw)
To: Phil Brutsche; +Cc: linux-scsi
[-- Attachment #1: Type: text/plain, Size: 3596 bytes --]
On Thu, 2004-10-28 at 14:21, Phil Brutsche wrote:
> Andrew Morton wrote:
> > Distribution: Debian Sarge
> > Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in
> > one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on
> > Firmware version 2.80 Build 6092
> > Software Environment: aacraid
I have a Dell 2650, 3 disks, RAID 5, 72GB drives (Fujitsu drives)
4 GB Ram, dual 2.4GHz Xeon
dmesg tells me I have this specific firmware:
AAC0: kernel 2.8.4 build 6092
AAC0: monitor 2.8.4 build 6092
AAC0: bios 2.8.0 build 6092
AAC0: serial 83ac41d3fafaf001
scsi0 : percraid
Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
Currently I have 2.6.8 on this machine. (I believe it's actually
2.6.8.1, Debian sometimes blurs things a little bit in terms of
backporting a patch.)
> > Problem Description:
> > The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with
> > the following error message:
> >
> > SCSI:0 (0:0): rejecting I/O to offline device
> > Buffer I/O error due to I/O error on sda8
That's what I'm seeing. It's rather hard to capture because the only
disks on this machine are in the RAID container that keeps going
offline.
> > Steps to reproduce:
> >
> > I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di
> > Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s
> > 16000 -n 150 -r 8000 -u nobody:nogroup
>
> FYI, I have been seeing this as well.
>
> I can trigger this card lockup at will with mkfs.ext3; for other
> filesystems, I may need to extract a kernel source .tar.gz in order to
> cause a lockup.
>
> aacraid: Host adapter reset request. SCSI hang ?
> aacraid: Host adapter appears dead
> Device offlined - not ready after error recovery: host 1 channel 0 id 0
> lun 0
> SCSI error : <1 0 0 0> return code = 0x6000000
> end_request: I/O error, dev sdb, sector 1667007
> Buffer I/O error on device sdb1, logical block 208368
> lost page write due to I/O error on sdb1
> scsi1 (0:0): rejecting I/O to offline device
> Buffer I/O error on device sdb1, logical block 208369
>
> I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives - yes, I
> know about the Seagate firmware timeout problem, these drives are brand
> new with firmware rev 0006 and thus aren't affected.
>
> This hardware has no problems with kernel 2.4.x.
I had similar, but not nearly as bad, problems with 2.4.x.
Under 2.4.x, this machine would become unavailable for approximately 20
minutes, and then would recover. The load would be around 20 when it
came back, and would rapidly drop off.
This was with 2.4.20, using a config modeled off the Debian 2.4.18-bf2.4
config.
Due to this problem, my machine is no longer a production machine, so I
can do whatever testing is necessary to fix this.
I have gone through the hardware diagnostics process with Dell, with the
exception of completing the Elite HD diagnostics on the Fujitsu drives.
(The program filled the boot floppy with the log, and I haven't gotten
around to rerunning it yet.)
I can debug and attempt to reproduce as much as is necessary at this
point, if anyone can give me a place to start and/or a patch to apply.
--
Ryan Anderson
AutoWeb Communications, Inc.
email: ryan@autoweb.net
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
2004-11-09 20:22 ` Ryan Anderson
@ 2004-11-09 21:32 ` Otto Solares
2004-11-09 23:49 ` Andrew Kinney
2004-11-10 17:43 ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
0 siblings, 2 replies; 12+ messages in thread
From: Otto Solares @ 2004-11-09 21:32 UTC (permalink / raw)
To: Ryan Anderson; +Cc: Phil Brutsche, linux-scsi
JFYI
I have exactly this same problem on 3 brand new Dell PE2650
machines with Perc3/Di controllers, my other new Dell servers
with the Perc4/Di controller have never fail.
Dell customer support sucks, they would not help me as I am
not running a supported distro/kernel.
The faulty servers have the latest BIOS, Perc3/Di firmware (6092),
latest ERA/RAC firmware, latest debian sarge, kernel 2.6.10-rc1-bk7.
Both 2.4 and 2.6 hangs the controller.
The problem appears when too many IO is happening, the kernel
don't die, as if I have a ssh session I could execute some
cached binaries like ps, bash, etc. Everything in memory runs
fine until it touches sda that is offlined as you can see
from this kernel messages:
Nov 5 14:53:30 saruman kernel: aacraid: Host adapter reset request. SCSI hang ?
Nov 5 14:54:33 saruman kernel: aacraid: SCSI bus appears hung
Nov 5 14:54:34 saruman kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
Nov 5 14:54:34 saruman kernel: Device sda not ready.
Nov 5 14:54:34 saruman kernel: end_request: I/O error, dev sda, sector 127952537
Nov 5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device
Nov 5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device
Nov 5 14:54:34 saruman kernel: EXT3-fs error (device sda4): ext3_find_entry: reading directory #13880243 offset 0
Nov 5 14:54:34 saruman kernel:
Nov 5 14:54:34 saruman kernel: Remounting filesystem read-only
-otto
On Tue, Nov 09, 2004 at 03:22:54PM -0500, Ryan Anderson wrote:
> On Thu, 2004-10-28 at 14:21, Phil Brutsche wrote:
> > Andrew Morton wrote:
> > > Distribution: Debian Sarge
> > > Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in a Raid 5 in
> > > one container, 8 GB RAM, Dual Xenon 2GHz. The Perc 3/Di Controller is on
> > > Firmware version 2.80 Build 6092
> > > Software Environment: aacraid
>
> I have a Dell 2650, 3 disks, RAID 5, 72GB drives (Fujitsu drives)
> 4 GB Ram, dual 2.4GHz Xeon
>
> dmesg tells me I have this specific firmware:
> AAC0: kernel 2.8.4 build 6092
> AAC0: monitor 2.8.4 build 6092
> AAC0: bios 2.8.0 build 6092
> AAC0: serial 83ac41d3fafaf001
> scsi0 : percraid
> Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
> Type: Direct-Access ANSI SCSI revision: 02
> Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
> Type: Direct-Access ANSI SCSI revision: 02
>
>
> Currently I have 2.6.8 on this machine. (I believe it's actually
> 2.6.8.1, Debian sometimes blurs things a little bit in terms of
> backporting a patch.)
>
> > > Problem Description:
> > > The Container on the PERC 3/Di Controller goes offline on heavy I/O Load with
> > > the following error message:
> > >
> > > SCSI:0 (0:0): rejecting I/O to offline device
> > > Buffer I/O error due to I/O error on sda8
>
> That's what I'm seeing. It's rather hard to capture because the only
> disks on this machine are in the RAID container that keeps going
> offline.
>
> > > Steps to reproduce:
> > >
> > > I am using bonnie++ to produce I/O load on the only Volume on the Perc 3/Di
> > > Controller with the following parameters bonnie++ -d /var/lib/postgres/test -s
> > > 16000 -n 150 -r 8000 -u nobody:nogroup
> >
> > FYI, I have been seeing this as well.
> >
> > I can trigger this card lockup at will with mkfs.ext3; for other
> > filesystems, I may need to extract a kernel source .tar.gz in order to
> > cause a lockup.
> >
> > aacraid: Host adapter reset request. SCSI hang ?
> > aacraid: Host adapter appears dead
> > Device offlined - not ready after error recovery: host 1 channel 0 id 0
> > lun 0
> > SCSI error : <1 0 0 0> return code = 0x6000000
> > end_request: I/O error, dev sdb, sector 1667007
> > Buffer I/O error on device sdb1, logical block 208368
> > lost page write due to I/O error on sdb1
> > scsi1 (0:0): rejecting I/O to offline device
> > Buffer I/O error on device sdb1, logical block 208369
> >
> > I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives - yes, I
> > know about the Seagate firmware timeout problem, these drives are brand
> > new with firmware rev 0006 and thus aren't affected.
> >
> > This hardware has no problems with kernel 2.4.x.
>
> I had similar, but not nearly as bad, problems with 2.4.x.
> Under 2.4.x, this machine would become unavailable for approximately 20
> minutes, and then would recover. The load would be around 20 when it
> came back, and would rapidly drop off.
>
> This was with 2.4.20, using a config modeled off the Debian 2.4.18-bf2.4
> config.
>
> Due to this problem, my machine is no longer a production machine, so I
> can do whatever testing is necessary to fix this.
>
> I have gone through the hardware diagnostics process with Dell, with the
> exception of completing the Elite HD diagnostics on the Fujitsu drives.
> (The program filled the boot floppy with the log, and I haven't gotten
> around to rerunning it yet.)
>
> I can debug and attempt to reproduce as much as is necessary at this
> point, if anyone can give me a place to start and/or a patch to apply.
>
>
> --
>
> Ryan Anderson
> AutoWeb Communications, Inc.
> email: ryan@autoweb.net
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
2004-11-09 21:32 ` Otto Solares
@ 2004-11-09 23:49 ` Andrew Kinney
2004-11-10 17:43 ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
1 sibling, 0 replies; 12+ messages in thread
From: Andrew Kinney @ 2004-11-09 23:49 UTC (permalink / raw)
To: Otto Solares; +Cc: Phil Brutsche, linux-scsi
FWIW, we had the same problems on two identically configured Dell
PE2500 machines with PERC3/DI controllers. They were purchased about
two years ago. The problem surfaced when we moved to a 2.4.20 kernel
from an older kernel and our disk system loads increased. It seemed
that a combination of seeking through large numbers of files spread
all over the array (12 million or so), a sequential read of a largish
log file (200MB or more), and a lot of random writes all over the
array caused a single drive to become unresponsive in the array.
When the drive became unresponsive, the upper layer drivers offlined
the container when they gave up waiting for the controller kernel to
finish marking the drive as dead (this is an interpretation of our
NVRAM controller logs) and continue servicing requests to the array.
The end result is that a single misbehaving drive, cable, or
connector can cause the entire array to be taken offline by the OS.
Obviously, this isn't the intended operation, so there is still
something that isn't happening correctly, but it could be that the
controller just needs to offline the bad disk faster.
At any rate, our solution (covered by Dell warranties) was to replace
the drive, replace the controller, replace the cables, and replace
the SCSI backplane. We also reinitialized the arrays with a 64KB
stripe size instead of a 32KB stripe size to reduce the physical I/O
overhead associated with many small files. This fixed the problem for
us (so far).
It's tough to say what exactly was the fix since we took a shotgun
approach to the problem, but my guess is that the drive itself wasn't
responding quickly enough. Replacing the drive with a drive of the
same RPM and capacity but designed for U320 operation instead of U160
operation is what I suspect resolved the trouble for us. The logic
chips on the U320 drive appear to process commands faster than those
on the U160 drive, thus limiting the possibility of getting jammed
with commands. Of course, the drive is also a different brand than
the other drives in the array, so that could have been related.
Hopefully that information was useful to you and others on this list.
Unfortunately, I'm not a kernel programmer nor do I have time to
contribute code, so I'm unable to offer anything other than what
solved the issue for us.
Andrew
On 9 Nov 2004 at 15:32, Otto Solares wrote:
> JFYI
>
> I have exactly this same problem on 3 brand new Dell PE2650
> machines with Perc3/Di controllers, my other new Dell servers
> with the Perc4/Di controller have never fail.
>
> Dell customer support sucks, they would not help me as I am
> not running a supported distro/kernel.
>
> The faulty servers have the latest BIOS, Perc3/Di firmware (6092),
> latest ERA/RAC firmware, latest debian sarge, kernel 2.6.10-rc1-bk7.
> Both 2.4 and 2.6 hangs the controller.
>
> The problem appears when too many IO is happening, the kernel
> don't die, as if I have a ssh session I could execute some
> cached binaries like ps, bash, etc. Everything in memory runs
> fine until it touches sda that is offlined as you can see
> from this kernel messages:
>
> Nov 5 14:53:30 saruman kernel: aacraid: Host adapter reset request.
> SCSI hang ? Nov 5 14:54:33 saruman kernel: aacraid: SCSI bus appears
> hung Nov 5 14:54:34 saruman kernel: scsi: Device offlined - not ready
> after error recovery: host 0 channel 0 id 0 lun 0 Nov 5 14:54:34
> saruman kernel: Device sda not ready. Nov 5 14:54:34 saruman kernel:
> end_request: I/O error, dev sda, sector 127952537 Nov 5 14:54:34
> saruman kernel: scsi0 (0:0): rejecting I/O to offline device Nov 5
> 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device
> Nov 5 14:54:34 saruman kernel: EXT3-fs error (device sda4):
> ext3_find_entry: reading directory #13880243 offset 0 Nov 5 14:54:34
> saruman kernel: Nov 5 14:54:34 saruman kernel: Remounting filesystem
> read-only
>
> -otto
>
> On Tue, Nov 09, 2004 at 03:22:54PM -0500, Ryan Anderson wrote:
> > On Thu, 2004-10-28 at 14:21, Phil Brutsche wrote:
> > > Andrew Morton wrote:
> > > > Distribution: Debian Sarge
> > > > Hardware Environment: Dell Poweredge 4600, 5 Disks each 146GB in
> > > > a Raid 5 in one container, 8 GB RAM, Dual Xenon 2GHz. The Perc
> > > > 3/Di Controller is on Firmware version 2.80 Build 6092 Software
> > > > Environment: aacraid
> >
> > I have a Dell 2650, 3 disks, RAID 5, 72GB drives (Fujitsu drives) 4
> > GB Ram, dual 2.4GHz Xeon
> >
> > dmesg tells me I have this specific firmware:
> > AAC0: kernel 2.8.4 build 6092
> > AAC0: monitor 2.8.4 build 6092
> > AAC0: bios 2.8.0 build 6092
> > AAC0: serial 83ac41d3fafaf001
> > scsi0 : percraid
> > Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0
> > Type: Direct-Access ANSI SCSI revision: 02
> > Vendor: DELL Model: PERCRAID RAID5 Rev: V1.0 Type:
> > Direct-Access ANSI SCSI revision: 02
> >
> >
> > Currently I have 2.6.8 on this machine. (I believe it's actually
> > 2.6.8.1, Debian sometimes blurs things a little bit in terms of
> > backporting a patch.)
> >
> > > > Problem Description:
> > > > The Container on the PERC 3/Di Controller goes offline on heavy
> > > > I/O Load with the following error message:
> > > >
> > > > SCSI:0 (0:0): rejecting I/O to offline device
> > > > Buffer I/O error due to I/O error on sda8
> >
> > That's what I'm seeing. It's rather hard to capture because the
> > only disks on this machine are in the RAID container that keeps
> > going offline.
> >
> > > > Steps to reproduce:
> > > >
> > > > I am using bonnie++ to produce I/O load on the only Volume on
> > > > the Perc 3/Di Controller with the following parameters bonnie++
> > > > -d /var/lib/postgres/test -s 16000 -n 150 -r 8000 -u
> > > > nobody:nogroup
> > >
> > > FYI, I have been seeing this as well.
> > >
> > > I can trigger this card lockup at will with mkfs.ext3; for other
> > > filesystems, I may need to extract a kernel source .tar.gz in
> > > order to cause a lockup.
> > >
> > > aacraid: Host adapter reset request. SCSI hang ?
> > > aacraid: Host adapter appears dead
> > > Device offlined - not ready after error recovery: host 1 channel 0
> > > id 0 lun 0 SCSI error : <1 0 0 0> return code = 0x6000000
> > > end_request: I/O error, dev sdb, sector 1667007 Buffer I/O error
> > > on device sdb1, logical block 208368 lost page write due to I/O
> > > error on sdb1 scsi1 (0:0): rejecting I/O to offline device Buffer
> > > I/O error on device sdb1, logical block 208369
> > >
> > > I am using an Adaptec 2120S with a RAID5 of Seagate U320 drives -
> > > yes, I know about the Seagate firmware timeout problem, these
> > > drives are brand new with firmware rev 0006 and thus aren't
> > > affected.
> > >
> > > This hardware has no problems with kernel 2.4.x.
> >
> > I had similar, but not nearly as bad, problems with 2.4.x.
> > Under 2.4.x, this machine would become unavailable for approximately
> > 20 minutes, and then would recover. The load would be around 20
> > when it came back, and would rapidly drop off.
> >
> > This was with 2.4.20, using a config modeled off the Debian
> > 2.4.18-bf2.4 config.
> >
> > Due to this problem, my machine is no longer a production machine,
> > so I can do whatever testing is necessary to fix this.
> >
> > I have gone through the hardware diagnostics process with Dell, with
> > the exception of completing the Elite HD diagnostics on the Fujitsu
> > drives. (The program filled the boot floppy with the log, and I
> > haven't gotten around to rerunning it yet.)
> >
> > I can debug and attempt to reproduce as much as is necessary at this
> > point, if anyone can give me a place to start and/or a patch to
> > apply.
> >
> >
> > --
> >
> > Ryan Anderson
> > AutoWeb Communications, Inc.
> > email: ryan@autoweb.net
>
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-scsi"
> in the body of a message to majordomo@vger.kernel.org More majordomo
> info at http://vger.kernel.org/majordomo-info.html
>
Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net
^ permalink raw reply [flat|nested] 12+ messages in thread
* aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline]
2004-11-09 21:32 ` Otto Solares
2004-11-09 23:49 ` Andrew Kinney
@ 2004-11-10 17:43 ` Otto Solares
2004-11-10 20:33 ` Phil Brutsche
1 sibling, 1 reply; 12+ messages in thread
From: Otto Solares @ 2004-11-10 17:43 UTC (permalink / raw)
To: linux-scsi
FYI again:
There is a recent KernelTrap thread showing more datails:
http://kerneltrap.org/node/view/3778
Investigating I found 2 more things I'll check:
A seagate time-out issue with exactly my disks:
http://www.seagate.com/support/disc/u320_firmware.html
Unfortunately adaptec will not give me directly the firmware
and Dell does not respond to my support mail queries. I'm really
impressed with Dell support, they suck so badly, at least
those updates must appear in their download area. I am pretty
sure my next servers will not come from Dell.
Does anybody have this seagate firmware updates?
And the other issue is a different driver from adaptec:
http://www.adaptec.com/worldwide/support/driverdetail.jsp?sess=no&language=English+US&cat=%2FOperating+System%2FLinux+Driver+Source+Code&filekey=aacraid-src_1.1.5.tgz
Diffing the adaptec and latest 2.6 kernel show they come from
the same base but it seems more complete the Adaptec one,
does somebody knows the difference? The important thing is
which is better for heavy I/O servers? Why two drivers?
Thanks.
-otto
On Tue, Nov 09, 2004 at 03:32:15PM -0600, Otto Solares wrote:
> JFYI
>
> I have exactly this same problem on 3 brand new Dell PE2650
> machines with Perc3/Di controllers, my other new Dell servers
> with the Perc4/Di controller have never fail.
>
> Dell customer support sucks, they would not help me as I am
> not running a supported distro/kernel.
>
> The faulty servers have the latest BIOS, Perc3/Di firmware (6092),
> latest ERA/RAC firmware, latest debian sarge, kernel 2.6.10-rc1-bk7.
> Both 2.4 and 2.6 hangs the controller.
>
> The problem appears when too many IO is happening, the kernel
> don't die, as if I have a ssh session I could execute some
> cached binaries like ps, bash, etc. Everything in memory runs
> fine until it touches sda that is offlined as you can see
> from this kernel messages:
>
> Nov 5 14:53:30 saruman kernel: aacraid: Host adapter reset request. SCSI hang ?
> Nov 5 14:54:33 saruman kernel: aacraid: SCSI bus appears hung
> Nov 5 14:54:34 saruman kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0
> Nov 5 14:54:34 saruman kernel: Device sda not ready.
> Nov 5 14:54:34 saruman kernel: end_request: I/O error, dev sda, sector 127952537
> Nov 5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device
> Nov 5 14:54:34 saruman kernel: scsi0 (0:0): rejecting I/O to offline device
> Nov 5 14:54:34 saruman kernel: EXT3-fs error (device sda4): ext3_find_entry: reading directory #13880243 offset 0
> Nov 5 14:54:34 saruman kernel:
> Nov 5 14:54:34 saruman kernel: Remounting filesystem read-only
>
> -otto
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline]
2004-11-10 17:43 ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
@ 2004-11-10 20:33 ` Phil Brutsche
2004-11-10 21:08 ` Otto Solares
0 siblings, 1 reply; 12+ messages in thread
From: Phil Brutsche @ 2004-11-10 20:33 UTC (permalink / raw)
To: Otto Solares; +Cc: linux-scsi
Otto Solares wrote:
> A seagate time-out issue with exactly my disks:
>
> http://www.seagate.com/support/disc/u320_firmware.html
[...]
> Does anybody have this seagate firmware updates?
I have a pair of RAID5s with these Seagate drives; I have flashed the
firmware on the affected drives to the latest 0006 rev (they are 15k RPM
drives).
I *do not* have a problem with aacraid under vanilla 2.4.x and the
included aacraid driver. The machine using these drives in a RAID5 with
an Adaptec controller has had uptimes of 4 months (or longer) without
any problems.
I *do* have a problem (check the archives for the error messages) with
2.6.9 and the included aacraid driver.
--
Phil Brutsche
phil@brutsche.us
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline]
2004-11-10 20:33 ` Phil Brutsche
@ 2004-11-10 21:08 ` Otto Solares
0 siblings, 0 replies; 12+ messages in thread
From: Otto Solares @ 2004-11-10 21:08 UTC (permalink / raw)
To: Phil Brutsche; +Cc: linux-scsi
On Wed, Nov 10, 2004 at 02:33:19PM -0600, Phil Brutsche wrote:
> Otto Solares wrote:
>
> > A seagate time-out issue with exactly my disks:
> >
> > http://www.seagate.com/support/disc/u320_firmware.html
>
> [...]
>
> > Does anybody have this seagate firmware updates?
>
> I have a pair of RAID5s with these Seagate drives; I have flashed the
> firmware on the affected drives to the latest 0006 rev (they are 15k RPM
> drives).
I have Dell servers with the following disks:
SPEED BRAND MODEL REV
10K SEAGATE ST3146807 DS09
15K SEAGATE ST373453 DX10
Clearly they are affected by the _time_out_ issue.
They are in RAID1 configurations. I couldn't obtain the
firmware from Dell nor Seagate, how do you obtain the
firmware updates?
-otto
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
2004-10-28 7:53 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Andrew Morton
2004-10-28 18:21 ` Phil Brutsche
@ 2004-11-23 21:41 ` Ryan Anderson
2004-11-23 22:58 ` Otto Solares
2004-11-24 1:00 ` Andrew Kinney
1 sibling, 2 replies; 12+ messages in thread
From: Ryan Anderson @ 2004-11-23 21:41 UTC (permalink / raw)
To: Andrew Morton; +Cc: linux-scsi
[-- Attachment #1: Type: text/plain, Size: 1601 bytes --]
On Thu, 2004-10-28 at 00:53 -0700, Andrew Morton wrote:
> Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
>
>
> http://bugme.osdl.org/show_bug.cgi?id=3651
>
> Summary: dell poweredge 4600 aacraid PERC 3/Di Container goes
> offline
> Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
> Status: NEW
> Severity: high
> Owner: andmike@us.ibm.com
> Submitter: oliver.polterauer@ewave.at
> CC: oliver.polterauer@ewave.at
Is there any update on this problem?
To reiterate my particular hardware involved that can trigger this
problem:
Dell 2650, Dual 2.4Ghz Xeon processors (hyperthreading no, though the
problem occured in 2.4.20 without hyperthreading disabled via "noht")
4 GB of ram
Only load is PostgreSQL related (i.e, network queries, plus twice daily
dumps of the database to a NFS store, and a rsync back to the server for
a second copy)
Under load, I repeatedly saw containers go offline.
Dell's recommended hardware diagnostics do not turn up anything (at
all!)
The harddrive are Fujitsu drives, so the Seagate Firmware issue should
not affect them.
I have since taken this server out of production. Unfortunately, this
makes the error much harder to trigger (i.e, I have failed so far to
trigger it, even with multiple bonnie++ runs)
Suggestions, diagnostics, etc, would be greatly appreciated.
--
Ryan Anderson
AutoWeb Communications, Inc.
email: ryan@autoweb.net
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
@ 2004-11-23 22:58 ` Otto Solares
2004-11-24 1:00 ` Andrew Kinney
1 sibling, 0 replies; 12+ messages in thread
From: Otto Solares @ 2004-11-23 22:58 UTC (permalink / raw)
To: Ryan Anderson; +Cc: Andrew Morton, linux-scsi
On Tue, Nov 23, 2004 at 04:41:41PM -0500, Ryan Anderson wrote:
> On Thu, 2004-10-28 at 00:53 -0700, Andrew Morton wrote:
> > Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
> >
> >
> > http://bugme.osdl.org/show_bug.cgi?id=3651
> >
> > Summary: dell poweredge 4600 aacraid PERC 3/Di Container goes
> > offline
> > Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
> > Status: NEW
> > Severity: high
> > Owner: andmike@us.ibm.com
> > Submitter: oliver.polterauer@ewave.at
> > CC: oliver.polterauer@ewave.at
>
> Is there any update on this problem?
> To reiterate my particular hardware involved that can trigger this
> problem:
>
> Dell 2650, Dual 2.4Ghz Xeon processors (hyperthreading no, though the
> problem occured in 2.4.20 without hyperthreading disabled via "noht")
>
> 4 GB of ram
> Only load is PostgreSQL related (i.e, network queries, plus twice daily
> dumps of the database to a NFS store, and a rsync back to the server for
> a second copy)
>
> Under load, I repeatedly saw containers go offline.
>
> Dell's recommended hardware diagnostics do not turn up anything (at
> all!)
>
> The harddrive are Fujitsu drives, so the Seagate Firmware issue should
> not affect them.
>
> I have since taken this server out of production. Unfortunately, this
> makes the error much harder to trigger (i.e, I have failed so far to
> trigger it, even with multiple bonnie++ runs)
>
> Suggestions, diagnostics, etc, would be greatly appreciated.
I used to have this very same problem with exactly the same hardware as you:
2 x 2.4GHz Xeon processor
4GB RAM
PERC 3/Di
4 x Fujitsu MAP3147NC Rev 5608 10K RPMs disks.
I tried all kernels on earth (2.4/2.6) and the controller always dies with
container offline (search this list for the past 15 days and you'll find
my problem).
Currently I'm running 2.6.10-rc1-bk20-adaptec-1.1.5-2370 _WITHOUT_ANY_
problems (ACPI on, HT enabled), my controller:
Red Hat/Adaptec aacraid driver (1.1-5[2370])
ACPI: PCI interrupt 0000:04:08.1[A] -> GSI 30 (level, low) -> IRQ 201
AAC0: kernel 2.8-0[6092]
AAC0: monitor 2.8-0[6092]
AAC0: bios 2.8-0[6092]
AAC0: serial 3520a1d3
aacraid_setup("")
nondasd=-1 dacmode=-1 commit=-1 coalescethreshold=16 acbsize=-1
scsi0 : percraid
Vendor: DELL Model: PERC RAID5 Rev: V1.0
Type: Direct-Access ANSI SCSI revision: 02
SCSI device sda: 860149632 512-byte hdwr sectors (440397 MB)
SCSI device sda: drive cache: write through
sda: sda1 sda2 sda3 sda4 < sda5 sda6 >
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
uptime:
16:48:22 up 13 days, 29 min, 2 users, load average: 1.15, 1.27, 1.31
I know 13 days is not much for a server but this server dies in the
1-2 day frame time so it is a huge improvement.
You should try that driver, it works for me.
I had to thank Mark Salyzyn from Adaptec for the updated driver, is my
opinion that this "enhanced driver" should make it in Linus' kernel.
-otto
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
2004-11-23 22:58 ` Otto Solares
@ 2004-11-24 1:00 ` Andrew Kinney
2004-11-24 18:35 ` Ryan Anderson
1 sibling, 1 reply; 12+ messages in thread
From: Andrew Kinney @ 2004-11-24 1:00 UTC (permalink / raw)
To: Ryan Anderson; +Cc: linux-scsi
I'll just add my two cents having gone through the exact same thing
with a pair of PowerEdge 2500 systems with PERC 3/Di RAID
controllers. They both had the problem.
Our problem was also only able to be replicated with the machine in
production with heavy RAM usage, moderate CPU usage, and many
simultaneous small I/Os mixed with large file writes. PostGres dumps
seemed to be the most likely cause since it crashed on schedule for
awhile and a cron job was running a PostGres dump at that time. That
wasn't able to cause the crash 100% of the time though. It only
caused it about 10% of the time, from what I could tell. I was
unable to reproduce the problem when it was out of production. It
took some hocus pocus mix of system activity that only the collective
consciousness of our customers could seem to cause. I spent weeks
using various stress testing programs of all sorts to try to
reproduce the issue, but I just couldn't replicate it.
First, what enabled us to locate the problem was the following
command run in the boot immediately following a crash. It needs to
be run from afacli after opening the controller.
diagnostic show history /old
That will display the series of events leading up to the container
going offline. Pay particular attention to the beginning of the
problem. It would look similar to this:
[13]: ID(0:01:0) Timeout detected on cmd[0x28]
[14]: ID(0:01:0): Timeout detected on cmd[0x2a]
[15]: ID(0:01:0): Timeout detected on cmd[0x28]
[16]: <...repeats 2 more times>
<snip>
[50]: ID(0:01:0) Cmd[0x28] Fail: Block Range 3424717 : 3424718 at
[51]: 509184 sec
<snip>
[78]: ID(0:01:0) Cmd[0x28] Fail: Block Range 0 : 0 at 509262 sec
[79]: 2 can't read mbr dev_t:1
[80]: <...repeats 1 more times>
[81]: can't read config from slice #[1]
Note the timestamp on line 51 (continued from line 50). Yours will
have a different time stamp and a different line number, but you'll
want to make note of the earliest time stamp visible during or just
before the problem. Our was almost 6 days from boot.
Then note the latest time stamp near the end of the problem (line 78
in our log). In our case it was 78 elapsed seconds. I'm pretty sure
it was coincidence that it was line 78 and 78 seconds elapsed. This
is, of course, longer than it should take for the firmware to offline
a dead drive, but that's what happened. This resulted in the linux
SCSI layer timing out and hosing everything since we ran everything
(including swap, boot, and root) from this container.
Our solution was UGLY. We replaced everything in the entire disk
subsystem in addition to getting the latest BIOS, firmware, and
drivers. Fortunately, both systems were under warranty, so Dell
provided the parts and labor to replace the errant hard drive (ID 0:1
in our case), the backplane, the cables, and the mainboard (since the
controller is embedded). The good news is that we've been up 47 days
since implimenting this solution. Previous uptimes were typically
less than a week.
I've had many theories about what the root cause was (some better
than others), but my latest iteration is that when the system was
under normal load, a power component on the SCSI backplane couldn't
supply the proper voltage to one drive for a short period. The drive
became unresponsive upon executing a command that used the servo
motor enough to cause the power fluctuation to affect the drive's
command processor, causing the drive to lockup.
The problem was compounded by the firmware not marking the drive as
bad in a reasonable time frame so it could continue processing
commands in degraded mode before the upper layer wigged out. The
upper layer linux SCSI drivers can't do anything about this,
unfortunately, and just blindly take down your only available disk
storage before the controller comes back from marking the disk as
bad.
Maybe our experience will benefit someone when they go to modify the
controller firmware to properly mark a drive as bad in a reasonable
time so the container can continue to operate in degraded mode rather
than being taken offline entirely by the OS. Hint. Hint.
Andrew
On 23 Nov 2004 at 16:41, Ryan Anderson wrote:
> On Thu, 2004-10-28 at 00:53 -0700, Andrew Morton wrote:
> > Subject: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid
> > PERC 3/Di Container goes offline
> >
> >
> > http://bugme.osdl.org/show_bug.cgi?id=3651
> >
> > Summary: dell poweredge 4600 aacraid PERC 3/Di Container
> > goes
> > offline
> > Kernel Version: 2.6.10-rc1, 2.6.9, 2.6.8, 2.6.7, 2.6.6
> > Status: NEW
> > Severity: high
> > Owner: andmike@us.ibm.com
> > Submitter: oliver.polterauer@ewave.at
> > CC: oliver.polterauer@ewave.at
>
> Is there any update on this problem?
> To reiterate my particular hardware involved that can trigger this
> problem:
>
> Dell 2650, Dual 2.4Ghz Xeon processors (hyperthreading no, though the
> problem occured in 2.4.20 without hyperthreading disabled via "noht")
>
> 4 GB of ram
> Only load is PostgreSQL related (i.e, network queries, plus twice
> daily dumps of the database to a NFS store, and a rsync back to the
> server for a second copy)
>
> Under load, I repeatedly saw containers go offline.
>
> Dell's recommended hardware diagnostics do not turn up anything (at
> all!)
>
> The harddrive are Fujitsu drives, so the Seagate Firmware issue should
> not affect them.
>
> I have since taken this server out of production. Unfortunately, this
> makes the error much harder to trigger (i.e, I have failed so far to
> trigger it, even with multiple bonnie++ runs)
>
> Suggestions, diagnostics, etc, would be greatly appreciated.
>
>
> --
>
> Ryan Anderson
> AutoWeb Communications, Inc.
> email: ryan@autoweb.net
>
Sincerely,
Andrew Kinney
President and
Chief Technology Officer
Advantagecom Networks, Inc.
http://www.advantagecom.net
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline
2004-11-24 1:00 ` Andrew Kinney
@ 2004-11-24 18:35 ` Ryan Anderson
0 siblings, 0 replies; 12+ messages in thread
From: Ryan Anderson @ 2004-11-24 18:35 UTC (permalink / raw)
To: andykinney; +Cc: linux-scsi
[-- Attachment #1: Type: text/plain, Size: 839 bytes --]
On Tue, 2004-11-23 at 17:00 -0800, Andrew Kinney wrote:
> I'll just add my two cents having gone through the exact same thing
> with a pair of PowerEdge 2500 systems with PERC 3/Di RAID
> controllers. They both had the problem.
Thanks!
[snip]
> First, what enabled us to locate the problem was the following
> command run in the boot immediately following a crash. It needs to
> be run from afacli after opening the controller.
>
> diagnostic show history /old
Ah, that helps.
I'm going to try like mad to get my system to fail again (it's rather
hard, oddly, but I was used to 10-day uptimes when this was in
production, so...)
Thanks for the tip, and I'll see what I can do to get a good diagnostic
out.
--
Ryan Anderson
AutoWeb Communications, Inc.
email: ryan@autoweb.net
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2004-11-24 18:36 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-10-28 7:53 Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Andrew Morton
2004-10-28 18:21 ` Phil Brutsche
2004-11-09 20:22 ` Ryan Anderson
2004-11-09 21:32 ` Otto Solares
2004-11-09 23:49 ` Andrew Kinney
2004-11-10 17:43 ` aacraid, seagate and adaptec issues [Was: Re: Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline] Otto Solares
2004-11-10 20:33 ` Phil Brutsche
2004-11-10 21:08 ` Otto Solares
2004-11-23 21:41 ` Fw: [Bugme-new] [Bug 3651] New: dell poweredge 4600 aacraid PERC 3/Di Container goes offline Ryan Anderson
2004-11-23 22:58 ` Otto Solares
2004-11-24 1:00 ` Andrew Kinney
2004-11-24 18:35 ` Ryan Anderson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.