linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Josh Brooks <user@mail.econolodgetulsa.com>
To: linux-kernel@vger.kernel.org
Subject: aacraid (dell PERC) cannot handle a degraded mirror
Date: Sun, 9 Mar 2003 23:44:54 -0800 (PST)	[thread overview]
Message-ID: <20030309233257.O74417-100000@mail.econolodgetulsa.com> (raw)
In-Reply-To: <5.1.1.6.2.20030101084621.00cdf9f8@pop.gmx.net>


If you are running Linux 2.4.x and using aacraid, and a mirror degrades
(ie. one of the disks goes bad or otherwise detaches itself from the
mirror) the system will panic and crash.

This is, of course, incorrect behavior - if a mirror degrades the system
should continue running because half of the mirror is still there.

Here is a scenario I have seen about ten times in the last few months -
and it is only this frequency and consistency that has provoked me to send
this email:

1. I start getting things like this in /var/log/messages

Mar  9 07:12:36 system kernel: aacraid:ID(0:02:0); Error Event
[command:0x28]
Mar  9 07:12:36 system kernel: aacraid:ID(0:02:0); Medium Error, Block
Range 435200 : 435327
Mar  9 07:12:36 system kernel: aacraid:ID(0:02:0); Error Too Long To
Correct
Mar  9 07:12:36 system kernel: aacraid:ID(0:02:0) Medium Error, LBN Range
435200:435327
Mar  9 07:12:36 system kernel: aacraid:ID(0:02:0) Starting BBR sequence

Ok, fair enough - disk 2 on channel 0 is bad or is going bad.  Good thing
I have a mirror ... wrong!


2. The problem gets worse:

Mar  9 07:13:00 system kernel: scsi : aborting command due to timeout :
pid
162469964, scsi0, channel 0, id 1, lun 0 Read (10) 00 00 06 a3 ff 00 00 08
00
Mar  9 07:13:06 system kernel: scsi : aborting command due to timeout :
pid
162470312, scsi0, channel 0, id 1, lun 0 Read (10) 00 03 c2 c2 fb 00 00 02
00
Mar  9 07:13:06 system kernel: scsi : aborting command due to timeout :
pid
162470320, scsi0, channel 0, id 1, lun 0 Read (10) 00 05 79 83 77 00 00 02
00
Mar  9 07:13:07 system kernel: scsi : aborting command due to timeout :
pid
162470322, scsi0, channel 0, id 1, lun 0 Read (10) 00 01 b6 c3 71 00 00 02
00
Mar  9 07:13:07 system kernel: aacraid:ID(0:02:0); Error Event
[command:0x28]
Mar  9 07:13:07 system kernel: aacraid:ID(0:02:0); Medium Error, Block
Range 435234 : 435234
Mar  9 07:13:07 system kernel: aacraid:ID(0:02:0); Error Too Long To
Correct


3. disk 2 on channel 0 fails.  No problem, it's a mirror, right ?


Mar  9 07:13:30 system kernel: SCSI host 0 abort (pid 162469964) timed out
- resetting
Mar  9 07:13:30 system kernel: SCSI bus is being reset for host 0 channel
0.
Mar  9 07:13:36 system kernel: scsi : aborting command due to timeout :
pid
162470312, scsi0, channel 0, id 1, lun 0 Read (10) 00 03 c2 c2 fb 00 00 02
00
Mar  9 07:13:36 system kernel: SCSI host 0 abort (pid 162470312) timed out
- resetting
Mar  9 07:13:36 system kernel: SCSI bus is being reset for host 0 channel
0.
Mar  9 07:13:36 system kernel: scsi : aborting command due to timeout :
pid
162470320, scsi0, channel 0, id 1, lun 0 Read (10) 00 05 79 83 77 00 00 02
00
Mar  9 07:13:36 system kernel: SCSI host 0 abort (pid 162470320) timed out
- resetting
Mar  9 07:13:36 system kernel: SCSI bus is being reset for host 0 channel
0.
Mar  9 07:13:36 system kernel: aacraid:  BBR timed out at Block 0x6a42d
Mar  9 07:13:36 system kernel: aacraid:Drive 0:2:0 returning error
Mar  9 07:13:36 system kernel: aacraid:ID(0:02:0) - IO failed, Cmd[0x28]


4. System panics and crashes (which makes _no_ sense, because the other
disk is totally healthy, has reported no errors, and makes up the other
half of the _mirror_.


Mar  9 07:13:41 system kernel: Unable to handle kernel paging request at
virtual address 405a2200
Mar  9 07:13:41 system kernel:  printing eip:
Mar  9 07:13:41 system kernel: c0114d0f
Mar  9 07:13:41 system kernel: *pde = 14629067
Mar  9 07:13:41 system kernel: *pte = 00000000
Mar  9 07:13:41 system kernel: Oops: 0000


5. upon system boot, the Dell PERC 3si reports that the mirror is
degraded, but that the other disk in the mirror is totally healthy, and
that the container is present.

6. system boots _just fine_ on the broken mirror, as it should - system
runs fine on broken mirror, as it should.


So, why does the system run fine on the broken mirror, but panics and
crashes when the mirror actually breaks ?

This is very frustrating - one of the reasons we spent money to mirror
things was to reduce possible downtimes (since a disk failure will not
crash the machine) but ... a disk failure does crash the machine.
Explanations welcome.


  parent reply	other threads:[~2003-03-10  7:34 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-01-01  2:41 Nvidia and its choice to read the GPL "differently" Hell.Surfers
2003-01-01  9:36 ` Mike Galbraith
2003-01-02 18:38   ` Richard Stallman
2003-01-02 18:49     ` Larry McVoy
2003-01-02 19:02     ` Richard B. Johnson
2003-01-02 19:31     ` Mark Mielke
2003-01-03  7:50       ` Richard Stallman
2003-01-03  7:56         ` Mark Hahn
2003-01-03 20:30           ` Richard Stallman
2003-01-03 11:17         ` venom
2003-01-03 11:49           ` Andrew Walrond
2003-01-03 13:11             ` venom
2003-01-03 14:58             ` Bill Davidsen
2003-01-03 15:25               ` Andrew Walrond
2003-01-03 15:48                 ` Hugo Mills
2003-01-03 20:30           ` Richard Stallman
2003-01-03  1:01     ` Mike Galbraith
2003-01-03  7:50       ` Richard Stallman
2003-01-04 22:14     ` Matthias Andree
2003-03-10  7:44   ` Josh Brooks [this message]
2003-03-11  0:22     ` aacraid (dell PERC) cannot handle a degraded mirror Alan Cox
2003-03-11 10:18       ` Josh Brooks

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030309233257.O74417-100000@mail.econolodgetulsa.com \
    --to=user@mail.econolodgetulsa.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).