All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: "megaraid mbox: critical hardware error" on new dell poweredge 1850, suse 9.2, kernel 2.6.8
@ 2005-01-26 19:48 Reggie Dugard
  0 siblings, 0 replies; 3+ messages in thread
From: Reggie Dugard @ 2005-01-26 19:48 UTC (permalink / raw)
  To: linux-raid

Hi Olivier,

> I'm trying to get a quite standard "suse linux 9.2" setup working
> on a brand new dell poweredge 1850 with 2 scsi disks in raid1 setup.
> 
> Installation went completely fine, everything is working. But now (and
> every time), after 2-3h of uptime and some high disk I/O load (rsync of
> some GB of data), it badly crashes with the following messages:

We're seeing something similar here on an 1850 with 2 disks under
hardware raid1 running RHEL rel. 3 with a 2.4.21-27 kernel.  It has
happened twice so far for us (about once a week or so).  It may have
been a backup of the raid (high disk i/o) that caused it to fail the
most recent time.  Below I've included data from our system
corresponding to what you've included, for comparison purposes.

Unfortunately, we have no leads as to the cause, but I thought I let you
know that you're not alone :) and we can share anything we find out.


megaraid: aborting-5781469 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781520 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781529 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781527 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781470 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781498 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781524 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781525 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781507 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781526 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781514 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781509 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781530 cmd=2a <c=0 t=0 l=0>
megaraid: 5781530:81, driver owner.
megaraid: aborting-5781530 cmd=2a <c=0 t=0 l=0>
megaraid: 5781530:81, driver owner.
megaraid: aborting-5781537 cmd=2a <c=0 t=0 l=0>
megaraid: 5781537:94, driver owner.
megaraid: aborting-5781537 cmd=2a <c=0 t=0 l=0>
megaraid: 5781537:94, driver owner.
megaraid: aborting-5781506 cmd=28 <c=0 t=0 l=0>
megaraid: aborting-5781532 cmd=2a <c=0 t=0 l=0>
megaraid: 5781532:98, driver owner.
megaraid: aborting-5781532 cmd=2a <c=0 t=0 l=0>
megaraid: 5781532:98, driver owner.
megaraid: reset-5781504 cmd=28 <c=0 t=0 l=0>
megaraid: 49 pending cmds; max wait 180 seconds
megaraid: pending 49; remaining 180 seconds
megaraid: pending 49; remaining 175 seconds
megaraid: pending 49; remaining 170 seconds
megaraid: pending 49; remaining 165 seconds
megaraid: pending 49; remaining 160 seconds
megaraid: pending 49; remaining 155 seconds
megaraid: pending 49; remaining 150 seconds
megaraid: pending 49; remaining 145 seconds
megaraid: pending 49; remaining 140 seconds
megaraid: pending 49; remaining 135 seconds
megaraid: pending 49; remaining 130 seconds
megaraid: pending 49; remaining 125 seconds
megaraid: pending 49; remaining 120 seconds
megaraid: pending 49; remaining 115 seconds
megaraid: pending 49; remaining 110 seconds
megaraid: pending 49; remaining 105 seconds
megaraid: pending 49; remaining 100 seconds
megaraid: pending 49; remaining 95 seconds
megaraid: pending 49; remaining 90 seconds
megaraid: pending 49; remaining 85 seconds
megaraid: pending 49; remaining 80 seconds
megaraid: pending 49; remaining 75 seconds
megaraid: pending 49; remaining 70 seconds
megaraid: pending 49; remaining 65 seconds
megaraid: pending 49; remaining 60 seconds
megaraid: pending 49; remaining 55 seconds
megaraid: pending 49; remaining 50 seconds
megaraid: pending 49; remaining 45 seconds
megaraid: pending 49; remaining 40 seconds
megaraid: pending 49; remaining 35 seconds
megaraid: pending 49; remaining 30 seconds
megaraid: pending 49; remaining 25 seconds
megaraid: pending 49; remaining 20 seconds
megaraid: pending 49; remaining 15 seconds
megaraid: pending 49; remaining 10 seconds
megaraid: pending 49; remaining 5 seconds
megaraid: critical hardware error!
megaraid: reset-5781504 cmd=28 <c=0 t=0 l=0>
megaraid: hw error, cannot reset
megaraid: reset-5781473 cmd=28 <c=0 t=0 l=0>
megaraid: hw error, cannot reset
megaraid: reset-5781472 cmd=28 <c=0 t=0 l=0>
megaraid: hw error, cannot reset
megaraid: reset-5781512 cmd=28 <c=0 t=0 l=0>
megaraid: hw error, cannot reset
megaraid: reset-5781471 cmd=28 <c=0 t=0 l=0>
megaraid: hw error, cannot reset
megaraid: reset-5781535 cmd=2a <c=0 t=0 l=0>
megaraid: hw error, cannot reset
megaraid: reset-5781490 cmd=28 <c=0 t=0 l=0>
megaraid: hw error, cannot reset

Loaded modules:

sg                     37388   0 (autoclean)
ext3                   89992   2
jbd                    55092   2 [ext3]
megaraid2              38376   3
diskdumplib             5260   0 [megaraid2]
sd_mod                 13936   6
scsi_mod              115240   3 [sg megaraid2 sd_mod]

$ uname -a
Linux kijang 2.4.21-27.0.1.ELsmp #1 SMP Mon Dec 20 18:47:45 EST 2004
i686 i686 i386 GNU/Linux

SCSI output from dmesg:

SCSI subsystem driver Revision: 1.00
megaraid: v2.10.8.2-RH1 (Release Date: Mon Jul 26 12:15:51 EDT 2004)
megaraid: found 0x1028:0x0013:bus 2:slot 14:func 0
scsi0:Found MegaRAID controller at 0xf8846000, IRQ:38
megaraid: [513O:H418] detected 1 logical drives.
megaraid: supports extended CDBs.
megaraid: channel[0] is raid.
scsi0 : LSI Logic MegaRAID 513O 254 commands 16 targs 4 chans 7 luns
blk: queue f7359e18, I/O limit 4294967295Mb (mask 0xffffffffffffffff)
scsi0: scanning scsi channel 0 for logical drives.
  Vendor: MegaRAID  Model: LD 0 RAID1   69G  Rev: 513O
  Type:   Direct-Access                      ANSI SCSI revision: 02
blk: queue f7359c18, I/O limit 4294967295Mb (mask 0xffffffffffffffff)
scsi0: scanning scsi channel 1 for logical drives.
scsi0: scanning scsi channel 2 for logical drives.
scsi0: scanning scsi channel 3 for logical drives.
scsi0: scanning scsi channel 4 [P0] for physical devices.
  Vendor: PE/PV     Model: 1x2 SCSI BP       Rev: 1.0
  Type:   Processor                          ANSI SCSI revision: 02
blk: queue f7359a18, I/O limit 4294967295Mb (mask 0xffffffffffffffff)
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
SCSI device sda: 143114240 512-byte hdwr sectors (73274 MB)
Partition check:
 sda: sda1 sda2 sda3 sda4 < sda5 >

Regards,

Reggie

-- 
Reggie Dugard <reggie@merfinllc.com>
Merfin, LLC



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: "megaraid mbox: critical hardware error" on new dell poweredge 1850, suse 9.2, kernel 2.6.8
  2005-01-22 23:23 Olivier Mueller
@ 2005-01-24 19:49 ` Olivier Mueller
  0 siblings, 0 replies; 3+ messages in thread
From: Olivier Mueller @ 2005-01-24 19:49 UTC (permalink / raw)
  To: linux-raid

On Sun, 2005-01-23 at 00:23 +0100, Olivier Mueller wrote:
> I'm trying to get a quite standard "suse linux 9.2" setup working
> on a brand new dell poweredge 1850 with 2 scsi disks in raid1 setup.
> 
> Installation went completely fine, everything is working. But now (and
> every time), after 2-3h of uptime and some high disk I/O load (rsync of
> some GB of data), it badly crashes with the following messages:
>                                                                                                                                                        
> -------------------------------------------------------------------
> megaraid: aborting-1164069 cmd=2a <c=1 t=0 l=0>
> megaraid abort: 1164069:48[255:0], fw owner
> megaraid: aborting-1164070 cmd=2a <c=1 t=0 l=0>
> megaraid abort: 1164070:59[255:0], fw owner
> megaraid: aborting-1164071 cmd=2a <c=1 t=0 l=0>
> megaraid abort: 1164071:19[255:0], fw owner
> megaraid: aborting-1164072 cmd=2a <c=1 t=0 l=0>
> megaraid abort: 1164072:18[255:0], fw owner


FYI, I tried some "load tests" with under linux kernel 2.4 (knoppix, 
with module megaraid2 loaded, and it doesn't crash yet after 2-3h of
work  (under 2.6 and megaraid it would have crashed after 30 minutes).

Has the megaraid module been "improved" between 2.4 and 2.6 ?  The
server is brand new, and the disks doesn't seem overheated... 
Is there anybody else working with dell 1850 servers ? :)

regards,
Olivier
-- 
_______________________________________________________
 Olivier Müller - PGP key ID: 0x0E84D2EA - Switzerland 
    E-Mail: http://omx.ch/mail/ - AIM/iChat: swix3k


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* "megaraid mbox: critical hardware error" on new dell poweredge 1850, suse 9.2, kernel 2.6.8
@ 2005-01-22 23:23 Olivier Mueller
  2005-01-24 19:49 ` Olivier Mueller
  0 siblings, 1 reply; 3+ messages in thread
From: Olivier Mueller @ 2005-01-22 23:23 UTC (permalink / raw)
  To: linux-raid

Hello,

I'm trying to get a quite standard "suse linux 9.2" setup working
on a brand new dell poweredge 1850 with 2 scsi disks in raid1 setup.

Installation went completely fine, everything is working. But now (and
every time), after 2-3h of uptime and some high disk I/O load (rsync of
some GB of data), it badly crashes with the following messages:
                                                                                                                                                       
-------------------------------------------------------------------
megaraid: aborting-1164069 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164069:48[255:0], fw owner
megaraid: aborting-1164070 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164070:59[255:0], fw owner
megaraid: aborting-1164071 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164071:19[255:0], fw owner
megaraid: aborting-1164072 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164072:18[255:0], fw owner
megaraid: aborting-1164073 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164073:20[255:0], fw owner
megaraid: aborting-1164074 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164074:32[255:0], fw owner
megaraid: aborting-1164075 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164075:13[255:0], fw owner
megaraid: aborting-1164076 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164076:8[255:0], fw owner
megaraid: aborting-1164077 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164076:8[255:0], fw owner
megaraid: aborting-1164077 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164077:33[255:0], fw owner
megaraid: aborting-1164078 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164078:60[255:0], fw owner
megaraid: aborting-1164079 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164079:0[255:0], fw owner
megaraid: aborting-1164080 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164080:63[255:0], fw owner
megaraid: aborting-1164081 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164081:44[255:0], fw owner
megaraid: aborting-1164082 cmd=2a <c=1 t=0 l=0>
megaraid abort: 1164082:53[255:0], fw owner
megaraid: reseting the host...
megaraid: 14 outstanding commands. Max wait 180 sec
megaraid mbox: Wait for 14 commands to complete:180
megaraid mbox: Wait for 14 commands to complete:175
megaraid mbox: Wait for 14 commands to complete:170
megaraid mbox: Wait for 14 commands to complete:165
megaraid mbox: Wait for 14 commands to complete:160
megaraid mbox: Wait for 14 commands to complete:155
megaraid mbox: Wait for 14 commands to complete:150
megaraid mbox: Wait for 14 commands to complete:145
megaraid mbox: Wait for 14 commands to complete:140
megaraid mbox: Wait for 14 commands to complete:135
megaraid mbox: Wait for 14 commands to complete:130
megaraid mbox: Wait for 14 commands to complete:125
megaraid mbox: Wait for 14 commands to complete:120
megaraid mbox: Wait for 14 commands to complete:115
megaraid mbox: Wait for 14 commands to complete:110
megaraid mbox: Wait for 14 commands to complete:105
megaraid mbox: Wait for 14 commands to complete:100
megaraid mbox: Wait for 14 commands to complete:95
megaraid mbox: Wait for 14 commands to complete:90
megaraid mbox: Wait for 14 commands to complete:85
megaraid mbox: Wait for 14 commands to complete:80
megaraid mbox: Wait for 14 commands to complete:75
megaraid mbox: Wait for 14 commands to complete:70
megaraid mbox: Wait for 14 commands to complete:65
megaraid mbox: Wait for 14 commands to complete:60
megaraid mbox: Wait for 14 commands to complete:55
megaraid mbox: Wait for 14 commands to complete:50
megaraid mbox: Wait for 14 commands to complete:45
megaraid mbox: Wait for 14 commands to complete:40
megaraid mbox: Wait for 14 commands to complete:35
megaraid mbox: Wait for 14 commands to complete:30
megaraid mbox: Wait for 14 commands to complete:25
megaraid mbox: Wait for 14 commands to complete:20
megaraid mbox: Wait for 14 commands to complete:15
megaraid mbox: Wait for 14 commands to complete:10
megaraid mbox: Wait for 14 commands to complete:5
megaraid mbox: Wait for 14 commands to complete:0
megaraid mbox: critical hardware error!
megaraid: reseting the host...
megaraid: hw error, cannot reset
megaraid: reseting the host...
megaraid: hw error, cannot reset
scsi: Device offlined - not ready after error recovery: host 0 channel 1
id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 1
id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 1
id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 1
id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 1
id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 1
id 0 lun 0
scsi: Device offlined - not ready after error recovery: host 0 channel 1
id 0 lun 0
[...]
scsi: Device offlined - not ready after error recovery: host 0 channel 1
id 0 lun 0
SCSI error : <0 1 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 105704481
Buffer I/O error on device sda8, logical block 855051
lost page write due to I/O error on sda8
scsi0 (0:0): rejecting I/O to offline device
Buffer I/O error on device sda8, logical block 855052
lost page write due to I/O error on sda8
Buffer I/O error on device sda8, logical block 855053
lost page write due to I/O error on sda8
Buffer I/O error on device sda8, logical block 855054
lost page write due to I/O error on sda8
Buffer I/O error on device sda8, logical block 855060
lost page write due to I/O error on sda8
SCSI error : <0 1 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 105704609
scsi0 (0:0): rejecting I/O to offline device
SCSI error : <0 1 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 105704737
scsi0 (0:0): rejecting I/O to offline device
SCSI error : <0 1 0 0> return code = 0x6000000
[...]
scsi0 (0:0): rejecting I/O to offline device
SCSI error : <0 1 0 0> return code = 0x6000000
end_request: I/O error, dev sda, sector 105705889
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
EXT3-fs error (device sda5) in ext3_reserve_inode_write: IO failure
scsi0 (0:0): rejecting I/O to offline device
scsi0 (0:0): rejecting I/O to offline device
EXT3-fs error (device sda5) in ext3_dirty_inode: IO failure
scsi0 (0:0): rejecting I/O to offline device
ext3_abort called.
EXT3-fs error (device sda5): ext3_journal_start: Detected aborted
journal
Remounting filesystem read-only
[...]
-------------------------------------------------------------------
                                                                                                                                                       
And then, complete crash, system not reacting anymore.
                                                                                                                                                       
                                                                                                                                                       
Not really nice, isn't it? :)  Now I'm trying to find a solution...
In the meantime, if you already saw somthing like that,
feedback/pointers would be very welcome. Merci!  I will try with knoppix
and some *BSD, but the chances that the HW is really bad are low: on
reboot everything runs completely fine, for some hours...

A consistancy check of the RAID array took about 1h, but reported
no problems.

                                                                                                                                                      
Some more infos:
                                                                                                                                                       
Loaded modules:
                                                                                                                                                       
ext3                  128744  5
jbd                    76964  1 ext3
megaraid_mbox          35216  6
megaraid_mm            14752  1 megaraid_mbox
sd_mod                 22144  7
scsi_mod              121412  5 sg,st,sr_mod,megaraid_mbox,sd_mod
                                                                                                                                                       
                                                                                                                                                       
# uname -a
Linux pe1850 2.6.8-24.10-smp #1 SMP Wed Dec 22 11:54:27 UTC 2004 i686
i686 i386 GNU/Linux
                                                                                                                                                       
                                                                                                                                                       
dmesg messages about scsi subsystem:
                                                                                                                                                       
SCSI subsystem initialized
megaraid cmm: 2.20.2.0 (Release Date: Thu Aug 19 09:58:33 EDT 2004)
megaraid: 2.20.4.0 (Release Date: Mon Sep 27 22:15:07 EDT 2004)
megaraid: probe new device 0x1028:0x0013:0x1028:0x016c: bus 2:slot
14:func 0
ACPI: PCI interrupt 0000:02:0e.0[A] -> GSI 46 (level, low) -> IRQ 201
megaraid: fw version:[513O] bios version:[H418]
scsi0 : LSI Logic MegaRAID driver
scsi[0]: scanning scsi channel 0 [Phy 0] for non-raid devices
  Vendor: PE/PV     Model: 1x2 SCSI BP       Rev: 1.0
  Type:   Processor                          ANSI SCSI revision: 02
scsi[0]: scanning scsi channel 1 [virtual] for logical drives
  Vendor: MegaRAID  Model: LD 0 RAID1   69G  Rev: 513O
  Type:   Direct-Access                      ANSI SCSI revision: 02
SCSI device sda: 143114240 512-byte hdwr sectors (73274 MB)
sda: asking for cache data failed
sda: assuming drive cache: write through
 sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 >
Attached scsi disk sda at scsi0, channel 1, id 0, lun 0
                                                                                                                                                       
                                                                                                                                                       
                                                                                                                                                       
regards,
Olivier

-- 
_______________________________________________________
 Olivier Müller - PGP key ID: 0x0E84D2EA - Switzerland 
    E-Mail: http://omx.ch/mail/ - AIM/iChat: swix3k


-
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2005-01-26 19:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-01-26 19:48 "megaraid mbox: critical hardware error" on new dell poweredge 1850, suse 9.2, kernel 2.6.8 Reggie Dugard
  -- strict thread matches above, loose matches on Subject: below --
2005-01-22 23:23 Olivier Mueller
2005-01-24 19:49 ` Olivier Mueller

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.