linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
@ 2012-09-19 18:52 Nix
  2012-09-19 20:19 ` Chris Murphy
  2012-09-19 22:30 ` Stan Hoeppner
  0 siblings, 2 replies; 9+ messages in thread
From: Nix @ 2012-09-19 18:52 UTC (permalink / raw)
  To: linux-raid, linux-kernel

So I have this x86-64 server running Linux 3.5.1 with a SATA-on-PCIe
Areca 1210 hardware RAID-5 controller driven by libata which has been
humming along happily for years -- but suddenly, today, the entire
machine froze for a couple of minutes (or at least fs access froze),
followed by this in the logs:

Sep 19 16:55:47 spindle notice: [3447524.381843] arcmsr0: abort device command of scsi id = 0 lun = 1 
[... repeated a few times at intervals over the next five minutes,
 followed by a mass of them at 16:59:29, and...]
Sep 19 16:59:25 spindle err: [3447657.821450] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 33 
Sep 19 16:59:25 spindle notice: [3447697.878386] arcmsr0: wait 'abort all outstanding command' timeout 
Sep 19 16:59:25 spindle notice: [3447697.878628] arcmsr0: executing hw bus reset .....
Sep 19 16:59:25 spindle err: [3447698.287054] irq 16: nobody cared (try booting with the "irqpoll" option)
Sep 19 16:59:25 spindle warning: [3447698.287291] Pid: 0, comm: swapper/4 Not tainted 3.5.1-dirty #1
Sep 19 16:59:25 spindle warning: [3447698.287522] Call Trace:
Sep 19 16:59:25 spindle warning: [3447698.287754]  <IRQ>  [<ffffffff810af5ba>] __report_bad_irq+0x31/0xc2
Sep 19 16:59:25 spindle warning: [3447698.288031]  [<ffffffff810af84e>] note_interrupt+0x16a/0x1e8
Sep 19 16:59:25 spindle warning: [3447698.288263]  [<ffffffff810ad9d5>] handle_irq_event_percpu+0x163/0x1a5
Sep 19 16:59:25 spindle warning: [3447698.288497]  [<ffffffff810ada4f>] handle_irq_event+0x38/0x55
Sep 19 16:59:25 spindle warning: [3447698.288727]  [<ffffffff810b01a0>] handle_fasteoi_irq+0x78/0xab
Sep 19 16:59:25 spindle warning: [3447698.288960]  [<ffffffff8103631c>] handle_irq+0x24/0x2a
Sep 19 16:59:25 spindle warning: [3447698.289189]  [<ffffffff81036229>] do_IRQ+0x4d/0xb4
Sep 19 16:59:25 spindle warning: [3447698.289419]  [<ffffffff815070e7>] common_interrupt+0x67/0x67
Sep 19 16:59:25 spindle warning: [3447698.289648]  <EOI>  [<ffffffff812ab174>] ? acpi_idle_enter_c1+0xcb/0xf2
Sep 19 16:59:25 spindle warning: [3447698.289919]  [<ffffffff812ab152>] ? acpi_idle_enter_c1+0xa9/0xf2
Sep 19 16:59:25 spindle warning: [3447698.290152]  [<ffffffff813c1446>] cpuidle_enter+0x12/0x14
Sep 19 16:59:25 spindle warning: [3447698.290382]  [<ffffffff813c1902>] cpuidle_idle_call+0xc5/0x175
Sep 19 16:59:25 spindle warning: [3447698.290614]  [<ffffffff8103c2da>] cpu_idle+0x5b/0xa5
Sep 19 16:59:25 spindle warning: [3447698.290844]  [<ffffffff81ad4fcb>] start_secondary+0x1a2/0x1a6
Sep 19 16:59:25 spindle err: [3447698.291074] handlers:
Sep 19 16:59:25 spindle err: [3447698.291294] [<ffffffff8133b9a3>] usb_hcd_irq
Sep 19 16:59:25 spindle emerg: [3447698.291553] Disabling IRQ #16
Sep 19 16:59:25 spindle err: [3447710.888187] arcmsr0: waiting for hw bus reset return, retry=0
Sep 19 16:59:25 spindle err: [3447720.882155] arcmsr0: waiting for hw bus reset return, retry=1
Sep 19 16:59:25 spindle notice: [3447730.896410] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
Sep 19 16:59:25 spindle err: [3447730.916348] arcmsr: scsi  bus reset eh returns with success

This is the first SCSI (that is, um, ATA) bus reset I have *ever* had on
this machine, hence my concern. (The IRQ disable we can ignore: it was
just bad luck that an interrupt destined for the Areca hit after the
controller had briefly vanished from the PCI bus as part of resetting.)

Now just last week another (surge-protected) machine on the same power
main as it died without warning with a fried power supply which
apparently roasted the BIOS and/or other motherboard components before
it died (the ACPI DSDT was filled with rubbish, and other things must
have been fried because even with ACPI off Linux wouldn't boot more than
one time out of a hundred (freezing solid at different places in the
boot each time). So my worry level when this SCSI bus reset turned up
today is quite high. It's higher given that the controller logs
(accessed via the Areca binary-only utility for this purpose) show no
sign of any problem at all.

EDAC shows no PCI bus problems and no memory problems, so this probably
*is* the controller.

So... is this a serious problem? Does anyone know if I'm about to lose
this controller, or indeed machine as well? (I really, really hope not.)

I'd write this off as a spurious problem and not report it at all, but
I'm jittery as heck after the catastrophic hardware failure last week,
and when this happens in close proximity, I worry.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
  2012-09-19 18:52 Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller? Nix
@ 2012-09-19 20:19 ` Chris Murphy
  2012-09-23 15:41   ` Nix
  2012-09-19 22:30 ` Stan Hoeppner
  1 sibling, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2012-09-19 20:19 UTC (permalink / raw)
  To: Linux RAID; +Cc: linux-kernel


On Sep 19, 2012, at 12:52 PM, Nix wrote:

> So I have this x86-64 server running Linux 3.5.1 with a SATA-on-PCIe
> Areca 1210 hardware RAID-5 controller 

Did you find this? Same controller family. Weird that this just shows up now, but perhaps instead of it being "bad hardware" out the gate, something's happened to it and now it's failing as you suspect.

http://www.xtremesystems.org/forums/showthread.php?276187-Raid-Locks-Up


Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
  2012-09-19 18:52 Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller? Nix
  2012-09-19 20:19 ` Chris Murphy
@ 2012-09-19 22:30 ` Stan Hoeppner
  2012-09-20  6:51   ` Nix
  1 sibling, 1 reply; 9+ messages in thread
From: Stan Hoeppner @ 2012-09-19 22:30 UTC (permalink / raw)
  To: Nix; +Cc: linux-raid, linux-kernel

On 9/19/2012 1:52 PM, Nix wrote:
> So I have this x86-64 server running Linux 3.5.1 

When did you install 3.5.1 on this machine?  If fairly recently, does it
run without these errors when booted into the previous kernel?

> with a SATA-on-PCIe
> Areca 1210 hardware RAID-5 controller driven by libata which has been
> humming along happily for years -- but suddenly, today, the entire
> machine froze for a couple of minutes (or at least fs access froze),
> followed by this in the logs:
> 
> Sep 19 16:55:47 spindle notice: [3447524.381843] arcmsr0: abort device command of scsi id = 0 lun = 1 
> [... repeated a few times at intervals over the next five minutes,
>  followed by a mass of them at 16:59:29, and...]
> Sep 19 16:59:25 spindle err: [3447657.821450] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 33 
> Sep 19 16:59:25 spindle notice: [3447697.878386] arcmsr0: wait 'abort all outstanding command' timeout 
> Sep 19 16:59:25 spindle notice: [3447697.878628] arcmsr0: executing hw bus reset .....
> Sep 19 16:59:25 spindle err: [3447698.287054] irq 16: nobody cared (try booting with the "irqpoll" option)
> Sep 19 16:59:25 spindle warning: [3447698.287291] Pid: 0, comm: swapper/4 Not tainted 3.5.1-dirty #1
> Sep 19 16:59:25 spindle warning: [3447698.287522] Call Trace:
> Sep 19 16:59:25 spindle warning: [3447698.287754]  <IRQ>  [<ffffffff810af5ba>] __report_bad_irq+0x31/0xc2
> Sep 19 16:59:25 spindle warning: [3447698.288031]  [<ffffffff810af84e>] note_interrupt+0x16a/0x1e8
> Sep 19 16:59:25 spindle warning: [3447698.288263]  [<ffffffff810ad9d5>] handle_irq_event_percpu+0x163/0x1a5
> Sep 19 16:59:25 spindle warning: [3447698.288497]  [<ffffffff810ada4f>] handle_irq_event+0x38/0x55
> Sep 19 16:59:25 spindle warning: [3447698.288727]  [<ffffffff810b01a0>] handle_fasteoi_irq+0x78/0xab
> Sep 19 16:59:25 spindle warning: [3447698.288960]  [<ffffffff8103631c>] handle_irq+0x24/0x2a
> Sep 19 16:59:25 spindle warning: [3447698.289189]  [<ffffffff81036229>] do_IRQ+0x4d/0xb4
> Sep 19 16:59:25 spindle warning: [3447698.289419]  [<ffffffff815070e7>] common_interrupt+0x67/0x67
> Sep 19 16:59:25 spindle warning: [3447698.289648]  <EOI>  [<ffffffff812ab174>] ? acpi_idle_enter_c1+0xcb/0xf2
> Sep 19 16:59:25 spindle warning: [3447698.289919]  [<ffffffff812ab152>] ? acpi_idle_enter_c1+0xa9/0xf2
> Sep 19 16:59:25 spindle warning: [3447698.290152]  [<ffffffff813c1446>] cpuidle_enter+0x12/0x14
> Sep 19 16:59:25 spindle warning: [3447698.290382]  [<ffffffff813c1902>] cpuidle_idle_call+0xc5/0x175
> Sep 19 16:59:25 spindle warning: [3447698.290614]  [<ffffffff8103c2da>] cpu_idle+0x5b/0xa5
> Sep 19 16:59:25 spindle warning: [3447698.290844]  [<ffffffff81ad4fcb>] start_secondary+0x1a2/0x1a6
> Sep 19 16:59:25 spindle err: [3447698.291074] handlers:
> Sep 19 16:59:25 spindle err: [3447698.291294] [<ffffffff8133b9a3>] usb_hcd_irq
> Sep 19 16:59:25 spindle emerg: [3447698.291553] Disabling IRQ #16
> Sep 19 16:59:25 spindle err: [3447710.888187] arcmsr0: waiting for hw bus reset return, retry=0
> Sep 19 16:59:25 spindle err: [3447720.882155] arcmsr0: waiting for hw bus reset return, retry=1
> Sep 19 16:59:25 spindle notice: [3447730.896410] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
> Sep 19 16:59:25 spindle err: [3447730.916348] arcmsr: scsi  bus reset eh returns with success
> 
> This is the first SCSI (that is, um, ATA) bus reset I have *ever* had on
> this machine, hence my concern. (The IRQ disable we can ignore: it was
> just bad luck that an interrupt destined for the Areca hit after the
> controller had briefly vanished from the PCI bus as part of resetting.)
> 
> Now just last week another (surge-protected) machine on the same power
> main as it died without warning with a fried power supply which
> apparently roasted the BIOS and/or other motherboard components before
> it died (the ACPI DSDT was filled with rubbish, and other things must
> have been fried because even with ACPI off Linux wouldn't boot more than
> one time out of a hundred (freezing solid at different places in the
> boot each time). So my worry level when this SCSI bus reset turned up
> today is quite high. It's higher given that the controller logs
> (accessed via the Areca binary-only utility for this purpose) show no
> sign of any problem at all.
> 
> EDAC shows no PCI bus problems and no memory problems, so this probably
> *is* the controller.
> 
> So... is this a serious problem? Does anyone know if I'm about to lose
> this controller, or indeed machine as well? (I really, really hope not.)
> 
> I'd write this off as a spurious problem and not report it at all, but
> I'm jittery as heck after the catastrophic hardware failure last week,
> and when this happens in close proximity, I worry.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
  2012-09-19 22:30 ` Stan Hoeppner
@ 2012-09-20  6:51   ` Nix
  0 siblings, 0 replies; 9+ messages in thread
From: Nix @ 2012-09-20  6:51 UTC (permalink / raw)
  To: stan; +Cc: linux-raid, linux-kernel

On 19 Sep 2012, Stan Hoeppner stated:

> On 9/19/2012 1:52 PM, Nix wrote:
>> So I have this x86-64 server running Linux 3.5.1 
>
> When did you install 3.5.1 on this machine?

Forty days ago.

>                                              If fairly recently, does it
> run without these errors when booted into the previous kernel?

Well, since this error happened only once, and after thirty-nine days of
uptime at that, I'm not sure how I can find out. :)

>> with a SATA-on-PCIe
>> Areca 1210 hardware RAID-5 controller driven by libata which has been
>> humming along happily for years -- but suddenly, today, the entire
>> machine froze for a couple of minutes (or at least fs access froze),
>> followed by this in the logs:
>> 
>> Sep 19 16:55:47 spindle notice: [3447524.381843] arcmsr0: abort device command of scsi id = 0 lun = 1 
>> [... repeated a few times at intervals over the next five minutes,
>>  followed by a mass of them at 16:59:29, and...]
>> Sep 19 16:59:25 spindle err: [3447657.821450] arcmsr: executing bus reset eh.....num_resets = 0, num_aborts = 33 
>> Sep 19 16:59:25 spindle notice: [3447697.878386] arcmsr0: wait 'abort all outstanding command' timeout 
>> Sep 19 16:59:25 spindle notice: [3447697.878628] arcmsr0: executing hw bus reset .....
>> Sep 19 16:59:25 spindle err: [3447698.287054] irq 16: nobody cared (try booting with the "irqpoll" option)
>> Sep 19 16:59:25 spindle warning: [3447698.287291] Pid: 0, comm: swapper/4 Not tainted 3.5.1-dirty #1
>> Sep 19 16:59:25 spindle warning: [3447698.287522] Call Trace:
>> Sep 19 16:59:25 spindle warning: [3447698.287754]  <IRQ>  [<ffffffff810af5ba>] __report_bad_irq+0x31/0xc2
>> Sep 19 16:59:25 spindle warning: [3447698.288031]  [<ffffffff810af84e>] note_interrupt+0x16a/0x1e8
>> Sep 19 16:59:25 spindle warning: [3447698.288263]  [<ffffffff810ad9d5>] handle_irq_event_percpu+0x163/0x1a5
>> Sep 19 16:59:25 spindle warning: [3447698.288497]  [<ffffffff810ada4f>] handle_irq_event+0x38/0x55
>> Sep 19 16:59:25 spindle warning: [3447698.288727]  [<ffffffff810b01a0>] handle_fasteoi_irq+0x78/0xab
>> Sep 19 16:59:25 spindle warning: [3447698.288960]  [<ffffffff8103631c>] handle_irq+0x24/0x2a
>> Sep 19 16:59:25 spindle warning: [3447698.289189]  [<ffffffff81036229>] do_IRQ+0x4d/0xb4
>> Sep 19 16:59:25 spindle warning: [3447698.289419]  [<ffffffff815070e7>] common_interrupt+0x67/0x67
>> Sep 19 16:59:25 spindle warning: [3447698.289648]  <EOI>  [<ffffffff812ab174>] ? acpi_idle_enter_c1+0xcb/0xf2
>> Sep 19 16:59:25 spindle warning: [3447698.289919]  [<ffffffff812ab152>] ? acpi_idle_enter_c1+0xa9/0xf2
>> Sep 19 16:59:25 spindle warning: [3447698.290152]  [<ffffffff813c1446>] cpuidle_enter+0x12/0x14
>> Sep 19 16:59:25 spindle warning: [3447698.290382]  [<ffffffff813c1902>] cpuidle_idle_call+0xc5/0x175
>> Sep 19 16:59:25 spindle warning: [3447698.290614]  [<ffffffff8103c2da>] cpu_idle+0x5b/0xa5
>> Sep 19 16:59:25 spindle warning: [3447698.290844]  [<ffffffff81ad4fcb>] start_secondary+0x1a2/0x1a6
>> Sep 19 16:59:25 spindle err: [3447698.291074] handlers:
>> Sep 19 16:59:25 spindle err: [3447698.291294] [<ffffffff8133b9a3>] usb_hcd_irq
>> Sep 19 16:59:25 spindle emerg: [3447698.291553] Disabling IRQ #16
>> Sep 19 16:59:25 spindle err: [3447710.888187] arcmsr0: waiting for hw bus reset return, retry=0
>> Sep 19 16:59:25 spindle err: [3447720.882155] arcmsr0: waiting for hw bus reset return, retry=1
>> Sep 19 16:59:25 spindle notice: [3447730.896410] Areca RAID Controller0: F/W V1.46 2009-01-06 & Model ARC-1210
>> Sep 19 16:59:25 spindle err: [3447730.916348] arcmsr: scsi  bus reset eh returns with success

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
  2012-09-19 20:19 ` Chris Murphy
@ 2012-09-23 15:41   ` Nix
  2012-10-01 21:33     ` Pierre Beck
  0 siblings, 1 reply; 9+ messages in thread
From: Nix @ 2012-09-23 15:41 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux RAID, linux-kernel

On 19 Sep 2012, Chris Murphy outgrape:

>
> On Sep 19, 2012, at 12:52 PM, Nix wrote:
>
>> So I have this x86-64 server running Linux 3.5.1 with a SATA-on-PCIe
>> Areca 1210 hardware RAID-5 controller 
>
> Did you find this? Same controller family. Weird that this just shows
> up now, but perhaps instead of it being "bad hardware" out the gate,
> something's happened to it and now it's failing as you suspect.

Hm, it's possible I suppose. Just as possible that a disk is dying.


It looks to have been a one-off transient -- no recurrence yet, touch
wood :)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
  2012-09-23 15:41   ` Nix
@ 2012-10-01 21:33     ` Pierre Beck
  2012-10-01 22:46       ` Chris Murphy
  2012-10-02  0:10       ` Nix
  0 siblings, 2 replies; 9+ messages in thread
From: Pierre Beck @ 2012-10-01 21:33 UTC (permalink / raw)
  To: Nix; +Cc: Chris Murphy, Linux RAID, linux-kernel

Check the SMART values of the disks if possible. Watch for command 
timeouts and the usual bad sector stuff. I've had similar issues with 
Adaptec controllers. Bad disks seem to cause havoc. The outstanding 
operation isn't answered within [SCSI Timeout, default 30, 
/sys/block/sdX/device/timeout] seconds, so Linux performs a loop reset, 
eventually resetting the controller. That means between 60 and 120 
seconds of zero I/O operation, varying between controllers and disk 
array sizes. It's particularly annoying when in RAID and the disk 
could've simply been kicked within few seconds. Something that needs 
improvement IMHO.

On 23.09.2012 17:42, Nix wrote:
> On 19 Sep 2012, Chris Murphy outgrape:
>
>> On Sep 19, 2012, at 12:52 PM, Nix wrote:
>>
>>> So I have this x86-64 server running Linux 3.5.1 with a SATA-on-PCIe
>>> Areca 1210 hardware RAID-5 controller
>> Did you find this? Same controller family. Weird that this just shows
>> up now, but perhaps instead of it being "bad hardware" out the gate,
>> something's happened to it and now it's failing as you suspect.
> Hm, it's possible I suppose. Just as possible that a disk is dying.
>
>
> It looks to have been a one-off transient -- no recurrence yet, touch
> wood :)
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
  2012-10-01 21:33     ` Pierre Beck
@ 2012-10-01 22:46       ` Chris Murphy
  2012-10-01 23:54         ` Pierre Beck
  2012-10-02  0:10       ` Nix
  1 sibling, 1 reply; 9+ messages in thread
From: Chris Murphy @ 2012-10-01 22:46 UTC (permalink / raw)
  To: Linux RAID; +Cc: linux-kernel


On Oct 1, 2012, at 3:33 PM, Pierre Beck wrote:
> It's particularly annoying when in RAID and the disk could've simply been kicked within few seconds. Something that needs improvement IMHO.

Except that while this helps with faster recovery, you're now degraded. You wouldn't want this "fast recovery" behavior if you're at your critical number of disks remaining or you lose the array upon a few seconds worth of subsequent problems. So we kinda need context specific behavior.


Chris Murphy

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
  2012-10-01 22:46       ` Chris Murphy
@ 2012-10-01 23:54         ` Pierre Beck
  0 siblings, 0 replies; 9+ messages in thread
From: Pierre Beck @ 2012-10-01 23:54 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux RAID, linux-kernel

Yes. When in degraded mode, timeout should be raised to five minutes or 
so. When in clean mode, timeout should be a tunable in milliseconds. 
Commercial RAIDs offer timeouts in ranges like 200ms - 2s. Plus a disk 
which was kicked that way should be scanned for and re-added if 
possible. With write-intent bitmaps, that would make RAIDs with aging 
disks or cables much more solid.

Also, non-degraded mode: Skip loop resets. Skip all resets actually, if 
possible. Just kick the disk. Degraded mode: Perform loop resets as it 
is now. A hung-up controller would then cause an array to degrade, but 
won't hang indefinitely. Granted, always doing loop resets keeps the 
array non-degraded, but a crashed controller is rare whilst failing 
disks are common.

linux-scsi and linux-raid should talk about this one day to make it 
happen. Requires a bit of interfacing between the layers.

On 02.10.2012 00:53, Chris Murphy wrote:
> On Oct 1, 2012, at 3:33 PM, Pierre Beck wrote:
>> It's particularly annoying when in RAID and the disk could've simply been kicked within few seconds. Something that needs improvement IMHO.
> Except that while this helps with faster recovery, you're now degraded. You wouldn't want this "fast recovery" behavior if you're at your critical number of disks remaining or you lose the array upon a few seconds worth of subsequent problems. So we kinda need context specific behavior.
>
>
> Chris Murphy--
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller?
  2012-10-01 21:33     ` Pierre Beck
  2012-10-01 22:46       ` Chris Murphy
@ 2012-10-02  0:10       ` Nix
  1 sibling, 0 replies; 9+ messages in thread
From: Nix @ 2012-10-02  0:10 UTC (permalink / raw)
  To: Pierre Beck; +Cc: Chris Murphy, Linux RAID, linux-kernel

On 1 Oct 2012, Pierre Beck stated:
> On 23.09.2012 17:42, Nix wrote:
>> On 19 Sep 2012, Chris Murphy outgrape:
>>
>>> On Sep 19, 2012, at 12:52 PM, Nix wrote:
>>>
>>>> So I have this x86-64 server running Linux 3.5.1 with a SATA-on-PCIe
>>>> Areca 1210 hardware RAID-5 controller
>>> Did you find this? Same controller family. Weird that this just shows
>>> up now, but perhaps instead of it being "bad hardware" out the gate,
>>> something's happened to it and now it's failing as you suspect.
>> Hm, it's possible I suppose. Just as possible that a disk is dying.
>>
>>
>> It looks to have been a one-off transient -- no recurrence yet, touch
>> wood :)
>>
> Check the SMART values of the disks if possible. Watch for command
> timeouts and the usual bad sector stuff. I've had similar issues with
> Adaptec controllers. Bad disks seem to cause havoc. The outstanding
> operation isn't answered within [SCSI Timeout, default 30,
> /sys/block/sdX/device/timeout] seconds, so Linux performs a loop
> reset, eventually resetting the controller. That means between 60 and
> 120 seconds of zero I/O operation, varying between controllers and
> disk array sizes. It's particularly annoying when in RAID and the disk
> could've simply been kicked within few seconds. Something that needs
> improvement IMHO.

The problem has not recurred in more than three weeks. SMART says no
problems... so I guess the controller dropped off the bus for some
reason. Probably some sort of subtle firmware bug or something. (There
are hints in the driver that such bugs exist, hence the enormous amount
of code the driver devotes to resetting the thing when it goes silent.)

-- 
NULL && (void)

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-10-02  0:10 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-19 18:52 Areca hardware RAID / first-ever SCSI bus reset: am I about to lose this disk controller? Nix
2012-09-19 20:19 ` Chris Murphy
2012-09-23 15:41   ` Nix
2012-10-01 21:33     ` Pierre Beck
2012-10-01 22:46       ` Chris Murphy
2012-10-01 23:54         ` Pierre Beck
2012-10-02  0:10       ` Nix
2012-09-19 22:30 ` Stan Hoeppner
2012-09-20  6:51   ` Nix

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).