All of lore.kernel.org
 help / color / mirror / Atom feed
* Marvell 88SE9320 SATA controller failure during heavy load
@ 2015-01-11 12:32 Jeroen Van den Keybus
  2015-01-11 17:39 ` Tim Small
  0 siblings, 1 reply; 3+ messages in thread
From: Jeroen Van den Keybus @ 2015-01-11 12:32 UTC (permalink / raw)
  To: linux-ide

I am using 4x 3TB WD HDDs (WD30EFRX-68EUZN0) on a HighPoint 640L board
with Marvell 88SE9230 AHCI SATA controller. I am using btrfs in RAID5
on these drives. Large copy operations to the disks work fine.
Scrubbing the 4-drive array afterwards reveals 0 errors.

However, when a SMART command is issued during the transfers,
occasionally the following error occurs:

[255341.597723] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[255341.597882] ata7.00: failed command: SMART
[255341.597974] ata7.00: cmd b0/d1:01:01:4f:c2/00:00:00:00:00/00 tag
18 pio 512 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[255341.598227] ata7.00: status: { DRDY }
[255341.598307] ata7: hard resetting link
[255341.917587] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[255341.919688] ata7.00: configured for UDMA/133
[255341.919773] ata7: EH complete

This happens primarily on ata7 but also on ata10 (the 4th drive on the
4-port board). The only other failing command I observed is IDENTIFY
DEVICE. I issued both:

$ sudo smartctl -a /dev/sde

and

$ sudo hddtemp /dev/sde

to trigger these.

Disk activity to the array (the copy operation) suspends completely
until 'EH complete'.  After the incident, the copy operations continue
as if nothing happened. Scrub is also fine. But if I hammer the array
with:

$ for i in {1..10}; do sudo smartctl -a /dev/sde; done

eventually NCQ is also disabled:

[255406.661724] ata7.00: NCQ disabled due to excessive errors
[255406.661745] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[255406.661894] ata7.00: failed command: IDENTIFY DEVICE
[255406.662000] ata7.00: cmd ec/00:01:00:00:00/00:00:00:00:00/00 tag
10 pio 512 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[255406.662253] ata7.00: status: { DRDY }
[255406.662333] ata7: hard resetting link
[255406.989862] ata7: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[255406.992147] ata7.00: configured for UDMA/133
[255406.992256] ata7: EH complete

The driver in use is ahci and the kernel version is 3.18 (on Ubuntu
14.10 server).

I found two reports of a comparable issue at

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=700975 and
https://www.centos.org/forums/viewtopic.php?f=15&t=48964

but in my case the storage system does not break down entirely. People
simply resorted to using another controller and none of the two
reports were eventually solved. Though I normally do not use SMART, I
feel uneasy at the prospect of having to rely on a drive array that is
known to have failed hard once.


Thanks for any advice,


Jeroen.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Marvell 88SE9320 SATA controller failure during heavy load
  2015-01-11 12:32 Marvell 88SE9320 SATA controller failure during heavy load Jeroen Van den Keybus
@ 2015-01-11 17:39 ` Tim Small
  2015-01-12  8:33   ` Jeroen Van den Keybus
  0 siblings, 1 reply; 3+ messages in thread
From: Tim Small @ 2015-01-11 17:39 UTC (permalink / raw)
  To: Jeroen Van den Keybus, linux-ide

On 11/01/15 12:32, Jeroen Van den Keybus wrote:

> when a SMART command is issued during the transfers,
> occasionally the following error occurs:

[...]

> Thanks for any advice,

I've seen several models of WD drive  which fail when you issue SMART
commands during heavy I/O load, (all older vintages of drive than yours,
but I've largely avoided WD since then so couldn't say with much
certainty if they've fixed these issues in their drive firmware).

I've also seen multiple failures under load with multiple Marvell
controllers (all Marvell AHCI controllers) unless NCQ is disabled.  Best
guess is that this is due to a long standing controller design fault.

If I were seeing this problem, and I had a spare PCIe slot, then I'd
swap out the Marvell based controller for either 2x Silicon Image 3132
based cards, or alternatively 2x Asmedia 1061 (e.g. SY-PEX40039), to see
if the problems go away.

I'm using both controller chips in high load situations without problems
at the moment.

Tim.



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Marvell 88SE9320 SATA controller failure during heavy load
  2015-01-11 17:39 ` Tim Small
@ 2015-01-12  8:33   ` Jeroen Van den Keybus
  0 siblings, 0 replies; 3+ messages in thread
From: Jeroen Van den Keybus @ 2015-01-12  8:33 UTC (permalink / raw)
  To: Tim Small; +Cc: linux-ide

> I've seen several models of WD drive  which fail when you issue SMART
> commands during heavy I/O load, (all older vintages of drive than yours,
> but I've largely avoided WD since then so couldn't say with much
> certainty if they've fixed these issues in their drive firmware).

I will check this by attaching the disks to the MB SATA ports.

> I've also seen multiple failures under load with multiple Marvell
> controllers (all Marvell AHCI controllers) unless NCQ is disabled.  Best
> guess is that this is due to a long standing controller design fault.

I would still consider the possibility that there's a compatibility
issue with the AHCI driver. Given the popularity of the Marvell chips,
I think it's worth investigating where the problem really is and
ensuring Linux works reliably with these controllers as well.

> If I were seeing this problem, and I had a spare PCIe slot, then I'd
> swap out the Marvell based controller for either 2x Silicon Image 3132
> based cards, or alternatively 2x Asmedia 1061 (e.g. SY-PEX40039), to see
> if the problems go away.

Cannot do that. I have no PCIe slots left.

Thanks,


Jeroen.

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2015-01-12  8:33 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-11 12:32 Marvell 88SE9320 SATA controller failure during heavy load Jeroen Van den Keybus
2015-01-11 17:39 ` Tim Small
2015-01-12  8:33   ` Jeroen Van den Keybus

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.