All of lore.kernel.org
 help / color / mirror / Atom feed
* Possible MD RAID 5 or sata_sil driver issues.
@ 2011-09-29 15:54 Jim Mills
  2011-09-29 20:50 ` Jim Paris
  0 siblings, 1 reply; 3+ messages in thread
From: Jim Mills @ 2011-09-29 15:54 UTC (permalink / raw)
  To: neilb, neilb, jgarzik, linux-ide

Possible MD RAID 5 or sata_sil driver issues.

Summary:
I created an XFS filesystem on top of a MD RAID5 across 4 SATA drives
connected to a single SIL3114 PCI card.

Problem:  I am seeing errors and corrupted files, as checked by CRC
and PAR2.  This is a brand new filesystem, on new drives.  The drives
do not have smart errors, and have even been zeroed out, as well as
reading all blocks with offline smart checks, badblocks, and even
ddrescue.  This also shows up in the mismatch_cnt after sending check
to sync_action.  Sending repair to sync_action, and then later sending
check doesn't fix it.

I have seen this issue regardless if it is XFS or even EXT4, so I am
not assuming it is not related to the filesystem.  Although I did note
that MD didn't start recovering after being created until a filesystem
was created.

I do not see these issues when using the drives without RAID.

This leaves me with the only common pieces is the card and the md
software, which is why I am writing both of you.  It might be
something weird with the interaction of the two.

I have tried looking at the sata_sil code, and don't see an easy way
to enable debugging via insmod.  I have not tried turning on any debug
in md, and can't unload it as my root, etc. is on a md mirror.

Linux Kernel: SUSE 3.0.4-2-desktop.
SiI 3114 IDE BIOS	4/22/2008	5.5.0.0

Please let me know what additional details would be helpful, and if I
should point this at a particular email distribution.

--
Jim Mills

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible MD RAID 5 or sata_sil driver issues.
  2011-09-29 15:54 Possible MD RAID 5 or sata_sil driver issues Jim Mills
@ 2011-09-29 20:50 ` Jim Paris
  2011-09-30 14:14   ` Jim Mills
  0 siblings, 1 reply; 3+ messages in thread
From: Jim Paris @ 2011-09-29 20:50 UTC (permalink / raw)
  To: Jim Mills; +Cc: neilb, neilb, jgarzik, linux-ide

Jim Mills wrote:
> Possible MD RAID 5 or sata_sil driver issues.
> 
> Summary:
> I created an XFS filesystem on top of a MD RAID5 across 4 SATA drives
> connected to a single SIL3114 PCI card.
> 
> Problem:  I am seeing errors and corrupted files, as checked by CRC
> and PAR2.  This is a brand new filesystem, on new drives.  The drives
> do not have smart errors, and have even been zeroed out, as well as
> reading all blocks with offline smart checks, badblocks, and even
> ddrescue.  This also shows up in the mismatch_cnt after sending check
> to sync_action.  Sending repair to sync_action, and then later sending
> check doesn't fix it.
> 
> I have seen this issue regardless if it is XFS or even EXT4, so I am
> not assuming it is not related to the filesystem.  Although I did note
> that MD didn't start recovering after being created until a filesystem
> was created.
> 
> I do not see these issues when using the drives without RAID.
> 
> This leaves me with the only common pieces is the card and the md
> software, which is why I am writing both of you.  It might be
> something weird with the interaction of the two.
> 
> I have tried looking at the sata_sil code, and don't see an easy way
> to enable debugging via insmod.  I have not tried turning on any debug
> in md, and can't unload it as my root, etc. is on a md mirror.
> 
> Linux Kernel: SUSE 3.0.4-2-desktop.
> SiI 3114 IDE BIOS	4/22/2008	5.5.0.0
> 
> Please let me know what additional details would be helpful, and if I
> should point this at a particular email distribution.

Just some random input from a bystander:

The md raid5 code and sata_sil drivers can usually be considered
really solid, they're very commonly used and well-tested.

I had similar issues once, with file corruption sometimes showing up
on a MD raid5.  The disks always tested out fine individually (writing
pseudorandom data and reading it back), and they were still fine with
a raid1 across all disks, so I thought it might be raid5 related.  It
turns out that it was actually bad RAM on one of the HDDs, and the
glitch was triggered only by certain access patterns that showed up
while writing to the raid5 array.

It's probably worth looking into hardware issues.  Maybe your power
supply isn't good enough and these particular access patterns trigger
a problem.  Or system RAM could be bad, or maybe your motherboard has
problems with heavy traffic on the PCI bus, etc.

It could help to figure out exactly what the corruption is, by writing
known data to the entire raid5 array and seeing where it differs when
you read it back.

-jim

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Possible MD RAID 5 or sata_sil driver issues.
  2011-09-29 20:50 ` Jim Paris
@ 2011-09-30 14:14   ` Jim Mills
  0 siblings, 0 replies; 3+ messages in thread
From: Jim Mills @ 2011-09-30 14:14 UTC (permalink / raw)
  To: Jim Paris; +Cc: neilb, neilb, jgarzik, linux-ide

I was pretty sure both md and sil drivers are solid... just didn't
know if their might be some weird interactions between the two.  I
might try blktrace to narrow it down.

When I get some time, I will try downgrading to some other SIL bioses.
 I do see some exceptions for certain hard drives in the SIL kernel
code, so it could be interactions with my particular drives.

If it is RAM on the hard drive, I have no real idea how to check this
one.  Maybe disabling the read/write cache to minimize the affect.

System RAM is good, and has been tested 24+ hours with memtest+.  This
is an older system, and I am contemplating just building another
system with a MB with many built in sata ports.

Is there any easy way to figure out how MD maps it to a particular
drive vs. me having to do manual math, etc?



On Thu, Sep 29, 2011 at 4:50 PM, Jim Paris <jim@jtan.com> wrote:
> Jim Mills wrote:
>> Possible MD RAID 5 or sata_sil driver issues.
>>
>> Summary:
>> I created an XFS filesystem on top of a MD RAID5 across 4 SATA drives
>> connected to a single SIL3114 PCI card.
>>
>> Problem:  I am seeing errors and corrupted files, as checked by CRC
>> and PAR2.  This is a brand new filesystem, on new drives.  The drives
>> do not have smart errors, and have even been zeroed out, as well as
>> reading all blocks with offline smart checks, badblocks, and even
>> ddrescue.  This also shows up in the mismatch_cnt after sending check
>> to sync_action.  Sending repair to sync_action, and then later sending
>> check doesn't fix it.
>>
>> I have seen this issue regardless if it is XFS or even EXT4, so I am
>> not assuming it is not related to the filesystem.  Although I did note
>> that MD didn't start recovering after being created until a filesystem
>> was created.
>>
>> I do not see these issues when using the drives without RAID.
>>
>> This leaves me with the only common pieces is the card and the md
>> software, which is why I am writing both of you.  It might be
>> something weird with the interaction of the two.
>>
>> I have tried looking at the sata_sil code, and don't see an easy way
>> to enable debugging via insmod.  I have not tried turning on any debug
>> in md, and can't unload it as my root, etc. is on a md mirror.
>>
>> Linux Kernel: SUSE 3.0.4-2-desktop.
>> SiI 3114 IDE BIOS     4/22/2008       5.5.0.0
>>
>> Please let me know what additional details would be helpful, and if I
>> should point this at a particular email distribution.
>
> Just some random input from a bystander:
>
> The md raid5 code and sata_sil drivers can usually be considered
> really solid, they're very commonly used and well-tested.
>
> I had similar issues once, with file corruption sometimes showing up
> on a MD raid5.  The disks always tested out fine individually (writing
> pseudorandom data and reading it back), and they were still fine with
> a raid1 across all disks, so I thought it might be raid5 related.  It
> turns out that it was actually bad RAM on one of the HDDs, and the
> glitch was triggered only by certain access patterns that showed up
> while writing to the raid5 array.
>
> It's probably worth looking into hardware issues.  Maybe your power
> supply isn't good enough and these particular access patterns trigger
> a problem.  Or system RAM could be bad, or maybe your motherboard has
> problems with heavy traffic on the PCI bus, etc.
>
> It could help to figure out exactly what the corruption is, by writing
> known data to the entire raid5 array and seeing where it differs when
> you read it back.
>
> -jim
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2011-09-30 14:14 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-29 15:54 Possible MD RAID 5 or sata_sil driver issues Jim Mills
2011-09-29 20:50 ` Jim Paris
2011-09-30 14:14   ` Jim Mills

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.