All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: detection/correction of corruption with raid6
@ 2008-12-16 21:58 Piergiorgio Sartor
  2008-12-16 22:25 ` Redeeman
  2008-12-17 14:48 ` Bill Davidsen
  0 siblings, 2 replies; 26+ messages in thread
From: Piergiorgio Sartor @ 2008-12-16 21:58 UTC (permalink / raw)
  To: linux-raid

Hi all,

while I do agree that the issue needs more in deep thinking,
I would like to tell a recent story that happened to me.

I was testing a RAID-6 array, with 7, small, HDs.
Intention was to get used to different situations, repair,
grow, fail, remove, etc.

After some playing, I started to check the files on the array
and I found out that they were not (always) correct.
So I started a check of the array, which returned some 1000 or
more mismatches.

After some investigation, I found out that one HD had a "flaky"
interface, data was correctly written, but sometimes, randomly,
reading returned some "wrong" bits (re-cabling solved the issue).

To check this with RAID-6, I could run the check with 6 disks,
for 7 times, each with a different disk removed, until one run
returned no mismatches.
At this point, I knew which "data path" was defective.

It would have saved a lot of time, if the check could have
done this automatically...

So, my RFE, would be, if possible, to try, during RAID-6 check,
to find out if and which HD has the mismatch.
Ideally, at the end of the check, the system log should show
how many mismatches, if any, are likely to belong to which HD
or are undetermined.
This would help to diagnose the full data path and reduce
testing time in case of problems.
In case only one HD results problematic, this one could be
failed, removed and the complete cabling, I/F and so on checked.
Of course, this goes beyond the simple "HD failure protection"
scope of RAID, nevertheless I do not see why this possibility
should be neglected, unless it is too complex/difficult to
implement and maintain.

Regarding the possibility of recovery, I have one question:

Why a RAID system might have inconsistencies?
Why do we have a "check" command at all, to run weekly or monthly?

Thanks,

bye,


-- 

piergiorgio sartor




^ permalink raw reply	[flat|nested] 26+ messages in thread
* Re: detection/correction of corruption with raid6
@ 2008-12-19  8:40 piergiorgio.sartor
  2008-12-19 13:10 ` Redeeman
  0 siblings, 1 reply; 26+ messages in thread
From: piergiorgio.sartor @ 2008-12-19  8:40 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

Hi,

thanks for the answer.
I've still some comments on the topic, see below.

> Suppose we agree that bit flips don't happen (undetected) on drive
> media.  But that bit flips can happen elsewhere (memory.  IO Buss
> etc).
> 
> And then suppose we discover that a bit-flip has happened.  What does
> that tell us?
> Maybe it tells us that our hardware is dodgey.  So it cannot be
> trusted to reliably do anything we tell it.  So maybe we shouldn't
> tell it to do anything. ??

Maybe I should try to clarify the concept.
There are *two* use cases.
One is the "check" and one is the "repair".
As I already wrote, I do agree that "repair" needs some deeper
thinking. It is easy to see cases where it could produce more
damages.
The "check" case is another story.
In case of RAID-6 I would like, as RFE, to have in the logs some
report on which "drive" or "data path" the mismatch occurs, when
detectable.
So, if the mismatch count says there are 1024 mismatches, then
would be nice to know if they belong all to the same drive or not.
In this case, it would be possible to fail/remove that one and
check the hardware (change drive/cable/connector/etc.).

Ideally, at the end of the "check", the log should report how
many mismatches, how many are "undeterminable" (multiple
drive), how many could belong to a specific drive.
This will help to to diagnose a problem, maybe reported by
the CRC in the filesystem.

This is for the "check", about the "repair", the only possible
change I could see is to offer the user, and we could check
in this mailing list how many would like to have the possibility,
the option to "reset the parity" of the array or "recalculate the
data", with the warning that the second one can do more
damage than already has.

Conclusion, for me, is that the "check" should be more
clever, with RAID-6, and "repair/resync" *might* be more
flexible (with warnings).

I take the opportunity to wish you all Merry Christmas
and Happy New Year.

bye,

-- 

pg


Jetzt komfortabel bei Arcor-Digital TV einsteigen: Mehr Happy Ends, mehr Herzschmerz, mehr Fernsehen! Erleben Sie 50 digitale TV Programme und optional 60 Pay TV Sender, einen elektronischen Programmführer mit Movie Star Bewertungen von TV Movie. Außerdem, aktuelle Filmhits und spannende Dokus in der Arcor-Videothek. Infos unter www.arcor.de/tv
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 26+ messages in thread
* detection/correction of corruption with raid6
@ 2008-12-05 21:00 Redeeman
  2008-12-05 21:02 ` Justin Piszcz
  0 siblings, 1 reply; 26+ messages in thread
From: Redeeman @ 2008-12-05 21:00 UTC (permalink / raw)
  To: linux-raid

Hello.

I was looking at the PDFs linked to from the wiki, and found this:
http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

More specifically, section 4, starting on page 8.

Am I understanding this correctly, in that with raid6, linux is capable
of detecting if the content on 1 disk is corrupted, and reconstruct it
from the remaining disks?


mvh.
Kasper Sandberg


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-12-19 13:10 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-12-16 21:58 detection/correction of corruption with raid6 Piergiorgio Sartor
2008-12-16 22:25 ` Redeeman
2008-12-17 21:52   ` Piergiorgio Sartor
2008-12-19  4:39     ` Neil Brown
2008-12-19  5:38       ` Redeeman
2008-12-17 14:48 ` Bill Davidsen
2008-12-17 15:50   ` David Lethe
     [not found]     ` <494960E8.8020407@tmr.com>
2008-12-17 21:47       ` David Lethe
  -- strict thread matches above, loose matches on Subject: below --
2008-12-19  8:40 piergiorgio.sartor
2008-12-19 13:10 ` Redeeman
2008-12-05 21:00 Redeeman
2008-12-05 21:02 ` Justin Piszcz
2008-12-05 21:06   ` Redeeman
2008-12-05 21:09     ` Justin Piszcz
2008-12-05 21:12       ` Redeeman
2008-12-05 21:17         ` Justin Piszcz
2008-12-05 21:30         ` Michał Przyłuski
2008-12-05 22:12           ` Peter Rabbitson
2008-12-05 22:26             ` Michał Przyłuski
2008-12-05 22:43               ` Greg Freemyer
2008-12-06  0:39                 ` Roger Heflin
2008-12-12 15:31           ` Redeeman
2008-12-16  2:33             ` Neil Brown
2008-12-16  6:33               ` Redeeman
2008-12-16  7:59               ` Mattias Wadenstein
2008-12-16 22:20                 ` Chris Worley

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.