All of lore.kernel.org
 help / color / mirror / Atom feed
* Idea for new RAID type - background extended recovery information
@ 2009-12-09  9:06 Michael Evans
  2009-12-09 10:53 ` Mikael Abrahamsson
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Evans @ 2009-12-09  9:06 UTC (permalink / raw)
  To: linux-raid

Summary:

New raid level between 0 and 1; version tracking and 'bad sector'
recovery parity.

Rational:
* For that extra .001% assurance that could be the difference between
a few bad sectors and otherwise valid data.
* Possibility to 'inform' upper layers about stale/bad checksum copies
of data; thus allowing improvement of recovery decisions.

Rambling train of thought:

One of the main problems that still remains unsolved with current RAID
operations is determining which set of data has gone bad.  The most
obvious choice is to use a data recovery scheme like PAR2 uses, which
keeps a checksum for every storage segment.  However that conflicts
with the 'zero it before creation and assume-clean works' idea.  It
also very likely has extremely poor write performance.  However it may
be sufficient to use a different approach.

If stripes still in memory are buffered the parity update might be
deferred.  Additional stripes or an external (hopefully independent)
logging device/file could be provided to record any pending changes.
Any modification which flushes an entire stripe to disk needn't be
logged once all the data had been written, so a separate ring buffer
for that section might be a good performance idea.  Ideally lots of
small, stripe-clustered changes could be buffered until they could be
combined in to a single recalculation and write; or at least until
idle cpu/io allowed them to be written anyway.

In addition to the per-stripe approaches deferring the calculations
might allow for the PAR2 style method to work as well.  A second
extended recovery data-set could be stored which would add to the
existing stripes of whatever type. It would only be updated by
explicit request, or during lulls in activity.  Storing N-1 (or less)
recovery units might also allow for a copy of that device's blocks or
all device's blocks to be stored.  That would allow easier
verification of data-version and consistency.  A more bold approach
then also presents it's self.  Using the other parity blocks in
conjunction with the extended set.  It would mean far worse on the fly
recovery, but the trade-off would be the ability to recover from more
partial disk failure/unreadable sector scenarios.  However in my mind
it seems a better trade to use an extra .001% of each storage device
to gain that tiny extra assurance against all normal parity units PLUS
one bad sector on a data drive crippling everything.  Also, again, the
consistency/version data would make determining which chunk to replace
far easier.

Given the zeros operation, a sparse (zero filled) device could be made
and then cleaned up by the first recovery process with but a single
information message in the system log.  "Detected newly assembled
pre-zeroed device, filling in missing checksum values.".  All of the
checksums would be the same and could be calculated at compile time.
The parity values might differ, depending on the algorithm, but could
assuredly be cached at runtime, leading to a series of easy to process
asynchronous writes.  Storing ranges of sparse information would
likely defer the write operation until after all reads are completed
anyway.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for new RAID type - background extended recovery information
  2009-12-09  9:06 Idea for new RAID type - background extended recovery information Michael Evans
@ 2009-12-09 10:53 ` Mikael Abrahamsson
  2009-12-10  0:45   ` Michael Evans
  2009-12-12  7:22   ` Kasper Sandberg
  0 siblings, 2 replies; 7+ messages in thread
From: Mikael Abrahamsson @ 2009-12-09 10:53 UTC (permalink / raw)
  To: linux-raid

On Wed, 9 Dec 2009, Michael Evans wrote:

> keeps a checksum for every storage segment.  However that conflicts
> with the 'zero it before creation and assume-clean works' idea.  It
> also very likely has extremely poor write performance.

Generally, my experience has been that total disk failures are fairly 
rare, instead with the much larger disks today, I get single block/sector 
failures, meaning 512 bytes (or 4 k, I don't remember) can't be read. Is 
there any data to support this?

Would it make sense to add 4k to every 64k of raid chunk (non-raid1) for 
some kind of "parity" information. Since I guess all writes involves 
re-writing the whole chunk, adding 4k here shouldn't make the write 
performance be worse?

The problem I'm trying to address is the raid5 "disk failure and then 
random single block/sector error on the rest of the drives".

For arrays with few drives this would be much more efficient than going to 
raid6...?

An 8 disk raid6 with 1TB you get 6 TB of usable data, for an 8 disk raid5p 
(p for parity, I just made that up), it would be 7*64/68=6.59 TB.

For 6 disk raid6 = 4TB and raid5p makes this 5*64/68=4.71TB.

For 4 disk raid5 = 2TB and raid5p makes this 3*64=68=2.82TB.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for new RAID type - background extended recovery information
  2009-12-09 10:53 ` Mikael Abrahamsson
@ 2009-12-10  0:45   ` Michael Evans
  2009-12-12  7:22   ` Kasper Sandberg
  1 sibling, 0 replies; 7+ messages in thread
From: Michael Evans @ 2009-12-10  0:45 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

On Wed, Dec 9, 2009 at 2:53 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:
> Generally, my experience has been that total disk failures are fairly rare,
> instead with the much larger disks today, I get single block/sector
> failures, meaning 512 bytes (or 4 k, I don't remember) can't be read. Is
> there any data to support this?
>

I agree with this failure mode.  I've seen occasional disk failures;
usually on single drive systems, far more often are either single
failures; or in the case of laptops (and possibly also hard drives
running in more seismically active areas) occasional runs of
head-crashes.  In the case of a head-crash it would be a _VERY_ good
idea to copy the data off first then recover it, but one would expect
only a moderate volume of poison data.  Having a layer to identify
which data is suspect, and potentially provide recovery information
would be a great idea.  In my use cases I'd probably dedicate an
entire stripe for every 64 that are backed by it.  Changing any
information within that 64 stripe section would result in a change to
the parity data, but that layer needn't be updated constantly, during
idle periods would be a sufficient safety net, so long as it was
automated; the list of checksums would of course be updated with the
same operation.

So there would be three basic functions:

1) Determine if chunks are the expected checksum.
2) Determine if chunks are the correct version (to provide upper
layers with atomic storage).
3) Provide low-density recovery data.  Not enough to protect against a
disk loss, but whatever scale of safety net between 0 and 100% of a
drive is desired.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for new RAID type - background extended recovery information
  2009-12-09 10:53 ` Mikael Abrahamsson
  2009-12-10  0:45   ` Michael Evans
@ 2009-12-12  7:22   ` Kasper Sandberg
  2009-12-13  3:47     ` Michael Evans
  1 sibling, 1 reply; 7+ messages in thread
From: Kasper Sandberg @ 2009-12-12  7:22 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid

On Wed, 2009-12-09 at 11:53 +0100, Mikael Abrahamsson wrote:
> On Wed, 9 Dec 2009, Michael Evans wrote:
> 
> > keeps a checksum for every storage segment.  However that conflicts
> > with the 'zero it before creation and assume-clean works' idea.  It
> > also very likely has extremely poor write performance.
> 
> Generally, my experience has been that total disk failures are fairly 
> rare, instead with the much larger disks today, I get single block/sector 
> failures, meaning 512 bytes (or 4 k, I don't remember) can't be read. Is 
> there any data to support this?
> 
> Would it make sense to add 4k to every 64k of raid chunk (non-raid1) for 
> some kind of "parity" information. Since I guess all writes involves 
> re-writing the whole chunk, adding 4k here shouldn't make the write 
> performance be worse?
> 
> The problem I'm trying to address is the raid5 "disk failure and then 
> random single block/sector error on the rest of the drives".
> 
> For arrays with few drives this would be much more efficient than going to 
> raid6...?
> 
> An 8 disk raid6 with 1TB you get 6 TB of usable data, for an 8 disk raid5p 
> (p for parity, I just made that up), it would be 7*64/68=6.59 TB.

while this could work, i would personally far rather see raid6 gain all
the recovery/sanity options possible. raid6 has multiple copies of the
same data, and as long as you have >2 copies, you can begin to look at
all the data sets, and with a pretty good probability weed out the bad
set.


> 
> For 6 disk raid6 = 4TB and raid5p makes this 5*64/68=4.71TB.
> 
> For 4 disk raid5 = 2TB and raid5p makes this 3*64=68=2.82TB.
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for new RAID type - background extended recovery information
  2009-12-12  7:22   ` Kasper Sandberg
@ 2009-12-13  3:47     ` Michael Evans
  2009-12-16 13:13       ` Goswin von Brederlow
  0 siblings, 1 reply; 7+ messages in thread
From: Michael Evans @ 2009-12-13  3:47 UTC (permalink / raw)
  To: Kasper Sandberg; +Cc: Mikael Abrahamsson, linux-raid

On Fri, Dec 11, 2009 at 11:22 PM, Kasper Sandberg
<postmaster@metanurb.dk> wrote:
> On Wed, 2009-12-09 at 11:53 +0100, Mikael Abrahamsson wrote:
>> On Wed, 9 Dec 2009, Michael Evans wrote:
>
> while this could work, i would personally far rather see raid6 gain all
> the recovery/sanity options possible. raid6 has multiple copies of the
> same data, and as long as you have >2 copies, you can begin to look at
> all the data sets, and with a pretty good probability weed out the bad
> set.
>

While I would like to have a layer that any storage use, including
other raid levels, could reside within.  Imagine how much smarter
raid6 could be if it already knew in advance which stripes had gone
bad?  Or if files older than a few seconds could also gain an
additional 'bad sector' survival; allowing the loss of whatever normal
raid tolerances plus a bad sector or two.  It would not be required,
but I believe it would be a good way of adding assurance to long-term
storage segments.

I implore you to comment on the original suggestion, or my reply to
his reply as well.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for new RAID type - background extended recovery information
  2009-12-13  3:47     ` Michael Evans
@ 2009-12-16 13:13       ` Goswin von Brederlow
  2009-12-17  1:11         ` Michael Evans
  0 siblings, 1 reply; 7+ messages in thread
From: Goswin von Brederlow @ 2009-12-16 13:13 UTC (permalink / raw)
  To: Michael Evans; +Cc: Kasper Sandberg, Mikael Abrahamsson, linux-raid

Michael Evans <mjevans1983@gmail.com> writes:

> On Fri, Dec 11, 2009 at 11:22 PM, Kasper Sandberg
> <postmaster@metanurb.dk> wrote:
>> On Wed, 2009-12-09 at 11:53 +0100, Mikael Abrahamsson wrote:
>>> On Wed, 9 Dec 2009, Michael Evans wrote:
>>
>> while this could work, i would personally far rather see raid6 gain all
>> the recovery/sanity options possible. raid6 has multiple copies of the
>> same data, and as long as you have >2 copies, you can begin to look at
>> all the data sets, and with a pretty good probability weed out the bad
>> set.
>>
>
> While I would like to have a layer that any storage use, including
> other raid levels, could reside within.  Imagine how much smarter
> raid6 could be if it already knew in advance which stripes had gone
> bad?  Or if files older than a few seconds could also gain an
> additional 'bad sector' survival; allowing the loss of whatever normal
> raid tolerances plus a bad sector or two.  It would not be required,
> but I believe it would be a good way of adding assurance to long-term
> storage segments.
>
> I implore you to comment on the original suggestion, or my reply to
> his reply as well.

I think that really belongs in the filesystem. You don't want to waste
parity on data that isn't in use and you want to be able to connect
bad data with the relevant files easily. So go use zfs or the like. :)

MfG
        Goswin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Idea for new RAID type - background extended recovery information
  2009-12-16 13:13       ` Goswin von Brederlow
@ 2009-12-17  1:11         ` Michael Evans
  0 siblings, 0 replies; 7+ messages in thread
From: Michael Evans @ 2009-12-17  1:11 UTC (permalink / raw)
  To: Goswin von Brederlow; +Cc: Kasper Sandberg, Mikael Abrahamsson, linux-raid

On Wed, Dec 16, 2009 at 5:13 AM, Goswin von Brederlow <goswin-v-b@web.de> wrote:
> Michael Evans <mjevans1983@gmail.com> writes:
>
>> On Fri, Dec 11, 2009 at 11:22 PM, Kasper Sandberg
>> <postmaster@metanurb.dk> wrote:
>>> On Wed, 2009-12-09 at 11:53 +0100, Mikael Abrahamsson wrote:
>>>> On Wed, 9 Dec 2009, Michael Evans wrote:
>>>
>>> while this could work, i would personally far rather see raid6 gain all
>>> the recovery/sanity options possible. raid6 has multiple copies of the
>>> same data, and as long as you have >2 copies, you can begin to look at
>>> all the data sets, and with a pretty good probability weed out the bad
>>> set.
>>>
>>
>> While I would like to have a layer that any storage use, including
>> other raid levels, could reside within.  Imagine how much smarter
>> raid6 could be if it already knew in advance which stripes had gone
>> bad?  Or if files older than a few seconds could also gain an
>> additional 'bad sector' survival; allowing the loss of whatever normal
>> raid tolerances plus a bad sector or two.  It would not be required,
>> but I believe it would be a good way of adding assurance to long-term
>> storage segments.
>>
>> I implore you to comment on the original suggestion, or my reply to
>> his reply as well.
>
> I think that really belongs in the filesystem. You don't want to waste
> parity on data that isn't in use and you want to be able to connect
> bad data with the relevant files easily. So go use zfs or the like. :)
>
> MfG
>        Goswin
>

The same argument can be made for all current levels of RAID as well.
The primary reason we are still using RAID layers is that the majority
of, virtually all, filesystems currently in use lack the capacity.
Additionally it is likely that even with new maturing filesystems that
do support RAID style storage we will still need to rely on the
protection of RAID for backwards compatibility.

I do however agree that the goals of the current RAID system and even
potentially the algorithms for creating and recovering parity blocks
can, and should be, shared with any portion of the kernel; also
possibly even userspace via library abstraction (in the case of
hardware acceleration).  It only adds additional failure points to
have multiple copies of very similar procedures.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-12-17  1:11 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-09  9:06 Idea for new RAID type - background extended recovery information Michael Evans
2009-12-09 10:53 ` Mikael Abrahamsson
2009-12-10  0:45   ` Michael Evans
2009-12-12  7:22   ` Kasper Sandberg
2009-12-13  3:47     ` Michael Evans
2009-12-16 13:13       ` Goswin von Brederlow
2009-12-17  1:11         ` Michael Evans

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.