Re: Linux raid-like idea

From: Brian Allen Vanderburg II <brianvanderburg2@aim.com>
To: antlists <antlists@youngman.org.uk>, linux-raid@vger.kernel.org
Subject: Re: Linux raid-like idea
Date: Fri, 11 Sep 2020 16:14:51 -0400	[thread overview]
Message-ID: <38f9595b-963e-b1f5-3c29-ad8981e677a7@aim.com> (raw)
In-Reply-To: <ddd9b5b9-88e6-e730-29f4-30dfafd3a736@youngman.org.uk>

On 9/11/20 3:16 PM, antlists wrote:
> Yes it is a bit like raid-4 since the data and parity disks are
>> separated.  In fact the idea could be better called a parity backed
>> collection of independently accessed disks. While you would not get the
>> advantage/performance increase of reads/writes going across multiple
>> disks, the idea is primarily targeted to read-heavy applications, so in
>> a typical use, read performance should be no worse than reading directly
>> from a single un-raided disk, except in case of a disk failure where the
>> parity is being used to calculated a block read on a missing disk.
>> Writes would have more overhead since they would also have to
>> calculate/update parity.
>
> Ummm...
>
> So let me word this differently. You're looking at pairing disks up,
> with a filesystem on each pair (data/parity), and then using mergefs
> on top. Compared with simple raid, that looks like a lose-lose
> scenario to me.
>
> A raid-1 will read faster than a single disk, because it optimises
> which disk to read from, and it will write faster too because your
> typical parity calculation for a two-disk scenario is a no-op, which
> might not optimise out.

Not exactly.  You can do a data + parity, but you could also do a data +
data + parity or a data + data + data + parity.  Or with more than one
parity disk data + data + data + data +parity + parity, etc.

Best viewed in a fixed-width font, and probably make more sense read
from the bottom up:

       /data
         |
    / mergerfs  \
   /             \
/pool1         /pool2         /pool3 (or /home or /usr/local, etc)
   |             |             |
The filesystem built upon the /dev/frX devices can be used however the
user wants.
   |             |             |
----------------------------------------
   |             |             |
ext4 (etc)     ext4(etc)    (ext4/etc, could in theory even have
multiple partitions then filesystems)
   |             |             |
Each exposed block device /dev/frX can have a filesystem/partition table
placed on it, which is placed onto the single mapped disk.
Any damage/issues on one data disk would not affect the other data disks
at all.  However, since the collection of data disks also has parity for
them,
damage to a data disk can be restored from the parity and other data
disks.  If, during restore, something prevents the restore, then only
the bad
data disks have an issue, the other data disks would still be fully
accessible, and any filesystem on them still intact since the entire
filesystem
from anything on /dev/fr0 would be only on /dev/sda1, and so on.
   |             |             |
----------------------------------------
   |             |             |
/dev/fr0      /dev/fr1      /dev/fr2
   |             |             |
Individual data disks are passed through as fully exposed block devices,
minus any overhead for information/data structures for the 'raid'.
A block X on /dev/fr0 maps to block X + offset on /dev/sda1 and so on
   |             |             |
Raid/parity backed disk layer (data: /dev/sda1=/dev/fr0,
/dev/sdb1=/dev/fr1, /dev/sdc1=/dev/fr2, parity: /dev/sdd1)
   |             |             |
-----------------------------------------------------
   |             |             |                 |
/dev/sda1    /dev/sdb1     /dev/sdc1      /dev/sdd1 (parity)

So basically at the raid (or parity backed layer), multiple disks and
not just a single disk, can be backed by the parity disk (ideally
support for more than on parity disk as well)  Only difference is,
instead of joining the disks as one block device /dev/md0, each data
disk gets its own block device and so has it's own filesystem(s) on it
independently of the other disks.  A single data disk can be removed
entirely, taken to a different system, and still be read (would need to
do losetup with an offset to get to the start of the
filesystem/partition table though), and the other data disks would still
be readable on the original system.  So any total loss of a data disk
would not affect the other data disks files.  In this example, /data
could be missing some files if /pool1 (/dev/sda1) died, but the files on
/pool2 would still be entirely accessible as would any filesystem from
/dev/sdc1.  There is no performance advantage to such a setup. The
advantage is that should something real bad happen and it become
impossible to restore some data disk(s), the other disk(s) are still
accessible.

Read from /dev/fr0 = read from /dev/sda1 (adjusted for any overhead/headers)
Read from /dev/fr1 = read from /dev/sdb1 (adjusted for any overhead/headers)
Read from /dev/fr2 = read from /dev/sdc1 (adjusted for any overhead/headers)
Write to /dev/fr0 = write to /dev/sda1 ((adjusted for any
overhead/headers) and parity /dev/sdd1
Write to /dev/fr1 = write to /dev/sdb1 ((adjusted for any
overhead/headers) and parity /dev/sdd1
Write to /dev/fr2 = write to /dev/sdc1 ((adjusted for any
overhead/headers) and parity /dev/sdd1

Read from /dev/fr0 (/dev/sda1 missing) = read from parity and other
disks, recalculate original block)
During rebuild, /dev/sdd dies as well (unable to rebuild from parity now
since /dev/sda and /dev/sdd are missing)
    Lost: /dev/sda1
    Still present: /dev/sdb1 -- some files from the pool will be missing
since /pool1 is missing but the files on /pool2 are still present in
their entirety
    Still present: /pool3 (or /home or /usr/local, etc, whatever
/dev/fr2 was used for)

>>
>>> Personally, I'm looking at something like raid-61 as a project. That
>>> would let you survive four disk failures ...
>>
>> Interesting.  I'll check that out more later, but from what it seems so
>> far there is a lot of overhead (10 1TB disks would only be 3TB of data
>> (2x 5 disk arrays mirrors, then raid6 on each leaving 3 disks-worth of
>> data).  My currently solution since I'ts basically just storing bulk
>> data, is mergerfs and snapraid, and from the documents of snapraid, 10
>> 1TB disks would provide 6TB if using 4 for parity.  However it's parity
>> calculations seem to be more complex as well.
>
> Actually no. Don't forget that, as far as linux is concerned, raid-10
> and raid-1+0 are two *completely* *different* things. You can raid-10
> three disks, but you need four for raid-1+0.
>
> You've mis-calculated raid-6+1 - that gives you 6TB for 10 disks (two
> 3TB arrays). I think I would probably get more with raid-61, but every
> time I think about it my brain goes "whoa!!!", and I'll need to start
> concentrating on it to work out exactly what's going on.

That's right, I get the various combinations confused.  So does raid61
allow for losing 4 disks in any order and still recovering? or would
some order of disks make it where just 3 disks lost and be bad?
Iinteresting non-the-less and I'll have to look into it.  Obviously it's
not intended to as a replacement for backing up important data, but, for
me any way, just away to minimize loss of any trivial bulk data/files.

It would be nice if the raid modules had support for methods that could
support a total of more disks in any order lost without loosing data. 
Snapraid source states that it uses some Cauchy Matrix algorithm which
in theory could loose up to 6 disks if using 6 parity disks, in any
order, and still be able to restore the data.  I'm not familiar with the
math behind it so can't speak to the accuracy of that claim.

>> This is actually the main purpose of the idea.  Due to the data on the
>> disks in a traditional raid5/6 being mapped from multiple disks to a
>> single logical block device, and so the structures of any file systems
>> and their files scattered across all the disks, losing one more than the
>> number of available lost disks would make the entire filesystem(s) and
>> all files virtually unrecoverable.
>
> But raid 5/6 give you much more usable space than a mirror. What I'm
> having trouble getting to grips with in your idea is how is it an
> improvement on a mirror? It looks to me like you're proposing a 2-disk
> raid-4 as the underlying storage medium, with mergefs on top. Which is
> effectively giving you a poorly-performing mirror. A crappy raid-1+0,
> basically.

I do apologize it seems I'm having a little difficulty clearly
explaining the idea.  Hopefully the chart above helps explain it better
than I have been.  Imagine raid 5 or 6, but with no striping (so the
parity goes on their own disks), and the data disks passed through as
their down block devices each.  You lose any performance benefits of the
striping of data/parity, but the data stored on any data disk is only on
that data disk, and same for the others, so losing all parity and a data
disk, would not lose the data on the other data disks.

>>
>> By keeping each data disk separate and exposed as it's own block device
>> with some parity backup, each disk contains an entire filesystem(s) on
>> it's own to be used however a user decides.  The loss of one of the
>> disks during a rebuild would not cause full data loss anymore but only
>> of the filesystem(s) on that disk.  The data on the other disks would
>> still be intact and readable, although depending on the user's usage,
>> may be missing files if they used a union/merge filesystem on top of
>> them.  A rebuild would still have the same issues, would have to read
>> all the remaining disks to rebuild the lost disk.  I'm not really sure
>> of any way around that since parity would essentially be calculated as
>> the xor of the same block on all the data disks.
>>
> And as I understand your setup, you also suffer from the same problem
> as raid-10 - lose one disk and you're fine, lose two and it's russian
> roulette whether you can recover your data. raid-6 is *any* two and
> you're fine, raid-61 would be *any* four and you're fine.

Not exactly.  Since the data disks are passed through as individual
block devices instead of 'joined' into a single block device, if you
lose one disk (assuming only one disk of parity) then you are fine. If
you lose two, then you've only lost the data on the lost data disk. The
other data disks would still have their in-tact filesystems on them. 
Depending on how they are used, some files may be missing. IE a mergerfs
between two mount points would be missing any files on the lost mount
point, but the other files would still be accessible.

It may or may not (leaning more to probably not) have any use. I'm
hoping from the above at least the idea is better understood.  I do
apologize if it's still not clear/

Thanks,

Brian Vanderburg II