Re: Exactly what is wrong with RAID5/6

From: Hugo Mills <hugo@carfax.org.uk>
To: waxhead <waxhead@dirtcellar.net>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: Exactly what is wrong with RAID5/6
Date: Tue, 20 Jun 2017 23:25:25 +0000	[thread overview]
Message-ID: <20170620232525.GF7140@carfax.org.uk> (raw)
In-Reply-To: <1f5a4702-d264-51c6-aadd-d2cf521a45eb@dirtcellar.net>

[-- Attachment #1: Type: text/plain, Size: 5501 bytes --]

On Wed, Jun 21, 2017 at 12:57:19AM +0200, waxhead wrote:
> I am trying to piece together the actual status of the RAID5/6 bit of BTRFS.
> The wiki refer to kernel 3.19 which was released in February 2015 so
> I assume that the information there is a tad outdated (the last
> update on the wiki page was July 2016)
> https://btrfs.wiki.kernel.org/index.php/RAID56
> 
> Now there are four problems listed
> 
> 1. Parity may be inconsistent after a crash (the "write hole")
> Is this still true, if yes - would not this apply for RAID1 / RAID10
> as well? How was it solved there , and why can't that be done for
> RAID5/6

   Yes, it's still true, and it's specific to parity RAID, not the
other RAID levels. The issue is (I think) that if you write one block,
that block is replaced, but then the other blocks in the stripe need
to be read for the parity block to be recalculated, before the new
parity can be written. There's a read-modify-write cycle involved
which isn't inherent for the non-parity RAID levels (which would just
overwrite both copies).

   One of the proposed solutions for dealing with the write hole in
btrfs's parity RAID is to ensure that any new writes are written to a
completely new stripe. The problem is that this introduces a whole new
level of fragmentation if the FS has lots of small writes (because
your write unit is limited to a complete stripe, even for a single
byte update).

   There are probably others here who can explain this better. :)

> 2. Parity data is not checksummed
> Why is this a problem? Does it have to do with the design of BTRFS somehow?
> Parity is after all just data, BTRFS does checksum data so what is
> the reason this is a problem?

   It increases the number of unrecoverable (or not-guaranteed-
recoverable) cases. btrfs's csums are based on individual blocks on
individual devices -- each item of data is independently checksummed
(even if it's a copy of something else). On parity RAID
configurations, if you have a device failure, you've lost a piece of
the parity-protected data. To repair it, you have to recover from n-1
data blocks (which are checksummed), and one parity block (which
isn't). This means that if the parity block happens to have an error
on it, you can't recover cleanly from the device loss, *and you can't
know that an error has happened*.

> 3. No support for discard? (possibly -- needs confirmation with cmason)
> Does this matter that much really?, is there an update on this?
> 
> 4. The algorithm uses as many devices as are available: No support
> for a fixed-width stripe.
> What is the plan for this one? There was patches on the mailing list
> by the SnapRAID author to support up to 6 parity devices. Will the
> (re?) resign of btrfs raid5/6 support a scheme that allows for
> multiple parity devices?

   That's a problem because it limits the practical number of devices
you can use. When the stripe size gets too large, you're having to
read/modify/(re)write every device on an update, even for very small
updates -- as this ratio of update-size to read-size goes up, the FS
has increasingly bad performance. Your personal limits of what's
acceptable will vary, but I'd be surprised to find anyone with, say,
40 parity RAID devices who finds their performance acceptable. Limit
the stripe width, and you can limit the performance degradation from
lots of devices.

   Even with a limited stripe width, however, you're still looking at
decreasing reliability as the number of devices increases...

   It shouldn't be *massively* hard to implement, but there's a load
of opportunities around managing RAID options in general that would
probably need to be addressed at the same time (e.g. per-subvol RAID
settings, more general RAID parameterisation). It's going to need some
fairly major properties handling, plus rewriting the chunk allocator
and pushing the allocator decisions quite a way up from where they're
currently made.

> I do have a few other questions as well...
> 
> 5. BTRFS does still (kernel 4.9) not seem to use the device ID to
> communicate with devices.
> 
> If you on a multi device filesystem yank out a device, for example
> /dev/sdg and it reappear as /dev/sdx for example btrfs will still
> happily try to write to /dev/sdg even if btrfs fi sh /mnt shows the
> correct device ID. What is the status for getting BTRFS to properly
> understand that a device is missing?

   I don't know about this one.

> 6. RAID1 needs to be able to make two copies always. E.g. if you
> have three disks you can loose one and it should still work. What
> about RAID10 ? If you have for example 6 disk RAID10 array, loose
> one disk and reboots (due to #5 above). Will RAID10 recognize that
> the array now is a 5 disk array and stripe+mirror over 2 disks (or
> possibly 2.5 disks?) instead of 3? In other words, will it work as
> long as it can create a RAID10 profile that requires a minimum of
> four disks?

   Yes. RAID-10 will work on any number of devices (>=4), not just an
even number. Obviously, if you have a 6-device array and lose one,
you'll need to deal with the loss of redundancy -- either add a new
device and rebalance, or replace the missing device with a new one, or
(space permitting) rebalance with existing devices.

   Hugo.

-- 
Hugo Mills             | Let me past! There's been a major scientific
hugo@... carfax.org.uk | break-in!
http://carfax.org.uk/  | Through! Break-through!
PGP: E2AB1DE4          |                                          Ford Prefect

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]