Re: MD RAID 5/6 vs BTRFS RAID 5/6

From: Supercilious Dude <supercilious.dude@gmail.com>
To: Chris Murphy <lists@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	Qu Wenruo <quwenruo.btrfs@gmx.com>
Subject: Re: MD RAID 5/6 vs BTRFS RAID 5/6
Date: Fri, 18 Oct 2019 23:19:47 +0100	[thread overview]
Message-ID: <CAGmvKk67D--TSRa-BMnoAEzMEaoDMS9MnVUgun_VEfPEfhT11A@mail.gmail.com> (raw)
In-Reply-To: <CAJCQCtR=NQd6uovvAhuTdxRNJtnMFDtkTma9u8-Ep9Nq+YQY=A@mail.gmail.com>

It would be be useful to have the ability to scrub only the metadata.
In many cases the data is so large that a full scrub is not feasible.
In my "little" test system of 34TB a full scrub takes many hours and
the IOPS saturate the disks to the extent that the volume is unusable
due to the high latencies. Ideally there would be a way to rate limit
the scrub operation I/Os so that it can happen in the background
without impacting the normal workload.

On Fri, 18 Oct 2019 at 21:38, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB <jonandermonleon@gmail.com> wrote:
> >
> > It would be interesting to know the pros and cons of this setup that
> > you are suggesting vs zfs.
> > +zfs detects and corrects bitrot (
> > http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
> > +zfs has working raid56
> > -modules out of kernel for license incompatibilities (a big minus)
> >
> > BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
> > to find any conclusive doc about it right now)
>
> Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.
>
> > I'm one of those that is waiting for the write hole bug to be fixed in
> > order to use raid5 on my home setup. It's a shame it's taking so long.
>
> For what it's worth, the write hole is considered to be rare.
> https://lwn.net/Articles/665299/
>
> Further, the write hole means a) parity is corrupt or stale compared
> to data stripe elements which is caused by a crash or powerloss during
> writes, and b) subsequently there is a missing device or bad sector in
> the same stripe as the corrupt/stale parity stripe element. The effect
> of b) is that reconstruction from parity is necessary, and the effect
> of a) is that it's reconstructed incorrectly, thus corruption. But
> Btrfs detects this corruption, whether it's metadata or data. The
> corruption isn't propagated in any case. But it makes the filesystem
> fragile if this happens with metadata. Any parity stripe element
> staleness likely results in significantly bad reconstruction in this
> case, and just can't be worked around, even btrfs check probably can't
> fix it. If the write hole problem happens with data block group, then
> EIO. But the good news is that this isn't going to result in silent
> data or file system metadata corruption. For sure you'll know about
> it.
>
> This is why scrub after a crash or powerloss with raid56 is important,
> while the array is still whole (not degraded). The two problems with
> that are:
>
> a) the scrub isn't initiated automatically, nor is it obvious to the
> user it's necessary
> b) the scrub can take a long time, Btrfs has no partial scrubbing.
>
> Wheras mdadm arrays offer a write intent bitmap to know what blocks to
> partially scrub, and to trigger it automatically following a crash or
> powerloss.
>
> It seems Btrfs already has enough on-disk metadata to infer a
> functional equivalent to the write intent bitmap, via transid. Just
> scrub the last ~50 generations the next time it's mounted. Either do
> this every time a Btrfs raid56 is mounted. Or create some flag that
> allows Btrfs to know if the filesystem was not cleanly shutdown. It's
> possible 50 generations could be a lot of data, but since it's an
> online scrub triggered after mount, it wouldn't add much to mount
> times. I'm also picking 50 generations arbitrarily, there's no basis
> for that number.
>
> The above doesn't cover the case where partial stripe write (which
> leads to write hole problem), and a crash or powerloss, and at the
> same time one or more device failures. In that case there's no time
> for a partial scrub to fix the problem leading to the write hole. So
> even if the corruption is detected, it's too late to fix it. But at
> least an automatic partial scrub, even degraded, will mean the user
> would be flagged of the uncorrectable problem before they get too far
> along.
>
>
> --
> Chris Murphy