linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Supercilious Dude <supercilious.dude@gmail.com>
To: Chris Murphy <lists@colorremedies.com>
Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>,
	Qu Wenruo <quwenruo.btrfs@gmx.com>
Subject: Re: MD RAID 5/6 vs BTRFS RAID 5/6
Date: Fri, 18 Oct 2019 23:19:47 +0100	[thread overview]
Message-ID: <CAGmvKk67D--TSRa-BMnoAEzMEaoDMS9MnVUgun_VEfPEfhT11A@mail.gmail.com> (raw)
In-Reply-To: <CAJCQCtR=NQd6uovvAhuTdxRNJtnMFDtkTma9u8-Ep9Nq+YQY=A@mail.gmail.com>

It would be be useful to have the ability to scrub only the metadata.
In many cases the data is so large that a full scrub is not feasible.
In my "little" test system of 34TB a full scrub takes many hours and
the IOPS saturate the disks to the extent that the volume is unusable
due to the high latencies. Ideally there would be a way to rate limit
the scrub operation I/Os so that it can happen in the background
without impacting the normal workload.


On Fri, 18 Oct 2019 at 21:38, Chris Murphy <lists@colorremedies.com> wrote:
>
> On Wed, Oct 16, 2019 at 10:07 PM Jon Ander MB <jonandermonleon@gmail.com> wrote:
> >
> > It would be interesting to know the pros and cons of this setup that
> > you are suggesting vs zfs.
> > +zfs detects and corrects bitrot (
> > http://www.zfsnas.com/2015/05/24/testing-bit-rot/ )
> > +zfs has working raid56
> > -modules out of kernel for license incompatibilities (a big minus)
> >
> > BTRFS can detect bitrot but... are we sure it can fix it? (can't seem
> > to find any conclusive doc about it right now)
>
> Yes. Active fixups with scrub since 3.19. Passive fixups since 4.12.
>
> > I'm one of those that is waiting for the write hole bug to be fixed in
> > order to use raid5 on my home setup. It's a shame it's taking so long.
>
> For what it's worth, the write hole is considered to be rare.
> https://lwn.net/Articles/665299/
>
> Further, the write hole means a) parity is corrupt or stale compared
> to data stripe elements which is caused by a crash or powerloss during
> writes, and b) subsequently there is a missing device or bad sector in
> the same stripe as the corrupt/stale parity stripe element. The effect
> of b) is that reconstruction from parity is necessary, and the effect
> of a) is that it's reconstructed incorrectly, thus corruption. But
> Btrfs detects this corruption, whether it's metadata or data. The
> corruption isn't propagated in any case. But it makes the filesystem
> fragile if this happens with metadata. Any parity stripe element
> staleness likely results in significantly bad reconstruction in this
> case, and just can't be worked around, even btrfs check probably can't
> fix it. If the write hole problem happens with data block group, then
> EIO. But the good news is that this isn't going to result in silent
> data or file system metadata corruption. For sure you'll know about
> it.
>
> This is why scrub after a crash or powerloss with raid56 is important,
> while the array is still whole (not degraded). The two problems with
> that are:
>
> a) the scrub isn't initiated automatically, nor is it obvious to the
> user it's necessary
> b) the scrub can take a long time, Btrfs has no partial scrubbing.
>
> Wheras mdadm arrays offer a write intent bitmap to know what blocks to
> partially scrub, and to trigger it automatically following a crash or
> powerloss.
>
> It seems Btrfs already has enough on-disk metadata to infer a
> functional equivalent to the write intent bitmap, via transid. Just
> scrub the last ~50 generations the next time it's mounted. Either do
> this every time a Btrfs raid56 is mounted. Or create some flag that
> allows Btrfs to know if the filesystem was not cleanly shutdown. It's
> possible 50 generations could be a lot of data, but since it's an
> online scrub triggered after mount, it wouldn't add much to mount
> times. I'm also picking 50 generations arbitrarily, there's no basis
> for that number.
>
> The above doesn't cover the case where partial stripe write (which
> leads to write hole problem), and a crash or powerloss, and at the
> same time one or more device failures. In that case there's no time
> for a partial scrub to fix the problem leading to the write hole. So
> even if the corruption is detected, it's too late to fix it. But at
> least an automatic partial scrub, even degraded, will mean the user
> would be flagged of the uncorrectable problem before they get too far
> along.
>
>
> --
> Chris Murphy

  parent reply	other threads:[~2019-10-18 22:20 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-16 15:40 MD RAID 5/6 vs BTRFS RAID 5/6 Edmund Urbani
2019-10-16 19:42 ` Zygo Blaxell
2019-10-21 15:27   ` Edmund Urbani
2019-10-21 19:34     ` Zygo Blaxell
2019-10-23 16:32       ` Edmund Urbani
2019-10-26  0:01         ` Zygo Blaxell
2019-10-17  4:07 ` Jon Ander MB
2019-10-17 15:57   ` Chris Murphy
2019-10-17 18:23     ` Graham Cobb
2019-10-20 21:41       ` Chris Murphy
2019-10-18 22:19     ` Supercilious Dude [this message]
     [not found]     ` <CAGmvKk4wENpDqLFZG+D8_zzjhXokjMfdbmgTKTL49EFcfdVEtQ@mail.gmail.com>
2019-10-20 21:43       ` Chris Murphy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGmvKk67D--TSRa-BMnoAEzMEaoDMS9MnVUgun_VEfPEfhT11A@mail.gmail.com \
    --to=supercilious.dude@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=lists@colorremedies.com \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).