linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: David Sterba <dsterba@suse.cz>
To: David Sterba <dsterba@suse.com>
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v2 0/6] RAID1 with 3- and 4- copies
Date: Tue, 25 Jun 2019 19:47:13 +0200	[thread overview]
Message-ID: <20190625174712.GV8917@twin.jikos.cz> (raw)
In-Reply-To: <cover.1559917235.git.dsterba@suse.com>

On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> Hi,
> 
> this patchset brings the RAID1 with 3 and 4 copies as a separate
> feature as outlined in V1
> (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> 
> This should help a bit in the raid56 situation, where the write hole
> hurts most for metadata, without a block group profile that offers 2
> device loss resistance.
> 
> I've gathered some feedback from knowlegeable poeople on IRC and the
> following setup is considered good enough (certainly better than what we
> have now):
> 
> - data: RAID6
> - metadata: RAID1C3
> 
> The RAID1C3 vs RAID6 have different characteristics in terms of space
> consumption and repair.
> 
> 
> Space consumption
> ~~~~~~~~~~~~~~~~~
> 
> * RAID6 reduces overall metadata by N/(N-2), so with more devices the
>   parity overhead ratio is small
> 
> * RAID1C3 will allways consume 67% of metadata chunks for redundancy
> 
> The overall size of metadata is typically in range of gigabytes to
> hundreds of gigabytes (depends on usecase), rough estimate is from
> 1%-10%. With larger filesystem the percentage is usually smaller.
> 
> So, for the 3-copy raid1 the cost of redundancy is better expressed in
> the absolute value of gigabytes "wasted" on redundancy than as the
> ratio that does look scary compared to raid6.
> 
> 
> Repair
> ~~~~~~
> 
> RAID6 needs to access all available devices to calculate the P and Q,
> either 1 or 2 missing devices.
> 
> RAID1C3 can utilize the independence of each copy and also the way the
> RAID1 works in btrfs. In the scenario with 1 missing device, one of the
> 2 correct copies is read and written to the repaired devices.
> 
> Given how the 2-copy RAID1 works on btrfs, the block groups could be
> spread over several devices so the load during repair would be spread as
> well.
> 
> Additionally, device replace works sequentially and in big chunks so on
> a lightly used system the read pattern is seek-friendly.
> 
> 
> Compatibility
> ~~~~~~~~~~~~~
> 
> The new block group types cost an incompatibility bit, so old kernel
> will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> the filesystem with the new type.
> 
> To upgrade existing filesystems use the balance filters eg. from RAID6
> 
>   $ btrfs balance start -mconvert=raid1c3 /path
> 
> 
> Merge target
> ~~~~~~~~~~~~
> 
> I'd like to push that to misc-next for wider testing and merge to 5.3,
> unless something bad pops up. Given that the code changes are small and
> just a new types with the constraints, the rest is done by the generic
> code, I'm not expecting problems that can't be fixed before full
> release.
> 
> 
> Testing so far
> ~~~~~~~~~~~~~~
> 
> * mkfs with the profiles
> * fstests (no specific tests, only check that it does not break)
> * profile conversions between single/raid1/raid5/raid1c3/raid6/raid1c4/raid1c4
>   with added devices where needed
> * scrub
> 
> TODO:
> 
> * 1 missing device followed by repair
> * 2 missing devices followed by repair

Unfortunatelly neither of the two cases works as expected and I don't have time
to fix it for the 5.3 deadline. As the 3-copy is supposed to be a replacement
for raid6, I consider the lack of repair capability to be a show stopper so the
main part of the patchset is postponed.

The test I did was something like this:

- create fs with 3 devices, raid1c3
- fill with some data
- unmount
- wipe 2nd device
- mount degraded
- replace missing
- remount read-write, write data to verify that it works
- unmount
- mount as usual   <-- here it fails and the device is still reported missing

The same happens for the 2 missing devices.

      parent reply	other threads:[~2019-06-25 17:46 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
2019-06-10 12:29 ` [PATCH v2 1/6] btrfs: add mask for all RAID1 types David Sterba
2019-06-10 12:29 ` [PATCH v2 2/6] btrfs: use mask for RAID56 profiles David Sterba
2019-06-10 12:29 ` [PATCH v2 3/6] btrfs: document BTRFS_MAX_MIRRORS David Sterba
2019-06-10 12:29 ` [PATCH v2 4/6] btrfs: add support for 3-copy replication (raid1c3) David Sterba
2019-06-10 12:29 ` [PATCH v2 5/6] btrfs: add support for 4-copy replication (raid1c4) David Sterba
2019-06-10 12:29 ` [PATCH v2 6/6] btrfs: add incompat for raid1 with 3, 4 copies David Sterba
2019-06-10 12:29 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
2019-06-10 12:42 ` [PATCH v2 0/6] RAID1 with 3- and 4- copies Hugo Mills
2019-06-10 14:02   ` David Sterba
2019-06-10 14:48     ` Hugo Mills
2019-06-11  9:53       ` David Sterba
2019-06-11 12:03         ` David Sterba
2019-06-25 17:47 ` David Sterba [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190625174712.GV8917@twin.jikos.cz \
    --to=dsterba@suse.cz \
    --cc=dsterba@suse.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).