linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] 3- and 4- copy RAID1
@ 2018-07-13 18:46 David Sterba
  2018-07-13 18:46 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
                   ` (7 more replies)
  0 siblings, 8 replies; 38+ messages in thread
From: David Sterba @ 2018-07-13 18:46 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Hi,

I have some goodies that go into the RAID56 problem, although not
implementing all the remaining features, it can be useful independently.

This time my hackweek project

https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56

aimed to implement the fix for the write hole problem but I spent more
time with analysis and design of the solution and don't have a working
prototype for that yet.

This patchset brings a feature that will be used by the raid56 log, the
log has to be on the same redundancy level and thus we need a 3-copy
replication for raid6. As it was easy to extend to higher replication,
I've added a 4-copy replication, that would allow triple copy raid (that
does not have a standardized name).

The number of copies is fixed, so it's not N-copy for an arbitrary N.
This would complicate the implementation too much, though I'd be willing
to add a 5-copy replication for a small bribe.

The new raid profiles and covered by an incompatibility bit, called
extended_raid, the (idealistic) plan is to stuff as many new
raid-related features as possible. The patch 4/4 mentions the 3- 4- copy
raid1, configurable stripe length, write hole log and triple parity.
If the plan turns out to be too ambitious, the ready and implemented
features will be split and merged.

An interesting question is the naming of the extended profiles. I picked
something that can be easily understood but it's not a final proposal.
Years ago, Hugo proposed a naming scheme that described the
non-standard raid varieties of the btrfs flavor:

https://marc.info/?l=linux-btrfs&m=136286324417767

Switching to this naming would be a good addition to the extended raid.

Regarding the missing raid56 features, I'll continue working on them as
time permits in the following weeks/months, as I'm not aware of anybody
working on that actively enough so to speak.

Anyway, git branches with the patches:

kernel: git://github.com/kdave/btrfs-devel dev/extended-raid-ncopies
progs:  git://github.com/kdave/btrfs-progs dev/extended-raid-ncopies

David Sterba (4):
  btrfs: refactor block group replication factor calculation to a helper
  btrfs: add support for 3-copy replication (raid1c3)
  btrfs: add support for 4-copy replication (raid1c4)
  btrfs: add incompatibility bit for extended raid features

 fs/btrfs/ctree.h                |  1 +
 fs/btrfs/extent-tree.c          | 45 +++++++-----------
 fs/btrfs/relocation.c           |  1 +
 fs/btrfs/scrub.c                |  4 +-
 fs/btrfs/super.c                | 17 +++----
 fs/btrfs/sysfs.c                |  2 +
 fs/btrfs/volumes.c              | 84 ++++++++++++++++++++++++++++++---
 fs/btrfs/volumes.h              |  6 +++
 include/uapi/linux/btrfs.h      | 12 ++++-
 include/uapi/linux/btrfs_tree.h |  6 +++
 10 files changed, 134 insertions(+), 44 deletions(-)

-- 
2.18.0


^ permalink raw reply	[flat|nested] 38+ messages in thread
* [PATCH v2 0/6] RAID1 with 3- and 4- copies
@ 2019-06-10 12:29 David Sterba
  2019-06-10 12:29 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
  0 siblings, 1 reply; 38+ messages in thread
From: David Sterba @ 2019-06-10 12:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Hi,

this patchset brings the RAID1 with 3 and 4 copies as a separate
feature as outlined in V1
(https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).

This should help a bit in the raid56 situation, where the write hole
hurts most for metadata, without a block group profile that offers 2
device loss resistance.

I've gathered some feedback from knowlegeable poeople on IRC and the
following setup is considered good enough (certainly better than what we
have now):

- data: RAID6
- metadata: RAID1C3

The RAID1C3 vs RAID6 have different characteristics in terms of space
consumption and repair.


Space consumption
~~~~~~~~~~~~~~~~~

* RAID6 reduces overall metadata by N/(N-2), so with more devices the
  parity overhead ratio is small

* RAID1C3 will allways consume 67% of metadata chunks for redundancy

The overall size of metadata is typically in range of gigabytes to
hundreds of gigabytes (depends on usecase), rough estimate is from
1%-10%. With larger filesystem the percentage is usually smaller.

So, for the 3-copy raid1 the cost of redundancy is better expressed in
the absolute value of gigabytes "wasted" on redundancy than as the
ratio that does look scary compared to raid6.


Repair
~~~~~~

RAID6 needs to access all available devices to calculate the P and Q,
either 1 or 2 missing devices.

RAID1C3 can utilize the independence of each copy and also the way the
RAID1 works in btrfs. In the scenario with 1 missing device, one of the
2 correct copies is read and written to the repaired devices.

Given how the 2-copy RAID1 works on btrfs, the block groups could be
spread over several devices so the load during repair would be spread as
well.

Additionally, device replace works sequentially and in big chunks so on
a lightly used system the read pattern is seek-friendly.


Compatibility
~~~~~~~~~~~~~

The new block group types cost an incompatibility bit, so old kernel
will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
the filesystem with the new type.

To upgrade existing filesystems use the balance filters eg. from RAID6

  $ btrfs balance start -mconvert=raid1c3 /path


Merge target
~~~~~~~~~~~~

I'd like to push that to misc-next for wider testing and merge to 5.3,
unless something bad pops up. Given that the code changes are small and
just a new types with the constraints, the rest is done by the generic
code, I'm not expecting problems that can't be fixed before full
release.


Testing so far
~~~~~~~~~~~~~~

* mkfs with the profiles
* fstests (no specific tests, only check that it does not break)
* profile conversions between single/raid1/raid5/raid1c3/raid6/raid1c4/raid1c4
  with added devices where needed
* scrub

TODO:

* 1 missing device followed by repair
* 2 missing devices followed by repair


David Sterba (6):
  btrfs: add mask for all RAID1 types
  btrfs: use mask for RAID56 profiles
  btrfs: document BTRFS_MAX_MIRRORS
  btrfs: add support for 3-copy replication (raid1c3)
  btrfs: add support for 4-copy replication (raid1c4)
  btrfs: add incompat for raid1 with 3, 4 copies

 fs/btrfs/ctree.h                | 14 ++++++++--
 fs/btrfs/extent-tree.c          | 19 +++++++------
 fs/btrfs/scrub.c                |  2 +-
 fs/btrfs/super.c                |  6 +++++
 fs/btrfs/sysfs.c                |  2 ++
 fs/btrfs/volumes.c              | 48 ++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h              |  4 +++
 include/uapi/linux/btrfs.h      |  5 +++-
 include/uapi/linux/btrfs_tree.h | 10 +++++++
 9 files changed, 90 insertions(+), 20 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 38+ messages in thread
* [PATCH v3 0/4] RAID1 with 3- and 4- copies
@ 2019-10-31 15:13 David Sterba
  2019-10-31 18:43 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
  0 siblings, 1 reply; 38+ messages in thread
From: David Sterba @ 2019-10-31 15:13 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Here it goes again, RAID1 with 3- and 4- copies. I found the bug that stopped
it from inclusion last time, it was in the test itself, so the kernel code is
effectively unchanged.

So, with 1 or 2 missing devices, replace by device id works. There's one
annoying thing but not new: regarding replace of a missing device, some
extra single/dup block groups are created during the replace process.
Example below. This can happen on plain raid1 with degraded read-write
mount as well.

Now what's the merge target.

The patches almost made it to 5.3, the changes build on existing code so the
actual addition of new profiles is namely in the definitions and additional
cases. So it should be safe.

I'm for adding it to 5.5 queue, though we're at rc5 and this can be seen as a
late time for a feature. The user benefits are noticeable, raid1c3 can replace
raid6 of metadata which is the most problematic part and much more complicated
to fix (write ahead journal or something like that). The feedback regarding the
plain 3-copy as a replacement was positive, on IRC and there are mails about
that too.

Further information can be found in the 5.3-time submission:
https://lore.kernel.org/linux-btrfs/cover.1559917235.git.dsterba@suse.com/

--

Example of 2 devices gone missing and replaced
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 - mkfs -d raid1c3 -m raidc3 /dev/sda10 /dev/sda11 /dev/sda12

 - delete devices 2 and 3 from the system

              Data      Metadata  System
Id Path       RAID1C3   RAID1C3   RAID1C3  Unallocated
-- ---------- --------- --------- -------- -----------
 1 /dev/sda10   1.00GiB 256.00MiB  8.00MiB     8.74GiB
 2 missing      1.00GiB 256.00MiB  8.00MiB    -1.26GiB
 3 missing      1.00GiB 256.00MiB  8.00MiB    -1.26GiB
-- ---------- --------- --------- -------- -----------
   Total        1.00GiB 256.00MiB  8.00MiB     6.23GiB
   Used       200.31MiB 320.00KiB 16.00KiB

- mount -o degraded

- btrfs replace 2 /dev/sda13

              Data      Metadata  Metadata  System   System
Id Path       RAID1C3   single    RAID1C3   single   RAID1C3 Unallocated
-- ---------- --------- --------- --------- -------- ------- -----------
 1 /dev/sda10   1.00GiB 256.00MiB 256.00MiB 32.00MiB 8.00MiB     8.46GiB
 2 /dev/sda13   1.00GiB         - 256.00MiB        - 8.00MiB     8.74GiB
 3 missing      1.00GiB         - 256.00MiB        - 8.00MiB    -1.26GiB
-- ---------- --------- --------- --------- -------- ------- -----------
   Total        1.00GiB 256.00MiB 256.00MiB 32.00MiB 8.00MiB    15.95GiB
   Used       200.31MiB     0.00B 320.00KiB 16.00KiB   0.00B


- btrfs replace 3 /dev/sda14

              Data      Metadata  Metadata  System   System
Id Path       RAID1C3   single    RAID1C3   single   RAID1C3 Unallocated
-- ---------- --------- --------- --------- -------- ------- -----------
 1 /dev/sda10   1.00GiB 256.00MiB 256.00MiB 32.00MiB 8.00MiB     8.46GiB
 2 /dev/sda13   1.00GiB         - 256.00MiB        - 8.00MiB     8.74GiB
 3 /dev/sda14   1.00GiB         - 256.00MiB        - 8.00MiB     8.74GiB
-- ---------- --------- --------- --------- -------- ------- -----------
   Total        1.00GiB 256.00MiB 256.00MiB 32.00MiB 8.00MiB    25.95GiB
   Used       200.31MiB     0.00B 320.00KiB 16.00KiB   0.00B

There you can see the metadata/single and system/single chunks, that are
otherwise unused if there are no other writes happening during replace.
Running 'balance start -mconvert=raid1c3,profiles=single' should get rid of
them.

This is an annoyance, we have a plan to avoid that but it needs to change
behaviour with degraded mount and enabled writes.

Implementation details: The new profiles are reduced from the expected ones
  (raid1 -> single or dup) to allow writes without breaking the raid
  constraints.  To relax that condition, allow writing to "half" of the raid
  with a missing device will skip creating the block groups.

  This is similar to MD-RAID that allows writing to just one of the RAID1
  devices, and then sync to the other when it's available again.

  With the btrfs style raid1 we can do better in case there are enough other
  devices that would satify the raid1 constraint (yet with a missing device).

--

David Sterba (4):
  btrfs: add support for 3-copy replication (raid1c3)
  btrfs: add support for 4-copy replication (raid1c4)
  btrfs: add incompat for raid1 with 3, 4 copies
  btrfs: drop incompat bit for raid1c34 after last block group is gone

 fs/btrfs/block-group.c          | 27 ++++++++++++++--------
 fs/btrfs/ctree.h                |  7 +++---
 fs/btrfs/super.c                |  4 ++++
 fs/btrfs/sysfs.c                |  2 ++
 fs/btrfs/volumes.c              | 40 +++++++++++++++++++++++++++++++--
 fs/btrfs/volumes.h              |  4 ++++
 include/uapi/linux/btrfs.h      |  5 ++++-
 include/uapi/linux/btrfs_tree.h | 10 ++++++++-
 8 files changed, 83 insertions(+), 16 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2019-10-31 18:43 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
2018-07-13 18:46 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
2018-07-13 18:46 ` [PATCH 1/4] btrfs: refactor block group replication factor calculation to a helper David Sterba
2018-07-13 18:46 ` [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3) David Sterba
2018-07-13 21:02   ` Goffredo Baroncelli
2018-07-17 16:00     ` David Sterba
2018-07-13 18:46 ` [PATCH 3/4] btrfs: add support for 4-copy replication (raid1c4) David Sterba
2018-07-13 18:46 ` [PATCH 4/4] btrfs: add incompatibility bit for extended raid features David Sterba
2018-07-15 14:37 ` [PATCH 0/4] 3- and 4- copy RAID1 waxhead
2018-07-16 18:29   ` Goffredo Baroncelli
2018-07-16 18:49     ` Austin S. Hemmelgarn
2018-07-17 21:12     ` Duncan
2018-07-18  5:59       ` Goffredo Baroncelli
2018-07-18  7:20         ` Duncan
2018-07-18  8:39           ` Duncan
2018-07-18 12:45             ` Austin S. Hemmelgarn
2018-07-18 12:50             ` Hugo Mills
2018-07-19 21:22               ` waxhead
2018-07-18 12:50           ` Austin S. Hemmelgarn
2018-07-18 19:42           ` Goffredo Baroncelli
2018-07-19 11:43             ` Austin S. Hemmelgarn
2018-07-19 17:29               ` Goffredo Baroncelli
2018-07-19 19:10                 ` Austin S. Hemmelgarn
2018-07-20 17:13                   ` Goffredo Baroncelli
2018-07-20 18:33                     ` Austin S. Hemmelgarn
2018-07-20  5:17             ` Andrei Borzenkov
2018-07-20 17:16               ` Goffredo Baroncelli
2018-07-20 18:38                 ` Andrei Borzenkov
2018-07-20 18:41                   ` Hugo Mills
2018-07-20 18:46                     ` Austin S. Hemmelgarn
2018-07-16 21:51   ` waxhead
2018-07-15 14:46 ` Hugo Mills
2018-07-19  7:27 ` Qu Wenruo
2018-07-19 11:47   ` Austin S. Hemmelgarn
2018-07-20 16:42     ` David Sterba
2018-07-20 16:35   ` David Sterba
2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
2019-06-10 12:29 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
2019-10-31 15:13 [PATCH v3 0/4] RAID1 with 3- and 4- copies David Sterba
2019-10-31 18:43 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).