linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] RAID1 with 3- and 4- copies
@ 2019-06-10 12:29 David Sterba
  2019-06-10 12:29 ` [PATCH v2 1/6] btrfs: add mask for all RAID1 types David Sterba
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: David Sterba @ 2019-06-10 12:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Hi,

this patchset brings the RAID1 with 3 and 4 copies as a separate
feature as outlined in V1
(https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).

This should help a bit in the raid56 situation, where the write hole
hurts most for metadata, without a block group profile that offers 2
device loss resistance.

I've gathered some feedback from knowlegeable poeople on IRC and the
following setup is considered good enough (certainly better than what we
have now):

- data: RAID6
- metadata: RAID1C3

The RAID1C3 vs RAID6 have different characteristics in terms of space
consumption and repair.


Space consumption
~~~~~~~~~~~~~~~~~

* RAID6 reduces overall metadata by N/(N-2), so with more devices the
  parity overhead ratio is small

* RAID1C3 will allways consume 67% of metadata chunks for redundancy

The overall size of metadata is typically in range of gigabytes to
hundreds of gigabytes (depends on usecase), rough estimate is from
1%-10%. With larger filesystem the percentage is usually smaller.

So, for the 3-copy raid1 the cost of redundancy is better expressed in
the absolute value of gigabytes "wasted" on redundancy than as the
ratio that does look scary compared to raid6.


Repair
~~~~~~

RAID6 needs to access all available devices to calculate the P and Q,
either 1 or 2 missing devices.

RAID1C3 can utilize the independence of each copy and also the way the
RAID1 works in btrfs. In the scenario with 1 missing device, one of the
2 correct copies is read and written to the repaired devices.

Given how the 2-copy RAID1 works on btrfs, the block groups could be
spread over several devices so the load during repair would be spread as
well.

Additionally, device replace works sequentially and in big chunks so on
a lightly used system the read pattern is seek-friendly.


Compatibility
~~~~~~~~~~~~~

The new block group types cost an incompatibility bit, so old kernel
will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
the filesystem with the new type.

To upgrade existing filesystems use the balance filters eg. from RAID6

  $ btrfs balance start -mconvert=raid1c3 /path


Merge target
~~~~~~~~~~~~

I'd like to push that to misc-next for wider testing and merge to 5.3,
unless something bad pops up. Given that the code changes are small and
just a new types with the constraints, the rest is done by the generic
code, I'm not expecting problems that can't be fixed before full
release.


Testing so far
~~~~~~~~~~~~~~

* mkfs with the profiles
* fstests (no specific tests, only check that it does not break)
* profile conversions between single/raid1/raid5/raid1c3/raid6/raid1c4/raid1c4
  with added devices where needed
* scrub

TODO:

* 1 missing device followed by repair
* 2 missing devices followed by repair


David Sterba (6):
  btrfs: add mask for all RAID1 types
  btrfs: use mask for RAID56 profiles
  btrfs: document BTRFS_MAX_MIRRORS
  btrfs: add support for 3-copy replication (raid1c3)
  btrfs: add support for 4-copy replication (raid1c4)
  btrfs: add incompat for raid1 with 3, 4 copies

 fs/btrfs/ctree.h                | 14 ++++++++--
 fs/btrfs/extent-tree.c          | 19 +++++++------
 fs/btrfs/scrub.c                |  2 +-
 fs/btrfs/super.c                |  6 +++++
 fs/btrfs/sysfs.c                |  2 ++
 fs/btrfs/volumes.c              | 48 ++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h              |  4 +++
 include/uapi/linux/btrfs.h      |  5 +++-
 include/uapi/linux/btrfs_tree.h | 10 +++++++
 9 files changed, 90 insertions(+), 20 deletions(-)

-- 
2.21.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/6] btrfs: add mask for all RAID1 types
  2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
@ 2019-06-10 12:29 ` David Sterba
  2019-06-10 12:29 ` [PATCH v2 2/6] btrfs: use mask for RAID56 profiles David Sterba
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-06-10 12:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Preparatory patch for additional RAID1 profiles with more copies. The
mask will contain 3-copy and 4-copy, most of the checks for plain RAID1
work the same for the other profiles.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent-tree.c          | 8 ++++----
 fs/btrfs/scrub.c                | 2 +-
 fs/btrfs/volumes.c              | 8 ++++----
 include/uapi/linux/btrfs_tree.h | 2 ++
 4 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 96628eb4b389..3413f78a3032 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7873,7 +7873,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		 */
 		if (!block_group_bits(block_group, flags)) {
 			u64 extra = BTRFS_BLOCK_GROUP_DUP |
-				BTRFS_BLOCK_GROUP_RAID1 |
+				BTRFS_BLOCK_GROUP_RAID1_MASK |
 				BTRFS_BLOCK_GROUP_RAID5 |
 				BTRFS_BLOCK_GROUP_RAID6 |
 				BTRFS_BLOCK_GROUP_RAID10;
@@ -9564,7 +9564,7 @@ static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
 
 	stripped = BTRFS_BLOCK_GROUP_RAID0 |
 		BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
-		BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10;
+		BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
 
 	if (num_devices == 1) {
 		stripped |= BTRFS_BLOCK_GROUP_DUP;
@@ -9575,7 +9575,7 @@ static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
 			return stripped;
 
 		/* turn mirroring into duplication */
-		if (flags & (BTRFS_BLOCK_GROUP_RAID1 |
+		if (flags & (BTRFS_BLOCK_GROUP_RAID1_MASK |
 			     BTRFS_BLOCK_GROUP_RAID10))
 			return stripped | BTRFS_BLOCK_GROUP_DUP;
 	} else {
@@ -10445,7 +10445,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 	list_for_each_entry_rcu(space_info, &info->space_info, list) {
 		if (!(get_alloc_profile(info, space_info->flags) &
 		      (BTRFS_BLOCK_GROUP_RAID10 |
-		       BTRFS_BLOCK_GROUP_RAID1 |
+		       BTRFS_BLOCK_GROUP_RAID1_MASK |
 		       BTRFS_BLOCK_GROUP_RAID5 |
 		       BTRFS_BLOCK_GROUP_RAID6 |
 		       BTRFS_BLOCK_GROUP_DUP)))
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 9f0297d529d4..0c99cf9fb595 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3091,7 +3091,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		offset = map->stripe_len * (num / map->sub_stripes);
 		increment = map->stripe_len * factor;
 		mirror_num = num % map->sub_stripes + 1;
-	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1_MASK) {
 		increment = map->stripe_len;
 		mirror_num = num % map->num_stripes + 1;
 	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 776f5c7ca7c5..9e5167a0e406 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5400,7 +5400,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 		return 1;
 
 	map = em->map_lookup;
-	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
+	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1_MASK))
 		ret = map->num_stripes;
 	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
 		ret = map->sub_stripes;
@@ -5474,7 +5474,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	struct btrfs_device *srcdev;
 
 	ASSERT((map->type &
-		 (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)));
+		 (BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10)));
 
 	if (map->type & BTRFS_BLOCK_GROUP_RAID10)
 		num_stripes = map->sub_stripes;
@@ -5663,7 +5663,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
 					      &remaining_stripes);
 		div_u64_rem(stripe_nr_end - 1, factor, &last_stripe);
 		last_stripe *= sub_stripes;
-	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
+	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1_MASK |
 				BTRFS_BLOCK_GROUP_DUP)) {
 		num_stripes = map->num_stripes;
 	} else {
@@ -6035,7 +6035,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 				&stripe_index);
 		if (!need_full_stripe(op))
 			mirror_num = 1;
-	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1_MASK) {
 		if (need_full_stripe(op))
 			num_stripes = map->num_stripes;
 		else if (mirror_num)
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 421239b98db2..34d5b34286fa 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -866,6 +866,8 @@ enum btrfs_raid_types {
 #define BTRFS_BLOCK_GROUP_RAID56_MASK	(BTRFS_BLOCK_GROUP_RAID5 |   \
 					 BTRFS_BLOCK_GROUP_RAID6)
 
+#define BTRFS_BLOCK_GROUP_RAID1_MASK	(BTRFS_BLOCK_GROUP_RAID1)
+
 /*
  * We need a bit for restriper to be able to tell when chunks of type
  * SINGLE are available.  This "extended" profile format is used in
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/6] btrfs: use mask for RAID56 profiles
  2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
  2019-06-10 12:29 ` [PATCH v2 1/6] btrfs: add mask for all RAID1 types David Sterba
@ 2019-06-10 12:29 ` David Sterba
  2019-06-10 12:29 ` [PATCH v2 3/6] btrfs: document BTRFS_MAX_MIRRORS David Sterba
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-06-10 12:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

We don't need to enumerate the profiles, use the mask for consistency.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent-tree.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3413f78a3032..556deb03c215 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7874,8 +7874,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		if (!block_group_bits(block_group, flags)) {
 			u64 extra = BTRFS_BLOCK_GROUP_DUP |
 				BTRFS_BLOCK_GROUP_RAID1_MASK |
-				BTRFS_BLOCK_GROUP_RAID5 |
-				BTRFS_BLOCK_GROUP_RAID6 |
+				BTRFS_BLOCK_GROUP_RAID56_MASK |
 				BTRFS_BLOCK_GROUP_RAID10;
 
 			/*
@@ -9562,8 +9561,7 @@ static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
 
 	num_devices = fs_info->fs_devices->rw_devices;
 
-	stripped = BTRFS_BLOCK_GROUP_RAID0 |
-		BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
+	stripped = BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID56_MASK |
 		BTRFS_BLOCK_GROUP_RAID1_MASK | BTRFS_BLOCK_GROUP_RAID10;
 
 	if (num_devices == 1) {
@@ -10446,8 +10444,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		if (!(get_alloc_profile(info, space_info->flags) &
 		      (BTRFS_BLOCK_GROUP_RAID10 |
 		       BTRFS_BLOCK_GROUP_RAID1_MASK |
-		       BTRFS_BLOCK_GROUP_RAID5 |
-		       BTRFS_BLOCK_GROUP_RAID6 |
+		       BTRFS_BLOCK_GROUP_RAID56_MASK |
 		       BTRFS_BLOCK_GROUP_DUP)))
 			continue;
 		/*
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/6] btrfs: document BTRFS_MAX_MIRRORS
  2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
  2019-06-10 12:29 ` [PATCH v2 1/6] btrfs: add mask for all RAID1 types David Sterba
  2019-06-10 12:29 ` [PATCH v2 2/6] btrfs: use mask for RAID56 profiles David Sterba
@ 2019-06-10 12:29 ` David Sterba
  2019-06-10 12:29 ` [PATCH v2 4/6] btrfs: add support for 3-copy replication (raid1c3) David Sterba
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-06-10 12:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

The real meaning of that constant is not clear from the context due to
the target device inclusion.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ctree.h | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fb7435533478..31198499f175 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -46,7 +46,16 @@ struct btrfs_ref;
 
 #define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null */
 
-#define BTRFS_MAX_MIRRORS 3
+/*
+ * Maximum number of mirrors that can be available for all profiles counting
+ * the target device of dev-replace as one. During an active device replace
+ * procedure, the target device of the copy operation is a mirror for the
+ * filesystem data as well that can be used to read data in order to repair
+ * read errors on other disks.
+ *
+ * Current value is derived from RAID1 with 2 copies.
+ */
+#define BTRFS_MAX_MIRRORS (2 + 1)
 
 #define BTRFS_MAX_LEVEL 8
 
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/6] btrfs: add support for 3-copy replication (raid1c3)
  2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
                   ` (2 preceding siblings ...)
  2019-06-10 12:29 ` [PATCH v2 3/6] btrfs: document BTRFS_MAX_MIRRORS David Sterba
@ 2019-06-10 12:29 ` David Sterba
  2019-06-10 12:29 ` [PATCH v2 5/6] btrfs: add support for 4-copy replication (raid1c4) David Sterba
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-06-10 12:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile name is temporary and may change in the
future.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ctree.h                |  4 ++--
 fs/btrfs/extent-tree.c          |  2 ++
 fs/btrfs/super.c                |  3 +++
 fs/btrfs/volumes.c              | 19 +++++++++++++++++--
 fs/btrfs/volumes.h              |  2 ++
 include/uapi/linux/btrfs.h      |  3 ++-
 include/uapi/linux/btrfs_tree.h |  6 +++++-
 7 files changed, 33 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 31198499f175..fbb7c4cf41e9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -53,9 +53,9 @@ struct btrfs_ref;
  * filesystem data as well that can be used to read data in order to repair
  * read errors on other disks.
  *
- * Current value is derived from RAID1 with 2 copies.
+ * Current value is derived from RAID1C3 with 3 copies.
  */
-#define BTRFS_MAX_MIRRORS (2 + 1)
+#define BTRFS_MAX_MIRRORS (3 + 1)
 
 #define BTRFS_MAX_LEVEL 8
 
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 556deb03c215..a4afd3a18c27 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -9884,6 +9884,8 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
 		min_free >>= 1;
 	} else if (index == BTRFS_RAID_RAID1) {
 		dev_min = 2;
+	} else if (index == BTRFS_RAID_RAID1C3) {
+		dev_min = 3;
 	} else if (index == BTRFS_RAID_DUP) {
 		/* Multiply by 2 */
 		min_free <<= 1;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 6e196b8a0820..a51469ae1c89 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1933,6 +1933,9 @@ static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
 	} else if (type & BTRFS_BLOCK_GROUP_RAID1) {
 		min_stripes = 2;
 		num_stripes = 2;
+	} else if (type & BTRFS_BLOCK_GROUP_RAID1C3) {
+		min_stripes = 3;
+		num_stripes = 3;
 	} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
 		min_stripes = 4;
 		num_stripes = 4;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 9e5167a0e406..c929eb46f814 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -56,6 +56,18 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 		.bg_flag	= BTRFS_BLOCK_GROUP_RAID1,
 		.mindev_error	= BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET,
 	},
+	[BTRFS_RAID_RAID1C3] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 3,
+		.tolerated_failures = 2,
+		.devs_increment	= 3,
+		.ncopies	= 3,
+		.raid_name	= "raid1c3",
+		.bg_flag	= BTRFS_BLOCK_GROUP_RAID1C3,
+		.mindev_error	= BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
+	},
 	[BTRFS_RAID_DUP] = {
 		.sub_stripes	= 1,
 		.dev_stripes	= 2,
@@ -5050,8 +5062,11 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
-	/* round down to number of usable stripes */
-	ndevs = round_down(ndevs, devs_increment);
+	/*
+	 * Round down to number of usable stripes, devs_increment can be any
+	 * number so we can't use round_down()
+	 */
+	ndevs -= ndevs % devs_increment;
 
 	if (ndevs < devs_min) {
 		ret = -ENOSPC;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index fea7b65a712e..9d5ea7fae1ab 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -549,6 +549,8 @@ static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
 		return BTRFS_RAID_RAID10;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID1)
 		return BTRFS_RAID_RAID1;
+	else if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
+		return BTRFS_RAID_RAID1C3;
 	else if (flags & BTRFS_BLOCK_GROUP_DUP)
 		return BTRFS_RAID_DUP;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID0)
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index c195896d478f..b03b386cfe2b 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -826,7 +826,8 @@ enum btrfs_err_code {
 	BTRFS_ERROR_DEV_TGT_REPLACE,
 	BTRFS_ERROR_DEV_MISSING_NOT_FOUND,
 	BTRFS_ERROR_DEV_ONLY_WRITABLE,
-	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
+	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS,
+	BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
 };
 
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 34d5b34286fa..8ee1092332ed 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -839,6 +839,7 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID10	(1ULL << 6)
 #define BTRFS_BLOCK_GROUP_RAID5         (1ULL << 7)
 #define BTRFS_BLOCK_GROUP_RAID6         (1ULL << 8)
+#define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
 #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
 					 BTRFS_SPACE_INFO_GLOBAL_RSV)
 
@@ -850,6 +851,7 @@ enum btrfs_raid_types {
 	BTRFS_RAID_SINGLE,
 	BTRFS_RAID_RAID5,
 	BTRFS_RAID_RAID6,
+	BTRFS_RAID_RAID1C3,
 	BTRFS_NR_RAID_TYPES
 };
 
@@ -859,6 +861,7 @@ enum btrfs_raid_types {
 
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK	(BTRFS_BLOCK_GROUP_RAID0 |   \
 					 BTRFS_BLOCK_GROUP_RAID1 |   \
+					 BTRFS_BLOCK_GROUP_RAID1C3 | \
 					 BTRFS_BLOCK_GROUP_RAID5 |   \
 					 BTRFS_BLOCK_GROUP_RAID6 |   \
 					 BTRFS_BLOCK_GROUP_DUP |     \
@@ -866,7 +869,8 @@ enum btrfs_raid_types {
 #define BTRFS_BLOCK_GROUP_RAID56_MASK	(BTRFS_BLOCK_GROUP_RAID5 |   \
 					 BTRFS_BLOCK_GROUP_RAID6)
 
-#define BTRFS_BLOCK_GROUP_RAID1_MASK	(BTRFS_BLOCK_GROUP_RAID1)
+#define BTRFS_BLOCK_GROUP_RAID1_MASK	(BTRFS_BLOCK_GROUP_RAID1 |   \
+					 BTRFS_BLOCK_GROUP_RAID1C3)
 
 /*
  * We need a bit for restriper to be able to tell when chunks of type
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/6] btrfs: add support for 4-copy replication (raid1c4)
  2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
                   ` (3 preceding siblings ...)
  2019-06-10 12:29 ` [PATCH v2 4/6] btrfs: add support for 3-copy replication (raid1c3) David Sterba
@ 2019-06-10 12:29 ` David Sterba
  2019-06-10 12:29 ` [PATCH v2 6/6] btrfs: add incompat for raid1 with 3, 4 copies David Sterba
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-06-10 12:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile name is temporary and may change in the
future.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ctree.h                |  4 ++--
 fs/btrfs/super.c                |  3 +++
 fs/btrfs/volumes.c              | 12 ++++++++++++
 fs/btrfs/volumes.h              |  2 ++
 include/uapi/linux/btrfs.h      |  1 +
 include/uapi/linux/btrfs_tree.h |  6 +++++-
 6 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index fbb7c4cf41e9..5a17d97d81fd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -53,9 +53,9 @@ struct btrfs_ref;
  * filesystem data as well that can be used to read data in order to repair
  * read errors on other disks.
  *
- * Current value is derived from RAID1C3 with 3 copies.
+ * Current value is derived from RAID1C4 with 4 copies.
  */
-#define BTRFS_MAX_MIRRORS (3 + 1)
+#define BTRFS_MAX_MIRRORS (4 + 1)
 
 #define BTRFS_MAX_LEVEL 8
 
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index a51469ae1c89..28fcb5868160 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1936,6 +1936,9 @@ static inline int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
 	} else if (type & BTRFS_BLOCK_GROUP_RAID1C3) {
 		min_stripes = 3;
 		num_stripes = 3;
+	} else if (type & BTRFS_BLOCK_GROUP_RAID1C4) {
+		min_stripes = 4;
+		num_stripes = 4;
 	} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
 		min_stripes = 4;
 		num_stripes = 4;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c929eb46f814..95dab3012377 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -68,6 +68,18 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 		.bg_flag	= BTRFS_BLOCK_GROUP_RAID1C3,
 		.mindev_error	= BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
 	},
+	[BTRFS_RAID_RAID1C4] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 4,
+		.tolerated_failures = 3,
+		.devs_increment	= 4,
+		.ncopies	= 4,
+		.raid_name	= "raid1c4",
+		.bg_flag	= BTRFS_BLOCK_GROUP_RAID1C4,
+		.mindev_error	= BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
+	},
 	[BTRFS_RAID_DUP] = {
 		.sub_stripes	= 1,
 		.dev_stripes	= 2,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 9d5ea7fae1ab..ff314c5dc772 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -551,6 +551,8 @@ static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
 		return BTRFS_RAID_RAID1;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
 		return BTRFS_RAID_RAID1C3;
+	else if (flags & BTRFS_BLOCK_GROUP_RAID1C4)
+		return BTRFS_RAID_RAID1C4;
 	else if (flags & BTRFS_BLOCK_GROUP_DUP)
 		return BTRFS_RAID_DUP;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID0)
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index b03b386cfe2b..4e32b161adfa 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -828,6 +828,7 @@ enum btrfs_err_code {
 	BTRFS_ERROR_DEV_ONLY_WRITABLE,
 	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS,
 	BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
+	BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
 };
 
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index 8ee1092332ed..16640355700d 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -840,6 +840,7 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID5         (1ULL << 7)
 #define BTRFS_BLOCK_GROUP_RAID6         (1ULL << 8)
 #define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
+#define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
 #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
 					 BTRFS_SPACE_INFO_GLOBAL_RSV)
 
@@ -852,6 +853,7 @@ enum btrfs_raid_types {
 	BTRFS_RAID_RAID5,
 	BTRFS_RAID_RAID6,
 	BTRFS_RAID_RAID1C3,
+	BTRFS_RAID_RAID1C4,
 	BTRFS_NR_RAID_TYPES
 };
 
@@ -862,6 +864,7 @@ enum btrfs_raid_types {
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK	(BTRFS_BLOCK_GROUP_RAID0 |   \
 					 BTRFS_BLOCK_GROUP_RAID1 |   \
 					 BTRFS_BLOCK_GROUP_RAID1C3 | \
+					 BTRFS_BLOCK_GROUP_RAID1C4 | \
 					 BTRFS_BLOCK_GROUP_RAID5 |   \
 					 BTRFS_BLOCK_GROUP_RAID6 |   \
 					 BTRFS_BLOCK_GROUP_DUP |     \
@@ -870,7 +873,8 @@ enum btrfs_raid_types {
 					 BTRFS_BLOCK_GROUP_RAID6)
 
 #define BTRFS_BLOCK_GROUP_RAID1_MASK	(BTRFS_BLOCK_GROUP_RAID1 |   \
-					 BTRFS_BLOCK_GROUP_RAID1C3)
+					 BTRFS_BLOCK_GROUP_RAID1C3 | \
+					 BTRFS_BLOCK_GROUP_RAID1C4)
 
 /*
  * We need a bit for restriper to be able to tell when chunks of type
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 6/6] btrfs: add incompat for raid1 with 3, 4 copies
  2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
                   ` (4 preceding siblings ...)
  2019-06-10 12:29 ` [PATCH v2 5/6] btrfs: add support for 4-copy replication (raid1c4) David Sterba
@ 2019-06-10 12:29 ` David Sterba
  2019-06-10 12:29 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-06-10 12:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

The RAID1 with more copies will be identified as 'raid1c34'.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ctree.h           | 3 ++-
 fs/btrfs/sysfs.c           | 2 ++
 fs/btrfs/volumes.c         | 9 +++++++++
 include/uapi/linux/btrfs.h | 1 +
 4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5a17d97d81fd..12984c780ee9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -292,7 +292,8 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES	|	\
-	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
+	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID	|	\
+	 BTRFS_FEATURE_INCOMPAT_RAID1C34)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
 	(BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF)
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 2f078b77fe14..957383d759e0 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -193,6 +193,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_INCOMPAT(metadata_uuid, METADATA_UUID);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
+BTRFS_FEAT_ATTR_INCOMPAT(raid1c34, RAID1C34);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -207,6 +208,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(no_holes),
 	BTRFS_FEAT_ATTR_PTR(metadata_uuid),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
+	BTRFS_FEAT_ATTR_PTR(raid1c34),
 	NULL
 };
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 95dab3012377..ffc895e2fc96 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4929,6 +4929,14 @@ static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 	btrfs_set_fs_incompat(info, RAID56);
 }
 
+static void check_raid1c34_incompat_flag(struct btrfs_fs_info *info, u64 type)
+{
+	if (!(type & (BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4)))
+		return;
+
+	btrfs_set_fs_incompat(info, RAID1C34);
+}
+
 static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			       u64 start, u64 type)
 {
@@ -5194,6 +5202,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 
 	free_extent_map(em);
 	check_raid56_incompat_flag(info, type);
+	check_raid1c34_incompat_flag(info, type);
 
 	kfree(devices_info);
 	return 0;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 4e32b161adfa..6ebf246e96ff 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -270,6 +270,7 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID	(1ULL << 10)
+#define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH] btrfs-progs: add support for raid1c3 and raid1c4
  2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
                   ` (5 preceding siblings ...)
  2019-06-10 12:29 ` [PATCH v2 6/6] btrfs: add incompat for raid1 with 3, 4 copies David Sterba
@ 2019-06-10 12:29 ` David Sterba
  2019-06-10 12:42 ` [PATCH v2 0/6] RAID1 with 3- and 4- copies Hugo Mills
  2019-06-25 17:47 ` David Sterba
  8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-06-10 12:29 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

$ ./mkfs.btrfs -m raid1c4 -d raid1c3 /dev/sd[abcd]

Label:              (null)
UUID:               f1f988ab-6750-4bc2-957b-98a4ebe98631
Node size:          16384
Sector size:        4096
Filesystem size:    8.00GiB
Block group profiles:
  Data:             RAID1C3         273.06MiB
  Metadata:         RAID1C4         204.75MiB
  System:           RAID1C4           8.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata, raid1c34
Number of devices:  4
Devices:
   ID        SIZE  PATH
    1     2.00GiB  /dev/sda
    2     2.00GiB  /dev/sdb
    3     2.00GiB  /dev/sdc
    4     2.00GiB  /dev/sdd

Signed-off-by: David Sterba <dsterba@suse.com>
---
 chunk-recover.c           |  4 ++++
 cmds-balance.c            |  4 ++++
 cmds-fi-usage.c           |  8 +++++++
 cmds-inspect-dump-super.c |  3 ++-
 ctree.h                   |  8 +++++++
 extent-tree.c             |  4 ++++
 fsfeatures.c              |  6 +++++
 ioctl.h                   |  4 +++-
 mkfs/main.c               | 11 ++++++++-
 print-tree.c              |  6 +++++
 utils.c                   | 12 +++++++++-
 volumes.c                 | 48 +++++++++++++++++++++++++++++++++++++--
 volumes.h                 |  4 ++++
 13 files changed, 116 insertions(+), 6 deletions(-)

diff --git a/chunk-recover.c b/chunk-recover.c
index f3e7774efc0f..005032961e71 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -1579,6 +1579,10 @@ static int calc_num_stripes(u64 type)
 	else if (type & (BTRFS_BLOCK_GROUP_RAID1 |
 			 BTRFS_BLOCK_GROUP_DUP))
 		return 2;
+	else if (type & (BTRFS_BLOCK_GROUP_RAID1C3))
+		return 3;
+	else if (type & (BTRFS_BLOCK_GROUP_RAID1C4))
+		return 4;
 	else
 		return 1;
 }
diff --git a/cmds-balance.c b/cmds-balance.c
index b533cf737584..854d7d4c380a 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -46,6 +46,10 @@ static int parse_one_profile(const char *profile, u64 *flags)
 		*flags |= BTRFS_BLOCK_GROUP_RAID0;
 	} else if (!strcmp(profile, "raid1")) {
 		*flags |= BTRFS_BLOCK_GROUP_RAID1;
+	} else if (!strcmp(profile, "raid1c3")) {
+		*flags |= BTRFS_BLOCK_GROUP_RAID1C3;
+	} else if (!strcmp(profile, "raid1c4")) {
+		*flags |= BTRFS_BLOCK_GROUP_RAID1C4;
 	} else if (!strcmp(profile, "raid10")) {
 		*flags |= BTRFS_BLOCK_GROUP_RAID10;
 	} else if (!strcmp(profile, "raid5")) {
diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
index 9a23e17633d4..6bae4e723daf 100644
--- a/cmds-fi-usage.c
+++ b/cmds-fi-usage.c
@@ -373,6 +373,10 @@ static int print_filesystem_usage_overall(int fd, struct chunk_info *chunkinfo,
 			ratio = 1;
 		else if (flags & BTRFS_BLOCK_GROUP_RAID1)
 			ratio = 2;
+		else if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
+			ratio = 3;
+		else if (flags & BTRFS_BLOCK_GROUP_RAID1C4)
+			ratio = 4;
 		else if (flags & BTRFS_BLOCK_GROUP_RAID5)
 			ratio = 0;
 		else if (flags & BTRFS_BLOCK_GROUP_RAID6)
@@ -653,6 +657,10 @@ static u64 calc_chunk_size(struct chunk_info *ci)
 		return ci->size / ci->num_stripes;
 	else if (ci->type & BTRFS_BLOCK_GROUP_RAID1)
 		return ci->size ;
+	else if (ci->type & BTRFS_BLOCK_GROUP_RAID1C3)
+		return ci->size;
+	else if (ci->type & BTRFS_BLOCK_GROUP_RAID1C4)
+		return ci->size;
 	else if (ci->type & BTRFS_BLOCK_GROUP_DUP)
 		return ci->size ;
 	else if (ci->type & BTRFS_BLOCK_GROUP_RAID5)
diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
index d62f0932556c..bf9539df0dd5 100644
--- a/cmds-inspect-dump-super.c
+++ b/cmds-inspect-dump-super.c
@@ -229,7 +229,8 @@ static struct readable_flag_entry incompat_flags_array[] = {
 	DEF_INCOMPAT_FLAG_ENTRY(RAID56),
 	DEF_INCOMPAT_FLAG_ENTRY(SKINNY_METADATA),
 	DEF_INCOMPAT_FLAG_ENTRY(NO_HOLES),
-	DEF_INCOMPAT_FLAG_ENTRY(METADATA_UUID)
+	DEF_INCOMPAT_FLAG_ENTRY(METADATA_UUID),
+	DEF_INCOMPAT_FLAG_ENTRY(RAID1C34),
 };
 static const int incompat_flags_num = sizeof(incompat_flags_array) /
 				      sizeof(struct readable_flag_entry);
diff --git a/ctree.h b/ctree.h
index 9156ca4de6fd..87f586991648 100644
--- a/ctree.h
+++ b/ctree.h
@@ -490,6 +490,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 #define BTRFS_FEATURE_INCOMPAT_METADATA_UUID    (1ULL << 10)
+#define BTRFS_FEATURE_INCOMPAT_RAID1C34		(1ULL << 11)
 
 #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
 
@@ -513,6 +514,7 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES |		\
+	 BTRFS_FEATURE_INCOMPAT_RAID1C34 |		\
 	 BTRFS_FEATURE_INCOMPAT_METADATA_UUID)
 
 /*
@@ -962,6 +964,8 @@ struct btrfs_csum_item {
 #define BTRFS_BLOCK_GROUP_RAID10	(1ULL << 6)
 #define BTRFS_BLOCK_GROUP_RAID5    	(1ULL << 7)
 #define BTRFS_BLOCK_GROUP_RAID6    	(1ULL << 8)
+#define BTRFS_BLOCK_GROUP_RAID1C3    	(1ULL << 9)
+#define BTRFS_BLOCK_GROUP_RAID1C4    	(1ULL << 10)
 #define BTRFS_BLOCK_GROUP_RESERVED	BTRFS_AVAIL_ALLOC_BIT_SINGLE
 
 enum btrfs_raid_types {
@@ -972,6 +976,8 @@ enum btrfs_raid_types {
 	BTRFS_RAID_SINGLE,
 	BTRFS_RAID_RAID5,
 	BTRFS_RAID_RAID6,
+	BTRFS_RAID_RAID1C3,
+	BTRFS_RAID_RAID1C4,
 	BTRFS_NR_RAID_TYPES
 };
 
@@ -983,6 +989,8 @@ enum btrfs_raid_types {
 					 BTRFS_BLOCK_GROUP_RAID1 |   \
 					 BTRFS_BLOCK_GROUP_RAID5 |   \
 					 BTRFS_BLOCK_GROUP_RAID6 |   \
+					 BTRFS_BLOCK_GROUP_RAID1C3 | \
+					 BTRFS_BLOCK_GROUP_RAID1C4 | \
 					 BTRFS_BLOCK_GROUP_DUP |     \
 					 BTRFS_BLOCK_GROUP_RAID10)
 
diff --git a/extent-tree.c b/extent-tree.c
index c6516b2ba445..50a38a775147 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1668,6 +1668,8 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 {
 	u64 extra_flags = flags & (BTRFS_BLOCK_GROUP_RAID0 |
 				   BTRFS_BLOCK_GROUP_RAID1 |
+				   BTRFS_BLOCK_GROUP_RAID1C3 |
+				   BTRFS_BLOCK_GROUP_RAID1C4 |
 				   BTRFS_BLOCK_GROUP_RAID10 |
 				   BTRFS_BLOCK_GROUP_RAID5 |
 				   BTRFS_BLOCK_GROUP_RAID6 |
@@ -3339,6 +3341,8 @@ static u64 get_dev_extent_len(struct map_lookup *map)
 	case 0: /* Single */
 	case BTRFS_BLOCK_GROUP_DUP:
 	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+	case BTRFS_BLOCK_GROUP_RAID1C4:
 		div = 1;
 		break;
 	case BTRFS_BLOCK_GROUP_RAID5:
diff --git a/fsfeatures.c b/fsfeatures.c
index 7f3ef03b8452..e3b3d63b43a5 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -86,6 +86,12 @@ static const struct btrfs_fs_feature {
 		VERSION_TO_STRING2(4,0),
 		NULL, 0,
 		"no explicit hole extents for files" },
+	{ "raid1c34", BTRFS_FEATURE_INCOMPAT_RAID1C34,
+		"raid1c34",
+		VERSION_TO_STRING2(5,3),
+		NULL, 0,
+		NULL, 0,
+		"RAID1 with 3 or 4 copies" },
 	/* Keep this one last */
 	{ "list-all", BTRFS_FEATURE_LIST_ALL, NULL }
 };
diff --git a/ioctl.h b/ioctl.h
index 66ee599f7a82..d3dfd6375de1 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -775,7 +775,9 @@ enum btrfs_err_code {
 	BTRFS_ERROR_DEV_TGT_REPLACE,
 	BTRFS_ERROR_DEV_MISSING_NOT_FOUND,
 	BTRFS_ERROR_DEV_ONLY_WRITABLE,
-	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
+	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS,
+	BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
+	BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
 };
 
 /* An error code to error string mapping for the kernel
diff --git a/mkfs/main.c b/mkfs/main.c
index 6b47a9ee0e73..88801b89e827 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -316,7 +316,7 @@ static void print_usage(int ret)
 	printf("Usage: mkfs.btrfs [options] dev [ dev ... ]\n");
 	printf("Options:\n");
 	printf("  allocation profiles:\n");
-	printf("\t-d|--data PROFILE       data profile, raid0, raid1, raid5, raid6, raid10, dup or single\n");
+	printf("\t-d|--data PROFILE       data profile, raid0, raid1, raid1c3, raid1c4, raid5, raid6, raid10, dup or single\n");
 	printf("\t-m|--metadata PROFILE   metadata profile, values like for data profile\n");
 	printf("\t-M|--mixed              mix metadata and data together\n");
 	printf("  features:\n");
@@ -347,6 +347,10 @@ static u64 parse_profile(const char *s)
 		return BTRFS_BLOCK_GROUP_RAID0;
 	} else if (strcasecmp(s, "raid1") == 0) {
 		return BTRFS_BLOCK_GROUP_RAID1;
+	} else if (strcasecmp(s, "raid1c3") == 0) {
+		return BTRFS_BLOCK_GROUP_RAID1C3;
+	} else if (strcasecmp(s, "raid1c4") == 0) {
+		return BTRFS_BLOCK_GROUP_RAID1C4;
 	} else if (strcasecmp(s, "raid5") == 0) {
 		return BTRFS_BLOCK_GROUP_RAID5;
 	} else if (strcasecmp(s, "raid6") == 0) {
@@ -1022,6 +1026,11 @@ int main(int argc, char **argv)
 		features |= BTRFS_FEATURE_INCOMPAT_RAID56;
 	}
 
+	if ((data_profile | metadata_profile) &
+	    (BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4)) {
+		features |= BTRFS_FEATURE_INCOMPAT_RAID1C34;
+	}
+
 	if (btrfs_check_nodesize(nodesize, sectorsize,
 				 features))
 		goto error;
diff --git a/print-tree.c b/print-tree.c
index 0d0bb5109207..78931d2d8eb1 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -163,6 +163,12 @@ static void bg_flags_to_str(u64 flags, char *ret)
 	case BTRFS_BLOCK_GROUP_RAID1:
 		strcat(ret, "|RAID1");
 		break;
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+		strcat(ret, "|RAID1C3");
+		break;
+	case BTRFS_BLOCK_GROUP_RAID1C4:
+		strcat(ret, "|RAID1C4");
+		break;
 	case BTRFS_BLOCK_GROUP_DUP:
 		strcat(ret, "|DUP");
 		break;
diff --git a/utils.c b/utils.c
index 0b271517551b..6320d1a496cd 100644
--- a/utils.c
+++ b/utils.c
@@ -1900,8 +1900,10 @@ static int group_profile_devs_min(u64 flag)
 	case BTRFS_BLOCK_GROUP_RAID5:
 		return 2;
 	case BTRFS_BLOCK_GROUP_RAID6:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
 		return 3;
 	case BTRFS_BLOCK_GROUP_RAID10:
+	case BTRFS_BLOCK_GROUP_RAID1C4:
 		return 4;
 	default:
 		return -1;
@@ -1918,9 +1920,10 @@ int test_num_disk_vs_raid(u64 metadata_profile, u64 data_profile,
 	default:
 	case 4:
 		allowed |= BTRFS_BLOCK_GROUP_RAID10;
+		allowed |= BTRFS_BLOCK_GROUP_RAID10 | BTRFS_BLOCK_GROUP_RAID1C4;
 		__attribute__ ((fallthrough));
 	case 3:
-		allowed |= BTRFS_BLOCK_GROUP_RAID6;
+		allowed |= BTRFS_BLOCK_GROUP_RAID6 | BTRFS_BLOCK_GROUP_RAID1C3;
 		__attribute__ ((fallthrough));
 	case 2:
 		allowed |= BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
@@ -1975,7 +1978,10 @@ int group_profile_max_safe_loss(u64 flags)
 	case BTRFS_BLOCK_GROUP_RAID10:
 		return 1;
 	case BTRFS_BLOCK_GROUP_RAID6:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
 		return 2;
+	case BTRFS_BLOCK_GROUP_RAID1C4:
+		return 3;
 	default:
 		return -1;
 	}
@@ -2199,6 +2205,10 @@ const char* btrfs_group_profile_str(u64 flag)
 		return "RAID0";
 	case BTRFS_BLOCK_GROUP_RAID1:
 		return "RAID1";
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+		return "RAID1C3";
+	case BTRFS_BLOCK_GROUP_RAID1C4:
+		return "RAID1C4";
 	case BTRFS_BLOCK_GROUP_RAID5:
 		return "RAID5";
 	case BTRFS_BLOCK_GROUP_RAID6:
diff --git a/volumes.c b/volumes.c
index 3a91b43b378b..66a88c7b7c17 100644
--- a/volumes.c
+++ b/volumes.c
@@ -49,6 +49,24 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 		.devs_increment	= 2,
 		.ncopies	= 2,
 	},
+	[BTRFS_RAID_RAID1C3] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 3,
+		.tolerated_failures = 2,
+		.devs_increment	= 3,
+		.ncopies	= 3,
+	},
+	[BTRFS_RAID_RAID1C4] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 4,
+		.tolerated_failures = 3,
+		.devs_increment	= 4,
+		.ncopies	= 4,
+	},
 	[BTRFS_RAID_DUP] = {
 		.sub_stripes	= 1,
 		.dev_stripes	= 2,
@@ -826,6 +844,8 @@ static u64 chunk_bytes_by_type(u64 type, u64 calc_size, int num_stripes,
 {
 	if (type & (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_DUP))
 		return calc_size;
+	else if (type & (BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4))
+		return calc_size;
 	else if (type & BTRFS_BLOCK_GROUP_RAID10)
 		return calc_size * (num_stripes / sub_stripes);
 	else if (type & BTRFS_BLOCK_GROUP_RAID5)
@@ -1006,6 +1026,20 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			return -ENOSPC;
 		min_stripes = 2;
 	}
+	if (type & BTRFS_BLOCK_GROUP_RAID1C3) {
+		num_stripes = min_t(u64, 3,
+				  btrfs_super_num_devices(info->super_copy));
+		if (num_stripes < 3)
+			return -ENOSPC;
+		min_stripes = 3;
+	}
+	if (type & BTRFS_BLOCK_GROUP_RAID1C4) {
+		num_stripes = min_t(u64, 4,
+				  btrfs_super_num_devices(info->super_copy));
+		if (num_stripes < 4)
+			return -ENOSPC;
+		min_stripes = 4;
+	}
 	if (type & BTRFS_BLOCK_GROUP_DUP) {
 		num_stripes = 2;
 		min_stripes = 2;
@@ -1354,7 +1388,8 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	}
 	map = container_of(ce, struct map_lookup, ce);
 
-	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
+	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
+			 BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4))
 		ret = map->num_stripes;
 	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
 		ret = map->sub_stripes;
@@ -1550,6 +1585,8 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 
 	if (rw == WRITE) {
 		if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
+				 BTRFS_BLOCK_GROUP_RAID1C3 |
+				 BTRFS_BLOCK_GROUP_RAID1C4 |
 				 BTRFS_BLOCK_GROUP_DUP)) {
 			stripes_required = map->num_stripes;
 		} else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
@@ -1592,6 +1629,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 	stripe_offset = offset - stripe_offset;
 
 	if (map->type & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
+			 BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4 |
 			 BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
 			 BTRFS_BLOCK_GROUP_RAID10 |
 			 BTRFS_BLOCK_GROUP_DUP)) {
@@ -1607,7 +1645,9 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 
 	multi->num_stripes = 1;
 	stripe_index = 0;
-	if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+	if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
+			 BTRFS_BLOCK_GROUP_RAID1C3 |
+			 BTRFS_BLOCK_GROUP_RAID1C4)) {
 		if (rw == WRITE)
 			multi->num_stripes = map->num_stripes;
 		else if (mirror_num)
@@ -1877,6 +1917,8 @@ int btrfs_check_chunk_valid(struct btrfs_fs_info *fs_info,
 	if ((type & BTRFS_BLOCK_GROUP_RAID10 && (sub_stripes != 2 ||
 		  !IS_ALIGNED(num_stripes, sub_stripes))) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
+	    (type & BTRFS_BLOCK_GROUP_RAID1C3 && num_stripes < 3) ||
+	    (type & BTRFS_BLOCK_GROUP_RAID1C4 && num_stripes < 4) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID6 && num_stripes < 3) ||
 	    (type & BTRFS_BLOCK_GROUP_DUP && num_stripes > 2) ||
@@ -2436,6 +2478,8 @@ u64 btrfs_stripe_length(struct btrfs_fs_info *fs_info,
 	switch (profile) {
 	case 0: /* Single profile */
 	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+	case BTRFS_BLOCK_GROUP_RAID1C4:
 	case BTRFS_BLOCK_GROUP_DUP:
 		stripe_len = chunk_len;
 		break;
diff --git a/volumes.h b/volumes.h
index dbe9d3dea647..6fa39cf31b9d 100644
--- a/volumes.h
+++ b/volumes.h
@@ -130,6 +130,10 @@ static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
 		return BTRFS_RAID_RAID10;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID1)
 		return BTRFS_RAID_RAID1;
+	else if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
+		return BTRFS_RAID_RAID1C3;
+	else if (flags & BTRFS_BLOCK_GROUP_RAID1C4)
+		return BTRFS_RAID_RAID1C4;
 	else if (flags & BTRFS_BLOCK_GROUP_DUP)
 		return BTRFS_RAID_DUP;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID0)
-- 
2.21.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/6] RAID1 with 3- and 4- copies
  2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
                   ` (6 preceding siblings ...)
  2019-06-10 12:29 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
@ 2019-06-10 12:42 ` Hugo Mills
  2019-06-10 14:02   ` David Sterba
  2019-06-25 17:47 ` David Sterba
  8 siblings, 1 reply; 14+ messages in thread
From: Hugo Mills @ 2019-06-10 12:42 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1149 bytes --]

   Hi, David,

On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> this patchset brings the RAID1 with 3 and 4 copies as a separate
> feature as outlined in V1
> (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
[...]
> Compatibility
> ~~~~~~~~~~~~~
> 
> The new block group types cost an incompatibility bit, so old kernel
> will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> the filesystem with the new type.
> 
> To upgrade existing filesystems use the balance filters eg. from RAID6
> 
>   $ btrfs balance start -mconvert=raid1c3 /path
[...]

   If I do:

$ btrfs balance start -mprofiles=raid13c,convert=raid1 \
                      -dprofiles=raid13c,convert=raid6 /path

will that clear the incompatibility bit?

(I'm not sure if profiles= and convert= work together, but let's
assume that they do for the purposes of this question).

   Hugo.

-- 
Hugo Mills             | The enemy have elected for Death by Powerpoint.
hugo@... carfax.org.uk | That's what they shall get.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                                   gdb

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/6] RAID1 with 3- and 4- copies
  2019-06-10 12:42 ` [PATCH v2 0/6] RAID1 with 3- and 4- copies Hugo Mills
@ 2019-06-10 14:02   ` David Sterba
  2019-06-10 14:48     ` Hugo Mills
  0 siblings, 1 reply; 14+ messages in thread
From: David Sterba @ 2019-06-10 14:02 UTC (permalink / raw)
  To: Hugo Mills, David Sterba, linux-btrfs

On Mon, Jun 10, 2019 at 12:42:26PM +0000, Hugo Mills wrote:
>    Hi, David,
> 
> On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> > this patchset brings the RAID1 with 3 and 4 copies as a separate
> > feature as outlined in V1
> > (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> [...]
> > Compatibility
> > ~~~~~~~~~~~~~
> > 
> > The new block group types cost an incompatibility bit, so old kernel
> > will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> > the filesystem with the new type.
> > 
> > To upgrade existing filesystems use the balance filters eg. from RAID6
> > 
> >   $ btrfs balance start -mconvert=raid1c3 /path
> [...]
> 
>    If I do:
> 
> $ btrfs balance start -mprofiles=raid13c,convert=raid1 \
>                       -dprofiles=raid13c,convert=raid6 /path
> 
> will that clear the incompatibility bit?

No the bit will stay, even though there are no chunks of the raid1c3
type. Same for raid5/6.

Dropping the bit would need an extra pass trough all chunks after
balance, which is feasible and I don't see usability surprises. That you
ask means that the current behaviour is probably opposite to what users
expect.

> (I'm not sure if profiles= and convert= work together, but let's
> assume that they do for the purposes of this question).

Yes they work together.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/6] RAID1 with 3- and 4- copies
  2019-06-10 14:02   ` David Sterba
@ 2019-06-10 14:48     ` Hugo Mills
  2019-06-11  9:53       ` David Sterba
  0 siblings, 1 reply; 14+ messages in thread
From: Hugo Mills @ 2019-06-10 14:48 UTC (permalink / raw)
  To: dsterba, David Sterba, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]

On Mon, Jun 10, 2019 at 04:02:36PM +0200, David Sterba wrote:
> On Mon, Jun 10, 2019 at 12:42:26PM +0000, Hugo Mills wrote:
> >    Hi, David,
> > 
> > On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> > > this patchset brings the RAID1 with 3 and 4 copies as a separate
> > > feature as outlined in V1
> > > (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> > [...]
> > > Compatibility
> > > ~~~~~~~~~~~~~
> > > 
> > > The new block group types cost an incompatibility bit, so old kernel
> > > will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> > > the filesystem with the new type.
> > > 
> > > To upgrade existing filesystems use the balance filters eg. from RAID6
> > > 
> > >   $ btrfs balance start -mconvert=raid1c3 /path
> > [...]
> > 
> >    If I do:
> > 
> > $ btrfs balance start -mprofiles=raid13c,convert=raid1 \
> >                       -dprofiles=raid13c,convert=raid6 /path
> > 
> > will that clear the incompatibility bit?
> 
> No the bit will stay, even though there are no chunks of the raid1c3
> type. Same for raid5/6.
> 
> Dropping the bit would need an extra pass trough all chunks after
> balance, which is feasible and I don't see usability surprises. That you
> ask means that the current behaviour is probably opposite to what users
> expect.

   We've had a couple of cases in the past where people have tried out
a new feature on a new kernel, then turned it off again and not been
able to go back to an earlier kernel. Particularly in this case, I can
see people being surprised at the trapdoor. "I don't have any RAID13C
on this filesystem: why can't I go back to 5.2?"

   Hugo.

-- 
Hugo Mills             | Great films about cricket: Forrest Stump
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/6] RAID1 with 3- and 4- copies
  2019-06-10 14:48     ` Hugo Mills
@ 2019-06-11  9:53       ` David Sterba
  2019-06-11 12:03         ` David Sterba
  0 siblings, 1 reply; 14+ messages in thread
From: David Sterba @ 2019-06-11  9:53 UTC (permalink / raw)
  To: Hugo Mills, dsterba, David Sterba, linux-btrfs

On Mon, Jun 10, 2019 at 02:48:06PM +0000, Hugo Mills wrote:
> On Mon, Jun 10, 2019 at 04:02:36PM +0200, David Sterba wrote:
> > On Mon, Jun 10, 2019 at 12:42:26PM +0000, Hugo Mills wrote:
> > >    Hi, David,
> > > 
> > > On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> > > > this patchset brings the RAID1 with 3 and 4 copies as a separate
> > > > feature as outlined in V1
> > > > (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> > > [...]
> > > > Compatibility
> > > > ~~~~~~~~~~~~~
> > > > 
> > > > The new block group types cost an incompatibility bit, so old kernel
> > > > will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> > > > the filesystem with the new type.
> > > > 
> > > > To upgrade existing filesystems use the balance filters eg. from RAID6
> > > > 
> > > >   $ btrfs balance start -mconvert=raid1c3 /path
> > > [...]
> > > 
> > >    If I do:
> > > 
> > > $ btrfs balance start -mprofiles=raid13c,convert=raid1 \
> > >                       -dprofiles=raid13c,convert=raid6 /path
> > > 
> > > will that clear the incompatibility bit?
> > 
> > No the bit will stay, even though there are no chunks of the raid1c3
> > type. Same for raid5/6.
> > 
> > Dropping the bit would need an extra pass trough all chunks after
> > balance, which is feasible and I don't see usability surprises. That you
> > ask means that the current behaviour is probably opposite to what users
> > expect.
> 
>    We've had a couple of cases in the past where people have tried out
> a new feature on a new kernel, then turned it off again and not been
> able to go back to an earlier kernel. Particularly in this case, I can
> see people being surprised at the trapdoor. "I don't have any RAID13C
> on this filesystem: why can't I go back to 5.2?"

Undoing the incompat bit is expensive in some cases, eg. for ZSTD this
would mean to scan all extents, but in case of the raid profiles it's
easy to check the list of space infos that are per-profile.

So, my current idea is to use the sysfs interface. The /features
directory lists the files representing features and writing 1 to the
file followed by a sync would trigger the rescan and drop the bit
eventually.

The meaning of the /sys/fs/btrfs/features/* is defined for 1, which
means 'can be set at runtime', so the ability to unset the feature would
be eg. 3, as a bitmask of possible actions (0b01 set, 0b10 unset).

We do have infrastructure for changing the state in a safe manner even
from sysfs, which sets a bit somewhere and commit processes that. That's
why the sync is required, but I don't think that's harming usability too
much.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/6] RAID1 with 3- and 4- copies
  2019-06-11  9:53       ` David Sterba
@ 2019-06-11 12:03         ` David Sterba
  0 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-06-11 12:03 UTC (permalink / raw)
  To: Hugo Mills, dsterba, David Sterba, linux-btrfs

On Tue, Jun 11, 2019 at 11:53:15AM +0200, David Sterba wrote:
> On Mon, Jun 10, 2019 at 02:48:06PM +0000, Hugo Mills wrote:
> > On Mon, Jun 10, 2019 at 04:02:36PM +0200, David Sterba wrote:
> > > On Mon, Jun 10, 2019 at 12:42:26PM +0000, Hugo Mills wrote:
> > > >    Hi, David,
> > > > 
> > > > On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> > > > > this patchset brings the RAID1 with 3 and 4 copies as a separate
> > > > > feature as outlined in V1
> > > > > (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> > > > [...]
> > > > > Compatibility
> > > > > ~~~~~~~~~~~~~
> > > > > 
> > > > > The new block group types cost an incompatibility bit, so old kernel
> > > > > will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> > > > > the filesystem with the new type.
> > > > > 
> > > > > To upgrade existing filesystems use the balance filters eg. from RAID6
> > > > > 
> > > > >   $ btrfs balance start -mconvert=raid1c3 /path
> > > > [...]
> > > > 
> > > >    If I do:
> > > > 
> > > > $ btrfs balance start -mprofiles=raid13c,convert=raid1 \
> > > >                       -dprofiles=raid13c,convert=raid6 /path
> > > > 
> > > > will that clear the incompatibility bit?
> > > 
> > > No the bit will stay, even though there are no chunks of the raid1c3
> > > type. Same for raid5/6.
> > > 
> > > Dropping the bit would need an extra pass trough all chunks after
> > > balance, which is feasible and I don't see usability surprises. That you
> > > ask means that the current behaviour is probably opposite to what users
> > > expect.
> > 
> >    We've had a couple of cases in the past where people have tried out
> > a new feature on a new kernel, then turned it off again and not been
> > able to go back to an earlier kernel. Particularly in this case, I can
> > see people being surprised at the trapdoor. "I don't have any RAID13C
> > on this filesystem: why can't I go back to 5.2?"
> 
> Undoing the incompat bit is expensive in some cases, eg. for ZSTD this
> would mean to scan all extents, but in case of the raid profiles it's
> easy to check the list of space infos that are per-profile.
> 
> So, my current idea is to use the sysfs interface. The /features
> directory lists the files representing features and writing 1 to the
> file followed by a sync would trigger the rescan and drop the bit
> eventually.
> 
> The meaning of the /sys/fs/btrfs/features/* is defined for 1, which
> means 'can be set at runtime', so the ability to unset the feature would
> be eg. 3, as a bitmask of possible actions (0b01 set, 0b10 unset).
> 
> We do have infrastructure for changing the state in a safe manner even
> from sysfs, which sets a bit somewhere and commit processes that. That's
> why the sync is required, but I don't think that's harming usability t

Scratch that, there's much simpler way and would work as expected in the
example. Ie. after removing last bg of the given type the incompat bit
will be dropped.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/6] RAID1 with 3- and 4- copies
  2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
                   ` (7 preceding siblings ...)
  2019-06-10 12:42 ` [PATCH v2 0/6] RAID1 with 3- and 4- copies Hugo Mills
@ 2019-06-25 17:47 ` David Sterba
  8 siblings, 0 replies; 14+ messages in thread
From: David Sterba @ 2019-06-25 17:47 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs

On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> Hi,
> 
> this patchset brings the RAID1 with 3 and 4 copies as a separate
> feature as outlined in V1
> (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dsterba@suse.com/).
> 
> This should help a bit in the raid56 situation, where the write hole
> hurts most for metadata, without a block group profile that offers 2
> device loss resistance.
> 
> I've gathered some feedback from knowlegeable poeople on IRC and the
> following setup is considered good enough (certainly better than what we
> have now):
> 
> - data: RAID6
> - metadata: RAID1C3
> 
> The RAID1C3 vs RAID6 have different characteristics in terms of space
> consumption and repair.
> 
> 
> Space consumption
> ~~~~~~~~~~~~~~~~~
> 
> * RAID6 reduces overall metadata by N/(N-2), so with more devices the
>   parity overhead ratio is small
> 
> * RAID1C3 will allways consume 67% of metadata chunks for redundancy
> 
> The overall size of metadata is typically in range of gigabytes to
> hundreds of gigabytes (depends on usecase), rough estimate is from
> 1%-10%. With larger filesystem the percentage is usually smaller.
> 
> So, for the 3-copy raid1 the cost of redundancy is better expressed in
> the absolute value of gigabytes "wasted" on redundancy than as the
> ratio that does look scary compared to raid6.
> 
> 
> Repair
> ~~~~~~
> 
> RAID6 needs to access all available devices to calculate the P and Q,
> either 1 or 2 missing devices.
> 
> RAID1C3 can utilize the independence of each copy and also the way the
> RAID1 works in btrfs. In the scenario with 1 missing device, one of the
> 2 correct copies is read and written to the repaired devices.
> 
> Given how the 2-copy RAID1 works on btrfs, the block groups could be
> spread over several devices so the load during repair would be spread as
> well.
> 
> Additionally, device replace works sequentially and in big chunks so on
> a lightly used system the read pattern is seek-friendly.
> 
> 
> Compatibility
> ~~~~~~~~~~~~~
> 
> The new block group types cost an incompatibility bit, so old kernel
> will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> the filesystem with the new type.
> 
> To upgrade existing filesystems use the balance filters eg. from RAID6
> 
>   $ btrfs balance start -mconvert=raid1c3 /path
> 
> 
> Merge target
> ~~~~~~~~~~~~
> 
> I'd like to push that to misc-next for wider testing and merge to 5.3,
> unless something bad pops up. Given that the code changes are small and
> just a new types with the constraints, the rest is done by the generic
> code, I'm not expecting problems that can't be fixed before full
> release.
> 
> 
> Testing so far
> ~~~~~~~~~~~~~~
> 
> * mkfs with the profiles
> * fstests (no specific tests, only check that it does not break)
> * profile conversions between single/raid1/raid5/raid1c3/raid6/raid1c4/raid1c4
>   with added devices where needed
> * scrub
> 
> TODO:
> 
> * 1 missing device followed by repair
> * 2 missing devices followed by repair

Unfortunatelly neither of the two cases works as expected and I don't have time
to fix it for the 5.3 deadline. As the 3-copy is supposed to be a replacement
for raid6, I consider the lack of repair capability to be a show stopper so the
main part of the patchset is postponed.

The test I did was something like this:

- create fs with 3 devices, raid1c3
- fill with some data
- unmount
- wipe 2nd device
- mount degraded
- replace missing
- remount read-write, write data to verify that it works
- unmount
- mount as usual   <-- here it fails and the device is still reported missing

The same happens for the 2 missing devices.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2019-06-25 17:46 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-06-10 12:29 [PATCH v2 0/6] RAID1 with 3- and 4- copies David Sterba
2019-06-10 12:29 ` [PATCH v2 1/6] btrfs: add mask for all RAID1 types David Sterba
2019-06-10 12:29 ` [PATCH v2 2/6] btrfs: use mask for RAID56 profiles David Sterba
2019-06-10 12:29 ` [PATCH v2 3/6] btrfs: document BTRFS_MAX_MIRRORS David Sterba
2019-06-10 12:29 ` [PATCH v2 4/6] btrfs: add support for 3-copy replication (raid1c3) David Sterba
2019-06-10 12:29 ` [PATCH v2 5/6] btrfs: add support for 4-copy replication (raid1c4) David Sterba
2019-06-10 12:29 ` [PATCH v2 6/6] btrfs: add incompat for raid1 with 3, 4 copies David Sterba
2019-06-10 12:29 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
2019-06-10 12:42 ` [PATCH v2 0/6] RAID1 with 3- and 4- copies Hugo Mills
2019-06-10 14:02   ` David Sterba
2019-06-10 14:48     ` Hugo Mills
2019-06-11  9:53       ` David Sterba
2019-06-11 12:03         ` David Sterba
2019-06-25 17:47 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).