All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] 3- and 4- copy RAID1
@ 2018-07-13 18:46 David Sterba
  2018-07-13 18:46 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
                   ` (7 more replies)
  0 siblings, 8 replies; 36+ messages in thread
From: David Sterba @ 2018-07-13 18:46 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Hi,

I have some goodies that go into the RAID56 problem, although not
implementing all the remaining features, it can be useful independently.

This time my hackweek project

https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56

aimed to implement the fix for the write hole problem but I spent more
time with analysis and design of the solution and don't have a working
prototype for that yet.

This patchset brings a feature that will be used by the raid56 log, the
log has to be on the same redundancy level and thus we need a 3-copy
replication for raid6. As it was easy to extend to higher replication,
I've added a 4-copy replication, that would allow triple copy raid (that
does not have a standardized name).

The number of copies is fixed, so it's not N-copy for an arbitrary N.
This would complicate the implementation too much, though I'd be willing
to add a 5-copy replication for a small bribe.

The new raid profiles and covered by an incompatibility bit, called
extended_raid, the (idealistic) plan is to stuff as many new
raid-related features as possible. The patch 4/4 mentions the 3- 4- copy
raid1, configurable stripe length, write hole log and triple parity.
If the plan turns out to be too ambitious, the ready and implemented
features will be split and merged.

An interesting question is the naming of the extended profiles. I picked
something that can be easily understood but it's not a final proposal.
Years ago, Hugo proposed a naming scheme that described the
non-standard raid varieties of the btrfs flavor:

https://marc.info/?l=linux-btrfs&m=136286324417767

Switching to this naming would be a good addition to the extended raid.

Regarding the missing raid56 features, I'll continue working on them as
time permits in the following weeks/months, as I'm not aware of anybody
working on that actively enough so to speak.

Anyway, git branches with the patches:

kernel: git://github.com/kdave/btrfs-devel dev/extended-raid-ncopies
progs:  git://github.com/kdave/btrfs-progs dev/extended-raid-ncopies

David Sterba (4):
  btrfs: refactor block group replication factor calculation to a helper
  btrfs: add support for 3-copy replication (raid1c3)
  btrfs: add support for 4-copy replication (raid1c4)
  btrfs: add incompatibility bit for extended raid features

 fs/btrfs/ctree.h                |  1 +
 fs/btrfs/extent-tree.c          | 45 +++++++-----------
 fs/btrfs/relocation.c           |  1 +
 fs/btrfs/scrub.c                |  4 +-
 fs/btrfs/super.c                | 17 +++----
 fs/btrfs/sysfs.c                |  2 +
 fs/btrfs/volumes.c              | 84 ++++++++++++++++++++++++++++++---
 fs/btrfs/volumes.h              |  6 +++
 include/uapi/linux/btrfs.h      | 12 ++++-
 include/uapi/linux/btrfs_tree.h |  6 +++
 10 files changed, 134 insertions(+), 44 deletions(-)

-- 
2.18.0


^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] btrfs-progs: add support for raid1c3 and raid1c4
  2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
@ 2018-07-13 18:46 ` David Sterba
  2018-07-13 18:46 ` [PATCH 1/4] btrfs: refactor block group replication factor calculation to a helper David Sterba
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: David Sterba @ 2018-07-13 18:46 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

$ ./mkfs.btrfs -m raid1c4 -d raid1c3 /dev/sd[abcd]

Label:              (null)
UUID:               f1f988ab-6750-4bc2-957b-98a4ebe98631
Node size:          16384
Sector size:        4096
Filesystem size:    8.00GiB
Block group profiles:
  Data:             RAID1C3         273.06MiB
  Metadata:         RAID1C4         204.75MiB
  System:           RAID1C4           8.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata, extraid
Number of devices:  4
Devices:
   ID        SIZE  PATH
    1     2.00GiB  /dev/sda
    2     2.00GiB  /dev/sdb
    3     2.00GiB  /dev/sdc
    4     2.00GiB  /dev/sdd

Signed-off-by: David Sterba <dsterba@suse.com>
---
 chunk-recover.c           |  4 ++++
 cmds-balance.c            |  4 ++++
 cmds-fi-usage.c           |  8 +++++++
 cmds-inspect-dump-super.c |  3 ++-
 ctree.h                   |  8 +++++++
 extent-tree.c             |  4 ++++
 fsfeatures.c              |  6 +++++
 ioctl.h                   |  3 ++-
 mkfs/main.c               | 11 ++++++++-
 print-tree.c              |  6 +++++
 utils.c                   | 13 +++++++++--
 volumes.c                 | 48 +++++++++++++++++++++++++++++++++++++--
 volumes.h                 |  4 ++++
 13 files changed, 115 insertions(+), 7 deletions(-)

diff --git a/chunk-recover.c b/chunk-recover.c
index 1d30db51d8ed..661a3bcb4f92 100644
--- a/chunk-recover.c
+++ b/chunk-recover.c
@@ -1569,6 +1569,10 @@ static int calc_num_stripes(u64 type)
 	else if (type & (BTRFS_BLOCK_GROUP_RAID1 |
 			 BTRFS_BLOCK_GROUP_DUP))
 		return 2;
+	else if (type & (BTRFS_BLOCK_GROUP_RAID1C3))
+		return 3;
+	else if (type & (BTRFS_BLOCK_GROUP_RAID1C4))
+		return 4;
 	else
 		return 1;
 }
diff --git a/cmds-balance.c b/cmds-balance.c
index 6cc26c358f95..dab8cec5d105 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -46,6 +46,10 @@ static int parse_one_profile(const char *profile, u64 *flags)
 		*flags |= BTRFS_BLOCK_GROUP_RAID0;
 	} else if (!strcmp(profile, "raid1")) {
 		*flags |= BTRFS_BLOCK_GROUP_RAID1;
+	} else if (!strcmp(profile, "raid1c3")) {
+		*flags |= BTRFS_BLOCK_GROUP_RAID1C3;
+	} else if (!strcmp(profile, "raid1c4")) {
+		*flags |= BTRFS_BLOCK_GROUP_RAID1C4;
 	} else if (!strcmp(profile, "raid10")) {
 		*flags |= BTRFS_BLOCK_GROUP_RAID10;
 	} else if (!strcmp(profile, "raid5")) {
diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
index dca2e8d0365f..4e4a415f0d7c 100644
--- a/cmds-fi-usage.c
+++ b/cmds-fi-usage.c
@@ -373,6 +373,10 @@ static int print_filesystem_usage_overall(int fd, struct chunk_info *chunkinfo,
 			ratio = 1;
 		else if (flags & BTRFS_BLOCK_GROUP_RAID1)
 			ratio = 2;
+		else if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
+			ratio = 3;
+		else if (flags & BTRFS_BLOCK_GROUP_RAID1C4)
+			ratio = 4;
 		else if (flags & BTRFS_BLOCK_GROUP_RAID5)
 			ratio = 0;
 		else if (flags & BTRFS_BLOCK_GROUP_RAID6)
@@ -653,6 +657,10 @@ static u64 calc_chunk_size(struct chunk_info *ci)
 		return ci->size / ci->num_stripes;
 	else if (ci->type & BTRFS_BLOCK_GROUP_RAID1)
 		return ci->size ;
+	else if (ci->type & BTRFS_BLOCK_GROUP_RAID1C3)
+		return ci->size;
+	else if (ci->type & BTRFS_BLOCK_GROUP_RAID1C4)
+		return ci->size;
 	else if (ci->type & BTRFS_BLOCK_GROUP_DUP)
 		return ci->size ;
 	else if (ci->type & BTRFS_BLOCK_GROUP_RAID5)
diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c
index e965267c5d96..6984386dbec4 100644
--- a/cmds-inspect-dump-super.c
+++ b/cmds-inspect-dump-super.c
@@ -228,7 +228,8 @@ static struct readable_flag_entry incompat_flags_array[] = {
 	DEF_INCOMPAT_FLAG_ENTRY(EXTENDED_IREF),
 	DEF_INCOMPAT_FLAG_ENTRY(RAID56),
 	DEF_INCOMPAT_FLAG_ENTRY(SKINNY_METADATA),
-	DEF_INCOMPAT_FLAG_ENTRY(NO_HOLES)
+	DEF_INCOMPAT_FLAG_ENTRY(NO_HOLES),
+	DEF_INCOMPAT_FLAG_ENTRY(EXTENDED_RAID),
 };
 static const int incompat_flags_num = sizeof(incompat_flags_array) /
 				      sizeof(struct readable_flag_entry);
diff --git a/ctree.h b/ctree.h
index 04a77550c715..f49d11e3d178 100644
--- a/ctree.h
+++ b/ctree.h
@@ -489,6 +489,7 @@ struct btrfs_super_block {
 #define BTRFS_FEATURE_INCOMPAT_RAID56		(1ULL << 7)
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
+#define BTRFS_FEATURE_INCOMPAT_EXTENDED_RAID	(1ULL << 10)
 
 #define BTRFS_FEATURE_COMPAT_SUPP		0ULL
 
@@ -509,6 +510,7 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_RAID56 |		\
 	 BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
+	 BTRFS_FEATURE_INCOMPAT_EXTENDED_RAID |		\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES)
 
 /*
@@ -958,6 +960,8 @@ struct btrfs_csum_item {
 #define BTRFS_BLOCK_GROUP_RAID10	(1ULL << 6)
 #define BTRFS_BLOCK_GROUP_RAID5    	(1ULL << 7)
 #define BTRFS_BLOCK_GROUP_RAID6    	(1ULL << 8)
+#define BTRFS_BLOCK_GROUP_RAID1C3    	(1ULL << 9)
+#define BTRFS_BLOCK_GROUP_RAID1C4    	(1ULL << 10)
 #define BTRFS_BLOCK_GROUP_RESERVED	BTRFS_AVAIL_ALLOC_BIT_SINGLE
 
 enum btrfs_raid_types {
@@ -968,6 +972,8 @@ enum btrfs_raid_types {
 	BTRFS_RAID_SINGLE,
 	BTRFS_RAID_RAID5,
 	BTRFS_RAID_RAID6,
+	BTRFS_RAID_RAID1C3,
+	BTRFS_RAID_RAID1C4,
 	BTRFS_NR_RAID_TYPES
 };
 
@@ -979,6 +985,8 @@ enum btrfs_raid_types {
 					 BTRFS_BLOCK_GROUP_RAID1 |   \
 					 BTRFS_BLOCK_GROUP_RAID5 |   \
 					 BTRFS_BLOCK_GROUP_RAID6 |   \
+					 BTRFS_BLOCK_GROUP_RAID1C3 | \
+					 BTRFS_BLOCK_GROUP_RAID1C4 | \
 					 BTRFS_BLOCK_GROUP_DUP |     \
 					 BTRFS_BLOCK_GROUP_RAID10)
 
diff --git a/extent-tree.c b/extent-tree.c
index 0643815bd41c..836cd4e9c088 100644
--- a/extent-tree.c
+++ b/extent-tree.c
@@ -1848,6 +1848,8 @@ static void set_avail_alloc_bits(struct btrfs_fs_info *fs_info, u64 flags)
 {
 	u64 extra_flags = flags & (BTRFS_BLOCK_GROUP_RAID0 |
 				   BTRFS_BLOCK_GROUP_RAID1 |
+				   BTRFS_BLOCK_GROUP_RAID1C3 |
+				   BTRFS_BLOCK_GROUP_RAID1C4 |
 				   BTRFS_BLOCK_GROUP_RAID10 |
 				   BTRFS_BLOCK_GROUP_RAID5 |
 				   BTRFS_BLOCK_GROUP_RAID6 |
@@ -3629,6 +3631,8 @@ static u64 get_dev_extent_len(struct map_lookup *map)
 	case 0: /* Single */
 	case BTRFS_BLOCK_GROUP_DUP:
 	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+	case BTRFS_BLOCK_GROUP_RAID1C4:
 		div = 1;
 		break;
 	case BTRFS_BLOCK_GROUP_RAID5:
diff --git a/fsfeatures.c b/fsfeatures.c
index 7d85d60f1277..50547bad8db2 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -86,6 +86,12 @@ static const struct btrfs_fs_feature {
 		VERSION_TO_STRING2(4,0),
 		NULL, 0,
 		"no explicit hole extents for files" },
+	{ "extraid", BTRFS_FEATURE_INCOMPAT_EXTENDED_RAID,
+		"extended_raid",
+		VERSION_TO_STRING2(4,17),
+		NULL, 0,
+		NULL, 0,
+		"extended raid features: raid1c3, raid1c4" },
 	/* Keep this one last */
 	{ "list-all", BTRFS_FEATURE_LIST_ALL, NULL }
 };
diff --git a/ioctl.h b/ioctl.h
index 709e996f401c..ae8f60515533 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -682,7 +682,8 @@ enum btrfs_err_code {
 	BTRFS_ERROR_DEV_TGT_REPLACE,
 	BTRFS_ERROR_DEV_MISSING_NOT_FOUND,
 	BTRFS_ERROR_DEV_ONLY_WRITABLE,
-	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
+	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS,
+	BTRFS_ERROR_DEV_RAID1c3_MIN_NOT_MET,
 };
 
 /* An error code to error string mapping for the kernel
diff --git a/mkfs/main.c b/mkfs/main.c
index b76462a735cf..099b38bbc80c 100644
--- a/mkfs/main.c
+++ b/mkfs/main.c
@@ -346,7 +346,7 @@ static void print_usage(int ret)
 	printf("Usage: mkfs.btrfs [options] dev [ dev ... ]\n");
 	printf("Options:\n");
 	printf("  allocation profiles:\n");
-	printf("\t-d|--data PROFILE       data profile, raid0, raid1, raid5, raid6, raid10, dup or single\n");
+	printf("\t-d|--data PROFILE       data profile, raid0, raid1, raid1c3, raid1c4, raid5, raid6, raid10, dup or single\n");
 	printf("\t-m|--metadata PROFILE   metadata profile, values like for data profile\n");
 	printf("\t-M|--mixed              mix metadata and data together\n");
 	printf("  features:\n");
@@ -377,6 +377,10 @@ static u64 parse_profile(const char *s)
 		return BTRFS_BLOCK_GROUP_RAID0;
 	} else if (strcasecmp(s, "raid1") == 0) {
 		return BTRFS_BLOCK_GROUP_RAID1;
+	} else if (strcasecmp(s, "raid1c3") == 0) {
+		return BTRFS_BLOCK_GROUP_RAID1C3;
+	} else if (strcasecmp(s, "raid1c4") == 0) {
+		return BTRFS_BLOCK_GROUP_RAID1C4;
 	} else if (strcasecmp(s, "raid5") == 0) {
 		return BTRFS_BLOCK_GROUP_RAID5;
 	} else if (strcasecmp(s, "raid6") == 0) {
@@ -958,6 +962,11 @@ int main(int argc, char **argv)
 		features |= BTRFS_FEATURE_INCOMPAT_RAID56;
 	}
 
+	if ((data_profile | metadata_profile) &
+	    (BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4)) {
+		features |= BTRFS_FEATURE_INCOMPAT_EXTENDED_RAID;
+	}
+
 	if (btrfs_check_nodesize(nodesize, sectorsize,
 				 features))
 		goto error;
diff --git a/print-tree.c b/print-tree.c
index a09ecfbb28f0..f816a851ea65 100644
--- a/print-tree.c
+++ b/print-tree.c
@@ -163,6 +163,12 @@ static void bg_flags_to_str(u64 flags, char *ret)
 	case BTRFS_BLOCK_GROUP_RAID1:
 		strcat(ret, "|RAID1");
 		break;
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+		strcat(ret, "|RAID1C3");
+		break;
+	case BTRFS_BLOCK_GROUP_RAID1C4:
+		strcat(ret, "|RAID1C4");
+		break;
 	case BTRFS_BLOCK_GROUP_DUP:
 		strcat(ret, "|DUP");
 		break;
diff --git a/utils.c b/utils.c
index d4395b1f32f8..4e942cff40d0 100644
--- a/utils.c
+++ b/utils.c
@@ -1884,8 +1884,10 @@ static int group_profile_devs_min(u64 flag)
 	case BTRFS_BLOCK_GROUP_RAID5:
 		return 2;
 	case BTRFS_BLOCK_GROUP_RAID6:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
 		return 3;
 	case BTRFS_BLOCK_GROUP_RAID10:
+	case BTRFS_BLOCK_GROUP_RAID1C4:
 		return 4;
 	default:
 		return -1;
@@ -1901,9 +1903,9 @@ int test_num_disk_vs_raid(u64 metadata_profile, u64 data_profile,
 	switch (dev_cnt) {
 	default:
 	case 4:
-		allowed |= BTRFS_BLOCK_GROUP_RAID10;
+		allowed |= BTRFS_BLOCK_GROUP_RAID10 | BTRFS_BLOCK_GROUP_RAID1C4;
 	case 3:
-		allowed |= BTRFS_BLOCK_GROUP_RAID6;
+		allowed |= BTRFS_BLOCK_GROUP_RAID6 | BTRFS_BLOCK_GROUP_RAID1C3;
 	case 2:
 		allowed |= BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
 			BTRFS_BLOCK_GROUP_RAID5;
@@ -1955,7 +1957,10 @@ int group_profile_max_safe_loss(u64 flags)
 	case BTRFS_BLOCK_GROUP_RAID10:
 		return 1;
 	case BTRFS_BLOCK_GROUP_RAID6:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
 		return 2;
+	case BTRFS_BLOCK_GROUP_RAID1C4:
+		return 3;
 	default:
 		return -1;
 	}
@@ -2170,6 +2175,10 @@ const char* btrfs_group_profile_str(u64 flag)
 		return "RAID0";
 	case BTRFS_BLOCK_GROUP_RAID1:
 		return "RAID1";
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+		return "RAID1C3";
+	case BTRFS_BLOCK_GROUP_RAID1C4:
+		return "RAID1C4";
 	case BTRFS_BLOCK_GROUP_RAID5:
 		return "RAID5";
 	case BTRFS_BLOCK_GROUP_RAID6:
diff --git a/volumes.c b/volumes.c
index 24eb3e8b2578..ae571da95094 100644
--- a/volumes.c
+++ b/volumes.c
@@ -94,6 +94,24 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 		.devs_increment	= 1,
 		.ncopies	= 3,
 	},
+	[BTRFS_RAID_RAID1C3] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 3,
+		.tolerated_failures = 2,
+		.devs_increment	= 3,
+		.ncopies	= 3,
+	},
+	[BTRFS_RAID_RAID1C4] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 4,
+		.tolerated_failures = 3,
+		.devs_increment	= 4,
+		.ncopies	= 4,
+	},
 };
 
 struct stripe {
@@ -795,6 +813,8 @@ static u64 chunk_bytes_by_type(u64 type, u64 calc_size, int num_stripes,
 {
 	if (type & (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_DUP))
 		return calc_size;
+	else if (type & (BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4))
+		return calc_size;
 	else if (type & BTRFS_BLOCK_GROUP_RAID10)
 		return calc_size * (num_stripes / sub_stripes);
 	else if (type & BTRFS_BLOCK_GROUP_RAID5)
@@ -971,6 +991,20 @@ int btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 			return -ENOSPC;
 		min_stripes = 2;
 	}
+	if (type & BTRFS_BLOCK_GROUP_RAID1C3) {
+		num_stripes = min_t(u64, 3,
+				  btrfs_super_num_devices(info->super_copy));
+		if (num_stripes < 3)
+			return -ENOSPC;
+		min_stripes = 3;
+	}
+	if (type & BTRFS_BLOCK_GROUP_RAID1C4) {
+		num_stripes = min_t(u64, 4,
+				  btrfs_super_num_devices(info->super_copy));
+		if (num_stripes < 4)
+			return -ENOSPC;
+		min_stripes = 4;
+	}
 	if (type & BTRFS_BLOCK_GROUP_DUP) {
 		num_stripes = 2;
 		min_stripes = 2;
@@ -1315,7 +1349,8 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 	}
 	map = container_of(ce, struct map_lookup, ce);
 
-	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
+	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
+			 BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4))
 		ret = map->num_stripes;
 	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
 		ret = map->sub_stripes;
@@ -1511,6 +1546,8 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 
 	if (rw == WRITE) {
 		if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
+				 BTRFS_BLOCK_GROUP_RAID1C3 |
+				 BTRFS_BLOCK_GROUP_RAID1C4 |
 				 BTRFS_BLOCK_GROUP_DUP)) {
 			stripes_required = map->num_stripes;
 		} else if (map->type & BTRFS_BLOCK_GROUP_RAID10) {
@@ -1553,6 +1590,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 	stripe_offset = offset - stripe_offset;
 
 	if (map->type & (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1 |
+			 BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4 |
 			 BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
 			 BTRFS_BLOCK_GROUP_RAID10 |
 			 BTRFS_BLOCK_GROUP_DUP)) {
@@ -1568,7 +1606,9 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, int rw,
 
 	multi->num_stripes = 1;
 	stripe_index = 0;
-	if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+	if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
+			 BTRFS_BLOCK_GROUP_RAID1C3 |
+			 BTRFS_BLOCK_GROUP_RAID1C4)) {
 		if (rw == WRITE)
 			multi->num_stripes = map->num_stripes;
 		else if (mirror_num)
@@ -1838,6 +1878,8 @@ int btrfs_check_chunk_valid(struct btrfs_fs_info *fs_info,
 	if ((type & BTRFS_BLOCK_GROUP_RAID10 && (sub_stripes != 2 ||
 		  !IS_ALIGNED(num_stripes, sub_stripes))) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
+	    (type & BTRFS_BLOCK_GROUP_RAID1C3 && num_stripes < 3) ||
+	    (type & BTRFS_BLOCK_GROUP_RAID1C4 && num_stripes < 4) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID6 && num_stripes < 3) ||
 	    (type & BTRFS_BLOCK_GROUP_DUP && num_stripes > 2) ||
@@ -2391,6 +2433,8 @@ u64 btrfs_stripe_length(struct btrfs_fs_info *fs_info,
 	switch (profile) {
 	case 0: /* Single profile */
 	case BTRFS_BLOCK_GROUP_RAID1:
+	case BTRFS_BLOCK_GROUP_RAID1C3:
+	case BTRFS_BLOCK_GROUP_RAID1C4:
 	case BTRFS_BLOCK_GROUP_DUP:
 		stripe_len = chunk_len;
 		break;
diff --git a/volumes.h b/volumes.h
index b4ea93f0bec3..6f74aee998e3 100644
--- a/volumes.h
+++ b/volumes.h
@@ -126,6 +126,10 @@ static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
 		return BTRFS_RAID_RAID10;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID1)
 		return BTRFS_RAID_RAID1;
+	else if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
+		return BTRFS_RAID_RAID1C3;
+	else if (flags & BTRFS_BLOCK_GROUP_RAID1C4)
+		return BTRFS_RAID_RAID1C4;
 	else if (flags & BTRFS_BLOCK_GROUP_DUP)
 		return BTRFS_RAID_DUP;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID0)
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/4] btrfs: refactor block group replication factor calculation to a helper
  2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
  2018-07-13 18:46 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
@ 2018-07-13 18:46 ` David Sterba
  2018-07-13 18:46 ` [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3) David Sterba
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: David Sterba @ 2018-07-13 18:46 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

There are many places that open code the duplicity factor of the block
group profiles, create a common helper. This can be easily extended for
more copies.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent-tree.c | 36 ++++++++----------------------------
 fs/btrfs/super.c       | 11 +++--------
 fs/btrfs/volumes.c     | 11 +++++++++++
 fs/btrfs/volumes.h     |  2 ++
 4 files changed, 24 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 3d9fe58c0080..4ffa64e288da 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4060,11 +4060,7 @@ static void update_space_info(struct btrfs_fs_info *info, u64 flags,
 	struct btrfs_space_info *found;
 	int factor;
 
-	if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
-		     BTRFS_BLOCK_GROUP_RAID10))
-		factor = 2;
-	else
-		factor = 1;
+	factor = btrfs_bg_type_to_factor(flags);
 
 	found = __find_space_info(info, flags);
 	ASSERT(found);
@@ -4703,6 +4699,7 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 	u64 space_size;
 	u64 avail;
 	u64 used;
+	int factor;
 
 	/* Don't overcommit when in mixed mode. */
 	if (space_info->flags & BTRFS_BLOCK_GROUP_DATA)
@@ -4737,10 +4734,8 @@ static int can_overcommit(struct btrfs_fs_info *fs_info,
 	 * doesn't include the parity drive, so we don't have to
 	 * change the math
 	 */
-	if (profile & (BTRFS_BLOCK_GROUP_DUP |
-		       BTRFS_BLOCK_GROUP_RAID1 |
-		       BTRFS_BLOCK_GROUP_RAID10))
-		avail >>= 1;
+	factor = btrfs_bg_type_to_factor(profile);
+	avail /= factor;
 
 	/*
 	 * If we aren't flushing all things, let us overcommit up to
@@ -6219,12 +6214,8 @@ static int update_block_group(struct btrfs_trans_handle *trans,
 		cache = btrfs_lookup_block_group(info, bytenr);
 		if (!cache)
 			return -ENOENT;
-		if (cache->flags & (BTRFS_BLOCK_GROUP_DUP |
-				    BTRFS_BLOCK_GROUP_RAID1 |
-				    BTRFS_BLOCK_GROUP_RAID10))
-			factor = 2;
-		else
-			factor = 1;
+		factor = btrfs_bg_type_to_factor(cache->flags);
+
 		/*
 		 * If this block group has free space cache written out, we
 		 * need to make sure to load it if we are removing space.  This
@@ -9520,13 +9511,7 @@ u64 btrfs_account_ro_block_groups_free_space(struct btrfs_space_info *sinfo)
 			continue;
 		}
 
-		if (block_group->flags & (BTRFS_BLOCK_GROUP_RAID1 |
-					  BTRFS_BLOCK_GROUP_RAID10 |
-					  BTRFS_BLOCK_GROUP_DUP))
-			factor = 2;
-		else
-			factor = 1;
-
+		factor = btrfs_bg_type_to_factor(block_group->flags);
 		free_bytes += (block_group->key.offset -
 			       btrfs_block_group_used(&block_group->item)) *
 			       factor;
@@ -10343,12 +10328,7 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans,
 
 	memcpy(&key, &block_group->key, sizeof(key));
 	index = btrfs_bg_flags_to_raid_index(block_group->flags);
-	if (block_group->flags & (BTRFS_BLOCK_GROUP_DUP |
-				  BTRFS_BLOCK_GROUP_RAID1 |
-				  BTRFS_BLOCK_GROUP_RAID10))
-		factor = 2;
-	else
-		factor = 1;
+	factor = btrfs_bg_type_to_factor(block_group->flags);
 
 	/* make sure this block group isn't part of an allocation cluster */
 	cluster = &fs_info->data_alloc_cluster;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 81107ad49f3a..4f646b66cc06 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -2098,14 +2098,9 @@ static int btrfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 				btrfs_account_ro_block_groups_free_space(found);
 
 			for (i = 0; i < BTRFS_NR_RAID_TYPES; i++) {
-				if (!list_empty(&found->block_groups[i])) {
-					switch (i) {
-					case BTRFS_RAID_DUP:
-					case BTRFS_RAID_RAID1:
-					case BTRFS_RAID_RAID10:
-						factor = 2;
-					}
-				}
+				if (!list_empty(&found->block_groups[i]))
+					factor = btrfs_bg_type_to_factor(
+						btrfs_raid_array[i].bg_flag);
 			}
 		}
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e034ad9e23b4..45635f4d78c8 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -7380,3 +7380,14 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info)
 		fs_devices = fs_devices->seed;
 	}
 }
+
+/*
+ * Multiplicity factor for simple profiles: DUP, RAID1-like and RAID10.
+ */
+int btrfs_bg_type_to_factor(u64 flags)
+{
+	if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
+		     BTRFS_BLOCK_GROUP_RAID10))
+		return 2;
+	return 1;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 5139ec8daf4c..c7b9ad9733ea 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -560,4 +560,6 @@ void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info);
 bool btrfs_check_rw_degradable(struct btrfs_fs_info *fs_info,
 					struct btrfs_device *failing_dev);
 
+int btrfs_bg_type_to_factor(u64 flags);
+
 #endif
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3)
  2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
  2018-07-13 18:46 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
  2018-07-13 18:46 ` [PATCH 1/4] btrfs: refactor block group replication factor calculation to a helper David Sterba
@ 2018-07-13 18:46 ` David Sterba
  2018-07-13 21:02   ` Goffredo Baroncelli
  2018-07-13 18:46 ` [PATCH 3/4] btrfs: add support for 4-copy replication (raid1c4) David Sterba
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: David Sterba @ 2018-07-13 18:46 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Add new block group profile to store 3 copies in a simliar way that
current RAID1 does. The profile name is temporary and may change in the
future.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent-tree.c          |  6 +++++
 fs/btrfs/relocation.c           |  1 +
 fs/btrfs/scrub.c                |  3 ++-
 fs/btrfs/super.c                |  3 +++
 fs/btrfs/volumes.c              | 40 ++++++++++++++++++++++++++++-----
 fs/btrfs/volumes.h              |  2 ++
 include/uapi/linux/btrfs.h      |  3 ++-
 include/uapi/linux/btrfs_tree.h |  3 +++
 8 files changed, 53 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4ffa64e288da..47f929dcc3d4 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7527,6 +7527,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		if (!block_group_bits(block_group, flags)) {
 		    u64 extra = BTRFS_BLOCK_GROUP_DUP |
 				BTRFS_BLOCK_GROUP_RAID1 |
+				BTRFS_BLOCK_GROUP_RAID1C3 |
 				BTRFS_BLOCK_GROUP_RAID5 |
 				BTRFS_BLOCK_GROUP_RAID6 |
 				BTRFS_BLOCK_GROUP_RAID10;
@@ -9330,6 +9331,8 @@ static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
 
 	num_devices = fs_info->fs_devices->rw_devices;
 
+	ASSERT(!(flags & BTRFS_BLOCK_GROUP_RAID1C3));
+
 	stripped = BTRFS_BLOCK_GROUP_RAID0 |
 		BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
 		BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10;
@@ -9647,6 +9650,8 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
 		min_free >>= 1;
 	} else if (index == BTRFS_RAID_RAID1) {
 		dev_min = 2;
+	} else if (index == BTRFS_RAID_RAID1C3) {
+		dev_min = 3;
 	} else if (index == BTRFS_RAID_DUP) {
 		/* Multiply by 2 */
 		min_free <<= 1;
@@ -10141,6 +10146,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		if (!(get_alloc_profile(info, space_info->flags) &
 		      (BTRFS_BLOCK_GROUP_RAID10 |
 		       BTRFS_BLOCK_GROUP_RAID1 |
+		       BTRFS_BLOCK_GROUP_RAID1C3 |
 		       BTRFS_BLOCK_GROUP_RAID5 |
 		       BTRFS_BLOCK_GROUP_RAID6 |
 		       BTRFS_BLOCK_GROUP_DUP)))
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index 879b76fa881a..fea9e7e96b87 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -4339,6 +4339,7 @@ static void describe_relocation(struct btrfs_fs_info *fs_info,
 		DESCRIBE_FLAG(METADATA, "metadata");
 		DESCRIBE_FLAG(RAID0,    "raid0");
 		DESCRIBE_FLAG(RAID1,    "raid1");
+		DESCRIBE_FLAG(RAID1C3,  "raid1c3");
 		DESCRIBE_FLAG(DUP,      "dup");
 		DESCRIBE_FLAG(RAID10,   "raid10");
 		DESCRIBE_FLAG(RAID5,    "raid5");
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 572306036477..e9355759f2ec 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3388,7 +3388,8 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		offset = map->stripe_len * (num / map->sub_stripes);
 		increment = map->stripe_len * factor;
 		mirror_num = num % map->sub_stripes + 1;
-	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
+				BTRFS_BLOCK_GROUP_RAID1C3)) {
 		increment = map->stripe_len;
 		mirror_num = num % map->num_stripes + 1;
 	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 4f646b66cc06..86e6aa5ef788 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1977,6 +1977,9 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
 	} else if (type & BTRFS_BLOCK_GROUP_RAID1) {
 		min_stripes = 2;
 		num_stripes = 2;
+	} else if (type & BTRFS_BLOCK_GROUP_RAID1C3) {
+		min_stripes = 3;
+		num_stripes = 3;
 	} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
 		min_stripes = 4;
 		num_stripes = 4;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 45635f4d78c8..0920b31e999d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -116,6 +116,18 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 		.bg_flag	= BTRFS_BLOCK_GROUP_RAID6,
 		.mindev_error	= BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
 	},
+	[BTRFS_RAID_RAID1C3] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 3,
+		.tolerated_failures = 2,
+		.devs_increment	= 3,
+		.ncopies	= 3,
+		.raid_name	= "raid1c3",
+		.bg_flag	= BTRFS_BLOCK_GROUP_RAID1C3,
+		.mindev_error	= BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
+	},
 };
 
 const char *get_raid_name(enum btrfs_raid_types type)
@@ -3336,6 +3348,8 @@ static int chunk_drange_filter(struct extent_buffer *leaf,
 	if (btrfs_chunk_type(leaf, chunk) & (BTRFS_BLOCK_GROUP_DUP |
 	     BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)) {
 		factor = num_stripes / 2;
+	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID1C3) {
+		factor = num_stripes / 3;
 	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID5) {
 		factor = num_stripes - 1;
 	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID6) {
@@ -3822,7 +3836,7 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
 	if (num_devices > 1)
 		allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1);
 	if (num_devices > 2)
-		allowed |= BTRFS_BLOCK_GROUP_RAID5;
+		allowed |= BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID1C3;
 	if (num_devices > 3)
 		allowed |= (BTRFS_BLOCK_GROUP_RAID10 |
 			    BTRFS_BLOCK_GROUP_RAID6);
@@ -3856,6 +3870,7 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
 
 	/* allow to reduce meta or sys integrity only if force set */
 	allowed = BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
+			BTRFS_BLOCK_GROUP_RAID1C3 |
 			BTRFS_BLOCK_GROUP_RAID10 |
 			BTRFS_BLOCK_GROUP_RAID5 |
 			BTRFS_BLOCK_GROUP_RAID6;
@@ -4787,8 +4802,11 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
 	     btrfs_cmp_device_info, NULL);
 
-	/* round down to number of usable stripes */
-	ndevs = round_down(ndevs, devs_increment);
+	/*
+	 * Round down to number of usable stripes, devs_increment can be any
+	 * number so we can't use round_down()
+	 */
+	ndevs -= ndevs % devs_increment;
 
 	if (ndevs < devs_min) {
 		ret = -ENOSPC;
@@ -5075,6 +5093,8 @@ static inline int btrfs_chunk_max_errors(struct map_lookup *map)
 			 BTRFS_BLOCK_GROUP_RAID5 |
 			 BTRFS_BLOCK_GROUP_DUP)) {
 		max_errors = 1;
+	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1C3) {
+		max_errors = 2;
 	} else if (map->type & BTRFS_BLOCK_GROUP_RAID6) {
 		max_errors = 2;
 	} else {
@@ -5163,7 +5183,8 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 		return 1;
 
 	map = em->map_lookup;
-	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
+	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
+			 BTRFS_BLOCK_GROUP_RAID1C3))
 		ret = map->num_stripes;
 	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
 		ret = map->sub_stripes;
@@ -5237,7 +5258,9 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	struct btrfs_device *srcdev;
 
 	ASSERT((map->type &
-		 (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)));
+		 (BTRFS_BLOCK_GROUP_RAID1 |
+		  BTRFS_BLOCK_GROUP_RAID1C3 |
+		  BTRFS_BLOCK_GROUP_RAID10)));
 
 	if (map->type & BTRFS_BLOCK_GROUP_RAID10)
 		num_stripes = map->sub_stripes;
@@ -5427,6 +5450,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
 		div_u64_rem(stripe_nr_end - 1, factor, &last_stripe);
 		last_stripe *= sub_stripes;
 	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
+				BTRFS_BLOCK_GROUP_RAID1C3 |
 				BTRFS_BLOCK_GROUP_DUP)) {
 		num_stripes = map->num_stripes;
 	} else {
@@ -5792,7 +5816,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 				&stripe_index);
 		if (!need_full_stripe(op))
 			mirror_num = 1;
-	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
+	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
+				BTRFS_BLOCK_GROUP_RAID1C3)) {
 		if (need_full_stripe(op))
 			num_stripes = map->num_stripes;
 		else if (mirror_num)
@@ -6441,6 +6466,7 @@ static int btrfs_check_chunk_valid(struct btrfs_fs_info *fs_info,
 	}
 	if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes != 2) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
+	    (type & BTRFS_BLOCK_GROUP_RAID1C3 && num_stripes < 3) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID6 && num_stripes < 3) ||
 	    (type & BTRFS_BLOCK_GROUP_DUP && num_stripes > 2) ||
@@ -7389,5 +7415,7 @@ int btrfs_bg_type_to_factor(u64 flags)
 	if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
 		     BTRFS_BLOCK_GROUP_RAID10))
 		return 2;
+	if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
+		return 3;
 	return 1;
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index c7b9ad9733ea..5be624896dad 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -537,6 +537,8 @@ static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
 		return BTRFS_RAID_RAID10;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID1)
 		return BTRFS_RAID_RAID1;
+	else if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
+		return BTRFS_RAID_RAID1C3;
 	else if (flags & BTRFS_BLOCK_GROUP_DUP)
 		return BTRFS_RAID_DUP;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID0)
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 5ca1d21fc4a7..137952d3375d 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -825,7 +825,8 @@ enum btrfs_err_code {
 	BTRFS_ERROR_DEV_TGT_REPLACE,
 	BTRFS_ERROR_DEV_MISSING_NOT_FOUND,
 	BTRFS_ERROR_DEV_ONLY_WRITABLE,
-	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
+	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS,
+	BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
 };
 
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index aff1356c2bb8..fa75b63dd928 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -836,6 +836,7 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID10	(1ULL << 6)
 #define BTRFS_BLOCK_GROUP_RAID5         (1ULL << 7)
 #define BTRFS_BLOCK_GROUP_RAID6         (1ULL << 8)
+#define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
 #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
 					 BTRFS_SPACE_INFO_GLOBAL_RSV)
 
@@ -847,6 +848,7 @@ enum btrfs_raid_types {
 	BTRFS_RAID_SINGLE,
 	BTRFS_RAID_RAID5,
 	BTRFS_RAID_RAID6,
+	BTRFS_RAID_RAID1C3,
 	BTRFS_NR_RAID_TYPES
 };
 
@@ -856,6 +858,7 @@ enum btrfs_raid_types {
 
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK	(BTRFS_BLOCK_GROUP_RAID0 |   \
 					 BTRFS_BLOCK_GROUP_RAID1 |   \
+					 BTRFS_BLOCK_GROUP_RAID1C3 | \
 					 BTRFS_BLOCK_GROUP_RAID5 |   \
 					 BTRFS_BLOCK_GROUP_RAID6 |   \
 					 BTRFS_BLOCK_GROUP_DUP |     \
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/4] btrfs: add support for 4-copy replication (raid1c4)
  2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
                   ` (2 preceding siblings ...)
  2018-07-13 18:46 ` [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3) David Sterba
@ 2018-07-13 18:46 ` David Sterba
  2018-07-13 18:46 ` [PATCH 4/4] btrfs: add incompatibility bit for extended raid features David Sterba
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: David Sterba @ 2018-07-13 18:46 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

Add new block group profile to store 4 copies in a simliar way that
current RAID1 does. The profile name is temporary and may change in the
future.

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/extent-tree.c          |  3 +++
 fs/btrfs/scrub.c                |  3 ++-
 fs/btrfs/super.c                |  3 +++
 fs/btrfs/volumes.c              | 28 ++++++++++++++++++++++++++--
 fs/btrfs/volumes.h              |  2 ++
 include/uapi/linux/btrfs.h      |  1 +
 include/uapi/linux/btrfs_tree.h |  3 +++
 7 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 47f929dcc3d4..ac7ec7a274d9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7528,6 +7528,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
 		    u64 extra = BTRFS_BLOCK_GROUP_DUP |
 				BTRFS_BLOCK_GROUP_RAID1 |
 				BTRFS_BLOCK_GROUP_RAID1C3 |
+				BTRFS_BLOCK_GROUP_RAID1C4 |
 				BTRFS_BLOCK_GROUP_RAID5 |
 				BTRFS_BLOCK_GROUP_RAID6 |
 				BTRFS_BLOCK_GROUP_RAID10;
@@ -9332,6 +9333,7 @@ static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
 	num_devices = fs_info->fs_devices->rw_devices;
 
 	ASSERT(!(flags & BTRFS_BLOCK_GROUP_RAID1C3));
+	ASSERT(!(flags & BTRFS_BLOCK_GROUP_RAID1C4));
 
 	stripped = BTRFS_BLOCK_GROUP_RAID0 |
 		BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
@@ -10147,6 +10149,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 		      (BTRFS_BLOCK_GROUP_RAID10 |
 		       BTRFS_BLOCK_GROUP_RAID1 |
 		       BTRFS_BLOCK_GROUP_RAID1C3 |
+		       BTRFS_BLOCK_GROUP_RAID1C4 |
 		       BTRFS_BLOCK_GROUP_RAID5 |
 		       BTRFS_BLOCK_GROUP_RAID6 |
 		       BTRFS_BLOCK_GROUP_DUP)))
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e9355759f2ec..2cb97a895dfb 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3389,7 +3389,8 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		increment = map->stripe_len * factor;
 		mirror_num = num % map->sub_stripes + 1;
 	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
-				BTRFS_BLOCK_GROUP_RAID1C3)) {
+				BTRFS_BLOCK_GROUP_RAID1C3 |
+				BTRFS_BLOCK_GROUP_RAID1C4)) {
 		increment = map->stripe_len;
 		mirror_num = num % map->num_stripes + 1;
 	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 86e6aa5ef788..997d19c669ab 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1980,6 +1980,9 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
 	} else if (type & BTRFS_BLOCK_GROUP_RAID1C3) {
 		min_stripes = 3;
 		num_stripes = 3;
+	} else if (type & BTRFS_BLOCK_GROUP_RAID1C4) {
+		min_stripes = 4;
+		num_stripes = 4;
 	} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
 		min_stripes = 4;
 		num_stripes = 4;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 0920b31e999d..62a8d8844dd4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -128,6 +128,18 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
 		.bg_flag	= BTRFS_BLOCK_GROUP_RAID1C3,
 		.mindev_error	= BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
 	},
+	[BTRFS_RAID_RAID1C4] = {
+		.sub_stripes	= 1,
+		.dev_stripes	= 1,
+		.devs_max	= 0,
+		.devs_min	= 4,
+		.tolerated_failures = 3,
+		.devs_increment	= 4,
+		.ncopies	= 4,
+		.raid_name	= "raid1c4",
+		.bg_flag	= BTRFS_BLOCK_GROUP_RAID1C4,
+		.mindev_error	= BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
+	},
 };
 
 const char *get_raid_name(enum btrfs_raid_types type)
@@ -3350,6 +3362,8 @@ static int chunk_drange_filter(struct extent_buffer *leaf,
 		factor = num_stripes / 2;
 	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID1C3) {
 		factor = num_stripes / 3;
+	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID1C4) {
+		factor = num_stripes / 4;
 	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID5) {
 		factor = num_stripes - 1;
 	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID6) {
@@ -3839,6 +3853,7 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
 		allowed |= BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID1C3;
 	if (num_devices > 3)
 		allowed |= (BTRFS_BLOCK_GROUP_RAID10 |
+			    BTRFS_BLOCK_GROUP_RAID1C4 |
 			    BTRFS_BLOCK_GROUP_RAID6);
 	if (validate_convert_profile(&bctl->data, allowed)) {
 		int index = btrfs_bg_flags_to_raid_index(bctl->data.target);
@@ -3871,6 +3886,7 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
 	/* allow to reduce meta or sys integrity only if force set */
 	allowed = BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
 			BTRFS_BLOCK_GROUP_RAID1C3 |
+			BTRFS_BLOCK_GROUP_RAID1C4 |
 			BTRFS_BLOCK_GROUP_RAID10 |
 			BTRFS_BLOCK_GROUP_RAID5 |
 			BTRFS_BLOCK_GROUP_RAID6;
@@ -5095,6 +5111,8 @@ static inline int btrfs_chunk_max_errors(struct map_lookup *map)
 		max_errors = 1;
 	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1C3) {
 		max_errors = 2;
+	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1C4) {
+		max_errors = 3;
 	} else if (map->type & BTRFS_BLOCK_GROUP_RAID6) {
 		max_errors = 2;
 	} else {
@@ -5184,7 +5202,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 
 	map = em->map_lookup;
 	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
-			 BTRFS_BLOCK_GROUP_RAID1C3))
+			 BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4))
 		ret = map->num_stripes;
 	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
 		ret = map->sub_stripes;
@@ -5260,6 +5278,7 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
 	ASSERT((map->type &
 		 (BTRFS_BLOCK_GROUP_RAID1 |
 		  BTRFS_BLOCK_GROUP_RAID1C3 |
+		  BTRFS_BLOCK_GROUP_RAID1C4 |
 		  BTRFS_BLOCK_GROUP_RAID10)));
 
 	if (map->type & BTRFS_BLOCK_GROUP_RAID10)
@@ -5451,6 +5470,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
 		last_stripe *= sub_stripes;
 	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
 				BTRFS_BLOCK_GROUP_RAID1C3 |
+				BTRFS_BLOCK_GROUP_RAID1C4 |
 				BTRFS_BLOCK_GROUP_DUP)) {
 		num_stripes = map->num_stripes;
 	} else {
@@ -5817,7 +5837,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 		if (!need_full_stripe(op))
 			mirror_num = 1;
 	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
-				BTRFS_BLOCK_GROUP_RAID1C3)) {
+				BTRFS_BLOCK_GROUP_RAID1C3 |
+				BTRFS_BLOCK_GROUP_RAID1C4)) {
 		if (need_full_stripe(op))
 			num_stripes = map->num_stripes;
 		else if (mirror_num)
@@ -6467,6 +6488,7 @@ static int btrfs_check_chunk_valid(struct btrfs_fs_info *fs_info,
 	if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes != 2) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID1C3 && num_stripes < 3) ||
+	    (type & BTRFS_BLOCK_GROUP_RAID1C4 && num_stripes < 4) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
 	    (type & BTRFS_BLOCK_GROUP_RAID6 && num_stripes < 3) ||
 	    (type & BTRFS_BLOCK_GROUP_DUP && num_stripes > 2) ||
@@ -7417,5 +7439,7 @@ int btrfs_bg_type_to_factor(u64 flags)
 		return 2;
 	if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
 		return 3;
+	if (flags & BTRFS_BLOCK_GROUP_RAID1C4)
+		return 4;
 	return 1;
 }
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 5be624896dad..9c71d6ef7791 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -539,6 +539,8 @@ static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
 		return BTRFS_RAID_RAID1;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
 		return BTRFS_RAID_RAID1C3;
+	else if (flags & BTRFS_BLOCK_GROUP_RAID1C4)
+		return BTRFS_RAID_RAID1C4;
 	else if (flags & BTRFS_BLOCK_GROUP_DUP)
 		return BTRFS_RAID_DUP;
 	else if (flags & BTRFS_BLOCK_GROUP_RAID0)
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 137952d3375d..229ef2e135ac 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -827,6 +827,7 @@ enum btrfs_err_code {
 	BTRFS_ERROR_DEV_ONLY_WRITABLE,
 	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS,
 	BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
+	BTRFS_ERROR_DEV_RAID1C4_MIN_NOT_MET,
 };
 
 #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
index fa75b63dd928..ce0443115982 100644
--- a/include/uapi/linux/btrfs_tree.h
+++ b/include/uapi/linux/btrfs_tree.h
@@ -837,6 +837,7 @@ struct btrfs_dev_replace_item {
 #define BTRFS_BLOCK_GROUP_RAID5         (1ULL << 7)
 #define BTRFS_BLOCK_GROUP_RAID6         (1ULL << 8)
 #define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
+#define BTRFS_BLOCK_GROUP_RAID1C4       (1ULL << 10)
 #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
 					 BTRFS_SPACE_INFO_GLOBAL_RSV)
 
@@ -849,6 +850,7 @@ enum btrfs_raid_types {
 	BTRFS_RAID_RAID5,
 	BTRFS_RAID_RAID6,
 	BTRFS_RAID_RAID1C3,
+	BTRFS_RAID_RAID1C4,
 	BTRFS_NR_RAID_TYPES
 };
 
@@ -859,6 +861,7 @@ enum btrfs_raid_types {
 #define BTRFS_BLOCK_GROUP_PROFILE_MASK	(BTRFS_BLOCK_GROUP_RAID0 |   \
 					 BTRFS_BLOCK_GROUP_RAID1 |   \
 					 BTRFS_BLOCK_GROUP_RAID1C3 | \
+					 BTRFS_BLOCK_GROUP_RAID1C4 | \
 					 BTRFS_BLOCK_GROUP_RAID5 |   \
 					 BTRFS_BLOCK_GROUP_RAID6 |   \
 					 BTRFS_BLOCK_GROUP_DUP |     \
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/4] btrfs: add incompatibility bit for extended raid features
  2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
                   ` (3 preceding siblings ...)
  2018-07-13 18:46 ` [PATCH 3/4] btrfs: add support for 4-copy replication (raid1c4) David Sterba
@ 2018-07-13 18:46 ` David Sterba
  2018-07-15 14:37 ` [PATCH 0/4] 3- and 4- copy RAID1 waxhead
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: David Sterba @ 2018-07-13 18:46 UTC (permalink / raw)
  To: linux-btrfs; +Cc: David Sterba

This bit will cover all newly added features of RAID:

- 3 copy replication
- 4 copy replication
- configurable stripe length
- triple parity
- raid56 log

Signed-off-by: David Sterba <dsterba@suse.com>
---
 fs/btrfs/ctree.h           | 1 +
 fs/btrfs/sysfs.c           | 2 ++
 fs/btrfs/volumes.c         | 9 +++++++++
 include/uapi/linux/btrfs.h | 8 ++++++++
 4 files changed, 20 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 118346aceea9..d734a0d7a3a9 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -265,6 +265,7 @@ struct btrfs_super_block {
 	 BTRFS_FEATURE_INCOMPAT_RAID56 |		\
 	 BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF |		\
 	 BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA |	\
+	 BTRFS_FEATURE_INCOMPAT_EXTENDED_RAID |		\
 	 BTRFS_FEATURE_INCOMPAT_NO_HOLES)
 
 #define BTRFS_FEATURE_INCOMPAT_SAFE_SET			\
diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 4a4e960c7c66..f67824fbbab4 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -194,6 +194,7 @@ BTRFS_FEAT_ATTR_INCOMPAT(raid56, RAID56);
 BTRFS_FEAT_ATTR_INCOMPAT(skinny_metadata, SKINNY_METADATA);
 BTRFS_FEAT_ATTR_INCOMPAT(no_holes, NO_HOLES);
 BTRFS_FEAT_ATTR_COMPAT_RO(free_space_tree, FREE_SPACE_TREE);
+BTRFS_FEAT_ATTR_INCOMPAT(extended_raid, EXTENDED_RAID);
 
 static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(mixed_backref),
@@ -207,6 +208,7 @@ static struct attribute *btrfs_supported_feature_attrs[] = {
 	BTRFS_FEAT_ATTR_PTR(skinny_metadata),
 	BTRFS_FEAT_ATTR_PTR(no_holes),
 	BTRFS_FEAT_ATTR_PTR(free_space_tree),
+	BTRFS_FEAT_ATTR_PTR(extended_raid),
 	NULL
 };
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 62a8d8844dd4..bf52b2b4c6a4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4663,6 +4663,14 @@ static void check_raid56_incompat_flag(struct btrfs_fs_info *info, u64 type)
 	btrfs_set_fs_incompat(info, RAID56);
 }
 
+static void check_extended_raid_incompat_flag(struct btrfs_fs_info *info, u64 type)
+{
+	if (!(type & (BTRFS_BLOCK_GROUP_RAID1C3 | BTRFS_BLOCK_GROUP_RAID1C4)))
+		return;
+
+	btrfs_set_fs_incompat(info, EXTENDED_RAID);
+}
+
 #define BTRFS_MAX_DEVS(info) ((BTRFS_MAX_ITEM_SIZE(info)	\
 			- sizeof(struct btrfs_chunk))		\
 			/ sizeof(struct btrfs_stripe) + 1)
@@ -4945,6 +4953,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
 
 	free_extent_map(em);
 	check_raid56_incompat_flag(info, type);
+	check_extended_raid_incompat_flag(info, type);
 
 	kfree(devices_info);
 	return 0;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 229ef2e135ac..490cc3e66b94 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -270,6 +270,14 @@ struct btrfs_ioctl_fs_info_args {
 #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA	(1ULL << 8)
 #define BTRFS_FEATURE_INCOMPAT_NO_HOLES		(1ULL << 9)
 
+/*
+ * More RAID features:
+ * - RAID1C3 - 3-copy mirroring
+ * - RAID1C4 - 4-copy mirroring
+ * - ...
+ */
+#define BTRFS_FEATURE_INCOMPAT_EXTENDED_RAID	(1ULL << 10)
+
 struct btrfs_ioctl_feature_flags {
 	__u64 compat_flags;
 	__u64 compat_ro_flags;
-- 
2.18.0


^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3)
  2018-07-13 18:46 ` [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3) David Sterba
@ 2018-07-13 21:02   ` Goffredo Baroncelli
  2018-07-17 16:00     ` David Sterba
  0 siblings, 1 reply; 36+ messages in thread
From: Goffredo Baroncelli @ 2018-07-13 21:02 UTC (permalink / raw)
  To: David Sterba, linux-btrfs

As general comment, good to hear that something is moving around raid5/6 + write hole and multiple mirroring.
However I am guessing if this is time to simplify the RAID code. There are a lot of "if" which could be avoided using 
the values stored in the array "btrfs_raid_array[]".

Below some comments:

On 07/13/2018 08:46 PM, David Sterba wrote:
> Add new block group profile to store 3 copies in a simliar way that
> current RAID1 does. The profile name is temporary and may change in the
> future.
> 
> Signed-off-by: David Sterba <dsterba@suse.com>
> ---
>  fs/btrfs/extent-tree.c          |  6 +++++
>  fs/btrfs/relocation.c           |  1 +
>  fs/btrfs/scrub.c                |  3 ++-
>  fs/btrfs/super.c                |  3 +++
>  fs/btrfs/volumes.c              | 40 ++++++++++++++++++++++++++++-----
>  fs/btrfs/volumes.h              |  2 ++
>  include/uapi/linux/btrfs.h      |  3 ++-
>  include/uapi/linux/btrfs_tree.h |  3 +++
>  8 files changed, 53 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4ffa64e288da..47f929dcc3d4 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7527,6 +7527,7 @@ static noinline int find_free_extent(struct btrfs_fs_info *fs_info,
>  		if (!block_group_bits(block_group, flags)) {
>  		    u64 extra = BTRFS_BLOCK_GROUP_DUP |
>  				BTRFS_BLOCK_GROUP_RAID1 |
> +				BTRFS_BLOCK_GROUP_RAID1C3 |
>  				BTRFS_BLOCK_GROUP_RAID5 |
>  				BTRFS_BLOCK_GROUP_RAID6 |
>  				BTRFS_BLOCK_GROUP_RAID10;

"extra" could be created iterating on btrfs_raid_array[] and considering only the item with ncopies > 1; or we could add 

#define BTRFS_BLOCK_GROUP_REDUNDANCY (BTRFS_BLOCK_GROUP_DUP| .....)

This constant could be used also below

> @@ -9330,6 +9331,8 @@ static u64 update_block_group_flags(struct btrfs_fs_info *fs_info, u64 flags)
>  
>  	num_devices = fs_info->fs_devices->rw_devices;
>  
> +	ASSERT(!(flags & BTRFS_BLOCK_GROUP_RAID1C3));
> +
>  	stripped = BTRFS_BLOCK_GROUP_RAID0 |
>  		BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6 |
>  		BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10;
> @@ -9647,6 +9650,8 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr)
>  		min_free >>= 1;
>  	} else if (index == BTRFS_RAID_RAID1) {
>  		dev_min = 2;
> +	} else if (index == BTRFS_RAID_RAID1C3) {
> +		dev_min = 3;
>  	} else if (index == BTRFS_RAID_DUP) {
>  		/* Multiply by 2 */
>  		min_free <<= 1;

The "if"s above could be simplified as:

	dev_min = btrfs_raid_array[index].devs_min;
	if (index == BTRFS_RAID_DUP)
		min_free <<= 1;
	else if (index == BTRFS_RAID_RAID0)
		min_free = div64_u64(min_free, dev_min);
	else if (index == BTRFS_RAID_RAID10)
		min_free >>= 1;


> @@ -10141,6 +10146,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
>  		if (!(get_alloc_profile(info, space_info->flags) &
>  		      (BTRFS_BLOCK_GROUP_RAID10 |
>  		       BTRFS_BLOCK_GROUP_RAID1 |
> +		       BTRFS_BLOCK_GROUP_RAID1C3 |
>  		       BTRFS_BLOCK_GROUP_RAID5 |
>  		       BTRFS_BLOCK_GROUP_RAID6 |
>  		       BTRFS_BLOCK_GROUP_DUP)))


See above about BTRFS_BLOCK_GROUP_REDUNDANCY

> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 879b76fa881a..fea9e7e96b87 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -4339,6 +4339,7 @@ static void describe_relocation(struct btrfs_fs_info *fs_info,
>  		DESCRIBE_FLAG(METADATA, "metadata");

The code below, could be performed searching in the array btrfs_raid_array[] instead of checking each possibility

>  		DESCRIBE_FLAG(RAID0,    "raid0");
>  		DESCRIBE_FLAG(RAID1,    "raid1");
> +		DESCRIBE_FLAG(RAID1C3,  "raid1c3");
>  		DESCRIBE_FLAG(DUP,      "dup");
>  		DESCRIBE_FLAG(RAID10,   "raid10");
>  		DESCRIBE_FLAG(RAID5,    "raid5");



> diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
> index 572306036477..e9355759f2ec 100644
> --- a/fs/btrfs/scrub.c
> +++ b/fs/btrfs/scrub.c
> @@ -3388,7 +3388,8 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
>  		offset = map->stripe_len * (num / map->sub_stripes);
>  		increment = map->stripe_len * factor;
>  		mirror_num = num % map->sub_stripes + 1;
> -	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
> +	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
> +				BTRFS_BLOCK_GROUP_RAID1C3)) {
>  		increment = map->stripe_len;
>  		mirror_num = num % map->num_stripes + 1;
>  	} else if (map->type & BTRFS_BLOCK_GROUP_DUP) {
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 4f646b66cc06..86e6aa5ef788 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -1977,6 +1977,9 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info,
>  	} else if (type & BTRFS_BLOCK_GROUP_RAID1) {
>  		min_stripes = 2;
>  		num_stripes = 2;
> +	} else if (type & BTRFS_BLOCK_GROUP_RAID1C3) {
> +		min_stripes = 3;
> +		num_stripes = 3;
>  	} else if (type & BTRFS_BLOCK_GROUP_RAID10) {
>  		min_stripes = 4;
>  		num_stripes = 4;

> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 45635f4d78c8..0920b31e999d 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -116,6 +116,18 @@ const struct btrfs_raid_attr btrfs_raid_array[BTRFS_NR_RAID_TYPES] = {
>  		.bg_flag	= BTRFS_BLOCK_GROUP_RAID6,
>  		.mindev_error	= BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET,
>  	},
> +	[BTRFS_RAID_RAID1C3] = {
> +		.sub_stripes	= 1,
> +		.dev_stripes	= 1,
> +		.devs_max	= 0,
> +		.devs_min	= 3,
> +		.tolerated_failures = 2,
> +		.devs_increment	= 3,
> +		.ncopies	= 3,
> +		.raid_name	= "raid1c3",
> +		.bg_flag	= BTRFS_BLOCK_GROUP_RAID1C3,
> +		.mindev_error	= BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
> +	},
>  };
>  
>  const char *get_raid_name(enum btrfs_raid_types type)
> @@ -3336,6 +3348,8 @@ static int chunk_drange_filter(struct extent_buffer *leaf,
>  	if (btrfs_chunk_type(leaf, chunk) & (BTRFS_BLOCK_GROUP_DUP |
>  	     BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)) {
>  		factor = num_stripes / 2;
> +	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID1C3) {
> +		factor = num_stripes / 3;
>  	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID5) {
>  		factor = num_stripes - 1;
>  	} else if (btrfs_chunk_type(leaf, chunk) & BTRFS_BLOCK_GROUP_RAID6) {

May be time to add two other fields to the btrfs_raid_array[] array ? I.e.:
	factor_div and factor_sub

	so factor could be computed as

	index = btrfs_bg_flags_to_raid_index(btrfs_chunk_type(leaf, chunk))
	factor = num_stripes / btrfs_raid_array[index].factor_div - btrfs_raid_array[index].factor_sub;

> @@ -3822,7 +3836,7 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
>  	if (num_devices > 1)
>  		allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1);
>  	if (num_devices > 2)
> -		allowed |= BTRFS_BLOCK_GROUP_RAID5;
> +		allowed |= BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID1C3;
>  	if (num_devices > 3)
>  		allowed |= (BTRFS_BLOCK_GROUP_RAID10 |
>  			    BTRFS_BLOCK_GROUP_RAID6);

The "if"s  below could be replaced by a search in the btrfs_raid_array[]

> @@ -3856,6 +3870,7 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
>  
>  	/* allow to reduce meta or sys integrity only if force set */
>  	allowed = BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
> +			BTRFS_BLOCK_GROUP_RAID1C3 |
>  			BTRFS_BLOCK_GROUP_RAID10 |
>  			BTRFS_BLOCK_GROUP_RAID5 |
>  			BTRFS_BLOCK_GROUP_RAID6;

See above about BTRFS_BLOCK_GROUP_REDUNDANCY

> @@ -4787,8 +4802,11 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans,
>  	sort(devices_info, ndevs, sizeof(struct btrfs_device_info),
>  	     btrfs_cmp_device_info, NULL);
>  
> -	/* round down to number of usable stripes */
> -	ndevs = round_down(ndevs, devs_increment);
> +	/*
> +	 * Round down to number of usable stripes, devs_increment can be any
> +	 * number so we can't use round_down()
> +	 */
> +	ndevs -= ndevs % devs_increment;
>  
>  	if (ndevs < devs_min) {
>  		ret = -ENOSPC;
> @@ -5075,6 +5093,8 @@ static inline int btrfs_chunk_max_errors(struct map_lookup *map)
>  			 BTRFS_BLOCK_GROUP_RAID5 |
>  			 BTRFS_BLOCK_GROUP_DUP)) {
>  		max_errors = 1;
> +	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1C3) {
> +		max_errors = 2;
>  	} else if (map->type & BTRFS_BLOCK_GROUP_RAID6) {
>  		max_errors = 2;
>  	} else {

Even in this case the ifs above could be replaced with something like:

	index = btrfs_bg_flags_to_raid_index(map->type)
	max_errors = btrfs_raid_array[index].ncopies-1;

> @@ -5163,7 +5183,8 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
>  		return 1;
>  
>  	map = em->map_lookup;
> -	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1))
> +	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
> +			 BTRFS_BLOCK_GROUP_RAID1C3))
>  		ret = map->num_stripes;
>  	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
>  		ret = map->sub_stripes;


With the exception of RAID6 case (which I don't understand), the ifs above could be replaced with

	index = btrfs_bg_flags_to_raid_index(map->type)
	ret = btrfs_raid_array[index].ncopies

> @@ -5237,7 +5258,9 @@ static int find_live_mirror(struct btrfs_fs_info *fs_info,
>  	struct btrfs_device *srcdev;
>  
>  	ASSERT((map->type &
> -		 (BTRFS_BLOCK_GROUP_RAID1 | BTRFS_BLOCK_GROUP_RAID10)));
> +		 (BTRFS_BLOCK_GROUP_RAID1 |
> +		  BTRFS_BLOCK_GROUP_RAID1C3 |
> +		  BTRFS_BLOCK_GROUP_RAID10)));
>  
>  	if (map->type & BTRFS_BLOCK_GROUP_RAID10)
>  		num_stripes = map->sub_stripes;
> @@ -5427,6 +5450,7 @@ static int __btrfs_map_block_for_discard(struct btrfs_fs_info *fs_info,
>  		div_u64_rem(stripe_nr_end - 1, factor, &last_stripe);
>  		last_stripe *= sub_stripes;
>  	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
> +				BTRFS_BLOCK_GROUP_RAID1C3 |
>  				BTRFS_BLOCK_GROUP_DUP)) {
>  		num_stripes = map->num_stripes;
>  	} else {
> @@ -5792,7 +5816,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
>  				&stripe_index);
>  		if (!need_full_stripe(op))
>  			mirror_num = 1;
> -	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1) {
> +	} else if (map->type & (BTRFS_BLOCK_GROUP_RAID1 |
> +				BTRFS_BLOCK_GROUP_RAID1C3)) {
>  		if (need_full_stripe(op))
>  			num_stripes = map->num_stripes;
>  		else if (mirror_num)
> @@ -6441,6 +6466,7 @@ static int btrfs_check_chunk_valid(struct btrfs_fs_info *fs_info,
>  	}
>  	if ((type & BTRFS_BLOCK_GROUP_RAID10 && sub_stripes != 2) ||
>  	    (type & BTRFS_BLOCK_GROUP_RAID1 && num_stripes < 1) ||
> +	    (type & BTRFS_BLOCK_GROUP_RAID1C3 && num_stripes < 3) ||
>  	    (type & BTRFS_BLOCK_GROUP_RAID5 && num_stripes < 2) ||
>  	    (type & BTRFS_BLOCK_GROUP_RAID6 && num_stripes < 3) ||
>  	    (type & BTRFS_BLOCK_GROUP_DUP && num_stripes > 2) ||


The check above could be translate in specific check for BTRFS_BLOCK_GROUP_DUP and BTRFS_BLOCK_GROUP_RAID10; for the other cases we could check that num_stripes is >= devs_min

> @@ -7389,5 +7415,7 @@ int btrfs_bg_type_to_factor(u64 flags)
>  	if (flags & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1 |
>  		     BTRFS_BLOCK_GROUP_RAID10))
>  		return 2;
> +	if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
> +		return 3;
>  	return 1;
>  }

Even the function above could be replaced with something more general

int btrfs_bg_type_to_factor(u64 flags) {
	if (flags & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6))
		return 1;
	return btrfs_raid_array[btrfs_bg_flags_to_raid_index(flags)]
}


> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index c7b9ad9733ea..5be624896dad 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -537,6 +537,8 @@ static inline enum btrfs_raid_types btrfs_bg_flags_to_raid_index(u64 flags)
>  		return BTRFS_RAID_RAID10;
>  	else if (flags & BTRFS_BLOCK_GROUP_RAID1)
>  		return BTRFS_RAID_RAID1;
> +	else if (flags & BTRFS_BLOCK_GROUP_RAID1C3)
> +		return BTRFS_RAID_RAID1C3;
>  	else if (flags & BTRFS_BLOCK_GROUP_DUP)
>  		return BTRFS_RAID_DUP;
>  	else if (flags & BTRFS_BLOCK_GROUP_RAID0)

What about iterating on btrfs_raid_array[] in the above function

> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> index 5ca1d21fc4a7..137952d3375d 100644
> --- a/include/uapi/linux/btrfs.h
> +++ b/include/uapi/linux/btrfs.h
> @@ -825,7 +825,8 @@ enum btrfs_err_code {
>  	BTRFS_ERROR_DEV_TGT_REPLACE,
>  	BTRFS_ERROR_DEV_MISSING_NOT_FOUND,
>  	BTRFS_ERROR_DEV_ONLY_WRITABLE,
> -	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS
> +	BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS,
> +	BTRFS_ERROR_DEV_RAID1C3_MIN_NOT_MET,
>  };
>  
>  #define BTRFS_IOC_SNAP_CREATE _IOW(BTRFS_IOCTL_MAGIC, 1, \
> diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h
> index aff1356c2bb8..fa75b63dd928 100644
> --- a/include/uapi/linux/btrfs_tree.h
> +++ b/include/uapi/linux/btrfs_tree.h
> @@ -836,6 +836,7 @@ struct btrfs_dev_replace_item {
>  #define BTRFS_BLOCK_GROUP_RAID10	(1ULL << 6)
>  #define BTRFS_BLOCK_GROUP_RAID5         (1ULL << 7)
>  #define BTRFS_BLOCK_GROUP_RAID6         (1ULL << 8)
> +#define BTRFS_BLOCK_GROUP_RAID1C3       (1ULL << 9)
>  #define BTRFS_BLOCK_GROUP_RESERVED	(BTRFS_AVAIL_ALLOC_BIT_SINGLE | \
>  					 BTRFS_SPACE_INFO_GLOBAL_RSV)
>  
> @@ -847,6 +848,7 @@ enum btrfs_raid_types {
>  	BTRFS_RAID_SINGLE,
>  	BTRFS_RAID_RAID5,
>  	BTRFS_RAID_RAID6,
> +	BTRFS_RAID_RAID1C3,
>  	BTRFS_NR_RAID_TYPES
>  };
>  
> @@ -856,6 +858,7 @@ enum btrfs_raid_types {
>  
>  #define BTRFS_BLOCK_GROUP_PROFILE_MASK	(BTRFS_BLOCK_GROUP_RAID0 |   \
>  					 BTRFS_BLOCK_GROUP_RAID1 |   \
> +					 BTRFS_BLOCK_GROUP_RAID1C3 | \
>  					 BTRFS_BLOCK_GROUP_RAID5 |   \
>  					 BTRFS_BLOCK_GROUP_RAID6 |   \
>  					 BTRFS_BLOCK_GROUP_DUP |     \
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
                   ` (4 preceding siblings ...)
  2018-07-13 18:46 ` [PATCH 4/4] btrfs: add incompatibility bit for extended raid features David Sterba
@ 2018-07-15 14:37 ` waxhead
  2018-07-16 18:29   ` Goffredo Baroncelli
  2018-07-16 21:51   ` waxhead
  2018-07-15 14:46 ` Hugo Mills
  2018-07-19  7:27 ` Qu Wenruo
  7 siblings, 2 replies; 36+ messages in thread
From: waxhead @ 2018-07-15 14:37 UTC (permalink / raw)
  To: David Sterba, linux-btrfs

David Sterba wrote:
> An interesting question is the naming of the extended profiles. I picked
> something that can be easily understood but it's not a final proposal.
> Years ago, Hugo proposed a naming scheme that described the
> non-standard raid varieties of the btrfs flavor:
> 
> https://marc.info/?l=linux-btrfs&m=136286324417767
> 
> Switching to this naming would be a good addition to the extended raid.
> 
As just a humble BTRFS user I agree and really think it is about time to 
move far away from the RAID terminology. However adding some more 
descriptive profile names (or at least some aliases) would be much 
better for the commoners (such as myself).

For example:

Old format / New Format / My suggested alias
SINGLE  / 1C     / SINGLE
DUP     / 2CD    / DUP (or even MIRRORLOCAL1)
RAID0   / 1CmS   / STRIPE
RAID1   / 2C     / MIRROR1
RAID1c3 / 3C     / MIRROR2
RAID1c4 / 4C     / MIRROR3
RAID10  / 2CmS   / STRIPE.MIRROR1
RAID5   / 1CmS1P / STRIPE.PARITY1
RAID6   / 1CmS2P / STRIPE.PARITY2

I find that writing something like "btrfs balance start 
-dconvert=stripe5.parity2 /mnt" is far less confusing and therefore less 
error prone than writing "-dconvert=1C5S2P".

While Hugo's suggestion is compact and to the point I would call for 
expanding that so it is a bit more descriptive and human readable.

So for example : STRIPE<num> where <num> obviously is the same as Hugo 
proposed - the number of storage devices for the stripe and no <num> 
would be best to mean 'use max devices'.
For PARITY then <num> is obviously required

Keep in mind that most people (...and I am willing to bet even Duncan 
which probably HAS backups ;) ) get a bit stressed when their storage 
system is degraded. With that in mind I hope for more elaborate, 
descriptive and human readable profile names to be used to avoid making 
mistakes using the "compact" layout.

...and yes, of course this could go both ways. A more compact (and dare 
I say cryptic) variant can cause people to stop and think before doing 
something and thus avoid errors,

Now that I made my point I can't help being a bit extra hash, obnoxious 
and possibly difficult so I would also suggest that Hugo's format could 
have been changed (dare I say improved?) from....

numCOPIESnumSTRIPESnumPARITY

to.....

REPLICASnum.STRIPESnum.PARITYnum

Which would make the above table look like so:

Old format / My Format / My suggested alias
SINGLE  / R0.S0.P0 / SINGLE
DUP     / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
RAID0   / R0.Sm.P0 / STRIPE
RAID1   / R1.S0.P0 / MIRROR1
RAID1c3 / R2.S0.P0 / MIRROR2
RAID1c4 / R3.S0.P0 / MIRROR3
RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
RAID5   / R1.Sm.P1 / STRIPE.PARITY1
RAID6   / R1.Sm.P2 / STRIPE.PARITY2

And i think this is much more readable, but others may disagree. And as 
a side note... from a (hobby) coders perspective this is probably 
simpler to parse as well.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
                   ` (5 preceding siblings ...)
  2018-07-15 14:37 ` [PATCH 0/4] 3- and 4- copy RAID1 waxhead
@ 2018-07-15 14:46 ` Hugo Mills
  2018-07-19  7:27 ` Qu Wenruo
  7 siblings, 0 replies; 36+ messages in thread
From: Hugo Mills @ 2018-07-15 14:46 UTC (permalink / raw)
  To: David Sterba; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 952 bytes --]

On Fri, Jul 13, 2018 at 08:46:28PM +0200, David Sterba wrote:
[snip]
> An interesting question is the naming of the extended profiles. I picked
> something that can be easily understood but it's not a final proposal.
> Years ago, Hugo proposed a naming scheme that described the
> non-standard raid varieties of the btrfs flavor:
> 
> https://marc.info/?l=linux-btrfs&m=136286324417767
> 
> Switching to this naming would be a good addition to the extended raid.

   I'd suggest using lower-case letter for the c, s, p, rather than
upper, as it makes it much easier to read. The upper-case version
tends to make the letters and numbers merge into each other. With
lower-case c, s, p, the taller digits (or M) stand out:

  1c
  1cMs2p
  2c3s8p (OK, just kidding about this one)

   Hugo.

-- 
Hugo Mills             | The English language has the mot juste for every
hugo@... carfax.org.uk | occasion.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-15 14:37 ` [PATCH 0/4] 3- and 4- copy RAID1 waxhead
@ 2018-07-16 18:29   ` Goffredo Baroncelli
  2018-07-16 18:49     ` Austin S. Hemmelgarn
  2018-07-17 21:12     ` Duncan
  2018-07-16 21:51   ` waxhead
  1 sibling, 2 replies; 36+ messages in thread
From: Goffredo Baroncelli @ 2018-07-16 18:29 UTC (permalink / raw)
  To: waxhead, David Sterba, linux-btrfs

On 07/15/2018 04:37 PM, waxhead wrote:
> David Sterba wrote:
>> An interesting question is the naming of the extended profiles. I picked
>> something that can be easily understood but it's not a final proposal.
>> Years ago, Hugo proposed a naming scheme that described the
>> non-standard raid varieties of the btrfs flavor:
>>
>> https://marc.info/?l=linux-btrfs&m=136286324417767
>>
>> Switching to this naming would be a good addition to the extended raid.
>>
> As just a humble BTRFS user I agree and really think it is about time to move far away from the RAID terminology. However adding some more descriptive profile names (or at least some aliases) would be much better for the commoners (such as myself).
> 
> For example:
> 
> Old format / New Format / My suggested alias
> SINGLE  / 1C     / SINGLE
> DUP     / 2CD    / DUP (or even MIRRORLOCAL1)
> RAID0   / 1CmS   / STRIPE


> RAID1   / 2C     / MIRROR1
> RAID1c3 / 3C     / MIRROR2
> RAID1c4 / 4C     / MIRROR3
> RAID10  / 2CmS   / STRIPE.MIRROR1

Striping and mirroring/pairing are orthogonal properties; mirror and parity are mutually exclusive. What about

RAID1 -> MIRROR1
RAID10 -> MIRROR1S
RAID1c3 -> MIRROR2
RAID1c3+striping -> MIRROR2S

and so on...

> RAID5   / 1CmS1P / STRIPE.PARITY1
> RAID6   / 1CmS2P / STRIPE.PARITY2

To me these should be called something like

RAID5 -> PARITY1S
RAID6 -> PARITY2S

The S final is due to the fact that usually RAID5/6 spread the data on all available disks

Question #1: for "parity" profiles, does make sense to limit the maximum disks number where the data may be spread ? If the answer is not, we could omit the last S. IMHO it should. 
Question #2: historically RAID10 is requires 4 disks. However I am guessing if the stripe could be done on a different number of disks: What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is that every 64k, the data are stored on a different disk....







> 
> I find that writing something like "btrfs balance start -dconvert=stripe5.parity2 /mnt" is far less confusing and therefore less error prone than writing "-dconvert=1C5S2P".
> 
> While Hugo's suggestion is compact and to the point I would call for expanding that so it is a bit more descriptive and human readable.
> 
> So for example : STRIPE<num> where <num> obviously is the same as Hugo proposed - the number of storage devices for the stripe and no <num> would be best to mean 'use max devices'.
> For PARITY then <num> is obviously required
> 
> Keep in mind that most people (...and I am willing to bet even Duncan which probably HAS backups ;) ) get a bit stressed when their storage system is degraded. With that in mind I hope for more elaborate, descriptive and human readable profile names to be used to avoid making mistakes using the "compact" layout.
> 
> ...and yes, of course this could go both ways. A more compact (and dare I say cryptic) variant can cause people to stop and think before doing something and thus avoid errors,
> 
> Now that I made my point I can't help being a bit extra hash, obnoxious and possibly difficult so I would also suggest that Hugo's format could have been changed (dare I say improved?) from....
> 
> numCOPIESnumSTRIPESnumPARITY
> 
> to.....
> 
> REPLICASnum.STRIPESnum.PARITYnum
> 
> Which would make the above table look like so:
> 
> Old format / My Format / My suggested alias
> SINGLE  / R0.S0.P0 / SINGLE
> DUP     / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
> RAID0   / R0.Sm.P0 / STRIPE
> RAID1   / R1.S0.P0 / MIRROR1
> RAID1c3 / R2.S0.P0 / MIRROR2
> RAID1c4 / R3.S0.P0 / MIRROR3
> RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
> RAID5   / R1.Sm.P1 / STRIPE.PARITY1
> RAID6   / R1.Sm.P2 / STRIPE.PARITY2
> 
> And i think this is much more readable, but others may disagree. And as a side note... from a (hobby) coders perspective this is probably simpler to parse as well.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-16 18:29   ` Goffredo Baroncelli
@ 2018-07-16 18:49     ` Austin S. Hemmelgarn
  2018-07-17 21:12     ` Duncan
  1 sibling, 0 replies; 36+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-16 18:49 UTC (permalink / raw)
  To: kreijack, waxhead, David Sterba, linux-btrfs

On 2018-07-16 14:29, Goffredo Baroncelli wrote:
> On 07/15/2018 04:37 PM, waxhead wrote:
>> David Sterba wrote:
>>> An interesting question is the naming of the extended profiles. I picked
>>> something that can be easily understood but it's not a final proposal.
>>> Years ago, Hugo proposed a naming scheme that described the
>>> non-standard raid varieties of the btrfs flavor:
>>>
>>> https://marc.info/?l=linux-btrfs&m=136286324417767
>>>
>>> Switching to this naming would be a good addition to the extended raid.
>>>
>> As just a humble BTRFS user I agree and really think it is about time to move far away from the RAID terminology. However adding some more descriptive profile names (or at least some aliases) would be much better for the commoners (such as myself).
>>
>> For example:
>>
>> Old format / New Format / My suggested alias
>> SINGLE  / 1C     / SINGLE
>> DUP     / 2CD    / DUP (or even MIRRORLOCAL1)
>> RAID0   / 1CmS   / STRIPE
> 
> 
>> RAID1   / 2C     / MIRROR1
>> RAID1c3 / 3C     / MIRROR2
>> RAID1c4 / 4C     / MIRROR3
>> RAID10  / 2CmS   / STRIPE.MIRROR1
> 
> Striping and mirroring/pairing are orthogonal properties; mirror and parity are mutually exclusive. What about
> 
> RAID1 -> MIRROR1
> RAID10 -> MIRROR1S
> RAID1c3 -> MIRROR2
> RAID1c3+striping -> MIRROR2S
> 
> and so on...
> 
>> RAID5   / 1CmS1P / STRIPE.PARITY1
>> RAID6   / 1CmS2P / STRIPE.PARITY2
> 
> To me these should be called something like
> 
> RAID5 -> PARITY1S
> RAID6 -> PARITY2S
> 
> The S final is due to the fact that usually RAID5/6 spread the data on all available disks
> 
> Question #1: for "parity" profiles, does make sense to limit the maximum disks number where the data may be spread ? If the answer is not, we could omit the last S. IMHO it should.
Currently, there is no ability to cap the number of disks that striping 
can happen across.  Ideally, that will change in the future, in which 
case not only the S will be needed, but also a number indicating how 
wide the stripe is.

> Question #2: historically RAID10 is requires 4 disks. However I am guessing if the stripe could be done on a different number of disks: What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is that every 64k, the data are stored on a different disk....
This is what MD and LVM RAID10 do.  They work somewhat differently from 
what BTRFS calls raid10 (actually, what we currently call raid1 works 
almost identically to MD and LVM RAID10 when more than 3 disks are 
involved, except that the chunk size is 1G or larger).  Short of drastic 
internal changes to how that profile works, this isn't likely to happen.

In spite of both of these, there is practical need for indicating the 
stripe width.  Depending on the configuration of the underlying storage, 
it's fully possible (and sometimes even certain) that you will see 
chunks with differing stripe widths, so properly reporting the stripe 
width (in devices, not bytes) is useful for monitoring purposes).

Consider for example a 6-device array using what's currently called a 
raid10 profile where 2 of the disks are smaller than the other four.  On 
such an array, chunks will span all six disks (resulting in 2 copies 
striped across 3 disks each) until those two smaller disks are full, at 
which point new chunks will span only the remaining four disks 
(resulting in 2 copies striped across 2 disks each).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-15 14:37 ` [PATCH 0/4] 3- and 4- copy RAID1 waxhead
  2018-07-16 18:29   ` Goffredo Baroncelli
@ 2018-07-16 21:51   ` waxhead
  1 sibling, 0 replies; 36+ messages in thread
From: waxhead @ 2018-07-16 21:51 UTC (permalink / raw)
  To: David Sterba, linux-btrfs

waxhead wrote:
> David Sterba wrote:
>> An interesting question is the naming of the extended profiles. I picked
>> something that can be easily understood but it's not a final proposal.
>> Years ago, Hugo proposed a naming scheme that described the
>> non-standard raid varieties of the btrfs flavor:
>>
>> https://marc.info/?l=linux-btrfs&m=136286324417767
>>
>> Switching to this naming would be a good addition to the extended raid.
>>
> As just a humble BTRFS user I agree and really think it is about time to 
> move far away from the RAID terminology. However adding some more 
> descriptive profile names (or at least some aliases) would be much 
> better for the commoners (such as myself). 
>...snip... > Which would make the above table look like so:
> 
> Old format / My Format / My suggested alias
> SINGLE  / R0.S0.P0 / SINGLE
> DUP     / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
> RAID0   / R0.Sm.P0 / STRIPE
> RAID1   / R1.S0.P0 / MIRROR1
> RAID1c3 / R2.S0.P0 / MIRROR2
> RAID1c4 / R3.S0.P0 / MIRROR3
> RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
> RAID5   / R1.Sm.P1 / STRIPE.PARITY1
> RAID6   / R1.Sm.P2 / STRIPE.PARITY2
> 
> And i think this is much more readable, but others may disagree. And as 
> a side note... from a (hobby) coders perspective this is probably 
> simpler to parse as well. 
>...snap...

...and before someone else points this out that my suggestion has an 
ugly flaw , I got a bit copy / paste happy and messed up the RAID 5 and 
6 like profiles. The below table are corrected and hopefully it make the 
point why using the word 'replicas' is easier to understand than 
'copies' even if I messed it up :)

Old format / My Format / My suggested alias
SINGLE  / R0.S0.P0 / SINGLE
DUP     / R1.S1.P0 / DUP (or even MIRRORLOCAL1)
RAID0   / R0.Sm.P0 / STRIPE
RAID1   / R1.S0.P0 / MIRROR1
RAID1c3 / R2.S0.P0 / MIRROR2
RAID1c4 / R3.S0.P0 / MIRROR3
RAID10  / R1.Sm.P0 / STRIPE.MIRROR1
RAID5   / R0.Sm.P1 / STRIPE.PARITY1
RAID6   / R0.Sm.P2 / STRIPE.PARITY2

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3)
  2018-07-13 21:02   ` Goffredo Baroncelli
@ 2018-07-17 16:00     ` David Sterba
  0 siblings, 0 replies; 36+ messages in thread
From: David Sterba @ 2018-07-17 16:00 UTC (permalink / raw)
  To: kreijack; +Cc: David Sterba, linux-btrfs

On Fri, Jul 13, 2018 at 11:02:03PM +0200, Goffredo Baroncelli wrote:
> As general comment, good to hear that something is moving around raid5/6 + write hole and multiple mirroring.
> However I am guessing if this is time to simplify the RAID code. There are a lot of "if" which could be avoided using 
> the values stored in the array "btrfs_raid_array[]".

I absolutely agree and had the same impression during implementing the
feature. For this patchset I did only a minimal prep work, the
suggestions you give below make sense to me.

Enhancing the table would make a lot of code go away and just use one
formula to calculate the results that are now opencoded. I'll be going
through the raid code so I'll get to the cleanups eventually.

> Below some comments:

> > @@ -5075,6 +5093,8 @@ static inline int btrfs_chunk_max_errors(struct map_lookup *map)
> >  			 BTRFS_BLOCK_GROUP_RAID5 |
> >  			 BTRFS_BLOCK_GROUP_DUP)) {
> >  		max_errors = 1;
> > +	} else if (map->type & BTRFS_BLOCK_GROUP_RAID1C3) {
> > +		max_errors = 2;
> >  	} else if (map->type & BTRFS_BLOCK_GROUP_RAID6) {
> >  		max_errors = 2;
> >  	} else {
> 
> Even in this case the ifs above could be replaced with something like:
> 
> 	index = btrfs_bg_flags_to_raid_index(map->type)
> 	max_errors = btrfs_raid_array[index].ncopies-1;

There's .tolerated_failures that should equal ncopies - 1 in general,
but does not for DUP so the semantics of the function and caller needs
to be verified.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-16 18:29   ` Goffredo Baroncelli
  2018-07-16 18:49     ` Austin S. Hemmelgarn
@ 2018-07-17 21:12     ` Duncan
  2018-07-18  5:59       ` Goffredo Baroncelli
  1 sibling, 1 reply; 36+ messages in thread
From: Duncan @ 2018-07-17 21:12 UTC (permalink / raw)
  To: linux-btrfs

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:

> On 07/15/2018 04:37 PM, waxhead wrote:

> Striping and mirroring/pairing are orthogonal properties; mirror and
> parity are mutually exclusive.

I can't agree.  I don't know whether you meant that in the global sense, 
or purely in the btrfs context (which I suspect), but either way I can't 
agree.

In the pure btrfs context, while striping and mirroring/pairing are 
orthogonal today, Hugo's whole point was that btrfs is theoretically 
flexible enough to allow both together and the feature may at some point 
be added, so it makes sense to have a layout notation format flexible 
enough to allow it as well.

In the global context, just to complete things and mostly for others 
reading as I feel a bit like a simpleton explaining to the expert here, 
just as raid10 is shorthand for raid1+0, aka raid0 layered on top of 
raid1 (normally preferred to raid01 due to rebuild characteristics, and 
as opposed to raid01, aka raid0+1, aka raid1 on top of raid0, sometimes 
recommended as btrfs raid1 on top of whatever raid0 here due to btrfs' 
data integrity characteristics and less optimized performance), so 
there's also raid51 and raid15, raid61 and raid16, etc, with or without 
the + symbols, involving mirroring and parity conceptually at two 
different levels altho they can be combined in a single implementation 
just as raid10 and raid01 commonly are.  These additional layered-raid 
levels can be used for higher reliability, with differing rebuild and 
performance characteristics between the two forms depending on which is 
the top layer.

> Question #1: for "parity" profiles, does make sense to limit the maximum
> disks number where the data may be spread ? If the answer is not, we
> could omit the last S. IMHO it should.

As someone else already replied, btrfs doesn't currently have the ability 
to specify spread limit, but the idea if we're going to change the 
notation is to allow for the flexibility in the new notation so the 
feature can be added later without further notation changes.

Why might it make sense to specify spread?  At least two possible reasons:

a) (stealing an already posted example) Consider a multi-device layout 
with two or more device sizes.  Someone may want to limit the spread in 
ordered to keep performance and risk consistent as the smaller devices 
fill up, limiting further usage to a lower number of devices.  If that 
lower number is specified as the spread originally it'll make things more 
consistent between the room on all devices case and the room on only some 
devices case.

b) Limiting spread can change the risk and rebuild performance profiles.  
Stripes of full width mean all stripes have a strip on each device, so 
knock a device out and (assuming parity or mirroring) replace it, and all 
stripes are degraded and must be rebuilt.  With less than maximum spread, 
some stripes won't be stripped to the replaced device, and won't be 
degraded or need rebuilt, tho assuming the same overall fill, a larger 
percentage of stripes that /do/ need rebuilt will be on the replaced 
device.  So the risk profile is more "objects" (stripes/chunks/files) 
affected but less of each object, or less of the total affected, but more 
of each affected object.

> Question #2: historically RAID10 is requires 4 disks. However I am
> guessing if the stripe could be done on a different number of disks:
> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
> that every 64k, the data are stored on a different disk....

As someone else pointed out, md/lvm-raid10 already work like this.  What 
btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much 
works this way except with huge (gig size) chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-17 21:12     ` Duncan
@ 2018-07-18  5:59       ` Goffredo Baroncelli
  2018-07-18  7:20         ` Duncan
  0 siblings, 1 reply; 36+ messages in thread
From: Goffredo Baroncelli @ 2018-07-18  5:59 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 07/17/2018 11:12 PM, Duncan wrote:
> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
> excerpted:
> 
>> On 07/15/2018 04:37 PM, waxhead wrote:
> 
>> Striping and mirroring/pairing are orthogonal properties; mirror and
>> parity are mutually exclusive.
> 
> I can't agree.  I don't know whether you meant that in the global sense, 
> or purely in the btrfs context (which I suspect), but either way I can't 
> agree.
> 
> In the pure btrfs context, while striping and mirroring/pairing are 
> orthogonal today, Hugo's whole point was that btrfs is theoretically 
> flexible enough to allow both together and the feature may at some point 
> be added, so it makes sense to have a layout notation format flexible 
> enough to allow it as well.

When I say orthogonal, It means that these can be combined: i.e. you can have
- striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation where both 'C' ( = number of copy) and 'P' ( = number of parities) is too verbose.

[...]
> 
>> Question #2: historically RAID10 is requires 4 disks. However I am
>> guessing if the stripe could be done on a different number of disks:
>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
>> that every 64k, the data are stored on a different disk....
> 
> As someone else pointed out, md/lvm-raid10 already work like this.  What 
> btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much 
> works this way except with huge (gig size) chunks.

As implemented in BTRFS, raid1 doesn't have striping.

 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-18  5:59       ` Goffredo Baroncelli
@ 2018-07-18  7:20         ` Duncan
  2018-07-18  8:39           ` Duncan
                             ` (2 more replies)
  0 siblings, 3 replies; 36+ messages in thread
From: Duncan @ 2018-07-18  7:20 UTC (permalink / raw)
  To: linux-btrfs

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:

> On 07/17/2018 11:12 PM, Duncan wrote:
>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>> excerpted:
>> 
>>> On 07/15/2018 04:37 PM, waxhead wrote:
>> 
>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>> parity are mutually exclusive.
>> 
>> I can't agree.  I don't know whether you meant that in the global
>> sense,
>> or purely in the btrfs context (which I suspect), but either way I
>> can't agree.
>> 
>> In the pure btrfs context, while striping and mirroring/pairing are
>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>> flexible enough to allow both together and the feature may at some
>> point be added, so it makes sense to have a layout notation format
>> flexible enough to allow it as well.
> 
> When I say orthogonal, It means that these can be combined: i.e. you can
> have - striping (RAID0)
> - parity  (?)
> - striping + parity  (e.g. RAID5/6)
> - mirroring  (RAID1)
> - mirroring + striping  (RAID10)
> 
> However you can't have mirroring+parity; this means that a notation
> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
> too verbose.

Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
top of mirroring or mirroring on top of raid5/6, much as raid10 is 
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
on top of raid0.  

While it's not possible today on (pure) btrfs (it's possible today with 
md/dm-raid or hardware-raid handling one layer), it's theoretically 
possible both for btrfs and in general, and it could be added to btrfs in 
the future, so a notation with the flexibility to allow parity and 
mirroring together does make sense, and having just that sort of 
flexibility is exactly why Hugo made the notation proposal he did.

Tho a sensible use-case for mirroring+parity is a different question.  I 
can see a case being made for it if one layer is hardware/firmware raid, 
but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61 
(or 15 or 51) might be, where pure mirroring or pure parity wouldn't 
arguably be a at least as good a match to the use-case.  Perhaps one of 
the other experts in such things here might help with that.

>>> Question #2: historically RAID10 is requires 4 disks. However I am
>>> guessing if the stripe could be done on a different number of disks:
>>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
>>> that every 64k, the data are stored on a different disk....
>> 
>> As someone else pointed out, md/lvm-raid10 already work like this. 
>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>> much works this way except with huge (gig size) chunks.
> 
> As implemented in BTRFS, raid1 doesn't have striping.

The argument is that because there's only two copies, on multi-device 
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to 
alternate device pairs, it's effectively striped at the macro level, with 
the 1 GiB device-level chunks effectively being huge individual device 
strips of 1 GiB.

At 1 GiB strip size it doesn't have the typical performance advantage of 
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB 
strips/chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-18  7:20         ` Duncan
@ 2018-07-18  8:39           ` Duncan
  2018-07-18 12:45             ` Austin S. Hemmelgarn
  2018-07-18 12:50             ` Hugo Mills
  2018-07-18 12:50           ` Austin S. Hemmelgarn
  2018-07-18 19:42           ` Goffredo Baroncelli
  2 siblings, 2 replies; 36+ messages in thread
From: Duncan @ 2018-07-18  8:39 UTC (permalink / raw)
  To: linux-btrfs

Duncan posted on Wed, 18 Jul 2018 07:20:09 +0000 as excerpted:

>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> alternate device pairs, it's effectively striped at the macro level,
> with the 1 GiB device-level chunks effectively being huge individual
> device strips of 1 GiB.
> 
> At 1 GiB strip size it doesn't have the typical performance advantage of
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> strips/chunks.

I forgot this bit...

Similarly, multi-device single is regarded by some to be conceptually 
equivalent to raid0 with really huge GiB strips/chunks.

(As you may note, "the argument is" and "regarded by some" are distancing 
phrases.  I've seen the argument made on-list, but while I understand the 
argument and agree with it to some extent, I'm still a bit uncomfortable 
with it and don't normally make it myself, this thread being a noted 
exception tho originally I simply repeated what someone else already said 
in-thread, because I too agree it's stretching things a bit.  But it does 
appear to be a useful conceptual equivalency for some, and I do see the 
similarity.

Perhaps it's a case of coder's view (no code doing it that way, it's just 
a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
(code or not, accidental or not, it's a reasonably accurate high-level 
description of how it ends up working most of the time with equivalent 
sized devices).)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-18  8:39           ` Duncan
@ 2018-07-18 12:45             ` Austin S. Hemmelgarn
  2018-07-18 12:50             ` Hugo Mills
  1 sibling, 0 replies; 36+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 12:45 UTC (permalink / raw)
  To: linux-btrfs

On 2018-07-18 04:39, Duncan wrote:
> Duncan posted on Wed, 18 Jul 2018 07:20:09 +0000 as excerpted:
> 
>>> As implemented in BTRFS, raid1 doesn't have striping.
>>
>> The argument is that because there's only two copies, on multi-device
>> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
>> alternate device pairs, it's effectively striped at the macro level,
>> with the 1 GiB device-level chunks effectively being huge individual
>> device strips of 1 GiB.
>>
>> At 1 GiB strip size it doesn't have the typical performance advantage of
>> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
>> strips/chunks.
> 
> I forgot this bit...
> 
> Similarly, multi-device single is regarded by some to be conceptually
> equivalent to raid0 with really huge GiB strips/chunks.
> 
> (As you may note, "the argument is" and "regarded by some" are distancing
> phrases.  I've seen the argument made on-list, but while I understand the
> argument and agree with it to some extent, I'm still a bit uncomfortable
> with it and don't normally make it myself, this thread being a noted
> exception tho originally I simply repeated what someone else already said
> in-thread, because I too agree it's stretching things a bit.  But it does
> appear to be a useful conceptual equivalency for some, and I do see the
> similarity.
If the file is larger than the data chunk size, it _is_ striped, because 
it spans multiple chunks which are on separate devices.  Otherwise, it's 
more similar to what in GlusterFS is called a 'distributed volume'.  In 
such a Gluster volume, each file is entirely stored on one node (or you 
have a complete copy on N nodes where N is the number of replicas), with 
the selection of what node is used for the next file created being based 
on which node has the most free space.

That said, the main reason I explain single and raid1 the way I do is 
that I've found it's a much simpler way to explain generically how they 
work to people who already have storage background but may not care 
about the specifics.
> 
> Perhaps it's a case of coder's view (no code doing it that way, it's just
> a coincidental oddity conditional on equal sizes), vs. sysadmin's view
> (code or not, accidental or not, it's a reasonably accurate high-level
> description of how it ends up working most of the time with equivalent
> sized devices).)
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-18  8:39           ` Duncan
  2018-07-18 12:45             ` Austin S. Hemmelgarn
@ 2018-07-18 12:50             ` Hugo Mills
  2018-07-19 21:22               ` waxhead
  1 sibling, 1 reply; 36+ messages in thread
From: Hugo Mills @ 2018-07-18 12:50 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 2003 bytes --]

On Wed, Jul 18, 2018 at 08:39:48AM +0000, Duncan wrote:
> Duncan posted on Wed, 18 Jul 2018 07:20:09 +0000 as excerpted:
> 
> >> As implemented in BTRFS, raid1 doesn't have striping.
> > 
> > The argument is that because there's only two copies, on multi-device
> > btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> > alternate device pairs, it's effectively striped at the macro level,
> > with the 1 GiB device-level chunks effectively being huge individual
> > device strips of 1 GiB.
> > 
> > At 1 GiB strip size it doesn't have the typical performance advantage of
> > striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> > strips/chunks.
> 
> I forgot this bit...
> 
> Similarly, multi-device single is regarded by some to be conceptually 
> equivalent to raid0 with really huge GiB strips/chunks.
> 
> (As you may note, "the argument is" and "regarded by some" are distancing 
> phrases.  I've seen the argument made on-list, but while I understand the 
> argument and agree with it to some extent, I'm still a bit uncomfortable 
> with it and don't normally make it myself, this thread being a noted 
> exception tho originally I simply repeated what someone else already said 
> in-thread, because I too agree it's stretching things a bit.  But it does 
> appear to be a useful conceptual equivalency for some, and I do see the 
> similarity.
> 
> Perhaps it's a case of coder's view (no code doing it that way, it's just 
> a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
> (code or not, accidental or not, it's a reasonably accurate high-level 
> description of how it ends up working most of the time with equivalent 
> sized devices).)

   Well, it's an *accurate* observation. It's just not a particularly
*useful* one. :)

   Hugo.

-- 
Hugo Mills             | I gave up smoking, drinking and sex once. It was the
hugo@... carfax.org.uk | scariest 20 minutes of my life.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-18  7:20         ` Duncan
  2018-07-18  8:39           ` Duncan
@ 2018-07-18 12:50           ` Austin S. Hemmelgarn
  2018-07-18 19:42           ` Goffredo Baroncelli
  2 siblings, 0 replies; 36+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-18 12:50 UTC (permalink / raw)
  To: linux-btrfs

On 2018-07-18 03:20, Duncan wrote:
> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
> excerpted:
> 
>> On 07/17/2018 11:12 PM, Duncan wrote:
>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>> excerpted:
>>>
>>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>>
>>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>>> parity are mutually exclusive.
>>>
>>> I can't agree.  I don't know whether you meant that in the global
>>> sense,
>>> or purely in the btrfs context (which I suspect), but either way I
>>> can't agree.
>>>
>>> In the pure btrfs context, while striping and mirroring/pairing are
>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>> flexible enough to allow both together and the feature may at some
>>> point be added, so it makes sense to have a layout notation format
>>> flexible enough to allow it as well.
>>
>> When I say orthogonal, It means that these can be combined: i.e. you can
>> have - striping (RAID0)
>> - parity  (?)
>> - striping + parity  (e.g. RAID5/6)
>> - mirroring  (RAID1)
>> - mirroring + striping  (RAID10)
>>
>> However you can't have mirroring+parity; this means that a notation
>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>> too verbose.
> 
> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
> top of mirroring or mirroring on top of raid5/6, much as raid10 is
> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
> on top of raid0.
> 
> While it's not possible today on (pure) btrfs (it's possible today with
> md/dm-raid or hardware-raid handling one layer), it's theoretically
> possible both for btrfs and in general, and it could be added to btrfs in
> the future, so a notation with the flexibility to allow parity and
> mirroring together does make sense, and having just that sort of
> flexibility is exactly why Hugo made the notation proposal he did.
> 
> Tho a sensible use-case for mirroring+parity is a different question.  I
> can see a case being made for it if one layer is hardware/firmware raid,
> but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61
> (or 15 or 51) might be, where pure mirroring or pure parity wouldn't
> arguably be a at least as good a match to the use-case.  Perhaps one of
> the other experts in such things here might help with that.
> 
>>>> Question #2: historically RAID10 is requires 4 disks. However I am
>>>> guessing if the stripe could be done on a different number of disks:
>>>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
>>>> that every 64k, the data are stored on a different disk....
>>>
>>> As someone else pointed out, md/lvm-raid10 already work like this.
>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>> much works this way except with huge (gig size) chunks.
>>
>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> alternate device pairs, it's effectively striped at the macro level, with
> the 1 GiB device-level chunks effectively being huge individual device
> strips of 1 GiB.
Actually, it also behaves like LVM and MD RAID10 for any number of 
devices greater than 2, though the exact placement may diverge because 
of BTRFS's concept of different chunk types.  In LVM and MD RAID10, each 
block is stored as two copies, and what disks it ends up on is dependent 
on the block number modulo the number of disks (so, for 3 disks A, B, 
and C, block 0 is on A and B, block 1 is on C and A, and block 2 is on B 
and C, with subsequent blocks following the same pattern).  In an 
idealized model of BTRFS with only one chunk type, you get exactly the 
same behavior (because BTRFS allocates chunks based on disk utilization, 
and prefers lower numbered disks to higher ones in the event of a tie).
> 
> At 1 GiB strip size it doesn't have the typical performance advantage of
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> strips/chunks.
>

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-18  7:20         ` Duncan
  2018-07-18  8:39           ` Duncan
  2018-07-18 12:50           ` Austin S. Hemmelgarn
@ 2018-07-18 19:42           ` Goffredo Baroncelli
  2018-07-19 11:43             ` Austin S. Hemmelgarn
  2018-07-20  5:17             ` Andrei Borzenkov
  2 siblings, 2 replies; 36+ messages in thread
From: Goffredo Baroncelli @ 2018-07-18 19:42 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 07/18/2018 09:20 AM, Duncan wrote:
> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
> excerpted:
> 
>> On 07/17/2018 11:12 PM, Duncan wrote:
>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>> excerpted:
>>>
>>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>>
>>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>>> parity are mutually exclusive.
>>>
>>> I can't agree.  I don't know whether you meant that in the global
>>> sense,
>>> or purely in the btrfs context (which I suspect), but either way I
>>> can't agree.
>>>
>>> In the pure btrfs context, while striping and mirroring/pairing are
>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>> flexible enough to allow both together and the feature may at some
>>> point be added, so it makes sense to have a layout notation format
>>> flexible enough to allow it as well.
>>
>> When I say orthogonal, It means that these can be combined: i.e. you can
>> have - striping (RAID0)
>> - parity  (?)
>> - striping + parity  (e.g. RAID5/6)
>> - mirroring  (RAID1)
>> - mirroring + striping  (RAID10)
>>
>> However you can't have mirroring+parity; this means that a notation
>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>> too verbose.
> 
> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
> on top of raid0.  
And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of....) ???

Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above.

The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below). 
[....]
>>>
>>> As someone else pointed out, md/lvm-raid10 already work like this. 
>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>> much works this way except with huge (gig size) chunks.
>>
>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device 
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to 
> alternate device pairs, it's effectively striped at the macro level, with 
> the 1 GiB device-level chunks effectively being huge individual device 
> strips of 1 GiB.

The striping concept is based to the fact that if the "stripe size" is small enough you have a speed benefit because the reads may be performed in parallel from different disks.
With a "stripe size" of 1GB, it is very unlikely that this would happens.

 
> At 1 GiB strip size it doesn't have the typical performance advantage of 
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB 
> strips/chunks.



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
                   ` (6 preceding siblings ...)
  2018-07-15 14:46 ` Hugo Mills
@ 2018-07-19  7:27 ` Qu Wenruo
  2018-07-19 11:47   ` Austin S. Hemmelgarn
  2018-07-20 16:35   ` David Sterba
  7 siblings, 2 replies; 36+ messages in thread
From: Qu Wenruo @ 2018-07-19  7:27 UTC (permalink / raw)
  To: David Sterba, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 4060 bytes --]



On 2018年07月14日 02:46, David Sterba wrote:
> Hi,
> 
> I have some goodies that go into the RAID56 problem, although not
> implementing all the remaining features, it can be useful independently.
> 
> This time my hackweek project
> 
> https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56
> 
> aimed to implement the fix for the write hole problem but I spent more
> time with analysis and design of the solution and don't have a working
> prototype for that yet.
> 
> This patchset brings a feature that will be used by the raid56 log, the
> log has to be on the same redundancy level and thus we need a 3-copy
> replication for raid6. As it was easy to extend to higher replication,
> I've added a 4-copy replication, that would allow triple copy raid (that
> does not have a standardized name).

So this special level will be used for RAID56 for now?
Or it will also be possible for metadata usage just like current RAID1?

If the latter, the metadata scrub problem will need to be considered more.

For more copies RAID1, it's will have higher possibility one or two
devices missing, and then being scrubbed.
For metadata scrub, inlined csum can't ensure it's the latest one.

So for such RAID1 scrub, we need to read out all copies and compare
their generation to find out the correct copy.
At least from the changeset, it doesn't look like it's addressed yet.

And this also reminds me that current scrub is not as flex as balance, I
really like we could filter block groups to scrub just like balance, and
do scrub in a block group basis, other than devid basis.
That's to say, for a block group scrub, we don't really care which
device we're scrubbing, we just need to ensure all device in this block
is storing correct data.

Thanks,
Qu

> 
> The number of copies is fixed, so it's not N-copy for an arbitrary N.
> This would complicate the implementation too much, though I'd be willing
> to add a 5-copy replication for a small bribe.
> 
> The new raid profiles and covered by an incompatibility bit, called
> extended_raid, the (idealistic) plan is to stuff as many new
> raid-related features as possible. The patch 4/4 mentions the 3- 4- copy
> raid1, configurable stripe length, write hole log and triple parity.
> If the plan turns out to be too ambitious, the ready and implemented
> features will be split and merged.
> 
> An interesting question is the naming of the extended profiles. I picked
> something that can be easily understood but it's not a final proposal.
> Years ago, Hugo proposed a naming scheme that described the
> non-standard raid varieties of the btrfs flavor:
> 
> https://marc.info/?l=linux-btrfs&m=136286324417767
> 
> Switching to this naming would be a good addition to the extended raid.
> 
> Regarding the missing raid56 features, I'll continue working on them as
> time permits in the following weeks/months, as I'm not aware of anybody
> working on that actively enough so to speak.
> 
> Anyway, git branches with the patches:
> 
> kernel: git://github.com/kdave/btrfs-devel dev/extended-raid-ncopies
> progs:  git://github.com/kdave/btrfs-progs dev/extended-raid-ncopies
> 
> David Sterba (4):
>   btrfs: refactor block group replication factor calculation to a helper
>   btrfs: add support for 3-copy replication (raid1c3)
>   btrfs: add support for 4-copy replication (raid1c4)
>   btrfs: add incompatibility bit for extended raid features
> 
>  fs/btrfs/ctree.h                |  1 +
>  fs/btrfs/extent-tree.c          | 45 +++++++-----------
>  fs/btrfs/relocation.c           |  1 +
>  fs/btrfs/scrub.c                |  4 +-
>  fs/btrfs/super.c                | 17 +++----
>  fs/btrfs/sysfs.c                |  2 +
>  fs/btrfs/volumes.c              | 84 ++++++++++++++++++++++++++++++---
>  fs/btrfs/volumes.h              |  6 +++
>  include/uapi/linux/btrfs.h      | 12 ++++-
>  include/uapi/linux/btrfs_tree.h |  6 +++
>  10 files changed, 134 insertions(+), 44 deletions(-)
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-18 19:42           ` Goffredo Baroncelli
@ 2018-07-19 11:43             ` Austin S. Hemmelgarn
  2018-07-19 17:29               ` Goffredo Baroncelli
  2018-07-20  5:17             ` Andrei Borzenkov
  1 sibling, 1 reply; 36+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-19 11:43 UTC (permalink / raw)
  To: kreijack, Duncan, linux-btrfs

On 2018-07-18 15:42, Goffredo Baroncelli wrote:
> On 07/18/2018 09:20 AM, Duncan wrote:
>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>> excerpted:
>>
>>> On 07/17/2018 11:12 PM, Duncan wrote:
>>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>>> excerpted:
>>>>
>>>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>>>
>>>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>>>> parity are mutually exclusive.
>>>>
>>>> I can't agree.  I don't know whether you meant that in the global
>>>> sense,
>>>> or purely in the btrfs context (which I suspect), but either way I
>>>> can't agree.
>>>>
>>>> In the pure btrfs context, while striping and mirroring/pairing are
>>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>>> flexible enough to allow both together and the feature may at some
>>>> point be added, so it makes sense to have a layout notation format
>>>> flexible enough to allow it as well.
>>>
>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>> have - striping (RAID0)
>>> - parity  (?)
>>> - striping + parity  (e.g. RAID5/6)
>>> - mirroring  (RAID1)
>>> - mirroring + striping  (RAID10)
>>>
>>> However you can't have mirroring+parity; this means that a notation
>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>> too verbose.
>>
>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
>> top of mirroring or mirroring on top of raid5/6, much as raid10 is
>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
>> on top of raid0.
> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of....) ???
> 
> Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above.
No, there are cases where other configurations make sense.

RAID05 and RAID06 are very widely used, especially on NAS systems where 
you have lots of disks.  The RAID5/6 lower layer mitigates the data loss 
risk of RAID0, and the RAID0 upper-layer mitigates the rebuild 
scalability issues of RAID5/6.  In fact, this is pretty much the 
standard recommended configuration for large ZFS arrays that want to use 
parity RAID.  This could be reasonably easily supported to a rudimentary 
degree in BTRFS by providing the ability to limit the stripe width for 
the parity profiles.

Some people use RAID50 or RAID60, although they are strictly speaking 
inferior in almost all respects to RAID05 and RAID06.

RAID01 is also used on occasion, it ends up having the same storage 
capacity as RAID10, but for some RAID implementations it has a different 
performance envelope and different rebuild characteristics.  Usually, 
when it is used though, it's software RAID0 on top of hardware RAID1.

RAID51 and RAID61 used to be used, but aren't much now.  They provided 
an easy way to have proper data verification without always having the 
rebuild overhead of RAID5/6 and without needing to do checksumming. 
They are pretty much useless for BTRFS, as it can already tell which 
copy is correct.

RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they 
might actually make sense in BTRFS to provide a backup means of 
rebuilding blocks that fail checksum validation if both copies fail.
> 
> The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below).
> [....]
>>>>
>>>> As someone else pointed out, md/lvm-raid10 already work like this.
>>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>>> much works this way except with huge (gig size) chunks.
>>>
>>> As implemented in BTRFS, raid1 doesn't have striping.
>>
>> The argument is that because there's only two copies, on multi-device
>> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
>> alternate device pairs, it's effectively striped at the macro level, with
>> the 1 GiB device-level chunks effectively being huge individual device
>> strips of 1 GiB.
> 
> The striping concept is based to the fact that if the "stripe size" is small enough you have a speed benefit because the reads may be performed in parallel from different disks.
That's not the only benefit of striping though.  The other big one is 
that you now have one volume that's the combined size of both of the 
original devices.  Striping is arguably better for this even if you're 
using a large stripe size because it better balances the wear across the 
devices than simple concatenation.

> With a "stripe size" of 1GB, it is very unlikely that this would happens.
That's a pretty big assumption.  There are all kinds of access patterns 
that will still distribute the load reasonably evenly across the 
constituent devices, even if they don't parallelize things.

If, for example, all your files are 64k or less, and you only read whole 
files, there's no functional difference between RAID0 with 1GB blocks 
and RAID0 with 64k blocks.  Such a workload is not unusual on a very 
busy mail-server.
> 
>   
>> At 1 GiB strip size it doesn't have the typical performance advantage of
>> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
>> strips/chunks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-19  7:27 ` Qu Wenruo
@ 2018-07-19 11:47   ` Austin S. Hemmelgarn
  2018-07-20 16:42     ` David Sterba
  2018-07-20 16:35   ` David Sterba
  1 sibling, 1 reply; 36+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-19 11:47 UTC (permalink / raw)
  To: Qu Wenruo, David Sterba, linux-btrfs

On 2018-07-19 03:27, Qu Wenruo wrote:
> 
> 
> On 2018年07月14日 02:46, David Sterba wrote:
>> Hi,
>>
>> I have some goodies that go into the RAID56 problem, although not
>> implementing all the remaining features, it can be useful independently.
>>
>> This time my hackweek project
>>
>> https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56
>>
>> aimed to implement the fix for the write hole problem but I spent more
>> time with analysis and design of the solution and don't have a working
>> prototype for that yet.
>>
>> This patchset brings a feature that will be used by the raid56 log, the
>> log has to be on the same redundancy level and thus we need a 3-copy
>> replication for raid6. As it was easy to extend to higher replication,
>> I've added a 4-copy replication, that would allow triple copy raid (that
>> does not have a standardized name).
> 
> So this special level will be used for RAID56 for now?
> Or it will also be possible for metadata usage just like current RAID1?
> 
> If the latter, the metadata scrub problem will need to be considered more.
> 
> For more copies RAID1, it's will have higher possibility one or two
> devices missing, and then being scrubbed.
> For metadata scrub, inlined csum can't ensure it's the latest one.
> 
> So for such RAID1 scrub, we need to read out all copies and compare
> their generation to find out the correct copy.
> At least from the changeset, it doesn't look like it's addressed yet.
> 
> And this also reminds me that current scrub is not as flex as balance, I
> really like we could filter block groups to scrub just like balance, and
> do scrub in a block group basis, other than devid basis.
> That's to say, for a block group scrub, we don't really care which
> device we're scrubbing, we just need to ensure all device in this block
> is storing correct data.
> 
This would actually be rather useful for non-parity cases too.  Being 
able to scrub only metadata when the data chunks are using a profile 
that provides no rebuild support would be great for performance.

On the same note, it would be _really_ nice to be able to scrub a subset 
of the volume's directory tree, even if it were only per-subvolume.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-19 11:43             ` Austin S. Hemmelgarn
@ 2018-07-19 17:29               ` Goffredo Baroncelli
  2018-07-19 19:10                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 36+ messages in thread
From: Goffredo Baroncelli @ 2018-07-19 17:29 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Duncan, linux-btrfs

On 07/19/2018 01:43 PM, Austin S. Hemmelgarn wrote:
> On 2018-07-18 15:42, Goffredo Baroncelli wrote:
>> On 07/18/2018 09:20 AM, Duncan wrote:
>>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>>> excerpted:
>>>
>>>> On 07/17/2018 11:12 PM, Duncan wrote:
>>>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>>>> excerpted:
>>>>>
[...]
>>>>
>>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>>> have - striping (RAID0)
>>>> - parity  (?)
>>>> - striping + parity  (e.g. RAID5/6)
>>>> - mirroring  (RAID1)
>>>> - mirroring + striping  (RAID10)
>>>>
>>>> However you can't have mirroring+parity; this means that a notation
>>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>>> too verbose.
>>>
>>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
>>> top of mirroring or mirroring on top of raid5/6, much as raid10 is
>>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
>>> on top of raid0.
>> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of....) ???
>>
>> Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above.
> No, there are cases where other configurations make sense.
> 
> RAID05 and RAID06 are very widely used, especially on NAS systems where you have lots of disks.  The RAID5/6 lower layer mitigates the data loss risk of RAID0, and the RAID0 upper-layer mitigates the rebuild scalability issues of RAID5/6.  In fact, this is pretty much the standard recommended configuration for large ZFS arrays that want to use parity RAID.  This could be reasonably easily supported to a rudimentary degree in BTRFS by providing the ability to limit the stripe width for the parity profiles.
> 
> Some people use RAID50 or RAID60, although they are strictly speaking inferior in almost all respects to RAID05 and RAID06.
> 
> RAID01 is also used on occasion, it ends up having the same storage capacity as RAID10, but for some RAID implementations it has a different performance envelope and different rebuild characteristics.  Usually, when it is used though, it's software RAID0 on top of hardware RAID1.
> 
> RAID51 and RAID61 used to be used, but aren't much now.  They provided an easy way to have proper data verification without always having the rebuild overhead of RAID5/6 and without needing to do checksumming. They are pretty much useless for BTRFS, as it can already tell which copy is correct.

So until now you are repeating what I told: the only useful raid profile are 
- striping 
- mirroring
- striping+paring (even limiting the number of disk involved)
- striping+mirroring

> 
> RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might actually make sense in BTRFS to provide a backup means of rebuilding blocks that fail checksum validation if both copies fail.
If you need further redundancy, it is easy to implement a parity3 and parity4 raid profile instead of stacking a raid6+raid1

>>
>> The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below).
>> [....]
>>>>>
>>>>> As someone else pointed out, md/lvm-raid10 already work like this.
>>>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>>>> much works this way except with huge (gig size) chunks.
>>>>
>>>> As implemented in BTRFS, raid1 doesn't have striping.
>>>
>>> The argument is that because there's only two copies, on multi-device
>>> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
>>> alternate device pairs, it's effectively striped at the macro level, with
>>> the 1 GiB device-level chunks effectively being huge individual device
>>> strips of 1 GiB.
>>
>> The striping concept is based to the fact that if the "stripe size" is small enough you have a speed benefit because the reads may be performed in parallel from different disks.
> That's not the only benefit of striping though.  The other big one is that you now have one volume that's the combined size of both of the original devices.  Striping is arguably better for this even if you're using a large stripe size because it better balances the wear across the devices than simple concatenation.

Striping means that the data is interleaved between the disks with a reasonable "block unit". Otherwise which would be the difference between btrfs-raid0 and btrfs-single ?

> 
>> With a "stripe size" of 1GB, it is very unlikely that this would happens.
> That's a pretty big assumption.  There are all kinds of access patterns that will still distribute the load reasonably evenly across the constituent devices, even if they don't parallelize things.
> 
> If, for example, all your files are 64k or less, and you only read whole files, there's no functional difference between RAID0 with 1GB blocks and RAID0 with 64k blocks.  Such a workload is not unusual on a very busy mail-server.

I fully agree that 64K may be too much for some workload, however I have to point out that I still find difficult to imagine that you can take advantage of parallel read from multiple disks with a 1GB stripe unit for a *common workload*. Pay attention that btrfs inline in the metadata the small files, so even if the file is smaller than 64k, a 64k read (or more) will be required in order to access it.


>>
>>  
>>> At 1 GiB strip size it doesn't have the typical performance advantage of
>>> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
>>> strips/chunks.
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-19 17:29               ` Goffredo Baroncelli
@ 2018-07-19 19:10                 ` Austin S. Hemmelgarn
  2018-07-20 17:13                   ` Goffredo Baroncelli
  0 siblings, 1 reply; 36+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-19 19:10 UTC (permalink / raw)
  To: kreijack, linux-btrfs

On 2018-07-19 13:29, Goffredo Baroncelli wrote:
> On 07/19/2018 01:43 PM, Austin S. Hemmelgarn wrote:
>> On 2018-07-18 15:42, Goffredo Baroncelli wrote:
>>> On 07/18/2018 09:20 AM, Duncan wrote:
>>>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>>>> excerpted:
>>>>
>>>>> On 07/17/2018 11:12 PM, Duncan wrote:
>>>>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>>>>> excerpted:
>>>>>>
> [...]
>>>>>
>>>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>>>> have - striping (RAID0)
>>>>> - parity  (?)
>>>>> - striping + parity  (e.g. RAID5/6)
>>>>> - mirroring  (RAID1)
>>>>> - mirroring + striping  (RAID10)
>>>>>
>>>>> However you can't have mirroring+parity; this means that a notation
>>>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>>>> too verbose.
>>>>
>>>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
>>>> top of mirroring or mirroring on top of raid5/6, much as raid10 is
>>>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
>>>> on top of raid0.
>>> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of....) ???
>>>
>>> Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above.
>> No, there are cases where other configurations make sense.
>>
>> RAID05 and RAID06 are very widely used, especially on NAS systems where you have lots of disks.  The RAID5/6 lower layer mitigates the data loss risk of RAID0, and the RAID0 upper-layer mitigates the rebuild scalability issues of RAID5/6.  In fact, this is pretty much the standard recommended configuration for large ZFS arrays that want to use parity RAID.  This could be reasonably easily supported to a rudimentary degree in BTRFS by providing the ability to limit the stripe width for the parity profiles.
>>
>> Some people use RAID50 or RAID60, although they are strictly speaking inferior in almost all respects to RAID05 and RAID06.
>>
>> RAID01 is also used on occasion, it ends up having the same storage capacity as RAID10, but for some RAID implementations it has a different performance envelope and different rebuild characteristics.  Usually, when it is used though, it's software RAID0 on top of hardware RAID1.
>>
>> RAID51 and RAID61 used to be used, but aren't much now.  They provided an easy way to have proper data verification without always having the rebuild overhead of RAID5/6 and without needing to do checksumming. They are pretty much useless for BTRFS, as it can already tell which copy is correct.
> 
> So until now you are repeating what I told: the only useful raid profile are
> - striping
> - mirroring
> - striping+paring (even limiting the number of disk involved)
> - striping+mirroring

No, not quite.  At least, not in the combinations you're saying make 
sense if you are using standard terminology.  RAID05 and RAID06 are not 
the same thing as 'striping+parity' as BTRFS implements that case, and 
can be significantly more optimized than the trivial implementation of 
just limiting the number of disks involved in each chunk (by, you know, 
actually striping just like what we currently call raid10 mode in BTRFS 
does).
> 
>>
>> RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might actually make sense in BTRFS to provide a backup means of rebuilding blocks that fail checksum validation if both copies fail.
> If you need further redundancy, it is easy to implement a parity3 and parity4 raid profile instead of stacking a raid6+raid1
I think you're misunderstanding what I mean here.

RAID15/16 consist of two layers:
* The top layer is regular RAID1, usually limited to two copies.
* The lower layer is RAID5 or RAID6.

This means that the lower layer can validate which of the two copies in 
the upper layer is correct when they don't agree.  It doesn't really 
provide significantly better redundancy (they can technically sustain 
more disk failures without failing completely than simple two-copy RAID1 
can, but just like BTRFS raid10, they can't reliably survive more than 
one (or two if you're using RAID6 as the lower layer) disk failure), so 
it does not do the same thing that higher-order parity does.
> 
>>>
>>> The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below).
>>> [....]
>>>>>>
>>>>>> As someone else pointed out, md/lvm-raid10 already work like this.
>>>>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>>>>> much works this way except with huge (gig size) chunks.
>>>>>
>>>>> As implemented in BTRFS, raid1 doesn't have striping.
>>>>
>>>> The argument is that because there's only two copies, on multi-device
>>>> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
>>>> alternate device pairs, it's effectively striped at the macro level, with
>>>> the 1 GiB device-level chunks effectively being huge individual device
>>>> strips of 1 GiB.
>>>
>>> The striping concept is based to the fact that if the "stripe size" is small enough you have a speed benefit because the reads may be performed in parallel from different disks.
>> That's not the only benefit of striping though.  The other big one is that you now have one volume that's the combined size of both of the original devices.  Striping is arguably better for this even if you're using a large stripe size because it better balances the wear across the devices than simple concatenation.
> 
> Striping means that the data is interleaved between the disks with a reasonable "block unit". Otherwise which would be the difference between btrfs-raid0 and btrfs-single ?
Single mode guarantees that any file less than the chunk size in length 
will either be completely present or completely absent if one of the 
devices fails.  BTRFS raid0 mode does not provide any such guarantee, 
and in fact guarantees that all files that are larger than the stripe 
unit size (however much gets put on one disk before moving to the next) 
will all lose data if a device fails.

Stupid as it sounds, this matters for some people.
> 
>>
>>> With a "stripe size" of 1GB, it is very unlikely that this would happens.
>> That's a pretty big assumption.  There are all kinds of access patterns that will still distribute the load reasonably evenly across the constituent devices, even if they don't parallelize things.
>>
>> If, for example, all your files are 64k or less, and you only read whole files, there's no functional difference between RAID0 with 1GB blocks and RAID0 with 64k blocks.  Such a workload is not unusual on a very busy mail-server.
> 
> I fully agree that 64K may be too much for some workload, however I have to point out that I still find difficult to imagine that you can take advantage of parallel read from multiple disks with a 1GB stripe unit for a *common workload*. Pay attention that btrfs inline in the metadata the small files, so even if the file is smaller than 64k, a 64k read (or more) will be required in order to access it.
Again, mail servers. Each file should be written out as a single extent, 
which means it's all in one chunk.  Delivery and processing need to 
access _LOTS_ of files on a busy mail server, and the good ones do this 
with userspace parallelization.  BTRFS doesn't parallelize disk accesses 
from the same userspace execution context (thread if threads are being 
used, process if not), but it does parallelize access for separate 
contexts, so if userspace is doing things from multiple threads, so will 
BTRFS.

FWIW, I actually tested this back when the company I work for still ran 
their own internal mail server.  BTRFS was significantly less optimized 
back then, but there was no measurable performance difference from 
userspace between using single profile for data or raid0 profile for data.
> 
>>>   
>>>> At 1 GiB strip size it doesn't have the typical performance advantage of
>>>> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
>>>> strips/chunks.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-18 12:50             ` Hugo Mills
@ 2018-07-19 21:22               ` waxhead
  0 siblings, 0 replies; 36+ messages in thread
From: waxhead @ 2018-07-19 21:22 UTC (permalink / raw)
  To: Hugo Mills, Duncan, linux-btrfs



Hugo Mills wrote:
> On Wed, Jul 18, 2018 at 08:39:48AM +0000, Duncan wrote:
>> Duncan posted on Wed, 18 Jul 2018 07:20:09 +0000 as excerpted:
>>
>> Perhaps it's a case of coder's view (no code doing it that way, it's just
>> a coincidental oddity conditional on equal sizes), vs. sysadmin's view
>> (code or not, accidental or not, it's a reasonably accurate high-level
>> description of how it ends up working most of the time with equivalent
>> sized devices).)
> 
>     Well, it's an *accurate* observation. It's just not a particularly
> *useful* one. :)
> 
>     Hugo.
> 
A bit off topic perhaps - but I've got to give it a go:
Pretty please with sugar, nuts, a cherry and chocolate sprinkles dipped 
in syrup and coated with icecream on top , would it not be about time to 
update your online btrfs-usage calculator (which is insanely useful in 
so many ways) to support the new modes!?

In fact it would have been a great- / even better as a- cli-tool.
And yes, a while ago I toyed about porting it to C for own use mostly, 
but never got that far.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-18 19:42           ` Goffredo Baroncelli
  2018-07-19 11:43             ` Austin S. Hemmelgarn
@ 2018-07-20  5:17             ` Andrei Borzenkov
  2018-07-20 17:16               ` Goffredo Baroncelli
  1 sibling, 1 reply; 36+ messages in thread
From: Andrei Borzenkov @ 2018-07-20  5:17 UTC (permalink / raw)
  To: kreijack, Duncan, linux-btrfs

18.07.2018 22:42, Goffredo Baroncelli пишет:
> On 07/18/2018 09:20 AM, Duncan wrote:
>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>> excerpted:
>>
>>> On 07/17/2018 11:12 PM, Duncan wrote:
>>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>>> excerpted:
>>>>
>>>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>>>
>>>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>>>> parity are mutually exclusive.
>>>>
>>>> I can't agree.  I don't know whether you meant that in the global
>>>> sense,
>>>> or purely in the btrfs context (which I suspect), but either way I
>>>> can't agree.
>>>>
>>>> In the pure btrfs context, while striping and mirroring/pairing are
>>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>>> flexible enough to allow both together and the feature may at some
>>>> point be added, so it makes sense to have a layout notation format
>>>> flexible enough to allow it as well.
>>>
>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>> have - striping (RAID0)
>>> - parity  (?)
>>> - striping + parity  (e.g. RAID5/6)
>>> - mirroring  (RAID1)
>>> - mirroring + striping  (RAID10)
>>>
>>> However you can't have mirroring+parity; this means that a notation
>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>> too verbose.
>>
>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
>> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
>> on top of raid0.  
> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of....) ???
> 
> Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above.

RAID50 (striping across RAID5) is common.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-19  7:27 ` Qu Wenruo
  2018-07-19 11:47   ` Austin S. Hemmelgarn
@ 2018-07-20 16:35   ` David Sterba
  1 sibling, 0 replies; 36+ messages in thread
From: David Sterba @ 2018-07-20 16:35 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: David Sterba, linux-btrfs

On Thu, Jul 19, 2018 at 03:27:17PM +0800, Qu Wenruo wrote:
> On 2018年07月14日 02:46, David Sterba wrote:
> > Hi,
> > 
> > I have some goodies that go into the RAID56 problem, although not
> > implementing all the remaining features, it can be useful independently.
> > 
> > This time my hackweek project
> > 
> > https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56
> > 
> > aimed to implement the fix for the write hole problem but I spent more
> > time with analysis and design of the solution and don't have a working
> > prototype for that yet.
> > 
> > This patchset brings a feature that will be used by the raid56 log, the
> > log has to be on the same redundancy level and thus we need a 3-copy
> > replication for raid6. As it was easy to extend to higher replication,
> > I've added a 4-copy replication, that would allow triple copy raid (that
> > does not have a standardized name).
> 
> So this special level will be used for RAID56 for now?
> Or it will also be possible for metadata usage just like current RAID1?

It's a new profile usable in the same way as is raid1, ie. for the data
or metadata. The patch that adds support to btrfs-progs has an mkfs
example.

The raid56 will use that to store the log, essentially data forcibly
stored on the n-copy raid1 chunk and used only for logging.

> If the latter, the metadata scrub problem will need to be considered more.
> 
> For more copies RAID1, it's will have higher possibility one or two
> devices missing, and then being scrubbed.
> For metadata scrub, inlined csum can't ensure it's the latest one.
> 
> So for such RAID1 scrub, we need to read out all copies and compare
> their generation to find out the correct copy.
> At least from the changeset, it doesn't look like it's addressed yet.

Nothing like this is implemented in the patches, but I don't understand
how this differs from the current raid1 and one missing device. Sure we
can't have 2 missing devices so the existing copy is automatically
considered correct and up to date.

There are more corner case recovery scenario when there could be 3
copies slightly out of date due to device loss and scrub attempt, so yes
this would need to be addressed.

> And this also reminds me that current scrub is not as flex as balance, I
> really like we could filter block groups to scrub just like balance, and
> do scrub in a block group basis, other than devid basis.
> That's to say, for a block group scrub, we don't really care which
> device we're scrubbing, we just need to ensure all device in this block
> is storing correct data.

Right, a subset of the balance filters would be nice.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-19 11:47   ` Austin S. Hemmelgarn
@ 2018-07-20 16:42     ` David Sterba
  0 siblings, 0 replies; 36+ messages in thread
From: David Sterba @ 2018-07-20 16:42 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Qu Wenruo, David Sterba, linux-btrfs

On Thu, Jul 19, 2018 at 07:47:23AM -0400, Austin S. Hemmelgarn wrote:
> > So this special level will be used for RAID56 for now?
> > Or it will also be possible for metadata usage just like current RAID1?
> > 
> > If the latter, the metadata scrub problem will need to be considered more.
> > 
> > For more copies RAID1, it's will have higher possibility one or two
> > devices missing, and then being scrubbed.
> > For metadata scrub, inlined csum can't ensure it's the latest one.
> > 
> > So for such RAID1 scrub, we need to read out all copies and compare
> > their generation to find out the correct copy.
> > At least from the changeset, it doesn't look like it's addressed yet.
> > 
> > And this also reminds me that current scrub is not as flex as balance, I
> > really like we could filter block groups to scrub just like balance, and
> > do scrub in a block group basis, other than devid basis.
> > That's to say, for a block group scrub, we don't really care which
> > device we're scrubbing, we just need to ensure all device in this block
> > is storing correct data.
> > 
> This would actually be rather useful for non-parity cases too.  Being 
> able to scrub only metadata when the data chunks are using a profile 
> that provides no rebuild support would be great for performance.
> 
> On the same note, it would be _really_ nice to be able to scrub a subset 
> of the volume's directory tree, even if it were only per-subvolume.

https://github.com/kdave/drafts/blob/master/btrfs/scrub-subvolume.txt
https://github.com/kdave/drafts/blob/master/btrfs/scrub-custom.txt

The idea is to build in-memory tree of block ranges that span the given
subvolume or files and run scrub only there.  The selective scrub on the
block groups of a given type would be a special case of the above.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-19 19:10                 ` Austin S. Hemmelgarn
@ 2018-07-20 17:13                   ` Goffredo Baroncelli
  2018-07-20 18:33                     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 36+ messages in thread
From: Goffredo Baroncelli @ 2018-07-20 17:13 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

On 07/19/2018 09:10 PM, Austin S. Hemmelgarn wrote:
> On 2018-07-19 13:29, Goffredo Baroncelli wrote:
[...]
>>
>> So until now you are repeating what I told: the only useful raid profile are
>> - striping
>> - mirroring
>> - striping+paring (even limiting the number of disk involved)
>> - striping+mirroring
> 
> No, not quite.  At least, not in the combinations you're saying make sense if you are using standard terminology.  RAID05 and RAID06 are not the same thing as 'striping+parity' as BTRFS implements that case, and can be significantly more optimized than the trivial implementation of just limiting the number of disks involved in each chunk (by, you know, actually striping just like what we currently call raid10 mode in BTRFS does).

Could you provide more information ?

>>
>>>
>>> RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might actually make sense in BTRFS to provide a backup means of rebuilding blocks that fail checksum validation if both copies fail.
>> If you need further redundancy, it is easy to implement a parity3 and parity4 raid profile instead of stacking a raid6+raid1
> I think you're misunderstanding what I mean here.
> 
> RAID15/16 consist of two layers:
> * The top layer is regular RAID1, usually limited to two copies.
> * The lower layer is RAID5 or RAID6.
> 
> This means that the lower layer can validate which of the two copies in the upper layer is correct when they don't agree.  

This happens only because there is a redundancy greater than 1. Anyway BTRFS has the checksum, which helps a lot in this area

> It doesn't really provide significantly better redundancy (they can technically sustain more disk failures without failing completely than simple two-copy RAID1 can, but just like BTRFS raid10, they can't reliably survive more than one (or two if you're using RAID6 as the lower layer) disk failure), so it does not do the same thing that higher-order parity does.
>>
>>>>
>>>> The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below).
>>>> [....]
>>>>>>>
>>>>>>> As someone else pointed out, md/lvm-raid10 already work like this.
>>>>>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>>>>>> much works this way except with huge (gig size) chunks.
>>>>>>
>>>>>> As implemented in BTRFS, raid1 doesn't have striping.
>>>>>
>>>>> The argument is that because there's only two copies, on multi-device
>>>>> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
>>>>> alternate device pairs, it's effectively striped at the macro level, with
>>>>> the 1 GiB device-level chunks effectively being huge individual device
>>>>> strips of 1 GiB.
>>>>
>>>> The striping concept is based to the fact that if the "stripe size" is small enough you have a speed benefit because the reads may be performed in parallel from different disks.
>>> That's not the only benefit of striping though.  The other big one is that you now have one volume that's the combined size of both of the original devices.  Striping is arguably better for this even if you're using a large stripe size because it better balances the wear across the devices than simple concatenation.
>>
>> Striping means that the data is interleaved between the disks with a reasonable "block unit". Otherwise which would be the difference between btrfs-raid0 and btrfs-single ?
> Single mode guarantees that any file less than the chunk size in length will either be completely present or completely absent if one of the devices fails.  BTRFS raid0 mode does not provide any such guarantee, and in fact guarantees that all files that are larger than the stripe unit size (however much gets put on one disk before moving to the next) will all lose data if a device fails.
> 
> Stupid as it sounds, this matters for some people.

I think that even better would be having different filesystems.

>>
>>>
>>>> With a "stripe size" of 1GB, it is very unlikely that this would happens.
>>> That's a pretty big assumption.  There are all kinds of access patterns that will still distribute the load reasonably evenly across the constituent devices, even if they don't parallelize things.
>>>
>>> If, for example, all your files are 64k or less, and you only read whole files, there's no functional difference between RAID0 with 1GB blocks and RAID0 with 64k blocks.  Such a workload is not unusual on a very busy mail-server.
>>
>> I fully agree that 64K may be too much for some workload, however I have to point out that I still find difficult to imagine that you can take advantage of parallel read from multiple disks with a 1GB stripe unit for a *common workload*. Pay attention that btrfs inline in the metadata the small files, so even if the file is smaller than 64k, a 64k read (or more) will be required in order to access it.

> Again, mail servers. Each file should be written out as a single extent, which means it's all in one chunk.  Delivery and processing need to access _LOTS_ of files on a busy mail server, and the good ones do this with userspace parallelization.  BTRFS doesn't parallelize disk accesses from the same userspace execution context (thread if threads are being used, process if not), but it does parallelize access for separate contexts, so if userspace is doing things from multiple threads, so will BTRFS.

The parallelization matters only if it is distributed across different disks. So more disks are involved more parallelization is possible. As extreme example, whit a stripe unit of 1GB, until the filesystem is smaller than 1GB, no parallelizzation is possible[*] because all data is in the same disk. And when the filesystem increases its size, the data must be "distant" more than 1GB to be parallelized.

[*] Of course it is possible to perform parallel read on the same disk, but the throughput would decrease; may be that the average latency would perform better.

> 
> FWIW, I actually tested this back when the company I work for still ran their own internal mail server.  BTRFS was significantly less optimized back then, but there was no measurable performance difference from userspace between using single profile for data or raid0 profile for data.

Despite the btrfs optimization, having a stripe unit of 1GB reduces the likelihood of parallelizing the reads. This because the data to read to be parallelized must be "distant" more than the "stripe unit": having a stripe unit smaller increase the likelihood of a parallel reads.

Of course this is not sufficient. In any case BTRFS should improve its I/O scheduler 


>>
>>>>  
>>>>> At 1 GiB strip size it doesn't have the typical performance advantage of
>>>>> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
>>>>> strips/chunks.
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-20  5:17             ` Andrei Borzenkov
@ 2018-07-20 17:16               ` Goffredo Baroncelli
  2018-07-20 18:38                 ` Andrei Borzenkov
  0 siblings, 1 reply; 36+ messages in thread
From: Goffredo Baroncelli @ 2018-07-20 17:16 UTC (permalink / raw)
  To: Andrei Borzenkov, Duncan, linux-btrfs

On 07/20/2018 07:17 AM, Andrei Borzenkov wrote:
> 18.07.2018 22:42, Goffredo Baroncelli пишет:
>> On 07/18/2018 09:20 AM, Duncan wrote:
>>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>>> excerpted:
>>>
>>>> On 07/17/2018 11:12 PM, Duncan wrote:
>>>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>>>> excerpted:
>>>>>
>>>>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>>>>
>>>>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>>>>> parity are mutually exclusive.
>>>>>
>>>>> I can't agree.  I don't know whether you meant that in the global
>>>>> sense,
>>>>> or purely in the btrfs context (which I suspect), but either way I
>>>>> can't agree.
>>>>>
>>>>> In the pure btrfs context, while striping and mirroring/pairing are
>>>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>>>> flexible enough to allow both together and the feature may at some
>>>>> point be added, so it makes sense to have a layout notation format
>>>>> flexible enough to allow it as well.
>>>>
>>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>>> have - striping (RAID0)
>>>> - parity  (?)
>>>> - striping + parity  (e.g. RAID5/6)
>>>> - mirroring  (RAID1)
>>>> - mirroring + striping  (RAID10)
>>>>
>>>> However you can't have mirroring+parity; this means that a notation
>>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>>> too verbose.
>>>
>>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
>>> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
>>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
>>> on top of raid0.  
>> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of....) ???
>>
>> Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above.
> 
> RAID50 (striping across RAID5) is common.

Yeah someone else report that. But other than reducing the number of disk per raid5 (increasing the ration number of disks/number of parity disks), which other advantages has ? 
Limiting the number of disk per raid, in BTRFS would be quite simple to implement in the "chunk allocator"

> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-20 17:13                   ` Goffredo Baroncelli
@ 2018-07-20 18:33                     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 36+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-20 18:33 UTC (permalink / raw)
  To: kreijack, linux-btrfs

On 2018-07-20 13:13, Goffredo Baroncelli wrote:
> On 07/19/2018 09:10 PM, Austin S. Hemmelgarn wrote:
>> On 2018-07-19 13:29, Goffredo Baroncelli wrote:
> [...]
>>>
>>> So until now you are repeating what I told: the only useful raid profile are
>>> - striping
>>> - mirroring
>>> - striping+paring (even limiting the number of disk involved)
>>> - striping+mirroring
>>
>> No, not quite.  At least, not in the combinations you're saying make sense if you are using standard terminology.  RAID05 and RAID06 are not the same thing as 'striping+parity' as BTRFS implements that case, and can be significantly more optimized than the trivial implementation of just limiting the number of disks involved in each chunk (by, you know, actually striping just like what we currently call raid10 mode in BTRFS does).
> 
> Could you provide more information ?
Just parity by itself is functionally equivalent to a really stupid 
implementation of 2 or more copies of the data.  Setups with only one 
disk more than the number of parities in RAID5 and RAID6 are called 
degenerate for this very reason.  All sane RAID5/6 implementations do 
striping across multiple devices internally, and that's almost always 
what people mean when talking about striping plus parity.

What I'm referring to is different though.  Just like RAID10 used to be 
implemented as RAID1 on top of RAID0, RAID05 is RAID0 on top of RAID5. 
That is, you're striping your data across multiple RAID5 arrays instead 
of using one big RAID5 array to store it all.  As I mentioned, this 
mitigates the scaling issues inherent in RAID5 when it comes to rebuilds 
(namely, the fact that device failure rates go up faster for larger 
arrays than rebuild times do).

Functionally, such a setup can be implemented in BTRFS by limiting 
RAID5/6 stripe width, but that will have all kinds of performance 
limitations compared to actually striping across all of the underlying 
RAID5 chunks.  In fact, it will have the exact same performance 
limitations you're calling out BTRFS single mode for below.
> 
>>>
>>>>
>>>> RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might actually make sense in BTRFS to provide a backup means of rebuilding blocks that fail checksum validation if both copies fail.
>>> If you need further redundancy, it is easy to implement a parity3 and parity4 raid profile instead of stacking a raid6+raid1
>> I think you're misunderstanding what I mean here.
>>
>> RAID15/16 consist of two layers:
>> * The top layer is regular RAID1, usually limited to two copies.
>> * The lower layer is RAID5 or RAID6.
>>
>> This means that the lower layer can validate which of the two copies in the upper layer is correct when they don't agree.
> 
> This happens only because there is a redundancy greater than 1. Anyway BTRFS has the checksum, which helps a lot in this area
The checksum helps, but what do you do when all copies fail the 
checksum?  Or, worse yet, what do you do with both copies have the 
'right' checksum, but different data?  Yes, you could have one more 
copy, but that just reduces the chances of those cases happening, it 
doesn't eliminate them.

Note that I'm not necessarily saying it makes sense to have support for 
this in BTRFS, just that it's a real-world counter-example to your 
statement that only those combinations make sense.  In the case of 
BTRFS, these would make more sense than RAID51 and RAID61, but they 
still aren't particularly practical.  For classic RAID though, they're 
really important, because you don't have checksumming (unless you have 
T10 DIF capable hardware and a RAID implementation that understands how 
to work with it, but that's rare and expensive) and it makes it easier 
to resize an array than having three copies (you only need 2 new disks 
for RAID15 or RAID16 to increase the size of the array, but you need 3 
for 3-copy RAID1 or RAID10).
> 
>> It doesn't really provide significantly better redundancy (they can technically sustain more disk failures without failing completely than simple two-copy RAID1 can, but just like BTRFS raid10, they can't reliably survive more than one (or two if you're using RAID6 as the lower layer) disk failure), so it does not do the same thing that higher-order parity does.
>>>
>>>>>
>>>>> The fact that you can combine striping and mirroring (or pairing) makes sense because you could have a speed gain (see below).
>>>>> [....]
>>>>>>>>
>>>>>>>> As someone else pointed out, md/lvm-raid10 already work like this.
>>>>>>>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>>>>>>>> much works this way except with huge (gig size) chunks.
>>>>>>>
>>>>>>> As implemented in BTRFS, raid1 doesn't have striping.
>>>>>>
>>>>>> The argument is that because there's only two copies, on multi-device
>>>>>> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
>>>>>> alternate device pairs, it's effectively striped at the macro level, with
>>>>>> the 1 GiB device-level chunks effectively being huge individual device
>>>>>> strips of 1 GiB.
>>>>>
>>>>> The striping concept is based to the fact that if the "stripe size" is small enough you have a speed benefit because the reads may be performed in parallel from different disks.
>>>> That's not the only benefit of striping though.  The other big one is that you now have one volume that's the combined size of both of the original devices.  Striping is arguably better for this even if you're using a large stripe size because it better balances the wear across the devices than simple concatenation.
>>>
>>> Striping means that the data is interleaved between the disks with a reasonable "block unit". Otherwise which would be the difference between btrfs-raid0 and btrfs-single ?
>> Single mode guarantees that any file less than the chunk size in length will either be completely present or completely absent if one of the devices fails.  BTRFS raid0 mode does not provide any such guarantee, and in fact guarantees that all files that are larger than the stripe unit size (however much gets put on one disk before moving to the next) will all lose data if a device fails.
>>
>> Stupid as it sounds, this matters for some people.
> 
> I think that even better would be having different filesystems.
Not necessarily.  In fact, quite the opposite in most cases, because 
having separate filesystems pushes the requirement to sort the files 
onto devices to userspace, which should not have to worry about that.

Put in cluster computing terms (where this kind of file layout is the 
norm), why exactly should the application software be the component 
responsible for figuring out what node a given file from a particular 
dataset is on?  Why shouldn't the filesystem itself handle this?
> 
>>>
>>>>
>>>>> With a "stripe size" of 1GB, it is very unlikely that this would happens.
>>>> That's a pretty big assumption.  There are all kinds of access patterns that will still distribute the load reasonably evenly across the constituent devices, even if they don't parallelize things.
>>>>
>>>> If, for example, all your files are 64k or less, and you only read whole files, there's no functional difference between RAID0 with 1GB blocks and RAID0 with 64k blocks.  Such a workload is not unusual on a very busy mail-server.
>>>
>>> I fully agree that 64K may be too much for some workload, however I have to point out that I still find difficult to imagine that you can take advantage of parallel read from multiple disks with a 1GB stripe unit for a *common workload*. Pay attention that btrfs inline in the metadata the small files, so even if the file is smaller than 64k, a 64k read (or more) will be required in order to access it.
> 
>> Again, mail servers. Each file should be written out as a single extent, which means it's all in one chunk.  Delivery and processing need to access _LOTS_ of files on a busy mail server, and the good ones do this with userspace parallelization.  BTRFS doesn't parallelize disk accesses from the same userspace execution context (thread if threads are being used, process if not), but it does parallelize access for separate contexts, so if userspace is doing things from multiple threads, so will BTRFS.
> 
> The parallelization matters only if it is distributed across different disks. So more disks are involved more parallelization is possible. As extreme example, whit a stripe unit of 1GB, until the filesystem is smaller than 1GB, no parallelizzation is possible[*] because all data is in the same disk. And when the filesystem increases its size, the data must be "distant" more than 1GB to be parallelized.First, I think you have things slightly backwards, it should be 'until 
the filesystem is _larger_ than 1GB' here.

That aside, the whole issue of data locality is not one to the degree 
you might think.  for 64k files, that's 16384 files per chunk.  That's a 
minuscule number for a really active mail-server (no, seriously, single 
subsidiary mail-servers in big companies may be handling queuing and 
delivery of more than twice that per-minute).
> 
> [*] Of course it is possible to perform parallel read on the same disk, but the throughput would decrease; may be that the average latency would perform better.
Raw throughput, measured simply as how many bytes you can read or write 
per second, would decrease.  Actual effective throughput will not 
necessarily if you've got a storage device with very low seek times, 
because being able to load and process files in parallel may allow for 
much faster actual processing of the data compared to simple serial 
processing.  Latency would be dependent on
> 
>>
>> FWIW, I actually tested this back when the company I work for still ran their own internal mail server.  BTRFS was significantly less optimized back then, but there was no measurable performance difference from userspace between using single profile for data or raid0 profile for data.
> 
> Despite the btrfs optimization, having a stripe unit of 1GB reduces the likelihood of parallelizing the reads. This because the data to read to be parallelized must be "distant" more than the "stripe unit": having a stripe unit smaller increase the likelihood of a parallel reads.
> 
> Of course this is not sufficient. In any case BTRFS should improve its I/O scheduler
Agreed, we need actual parallel access to devices in BTRFS.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-20 17:16               ` Goffredo Baroncelli
@ 2018-07-20 18:38                 ` Andrei Borzenkov
  2018-07-20 18:41                   ` Hugo Mills
  0 siblings, 1 reply; 36+ messages in thread
From: Andrei Borzenkov @ 2018-07-20 18:38 UTC (permalink / raw)
  To: kreijack, Duncan, linux-btrfs

20.07.2018 20:16, Goffredo Baroncelli пишет:
> On 07/20/2018 07:17 AM, Andrei Borzenkov wrote:
>> 18.07.2018 22:42, Goffredo Baroncelli пишет:
>>> On 07/18/2018 09:20 AM, Duncan wrote:
>>>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>>>> excerpted:
>>>>
>>>>> On 07/17/2018 11:12 PM, Duncan wrote:
>>>>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>>>>> excerpted:
>>>>>>
>>>>>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>>>>>
>>>>>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>>>>>> parity are mutually exclusive.
>>>>>>
>>>>>> I can't agree.  I don't know whether you meant that in the global
>>>>>> sense,
>>>>>> or purely in the btrfs context (which I suspect), but either way I
>>>>>> can't agree.
>>>>>>
>>>>>> In the pure btrfs context, while striping and mirroring/pairing are
>>>>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>>>>> flexible enough to allow both together and the feature may at some
>>>>>> point be added, so it makes sense to have a layout notation format
>>>>>> flexible enough to allow it as well.
>>>>>
>>>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>>>> have - striping (RAID0)
>>>>> - parity  (?)
>>>>> - striping + parity  (e.g. RAID5/6)
>>>>> - mirroring  (RAID1)
>>>>> - mirroring + striping  (RAID10)
>>>>>
>>>>> However you can't have mirroring+parity; this means that a notation
>>>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>>>> too verbose.
>>>>
>>>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
>>>> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
>>>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
>>>> on top of raid0.  
>>> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top of....) ???
>>>
>>> Seriously, of course you can combine a lot of different profile; however the only ones that make sense are the ones above.
>>
>> RAID50 (striping across RAID5) is common.
> 
> Yeah someone else report that. But other than reducing the number of disk per raid5 (increasing the ration number of disks/number of parity disks), which other advantages has ? 

It allows distributing IO across virtually unlimited number of disks
while confining failure domain to manageable size.

> Limiting the number of disk per raid, in BTRFS would be quite simple to implement in the "chunk allocator"
> 

You mean that currently RAID5 stripe size is equal to number of disks?
Well, I suppose nobody is using btrfs with disk pools of two or three
digits size.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-20 18:38                 ` Andrei Borzenkov
@ 2018-07-20 18:41                   ` Hugo Mills
  2018-07-20 18:46                     ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 36+ messages in thread
From: Hugo Mills @ 2018-07-20 18:41 UTC (permalink / raw)
  To: Andrei Borzenkov; +Cc: kreijack, Duncan, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 995 bytes --]

On Fri, Jul 20, 2018 at 09:38:14PM +0300, Andrei Borzenkov wrote:
> 20.07.2018 20:16, Goffredo Baroncelli пишет:
[snip]
> > Limiting the number of disk per raid, in BTRFS would be quite simple to implement in the "chunk allocator"
> > 
> 
> You mean that currently RAID5 stripe size is equal to number of disks?
> Well, I suppose nobody is using btrfs with disk pools of two or three
> digits size.

   But they are (even if not very many of them) -- we've seen at least
one person with something like 40 or 50 devices in the array. They'd
definitely got into /dev/sdac territory. I don't recall what RAID level
they were using. I think it was either RAID-1 or -10.

   That's the largest I can recall seeing mention of, though.

   Hugo.

-- 
Hugo Mills             | Have found Lost City of Atlantis. High Priest is
hugo@... carfax.org.uk | winning at quoits.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                       Terry Pratchett

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/4] 3- and 4- copy RAID1
  2018-07-20 18:41                   ` Hugo Mills
@ 2018-07-20 18:46                     ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 36+ messages in thread
From: Austin S. Hemmelgarn @ 2018-07-20 18:46 UTC (permalink / raw)
  To: Hugo Mills, Andrei Borzenkov, kreijack, Duncan, linux-btrfs

On 2018-07-20 14:41, Hugo Mills wrote:
> On Fri, Jul 20, 2018 at 09:38:14PM +0300, Andrei Borzenkov wrote:
>> 20.07.2018 20:16, Goffredo Baroncelli пишет:
> [snip]
>>> Limiting the number of disk per raid, in BTRFS would be quite simple to implement in the "chunk allocator"
>>>
>>
>> You mean that currently RAID5 stripe size is equal to number of disks?
>> Well, I suppose nobody is using btrfs with disk pools of two or three
>> digits size.
> 
>     But they are (even if not very many of them) -- we've seen at least
> one person with something like 40 or 50 devices in the array. They'd
> definitely got into /dev/sdac territory. I don't recall what RAID level
> they were using. I think it was either RAID-1 or -10.
> 
>     That's the largest I can recall seeing mention of, though.
I've talked to at least two people using it on 100+ disks in a SAN 
situation.  In both cases however, BTRFS itself was only seeing about 20 
devices and running in raid0 mode on them, with each of those being a 
RAID6 volume configured on the SAN node holding the disks for it.  From 
what I understood when talking to them, they actually got rather good 
performance in this setup, though maintenance was a bit of a pain.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-07-20 19:36 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-13 18:46 [PATCH 0/4] 3- and 4- copy RAID1 David Sterba
2018-07-13 18:46 ` [PATCH] btrfs-progs: add support for raid1c3 and raid1c4 David Sterba
2018-07-13 18:46 ` [PATCH 1/4] btrfs: refactor block group replication factor calculation to a helper David Sterba
2018-07-13 18:46 ` [PATCH 2/4] btrfs: add support for 3-copy replication (raid1c3) David Sterba
2018-07-13 21:02   ` Goffredo Baroncelli
2018-07-17 16:00     ` David Sterba
2018-07-13 18:46 ` [PATCH 3/4] btrfs: add support for 4-copy replication (raid1c4) David Sterba
2018-07-13 18:46 ` [PATCH 4/4] btrfs: add incompatibility bit for extended raid features David Sterba
2018-07-15 14:37 ` [PATCH 0/4] 3- and 4- copy RAID1 waxhead
2018-07-16 18:29   ` Goffredo Baroncelli
2018-07-16 18:49     ` Austin S. Hemmelgarn
2018-07-17 21:12     ` Duncan
2018-07-18  5:59       ` Goffredo Baroncelli
2018-07-18  7:20         ` Duncan
2018-07-18  8:39           ` Duncan
2018-07-18 12:45             ` Austin S. Hemmelgarn
2018-07-18 12:50             ` Hugo Mills
2018-07-19 21:22               ` waxhead
2018-07-18 12:50           ` Austin S. Hemmelgarn
2018-07-18 19:42           ` Goffredo Baroncelli
2018-07-19 11:43             ` Austin S. Hemmelgarn
2018-07-19 17:29               ` Goffredo Baroncelli
2018-07-19 19:10                 ` Austin S. Hemmelgarn
2018-07-20 17:13                   ` Goffredo Baroncelli
2018-07-20 18:33                     ` Austin S. Hemmelgarn
2018-07-20  5:17             ` Andrei Borzenkov
2018-07-20 17:16               ` Goffredo Baroncelli
2018-07-20 18:38                 ` Andrei Borzenkov
2018-07-20 18:41                   ` Hugo Mills
2018-07-20 18:46                     ` Austin S. Hemmelgarn
2018-07-16 21:51   ` waxhead
2018-07-15 14:46 ` Hugo Mills
2018-07-19  7:27 ` Qu Wenruo
2018-07-19 11:47   ` Austin S. Hemmelgarn
2018-07-20 16:42     ` David Sterba
2018-07-20 16:35   ` David Sterba

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.