linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features
@ 2022-05-13  8:34 Qu Wenruo
  2022-05-13  8:34 ` [PATCH 1/4] btrfs: remove @dev_extent_len argument from scrub_stripe() function Qu Wenruo
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Qu Wenruo @ 2022-05-13  8:34 UTC (permalink / raw)
  To: linux-btrfs

Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
(J for journal), if we're relying on ad-hoc if () else if () branches to
calculate thing like number of p/q stripes, it will cause a lot of
problems.

In fact, during my development, I have hit tons of bugs related to this.

One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot
to update the check for RAID5 and RAID6 profiles, we will got a bad
nr_data == num_stripes, and screw up later writeback.

90% of my suffering comes from such ad-hoc usage doing manual checks on
RAID56.

Another example is, scrub_stripe() which due to the extra per-device
reservation, @dev_extent_len is no longer the same the data stripe
length calculated from extent_map.

So this patchset will do the following cleanups preparing for the
incoming RAID56J (already finished coding, functionality and on-disk
format are fine, although no journal yet):

- Calculate device stripe length in-house inside scrub_stripe()
  This removes one of the nasty mismatch which is less obvious.

- Use btrfs_raid_array[] based calculation instead of ad-hoc check
  The only exception is scrub_nr_raid_mirrors(), which has several
  difference against btrfs_num_copies():

  * No iteration on all RAID6 combinations
    No sure if it's planned or not.

  * Use bioc->num_stripes directly
    In that context, bioc is already all the mirrors for the same
    stripe, thus no need to lookup using btrfs_raid_array[].

With all these cleanups, the RAID56J will be much easier to implement.

Qu Wenruo (4):
  btrfs: remove @dev_extent_len argument from scrub_stripe() function
  btrfs: use btrfs_chunk_max_errors() to replace weird tolerance
    calculation
  btrfs: use btrfs_raid_array[] to calculate the number of parity
    stripes
  btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()

 fs/btrfs/raid56.c  | 10 ++--------
 fs/btrfs/raid56.h  | 12 ++----------
 fs/btrfs/scrub.c   | 13 +++++++------
 fs/btrfs/volumes.c | 35 +++++++++++++++++++++--------------
 fs/btrfs/volumes.h |  2 ++
 5 files changed, 34 insertions(+), 38 deletions(-)

-- 
2.36.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/4] btrfs: remove @dev_extent_len argument from scrub_stripe() function
  2022-05-13  8:34 [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Qu Wenruo
@ 2022-05-13  8:34 ` Qu Wenruo
  2022-05-13  8:47   ` Johannes Thumshirn
  2022-05-13  8:34 ` [PATCH 2/4] btrfs: use btrfs_chunk_max_errors() to replace weird tolerance calculation Qu Wenruo
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2022-05-13  8:34 UTC (permalink / raw)
  To: linux-btrfs

For scrub_stripe() we can easily calculate the dev extent length as we
have the full info of the chunk.

Thus there is no need to pass @dev_extent_len from the caller, and we
introduce a helper, btrfs_calc_stripe_length(), to do the calculation
from extent_map structure.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/scrub.c   | 13 +++++++------
 fs/btrfs/volumes.c | 12 ++++++------
 fs/btrfs/volumes.h |  1 +
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index e7b0323e6efd..84346faa4ff1 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -3429,20 +3429,22 @@ static int scrub_simple_stripe(struct scrub_ctx *sctx,
 
 static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 					   struct btrfs_block_group *bg,
-					   struct map_lookup *map,
+					   struct extent_map *em,
 					   struct btrfs_device *scrub_dev,
-					   int stripe_index, u64 dev_extent_len)
+					   int stripe_index)
 {
 	struct btrfs_path *path;
 	struct btrfs_fs_info *fs_info = sctx->fs_info;
 	struct btrfs_root *root;
 	struct btrfs_root *csum_root;
 	struct blk_plug plug;
+	struct map_lookup *map = em->map_lookup;
 	const u64 profile = map->type & BTRFS_BLOCK_GROUP_PROFILE_MASK;
 	const u64 chunk_logical = bg->start;
 	int ret;
 	u64 physical = map->stripes[stripe_index].physical;
-	const u64 physical_end = physical + dev_extent_len;
+	const u64 dev_stripe_len = btrfs_calc_stripe_length(em);
+	const u64 physical_end = physical + dev_stripe_len;
 	u64 logical;
 	u64 logic_end;
 	/* The logical increment after finishing one stripe */
@@ -3570,7 +3572,7 @@ static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx,
 		spin_lock(&sctx->stat_lock);
 		if (stop_loop)
 			sctx->stat.last_physical = map->stripes[stripe_index].physical +
-						   dev_extent_len;
+						   dev_stripe_len;
 		else
 			sctx->stat.last_physical = physical;
 		spin_unlock(&sctx->stat_lock);
@@ -3639,8 +3641,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx,
 	for (i = 0; i < map->num_stripes; ++i) {
 		if (map->stripes[i].dev->bdev == scrub_dev->bdev &&
 		    map->stripes[i].physical == dev_offset) {
-			ret = scrub_stripe(sctx, bg, map, scrub_dev, i,
-					   dev_extent_len);
+			ret = scrub_stripe(sctx, bg, em, scrub_dev, i);
 			if (ret)
 				goto out;
 		}
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 748614b00ffa..a49d72d845b1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6977,11 +6977,12 @@ static void btrfs_report_missing_device(struct btrfs_fs_info *fs_info,
 			      devid, uuid);
 }
 
-static u64 calc_stripe_length(u64 type, u64 chunk_len, int num_stripes)
+u64 btrfs_calc_stripe_length(const struct extent_map *em)
 {
-	const int data_stripes = calc_data_stripes(type, num_stripes);
+	const struct map_lookup *map = em->map_lookup;
+	const int data_stripes = calc_data_stripes(map->type, map->num_stripes);
 
-	return div_u64(chunk_len, data_stripes);
+	return div_u64(em->len, data_stripes);
 }
 
 #if BITS_PER_LONG == 32
@@ -7120,8 +7121,7 @@ static int read_one_chunk(struct btrfs_key *key, struct extent_buffer *leaf,
 	map->type = type;
 	map->sub_stripes = btrfs_chunk_sub_stripes(leaf, chunk);
 	map->verified_stripes = 0;
-	em->orig_block_len = calc_stripe_length(type, em->len,
-						map->num_stripes);
+	em->orig_block_len = btrfs_calc_stripe_length(em);
 	for (i = 0; i < num_stripes; i++) {
 		map->stripes[i].physical =
 			btrfs_stripe_offset_nr(leaf, chunk, i);
@@ -8022,7 +8022,7 @@ static int verify_one_dev_extent(struct btrfs_fs_info *fs_info,
 	}
 
 	map = em->map_lookup;
-	stripe_len = calc_stripe_length(map->type, em->len, map->num_stripes);
+	stripe_len = btrfs_calc_stripe_length(em);
 	if (physical_len != stripe_len) {
 		btrfs_err(fs_info,
 "dev extent physical offset %llu on devid %llu length doesn't match chunk %llu, have %llu expect %llu",
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 12b2af9260e9..1338bc251ff6 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -601,6 +601,7 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info,
 			   u64 logical, u64 len);
 unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 				    u64 logical);
+u64 btrfs_calc_stripe_length(const struct extent_map *em);
 int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
 				     struct btrfs_block_group *bg);
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/4] btrfs: use btrfs_chunk_max_errors() to replace weird tolerance calculation
  2022-05-13  8:34 [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Qu Wenruo
  2022-05-13  8:34 ` [PATCH 1/4] btrfs: remove @dev_extent_len argument from scrub_stripe() function Qu Wenruo
@ 2022-05-13  8:34 ` Qu Wenruo
  2022-05-13  8:45   ` Johannes Thumshirn
  2022-05-13  8:34 ` [PATCH 3/4] btrfs: use btrfs_raid_array[] to calculate the number of parity stripes Qu Wenruo
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2022-05-13  8:34 UTC (permalink / raw)
  To: linux-btrfs

In __btrfs_map_block() we have a very weird assignment to @max_errors
using nr_parity_stripes().

Although it works for RAID56 without problem, it's not reader friendly.

Just replace it with btrfs_chunk_max_errors().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a49d72d845b1..c6c55ef17000 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6480,7 +6480,7 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info,
 
 			/* RAID[56] write or recovery. Return all stripes */
 			num_stripes = map->num_stripes;
-			max_errors = nr_parity_stripes(map);
+			max_errors = btrfs_chunk_max_errors(map);
 
 			*length = map->stripe_len;
 			stripe_index = 0;
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/4] btrfs: use btrfs_raid_array[] to calculate the number of parity stripes
  2022-05-13  8:34 [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Qu Wenruo
  2022-05-13  8:34 ` [PATCH 1/4] btrfs: remove @dev_extent_len argument from scrub_stripe() function Qu Wenruo
  2022-05-13  8:34 ` [PATCH 2/4] btrfs: use btrfs_chunk_max_errors() to replace weird tolerance calculation Qu Wenruo
@ 2022-05-13  8:34 ` Qu Wenruo
  2022-05-13  8:56   ` Johannes Thumshirn
  2022-05-13  8:34 ` [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies() Qu Wenruo
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2022-05-13  8:34 UTC (permalink / raw)
  To: linux-btrfs

This makes later expansion on RAID56 based profiles easier.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c  | 10 ++--------
 fs/btrfs/raid56.h  | 12 ++----------
 fs/btrfs/volumes.c |  7 +++++++
 fs/btrfs/volumes.h |  1 +
 4 files changed, 12 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index a5b623ee6fac..068576768925 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1032,7 +1032,6 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 	const unsigned int stripe_nsectors = stripe_len >> fs_info->sectorsize_bits;
 	const unsigned int num_sectors = stripe_nsectors * real_stripes;
 	struct btrfs_raid_bio *rbio;
-	int nr_data = 0;
 	void *p;
 
 	ASSERT(IS_ALIGNED(stripe_len, PAGE_SIZE));
@@ -1085,14 +1084,9 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 	CONSUME_ALLOC(rbio->finish_pbitmap, BITS_TO_LONGS(stripe_nsectors));
 #undef  CONSUME_ALLOC
 
-	if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID5)
-		nr_data = real_stripes - 1;
-	else if (bioc->map_type & BTRFS_BLOCK_GROUP_RAID6)
-		nr_data = real_stripes - 2;
-	else
-		BUG();
+	ASSERT(btrfs_nr_parity_stripes(bioc->map_type));
+	rbio->nr_data = real_stripes - btrfs_nr_parity_stripes(bioc->map_type);
 
-	rbio->nr_data = nr_data;
 	return rbio;
 }
 
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index aaad08aefd7d..83a766bcfb17 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -7,19 +7,11 @@
 #ifndef BTRFS_RAID56_H
 #define BTRFS_RAID56_H
 
-static inline int nr_parity_stripes(const struct map_lookup *map)
-{
-	if (map->type & BTRFS_BLOCK_GROUP_RAID5)
-		return 1;
-	else if (map->type & BTRFS_BLOCK_GROUP_RAID6)
-		return 2;
-	else
-		return 0;
-}
+#include "volumes.h"
 
 static inline int nr_data_stripes(const struct map_lookup *map)
 {
-	return map->num_stripes - nr_parity_stripes(map);
+	return map->num_stripes - btrfs_nr_parity_stripes(map->type);
 }
 #define RAID5_P_STRIPE ((u64)-2)
 #define RAID6_Q_STRIPE ((u64)-1)
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c6c55ef17000..2ce9140fb955 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6985,6 +6985,13 @@ u64 btrfs_calc_stripe_length(const struct extent_map *em)
 	return div_u64(em->len, data_stripes);
 }
 
+int btrfs_nr_parity_stripes(u64 type)
+{
+	enum btrfs_raid_types index = btrfs_bg_flags_to_raid_index(type);
+
+	return btrfs_raid_array[index].nparity;
+}
+
 #if BITS_PER_LONG == 32
 /*
  * Due to page cache limit, metadata beyond BTRFS_32BIT_MAX_FILE_SIZE
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 1338bc251ff6..2bfe14d75a15 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -602,6 +602,7 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info *fs_info,
 unsigned long btrfs_full_stripe_len(struct btrfs_fs_info *fs_info,
 				    u64 logical);
 u64 btrfs_calc_stripe_length(const struct extent_map *em);
+int btrfs_nr_parity_stripes(u64 type);
 int btrfs_chunk_alloc_add_chunk_item(struct btrfs_trans_handle *trans,
 				     struct btrfs_block_group *bg);
 int btrfs_remove_chunk(struct btrfs_trans_handle *trans, u64 chunk_offset);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()
  2022-05-13  8:34 [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-05-13  8:34 ` [PATCH 3/4] btrfs: use btrfs_raid_array[] to calculate the number of parity stripes Qu Wenruo
@ 2022-05-13  8:34 ` Qu Wenruo
  2022-05-13  9:15   ` Johannes Thumshirn
  2022-05-13 11:38 ` [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features David Sterba
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2022-05-13  8:34 UTC (permalink / raw)
  To: linux-btrfs

For all non-raid56 profiles, we can just use btrfs_raid_array[].ncopies
directly.

Only for RAID5 and RAID6 we need some extra handling.

This will reduce several branches.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2ce9140fb955..cddbbd8eb310 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5721,7 +5721,8 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 {
 	struct extent_map *em;
 	struct map_lookup *map;
-	int ret;
+	enum btrfs_raid_types index;
+	int ret = 1;
 
 	em = btrfs_get_chunk_map(fs_info, logical, len);
 	if (IS_ERR(em))
@@ -5734,10 +5735,11 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 		return 1;
 
 	map = em->map_lookup;
-	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1_MASK))
-		ret = map->num_stripes;
-	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
-		ret = map->sub_stripes;
+	index = btrfs_bg_flags_to_raid_index(map->type);
+
+	if (!(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK))
+		/* Non-raid56, use their ncopies from btrfs_raid_array[]. */
+		ret = btrfs_raid_array[index].ncopies;
 	else if (map->type & BTRFS_BLOCK_GROUP_RAID5)
 		ret = 2;
 	else if (map->type & BTRFS_BLOCK_GROUP_RAID6)
@@ -5749,8 +5751,6 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len)
 		 * stripe under reconstruction.
 		 */
 		ret = map->num_stripes;
-	else
-		ret = 1;
 	free_extent_map(em);
 
 	down_read(&fs_info->dev_replace.rwsem);
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/4] btrfs: use btrfs_chunk_max_errors() to replace weird tolerance calculation
  2022-05-13  8:34 ` [PATCH 2/4] btrfs: use btrfs_chunk_max_errors() to replace weird tolerance calculation Qu Wenruo
@ 2022-05-13  8:45   ` Johannes Thumshirn
  0 siblings, 0 replies; 17+ messages in thread
From: Johannes Thumshirn @ 2022-05-13  8:45 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/4] btrfs: remove @dev_extent_len argument from scrub_stripe() function
  2022-05-13  8:34 ` [PATCH 1/4] btrfs: remove @dev_extent_len argument from scrub_stripe() function Qu Wenruo
@ 2022-05-13  8:47   ` Johannes Thumshirn
  0 siblings, 0 replies; 17+ messages in thread
From: Johannes Thumshirn @ 2022-05-13  8:47 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/4] btrfs: use btrfs_raid_array[] to calculate the number of parity stripes
  2022-05-13  8:34 ` [PATCH 3/4] btrfs: use btrfs_raid_array[] to calculate the number of parity stripes Qu Wenruo
@ 2022-05-13  8:56   ` Johannes Thumshirn
  0 siblings, 0 replies; 17+ messages in thread
From: Johannes Thumshirn @ 2022-05-13  8:56 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()
  2022-05-13  8:34 ` [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies() Qu Wenruo
@ 2022-05-13  9:15   ` Johannes Thumshirn
  2022-05-13  9:22     ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Thumshirn @ 2022-05-13  9:15 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 13/05/2022 10:34, Qu Wenruo wrote:
>  	map = em->map_lookup;
> -	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1_MASK))
> -		ret = map->num_stripes;
> -	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
> -		ret = map->sub_stripes;
> +	index = btrfs_bg_flags_to_raid_index(map->type);
> +
> +	if (!(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK))
> +		/* Non-raid56, use their ncopies from btrfs_raid_array[]. */
> +		ret = btrfs_raid_array[index].ncopies;
>  	else if (map->type & BTRFS_BLOCK_GROUP_RAID5)
>  		ret = 2;
>  	else if (map->type & BTRFS_BLOCK_GROUP_RAID6)

Here I'm not 100% sure. RAID10 used sub_stripes and now it's using ncopies.
The code still produces the same result, as for RAID10 ncopies == sub_stripes,
but semantically it's different (I think).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()
  2022-05-13  9:15   ` Johannes Thumshirn
@ 2022-05-13  9:22     ` Qu Wenruo
  2022-05-13  9:24       ` Johannes Thumshirn
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2022-05-13  9:22 UTC (permalink / raw)
  To: Johannes Thumshirn, Qu Wenruo, linux-btrfs



On 2022/5/13 17:15, Johannes Thumshirn wrote:
> On 13/05/2022 10:34, Qu Wenruo wrote:
>>   	map = em->map_lookup;
>> -	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1_MASK))
>> -		ret = map->num_stripes;
>> -	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
>> -		ret = map->sub_stripes;
>> +	index = btrfs_bg_flags_to_raid_index(map->type);
>> +
>> +	if (!(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK))
>> +		/* Non-raid56, use their ncopies from btrfs_raid_array[]. */
>> +		ret = btrfs_raid_array[index].ncopies;
>>   	else if (map->type & BTRFS_BLOCK_GROUP_RAID5)
>>   		ret = 2;
>>   	else if (map->type & BTRFS_BLOCK_GROUP_RAID6)
>
> Here I'm not 100% sure. RAID10 used sub_stripes and now it's using ncopies.
> The code still produces the same result, as for RAID10 ncopies == sub_stripes,
> but semantically it's different (I think).

Since the objective of the code is the get the number of mirrors for the
chunk, sub_stripes is definitely not correct here.

Just imagine if we have RAID1C30 (sounds ugly I know), which is RAID1C3
first, then RAID0.

We should still have sub_stripes == 2, but ncopies == 3.

And the number of mirrors is definitely 3, not 2.

So it's the old code not correctly semantically.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()
  2022-05-13  9:22     ` Qu Wenruo
@ 2022-05-13  9:24       ` Johannes Thumshirn
  2022-05-13  9:33         ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Johannes Thumshirn @ 2022-05-13  9:24 UTC (permalink / raw)
  To: Qu Wenruo, Qu Wenruo, linux-btrfs

On 13/05/2022 11:22, Qu Wenruo wrote:
> 
> 
> On 2022/5/13 17:15, Johannes Thumshirn wrote:
>> On 13/05/2022 10:34, Qu Wenruo wrote:
>>>   	map = em->map_lookup;
>>> -	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1_MASK))
>>> -		ret = map->num_stripes;
>>> -	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
>>> -		ret = map->sub_stripes;
>>> +	index = btrfs_bg_flags_to_raid_index(map->type);
>>> +
>>> +	if (!(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK))
>>> +		/* Non-raid56, use their ncopies from btrfs_raid_array[]. */
>>> +		ret = btrfs_raid_array[index].ncopies;
>>>   	else if (map->type & BTRFS_BLOCK_GROUP_RAID5)
>>>   		ret = 2;
>>>   	else if (map->type & BTRFS_BLOCK_GROUP_RAID6)
>>
>> Here I'm not 100% sure. RAID10 used sub_stripes and now it's using ncopies.
>> The code still produces the same result, as for RAID10 ncopies == sub_stripes,
>> but semantically it's different (I think).
> 
> Since the objective of the code is the get the number of mirrors for the
> chunk, sub_stripes is definitely not correct here.
> 
> Just imagine if we have RAID1C30 (sounds ugly I know), which is RAID1C3
> first, then RAID0.
> 
> We should still have sub_stripes == 2, but ncopies == 3.
> 
> And the number of mirrors is definitely 3, not 2.
> 
> So it's the old code not correctly semantically.

Ah ok then. Thanks for the explanation.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()
  2022-05-13  9:24       ` Johannes Thumshirn
@ 2022-05-13  9:33         ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2022-05-13  9:33 UTC (permalink / raw)
  To: Johannes Thumshirn, Qu Wenruo, linux-btrfs



On 2022/5/13 17:24, Johannes Thumshirn wrote:
> On 13/05/2022 11:22, Qu Wenruo wrote:
>>
>>
>> On 2022/5/13 17:15, Johannes Thumshirn wrote:
>>> On 13/05/2022 10:34, Qu Wenruo wrote:
>>>>    	map = em->map_lookup;
>>>> -	if (map->type & (BTRFS_BLOCK_GROUP_DUP | BTRFS_BLOCK_GROUP_RAID1_MASK))
>>>> -		ret = map->num_stripes;
>>>> -	else if (map->type & BTRFS_BLOCK_GROUP_RAID10)
>>>> -		ret = map->sub_stripes;
>>>> +	index = btrfs_bg_flags_to_raid_index(map->type);
>>>> +
>>>> +	if (!(map->type & BTRFS_BLOCK_GROUP_RAID56_MASK))
>>>> +		/* Non-raid56, use their ncopies from btrfs_raid_array[]. */
>>>> +		ret = btrfs_raid_array[index].ncopies;
>>>>    	else if (map->type & BTRFS_BLOCK_GROUP_RAID5)
>>>>    		ret = 2;
>>>>    	else if (map->type & BTRFS_BLOCK_GROUP_RAID6)
>>>
>>> Here I'm not 100% sure. RAID10 used sub_stripes and now it's using ncopies.
>>> The code still produces the same result, as for RAID10 ncopies == sub_stripes,
>>> but semantically it's different (I think).
>>
>> Since the objective of the code is the get the number of mirrors for the
>> chunk, sub_stripes is definitely not correct here.
>>
>> Just imagine if we have RAID1C30 (sounds ugly I know), which is RAID1C3
>> first, then RAID0.
>>
>> We should still have sub_stripes == 2, but ncopies == 3.

My bad, unfortunately if we really have RAID1C30, sub_stripes will still
be 3, and ncopies is also 3...

But you get the point, ncopies is the number of copies for mirror based
profiles, so semantically we should go ncopies.

While for sub_stripes, any RAID1 then RAID0 profiles will always get
ncopies == sub_stripes.
So the old code can always reproduce correct result.

If we have things like RAID01, we may have a different result, but then
I guess we will have special semantic for RAID01 then.

Thanks,
Qu

>>
>> And the number of mirrors is definitely 3, not 2.
>>
>> So it's the old code not correctly semantically.
>
> Ah ok then. Thanks for the explanation.
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features
  2022-05-13  8:34 [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-05-13  8:34 ` [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies() Qu Wenruo
@ 2022-05-13 11:38 ` David Sterba
  2022-05-13 12:21   ` Qu Wenruo
  2022-05-13 15:14 ` Forza
  2022-06-15 11:45 ` David Sterba
  6 siblings, 1 reply; 17+ messages in thread
From: David Sterba @ 2022-05-13 11:38 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, May 13, 2022 at 04:34:27PM +0800, Qu Wenruo wrote:
> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
> (J for journal),

Do you need to introduce completely new profiles? IIRC in my drafts I
was more inclined to reuse a dedicated raid1c3 block group, of an
arbitrary size and not used by anything else. A new profile would
technically work too but it brings other issues.

As a related feature to the raid56, I was working on the striped
raid10c34 profiles but was not able to make it work. In a sense this is
easier as it reuses existing code, but if you make the journal work we
won't need that.

> if we're relying on ad-hoc if () else if () branches to
> calculate thing like number of p/q stripes, it will cause a lot of
> problems.
> 
> In fact, during my development, I have hit tons of bugs related to this.
> 
> One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot
> to update the check for RAID5 and RAID6 profiles, we will got a bad
> nr_data == num_stripes, and screw up later writeback.
> 
> 90% of my suffering comes from such ad-hoc usage doing manual checks on
> RAID56.
> 
> Another example is, scrub_stripe() which due to the extra per-device
> reservation, @dev_extent_len is no longer the same the data stripe
> length calculated from extent_map.
> 
> So this patchset will do the following cleanups preparing for the
> incoming RAID56J (already finished coding, functionality and on-disk
> format are fine, although no journal yet):
> 
> - Calculate device stripe length in-house inside scrub_stripe()
>   This removes one of the nasty mismatch which is less obvious.
> 
> - Use btrfs_raid_array[] based calculation instead of ad-hoc check
>   The only exception is scrub_nr_raid_mirrors(), which has several
>   difference against btrfs_num_copies():
> 
>   * No iteration on all RAID6 combinations
>     No sure if it's planned or not.
> 
>   * Use bioc->num_stripes directly
>     In that context, bioc is already all the mirrors for the same
>     stripe, thus no need to lookup using btrfs_raid_array[].
> 
> With all these cleanups, the RAID56J will be much easier to implement.
> 
> Qu Wenruo (4):
>   btrfs: remove @dev_extent_len argument from scrub_stripe() function
>   btrfs: use btrfs_chunk_max_errors() to replace weird tolerance
>     calculation
>   btrfs: use btrfs_raid_array[] to calculate the number of parity
>     stripes
>   btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()

The preparatory patches look good to me.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features
  2022-05-13 11:38 ` [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features David Sterba
@ 2022-05-13 12:21   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2022-05-13 12:21 UTC (permalink / raw)
  To: dsterba, Qu Wenruo, linux-btrfs



On 2022/5/13 19:38, David Sterba wrote:
> On Fri, May 13, 2022 at 04:34:27PM +0800, Qu Wenruo wrote:
>> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
>> (J for journal),
>
> Do you need to introduce completely new profiles? IIRC in my drafts I
> was more inclined to reuse a dedicated raid1c3 block group, of an
> arbitrary size and not used by anything else. A new profile would
> technically work too but it brings other issues.

The point here is, I want to reserve some space for each data stripe of
a RAID56 chunk.

Unfortunately, we need some on-disk format change anyway, for reusing
btrfs_chunk::io_align to represent how many bytes are reserved for each
dev extent, before where the data really starts.

So new RAID56 types are needed anyway.

Sure, this will make dev_extent mismatch from pure calculated extent
length from a chunk map.

But all the involved change are not that huge, still reviewable.

 From my latest version, already under fstests (with new RAID56J
profiles added to related test cases), but without real journal code
yet, the patch looks like this:

  fs/btrfs/block-group.c          |  23 ++++++-
  fs/btrfs/ctree.h                |   7 +-
  fs/btrfs/scrub.c                |  15 ++--
  fs/btrfs/tree-checker.c         |   4 ++
  fs/btrfs/volumes.c              | 118 +++++++++++++++++++++++++++++---
  fs/btrfs/volumes.h              |   7 +-
  fs/btrfs/zoned.c                |   2 +
  include/uapi/linux/btrfs.h      |   1 +
  include/uapi/linux/btrfs_tree.h |  30 +++++++-
  9 files changed, 185 insertions(+), 22 deletions(-)

Most changes are involved in dev extent handling, and some sites can not
use btrfs_raid_array[] directly.

I guess you'll see that kernel patch soon, along with needed progs patch
in next week.

Thanks,
Qu
>
> As a related feature to the raid56, I was working on the striped
> raid10c34 profiles but was not able to make it work. In a sense this is
> easier as it reuses existing code, but if you make the journal work we
> won't need that.
>
>> if we're relying on ad-hoc if () else if () branches to
>> calculate thing like number of p/q stripes, it will cause a lot of
>> problems.
>>
>> In fact, during my development, I have hit tons of bugs related to this.
>>
>> One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot
>> to update the check for RAID5 and RAID6 profiles, we will got a bad
>> nr_data == num_stripes, and screw up later writeback.
>>
>> 90% of my suffering comes from such ad-hoc usage doing manual checks on
>> RAID56.
>>
>> Another example is, scrub_stripe() which due to the extra per-device
>> reservation, @dev_extent_len is no longer the same the data stripe
>> length calculated from extent_map.
>>
>> So this patchset will do the following cleanups preparing for the
>> incoming RAID56J (already finished coding, functionality and on-disk
>> format are fine, although no journal yet):
>>
>> - Calculate device stripe length in-house inside scrub_stripe()
>>    This removes one of the nasty mismatch which is less obvious.
>>
>> - Use btrfs_raid_array[] based calculation instead of ad-hoc check
>>    The only exception is scrub_nr_raid_mirrors(), which has several
>>    difference against btrfs_num_copies():
>>
>>    * No iteration on all RAID6 combinations
>>      No sure if it's planned or not.
>>
>>    * Use bioc->num_stripes directly
>>      In that context, bioc is already all the mirrors for the same
>>      stripe, thus no need to lookup using btrfs_raid_array[].
>>
>> With all these cleanups, the RAID56J will be much easier to implement.
>>
>> Qu Wenruo (4):
>>    btrfs: remove @dev_extent_len argument from scrub_stripe() function
>>    btrfs: use btrfs_chunk_max_errors() to replace weird tolerance
>>      calculation
>>    btrfs: use btrfs_raid_array[] to calculate the number of parity
>>      stripes
>>    btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()
>
> The preparatory patches look good to me.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features
  2022-05-13  8:34 [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-05-13 11:38 ` [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features David Sterba
@ 2022-05-13 15:14 ` Forza
  2022-05-13 22:58   ` Qu Wenruo
  2022-06-15 11:45 ` David Sterba
  6 siblings, 1 reply; 17+ messages in thread
From: Forza @ 2022-05-13 15:14 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi, 

---- From: Qu Wenruo <wqu@suse.com> -- Sent: 2022-05-13 - 10:34 ----

> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
> (J for journal), 

Great to see work being done on the RAID56 parts of Btrfs. :) 

I am just a user of btrfs and don't have the full understanding of the internals, but it makes me a little curious that we choose to use journals and RMW instead of a CoW solution to solve the write hole. 


Since we need on-disk changes to implement it, could it not be better to rethink the raid56 modes and implement a solution with full CoW, such as variable stripe extents etc? It is likely much more work, but could have better performance because it avoids double writes and RMW cycles too. 

Thanks 
 
Forza


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features
  2022-05-13 15:14 ` Forza
@ 2022-05-13 22:58   ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2022-05-13 22:58 UTC (permalink / raw)
  To: Forza, Qu Wenruo, linux-btrfs



On 2022/5/13 23:14, Forza wrote:
> Hi,
>
> ---- From: Qu Wenruo <wqu@suse.com> -- Sent: 2022-05-13 - 10:34 ----
>
>> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
>> (J for journal),
>
> Great to see work being done on the RAID56 parts of Btrfs. :)
>
> I am just a user of btrfs and don't have the full understanding of the internals, but it makes me a little curious that we choose to use journals and RMW instead of a CoW solution to solve the write hole.

In fact, Johannes from WDC is already working on a pure CoW based
solution, called stripe tree.

The idea there is to introduce a new layer of mapping.

With that stripe tree, inside one chunk, the logical bytenr is no longer
directly mapped to a physical location, but can be dynamically mapped to
any physical location inside the chunk range.

So previously if we have a RAID56 chunk with 3 disks looks like this:

     Logical bytenr X		X + 64K		X + 128K
		   |            |		|

Then we have the following on-disk mapping:

   [X, X + 64K):		Devid 1		Physical Y1
   [X + 64K, X + 128K)	Devid 2		Physical Y2
   Parity for above	Devid 3		Physical Y3

So if we just write 64K into logical bytenr X, we need to read out data
at logical bytenr [X + 64K, X + 128K), then calculate the parity, write
into devid3 physical offset Y3.

But with the new stripe tree, we can map [X, X + 64K) into any location
in the devid 1.
So is [X + 64K, X + 128K) and the parity.

Then we we write data into logical bytenr [X, X + 64K), then we just
find a free 64K range in stripe tree of devid 1, and check if we have
mapped [X + 64K, X + 128) in the stripe tree.

a) Mapped

If we have [X + 64K, X + 128) mapped, then we read that range out,
update our parity stripe, and write the parity stripe into some newer
location (CoW), then free up the old stripe.

b) Not mapped

This means we don't have any data write into that range, thus it is all
zero. We calculate parity with all zero, then find a new location for
parity in devid 3, write the newly calculated parity and insert a
mapping for the new parity location.


By this, we in fact decouple the 1:1 mapping for RAID56, and get way
more flexibility.
Although this idea no longer follows the strict rotation of RAID5, thus
it's a middle ground between RAID4 and RAID5.


The brilliant idea is introduced mostly to support different chunk
profiles for zoned devices, but Johannes is working on enabling this for
non-zoned devices too.



Then you may ask why I'm still pushing this way more traditional RAID56J
solution, the reasons are:

- Complexity
   The stripe tree is flexible, thus more complex.
   And AFAIK it will affect all chunk types, not only RAID56.
   Thus it can be more challenging.

- Currently relies on zoned unit to split extents/stripes
   Thus I believe Johannes can solve it without any problems.

- I just want a valid way to steal code from dm/md guys :)

Don't get me wrong, I totally believe stripe tree can be the silver
bullet, but it doesn't prevent us to explore some different (and more
traditional) ways.

>
>
> Since we need on-disk changes to implement it, could it not be better to rethink the raid56 modes and implement a solution with full CoW, such as variable stripe extents etc? It is likely much more work, but could have better performance because it avoids double writes and RMW cycles too.

Well, the journal will have optimizations, e.g. full stripe doesn't need
to journal its data.

I'll learn (steal code) from dm/md to implement the code.


But there are problems related in RAID56, affecting dm/md raid56 too.

Like bitrot in one data stripe, while we're writing data into the other
data stripe.
Then RWM will read out the bad data stripe, calculate parity, and cause
the bit rot permanent.

The destructive RMW will not be detected in traditional raid56 with
traditional fs, but can be detected by btrfs.

Thus after the RAID56J project, I'll take more time on that destructive
RMW problem.

Thanks,
Qu

>
> Thanks
>
> Forza
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features
  2022-05-13  8:34 [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Qu Wenruo
                   ` (5 preceding siblings ...)
  2022-05-13 15:14 ` Forza
@ 2022-06-15 11:45 ` David Sterba
  6 siblings, 0 replies; 17+ messages in thread
From: David Sterba @ 2022-06-15 11:45 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Fri, May 13, 2022 at 04:34:27PM +0800, Qu Wenruo wrote:
> Since I'm going to introduce two new chunk profiles, RAID5J and RAID6J
> (J for journal), if we're relying on ad-hoc if () else if () branches to
> calculate thing like number of p/q stripes, it will cause a lot of
> problems.
> 
> In fact, during my development, I have hit tons of bugs related to this.
> 
> One example is alloc_rbio(), it will assign rbio->nr_data, if we forgot
> to update the check for RAID5 and RAID6 profiles, we will got a bad
> nr_data == num_stripes, and screw up later writeback.
> 
> 90% of my suffering comes from such ad-hoc usage doing manual checks on
> RAID56.
> 
> Another example is, scrub_stripe() which due to the extra per-device
> reservation, @dev_extent_len is no longer the same the data stripe
> length calculated from extent_map.
> 
> So this patchset will do the following cleanups preparing for the
> incoming RAID56J (already finished coding, functionality and on-disk
> format are fine, although no journal yet):
> 
> - Calculate device stripe length in-house inside scrub_stripe()
>   This removes one of the nasty mismatch which is less obvious.
> 
> - Use btrfs_raid_array[] based calculation instead of ad-hoc check
>   The only exception is scrub_nr_raid_mirrors(), which has several
>   difference against btrfs_num_copies():
> 
>   * No iteration on all RAID6 combinations
>     No sure if it's planned or not.
> 
>   * Use bioc->num_stripes directly
>     In that context, bioc is already all the mirrors for the same
>     stripe, thus no need to lookup using btrfs_raid_array[].
> 
> With all these cleanups, the RAID56J will be much easier to implement.
> 
> Qu Wenruo (4):
>   btrfs: remove @dev_extent_len argument from scrub_stripe() function
>   btrfs: use btrfs_chunk_max_errors() to replace weird tolerance
>     calculation
>   btrfs: use btrfs_raid_array[] to calculate the number of parity
>     stripes
>   btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies()

Added to misc-next, thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-06-15 11:50 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-13  8:34 [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features Qu Wenruo
2022-05-13  8:34 ` [PATCH 1/4] btrfs: remove @dev_extent_len argument from scrub_stripe() function Qu Wenruo
2022-05-13  8:47   ` Johannes Thumshirn
2022-05-13  8:34 ` [PATCH 2/4] btrfs: use btrfs_chunk_max_errors() to replace weird tolerance calculation Qu Wenruo
2022-05-13  8:45   ` Johannes Thumshirn
2022-05-13  8:34 ` [PATCH 3/4] btrfs: use btrfs_raid_array[] to calculate the number of parity stripes Qu Wenruo
2022-05-13  8:56   ` Johannes Thumshirn
2022-05-13  8:34 ` [PATCH 4/4] btrfs: use btrfs_raid_array[].ncopies in btrfs_num_copies() Qu Wenruo
2022-05-13  9:15   ` Johannes Thumshirn
2022-05-13  9:22     ` Qu Wenruo
2022-05-13  9:24       ` Johannes Thumshirn
2022-05-13  9:33         ` Qu Wenruo
2022-05-13 11:38 ` [PATCH 0/4] btrfs: cleanups and preparation for the incoming RAID56J features David Sterba
2022-05-13 12:21   ` Qu Wenruo
2022-05-13 15:14 ` Forza
2022-05-13 22:58   ` Qu Wenruo
2022-06-15 11:45 ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).