All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] btrfs: reduce the memory usage for replace in btrfs_io_context.
@ 2023-01-28  8:23 Qu Wenruo
  2023-01-28  8:23 ` [PATCH 1/3] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace() Qu Wenruo
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-01-28  8:23 UTC (permalink / raw)
  To: linux-btrfs

In btrfs_io_context, we have two members dedicated for dev-replace:

- num_tgtdevs
  This is straight-forward, just the number of extra stripes for replace
  usage.

- tgtdev_map[]
  This is a little complex, it represents the mapping between the
  original stripes and dev-replace stripes.

  This is mostly for RAID56, as only in RAID56 the stripes contain
  different contents, thus it's important to know the mapping.

  It goes like this:

    num_stripes = 4 (3 + 1 for replace)
    stripes[0]:		dev = devid 1, physical = X
    stripes[1]:		dev = devid 2, physical = Y
    stripes[2]:		dev = devid 3, physical = Z
    stripes[3]:		dev = devid 0, physical = Y

    num_tgtdevs = 1
    tgtdev_map[0] = 0	<- Means stripes[0] is not involved in replace.
    tgtdev_map[1] = 3	<- Means stripes[1] is involved in replace,
			   and it's duplicated to stripes[3].
    tgtdev_map[2] = 0	<- Means stripes[2] is not involved in replace.

  Thus most space is wasted, and the more devices in the array, the more
  space wasted.


For the current tgtdev_map[] design, it's wasting quite some space.
E.g. in the above case, we only need on slot to record the source stripe
number, and the other two slots are just a waste of space.

The existing tgtdev_map[] will make more sense if we support multiple
running dev-replaces, but that's not the case.

So this patch would mostly change it to a new, and more space efficient
way, by going something like this for the same example:

  replace_nr_stripes = 1
  tgtdev_map[0] = 1	<- Means stripes[1] is involved in replace.
  tgtdev_map[1] = -1	<- Means the second slot is not used.
		 	   (Only DUP can use this slot, but they
			    don't really care)

Furthermore we reduce the width of nr_stripes related member to u16, the
same as on-disk format width.

This not only saved some space for btrfs_io_context structure, but also
allows the following cleanups:

- Streamline handle_ops_on_dev_replace()
  We go a common path for both WRITE and GET_READ_MIRRORS, and only
  for DUP and GET_READ_MIRRORS, we shrink the bioc to keep the same
  old behavior.

- Remove some unnecessary variables

Although the series still increases the number of lines, the net
increase mostly comes from comments, in fact around 70 lines of comments
are added around the replace related members.


Qu Wenruo (3):
  btrfs: simplify the @bioc argument for handle_ops_on_dev_replace()
  btrfs: small improvement for btrfs_io_context structure
  btrfs: use a more space efficient way to represent the source of
    duplicated stripes

 fs/btrfs/raid56.c  |  44 +++++++++--
 fs/btrfs/scrub.c   |   4 +-
 fs/btrfs/volumes.c | 187 +++++++++++++++++++++------------------------
 fs/btrfs/volumes.h |  52 +++++++++++--
 4 files changed, 174 insertions(+), 113 deletions(-)

-- 
2.39.1


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace()
  2023-01-28  8:23 [PATCH 0/3] btrfs: reduce the memory usage for replace in btrfs_io_context Qu Wenruo
@ 2023-01-28  8:23 ` Qu Wenruo
  2023-01-30  7:17   ` Johannes Thumshirn
  2023-01-31 11:54   ` Anand Jain
  2023-01-28  8:23 ` [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure Qu Wenruo
  2023-01-28  8:23 ` [PATCH 3/3] btrfs: use a more space efficient way to represent the source of duplicated stripes Qu Wenruo
  2 siblings, 2 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-01-28  8:23 UTC (permalink / raw)
  To: linux-btrfs

There is no memory re-allocation for handle_ops_on_dev_replace(), thus
we don't need to pass a struct btrfs_io_context ** pointer.

Just a regular struct btrfs_io_contex * pointer is enough.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 883e5d2a23b6..66d44167f7b1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6168,12 +6168,11 @@ static bool is_block_group_to_copy(struct btrfs_fs_info *fs_info, u64 logical)
 }
 
 static void handle_ops_on_dev_replace(enum btrfs_map_op op,
-				      struct btrfs_io_context **bioc_ret,
+				      struct btrfs_io_context *bioc,
 				      struct btrfs_dev_replace *dev_replace,
 				      u64 logical,
 				      int *num_stripes_ret, int *max_errors_ret)
 {
-	struct btrfs_io_context *bioc = *bioc_ret;
 	u64 srcdev_devid = dev_replace->srcdev->devid;
 	int tgtdev_indexes = 0;
 	int num_stripes = *num_stripes_ret;
@@ -6262,7 +6261,6 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 	*num_stripes_ret = num_stripes;
 	*max_errors_ret = max_errors;
 	bioc->num_tgtdevs = tgtdev_indexes;
-	*bioc_ret = bioc;
 }
 
 static bool need_full_stripe(enum btrfs_map_op op)
@@ -6604,7 +6602,7 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 
 	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
 	    need_full_stripe(op)) {
-		handle_ops_on_dev_replace(op, &bioc, dev_replace, logical,
+		handle_ops_on_dev_replace(op, bioc, dev_replace, logical,
 					  &num_stripes, &max_errors);
 	}
 
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure
  2023-01-28  8:23 [PATCH 0/3] btrfs: reduce the memory usage for replace in btrfs_io_context Qu Wenruo
  2023-01-28  8:23 ` [PATCH 1/3] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace() Qu Wenruo
@ 2023-01-28  8:23 ` Qu Wenruo
  2023-01-30  7:27   ` Johannes Thumshirn
  2023-02-02  6:28   ` Anand Jain
  2023-01-28  8:23 ` [PATCH 3/3] btrfs: use a more space efficient way to represent the source of duplicated stripes Qu Wenruo
  2 siblings, 2 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-01-28  8:23 UTC (permalink / raw)
  To: linux-btrfs

That structure is our ultimate objective for all __btrfs_map_block()
related functions.

We have some hard to understand members, like tgtdev_map, but without
any comments.

This patch will improve the situation by:

- Add extra comments for num_stripes, mirror_num, num_tgtdevs and
  tgtdev_map[]
  Especially for the last two members, add a dedicated (thus very long)
  comments for them, with example to explain it.

- Shrink those int members to u16.
  In fact our on-disk format is only using u16 for num_stripes, thus
  no need to go int at all.

- Add extra ASSERT() for alloc_btrfs_io_context() for the stripes
  argument

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/volumes.c | 15 ++++++++++---
 fs/btrfs/volumes.h | 53 +++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 60 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 66d44167f7b1..f00716fbb6cd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5888,13 +5888,22 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
 						       int total_stripes,
 						       int real_stripes)
 {
-	struct btrfs_io_context *bioc = kzalloc(
+	struct btrfs_io_context *bioc;
+
+	/*
+	 * We should have valid number of stripes, larger than U16_MAX(65535)
+	 * indicates something totally wrong, as our on-disk format is only
+	 * at u16.
+	 */
+	ASSERT(total_stripes < U16_MAX && total_stripes > 0);
+
+	bioc = kzalloc(
 		 /* The size of btrfs_io_context */
 		sizeof(struct btrfs_io_context) +
 		/* Plus the variable array for the stripes */
 		sizeof(struct btrfs_io_stripe) * (total_stripes) +
 		/* Plus the variable array for the tgt dev */
-		sizeof(int) * (real_stripes) +
+		sizeof(u16) * (real_stripes) +
 		/*
 		 * Plus the raid_map, which includes both the tgt dev
 		 * and the stripes.
@@ -5908,7 +5917,7 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
 	refcount_set(&bioc->refs, 1);
 
 	bioc->fs_info = fs_info;
-	bioc->tgtdev_map = (int *)(bioc->stripes + total_stripes);
+	bioc->tgtdev_map = (u16 *)(bioc->stripes + total_stripes);
 	bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes);
 
 	return bioc;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 6b7a05f6cf82..fe664512191b 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -419,11 +419,54 @@ struct btrfs_io_context {
 	u64 map_type; /* get from map_lookup->type */
 	struct bio *orig_bio;
 	atomic_t error;
-	int max_errors;
-	int num_stripes;
-	int mirror_num;
-	int num_tgtdevs;
-	int *tgtdev_map;
+	u16 max_errors;
+
+	/*
+	 * The total number of stripes, including the extra duplicated
+	 * stripe for replace.
+	 */
+	u16 num_stripes;
+
+	/*
+	 * The mirror_num of this bioc.
+	 *
+	 * This is for reads which uses 0 as mirror_num, thus we should
+	 * return a valid mirror_num (>0) for the reader.
+	 */
+	u16 mirror_num;
+
+	/*
+	 * The following two members are for dev-replace case only.
+	 *
+	 * @num_tgtdevs:	Number of duplicated stripes which needs to be
+	 *			written to replace target.
+	 *			Should be <= 2 (2 for DUP, otherwise <= 1).
+	 * @tgtdev_map:		The array indicates where the duplicated stripes
+	 *			are from. The size is the number of original
+	 *			stripes (num_stripes - num_tgtdevs).
+	 *
+	 * The @tgtdev_map[] array is mostly for RAID56 cases.
+	 * As non-RAID56 stripes share the same contents for the mapped range,
+	 * thus no need to bother where the duplicated ones are from.
+	 *
+	 * But for RAID56 case, all stripes contain different contents, thus
+	 * must need a way to know the mapping.
+	 *
+	 * There is an example for the two members, using a RAID5 write:
+	 * num_stripes:		4 (3 + 1 duplicated write)
+	 * stripes[0]:		dev = devid 1, physical = X
+	 * stripes[1]:		dev = devid 2, physical = Y
+	 * stripes[2]:		dev = devid 3, physical = Z
+	 * stripes[3]:		dev = devid 0, physical = Y
+	 *
+	 * num_tgtdevs = 1
+	 * tgtdev_map[0] = 0	<- Means stripes[0] is not involved in replace.
+	 * tgtdev_map[1] = 3	<- Means stripes[1] is involved in replace,
+	 *			   and it's duplicated to stripes[3].
+	 * tgtdev_map[2] = 0	<- Means stripes[2] is not involved in replace.
+	 */
+	u16 num_tgtdevs;
+	u16 *tgtdev_map;
 	/*
 	 * logical block numbers for the start of each stripe
 	 * The last one or two are p/q.  These are sorted,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] btrfs: use a more space efficient way to represent the source of duplicated stripes
  2023-01-28  8:23 [PATCH 0/3] btrfs: reduce the memory usage for replace in btrfs_io_context Qu Wenruo
  2023-01-28  8:23 ` [PATCH 1/3] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace() Qu Wenruo
  2023-01-28  8:23 ` [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure Qu Wenruo
@ 2023-01-28  8:23 ` Qu Wenruo
  2023-01-30  7:40   ` Johannes Thumshirn
  2 siblings, 1 reply; 15+ messages in thread
From: Qu Wenruo @ 2023-01-28  8:23 UTC (permalink / raw)
  To: linux-btrfs

For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.

For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
 same write to all involved devices).

But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.

Currently we use a complex way using tgtdev_map[] array, e.g:

 num_tgtdevs = 1
 tgtdev_map[0] = 0    <- Means stripes[0] is not involved in replace.
 tgtdev_map[1] = 3    <- Means stripes[1] is involved in replace,
			 and it's duplicated to stripes[3].
 tgtdev_map[2] = 0    <- Means stripes[2] is not involved in replace.

But this is wasting some space, and ignored one important thing for
dev-replace, there is at most ONE running replace.

Thus we can change it to a fixed array to represent the mapping:

 replace_nr_stripes = 1
 tgtdev_map[0] = 1    <- Means stripes[1] is involved in replace.
 tgtdev_map[1] = -1   <- Means the second slot is not used.
                         (Only DUP can use this slot, but they
                          don't really care)

By this we can save some space for large RAID56 chunks.

Thus the patch involves the following changes:

- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
  and @replace_stripe_src[2]

  @num_tgtdevs is just renamed to @replace_nr_stripes.
  While the mapping is completely changed.

- Add extra ASSERT()s for RAID56 code

- Only add two more extra stripes for dev-replace cases.
  As we have a upper limit on how many dev-replace stripes we can have.

- Unify the behavior of handle_ops_on_dev_replace()
  Previously handle_ops_on_dev_replace() go two different paths for
  WRITE and GET_READ_MIRRORS.
  Now unify them by always go WRITE path first (with at most 2 replace
  stripes), then if we're doing GET_READ_MIRRORS and we have 2 extra
  stripes, just drop one stripe.

- Remove the @real_stripes argument from alloc_btrfs_io_context()
  As we don't need the old variable length array any more.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c  |  44 +++++++++---
 fs/btrfs/scrub.c   |   4 +-
 fs/btrfs/volumes.c | 170 ++++++++++++++++++++-------------------------
 fs/btrfs/volumes.h |  23 +++---
 4 files changed, 126 insertions(+), 115 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index d095c07a152d..093c8db5720b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -912,7 +912,7 @@ static struct sector_ptr *sector_in_rbio(struct btrfs_raid_bio *rbio,
 static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 					 struct btrfs_io_context *bioc)
 {
-	const unsigned int real_stripes = bioc->num_stripes - bioc->num_tgtdevs;
+	const unsigned int real_stripes = bioc->num_stripes - bioc->replace_nr_stripes;
 	const unsigned int stripe_npages = BTRFS_STRIPE_LEN >> PAGE_SHIFT;
 	const unsigned int num_pages = stripe_npages * real_stripes;
 	const unsigned int stripe_nsectors =
@@ -1275,10 +1275,20 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 			goto error;
 	}
 
-	if (likely(!rbio->bioc->num_tgtdevs))
+	if (likely(!rbio->bioc->replace_nr_stripes))
 		return 0;
 
-	/* Make a copy for the replace target device. */
+	/*
+	 * Make a copy for the replace target device.
+	 *
+	 * Thus the source stripe number (in replace_stripe_src[0])
+	 * should be valid.
+	 * And just in case, make sure the second slot is uninitialized
+	 * (-1), as there can only be one device being replace in RAID56.
+	 */
+	ASSERT(rbio->bioc->replace_stripe_src[0] >= 0);
+	ASSERT(rbio->bioc->replace_stripe_src[1] < 0);
+
 	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
 	     total_sector_nr++) {
 		struct sector_ptr *sector;
@@ -1286,7 +1296,12 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 		stripe = total_sector_nr / rbio->stripe_nsectors;
 		sectornr = total_sector_nr % rbio->stripe_nsectors;
 
-		if (!rbio->bioc->tgtdev_map[stripe]) {
+		/*
+		 * For RAID56, there is only one device can be replaced,
+		 * and replace_stripe_src[0] indicates the stripe number
+		 * we need to copy from.
+		 */
+		if (stripe != rbio->bioc->replace_stripe_src[0]) {
 			/*
 			 * We can skip the whole stripe completely, note
 			 * total_sector_nr will be increased by one anyway.
@@ -1309,7 +1324,7 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 		}
 
 		ret = rbio_add_io_sector(rbio, bio_list, sector,
-					 rbio->bioc->tgtdev_map[stripe],
+					 rbio->real_stripes,
 					 sectornr, REQ_OP_WRITE);
 		if (ret)
 			goto error;
@@ -2518,7 +2533,12 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check)
 	else
 		BUG();
 
-	if (bioc->num_tgtdevs && bioc->tgtdev_map[rbio->scrubp]) {
+	/*
+	 * Replace is running and our P/Q stripe is being replace, then
+	 * we need to duplicated the final write to reaplce target.
+	 */
+	if (bioc->replace_nr_stripes &&
+	    bioc->replace_stripe_src[0] == rbio->scrubp) {
 		is_replace = 1;
 		bitmap_copy(pbitmap, &rbio->dbitmap, rbio->stripe_nsectors);
 	}
@@ -2620,13 +2640,21 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check)
 	if (!is_replace)
 		goto submit_write;
 
+	/*
+	 * Replace is running and our parity stripe needs to be duplicated
+	 * to target device.
+	 * Checks we have valid source stripe number in the first slot,
+	 * and the second slot is unused.
+	 */
+	ASSERT(rbio->bioc->replace_stripe_src[0] >= 0);
+	ASSERT(rbio->bioc->replace_stripe_src[1] < 0);
 	for_each_set_bit(sectornr, pbitmap, rbio->stripe_nsectors) {
 		struct sector_ptr *sector;
 
 		sector = rbio_stripe_sector(rbio, rbio->scrubp, sectornr);
 		ret = rbio_add_io_sector(rbio, &bio_list, sector,
-				       bioc->tgtdev_map[rbio->scrubp],
-				       sectornr, REQ_OP_WRITE);
+					 rbio->real_stripes,
+					 sectornr, REQ_OP_WRITE);
 		if (ret)
 			goto cleanup;
 	}
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 69c93ae333f6..4ef1611f46f1 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1230,7 +1230,7 @@ static int scrub_handle_errored_block(struct scrub_block *sblock_to_check)
 			sblock_other = sblocks_for_recheck[mirror_index];
 		} else {
 			struct scrub_recover *r = sblock_bad->sectors[0]->recover;
-			int max_allowed = r->bioc->num_stripes - r->bioc->num_tgtdevs;
+			int max_allowed = r->bioc->num_stripes - r->bioc->replace_nr_stripes;
 
 			if (mirror_index >= max_allowed)
 				break;
@@ -1540,7 +1540,7 @@ static int scrub_setup_recheck_block(struct scrub_block *original_sblock,
 						      bioc->map_type,
 						      bioc->raid_map,
 						      bioc->num_stripes -
-						      bioc->num_tgtdevs,
+						      bioc->replace_nr_stripes,
 						      mirror_index,
 						      &stripe_index,
 						      &stripe_offset);
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f00716fbb6cd..790511d2917e 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5885,8 +5885,7 @@ static void sort_parity_stripes(struct btrfs_io_context *bioc, int num_stripes)
 }
 
 static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_info,
-						       int total_stripes,
-						       int real_stripes)
+						       int total_stripes)
 {
 	struct btrfs_io_context *bioc;
 
@@ -5902,8 +5901,6 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
 		sizeof(struct btrfs_io_context) +
 		/* Plus the variable array for the stripes */
 		sizeof(struct btrfs_io_stripe) * (total_stripes) +
-		/* Plus the variable array for the tgt dev */
-		sizeof(u16) * (real_stripes) +
 		/*
 		 * Plus the raid_map, which includes both the tgt dev
 		 * and the stripes.
@@ -5917,8 +5914,9 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
 	refcount_set(&bioc->refs, 1);
 
 	bioc->fs_info = fs_info;
-	bioc->tgtdev_map = (u16 *)(bioc->stripes + total_stripes);
-	bioc->raid_map = (u64 *)(bioc->tgtdev_map + real_stripes);
+	bioc->raid_map = (u64 *)(bioc->stripes + total_stripes);
+	bioc->replace_stripe_src[0] = -1;
+	bioc->replace_stripe_src[1] = -1;
 
 	return bioc;
 }
@@ -6183,93 +6181,77 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
 				      int *num_stripes_ret, int *max_errors_ret)
 {
 	u64 srcdev_devid = dev_replace->srcdev->devid;
-	int tgtdev_indexes = 0;
+	/*
+	 * At this stage, num_stripes is still the real number of stripes,
+	 * excluding the duplicated stripes.
+	 */
 	int num_stripes = *num_stripes_ret;
+	int nr_extra_stripes = 0;
 	int max_errors = *max_errors_ret;
 	int i;
 
-	if (op == BTRFS_MAP_WRITE) {
-		int index_where_to_add;
+	/*
+	 * A block group which have "to_copy" set will eventually
+	 * copied by dev-replace process. We can avoid cloning IO here.
+	 */
+	if (is_block_group_to_copy(dev_replace->srcdev->fs_info, logical))
+		return;
+
+	/*
+	 * duplicate the write operations while the dev replace
+	 * procedure is running. Since the copying of the old disk to
+	 * the new disk takes place at run time while the filesystem is
+	 * mounted writable, the regular write operations to the old
+	 * disk have to be duplicated to go to the new disk as well.
+	 *
+	 * Note that device->missing is handled by the caller, and that
+	 * the write to the old disk is already set up in the stripes
+	 * array.
+	 */
+	for (i = 0; i < num_stripes; i++) {
+		struct btrfs_io_stripe *old = &bioc->stripes[i];
+		struct btrfs_io_stripe *new = &bioc->stripes[
+						num_stripes + nr_extra_stripes];
+
+		if (old->dev->devid != srcdev_devid)
+			continue;
+
+		new->physical = old->physical;
+		new->dev = dev_replace->tgtdev;
+		bioc->replace_stripe_src[nr_extra_stripes] = i;
+		nr_extra_stripes++;
+	}
+
+	/* We can only have at most 2 extra nr_stripes (for DUP). */
+	ASSERT(nr_extra_stripes <= 2);
+	/*
+	 * For GET_READ_MIRRORS, we can only return at most 1 extra stripes for
+	 * replace.
+	 * If we have 2 extra stripes, only choose the one with smaller physical.
+	 */
+	if (op == BTRFS_MAP_GET_READ_MIRRORS && nr_extra_stripes == 2) {
+		struct btrfs_io_stripe *first = &bioc->stripes[num_stripes];
+		struct btrfs_io_stripe *second = &bioc->stripes[num_stripes + 1];
+
+		/* Only DUP can have two extra stripes. */
+		ASSERT(bioc->map_type & BTRFS_BLOCK_GROUP_DUP);
 
 		/*
-		 * A block group which have "to_copy" set will eventually
-		 * copied by dev-replace process. We can avoid cloning IO here.
+		 * Swap the last stripe stripes and just reduce
+		 * @nr_extra_stripes.
+		 * The extra stripe would still be there, but won't be
+		 * accessed.
 		 */
-		if (is_block_group_to_copy(dev_replace->srcdev->fs_info, logical))
-			return;
-
-		/*
-		 * duplicate the write operations while the dev replace
-		 * procedure is running. Since the copying of the old disk to
-		 * the new disk takes place at run time while the filesystem is
-		 * mounted writable, the regular write operations to the old
-		 * disk have to be duplicated to go to the new disk as well.
-		 *
-		 * Note that device->missing is handled by the caller, and that
-		 * the write to the old disk is already set up in the stripes
-		 * array.
-		 */
-		index_where_to_add = num_stripes;
-		for (i = 0; i < num_stripes; i++) {
-			if (bioc->stripes[i].dev->devid == srcdev_devid) {
-				/* write to new disk, too */
-				struct btrfs_io_stripe *new =
-					bioc->stripes + index_where_to_add;
-				struct btrfs_io_stripe *old =
-					bioc->stripes + i;
-
-				new->physical = old->physical;
-				new->dev = dev_replace->tgtdev;
-				bioc->tgtdev_map[i] = index_where_to_add;
-				index_where_to_add++;
-				max_errors++;
-				tgtdev_indexes++;
-			}
-		}
-		num_stripes = index_where_to_add;
-	} else if (op == BTRFS_MAP_GET_READ_MIRRORS) {
-		int index_srcdev = 0;
-		int found = 0;
-		u64 physical_of_found = 0;
-
-		/*
-		 * During the dev-replace procedure, the target drive can also
-		 * be used to read data in case it is needed to repair a corrupt
-		 * block elsewhere. This is possible if the requested area is
-		 * left of the left cursor. In this area, the target drive is a
-		 * full copy of the source drive.
-		 */
-		for (i = 0; i < num_stripes; i++) {
-			if (bioc->stripes[i].dev->devid == srcdev_devid) {
-				/*
-				 * In case of DUP, in order to keep it simple,
-				 * only add the mirror with the lowest physical
-				 * address
-				 */
-				if (found &&
-				    physical_of_found <= bioc->stripes[i].physical)
-					continue;
-				index_srcdev = i;
-				found = 1;
-				physical_of_found = bioc->stripes[i].physical;
-			}
-		}
-		if (found) {
-			struct btrfs_io_stripe *tgtdev_stripe =
-				bioc->stripes + num_stripes;
-
-			tgtdev_stripe->physical = physical_of_found;
-			tgtdev_stripe->dev = dev_replace->tgtdev;
-			bioc->tgtdev_map[index_srcdev] = num_stripes;
-
-			tgtdev_indexes++;
-			num_stripes++;
+		if (first->physical > second->physical) {
+			swap(second->physical, first->physical);
+			swap(second->dev, first->dev);
+			nr_extra_stripes--;
 		}
 	}
 
-	*num_stripes_ret = num_stripes;
-	*max_errors_ret = max_errors;
-	bioc->num_tgtdevs = tgtdev_indexes;
+	*num_stripes_ret = num_stripes + nr_extra_stripes;
+	*max_errors_ret = max_errors + nr_extra_stripes;
+	bioc->replace_nr_stripes = nr_extra_stripes;
 }
 
 static bool need_full_stripe(enum btrfs_map_op op)
@@ -6390,7 +6372,6 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	int mirror_num = (mirror_num_ret ? *mirror_num_ret : 0);
 	int num_stripes;
 	int max_errors = 0;
-	int tgtdev_indexes = 0;
 	struct btrfs_io_context *bioc = NULL;
 	struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace;
 	int dev_replace_is_ongoing = 0;
@@ -6540,13 +6521,16 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	}
 
 	num_alloc_stripes = num_stripes;
-	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) {
-		if (op == BTRFS_MAP_WRITE)
-			num_alloc_stripes <<= 1;
-		if (op == BTRFS_MAP_GET_READ_MIRRORS)
-			num_alloc_stripes++;
-		tgtdev_indexes = num_stripes;
-	}
+	if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL &&
+	    op != BTRFS_MAP_READ)
+		/*
+		 * For replace case, we need to add extra stripes for extra
+		 * duplicated stripes.
+		 *
+		 * For both WRITE and GET_READ_MIRRORS, we may have
+		 * at most 2 more stripes (DUP types, otherwise 1).
+		 */
+		num_alloc_stripes += 2;
 
 	/*
 	 * If this I/O maps to a single device, try to return the device and
@@ -6571,11 +6555,12 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 		goto out;
 	}
 
-	bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes, tgtdev_indexes);
+	bioc = alloc_btrfs_io_context(fs_info, num_alloc_stripes);
 	if (!bioc) {
 		ret = -ENOMEM;
 		goto out;
 	}
+	bioc->map_type = map->type;
 
 	for (i = 0; i < num_stripes; i++) {
 		set_io_stripe(&bioc->stripes[i], map, stripe_index, stripe_offset,
@@ -6616,7 +6601,6 @@ int __btrfs_map_block(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 	}
 
 	*bioc_ret = bioc;
-	bioc->map_type = map->type;
 	bioc->num_stripes = num_stripes;
 	bioc->max_errors = max_errors;
 	bioc->mirror_num = mirror_num;
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index fe664512191b..7288fd476e40 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -438,14 +438,13 @@ struct btrfs_io_context {
 	/*
 	 * The following two members are for dev-replace case only.
 	 *
-	 * @num_tgtdevs:	Number of duplicated stripes which needs to be
+	 * @replace_nr_stripes:	Number of duplicated stripes which needs to be
 	 *			written to replace target.
 	 *			Should be <= 2 (2 for DUP, otherwise <= 1).
-	 * @tgtdev_map:		The array indicates where the duplicated stripes
-	 *			are from. The size is the number of original
-	 *			stripes (num_stripes - num_tgtdevs).
+	 * @replace_stripe_src:	The array indicates where the duplicated stripes
+	 *			are from.
 	 *
-	 * The @tgtdev_map[] array is mostly for RAID56 cases.
+	 * The @replace_stripe_src[] array is mostly for RAID56 cases.
 	 * As non-RAID56 stripes share the same contents for the mapped range,
 	 * thus no need to bother where the duplicated ones are from.
 	 *
@@ -459,14 +458,14 @@ struct btrfs_io_context {
 	 * stripes[2]:		dev = devid 3, physical = Z
 	 * stripes[3]:		dev = devid 0, physical = Y
 	 *
-	 * num_tgtdevs = 1
-	 * tgtdev_map[0] = 0	<- Means stripes[0] is not involved in replace.
-	 * tgtdev_map[1] = 3	<- Means stripes[1] is involved in replace,
-	 *			   and it's duplicated to stripes[3].
-	 * tgtdev_map[2] = 0	<- Means stripes[2] is not involved in replace.
+	 * replace_nr_stripes = 1
+	 * tgtdev_map[0] = 1	<- Means stripes[1] is involved in replace.
+	 * tgtdev_map[1] = -1	<- Means the second slot is not used.
+	 *			   (Only DUP can use this slot, but they
+	 *			    don't really care)
 	 */
-	u16 num_tgtdevs;
-	u16 *tgtdev_map;
+	u16 replace_nr_stripes;
+	s16 replace_stripe_src[2];
 	/*
 	 * logical block numbers for the start of each stripe
 	 * The last one or two are p/q.  These are sorted,
-- 
2.39.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace()
  2023-01-28  8:23 ` [PATCH 1/3] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace() Qu Wenruo
@ 2023-01-30  7:17   ` Johannes Thumshirn
  2023-01-31 11:54   ` Anand Jain
  1 sibling, 0 replies; 15+ messages in thread
From: Johannes Thumshirn @ 2023-01-30  7:17 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure
  2023-01-28  8:23 ` [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure Qu Wenruo
@ 2023-01-30  7:27   ` Johannes Thumshirn
  2023-01-30  7:31     ` Qu Wenruo
  2023-02-02  6:28   ` Anand Jain
  1 sibling, 1 reply; 15+ messages in thread
From: Johannes Thumshirn @ 2023-01-30  7:27 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 28.01.23 09:23, Qu Wenruo wrote:
> @@ -5888,13 +5888,22 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
>  						       int total_stripes,
>  						       int real_stripes)
>  {
> -	struct btrfs_io_context *bioc = kzalloc(
> +	struct btrfs_io_context *bioc;
> +
> +	/*
> +	 * We should have valid number of stripes, larger than U16_MAX(65535)
> +	 * indicates something totally wrong, as our on-disk format is only
> +	 * at u16.
> +	 */
> +	ASSERT(total_stripes < U16_MAX && total_stripes > 0);

Shouldn't we better change 'int total_stripes' and 'int real_stripes' to u16?
That'll imply above ASSERT()

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure
  2023-01-30  7:27   ` Johannes Thumshirn
@ 2023-01-30  7:31     ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-01-30  7:31 UTC (permalink / raw)
  To: Johannes Thumshirn, Qu Wenruo, linux-btrfs



On 2023/1/30 15:27, Johannes Thumshirn wrote:
> On 28.01.23 09:23, Qu Wenruo wrote:
>> @@ -5888,13 +5888,22 @@ static struct btrfs_io_context *alloc_btrfs_io_context(struct btrfs_fs_info *fs_
>>   						       int total_stripes,
>>   						       int real_stripes)
>>   {
>> -	struct btrfs_io_context *bioc = kzalloc(
>> +	struct btrfs_io_context *bioc;
>> +
>> +	/*
>> +	 * We should have valid number of stripes, larger than U16_MAX(65535)
>> +	 * indicates something totally wrong, as our on-disk format is only
>> +	 * at u16.
>> +	 */
>> +	ASSERT(total_stripes < U16_MAX && total_stripes > 0);
> 
> Shouldn't we better change 'int total_stripes' and 'int real_stripes' to u16?
> That'll imply above ASSERT()

Right, that is way better.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] btrfs: use a more space efficient way to represent the source of duplicated stripes
  2023-01-28  8:23 ` [PATCH 3/3] btrfs: use a more space efficient way to represent the source of duplicated stripes Qu Wenruo
@ 2023-01-30  7:40   ` Johannes Thumshirn
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Thumshirn @ 2023-01-30  7:40 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 28.01.23 09:23, Qu Wenruo wrote:
>  
> -		if (!rbio->bioc->tgtdev_map[stripe]) {
> +		/*
> +		 * For RAID56, there is only one device can be replaced,

Nit: one device _that_ can be 

>  		ret = rbio_add_io_sector(rbio, bio_list, sector,
> -					 rbio->bioc->tgtdev_map[stripe],
> +					 rbio->real_stripes,
>  					 sectornr, REQ_OP_WRITE);
>  		if (ret)
>  			goto error;
> @@ -2518,7 +2533,12 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check)
>  	else
>  		BUG();
>  
> -	if (bioc->num_tgtdevs && bioc->tgtdev_map[rbio->scrubp]) {
> +	/*
> +	 * Replace is running and our P/Q stripe is being replace, then

being replaced.

> +	 * we need to duplicated the final write to reaplce target.

s/reaplce/replace/

> @@ -2620,13 +2640,21 @@ static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check)
>  	if (!is_replace)
>  		goto submit_write;
>  
> +	/*
> +	 * Replace is running and our parity stripe needs to be duplicated
> +	 * to target device.

to _the_ target device.

> +	 * Checks we have valid source stripe number in the first slot,
> +	 * and the second slot is unused.
> +	 */
> +	ASSERT(rbio->bioc->replace_stripe_src[0] >= 0);

> @@ -6183,93 +6181,77 @@ static void handle_ops_on_dev_replace(enum btrfs_map_op op,
>  				      int *num_stripes_ret, int *max_errors_ret)
>  {
>  	u64 srcdev_devid = dev_replace->srcdev->devid;
> -	int tgtdev_indexes = 0;
> +	/*
> +	 * At this stage, num_stripes is still the real number of stripes,
> +	 * excluding the duplicated stripes.
> +	 */
>  	int num_stripes = *num_stripes_ret;
> +	int nr_extra_stripes = 0;
>  	int max_errors = *max_errors_ret;
>  	int i;
>  
> -	if (op == BTRFS_MAP_WRITE) {
> -		int index_where_to_add;
> +	/*
> +	 * A block group which have "to_copy" set will eventually
s/have/has/

> +	 * copied by dev-replace process. We can avoid cloning IO here.

be copied by the dev-replace process.

> +	 */
> +	if (is_block_group_to_copy(dev_replace->srcdev->fs_info, logical))
> +		return;
> +
> +	/*
> +	 * duplicate the write operations while the dev replace

Duplicate

> +	 * procedure is running. Since the copying of the old disk to


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace()
  2023-01-28  8:23 ` [PATCH 1/3] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace() Qu Wenruo
  2023-01-30  7:17   ` Johannes Thumshirn
@ 2023-01-31 11:54   ` Anand Jain
  1 sibling, 0 replies; 15+ messages in thread
From: Anand Jain @ 2023-01-31 11:54 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

LGTM.

Reviewed-by: Anand Jain <anand.jain@oracle.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure
  2023-01-28  8:23 ` [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure Qu Wenruo
  2023-01-30  7:27   ` Johannes Thumshirn
@ 2023-02-02  6:28   ` Anand Jain
  2023-02-02  6:47     ` Qu Wenruo
  1 sibling, 1 reply; 15+ messages in thread
From: Anand Jain @ 2023-02-02  6:28 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



> +	bioc = kzalloc(
>   		 /* The size of btrfs_io_context */
>   		sizeof(struct btrfs_io_context) +
>   		/* Plus the variable array for the stripes */
>   		sizeof(struct btrfs_io_stripe) * (total_stripes) +
>   		/* Plus the variable array for the tgt dev */
> -		sizeof(int) * (real_stripes) +
> +		sizeof(u16) * (real_stripes) +
>   		/*
>   		 * Plus the raid_map, which includes both the tgt dev
>   		 * and the stripes.

Why not order the various sizeof() in the same manner as in the struct 
btrfs_io_context?

Thx.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure
  2023-02-02  6:28   ` Anand Jain
@ 2023-02-02  6:47     ` Qu Wenruo
  2023-02-02  9:12       ` Anand Jain
  2023-02-03  6:19       ` Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-02-02  6:47 UTC (permalink / raw)
  To: Anand Jain, linux-btrfs



On 2023/2/2 14:28, Anand Jain wrote:
> 
> 
>> +    bioc = kzalloc(
>>            /* The size of btrfs_io_context */
>>           sizeof(struct btrfs_io_context) +
>>           /* Plus the variable array for the stripes */
>>           sizeof(struct btrfs_io_stripe) * (total_stripes) +
>>           /* Plus the variable array for the tgt dev */
>> -        sizeof(int) * (real_stripes) +
>> +        sizeof(u16) * (real_stripes) +
>>           /*
>>            * Plus the raid_map, which includes both the tgt dev
>>            * and the stripes.
> 
> Why not order the various sizeof() in the same manner as in the struct 
> btrfs_io_context?

Because the tgtdev_map would soon get completely removed in the next patch.

Just to mention, I don't like the current way to allocate memory at all.

If there is more feedback, I can convert the allocation to the same way 
as alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for 
those arrays.

Thanks,
Qu

> 
> Thx.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure
  2023-02-02  6:47     ` Qu Wenruo
@ 2023-02-02  9:12       ` Anand Jain
  2023-02-07  2:19         ` Qu Wenruo
  2023-02-03  6:19       ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Anand Jain @ 2023-02-02  9:12 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs



On 2/2/23 14:47, Qu Wenruo wrote:
> 
> 
> On 2023/2/2 14:28, Anand Jain wrote:
>>
>>
>>> +    bioc = kzalloc(
>>>            /* The size of btrfs_io_context */
>>>           sizeof(struct btrfs_io_context) +
>>>           /* Plus the variable array for the stripes */
>>>           sizeof(struct btrfs_io_stripe) * (total_stripes) +
>>>           /* Plus the variable array for the tgt dev */
>>> -        sizeof(int) * (real_stripes) +
>>> +        sizeof(u16) * (real_stripes) +
>>>           /*
>>>            * Plus the raid_map, which includes both the tgt dev
>>>            * and the stripes.
>>
>> Why not order the various sizeof() in the same manner as in the struct 
>> btrfs_io_context?
> 
> Because the tgtdev_map would soon get completely removed in the next patch.
> 

Ah. Ok.

> Just to mention, I don't like the current way to allocate memory at all.
> 
> If there is more feedback, I can convert the allocation to the same way 
> as alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for 
> those arrays.

The alloc_rbio() method allocates each component independently, leading
to frequent memory allocation and freeing, which could impact 
performance in an IO context.

It may be goodidea to conduct a small experiment to assess any
performance penalties. If none are detected, then it should be ok.

Thx,


> 
> Thanks,
> Qu
> 
>>
>> Thx.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure
  2023-02-02  6:47     ` Qu Wenruo
  2023-02-02  9:12       ` Anand Jain
@ 2023-02-03  6:19       ` Christoph Hellwig
  2023-02-03  6:44         ` Qu Wenruo
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2023-02-03  6:19 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Anand Jain, linux-btrfs

On Thu, Feb 02, 2023 at 02:47:13PM +0800, Qu Wenruo wrote:
> Because the tgtdev_map would soon get completely removed in the next patch.
> 
> Just to mention, I don't like the current way to allocate memory at all.
> 
> If there is more feedback, I can convert the allocation to the same way as
> alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for those
> arrays.

The real elephant in the room is that the io_context needs to use a
mempool anyway to be deadlock safe.  Which means we'll need to size
for the worst case (per file system?) here.  Fortunately the read
fast path now doesn't use the io_context at all which helps with
the different sizing optimizations.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure
  2023-02-03  6:19       ` Christoph Hellwig
@ 2023-02-03  6:44         ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-02-03  6:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Anand Jain, linux-btrfs



On 2023/2/3 14:19, Christoph Hellwig wrote:
> On Thu, Feb 02, 2023 at 02:47:13PM +0800, Qu Wenruo wrote:
>> Because the tgtdev_map would soon get completely removed in the next patch.
>>
>> Just to mention, I don't like the current way to allocate memory at all.
>>
>> If there is more feedback, I can convert the allocation to the same way as
>> alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls for those
>> arrays.
> 
> The real elephant in the room is that the io_context needs to use a
> mempool anyway to be deadlock safe.

In fact, that makes more sense to migrate to the raid56 method.

We go mempool for the io_context structure itself, then still do regular 
memory allocation for the stripes (and the raid56 map).

Furthermore we can skip the raid56_map if we don't need, especially 
currently we're wasting memory for non-raid56 profiles.

And with my current work to make that tgtdev_map static (currently two 
u16, but next version it would be just one u16), for non-raid56 cases we 
will just do one mempool alloc (io_context) + one regular alloc (stripes).

Which I guess won't be that bad compared to the existing one.

Thanks,
Qu
>  Which means we'll need to size
> for the worst case (per file system?) here.  Fortunately the read
> fast path now doesn't use the io_context at all which helps with
> the different sizing optimizations.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure
  2023-02-02  9:12       ` Anand Jain
@ 2023-02-07  2:19         ` Qu Wenruo
  0 siblings, 0 replies; 15+ messages in thread
From: Qu Wenruo @ 2023-02-07  2:19 UTC (permalink / raw)
  To: Anand Jain, Qu Wenruo, linux-btrfs



On 2023/2/2 17:12, Anand Jain wrote:
> 
> 
> On 2/2/23 14:47, Qu Wenruo wrote:
>>
>>
>> On 2023/2/2 14:28, Anand Jain wrote:
>>>
>>>
>>>> +    bioc = kzalloc(
>>>>            /* The size of btrfs_io_context */
>>>>           sizeof(struct btrfs_io_context) +
>>>>           /* Plus the variable array for the stripes */
>>>>           sizeof(struct btrfs_io_stripe) * (total_stripes) +
>>>>           /* Plus the variable array for the tgt dev */
>>>> -        sizeof(int) * (real_stripes) +
>>>> +        sizeof(u16) * (real_stripes) +
>>>>           /*
>>>>            * Plus the raid_map, which includes both the tgt dev
>>>>            * and the stripes.
>>>
>>> Why not order the various sizeof() in the same manner as in the 
>>> struct btrfs_io_context?
>>
>> Because the tgtdev_map would soon get completely removed in the next 
>> patch.
>>
> 
> Ah. Ok.
> 
>> Just to mention, I don't like the current way to allocate memory at all.
>>
>> If there is more feedback, I can convert the allocation to the same 
>> way as alloc_rbio() of raid56.c, AKA, use dedicated kcalloc() calls 
>> for those arrays.
> 
> The alloc_rbio() method allocates each component independently, leading
> to frequent memory allocation and freeing, which could impact 
> performance in an IO context.

That method is fine if we only have one variable length array (stripes).

But it won't be a problem anymore, the 3/3 would make the dev-replace to 
only take one s16 to record the replace source.

Then I have a new patch to this series, which converts *raid_map into a 
single u64.

By all those work, btrfs_io_context structure would only have one 
variable length array (stripes), then the existing method would be 
acceptable.

The final result would make btrfs_io_context look like this: (comments 
skipped, as the comments would be larger than the structure itself)

struct btrfs_io_context {
         refcount_t refs;
         struct btrfs_fs_info *fs_info;
         u64 map_type; /* get from map_lookup->type */
         struct bio *orig_bio;
         atomic_t error;
         u16 max_errors;
         u16 num_stripes;
         u16 mirror_num;

         u16 replace_nr_stripes;
         s16 replace_stripe_src;

         u64 full_stripe_logical;
         struct btrfs_io_stripe stripes[];
};

> 
> It may be goodidea to conduct a small experiment to assess any
> performance penalties. If none are detected, then it should be ok.

If HCH or someone else is going to make btrfs_io_context mempool based, 
then I guess we have to go the other way, mempool for btrfs_io_context, 
but still manually allocate the stripes.

Anyway the updated series would benefit everyone (at least I hope so).

Thanks,
Qu

> 
> Thx,
> 
> 
>>
>> Thanks,
>> Qu
>>
>>>
>>> Thx.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-02-07  2:20 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-28  8:23 [PATCH 0/3] btrfs: reduce the memory usage for replace in btrfs_io_context Qu Wenruo
2023-01-28  8:23 ` [PATCH 1/3] btrfs: simplify the @bioc argument for handle_ops_on_dev_replace() Qu Wenruo
2023-01-30  7:17   ` Johannes Thumshirn
2023-01-31 11:54   ` Anand Jain
2023-01-28  8:23 ` [PATCH 2/3] btrfs: small improvement for btrfs_io_context structure Qu Wenruo
2023-01-30  7:27   ` Johannes Thumshirn
2023-01-30  7:31     ` Qu Wenruo
2023-02-02  6:28   ` Anand Jain
2023-02-02  6:47     ` Qu Wenruo
2023-02-02  9:12       ` Anand Jain
2023-02-07  2:19         ` Qu Wenruo
2023-02-03  6:19       ` Christoph Hellwig
2023-02-03  6:44         ` Qu Wenruo
2023-01-28  8:23 ` [PATCH 3/3] btrfs: use a more space efficient way to represent the source of duplicated stripes Qu Wenruo
2023-01-30  7:40   ` Johannes Thumshirn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.