linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload
@ 2022-11-01 11:16 Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 01/12] btrfs: raid56: extract the vertical stripe recovery code into recover_vertical() Qu Wenruo
                   ` (12 more replies)
  0 siblings, 13 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

[CHANGELOG]
v2:
- Add coverage for raid56 recover and scrub paths
  In fact, with full coverage we can do better cleanups, which gets
  reflected to the changed lines.

- Better naming schemes
  We now have 3 main entrances:
  * recover_rbio()
  * rmw_rbio()
  * scrub_rbio()

  And their work related functions will be called:
  * {recover|rmw|scrub}_rbio_work()
  * {recover|rmw|scrub}_rbio_work_locked()

  The later is for unlock_stripe() to call, where we hold the full
  stripe lock.

- More extracted helpers
  Now we have the following helpers:
  * {recover|rmw|scrub}_assemble_{read|write}_bios()
  * submit_read_bios()
  * submit_write_bios()

Currently btrfs raid56 has very chaotic jumps using endio functions:

E.g. for raid_parity_write(), if we go sub-stripe, the function jumps
would be:

  raid_parity_write()
   |
   v
  __raid56_parity_write()
   |
   v
  partial_stripe_write()
   |
   v
  start_async_work(rmw_work) <<< Delayed work
   |
   v
  raid56_rmw_stripe()
   |
   v
  raid56_rmw_end_io_work() <<< End io jump
   |
   v
  validate_rbio_for_rmw()
   |
   v
  finish_rmw()
   |
   v
  raid_write_end_io() <<< End io jump
   |
   v
  rbio_orig_end_io()

During a simple sub-stripe write, we have to go through over 10
functions, two end_io jump, at least one delayed work.

With this patchset, we only go like this:

  raid_parity_write()
   |
   v
  rmw_rbio_work() <<< Delayed work
   |
   v
  rbio_work()
   |
   v
  rbio_orig_end_io()

And inside rbio_work(), there is no more end io jumps or extra delayed
work.
Everything except IO is single threaded, and the IO is just
submit-and-wait.

This patchset will rework the raid56 write path (recover and scrub path
is not touched yet) by:

- Introduce a single entrance for recover/rmw/scrub.
  The main function will be called {recover|rmw|scrub}_rbio(),
  executed in rmw_worker workqueue.

- Unified handling for all writes (full/sub-stripe, cached/non-cached,
  degraded or not), recover (dedicated, and in writes/scrub path) and
  scrub.

- No special handling for cases we can skip some workload
  E.g. for sub-stripe write, if we have a cached rbio, we can skip the
  read.

  Now we just assemble the read bio list, submit all bios (none in
  this case) and wait for the IO to finish.

  Since we submitted zero bios, the rbio::stripes_pending is 0, the
  wait immediately returned, needing no extra handling.

- No more need for end_io_work or raid56_endio_workers
  Since we don't rely on endio functions to jump to the next step.

By this, we have unified entrance for all raid56 writes, and no extra
jumping/workqueue mess to interrupt the workflow.

This would make later destructive RMW fix much easier to add, as the
timing of each step in RMW cycle should be very easy to grasp.

Thus I hope this series can be merged before the previous RFC series of
destructive RMW fix.

Qu Wenruo (12):
  btrfs: raid56: extract the vertical stripe recovery code into
    recover_vertical()
  btrfs: raid56: extract the pq generation code into a helper
  btrfs: raid56: extract the recovery bio list build code into a helper
  btrfs: raid56: extract sector recovery code into a helper
  btrfs: raid56: switch recovery path to a single function
  btrfs: raid56: extract the rmw bio list build code into a helper
  btrfs: raid56: extract rwm write bios assembly into a helper
  btrfs: raid56: introduce the a main entrance for rmw path
  btrfs: raid56: switch write path to rmw_rbio()
  btrfs: raid56: extract scrub read bio list assembly code into a helper
  btrfs: raid56: switch scrub path to use a single function
  btrfs: remove the unused btrfs_fs_info::endio_raid56_workers and
    btrfs_raid_bio::end_io_work

 fs/btrfs/disk-io.c |    6 +-
 fs/btrfs/fs.h      |    1 -
 fs/btrfs/raid56.c  | 1422 ++++++++++++++++++++------------------------
 fs/btrfs/raid56.h  |    2 +-
 4 files changed, 631 insertions(+), 800 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 01/12] btrfs: raid56: extract the vertical stripe recovery code into recover_vertical()
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 02/12] btrfs: raid56: extract the pq generation code into a helper Qu Wenruo
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

This refactor includes the following behavior change first:

- Don't error out if only P/Q is corrupted

  The old code will directly error out if only P/Q is corrupted.
  Although it is an logical error if we go into rebuild path with
  only P/Q corrupted, there is no need to error out.

  Just skip the rebuild and return the already good data.

Then comes the following refactor which shouldn't cause behavior
changes:

- Introduce a helper to do vertical stripe recovery

  This not only reduce one indent level, but also paves the road for
  later data checksum verification in RMW cycles.

- Sort rbio->faila/b before recovery

  So we don't need to do the same swap every vertical stripe

- Replace a BUG_ON() with ASSERT()

  Or checkpatch won't let me pass.

- Mark recovered sectors uptodate after the recover loop

- Do the cleanup for pointers unconditionally

  We only need to initialize @pointers and @unmap_array to NULL, so
  we can safely free them unconditionally.

- Mark the repaired sector uptodate in recover_vertical()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 285 ++++++++++++++++++++++++----------------------
 1 file changed, 149 insertions(+), 136 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 349270537c2e..85b883a21baa 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1886,6 +1886,144 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc)
 	bio_endio(bio);
 }
 
+/*
+ * Recover a vertical stripe specified by @sector_nr.
+ * @*pointers are the pre-allocated pointers by the caller, so we don't
+ * need to allocate/free the pointers again and again.
+ */
+static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
+			     void **pointers, void **unmap_array)
+{
+	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
+	struct sector_ptr *sector;
+	const u32 sectorsize = fs_info->sectorsize;
+	const int faila = rbio->faila;
+	const int failb = rbio->failb;
+	int stripe_nr;
+
+	/*
+	 * Now we just use bitmap to mark the horizontal stripes in
+	 * which we have data when doing parity scrub.
+	 */
+	if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
+	    !test_bit(sector_nr, &rbio->dbitmap))
+		return;
+
+	/*
+	 * Setup our array of pointers with sectors from each stripe
+	 *
+	 * NOTE: store a duplicate array of pointers to preserve the
+	 * pointer order.
+	 */
+	for (stripe_nr = 0; stripe_nr < rbio->real_stripes; stripe_nr++) {
+		/*
+		 * If we're rebuilding a read, we have to use
+		 * pages from the bio list
+		 */
+		if ((rbio->operation == BTRFS_RBIO_READ_REBUILD ||
+		     rbio->operation == BTRFS_RBIO_REBUILD_MISSING) &&
+		    (stripe_nr == faila || stripe_nr == failb)) {
+			sector = sector_in_rbio(rbio, stripe_nr, sector_nr, 0);
+		} else {
+			sector = rbio_stripe_sector(rbio, stripe_nr, sector_nr);
+		}
+		ASSERT(sector->page);
+		pointers[stripe_nr] = kmap_local_page(sector->page) +
+				   sector->pgoff;
+		unmap_array[stripe_nr] = pointers[stripe_nr];
+	}
+
+	/* All raid6 handling here */
+	if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID6) {
+		/* Single failure, rebuild from parity raid5 style */
+		if (failb < 0) {
+			if (faila == rbio->nr_data)
+				/*
+				 * Just the P stripe has failed, without
+				 * a bad data or Q stripe.
+				 * We have nothing to do, just skip the
+				 * recovery for this stripe.
+				 */
+				goto cleanup;
+			/*
+			 * a single failure in raid6 is rebuilt
+			 * in the pstripe code below
+			 */
+			goto pstripe;
+		}
+
+		/*
+		 * If the q stripe is failed, do a pstripe reconstruction from
+		 * the xors.
+		 * If both the q stripe and the P stripe are failed, we're
+		 * here due to a crc mismatch and we can't give them the
+		 * data they want.
+		 */
+		if (rbio->bioc->raid_map[failb] == RAID6_Q_STRIPE) {
+			if (rbio->bioc->raid_map[faila] ==
+			    RAID5_P_STRIPE)
+				/*
+				 * Only P and Q are corrupted.
+				 * We only care about data stripes recovery,
+				 * can skip this vertical stripe.
+				 */
+				goto cleanup;
+			/*
+			 * Otherwise we have one bad data stripe and
+			 * a good P stripe.  raid5!
+			 */
+			goto pstripe;
+		}
+
+		if (rbio->bioc->raid_map[failb] == RAID5_P_STRIPE) {
+			raid6_datap_recov(rbio->real_stripes, sectorsize,
+					  faila, pointers);
+		} else {
+			raid6_2data_recov(rbio->real_stripes, sectorsize,
+					  faila, failb, pointers);
+		}
+	} else {
+		void *p;
+
+		/* Rebuild from P stripe here (raid5 or raid6). */
+		ASSERT(failb == -1);
+pstripe:
+		/* Copy parity block into failed block to start with */
+		memcpy(pointers[faila], pointers[rbio->nr_data], sectorsize);
+
+		/* Rearrange the pointer array */
+		p = pointers[faila];
+		for (stripe_nr = faila; stripe_nr < rbio->nr_data - 1;
+		     stripe_nr++)
+			pointers[stripe_nr] = pointers[stripe_nr + 1];
+		pointers[rbio->nr_data - 1] = p;
+
+		/* Xor in the rest */
+		run_xor(pointers, rbio->nr_data - 1, sectorsize);
+
+	}
+
+	/*
+	 * No matter if this is a RMW or recovery, we should have all
+	 * failed sectors repaired in the vertical stripe, thus they are now
+	 * uptodate.
+	 * Especially if we determine to cache the rbio, we need to
+	 * have at least all data sectors uptodate.
+	 */
+	if (rbio->faila >= 0) {
+		sector = rbio_stripe_sector(rbio, rbio->faila, sector_nr);
+		sector->uptodate = 1;
+	}
+	if (rbio->failb >= 0) {
+		sector = rbio_stripe_sector(rbio, rbio->failb, sector_nr);
+		sector->uptodate = 1;
+	}
+
+cleanup:
+	for (stripe_nr = rbio->real_stripes - 1; stripe_nr >= 0; stripe_nr--)
+		kunmap_local(unmap_array[stripe_nr]);
+}
+
 /*
  * all parity reconstruction happens here.  We've read in everything
  * we can find from the drives and this does the heavy lifting of
@@ -1893,13 +2031,10 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc)
  */
 static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 {
-	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
-	int sectornr, stripe;
-	void **pointers;
-	void **unmap_array;
-	int faila = -1, failb = -1;
+	int sectornr;
+	void **pointers = NULL;
+	void **unmap_array = NULL;
 	blk_status_t err;
-	int i;
 
 	/*
 	 * This array stores the pointer for each sector, thus it has the extra
@@ -1908,7 +2043,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
 	if (!pointers) {
 		err = BLK_STS_RESOURCE;
-		goto cleanup_io;
+		goto cleanup;
 	}
 
 	/*
@@ -1918,11 +2053,12 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
 	if (!unmap_array) {
 		err = BLK_STS_RESOURCE;
-		goto cleanup_pointers;
+		goto cleanup;
 	}
 
-	faila = rbio->faila;
-	failb = rbio->failb;
+	/* Make sure faila and fail b are in order. */
+	if (rbio->faila >= 0 && rbio->failb >= 0 && rbio->faila > rbio->failb)
+		swap(rbio->faila, rbio->failb);
 
 	if (rbio->operation == BTRFS_RBIO_READ_REBUILD ||
 	    rbio->operation == BTRFS_RBIO_REBUILD_MISSING) {
@@ -1933,138 +2069,15 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 
 	index_rbio_pages(rbio);
 
-	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) {
-		struct sector_ptr *sector;
-
-		/*
-		 * Now we just use bitmap to mark the horizontal stripes in
-		 * which we have data when doing parity scrub.
-		 */
-		if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB &&
-		    !test_bit(sectornr, &rbio->dbitmap))
-			continue;
-
-		/*
-		 * Setup our array of pointers with sectors from each stripe
-		 *
-		 * NOTE: store a duplicate array of pointers to preserve the
-		 * pointer order
-		 */
-		for (stripe = 0; stripe < rbio->real_stripes; stripe++) {
-			/*
-			 * If we're rebuilding a read, we have to use
-			 * pages from the bio list
-			 */
-			if ((rbio->operation == BTRFS_RBIO_READ_REBUILD ||
-			     rbio->operation == BTRFS_RBIO_REBUILD_MISSING) &&
-			    (stripe == faila || stripe == failb)) {
-				sector = sector_in_rbio(rbio, stripe, sectornr, 0);
-			} else {
-				sector = rbio_stripe_sector(rbio, stripe, sectornr);
-			}
-			ASSERT(sector->page);
-			pointers[stripe] = kmap_local_page(sector->page) +
-					   sector->pgoff;
-			unmap_array[stripe] = pointers[stripe];
-		}
-
-		/* All raid6 handling here */
-		if (rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID6) {
-			/* Single failure, rebuild from parity raid5 style */
-			if (failb < 0) {
-				if (faila == rbio->nr_data) {
-					/*
-					 * Just the P stripe has failed, without
-					 * a bad data or Q stripe.
-					 * TODO, we should redo the xor here.
-					 */
-					err = BLK_STS_IOERR;
-					goto cleanup;
-				}
-				/*
-				 * a single failure in raid6 is rebuilt
-				 * in the pstripe code below
-				 */
-				goto pstripe;
-			}
-
-			/* make sure our ps and qs are in order */
-			if (faila > failb)
-				swap(faila, failb);
-
-			/* if the q stripe is failed, do a pstripe reconstruction
-			 * from the xors.
-			 * If both the q stripe and the P stripe are failed, we're
-			 * here due to a crc mismatch and we can't give them the
-			 * data they want
-			 */
-			if (rbio->bioc->raid_map[failb] == RAID6_Q_STRIPE) {
-				if (rbio->bioc->raid_map[faila] ==
-				    RAID5_P_STRIPE) {
-					err = BLK_STS_IOERR;
-					goto cleanup;
-				}
-				/*
-				 * otherwise we have one bad data stripe and
-				 * a good P stripe.  raid5!
-				 */
-				goto pstripe;
-			}
-
-			if (rbio->bioc->raid_map[failb] == RAID5_P_STRIPE) {
-				raid6_datap_recov(rbio->real_stripes,
-						  sectorsize, faila, pointers);
-			} else {
-				raid6_2data_recov(rbio->real_stripes,
-						  sectorsize, faila, failb,
-						  pointers);
-			}
-		} else {
-			void *p;
-
-			/* rebuild from P stripe here (raid5 or raid6) */
-			BUG_ON(failb != -1);
-pstripe:
-			/* Copy parity block into failed block to start with */
-			memcpy(pointers[faila], pointers[rbio->nr_data], sectorsize);
-
-			/* rearrange the pointer array */
-			p = pointers[faila];
-			for (stripe = faila; stripe < rbio->nr_data - 1; stripe++)
-				pointers[stripe] = pointers[stripe + 1];
-			pointers[rbio->nr_data - 1] = p;
-
-			/* xor in the rest */
-			run_xor(pointers, rbio->nr_data - 1, sectorsize);
-		}
-
-		/*
-		 * No matter if this is a RMW or recovery, we should have all
-		 * failed sectors repaired, thus they are now uptodate.
-		 * Especially if we determine to cache the rbio, we need to
-		 * have at least all data sectors uptodate.
-		 */
-		for (i = 0;  i < rbio->stripe_nsectors; i++) {
-			if (faila != -1) {
-				sector = rbio_stripe_sector(rbio, faila, i);
-				sector->uptodate = 1;
-			}
-			if (failb != -1) {
-				sector = rbio_stripe_sector(rbio, failb, i);
-				sector->uptodate = 1;
-			}
-		}
-		for (stripe = rbio->real_stripes - 1; stripe >= 0; stripe--)
-			kunmap_local(unmap_array[stripe]);
-	}
+	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++)
+		recover_vertical(rbio, sectornr, pointers, unmap_array);
 
 	err = BLK_STS_OK;
+
 cleanup:
 	kfree(unmap_array);
-cleanup_pointers:
 	kfree(pointers);
 
-cleanup_io:
 	/*
 	 * Similar to READ_REBUILD, REBUILD_MISSING at this point also has a
 	 * valid rbio which is consistent with ondisk content, thus such a
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 02/12] btrfs: raid56: extract the pq generation code into a helper
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 01/12] btrfs: raid56: extract the vertical stripe recovery code into recover_vertical() Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 03/12] btrfs: raid56: extract the recovery bio list build " Qu Wenruo
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

Currently finish_rmw() will updates the P/Q stripes before submitting
the writes.

It's done behind a for(;;) loop, it's a little congested indent-wise, so
extract the code into a helper called generate_pq_vertical().

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 90 +++++++++++++++++++++++------------------------
 1 file changed, 44 insertions(+), 46 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 85b883a21baa..fe525fe9338f 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1193,6 +1193,48 @@ static void bio_get_trace_info(struct btrfs_raid_bio *rbio, struct bio *bio,
 	trace_info->stripe_nr = -1;
 }
 
+/* Generate PQ for one veritical stripe. */
+static void generate_pq_vertical(struct btrfs_raid_bio *rbio, int sectornr)
+{
+	void **pointers = rbio->finish_pointers;
+	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
+	struct sector_ptr *sector;
+	int stripe;
+	const bool has_qstripe = rbio->bioc->map_type & BTRFS_BLOCK_GROUP_RAID6;
+
+	/* First collect one sector from each data stripe */
+	for (stripe = 0; stripe < rbio->nr_data; stripe++) {
+		sector = sector_in_rbio(rbio, stripe, sectornr, 0);
+		pointers[stripe] = kmap_local_page(sector->page) +
+				   sector->pgoff;
+	}
+
+	/* Then add the parity stripe */
+	sector = rbio_pstripe_sector(rbio, sectornr);
+	sector->uptodate = 1;
+	pointers[stripe++] = kmap_local_page(sector->page) + sector->pgoff;
+
+	if (has_qstripe) {
+		/*
+		 * RAID6, add the qstripe and call the library function
+		 * to fill in our p/q
+		 */
+		sector = rbio_qstripe_sector(rbio, sectornr);
+		sector->uptodate = 1;
+		pointers[stripe++] = kmap_local_page(sector->page) +
+				     sector->pgoff;
+
+		raid6_call.gen_syndrome(rbio->real_stripes, sectorsize,
+					pointers);
+	} else {
+		/* raid5 */
+		memcpy(pointers[rbio->nr_data], pointers[0], sectorsize);
+		run_xor(pointers + 1, rbio->nr_data - 1, sectorsize);
+	}
+	for (stripe = stripe - 1; stripe >= 0; stripe--)
+		kunmap_local(pointers[stripe]);
+}
+
 /*
  * this is called from one of two situations.  We either
  * have a full stripe from the higher layers, or we've read all
@@ -1204,28 +1246,17 @@ static void bio_get_trace_info(struct btrfs_raid_bio *rbio, struct bio *bio,
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
 	struct btrfs_io_context *bioc = rbio->bioc;
-	const u32 sectorsize = bioc->fs_info->sectorsize;
-	void **pointers = rbio->finish_pointers;
-	int nr_data = rbio->nr_data;
 	/* The total sector number inside the full stripe. */
 	int total_sector_nr;
 	int stripe;
 	/* Sector number inside a stripe. */
 	int sectornr;
-	bool has_qstripe;
 	struct bio_list bio_list;
 	struct bio *bio;
 	int ret;
 
 	bio_list_init(&bio_list);
 
-	if (rbio->real_stripes - rbio->nr_data == 1)
-		has_qstripe = false;
-	else if (rbio->real_stripes - rbio->nr_data == 2)
-		has_qstripe = true;
-	else
-		BUG();
-
 	/* We should have at least one data sector. */
 	ASSERT(bitmap_weight(&rbio->dbitmap, rbio->stripe_nsectors));
 
@@ -1258,41 +1289,8 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	else
 		clear_bit(RBIO_CACHE_READY_BIT, &rbio->flags);
 
-	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++) {
-		struct sector_ptr *sector;
-
-		/* First collect one sector from each data stripe */
-		for (stripe = 0; stripe < nr_data; stripe++) {
-			sector = sector_in_rbio(rbio, stripe, sectornr, 0);
-			pointers[stripe] = kmap_local_page(sector->page) +
-					   sector->pgoff;
-		}
-
-		/* Then add the parity stripe */
-		sector = rbio_pstripe_sector(rbio, sectornr);
-		sector->uptodate = 1;
-		pointers[stripe++] = kmap_local_page(sector->page) + sector->pgoff;
-
-		if (has_qstripe) {
-			/*
-			 * RAID6, add the qstripe and call the library function
-			 * to fill in our p/q
-			 */
-			sector = rbio_qstripe_sector(rbio, sectornr);
-			sector->uptodate = 1;
-			pointers[stripe++] = kmap_local_page(sector->page) +
-					     sector->pgoff;
-
-			raid6_call.gen_syndrome(rbio->real_stripes, sectorsize,
-						pointers);
-		} else {
-			/* raid5 */
-			memcpy(pointers[nr_data], pointers[0], sectorsize);
-			run_xor(pointers + 1, nr_data - 1, sectorsize);
-		}
-		for (stripe = stripe - 1; stripe >= 0; stripe--)
-			kunmap_local(pointers[stripe]);
-	}
+	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++)
+		generate_pq_vertical(rbio, sectornr);
 
 	/*
 	 * Start writing.  Make bios for everything from the higher layers (the
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 03/12] btrfs: raid56: extract the recovery bio list build code into a helper
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 01/12] btrfs: raid56: extract the vertical stripe recovery code into recover_vertical() Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 02/12] btrfs: raid56: extract the pq generation code into a helper Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 04/12] btrfs: raid56: extract sector recovery " Qu Wenruo
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

This new helper will be also utilized in the incoming refactor of
recovery path.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 64 ++++++++++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index fe525fe9338f..e3c172ba3c80 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2134,30 +2134,14 @@ static void raid_recover_end_io_work(struct work_struct *work)
 		__raid_recover_end_io(rbio);
 }
 
-/*
- * reads everything we need off the disk to reconstruct
- * the parity. endio handlers trigger final reconstruction
- * when the IO is done.
- *
- * This is used both for reads from the higher layers and for
- * parity construction required to finish a rmw cycle.
- */
-static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
+static int recover_assemble_read_bios(struct btrfs_raid_bio *rbio,
+				      struct bio_list *bio_list)
 {
-	int bios_to_read = 0;
-	struct bio_list bio_list;
-	int ret;
-	int total_sector_nr;
 	struct bio *bio;
+	int total_sector_nr;
+	int ret = 0;
 
-	bio_list_init(&bio_list);
-
-	ret = alloc_rbio_pages(rbio);
-	if (ret)
-		goto cleanup;
-
-	atomic_set(&rbio->error, 0);
-
+	ASSERT(bio_list_size(bio_list) == 0);
 	/*
 	 * Read everything that hasn't failed. However this time we will
 	 * not trust any cached sector.
@@ -2180,11 +2164,45 @@ static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
 			continue;
 		}
 		sector = rbio_stripe_sector(rbio, stripe, sectornr);
-		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
+		ret = rbio_add_io_sector(rbio, bio_list, sector, stripe,
 					 sectornr, REQ_OP_READ);
 		if (ret < 0)
-			goto cleanup;
+			goto error;
 	}
+	return 0;
+error:
+	while ((bio = bio_list_pop(bio_list)))
+		bio_put(bio);
+
+	return -EIO;
+}
+
+/*
+ * reads everything we need off the disk to reconstruct
+ * the parity. endio handlers trigger final reconstruction
+ * when the IO is done.
+ *
+ * This is used both for reads from the higher layers and for
+ * parity construction required to finish a rmw cycle.
+ */
+static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
+{
+	int bios_to_read = 0;
+	struct bio_list bio_list;
+	int ret;
+	struct bio *bio;
+
+	bio_list_init(&bio_list);
+
+	ret = alloc_rbio_pages(rbio);
+	if (ret)
+		goto cleanup;
+
+	atomic_set(&rbio->error, 0);
+
+	ret = recover_assemble_read_bios(rbio, &bio_list);
+	if (ret < 0)
+		goto cleanup;
 
 	bios_to_read = bio_list_size(&bio_list);
 	if (!bios_to_read) {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 04/12] btrfs: raid56: extract sector recovery code into a helper
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (2 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 03/12] btrfs: raid56: extract the recovery bio list build " Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 05/12] btrfs: raid56: switch recovery path to a single function Qu Wenruo
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

This includes extra changes:

- The allocation for unmap_array[] and pointers[]
  Now we allocate them in one go, and free them together.

- Remove @err
  Use errno_to_blk_status(ret) instead.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 59 +++++++++++++++++++++++------------------------
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index e3c172ba3c80..0f6c3358bb10 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2022,36 +2022,24 @@ static void recover_vertical(struct btrfs_raid_bio *rbio, int sector_nr,
 		kunmap_local(unmap_array[stripe_nr]);
 }
 
-/*
- * all parity reconstruction happens here.  We've read in everything
- * we can find from the drives and this does the heavy lifting of
- * sorting the good from the bad.
- */
-static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
+static int recover_sectors(struct btrfs_raid_bio *rbio)
 {
-	int sectornr;
 	void **pointers = NULL;
 	void **unmap_array = NULL;
-	blk_status_t err;
+	int sectornr;
+	int ret = 0;
 
 	/*
-	 * This array stores the pointer for each sector, thus it has the extra
-	 * pgoff value added from each sector
+	 * @pointers array stores the pointer for each sector.
+	 *
+	 * @unmap_array stores copy of pointers that does not get reordered
+	 * during reconstruction so that kunmap_local works.
 	 */
 	pointers = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
-	if (!pointers) {
-		err = BLK_STS_RESOURCE;
-		goto cleanup;
-	}
-
-	/*
-	 * Store copy of pointers that does not get reordered during
-	 * reconstruction so that kunmap_local works.
-	 */
 	unmap_array = kcalloc(rbio->real_stripes, sizeof(void *), GFP_NOFS);
-	if (!unmap_array) {
-		err = BLK_STS_RESOURCE;
-		goto cleanup;
+	if (!pointers || !unmap_array) {
+		ret = -ENOMEM;
+		goto out;
 	}
 
 	/* Make sure faila and fail b are in order. */
@@ -2070,11 +2058,22 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++)
 		recover_vertical(rbio, sectornr, pointers, unmap_array);
 
-	err = BLK_STS_OK;
-
-cleanup:
-	kfree(unmap_array);
+out:
 	kfree(pointers);
+	kfree(unmap_array);
+	return ret;
+}
+
+/*
+ * all parity reconstruction happens here.  We've read in everything
+ * we can find from the drives and this does the heavy lifting of
+ * sorting the good from the bad.
+ */
+static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
+{
+	int ret;
+
+	ret = recover_sectors(rbio);
 
 	/*
 	 * Similar to READ_REBUILD, REBUILD_MISSING at this point also has a
@@ -2098,13 +2097,13 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 		 *   Cache this rbio iff the above read reconstruction is
 		 *   executed without problems.
 		 */
-		if (err == BLK_STS_OK && rbio->failb < 0)
+		if (!ret && rbio->failb < 0)
 			cache_rbio_pages(rbio);
 		else
 			clear_bit(RBIO_CACHE_READY_BIT, &rbio->flags);
 
-		rbio_orig_end_io(rbio, err);
-	} else if (err == BLK_STS_OK) {
+		rbio_orig_end_io(rbio, errno_to_blk_status(ret));
+	} else if (!ret) {
 		rbio->faila = -1;
 		rbio->failb = -1;
 
@@ -2115,7 +2114,7 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 		else
 			BUG();
 	} else {
-		rbio_orig_end_io(rbio, err);
+		rbio_orig_end_io(rbio, errno_to_blk_status(ret));
 	}
 }
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 05/12] btrfs: raid56: switch recovery path to a single function
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (3 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 04/12] btrfs: raid56: extract sector recovery " Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 06/12] btrfs: raid56: extract the rmw bio list build code into a helper Qu Wenruo
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

Currently btrfs uses end_io functions to jump between different stages
of recovery.

For example, we go the following different functions:

- raid56_bio_end_io()
  This handles the read for all the sectors (except the missing device).

- __raid_recover_end_io()
  This does the real work, it's called inside the delayed work function
   raid_recover_end_io_work().

This one recovery path involves at least 3 different functions, which is
a big burden for readers.

This patch will change the behavior by:

- Introduce a unified recovery entrance, recover_rbio()

- Use submit-and-wait method
  So the workflow is not interrupted by the endio function jump.
  This doesn't bring performance change, but reduce the burden for
  reviewers.

- Run the main function in the rmw_workers workqueue
  Now raid56_parity_recover() only needs to setup the work, and
  queue the work using start_async_work().

Now readers only need to do one function jump (start_async_work()) to
find out the main entrance of recovery path.

Furthermore, recover_rbio() function can easily be reused by other paths.

The old recovery path is still utilized by degraded write path.
It will be cleaned up when we have migrated the write path.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 145 +++++++++++++++++++++++++++++++++++++---------
 fs/btrfs/raid56.h |   2 +
 2 files changed, 120 insertions(+), 27 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 0f6c3358bb10..4bef7e32acbe 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -67,7 +67,6 @@ struct sector_ptr {
 static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio);
 static void rmw_work(struct work_struct *work);
-static void read_rebuild_work(struct work_struct *work);
 static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio);
 static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
@@ -752,6 +751,8 @@ static noinline int lock_stripe_add(struct btrfs_raid_bio *rbio)
 	return ret;
 }
 
+static void recover_rbio_work_locked(struct work_struct *work);
+
 /*
  * called as rmw or parity rebuild is completed.  If the plug list has more
  * rbios waiting for this stripe, the next one on the list will be started
@@ -809,10 +810,10 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio)
 			spin_unlock_irqrestore(&h->lock, flags);
 
 			if (next->operation == BTRFS_RBIO_READ_REBUILD)
-				start_async_work(next, read_rebuild_work);
+				start_async_work(next, recover_rbio_work_locked);
 			else if (next->operation == BTRFS_RBIO_REBUILD_MISSING) {
 				steal_rbio(rbio, next);
-				start_async_work(next, read_rebuild_work);
+				start_async_work(next, recover_rbio_work_locked);
 			} else if (next->operation == BTRFS_RBIO_WRITE) {
 				steal_rbio(rbio, next);
 				start_async_work(next, rmw_work);
@@ -989,6 +990,7 @@ static struct btrfs_raid_bio *alloc_rbio(struct btrfs_fs_info *fs_info,
 	}
 
 	bio_list_init(&rbio->bio_list);
+	init_waitqueue_head(&rbio->io_wait);
 	INIT_LIST_HEAD(&rbio->plug_list);
 	spin_lock_init(&rbio->bio_list_lock);
 	INIT_LIST_HEAD(&rbio->stripe_cache);
@@ -1519,6 +1521,40 @@ static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
 	}
 }
 
+static void raid_wait_read_end_io(struct bio *bio)
+{
+	struct btrfs_raid_bio *rbio = bio->bi_private;
+
+	if (bio->bi_status)
+		fail_bio_stripe(rbio, bio);
+	else
+		set_bio_pages_uptodate(rbio, bio);
+
+
+	bio_put(bio);
+	atomic_dec(&rbio->stripes_pending);
+	wake_up(&rbio->io_wait);
+}
+
+static void submit_read_bios(struct btrfs_raid_bio *rbio,
+			     struct bio_list *bio_list)
+{
+	struct bio *bio;
+
+	atomic_set(&rbio->stripes_pending, bio_list_size(bio_list));
+	while ((bio = bio_list_pop(bio_list))) {
+		bio->bi_end_io = raid_wait_read_end_io;
+
+		if (trace_raid56_scrub_read_recover_enabled()) {
+			struct raid56_bio_trace_info trace_info = { 0 };
+
+			bio_get_trace_info(rbio, bio, &trace_info);
+			trace_raid56_scrub_read_recover(rbio, bio, &trace_info);
+		}
+		submit_bio(bio);
+	}
+}
+
 static void raid56_bio_end_io(struct bio *bio)
 {
 	struct btrfs_raid_bio *rbio = bio->bi_private;
@@ -2176,6 +2212,79 @@ static int recover_assemble_read_bios(struct btrfs_raid_bio *rbio,
 	return -EIO;
 }
 
+static int recover_rbio(struct btrfs_raid_bio *rbio)
+{
+	struct bio_list bio_list;
+	struct bio *bio;
+	int ret;
+
+	/*
+	 * Either we're doing recover for a read failure or degraded write,
+	 * caller should have set faila/b correctly.
+	 */
+	ASSERT(rbio->faila >= 0 || rbio->failb >= 0);
+	bio_list_init(&bio_list);
+
+	/*
+	 * Reset error to 0, as we will later increase error for missing
+	 * devices.
+	 */
+	atomic_set(&rbio->error, 0);
+
+	/* For recovery, we need to read all sectors including P/Q. */
+	ret = alloc_rbio_pages(rbio);
+	if (ret < 0)
+		goto out;
+
+	index_rbio_pages(rbio);
+
+	ret = recover_assemble_read_bios(rbio, &bio_list);
+	if (ret < 0)
+		goto out;
+
+	submit_read_bios(rbio, &bio_list);
+	wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
+
+	/* We have more errors than our tolerance during the read. */
+	if (atomic_read(&rbio->error) > rbio->bioc->max_errors) {
+		ret = -EIO;
+		goto out;
+	}
+
+	ret = recover_sectors(rbio);
+
+out:
+	while ((bio = bio_list_pop(&bio_list)))
+		bio_put(bio);
+
+	return ret;
+}
+
+static void recover_rbio_work(struct work_struct *work)
+{
+	struct btrfs_raid_bio *rbio;
+	int ret;
+
+	rbio = container_of(work, struct btrfs_raid_bio, work);
+
+	ret = lock_stripe_add(rbio);
+	if (ret == 0) {
+		ret = recover_rbio(rbio);
+		rbio_orig_end_io(rbio, errno_to_blk_status(ret));
+	}
+}
+
+static void recover_rbio_work_locked(struct work_struct *work)
+{
+	struct btrfs_raid_bio *rbio;
+	int ret;
+
+	rbio = container_of(work, struct btrfs_raid_bio, work);
+
+	ret = recover_rbio(rbio);
+	rbio_orig_end_io(rbio, errno_to_blk_status(ret));
+}
+
 /*
  * reads everything we need off the disk to reconstruct
  * the parity. endio handlers trigger final reconstruction
@@ -2264,7 +2373,8 @@ void raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
 	rbio = alloc_rbio(fs_info, bioc);
 	if (IS_ERR(rbio)) {
 		bio->bi_status = errno_to_blk_status(PTR_ERR(rbio));
-		goto out_end_bio;
+		bio_endio(bio);
+		return;
 	}
 
 	rbio->operation = BTRFS_RBIO_READ_REBUILD;
@@ -2278,7 +2388,8 @@ void raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
 			   (u64)bio->bi_iter.bi_size, bioc->map_type);
 		free_raid_bio(rbio);
 		bio->bi_status = BLK_STS_IOERR;
-		goto out_end_bio;
+		bio_endio(bio);
+		return;
 	}
 
 	/*
@@ -2298,18 +2409,7 @@ void raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
 			rbio->failb--;
 	}
 
-	if (lock_stripe_add(rbio))
-		return;
-
-	/*
-	 * This adds our rbio to the list of rbios that will be handled after
-	 * the current lock owner is done.
-	 */
-	__raid56_parity_recover(rbio);
-	return;
-
-out_end_bio:
-	bio_endio(bio);
+	start_async_work(rbio, recover_rbio_work);
 }
 
 static void rmw_work(struct work_struct *work)
@@ -2320,14 +2420,6 @@ static void rmw_work(struct work_struct *work)
 	raid56_rmw_stripe(rbio);
 }
 
-static void read_rebuild_work(struct work_struct *work)
-{
-	struct btrfs_raid_bio *rbio;
-
-	rbio = container_of(work, struct btrfs_raid_bio, work);
-	__raid56_parity_recover(rbio);
-}
-
 /*
  * The following code is used to scrub/replace the parity stripe
  *
@@ -2818,6 +2910,5 @@ raid56_alloc_missing_rbio(struct bio *bio, struct btrfs_io_context *bioc)
 
 void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio)
 {
-	if (!lock_stripe_add(rbio))
-		start_async_work(rbio, read_rebuild_work);
+	start_async_work(rbio, recover_rbio_work);
 }
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 91d5c0adad15..445e833fcfcf 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -95,6 +95,8 @@ struct btrfs_raid_bio {
 
 	atomic_t error;
 
+	wait_queue_head_t io_wait;
+
 	struct work_struct end_io_work;
 
 	/* Bitmap to record which horizontal stripe has data */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 06/12] btrfs: raid56: extract the rmw bio list build code into a helper
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (4 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 05/12] btrfs: raid56: switch recovery path to a single function Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 07/12] btrfs: raid56: extract rwm write bios assembly " Qu Wenruo
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

The helper will later be used to refactor the whole RMW path.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 56 ++++++++++++++++++++++++++++++++---------------
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 4bef7e32acbe..3133c9706da4 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1596,28 +1596,16 @@ static void raid56_rmw_end_io_work(struct work_struct *work)
 	validate_rbio_for_rmw(rbio);
 }
 
-/*
- * the stripe must be locked by the caller.  It will
- * unlock after all the writes are done
- */
-static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
+static int rmw_assemble_read_bios(struct btrfs_raid_bio *rbio,
+				  struct bio_list *bio_list)
 {
-	int bios_to_read = 0;
-	struct bio_list bio_list;
 	const int nr_data_sectors = rbio->stripe_nsectors * rbio->nr_data;
-	int ret;
-	int total_sector_nr;
 	struct bio *bio;
+	int total_sector_nr;
+	int ret = 0;
 
-	bio_list_init(&bio_list);
+	ASSERT(bio_list_size(bio_list) == 0);
 
-	ret = alloc_rbio_pages(rbio);
-	if (ret)
-		goto cleanup;
-
-	index_rbio_pages(rbio);
-
-	atomic_set(&rbio->error, 0);
 	/* Build a list of bios to read all the missing data sectors. */
 	for (total_sector_nr = 0; total_sector_nr < nr_data_sectors;
 	     total_sector_nr++) {
@@ -1642,11 +1630,43 @@ static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
 		if (sector->uptodate)
 			continue;
 
-		ret = rbio_add_io_sector(rbio, &bio_list, sector,
+		ret = rbio_add_io_sector(rbio, bio_list, sector,
 			       stripe, sectornr, REQ_OP_READ);
 		if (ret)
 			goto cleanup;
 	}
+	return 0;
+
+cleanup:
+	while ((bio = bio_list_pop(bio_list)))
+		bio_put(bio);
+	return ret;
+}
+
+/*
+ * the stripe must be locked by the caller.  It will
+ * unlock after all the writes are done
+ */
+static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
+{
+	int bios_to_read = 0;
+	struct bio_list bio_list;
+	int ret;
+	struct bio *bio;
+
+	bio_list_init(&bio_list);
+
+	ret = alloc_rbio_pages(rbio);
+	if (ret)
+		goto cleanup;
+
+	index_rbio_pages(rbio);
+
+	atomic_set(&rbio->error, 0);
+
+	ret = rmw_assemble_read_bios(rbio, &bio_list);
+	if (ret < 0)
+		goto cleanup;
 
 	bios_to_read = bio_list_size(&bio_list);
 	if (!bios_to_read) {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 07/12] btrfs: raid56: extract rwm write bios assembly into a helper
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (5 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 06/12] btrfs: raid56: extract the rmw bio list build code into a helper Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 08/12] btrfs: raid56: introduce the a main entrance for rmw path Qu Wenruo
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

The helper will be later used to refactor the rmw write path.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 165 ++++++++++++++++++++++++++--------------------
 1 file changed, 94 insertions(+), 71 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 3133c9706da4..de986bbe43fc 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1237,6 +1237,97 @@ static void generate_pq_vertical(struct btrfs_raid_bio *rbio, int sectornr)
 		kunmap_local(pointers[stripe]);
 }
 
+static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
+				   struct bio_list *bio_list)
+{
+	struct bio *bio;
+	/* The total sector number inside the full stripe. */
+	int total_sector_nr;
+	int sectornr;
+	int stripe;
+	int ret;
+
+	ASSERT(bio_list_size(bio_list) == 0);
+
+	/* We should have at least one data sector. */
+	ASSERT(bitmap_weight(&rbio->dbitmap, rbio->stripe_nsectors));
+
+	/*
+	 * Start assembly.  Make bios for everything from the higher layers (the
+	 * bio_list in our rbio) and our P/Q.  Ignore everything else.
+	 */
+	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
+	     total_sector_nr++) {
+		struct sector_ptr *sector;
+
+		stripe = total_sector_nr / rbio->stripe_nsectors;
+		sectornr = total_sector_nr % rbio->stripe_nsectors;
+
+		/* This vertical stripe has no data, skip it. */
+		if (!test_bit(sectornr, &rbio->dbitmap))
+			continue;
+
+		if (stripe < rbio->nr_data) {
+			sector = sector_in_rbio(rbio, stripe, sectornr, 1);
+			if (!sector)
+				continue;
+		} else {
+			sector = rbio_stripe_sector(rbio, stripe, sectornr);
+		}
+
+		ret = rbio_add_io_sector(rbio, bio_list, sector, stripe,
+					 sectornr, REQ_OP_WRITE);
+		if (ret)
+			goto error;
+	}
+
+	if (likely(!rbio->bioc->num_tgtdevs))
+		return 0;
+
+	/* Make a copy for the replace target device. */
+	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
+	     total_sector_nr++) {
+		struct sector_ptr *sector;
+
+		stripe = total_sector_nr / rbio->stripe_nsectors;
+		sectornr = total_sector_nr % rbio->stripe_nsectors;
+
+		if (!rbio->bioc->tgtdev_map[stripe]) {
+			/*
+			 * We can skip the whole stripe completely, note
+			 * total_sector_nr will be increased by one anyway.
+			 */
+			ASSERT(sectornr == 0);
+			total_sector_nr += rbio->stripe_nsectors - 1;
+			continue;
+		}
+
+		/* This vertical stripe has no data, skip it. */
+		if (!test_bit(sectornr, &rbio->dbitmap))
+			continue;
+
+		if (stripe < rbio->nr_data) {
+			sector = sector_in_rbio(rbio, stripe, sectornr, 1);
+			if (!sector)
+				continue;
+		} else {
+			sector = rbio_stripe_sector(rbio, stripe, sectornr);
+		}
+
+		ret = rbio_add_io_sector(rbio, bio_list, sector,
+					 rbio->bioc->tgtdev_map[stripe],
+					 sectornr, REQ_OP_WRITE);
+		if (ret)
+			goto error;
+	}
+
+	return 0;
+error:
+	while ((bio = bio_list_pop(bio_list)))
+		bio_put(bio);
+	return -EIO;
+}
+
 /*
  * this is called from one of two situations.  We either
  * have a full stripe from the higher layers, or we've read all
@@ -1247,10 +1338,7 @@ static void generate_pq_vertical(struct btrfs_raid_bio *rbio, int sectornr)
  */
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 {
-	struct btrfs_io_context *bioc = rbio->bioc;
 	/* The total sector number inside the full stripe. */
-	int total_sector_nr;
-	int stripe;
 	/* Sector number inside a stripe. */
 	int sectornr;
 	struct bio_list bio_list;
@@ -1294,75 +1382,10 @@ static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
 	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++)
 		generate_pq_vertical(rbio, sectornr);
 
-	/*
-	 * Start writing.  Make bios for everything from the higher layers (the
-	 * bio_list in our rbio) and our P/Q.  Ignore everything else.
-	 */
-	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
-	     total_sector_nr++) {
-		struct sector_ptr *sector;
+	ret = rmw_assemble_write_bios(rbio, &bio_list);
+	if (ret < 0)
+		goto cleanup;
 
-		stripe = total_sector_nr / rbio->stripe_nsectors;
-		sectornr = total_sector_nr % rbio->stripe_nsectors;
-
-		/* This vertical stripe has no data, skip it. */
-		if (!test_bit(sectornr, &rbio->dbitmap))
-			continue;
-
-		if (stripe < rbio->nr_data) {
-			sector = sector_in_rbio(rbio, stripe, sectornr, 1);
-			if (!sector)
-				continue;
-		} else {
-			sector = rbio_stripe_sector(rbio, stripe, sectornr);
-		}
-
-		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
-					 sectornr, REQ_OP_WRITE);
-		if (ret)
-			goto cleanup;
-	}
-
-	if (likely(!bioc->num_tgtdevs))
-		goto write_data;
-
-	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
-	     total_sector_nr++) {
-		struct sector_ptr *sector;
-
-		stripe = total_sector_nr / rbio->stripe_nsectors;
-		sectornr = total_sector_nr % rbio->stripe_nsectors;
-
-		if (!bioc->tgtdev_map[stripe]) {
-			/*
-			 * We can skip the whole stripe completely, note
-			 * total_sector_nr will be increased by one anyway.
-			 */
-			ASSERT(sectornr == 0);
-			total_sector_nr += rbio->stripe_nsectors - 1;
-			continue;
-		}
-
-		/* This vertical stripe has no data, skip it. */
-		if (!test_bit(sectornr, &rbio->dbitmap))
-			continue;
-
-		if (stripe < rbio->nr_data) {
-			sector = sector_in_rbio(rbio, stripe, sectornr, 1);
-			if (!sector)
-				continue;
-		} else {
-			sector = rbio_stripe_sector(rbio, stripe, sectornr);
-		}
-
-		ret = rbio_add_io_sector(rbio, &bio_list, sector,
-					 rbio->bioc->tgtdev_map[stripe],
-					 sectornr, REQ_OP_WRITE);
-		if (ret)
-			goto cleanup;
-	}
-
-write_data:
 	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
 	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 08/12] btrfs: raid56: introduce the a main entrance for rmw path
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (6 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 07/12] btrfs: raid56: extract rwm write bios assembly " Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 09/12] btrfs: raid56: switch write path to rmw_rbio() Qu Wenruo
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

The new entrance will be called rmw_rbio(), it will have a streamlined
workflow by using submit-and-wait method.

Thus there will be no weird jumps between tons of functions, thus way
more reader friendly, and will make later expansion easier, as it's now
a straight workflow, the timing is way more clear.

Unfortunately we can not yet migrate the rmw path to use this new
entrance as we still need extra work to address the plug and
unlock_stripe() function.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 161 ++++++++++++++++++++++++++++++++++++++++++++++
 fs/btrfs/raid56.h |   5 ++
 2 files changed, 166 insertions(+)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index de986bbe43fc..436c9b1d51f5 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1252,6 +1252,14 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 	/* We should have at least one data sector. */
 	ASSERT(bitmap_weight(&rbio->dbitmap, rbio->stripe_nsectors));
 
+	/*
+	 * Reset errors, as we may have errors inherited from from degraded
+	 * write.
+	 */
+	atomic_set(&rbio->error, 0);
+	rbio->faila = -1;
+	rbio->failb = -1;
+
 	/*
 	 * Start assembly.  Make bios for everything from the higher layers (the
 	 * bio_list in our rbio) and our P/Q.  Ignore everything else.
@@ -1666,6 +1674,19 @@ static int rmw_assemble_read_bios(struct btrfs_raid_bio *rbio,
 	return ret;
 }
 
+static int alloc_rbio_data_pages(struct btrfs_raid_bio *rbio)
+{
+	const int data_pages = rbio->nr_data * rbio->stripe_npages;
+	int ret;
+
+	ret = btrfs_alloc_page_array(data_pages, rbio->stripe_pages);
+	if (ret < 0)
+		return ret;
+
+	index_stripe_sectors(rbio);
+	return 0;
+}
+
 /*
  * the stripe must be locked by the caller.  It will
  * unlock after all the writes are done
@@ -2455,6 +2476,146 @@ void raid56_parity_recover(struct bio *bio, struct btrfs_io_context *bioc,
 	start_async_work(rbio, recover_rbio_work);
 }
 
+static int rmw_read_and_wait(struct btrfs_raid_bio *rbio)
+{
+	struct bio_list bio_list;
+	struct bio *bio;
+	int ret;
+
+	bio_list_init(&bio_list);
+	atomic_set(&rbio->error, 0);
+
+	ret = rmw_assemble_read_bios(rbio, &bio_list);
+	if (ret < 0)
+		goto out;
+
+	submit_read_bios(rbio, &bio_list);
+	wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
+	return ret;
+out:
+	while ((bio = bio_list_pop(&bio_list)))
+		bio_put(bio);
+
+	return ret;
+}
+
+static void raid_wait_write_end_io(struct bio *bio)
+{
+	struct btrfs_raid_bio *rbio = bio->bi_private;
+	blk_status_t err = bio->bi_status;
+
+	if (err)
+		fail_bio_stripe(rbio, bio);
+	bio_put(bio);
+	atomic_dec(&rbio->stripes_pending);
+	wake_up(&rbio->io_wait);
+}
+
+static void submit_write_bios(struct btrfs_raid_bio *rbio,
+			      struct bio_list *bio_list)
+{
+	struct bio *bio;
+
+	atomic_set(&rbio->stripes_pending, bio_list_size(bio_list));
+	while ((bio = bio_list_pop(bio_list))) {
+		bio->bi_end_io = raid_wait_write_end_io;
+
+		if (trace_raid56_write_stripe_enabled()) {
+			struct raid56_bio_trace_info trace_info = { 0 };
+
+			bio_get_trace_info(rbio, bio, &trace_info);
+			trace_raid56_write_stripe(rbio, bio, &trace_info);
+		}
+		submit_bio(bio);
+	}
+}
+
+int rmw_rbio(struct btrfs_raid_bio *rbio)
+{
+	struct bio_list bio_list;
+	int sectornr;
+	int ret = 0;
+
+	/*
+	 * Allocate the pages for parity first, as P/Q pages will always be
+	 * needed for both full-stripe and sub-stripe writes.
+	 */
+	ret = alloc_rbio_parity_pages(rbio);
+	if (ret < 0)
+		return ret;
+
+	/* Full stripe write, can write the full stripe right now. */
+	if (rbio_is_full(rbio))
+		goto write;
+	/*
+	 * Now we're doing sub-stripe write, also need all data stripes to do
+	 * the full RMW.
+	 */
+	ret = alloc_rbio_data_pages(rbio);
+	if (ret < 0)
+		return ret;
+
+	atomic_set(&rbio->error, 0);
+	index_rbio_pages(rbio);
+
+	ret = rmw_read_and_wait(rbio);
+	if (ret < 0)
+		return ret;
+
+	/* Too many read errors, beyond our tolerance. */
+	if (atomic_read(&rbio->error) > rbio->bioc->max_errors)
+		return ret;
+
+	/* Have read failures but under tolerance, needs recovery. */
+	if (rbio->faila >= 0 || rbio->failb >= 0) {
+		ret = recover_rbio(rbio);
+		if (ret < 0)
+			return ret;
+	}
+write:
+	/*
+	 * At this stage we're not allowed to add any new bios to the
+	 * bio list any more, anyone else that wants to change this stripe
+	 * needs to do their own rmw.
+	 */
+	spin_lock_irq(&rbio->bio_list_lock);
+	set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
+	spin_unlock_irq(&rbio->bio_list_lock);
+
+	atomic_set(&rbio->error, 0);
+
+	index_rbio_pages(rbio);
+
+	/*
+	 * We don't cache full rbios because we're assuming
+	 * the higher layers are unlikely to use this area of
+	 * the disk again soon.  If they do use it again,
+	 * hopefully they will send another full bio.
+	 */
+	if (!rbio_is_full(rbio))
+		cache_rbio_pages(rbio);
+	else
+		clear_bit(RBIO_CACHE_READY_BIT, &rbio->flags);
+
+	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++)
+		generate_pq_vertical(rbio, sectornr);
+
+	bio_list_init(&bio_list);
+	ret = rmw_assemble_write_bios(rbio, &bio_list);
+	if (ret < 0)
+		return ret;
+
+	/* We should have at least one bio assembled. */
+	ASSERT(bio_list_size(&bio_list));
+	submit_write_bios(rbio, &bio_list);
+	wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
+
+	/* We have more errors than our tolerance during the read. */
+	if (atomic_read(&rbio->error) > rbio->bioc->max_errors)
+		ret = -EIO;
+	return ret;
+}
+
 static void rmw_work(struct work_struct *work)
 {
 	struct btrfs_raid_bio *rbio;
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 445e833fcfcf..0e77c77c5dba 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -185,4 +185,9 @@ void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio);
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
 void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info);
 
+/*
+ * Placeholder definition to avoid warning, will be removed when
+ * the full write path is migrated.
+ */
+int rmw_rbio(struct btrfs_raid_bio *rbio);
 #endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 09/12] btrfs: raid56: switch write path to rmw_rbio()
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (7 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 08/12] btrfs: raid56: introduce the a main entrance for rmw path Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 10/12] btrfs: raid56: extract scrub read bio list assembly code into a helper Qu Wenruo
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

This includes the following changes:

- Implement new raid_unplug() functions
  Now we don't need a workqueue to run the plug, as all our
  work is just queue rmw_rbio_work() call, which can be executed
  without sleep.

- Implement a rmw_rbio_work_locked() helper
  This is for unlock_stripe(), which is already holding the full stripe
  lock.

- Remove all the old functions
  This should already shows how complex the old functions are, as we
  ended up removing the following functions:

  * rmw_work()
  * validate_rbio_for_rmw()
  * raid56_rmw_end_io_work()
  * raid56_rmw_stripe()
  * full_stripe_write()
  * partial_stripe_write()
  * __raid56_parity_write()
  * run_plug()
  * unplug_work()
  * btrfs_raid_unplug()
  * rmw_work()
  * __raid56_parity_recover()
  * raid_recover_end_io_work()

- Unexport rmw_rbio()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 350 ++++++----------------------------------------
 fs/btrfs/raid56.h |   5 -
 2 files changed, 42 insertions(+), 313 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 436c9b1d51f5..94df689713ea 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -64,9 +64,9 @@ struct sector_ptr {
 	unsigned int uptodate:8;
 };
 
-static int __raid56_parity_recover(struct btrfs_raid_bio *rbio);
 static noinline void finish_rmw(struct btrfs_raid_bio *rbio);
-static void rmw_work(struct work_struct *work);
+static void rmw_rbio_work(struct work_struct *work);
+static void rmw_rbio_work_locked(struct work_struct *work);
 static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio);
 static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
@@ -816,7 +816,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio)
 				start_async_work(next, recover_rbio_work_locked);
 			} else if (next->operation == BTRFS_RBIO_WRITE) {
 				steal_rbio(rbio, next);
-				start_async_work(next, rmw_work);
+				start_async_work(next, rmw_rbio_work_locked);
 			} else if (next->operation == BTRFS_RBIO_PARITY_SCRUB) {
 				steal_rbio(rbio, next);
 				start_async_work(next, scrub_parity_work);
@@ -1108,23 +1108,6 @@ static int rbio_add_io_sector(struct btrfs_raid_bio *rbio,
 	return 0;
 }
 
-/*
- * while we're doing the read/modify/write cycle, we could
- * have errors in reading pages off the disk.  This checks
- * for errors and if we're not able to read the page it'll
- * trigger parity reconstruction.  The rmw will be finished
- * after we've reconstructed the failed stripes
- */
-static void validate_rbio_for_rmw(struct btrfs_raid_bio *rbio)
-{
-	if (rbio->faila >= 0 || rbio->failb >= 0) {
-		BUG_ON(rbio->faila == rbio->real_stripes - 1);
-		__raid56_parity_recover(rbio);
-	} else {
-		finish_rmw(rbio);
-	}
-}
-
 static void index_one_bio(struct btrfs_raid_bio *rbio, struct bio *bio)
 {
 	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
@@ -1602,31 +1585,6 @@ static void raid56_bio_end_io(struct bio *bio)
 			   &rbio->end_io_work);
 }
 
-/*
- * End io handler for the read phase of the RMW cycle.  All the bios here are
- * physical stripe bios we've read from the disk so we can recalculate the
- * parity of the stripe.
- *
- * This will usually kick off finish_rmw once all the bios are read in, but it
- * may trigger parity reconstruction if we had any errors along the way
- */
-static void raid56_rmw_end_io_work(struct work_struct *work)
-{
-	struct btrfs_raid_bio *rbio =
-		container_of(work, struct btrfs_raid_bio, end_io_work);
-
-	if (atomic_read(&rbio->error) > rbio->bioc->max_errors) {
-		rbio_orig_end_io(rbio, BLK_STS_IOERR);
-		return;
-	}
-
-	/*
-	 * This will normally call finish_rmw to start our write but if there
-	 * are any failed stripes we'll reconstruct from parity first.
-	 */
-	validate_rbio_for_rmw(rbio);
-}
-
 static int rmw_assemble_read_bios(struct btrfs_raid_bio *rbio,
 				  struct bio_list *bio_list)
 {
@@ -1687,122 +1645,6 @@ static int alloc_rbio_data_pages(struct btrfs_raid_bio *rbio)
 	return 0;
 }
 
-/*
- * the stripe must be locked by the caller.  It will
- * unlock after all the writes are done
- */
-static int raid56_rmw_stripe(struct btrfs_raid_bio *rbio)
-{
-	int bios_to_read = 0;
-	struct bio_list bio_list;
-	int ret;
-	struct bio *bio;
-
-	bio_list_init(&bio_list);
-
-	ret = alloc_rbio_pages(rbio);
-	if (ret)
-		goto cleanup;
-
-	index_rbio_pages(rbio);
-
-	atomic_set(&rbio->error, 0);
-
-	ret = rmw_assemble_read_bios(rbio, &bio_list);
-	if (ret < 0)
-		goto cleanup;
-
-	bios_to_read = bio_list_size(&bio_list);
-	if (!bios_to_read) {
-		/*
-		 * this can happen if others have merged with
-		 * us, it means there is nothing left to read.
-		 * But if there are missing devices it may not be
-		 * safe to do the full stripe write yet.
-		 */
-		goto finish;
-	}
-
-	/*
-	 * The bioc may be freed once we submit the last bio. Make sure not to
-	 * touch it after that.
-	 */
-	atomic_set(&rbio->stripes_pending, bios_to_read);
-	INIT_WORK(&rbio->end_io_work, raid56_rmw_end_io_work);
-	while ((bio = bio_list_pop(&bio_list))) {
-		bio->bi_end_io = raid56_bio_end_io;
-
-		if (trace_raid56_read_partial_enabled()) {
-			struct raid56_bio_trace_info trace_info = { 0 };
-
-			bio_get_trace_info(rbio, bio, &trace_info);
-			trace_raid56_read_partial(rbio, bio, &trace_info);
-		}
-		submit_bio(bio);
-	}
-	/* the actual write will happen once the reads are done */
-	return 0;
-
-cleanup:
-	rbio_orig_end_io(rbio, BLK_STS_IOERR);
-
-	while ((bio = bio_list_pop(&bio_list)))
-		bio_put(bio);
-
-	return -EIO;
-
-finish:
-	validate_rbio_for_rmw(rbio);
-	return 0;
-}
-
-/*
- * if the upper layers pass in a full stripe, we thank them by only allocating
- * enough pages to hold the parity, and sending it all down quickly.
- */
-static int full_stripe_write(struct btrfs_raid_bio *rbio)
-{
-	int ret;
-
-	ret = alloc_rbio_parity_pages(rbio);
-	if (ret)
-		return ret;
-
-	ret = lock_stripe_add(rbio);
-	if (ret == 0)
-		finish_rmw(rbio);
-	return 0;
-}
-
-/*
- * partial stripe writes get handed over to async helpers.
- * We're really hoping to merge a few more writes into this
- * rbio before calculating new parity
- */
-static int partial_stripe_write(struct btrfs_raid_bio *rbio)
-{
-	int ret;
-
-	ret = lock_stripe_add(rbio);
-	if (ret == 0)
-		start_async_work(rbio, rmw_work);
-	return 0;
-}
-
-/*
- * sometimes while we were reading from the drive to
- * recalculate parity, enough new bios come into create
- * a full stripe.  So we do a check here to see if we can
- * go directly to finish_rmw
- */
-static int __raid56_parity_write(struct btrfs_raid_bio *rbio)
-{
-	/* head off into rmw land if we don't have a full stripe */
-	if (!rbio_is_full(rbio))
-		return partial_stripe_write(rbio);
-	return full_stripe_write(rbio);
-}
-
 /*
  * We use plugging call backs to collect full stripes.
  * Any time we get a partial stripe write while plugged
@@ -1837,28 +1679,22 @@ static int plug_cmp(void *priv, const struct list_head *a,
 	return 0;
 }
 
-static void run_plug(struct btrfs_plug_cb *plug)
+static void raid_unplug(struct blk_plug_cb *cb, bool from_schedule)
 {
+	struct btrfs_plug_cb *plug = container_of(cb, struct btrfs_plug_cb, cb);
 	struct btrfs_raid_bio *cur;
 	struct btrfs_raid_bio *last = NULL;
 
-	/*
-	 * sort our plug list then try to merge
-	 * everything we can in hopes of creating full
-	 * stripes.
-	 */
 	list_sort(NULL, &plug->rbio_list, plug_cmp);
+
 	while (!list_empty(&plug->rbio_list)) {
 		cur = list_entry(plug->rbio_list.next,
 				 struct btrfs_raid_bio, plug_list);
 		list_del_init(&cur->plug_list);
 
 		if (rbio_is_full(cur)) {
-			int ret;
-
-			/* we have a full stripe, send it down */
-			ret = full_stripe_write(cur);
-			BUG_ON(ret);
+			/* We have a full stripe, queue it down. */
+			start_async_work(cur, rmw_rbio_work);
 			continue;
 		}
 		if (last) {
@@ -1866,42 +1702,16 @@ static void run_plug(struct btrfs_plug_cb *plug)
 				merge_rbio(last, cur);
 				free_raid_bio(cur);
 				continue;
-
 			}
-			__raid56_parity_write(last);
+			start_async_work(last, rmw_rbio_work);
 		}
 		last = cur;
 	}
-	if (last) {
-		__raid56_parity_write(last);
-	}
+	if (last)
+		start_async_work(last, rmw_rbio_work);
 	kfree(plug);
 }
 
-/*
- * if the unplug comes from schedule, we have to push the
- * work off to a helper thread
- */
-static void unplug_work(struct work_struct *work)
-{
-	struct btrfs_plug_cb *plug;
-	plug = container_of(work, struct btrfs_plug_cb, work);
-	run_plug(plug);
-}
-
-static void btrfs_raid_unplug(struct blk_plug_cb *cb, bool from_schedule)
-{
-	struct btrfs_plug_cb *plug;
-	plug = container_of(cb, struct btrfs_plug_cb, cb);
-
-	if (from_schedule) {
-		INIT_WORK(&plug->work, unplug_work);
-		queue_work(plug->info->rmw_workers, &plug->work);
-		return;
-	}
-	run_plug(plug);
-}
-
 /* Add the original bio into rbio->bio_list, and update rbio::dbitmap. */
 static void rbio_add_bio(struct btrfs_raid_bio *rbio, struct bio *orig_bio)
 {
@@ -1949,19 +1759,13 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc)
 	rbio_add_bio(rbio, bio);
 
 	/*
-	 * don't plug on full rbios, just get them out the door
+	 * Don't plug on full rbios, just get them out the door
 	 * as quickly as we can
 	 */
-	if (rbio_is_full(rbio)) {
-		ret = full_stripe_write(rbio);
-		if (ret) {
-			free_raid_bio(rbio);
-			goto fail;
-		}
-		return;
-	}
+	if (rbio_is_full(rbio))
+		goto queue_rbio;
 
-	cb = blk_check_plugged(btrfs_raid_unplug, fs_info, sizeof(*plug));
+	cb = blk_check_plugged(raid_unplug, fs_info, sizeof(*plug));
 	if (cb) {
 		plug = container_of(cb, struct btrfs_plug_cb, cb);
 		if (!plug->info) {
@@ -1969,13 +1773,14 @@ void raid56_parity_write(struct bio *bio, struct btrfs_io_context *bioc)
 			INIT_LIST_HEAD(&plug->rbio_list);
 		}
 		list_add_tail(&rbio->plug_list, &plug->rbio_list);
-	} else {
-		ret = __raid56_parity_write(rbio);
-		if (ret) {
-			free_raid_bio(rbio);
-			goto fail;
-		}
+		return;
 	}
+queue_rbio:
+	/*
+	 * Either we don't have any existing plug, or we're doing a full stripe,
+	 * can queue the rmw work now.
+	 */
+	start_async_work(rbio, rmw_rbio_work);
 
 	return;
 
@@ -2218,21 +2023,6 @@ static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
 	}
 }
 
-/*
- * This is called only for stripes we've read from disk to reconstruct the
- * parity.
- */
-static void raid_recover_end_io_work(struct work_struct *work)
-{
-	struct btrfs_raid_bio *rbio =
-		container_of(work, struct btrfs_raid_bio, end_io_work);
-
-	if (atomic_read(&rbio->error) > rbio->bioc->max_errors)
-		rbio_orig_end_io(rbio, BLK_STS_IOERR);
-	else
-		__raid_recover_end_io(rbio);
-}
-
 static int recover_assemble_read_bios(struct btrfs_raid_bio *rbio,
 				      struct bio_list *bio_list)
 {
@@ -2349,79 +2139,6 @@ static void recover_rbio_work_locked(struct work_struct *work)
 	rbio_orig_end_io(rbio, errno_to_blk_status(ret));
 }
 
-/*
- * reads everything we need off the disk to reconstruct
- * the parity. endio handlers trigger final reconstruction
- * when the IO is done.
- *
- * This is used both for reads from the higher layers and for
- * parity construction required to finish a rmw cycle.
- */
-static int __raid56_parity_recover(struct btrfs_raid_bio *rbio)
-{
-	int bios_to_read = 0;
-	struct bio_list bio_list;
-	int ret;
-	struct bio *bio;
-
-	bio_list_init(&bio_list);
-
-	ret = alloc_rbio_pages(rbio);
-	if (ret)
-		goto cleanup;
-
-	atomic_set(&rbio->error, 0);
-
-	ret = recover_assemble_read_bios(rbio, &bio_list);
-	if (ret < 0)
-		goto cleanup;
-
-	bios_to_read = bio_list_size(&bio_list);
-	if (!bios_to_read) {
-		/*
-		 * we might have no bios to read just because the pages
-		 * were up to date, or we might have no bios to read because
-		 * the devices were gone.
-		 */
-		if (atomic_read(&rbio->error) <= rbio->bioc->max_errors) {
-			__raid_recover_end_io(rbio);
-			return 0;
-		} else {
-			goto cleanup;
-		}
-	}
-
-	/*
-	 * The bioc may be freed once we submit the last bio. Make sure not to
-	 * touch it after that.
-	 */
-	atomic_set(&rbio->stripes_pending, bios_to_read);
-	INIT_WORK(&rbio->end_io_work, raid_recover_end_io_work);
-	while ((bio = bio_list_pop(&bio_list))) {
-		bio->bi_end_io = raid56_bio_end_io;
-
-		if (trace_raid56_scrub_read_recover_enabled()) {
-			struct raid56_bio_trace_info trace_info = { 0 };
-
-			bio_get_trace_info(rbio, bio, &trace_info);
-			trace_raid56_scrub_read_recover(rbio, bio, &trace_info);
-		}
-		submit_bio(bio);
-	}
-
-	return 0;
-
-cleanup:
-	if (rbio->operation == BTRFS_RBIO_READ_REBUILD ||
-	    rbio->operation == BTRFS_RBIO_REBUILD_MISSING)
-		rbio_orig_end_io(rbio, BLK_STS_IOERR);
-
-	while ((bio = bio_list_pop(&bio_list)))
-		bio_put(bio);
-
-	return -EIO;
-}
-
 /*
  * the main entry point for reads from the higher layers.  This
  * is really only called when the normal read path had a failure,
@@ -2530,7 +2247,7 @@ static void submit_write_bios(struct btrfs_raid_bio *rbio,
 	}
 }
 
-int rmw_rbio(struct btrfs_raid_bio *rbio)
+static int rmw_rbio(struct btrfs_raid_bio *rbio)
 {
 	struct bio_list bio_list;
 	int sectornr;
@@ -2616,12 +2333,29 @@ int rmw_rbio(struct btrfs_raid_bio *rbio)
 	return ret;
 }
 
-static void rmw_work(struct work_struct *work)
+static void rmw_rbio_work(struct work_struct *work)
 {
 	struct btrfs_raid_bio *rbio;
+	int ret;
 
 	rbio = container_of(work, struct btrfs_raid_bio, work);
-	raid56_rmw_stripe(rbio);
+
+	ret = lock_stripe_add(rbio);
+	if (ret == 0) {
+		ret = rmw_rbio(rbio);
+		rbio_orig_end_io(rbio, errno_to_blk_status(ret));
+	}
+}
+
+static void rmw_rbio_work_locked(struct work_struct *work)
+{
+	struct btrfs_raid_bio *rbio;
+	int ret;
+
+	rbio = container_of(work, struct btrfs_raid_bio, work);
+
+	ret = rmw_rbio(rbio);
+	rbio_orig_end_io(rbio, errno_to_blk_status(ret));
 }
 
 /*
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 0e77c77c5dba..445e833fcfcf 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -185,9 +185,4 @@ void raid56_submit_missing_rbio(struct btrfs_raid_bio *rbio);
 int btrfs_alloc_stripe_hash_table(struct btrfs_fs_info *info);
 void btrfs_free_stripe_hash_table(struct btrfs_fs_info *info);
 
-/*
- * Placeholder definition to avoid warning, will be removed when
- * the full write path is migrated.
- */
-int rmw_rbio(struct btrfs_raid_bio *rbio);
 #endif
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 10/12] btrfs: raid56: extract scrub read bio list assembly code into a helper
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (8 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 09/12] btrfs: raid56: switch write path to rmw_rbio() Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 11/12] btrfs: raid56: switch scrub path to use a single function Qu Wenruo
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

Just like what we did for write/recovery, also extract the read bio
assembly code into a helper for scrub.

The difference between the three are:

- rmw_assemble_read_bios() only submit reads for missing sectors
  Thus it will skip cached sectors, but will also read sectors which
  is not covered by any full stripe. (For cache usage)

- recover_assemble_read_bios() reads every sector which has not failed

- scrub_assemble_read_bios() has extra check for vertical stripes
  It's mostly the same as rmw_assemble_read_bios(), but will skip
  sectors which is not covered by a vertical stripe.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 44 +++++++++++++++++++++++++++++++-------------
 1 file changed, 31 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 94df689713ea..4a82bffbcb4b 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2708,21 +2708,15 @@ static void raid56_parity_scrub_end_io_work(struct work_struct *work)
 	validate_rbio_for_parity_scrub(rbio);
 }
 
-static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
+static int scrub_assemble_read_bios(struct btrfs_raid_bio *rbio,
+				    struct bio_list *bio_list)
 {
-	int bios_to_read = 0;
-	struct bio_list bio_list;
-	int ret;
-	int total_sector_nr;
 	struct bio *bio;
+	int total_sector_nr;
+	int ret = 0;
 
-	bio_list_init(&bio_list);
+	ASSERT(bio_list_size(bio_list) == 0);
 
-	ret = alloc_rbio_essential_pages(rbio);
-	if (ret)
-		goto cleanup;
-
-	atomic_set(&rbio->error, 0);
 	/* Build a list of bios to read all the missing parts. */
 	for (total_sector_nr = 0; total_sector_nr < rbio->nr_sectors;
 	     total_sector_nr++) {
@@ -2751,11 +2745,35 @@ static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
 		if (sector->uptodate)
 			continue;
 
-		ret = rbio_add_io_sector(rbio, &bio_list, sector, stripe,
+		ret = rbio_add_io_sector(rbio, bio_list, sector, stripe,
 					 sectornr, REQ_OP_READ);
 		if (ret)
-			goto cleanup;
+			goto error;
 	}
+	return 0;
+error:
+	while ((bio = bio_list_pop(bio_list)))
+		bio_put(bio);
+	return ret;
+}
+
+static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
+{
+	int bios_to_read = 0;
+	struct bio_list bio_list;
+	int ret;
+	struct bio *bio;
+
+	bio_list_init(&bio_list);
+
+	ret = alloc_rbio_essential_pages(rbio);
+	if (ret)
+		goto cleanup;
+
+	atomic_set(&rbio->error, 0);
+	ret = scrub_assemble_read_bios(rbio, &bio_list);
+	if (ret < 0)
+		goto cleanup;
 
 	bios_to_read = bio_list_size(&bio_list);
 	if (!bios_to_read) {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 11/12] btrfs: raid56: switch scrub path to use a single function
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (9 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 10/12] btrfs: raid56: extract scrub read bio list assembly code into a helper Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-01 11:16 ` [PATCH v2 12/12] btrfs: remove the unused btrfs_fs_info::endio_raid56_workers and btrfs_raid_bio::end_io_work Qu Wenruo
  2022-11-02 14:59 ` [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload David Sterba
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

This switch involves the following changes:

- Make finish_parity_scrub() only to submit the write bios
  It will no longer call rbio_orig_end_io(), and now it will
  return error.

- Add a new helper, recover_scrub_rbio(), to handle recovery
  It's just doing extra scrub related checks, and then call
  recover_sectors().

- Rename raid56_parity_scrub_stripe() to scrub_rbio()
- Rename scrub_parity_work() to scrub_rbio_work_locked()
  To follow the existing naming scheme.

- Delete unused functions
  Including:
  * finish_rmw()
  * raid_write_end_io()
  * raid56_bio_end_io()
  * __raid_recover_end_io()

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/raid56.c | 401 ++++++++++------------------------------------
 1 file changed, 81 insertions(+), 320 deletions(-)

diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 4a82bffbcb4b..74e870941340 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -64,7 +64,6 @@ struct sector_ptr {
 	unsigned int uptodate:8;
 };
 
-static noinline void finish_rmw(struct btrfs_raid_bio *rbio);
 static void rmw_rbio_work(struct work_struct *work);
 static void rmw_rbio_work_locked(struct work_struct *work);
 static int fail_bio_stripe(struct btrfs_raid_bio *rbio, struct bio *bio);
@@ -72,9 +71,8 @@ static int fail_rbio_index(struct btrfs_raid_bio *rbio, int failed);
 static void index_rbio_pages(struct btrfs_raid_bio *rbio);
 static int alloc_rbio_pages(struct btrfs_raid_bio *rbio);
 
-static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
-					 int need_check);
-static void scrub_parity_work(struct work_struct *work);
+static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check);
+static void scrub_rbio_work_locked(struct work_struct *work);
 
 static void free_raid_bio_pointers(struct btrfs_raid_bio *rbio)
 {
@@ -819,7 +817,7 @@ static noinline void unlock_stripe(struct btrfs_raid_bio *rbio)
 				start_async_work(next, rmw_rbio_work_locked);
 			} else if (next->operation == BTRFS_RBIO_PARITY_SCRUB) {
 				steal_rbio(rbio, next);
-				start_async_work(next, scrub_parity_work);
+				start_async_work(next, scrub_rbio_work_locked);
 			}
 
 			goto done_nolock;
@@ -880,35 +878,6 @@ static void rbio_orig_end_io(struct btrfs_raid_bio *rbio, blk_status_t err)
 		rbio_endio_bio_list(extra, err);
 }
 
-/*
- * end io function used by finish_rmw.  When we finally
- * get here, we've written a full stripe
- */
-static void raid_write_end_io(struct bio *bio)
-{
-	struct btrfs_raid_bio *rbio = bio->bi_private;
-	blk_status_t err = bio->bi_status;
-	int max_errors;
-
-	if (err)
-		fail_bio_stripe(rbio, bio);
-
-	bio_put(bio);
-
-	if (!atomic_dec_and_test(&rbio->stripes_pending))
-		return;
-
-	err = BLK_STS_OK;
-
-	/* OK, we have read all the stripes we need to. */
-	max_errors = (rbio->operation == BTRFS_RBIO_PARITY_SCRUB) ?
-		     0 : rbio->bioc->max_errors;
-	if (atomic_read(&rbio->error) > max_errors)
-		err = BLK_STS_IOERR;
-
-	rbio_orig_end_io(rbio, err);
-}
-
 /**
  * Get a sector pointer specified by its @stripe_nr and @sector_nr
  *
@@ -1319,87 +1288,6 @@ static int rmw_assemble_write_bios(struct btrfs_raid_bio *rbio,
 	return -EIO;
 }
 
-/*
- * this is called from one of two situations.  We either
- * have a full stripe from the higher layers, or we've read all
- * the missing bits off disk.
- *
- * This will calculate the parity and then send down any
- * changed blocks.
- */
-static noinline void finish_rmw(struct btrfs_raid_bio *rbio)
-{
-	/* The total sector number inside the full stripe. */
-	/* Sector number inside a stripe. */
-	int sectornr;
-	struct bio_list bio_list;
-	struct bio *bio;
-	int ret;
-
-	bio_list_init(&bio_list);
-
-	/* We should have at least one data sector. */
-	ASSERT(bitmap_weight(&rbio->dbitmap, rbio->stripe_nsectors));
-
-	/* at this point we either have a full stripe,
-	 * or we've read the full stripe from the drive.
-	 * recalculate the parity and write the new results.
-	 *
-	 * We're not allowed to add any new bios to the
-	 * bio list here, anyone else that wants to
-	 * change this stripe needs to do their own rmw.
-	 */
-	spin_lock_irq(&rbio->bio_list_lock);
-	set_bit(RBIO_RMW_LOCKED_BIT, &rbio->flags);
-	spin_unlock_irq(&rbio->bio_list_lock);
-
-	atomic_set(&rbio->error, 0);
-
-	/*
-	 * now that we've set rmw_locked, run through the
-	 * bio list one last time and map the page pointers
-	 *
-	 * We don't cache full rbios because we're assuming
-	 * the higher layers are unlikely to use this area of
-	 * the disk again soon.  If they do use it again,
-	 * hopefully they will send another full bio.
-	 */
-	index_rbio_pages(rbio);
-	if (!rbio_is_full(rbio))
-		cache_rbio_pages(rbio);
-	else
-		clear_bit(RBIO_CACHE_READY_BIT, &rbio->flags);
-
-	for (sectornr = 0; sectornr < rbio->stripe_nsectors; sectornr++)
-		generate_pq_vertical(rbio, sectornr);
-
-	ret = rmw_assemble_write_bios(rbio, &bio_list);
-	if (ret < 0)
-		goto cleanup;
-
-	atomic_set(&rbio->stripes_pending, bio_list_size(&bio_list));
-	BUG_ON(atomic_read(&rbio->stripes_pending) == 0);
-
-	while ((bio = bio_list_pop(&bio_list))) {
-		bio->bi_end_io = raid_write_end_io;
-
-		if (trace_raid56_write_stripe_enabled()) {
-			struct raid56_bio_trace_info trace_info = { 0 };
-
-			bio_get_trace_info(rbio, bio, &trace_info);
-			trace_raid56_write_stripe(rbio, bio, &trace_info);
-		}
-		submit_bio(bio);
-	}
-	return;
-
-cleanup:
-	rbio_orig_end_io(rbio, BLK_STS_IOERR);
-
-	while ((bio = bio_list_pop(&bio_list)))
-		bio_put(bio);
-}
-
 /*
  * helper to find the stripe number for a given bio.  Used to figure out which
  * stripe has failed.  This expects the bio to correspond to a physical disk,
@@ -1569,22 +1457,6 @@ static void submit_read_bios(struct btrfs_raid_bio *rbio,
 	}
 }
 
-static void raid56_bio_end_io(struct bio *bio)
-{
-	struct btrfs_raid_bio *rbio = bio->bi_private;
-
-	if (bio->bi_status)
-		fail_bio_stripe(rbio, bio);
-	else
-		set_bio_pages_uptodate(rbio, bio);
-
-	bio_put(bio);
-
-	if (atomic_dec_and_test(&rbio->stripes_pending))
-		queue_work(rbio->bioc->fs_info->endio_raid56_workers,
-			   &rbio->end_io_work);
-}
-
 static int rmw_assemble_read_bios(struct btrfs_raid_bio *rbio,
 				  struct bio_list *bio_list)
 {
@@ -1969,60 +1841,6 @@ static int recover_sectors(struct btrfs_raid_bio *rbio)
 	return ret;
 }
 
-/*
- * all parity reconstruction happens here.  We've read in everything
- * we can find from the drives and this does the heavy lifting of
- * sorting the good from the bad.
- */
-static void __raid_recover_end_io(struct btrfs_raid_bio *rbio)
-{
-	int ret;
-
-	ret = recover_sectors(rbio);
-
-	/*
-	 * Similar to READ_REBUILD, REBUILD_MISSING at this point also has a
-	 * valid rbio which is consistent with ondisk content, thus such a
-	 * valid rbio can be cached to avoid further disk reads.
-	 */
-	if (rbio->operation == BTRFS_RBIO_READ_REBUILD ||
-	    rbio->operation == BTRFS_RBIO_REBUILD_MISSING) {
-		/*
-		 * - In case of two failures, where rbio->failb != -1:
-		 *
-		 *   Do not cache this rbio since the above read reconstruction
-		 *   (raid6_datap_recov() or raid6_2data_recov()) may have
-		 *   changed some content of stripes which are not identical to
-		 *   on-disk content any more, otherwise, a later write/recover
-		 *   may steal stripe_pages from this rbio and end up with
-		 *   corruptions or rebuild failures.
-		 *
-		 * - In case of single failure, where rbio->failb == -1:
-		 *
-		 *   Cache this rbio iff the above read reconstruction is
-		 *   executed without problems.
-		 */
-		if (!ret && rbio->failb < 0)
-			cache_rbio_pages(rbio);
-		else
-			clear_bit(RBIO_CACHE_READY_BIT, &rbio->flags);
-
-		rbio_orig_end_io(rbio, errno_to_blk_status(ret));
-	} else if (!ret) {
-		rbio->faila = -1;
-		rbio->failb = -1;
-
-		if (rbio->operation == BTRFS_RBIO_WRITE)
-			finish_rmw(rbio);
-		else if (rbio->operation == BTRFS_RBIO_PARITY_SCRUB)
-			finish_parity_scrub(rbio, 0);
-		else
-			BUG();
-	} else {
-		rbio_orig_end_io(rbio, errno_to_blk_status(ret));
-	}
-}
-
 static int recover_assemble_read_bios(struct btrfs_raid_bio *rbio,
 				      struct bio_list *bio_list)
 {
@@ -2450,8 +2268,7 @@ static int alloc_rbio_essential_pages(struct btrfs_raid_bio *rbio)
 	return 0;
 }
 
-static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
-					 int need_check)
+static int finish_parity_scrub(struct btrfs_raid_bio *rbio, int need_check)
 {
 	struct btrfs_io_context *bioc = rbio->bioc;
 	const u32 sectorsize = bioc->fs_info->sectorsize;
@@ -2494,7 +2311,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 
 	p_sector.page = alloc_page(GFP_NOFS);
 	if (!p_sector.page)
-		goto cleanup;
+		return -ENOMEM;
 	p_sector.pgoff = 0;
 	p_sector.uptodate = 1;
 
@@ -2504,7 +2321,7 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 		if (!q_sector.page) {
 			__free_page(p_sector.page);
 			p_sector.page = NULL;
-			goto cleanup;
+			return -ENOMEM;
 		}
 		q_sector.pgoff = 0;
 		q_sector.uptodate = 1;
@@ -2591,33 +2408,13 @@ static noinline void finish_parity_scrub(struct btrfs_raid_bio *rbio,
 	}
 
 submit_write:
-	nr_data = bio_list_size(&bio_list);
-	if (!nr_data) {
-		/* Every parity is right */
-		rbio_orig_end_io(rbio, BLK_STS_OK);
-		return;
-	}
-
-	atomic_set(&rbio->stripes_pending, nr_data);
-
-	while ((bio = bio_list_pop(&bio_list))) {
-		bio->bi_end_io = raid_write_end_io;
-
-		if (trace_raid56_scrub_write_stripe_enabled()) {
-			struct raid56_bio_trace_info trace_info = { 0 };
-
-			bio_get_trace_info(rbio, bio, &trace_info);
-			trace_raid56_scrub_write_stripe(rbio, bio, &trace_info);
-		}
-		submit_bio(bio);
-	}
-	return;
+	submit_write_bios(rbio, &bio_list);
+	return 0;
 
 cleanup:
-	rbio_orig_end_io(rbio, BLK_STS_IOERR);
-
 	while ((bio = bio_list_pop(&bio_list)))
 		bio_put(bio);
+	return ret;
 }
 
 static inline int is_data_stripe(struct btrfs_raid_bio *rbio, int stripe)
@@ -2627,85 +2424,51 @@ static inline int is_data_stripe(struct btrfs_raid_bio *rbio, int stripe)
 	return 0;
 }
 
-/*
- * While we're doing the parity check and repair, we could have errors
- * in reading pages off the disk.  This checks for errors and if we're
- * not able to read the page it'll trigger parity reconstruction.  The
- * parity scrub will be finished after we've reconstructed the failed
- * stripes
- */
-static void validate_rbio_for_parity_scrub(struct btrfs_raid_bio *rbio)
+static int recover_scrub_rbio(struct btrfs_raid_bio *rbio)
 {
-	if (atomic_read(&rbio->error) > rbio->bioc->max_errors)
-		goto cleanup;
+	int dfail = 0, failp = -1;
+	int ret;
 
-	if (rbio->faila >= 0 || rbio->failb >= 0) {
-		int dfail = 0, failp = -1;
+	/* No error case should be already handled by the caller. */
+	ASSERT(rbio->faila >= 0 || rbio->failb >= 0);
 
-		if (is_data_stripe(rbio, rbio->faila))
-			dfail++;
-		else if (is_parity_stripe(rbio->faila))
-			failp = rbio->faila;
+	if (is_data_stripe(rbio, rbio->faila))
+		dfail++;
+	else if (is_parity_stripe(rbio->faila))
+		failp = rbio->faila;
 
-		if (is_data_stripe(rbio, rbio->failb))
-			dfail++;
-		else if (is_parity_stripe(rbio->failb))
-			failp = rbio->failb;
-
-		/*
-		 * Because we can not use a scrubbing parity to repair
-		 * the data, so the capability of the repair is declined.
-		 * (In the case of RAID5, we can not repair anything)
-		 */
-		if (dfail > rbio->bioc->max_errors - 1)
-			goto cleanup;
-
-		/*
-		 * If all data is good, only parity is correctly, just
-		 * repair the parity.
-		 */
-		if (dfail == 0) {
-			finish_parity_scrub(rbio, 0);
-			return;
-		}
-
-		/*
-		 * Here means we got one corrupted data stripe and one
-		 * corrupted parity on RAID6, if the corrupted parity
-		 * is scrubbing parity, luckily, use the other one to repair
-		 * the data, or we can not repair the data stripe.
-		 */
-		if (failp != rbio->scrubp)
-			goto cleanup;
-
-		__raid_recover_end_io(rbio);
-	} else {
-		finish_parity_scrub(rbio, 1);
-	}
-	return;
-
-cleanup:
-	rbio_orig_end_io(rbio, BLK_STS_IOERR);
-}
-
-/*
- * end io for the read phase of the rmw cycle.  All the bios here are physical
- * stripe bios we've read from the disk so we can recalculate the parity of the
- * stripe.
- *
- * This will usually kick off finish_rmw once all the bios are read in, but it
- * may trigger parity reconstruction if we had any errors along the way
- */
-static void raid56_parity_scrub_end_io_work(struct work_struct *work)
-{
-	struct btrfs_raid_bio *rbio =
-		container_of(work, struct btrfs_raid_bio, end_io_work);
+	if (is_data_stripe(rbio, rbio->failb))
+		dfail++;
+	else if (is_parity_stripe(rbio->failb))
+		failp = rbio->failb;
 
 	/*
-	 * This will normally call finish_rmw to start our write, but if there
-	 * are any failed stripes we'll reconstruct from parity first
+	 * Because we can not use a scrubbing parity to repair
+	 * the data, so the capability of the repair is declined.
+	 * (In the case of RAID5, we can not repair anything)
 	 */
-	validate_rbio_for_parity_scrub(rbio);
+	if (dfail > rbio->bioc->max_errors - 1)
+		return -EIO;
+
+	/*
+	 * If all data is good, only parity is correctly, just
+	 * repair the parity.
+	 */
+	if (dfail == 0)
+		return 0;
+
+	/*
+	 * Here means we got one corrupted data stripe and one
+	 * corrupted parity on RAID6, if the corrupted parity
+	 * is scrubbing parity, luckily, use the other one to repair
+	 * the data, or we can not repair the data stripe.
+	 */
+	if (failp != rbio->scrubp)
+		return -EIO;
+
+	/* We have some corrupted sectors, need to repair them. */
+	ret = recover_sectors(rbio);
+	return ret;
 }
 
 static int scrub_assemble_read_bios(struct btrfs_raid_bio *rbio,
@@ -2757,9 +2520,9 @@ static int scrub_assemble_read_bios(struct btrfs_raid_bio *rbio,
 	return ret;
 }
 
-static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
+static int scrub_rbio(struct btrfs_raid_bio *rbio)
 {
-	int bios_to_read = 0;
+	bool need_check = false;
 	struct bio_list bio_list;
 	int ret;
 	struct bio *bio;
@@ -2775,61 +2538,59 @@ static void raid56_parity_scrub_stripe(struct btrfs_raid_bio *rbio)
 	if (ret < 0)
 		goto cleanup;
 
-	bios_to_read = bio_list_size(&bio_list);
-	if (!bios_to_read) {
-		/*
-		 * this can happen if others have merged with
-		 * us, it means there is nothing left to read.
-		 * But if there are missing devices it may not be
-		 * safe to do the full stripe write yet.
-		 */
+	submit_read_bios(rbio, &bio_list);
+	wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
+
+	if (atomic_read(&rbio->error) > rbio->bioc->max_errors) {
+		ret = -EIO;
+		goto cleanup;
+	}
+	/*
+	 * No error during read, can finish the scrub and need to verify the
+	 * P/Q sectors;
+	 */
+	if (atomic_read(&rbio->error) == 0) {
+		need_check = true;
 		goto finish;
 	}
 
+	/* We have some failures, need to recover the failed sectors first. */
+	ret = recover_scrub_rbio(rbio);
+	if (ret < 0)
+		goto cleanup;
+
+finish:
 	/*
-	 * The bioc may be freed once we submit the last bio. Make sure not to
-	 * touch it after that.
+	 * We have every sector properly prepared. Can finish the scrub
+	 * and writeback the good content.
 	 */
-	atomic_set(&rbio->stripes_pending, bios_to_read);
-	INIT_WORK(&rbio->end_io_work, raid56_parity_scrub_end_io_work);
-	while ((bio = bio_list_pop(&bio_list))) {
-		bio->bi_end_io = raid56_bio_end_io;
-
-		if (trace_raid56_scrub_read_enabled()) {
-			struct raid56_bio_trace_info trace_info = { 0 };
-
-			bio_get_trace_info(rbio, bio, &trace_info);
-			trace_raid56_scrub_read(rbio, bio, &trace_info);
-		}
-		submit_bio(bio);
-	}
-	/* the actual write will happen once the reads are done */
-	return;
+	ret = finish_parity_scrub(rbio, need_check);
+	wait_event(rbio->io_wait, atomic_read(&rbio->stripes_pending) == 0);
+	if (atomic_read(&rbio->error) > rbio->bioc->max_errors)
+		ret = -EIO;
+	return ret;
 
 cleanup:
-	rbio_orig_end_io(rbio, BLK_STS_IOERR);
-
 	while ((bio = bio_list_pop(&bio_list)))
 		bio_put(bio);
 
-	return;
-
-finish:
-	validate_rbio_for_parity_scrub(rbio);
+	return ret;
 }
 
-static void scrub_parity_work(struct work_struct *work)
+static void scrub_rbio_work_locked(struct work_struct *work)
 {
 	struct btrfs_raid_bio *rbio;
+	int ret;
 
 	rbio = container_of(work, struct btrfs_raid_bio, work);
-	raid56_parity_scrub_stripe(rbio);
+	ret = scrub_rbio(rbio);
+	rbio_orig_end_io(rbio, errno_to_blk_status(ret));
 }
 
 void raid56_parity_submit_scrub_rbio(struct btrfs_raid_bio *rbio)
 {
 	if (!lock_stripe_add(rbio))
-		start_async_work(rbio, scrub_parity_work);
+		start_async_work(rbio, scrub_rbio_work_locked);
 }
 
 /* The following code is used for dev replace of a missing RAID 5/6 device. */
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 12/12] btrfs: remove the unused btrfs_fs_info::endio_raid56_workers and btrfs_raid_bio::end_io_work
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (10 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 11/12] btrfs: raid56: switch scrub path to use a single function Qu Wenruo
@ 2022-11-01 11:16 ` Qu Wenruo
  2022-11-02 14:59 ` [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload David Sterba
  12 siblings, 0 replies; 16+ messages in thread
From: Qu Wenruo @ 2022-11-01 11:16 UTC (permalink / raw)
  To: linux-btrfs

Since we have switch all raid56 workload to submit-and-wait method,
there is no use for btrfs_fs_info::endio_raid56_workers workqueue and
btrfs_raid_bio::end_io_work.

Remove them to save some memory.

Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/disk-io.c | 6 +-----
 fs/btrfs/fs.h      | 1 -
 fs/btrfs/raid56.h  | 2 --
 3 files changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7fefd5baa86b..05385ecd8314 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2092,8 +2092,6 @@ static void btrfs_stop_all_workers(struct btrfs_fs_info *fs_info)
 	btrfs_destroy_workqueue(fs_info->workers);
 	if (fs_info->endio_workers)
 		destroy_workqueue(fs_info->endio_workers);
-	if (fs_info->endio_raid56_workers)
-		destroy_workqueue(fs_info->endio_raid56_workers);
 	if (fs_info->rmw_workers)
 		destroy_workqueue(fs_info->rmw_workers);
 	if (fs_info->compressed_write_workers)
@@ -2299,8 +2297,6 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 		alloc_workqueue("btrfs-endio", flags, max_active);
 	fs_info->endio_meta_workers =
 		alloc_workqueue("btrfs-endio-meta", flags, max_active);
-	fs_info->endio_raid56_workers =
-		alloc_workqueue("btrfs-endio-raid56", flags, max_active);
 	fs_info->rmw_workers = alloc_workqueue("btrfs-rmw", flags, max_active);
 	fs_info->endio_write_workers =
 		btrfs_alloc_workqueue(fs_info, "endio-write", flags,
@@ -2322,7 +2318,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info *fs_info)
 	      fs_info->delalloc_workers && fs_info->flush_workers &&
 	      fs_info->endio_workers && fs_info->endio_meta_workers &&
 	      fs_info->compressed_write_workers &&
-	      fs_info->endio_write_workers && fs_info->endio_raid56_workers &&
+	      fs_info->endio_write_workers &&
 	      fs_info->endio_freespace_worker && fs_info->rmw_workers &&
 	      fs_info->caching_workers && fs_info->fixup_workers &&
 	      fs_info->delayed_workers && fs_info->qgroup_rescan_workers &&
diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h
index becb164d8a5d..c771b284eea7 100644
--- a/fs/btrfs/fs.h
+++ b/fs/btrfs/fs.h
@@ -524,7 +524,6 @@ struct btrfs_fs_info {
 	struct btrfs_workqueue *flush_workers;
 	struct workqueue_struct *endio_workers;
 	struct workqueue_struct *endio_meta_workers;
-	struct workqueue_struct *endio_raid56_workers;
 	struct workqueue_struct *rmw_workers;
 	struct workqueue_struct *compressed_write_workers;
 	struct btrfs_workqueue *endio_write_workers;
diff --git a/fs/btrfs/raid56.h b/fs/btrfs/raid56.h
index 445e833fcfcf..9fae97b7a2a5 100644
--- a/fs/btrfs/raid56.h
+++ b/fs/btrfs/raid56.h
@@ -97,8 +97,6 @@ struct btrfs_raid_bio {
 
 	wait_queue_head_t io_wait;
 
-	struct work_struct end_io_work;
-
 	/* Bitmap to record which horizontal stripe has data */
 	unsigned long dbitmap;
 
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload
  2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
                   ` (11 preceding siblings ...)
  2022-11-01 11:16 ` [PATCH v2 12/12] btrfs: remove the unused btrfs_fs_info::endio_raid56_workers and btrfs_raid_bio::end_io_work Qu Wenruo
@ 2022-11-02 14:59 ` David Sterba
  2022-11-02 22:32   ` Qu Wenruo
  12 siblings, 1 reply; 16+ messages in thread
From: David Sterba @ 2022-11-02 14:59 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: linux-btrfs

On Tue, Nov 01, 2022 at 07:16:00PM +0800, Qu Wenruo wrote:
> [CHANGELOG]
> v2:
> - Add coverage for raid56 recover and scrub paths
>   In fact, with full coverage we can do better cleanups, which gets
>   reflected to the changed lines.
> 
> - Better naming schemes
>   We now have 3 main entrances:
>   * recover_rbio()
>   * rmw_rbio()
>   * scrub_rbio()
> 
>   And their work related functions will be called:
>   * {recover|rmw|scrub}_rbio_work()
>   * {recover|rmw|scrub}_rbio_work_locked()
> 
>   The later is for unlock_stripe() to call, where we hold the full
>   stripe lock.
> 
> - More extracted helpers
>   Now we have the following helpers:
>   * {recover|rmw|scrub}_assemble_{read|write}_bios()
>   * submit_read_bios()
>   * submit_write_bios()
> 
> Currently btrfs raid56 has very chaotic jumps using endio functions:
> 
> E.g. for raid_parity_write(), if we go sub-stripe, the function jumps
> would be:
> 
>   raid_parity_write()
>    |
>    v
>   __raid56_parity_write()
>    |
>    v
>   partial_stripe_write()
>    |
>    v
>   start_async_work(rmw_work) <<< Delayed work
>    |
>    v
>   raid56_rmw_stripe()
>    |
>    v
>   raid56_rmw_end_io_work() <<< End io jump
>    |
>    v
>   validate_rbio_for_rmw()
>    |
>    v
>   finish_rmw()
>    |
>    v
>   raid_write_end_io() <<< End io jump
>    |
>    v
>   rbio_orig_end_io()
> 
> During a simple sub-stripe write, we have to go through over 10
> functions, two end_io jump, at least one delayed work.
> 
> With this patchset, we only go like this:
> 
>   raid_parity_write()
>    |
>    v
>   rmw_rbio_work() <<< Delayed work
>    |
>    v
>   rbio_work()
>    |
>    v
>   rbio_orig_end_io()
> 
> And inside rbio_work(), there is no more end io jumps or extra delayed
> work.
> Everything except IO is single threaded, and the IO is just
> submit-and-wait.
> 
> This patchset will rework the raid56 write path (recover and scrub path
> is not touched yet) by:
> 
> - Introduce a single entrance for recover/rmw/scrub.
>   The main function will be called {recover|rmw|scrub}_rbio(),
>   executed in rmw_worker workqueue.
> 
> - Unified handling for all writes (full/sub-stripe, cached/non-cached,
>   degraded or not), recover (dedicated, and in writes/scrub path) and
>   scrub.
> 
> - No special handling for cases we can skip some workload
>   E.g. for sub-stripe write, if we have a cached rbio, we can skip the
>   read.
> 
>   Now we just assemble the read bio list, submit all bios (none in
>   this case) and wait for the IO to finish.
> 
>   Since we submitted zero bios, the rbio::stripes_pending is 0, the
>   wait immediately returned, needing no extra handling.
> 
> - No more need for end_io_work or raid56_endio_workers
>   Since we don't rely on endio functions to jump to the next step.
> 
> By this, we have unified entrance for all raid56 writes, and no extra
> jumping/workqueue mess to interrupt the workflow.
> 
> This would make later destructive RMW fix much easier to add, as the
> timing of each step in RMW cycle should be very easy to grasp.
> 
> Thus I hope this series can be merged before the previous RFC series of
> destructive RMW fix.
> 
> Qu Wenruo (12):
>   btrfs: raid56: extract the vertical stripe recovery code into
>     recover_vertical()
>   btrfs: raid56: extract the pq generation code into a helper
>   btrfs: raid56: extract the recovery bio list build code into a helper
>   btrfs: raid56: extract sector recovery code into a helper
>   btrfs: raid56: switch recovery path to a single function
>   btrfs: raid56: extract the rmw bio list build code into a helper
>   btrfs: raid56: extract rwm write bios assembly into a helper
>   btrfs: raid56: introduce the a main entrance for rmw path
>   btrfs: raid56: switch write path to rmw_rbio()
>   btrfs: raid56: extract scrub read bio list assembly code into a helper
>   btrfs: raid56: switch scrub path to use a single function
>   btrfs: remove the unused btrfs_fs_info::endio_raid56_workers and
>     btrfs_raid_bio::end_io_work

Thanks for untangling it. You may want to add some recognizable prefix
like raid56_ to functions that could block for io, simple helpers do not
need it and they'll be probably inlined anyway.

Added to misc-next. Please proceed with the RMW fix series.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload
  2022-11-02 14:59 ` [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload David Sterba
@ 2022-11-02 22:32   ` Qu Wenruo
  2022-11-02 23:34     ` David Sterba
  0 siblings, 1 reply; 16+ messages in thread
From: Qu Wenruo @ 2022-11-02 22:32 UTC (permalink / raw)
  To: dsterba; +Cc: linux-btrfs



On 2022/11/2 22:59, David Sterba wrote:
> On Tue, Nov 01, 2022 at 07:16:00PM +0800, Qu Wenruo wrote:
>> [CHANGELOG]
>> v2:
>> - Add coverage for raid56 recover and scrub paths
>>    In fact, with full coverage we can do better cleanups, which gets
>>    reflected to the changed lines.
>>
>> - Better naming schemes
>>    We now have 3 main entrances:
>>    * recover_rbio()
>>    * rmw_rbio()
>>    * scrub_rbio()
>>
>>    And their work related functions will be called:
>>    * {recover|rmw|scrub}_rbio_work()
>>    * {recover|rmw|scrub}_rbio_work_locked()
>>
>>    The later is for unlock_stripe() to call, where we hold the full
>>    stripe lock.
>>
>> - More extracted helpers
>>    Now we have the following helpers:
>>    * {recover|rmw|scrub}_assemble_{read|write}_bios()
>>    * submit_read_bios()
>>    * submit_write_bios()
>>
>> Currently btrfs raid56 has very chaotic jumps using endio functions:
>>
>> E.g. for raid_parity_write(), if we go sub-stripe, the function jumps
>> would be:
>>
>>    raid_parity_write()
>>     |
>>     v
>>    __raid56_parity_write()
>>     |
>>     v
>>    partial_stripe_write()
>>     |
>>     v
>>    start_async_work(rmw_work) <<< Delayed work
>>     |
>>     v
>>    raid56_rmw_stripe()
>>     |
>>     v
>>    raid56_rmw_end_io_work() <<< End io jump
>>     |
>>     v
>>    validate_rbio_for_rmw()
>>     |
>>     v
>>    finish_rmw()
>>     |
>>     v
>>    raid_write_end_io() <<< End io jump
>>     |
>>     v
>>    rbio_orig_end_io()
>>
>> During a simple sub-stripe write, we have to go through over 10
>> functions, two end_io jump, at least one delayed work.
>>
>> With this patchset, we only go like this:
>>
>>    raid_parity_write()
>>     |
>>     v
>>    rmw_rbio_work() <<< Delayed work
>>     |
>>     v
>>    rbio_work()
>>     |
>>     v
>>    rbio_orig_end_io()
>>
>> And inside rbio_work(), there is no more end io jumps or extra delayed
>> work.
>> Everything except IO is single threaded, and the IO is just
>> submit-and-wait.
>>
>> This patchset will rework the raid56 write path (recover and scrub path
>> is not touched yet) by:
>>
>> - Introduce a single entrance for recover/rmw/scrub.
>>    The main function will be called {recover|rmw|scrub}_rbio(),
>>    executed in rmw_worker workqueue.
>>
>> - Unified handling for all writes (full/sub-stripe, cached/non-cached,
>>    degraded or not), recover (dedicated, and in writes/scrub path) and
>>    scrub.
>>
>> - No special handling for cases we can skip some workload
>>    E.g. for sub-stripe write, if we have a cached rbio, we can skip the
>>    read.
>>
>>    Now we just assemble the read bio list, submit all bios (none in
>>    this case) and wait for the IO to finish.
>>
>>    Since we submitted zero bios, the rbio::stripes_pending is 0, the
>>    wait immediately returned, needing no extra handling.
>>
>> - No more need for end_io_work or raid56_endio_workers
>>    Since we don't rely on endio functions to jump to the next step.
>>
>> By this, we have unified entrance for all raid56 writes, and no extra
>> jumping/workqueue mess to interrupt the workflow.
>>
>> This would make later destructive RMW fix much easier to add, as the
>> timing of each step in RMW cycle should be very easy to grasp.
>>
>> Thus I hope this series can be merged before the previous RFC series of
>> destructive RMW fix.
>>
>> Qu Wenruo (12):
>>    btrfs: raid56: extract the vertical stripe recovery code into
>>      recover_vertical()
>>    btrfs: raid56: extract the pq generation code into a helper
>>    btrfs: raid56: extract the recovery bio list build code into a helper
>>    btrfs: raid56: extract sector recovery code into a helper
>>    btrfs: raid56: switch recovery path to a single function
>>    btrfs: raid56: extract the rmw bio list build code into a helper
>>    btrfs: raid56: extract rwm write bios assembly into a helper
>>    btrfs: raid56: introduce the a main entrance for rmw path
>>    btrfs: raid56: switch write path to rmw_rbio()
>>    btrfs: raid56: extract scrub read bio list assembly code into a helper
>>    btrfs: raid56: switch scrub path to use a single function
>>    btrfs: remove the unused btrfs_fs_info::endio_raid56_workers and
>>      btrfs_raid_bio::end_io_work
> 
> Thanks for untangling it. You may want to add some recognizable prefix
> like raid56_ to functions that could block for io,

One thing I am never going to stably handle is the naming...

To me, since it's inside raid56.c, a prefix like "raid56_" looks a 
little redundant, unless it's exported.

But sure, with much less functions it's much easier to add/remove 
prefixes in one go.

> simple helpers do not
> need it and they'll be probably inlined anyway.
> 
> Added to misc-next. Please proceed with the RMW fix series.

That's awesome.

Although I'm afraid I would just come up with extra refactors...

The latest thing that has some impact on RMW fix is the faila/b used in 
the current code.

a) if just one sector failed to be read by somehow, we mark the whole 
stripe failed.

b) since RMW now will handle data csum verification, we have extra 
source of failed sectors.

I'll check if I can migrate the faila/b to an error bitmap, and then 
re-base my RMW fixes.

Thanks,
Qu

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload
  2022-11-02 22:32   ` Qu Wenruo
@ 2022-11-02 23:34     ` David Sterba
  0 siblings, 0 replies; 16+ messages in thread
From: David Sterba @ 2022-11-02 23:34 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: dsterba, linux-btrfs

On Thu, Nov 03, 2022 at 06:32:46AM +0800, Qu Wenruo wrote:
> On 2022/11/2 22:59, David Sterba wrote:
> > On Tue, Nov 01, 2022 at 07:16:00PM +0800, Qu Wenruo wrote:
> >>    btrfs: raid56: extract rwm write bios assembly into a helper
> >>    btrfs: raid56: introduce the a main entrance for rmw path
> >>    btrfs: raid56: switch write path to rmw_rbio()
> >>    btrfs: raid56: extract scrub read bio list assembly code into a helper
> >>    btrfs: raid56: switch scrub path to use a single function
> >>    btrfs: remove the unused btrfs_fs_info::endio_raid56_workers and
> >>      btrfs_raid_bio::end_io_work
> > 
> > Thanks for untangling it. You may want to add some recognizable prefix
> > like raid56_ to functions that could block for io,
> 
> One thing I am never going to stably handle is the naming...
> 
> To me, since it's inside raid56.c, a prefix like "raid56_" looks a 
> little redundant, unless it's exported.

Yes, this is one of the reasons, source code clarity. I suggested adding
some prefix because there is or was an occasional error that ended up
in finish_rmw. Before your patches it was declared as noinline and doing
all xor/parity calculations and then bio submission. The only hint that
it's in our code is because of the 'rmw', the full btrfs_ prefix or at
least some extent_, extent_io_ etc could help to recognize where the
error happened. The generic functions tend to use very generic naming
scheme so anything that seems btrfs specific helps, in this case it
would be sufficient to add raid56_ .

The renames can be added as needed, no big deal if it's not in the first
version but in this case I had a prior knowledge where I'd really like
to see it. As the code got reworked significantly I'm fine with doing it
in the followup RMW fix series.

> But sure, with much less functions it's much easier to add/remove 
> prefixes in one go.
> 
> > simple helpers do not
> > need it and they'll be probably inlined anyway.
> > 
> > Added to misc-next. Please proceed with the RMW fix series.
> 
> That's awesome.
> 
> Although I'm afraid I would just come up with extra refactors...
> 
> The latest thing that has some impact on RMW fix is the faila/b used in 
> the current code.
> 
> a) if just one sector failed to be read by somehow, we mark the whole 
> stripe failed.
> 
> b) since RMW now will handle data csum verification, we have extra 
> source of failed sectors.
> 
> I'll check if I can migrate the faila/b to an error bitmap, and then 
> re-base my RMW fixes.

I think it's fine, reducing the jumps between the functions is a good
intermediate step, in this case it helps to reduce the complexity and
it's easier to find further cleanups or optimizations.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-11-02 23:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-01 11:16 [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 01/12] btrfs: raid56: extract the vertical stripe recovery code into recover_vertical() Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 02/12] btrfs: raid56: extract the pq generation code into a helper Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 03/12] btrfs: raid56: extract the recovery bio list build " Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 04/12] btrfs: raid56: extract sector recovery " Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 05/12] btrfs: raid56: switch recovery path to a single function Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 06/12] btrfs: raid56: extract the rmw bio list build code into a helper Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 07/12] btrfs: raid56: extract rwm write bios assembly " Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 08/12] btrfs: raid56: introduce the a main entrance for rmw path Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 09/12] btrfs: raid56: switch write path to rmw_rbio() Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 10/12] btrfs: raid56: extract scrub read bio list assembly code into a helper Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 11/12] btrfs: raid56: switch scrub path to use a single function Qu Wenruo
2022-11-01 11:16 ` [PATCH v2 12/12] btrfs: remove the unused btrfs_fs_info::endio_raid56_workers and btrfs_raid_bio::end_io_work Qu Wenruo
2022-11-02 14:59 ` [PATCH v2 00/12] btrfs: raid56: use submit-and-wait method to handle raid56 worload David Sterba
2022-11-02 22:32   ` Qu Wenruo
2022-11-02 23:34     ` David Sterba

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).