linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/15] Improve Raid5 Lock Contention
@ 2022-06-16 19:19 Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 01/15] md/raid5: Make logic blocking check consistent with logic that blocks Logan Gunthorpe
                   ` (15 more replies)
  0 siblings, 16 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Hi,

Now that I've done some cleanup of the mdadm testing infrastructure
as well as a lot of long run testing and bug fixes I'm much more
confident in the correctness of this series. The previous posting is
at [1].

Patch 14 has been completely reworked from the previous series (where
much of the feedback was on). Rare bugs were found with the original
method where if the array changed shape at just the right time while
first_wrap was true, the algorithm would fail and blocks would not be
written correctly. To fix this, the new version uses a bitmap to track
which pages have been added to the stripe_head. This requires
limitting the size of the request but to a size greater than the
current limit (which is based on the number of segments). I've also
included another patch to remove the limit on the number of segments
(seeing it is not needed) and the limit on the number of sectors is
higher and ends up with less bio splitting and fewer bios that are
unaligned with the chunk size.

--

I've been doing some work trying to improve the bulk write performance
of raid5 on large systems with fast NVMe drives. The bottleneck appears
largely to be lock contention on the hash_lock and device_lock. This
series improves the situation slightly by addressing a couple of low
hanging fruit ways to take the lock fewer times in the request path.

Patch 11 adjusts how batching works by keeping a reference to the
previous stripe_head in raid5_make_request(). Under most situtations,
this removes the need to take the hash_lock in stripe_add_to_batch_list()
which should reduce the number of times the lock is taken by a factor of
about 2.

Patch 14 pivots the way raid5_make_request() works. Before the patch, the
code must find the stripe_head for every 4KB page in the request, so each
stripe head must be found once for every data disk. The patch changes this
so that all the data disks can be added to a stripe_head at once and the
number of times the stripe_head must be found (and thus the number of
times the hash_lock is taken) should be reduced by a factor roughly equal
to the number of data disks.

Patch 16 increases the restriction on block layer IO size to reduce the
amount of bio splitting which decreases the amount of broken batches that
occur with large IOs due to the unecessary splitting.

I've also included Patch 15 which changes some debug prints to make
debugging a bit easier.

The remaining patches are just cleanup and prep patches for those two
patches.

Doing apples to apples testing this series on a small VM with 5 ram
disks, I saw a bandwidth increase of roughly 14% and lock contentions
on the hash_lock (as reported by lock stat) reduced by more than a factor
of 5 (though it is still significantly contended).

Testing on larger systems with NVMe drives saw similar small bandwidth
increases from 3% to 20% depending on the parameters. Oddly small arrays
had larger gains, likely due to them having lower starting bandwidths; I
would have expected larger gains with larger arrays (seeing there
should have been even fewer locks taken in raid5_make_request()).

This series is based on the current md/md-next (facef3b96c5b9565). A git
branch is available here:

  https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v3

Logan

[1] https://lkml.kernel.org/r/20220420195425.34911-1-logang@deltatee.com

--

Changes since v2:
  - Rebased on current md-next branch (facef3b96c5b9565)
  - Reworked Pivot patch with bitmap due to unfixable bug
  - Changed to a ternary operator in ahead_of_reshape() helper (per Paul)
  - Seperated out the functional change from non-functional change in
    the first patch (per Paul)
  - Dropped an unecessary hash argument in __find_stripe() (per
    Christoph)
  - Fixed some minor commit message and comment errors
  - Collected tags from Christoph and Guoqing

Changes since v1:
  - Rebased on current md-next branch (190a901246c69d79)
  - Added patch to create a helper for checking if a sector
    is ahead of the reshape (per Christoph)
  - Reworked the __find_stripe() patch to create a find_get_stripe()
    helper (per Christoph)
  - Added more patches to further refactor raid5_make_request() and
    pull most of the loop body into a helper function (per Christoph)
  - A few other minor cleanups (boolean return, droping casting when
    printing sectors, commit message grammar) as suggested by Christoph.
  - Fixed two uncommon but bad data corruption bugs in that were found.

--

Logan Gunthorpe (15):
  md/raid5: Make logic blocking check consistent with logic that blocks
  md/raid5: Factor out ahead_of_reshape() function
  md/raid5: Refactor raid5_make_request loop
  md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
  md/raid5: Move common stripe get code into new find_get_stripe()
    helper
  md/raid5: Factor out helper from raid5_make_request() loop
  md/raid5: Drop the do_prepare flag in raid5_make_request()
  md/raid5: Move read_seqcount_begin() into make_stripe_request()
  md/raid5: Refactor for loop in raid5_make_request() into while loop
  md/raid5: Keep a reference to last stripe_head for batch
  md/raid5: Refactor add_stripe_bio()
  md/raid5: Check all disks in a stripe_head for reshape progress
  md/raid5: Pivot raid5_make_request()
  md/raid5: Improve debug prints
  md/raid5: Increase restriction on max segments per request

 drivers/md/raid5.c | 641 +++++++++++++++++++++++++++++----------------
 1 file changed, 418 insertions(+), 223 deletions(-)


base-commit: facef3b96c5b9565fa0416d7701ef990ef96e5a6
--
2.30.2

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v3 01/15] md/raid5: Make logic blocking check consistent with logic that blocks
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 02/15] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

The check in raid5_make_request differs very slightly from the logic
that causes it to block lower down. This likely does not cause a bug
as the check is fuzzy anyway (as reshape may move on between the first
check and the subsequent check). However, make it consistent so it can
be cleaned up in a subsequent patch.

The condition which causes the schedule is:

 !(mddev->reshape_backwards ? logical_sector < conf->reshape_progress :
   logical_sector >= conf->reshape_progress) &&
  (mddev->reshape_backwards ? logical_sector < conf->reshape_safe :
   logical_sector >= conf->reshape_safe)

The condition that causes the early bailout is made to match this.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5d84bad8b854..b3d1f894f154 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5841,7 +5841,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	if ((bi->bi_opf & REQ_NOWAIT) &&
 	    (conf->reshape_progress != MaxSector) &&
 	    (mddev->reshape_backwards
-	    ? (logical_sector > conf->reshape_progress && logical_sector <= conf->reshape_safe)
+	    ? (logical_sector >= conf->reshape_progress && logical_sector < conf->reshape_safe)
 	    : (logical_sector >= conf->reshape_safe && logical_sector < conf->reshape_progress))) {
 		bio_wouldblock_error(bi);
 		if (rw == WRITE)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 02/15] md/raid5: Factor out ahead_of_reshape() function
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 01/15] md/raid5: Make logic blocking check consistent with logic that blocks Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 03/15] md/raid5: Refactor raid5_make_request loop Logan Gunthorpe
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

There are a few uses of an ugly ternary operator in raid5_make_request()
to check if a sector is a head of a reshape sector.

Factor this out into a simple helper called ahead_of_reshape().

No functional changes intended.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b3d1f894f154..6e53b8490fff 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5784,6 +5784,13 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	bio_endio(bi);
 }
 
+static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
+			     sector_t reshape_sector)
+{
+	return mddev->reshape_backwards ? sector < reshape_sector :
+					  sector >= reshape_sector;
+}
+
 static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
@@ -5840,9 +5847,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
 	if ((bi->bi_opf & REQ_NOWAIT) &&
 	    (conf->reshape_progress != MaxSector) &&
-	    (mddev->reshape_backwards
-	    ? (logical_sector >= conf->reshape_progress && logical_sector < conf->reshape_safe)
-	    : (logical_sector >= conf->reshape_safe && logical_sector < conf->reshape_progress))) {
+	    !ahead_of_reshape(mddev, logical_sector, conf->reshape_progress) &&
+	    ahead_of_reshape(mddev, logical_sector, conf->reshape_safe)) {
 		bio_wouldblock_error(bi);
 		if (rw == WRITE)
 			md_write_end(mddev);
@@ -5871,14 +5877,12 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			 * to check again.
 			 */
 			spin_lock_irq(&conf->device_lock);
-			if (mddev->reshape_backwards
-			    ? logical_sector < conf->reshape_progress
-			    : logical_sector >= conf->reshape_progress) {
+			if (ahead_of_reshape(mddev, logical_sector,
+					     conf->reshape_progress)) {
 				previous = 1;
 			} else {
-				if (mddev->reshape_backwards
-				    ? logical_sector < conf->reshape_safe
-				    : logical_sector >= conf->reshape_safe) {
+				if (ahead_of_reshape(mddev, logical_sector,
+						     conf->reshape_safe)) {
 					spin_unlock_irq(&conf->device_lock);
 					schedule();
 					do_prepare = true;
@@ -5909,9 +5913,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 				 */
 				int must_retry = 0;
 				spin_lock_irq(&conf->device_lock);
-				if (mddev->reshape_backwards
-				    ? logical_sector >= conf->reshape_progress
-				    : logical_sector < conf->reshape_progress)
+				if (!ahead_of_reshape(mddev, logical_sector,
+						      conf->reshape_progress))
 					/* mismatch, need to try again */
 					must_retry = 1;
 				spin_unlock_irq(&conf->device_lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 03/15] md/raid5: Refactor raid5_make_request loop
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 01/15] md/raid5: Make logic blocking check consistent with logic that blocks Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 02/15] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 04/15] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() Logan Gunthorpe
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

Break immediately if raid5_get_active_stripe() returns NULL and deindent
the rest of the loop. Annotate this check with an unlikely().

This makes the code easier to read and reduces the indentation level.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
 drivers/md/raid5.c | 109 +++++++++++++++++++++++----------------------
 1 file changed, 55 insertions(+), 54 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 6e53b8490fff..25db747c5856 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5901,68 +5901,69 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 		sh = raid5_get_active_stripe(conf, new_sector, previous,
 				       (bi->bi_opf & REQ_RAHEAD), 0);
-		if (sh) {
-			if (unlikely(previous)) {
-				/* expansion might have moved on while waiting for a
-				 * stripe, so we must do the range check again.
-				 * Expansion could still move past after this
-				 * test, but as we are holding a reference to
-				 * 'sh', we know that if that happens,
-				 *  STRIPE_EXPANDING will get set and the expansion
-				 * won't proceed until we finish with the stripe.
-				 */
-				int must_retry = 0;
-				spin_lock_irq(&conf->device_lock);
-				if (!ahead_of_reshape(mddev, logical_sector,
-						      conf->reshape_progress))
-					/* mismatch, need to try again */
-					must_retry = 1;
-				spin_unlock_irq(&conf->device_lock);
-				if (must_retry) {
-					raid5_release_stripe(sh);
-					schedule();
-					do_prepare = true;
-					goto retry;
-				}
-			}
-			if (read_seqcount_retry(&conf->gen_lock, seq)) {
-				/* Might have got the wrong stripe_head
-				 * by accident
-				 */
-				raid5_release_stripe(sh);
-				goto retry;
-			}
+		if (unlikely(!sh)) {
+			/* cannot get stripe, just give-up */
+			bi->bi_status = BLK_STS_IOERR;
+			break;
+		}
 
-			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-			    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
-				/* Stripe is busy expanding or
-				 * add failed due to overlap.  Flush everything
-				 * and wait a while
-				 */
-				md_wakeup_thread(mddev->thread);
+		if (unlikely(previous)) {
+			/* expansion might have moved on while waiting for a
+			 * stripe, so we must do the range check again.
+			 * Expansion could still move past after this
+			 * test, but as we are holding a reference to
+			 * 'sh', we know that if that happens,
+			 *  STRIPE_EXPANDING will get set and the expansion
+			 * won't proceed until we finish with the stripe.
+			 */
+			int must_retry = 0;
+			spin_lock_irq(&conf->device_lock);
+			if (!ahead_of_reshape(mddev, logical_sector,
+					      conf->reshape_progress))
+				/* mismatch, need to try again */
+				must_retry = 1;
+			spin_unlock_irq(&conf->device_lock);
+			if (must_retry) {
 				raid5_release_stripe(sh);
 				schedule();
 				do_prepare = true;
 				goto retry;
 			}
-			if (do_flush) {
-				set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
-				/* we only need flush for one stripe */
-				do_flush = false;
-			}
+		}
 
-			set_bit(STRIPE_HANDLE, &sh->state);
-			clear_bit(STRIPE_DELAYED, &sh->state);
-			if ((!sh->batch_head || sh == sh->batch_head) &&
-			    (bi->bi_opf & REQ_SYNC) &&
-			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				atomic_inc(&conf->preread_active_stripes);
-			release_stripe_plug(mddev, sh);
-		} else {
-			/* cannot get stripe for read-ahead, just give-up */
-			bi->bi_status = BLK_STS_IOERR;
-			break;
+		if (read_seqcount_retry(&conf->gen_lock, seq)) {
+			/* Might have got the wrong stripe_head by accident */
+			raid5_release_stripe(sh);
+			goto retry;
+		}
+
+		if (test_bit(STRIPE_EXPANDING, &sh->state) ||
+		    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+			/*
+			 * Stripe is busy expanding or add failed due to
+			 * overlap. Flush everything and wait a while.
+			 */
+			md_wakeup_thread(mddev->thread);
+			raid5_release_stripe(sh);
+			schedule();
+			do_prepare = true;
+			goto retry;
 		}
+
+		if (do_flush) {
+			set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+			/* we only need flush for one stripe */
+			do_flush = false;
+		}
+
+		set_bit(STRIPE_HANDLE, &sh->state);
+		clear_bit(STRIPE_DELAYED, &sh->state);
+		if ((!sh->batch_head || sh == sh->batch_head) &&
+		    (bi->bi_opf & REQ_SYNC) &&
+		    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+			atomic_inc(&conf->preread_active_stripes);
+
+		release_stripe_plug(mddev, sh);
 	}
 	finish_wait(&conf->wait_for_overlap, &w);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 04/15] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 03/15] md/raid5: Refactor raid5_make_request loop Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 05/15] md/raid5: Move common stripe get code into new find_get_stripe() helper Logan Gunthorpe
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

stripe_add_to_batch_list() is better done in the loop in make_request
instead of inside add_stripe_bio(). This is clearer and allows for
storing the batch_head state outside the loop in a subsequent patch.

The call to add_stripe_bio() in retry_aligned_read() is for read
and batching only applies to write. So it's impossible for batching
to happen at that call site.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
 drivers/md/raid5.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 25db747c5856..969609b7114b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3531,8 +3531,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	}
 	spin_unlock_irq(&sh->stripe_lock);
 
-	if (stripe_can_batch(sh))
-		stripe_add_to_batch_list(conf, sh);
 	return 1;
 
  overlap:
@@ -5950,6 +5948,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			goto retry;
 		}
 
+		if (stripe_can_batch(sh))
+			stripe_add_to_batch_list(conf, sh);
+
 		if (do_flush) {
 			set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
 			/* we only need flush for one stripe */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 05/15] md/raid5: Move common stripe get code into new find_get_stripe() helper
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 04/15] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 06/15] md/raid5: Factor out helper from raid5_make_request() loop Logan Gunthorpe
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

Both uses of find_stripe() require a fairly complicated dance to
increment the reference count. Move this into a common find_get_stripe()
helper.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
 drivers/md/raid5.c | 131 ++++++++++++++++++++++-----------------------
 1 file changed, 64 insertions(+), 67 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 969609b7114b..1bbf87d15bc8 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -624,6 +624,49 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
 	return NULL;
 }
 
+static struct stripe_head *find_get_stripe(struct r5conf *conf,
+		sector_t sector, short generation, int hash)
+{
+	int inc_empty_inactive_list_flag;
+	struct stripe_head *sh;
+
+	sh = __find_stripe(conf, sector, generation);
+	if (!sh)
+		return NULL;
+
+	if (atomic_inc_not_zero(&sh->count))
+		return sh;
+
+	/*
+	 * Slow path. The reference count is zero which means the stripe must
+	 * be on a list (sh->lru). Must remove the stripe from the list that
+	 * references it with the device_lock held.
+	 */
+
+	spin_lock(&conf->device_lock);
+	if (!atomic_read(&sh->count)) {
+		if (!test_bit(STRIPE_HANDLE, &sh->state))
+			atomic_inc(&conf->active_stripes);
+		BUG_ON(list_empty(&sh->lru) &&
+		       !test_bit(STRIPE_EXPANDING, &sh->state));
+		inc_empty_inactive_list_flag = 0;
+		if (!list_empty(conf->inactive_list + hash))
+			inc_empty_inactive_list_flag = 1;
+		list_del_init(&sh->lru);
+		if (list_empty(conf->inactive_list + hash) &&
+		    inc_empty_inactive_list_flag)
+			atomic_inc(&conf->empty_inactive_list_nr);
+		if (sh->group) {
+			sh->group->stripes_cnt--;
+			sh->group = NULL;
+		}
+	}
+	atomic_inc(&sh->count);
+	spin_unlock(&conf->device_lock);
+
+	return sh;
+}
+
 /*
  * Need to check if array has failed when deciding whether to:
  *  - start an array
@@ -716,7 +759,6 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 {
 	struct stripe_head *sh;
 	int hash = stripe_hash_locks_hash(conf, sector);
-	int inc_empty_inactive_list_flag;
 
 	pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
 
@@ -726,57 +768,34 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 		wait_event_lock_irq(conf->wait_for_quiescent,
 				    conf->quiesce == 0 || noquiesce,
 				    *(conf->hash_locks + hash));
-		sh = __find_stripe(conf, sector, conf->generation - previous);
-		if (!sh) {
-			if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
-				sh = get_free_stripe(conf, hash);
-				if (!sh && !test_bit(R5_DID_ALLOC,
-						     &conf->cache_state))
-					set_bit(R5_ALLOC_MORE,
-						&conf->cache_state);
-			}
-			if (noblock && sh == NULL)
-				break;
+		sh = find_get_stripe(conf, sector, conf->generation - previous,
+				     hash);
+		if (sh)
+			break;
 
-			r5c_check_stripe_cache_usage(conf);
-			if (!sh) {
-				set_bit(R5_INACTIVE_BLOCKED,
-					&conf->cache_state);
-				r5l_wake_reclaim(conf->log, 0);
-				wait_event_lock_irq(
-					conf->wait_for_stripe,
+		if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
+			sh = get_free_stripe(conf, hash);
+			if (!sh && !test_bit(R5_DID_ALLOC, &conf->cache_state))
+				set_bit(R5_ALLOC_MORE, &conf->cache_state);
+		}
+		if (noblock && !sh)
+			break;
+
+		r5c_check_stripe_cache_usage(conf);
+		if (!sh) {
+			set_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
+			r5l_wake_reclaim(conf->log, 0);
+			wait_event_lock_irq(conf->wait_for_stripe,
 					!list_empty(conf->inactive_list + hash) &&
 					(atomic_read(&conf->active_stripes)
 					 < (conf->max_nr_stripes * 3 / 4)
 					 || !test_bit(R5_INACTIVE_BLOCKED,
 						      &conf->cache_state)),
 					*(conf->hash_locks + hash));
-				clear_bit(R5_INACTIVE_BLOCKED,
-					  &conf->cache_state);
-			} else {
-				init_stripe(sh, sector, previous);
-				atomic_inc(&sh->count);
-			}
-		} else if (!atomic_inc_not_zero(&sh->count)) {
-			spin_lock(&conf->device_lock);
-			if (!atomic_read(&sh->count)) {
-				if (!test_bit(STRIPE_HANDLE, &sh->state))
-					atomic_inc(&conf->active_stripes);
-				BUG_ON(list_empty(&sh->lru) &&
-				       !test_bit(STRIPE_EXPANDING, &sh->state));
-				inc_empty_inactive_list_flag = 0;
-				if (!list_empty(conf->inactive_list + hash))
-					inc_empty_inactive_list_flag = 1;
-				list_del_init(&sh->lru);
-				if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
-					atomic_inc(&conf->empty_inactive_list_nr);
-				if (sh->group) {
-					sh->group->stripes_cnt--;
-					sh->group = NULL;
-				}
-			}
+			clear_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
+		} else {
+			init_stripe(sh, sector, previous);
 			atomic_inc(&sh->count);
-			spin_unlock(&conf->device_lock);
 		}
 	} while (sh == NULL);
 
@@ -830,7 +849,6 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
 	sector_t head_sector, tmp_sec;
 	int hash;
 	int dd_idx;
-	int inc_empty_inactive_list_flag;
 
 	/* Don't cross chunks, so stripe pd_idx/qd_idx is the same */
 	tmp_sec = sh->sector;
@@ -840,28 +858,7 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
 
 	hash = stripe_hash_locks_hash(conf, head_sector);
 	spin_lock_irq(conf->hash_locks + hash);
-	head = __find_stripe(conf, head_sector, conf->generation);
-	if (head && !atomic_inc_not_zero(&head->count)) {
-		spin_lock(&conf->device_lock);
-		if (!atomic_read(&head->count)) {
-			if (!test_bit(STRIPE_HANDLE, &head->state))
-				atomic_inc(&conf->active_stripes);
-			BUG_ON(list_empty(&head->lru) &&
-			       !test_bit(STRIPE_EXPANDING, &head->state));
-			inc_empty_inactive_list_flag = 0;
-			if (!list_empty(conf->inactive_list + hash))
-				inc_empty_inactive_list_flag = 1;
-			list_del_init(&head->lru);
-			if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
-				atomic_inc(&conf->empty_inactive_list_nr);
-			if (head->group) {
-				head->group->stripes_cnt--;
-				head->group = NULL;
-			}
-		}
-		atomic_inc(&head->count);
-		spin_unlock(&conf->device_lock);
-	}
+	head = find_get_stripe(conf, head_sector, conf->generation, hash);
 	spin_unlock_irq(conf->hash_locks + hash);
 
 	if (!head)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 06/15] md/raid5: Factor out helper from raid5_make_request() loop
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 05/15] md/raid5: Move common stripe get code into new find_get_stripe() helper Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 07/15] md/raid5: Drop the do_prepare flag in raid5_make_request() Logan Gunthorpe
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Factor out the inner loop of raid5_make_request() into it's own helper
called make_stripe_request().

The helper returns a number of statuses: SUCCESS, RETRY,
SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to
understand and allows the SCHEDULE_AND_RETRY path to be made common.

A context structure is added to contain do_flush. It will be used
more in subsequent patches for state that needs to be kept
outside the loop.

No functional changes intended. This will be cleaned up further in
subsequent patches to untangle the gen_lock and do_prepare logic
further.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 231 ++++++++++++++++++++++++++-------------------
 1 file changed, 133 insertions(+), 98 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1bbf87d15bc8..26ef292842de 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5786,17 +5786,139 @@ static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
 					  sector >= reshape_sector;
 }
 
+enum stripe_result {
+	STRIPE_SUCCESS = 0,
+	STRIPE_RETRY,
+	STRIPE_SCHEDULE_AND_RETRY,
+	STRIPE_FAIL,
+};
+
+struct stripe_request_ctx {
+	/* the request had REQ_PREFLUSH, cleared after the first stripe_head */
+	bool do_flush;
+};
+
+static enum stripe_result make_stripe_request(struct mddev *mddev,
+		struct r5conf *conf, struct stripe_request_ctx *ctx,
+		sector_t logical_sector, struct bio *bi, int seq)
+{
+	const int rw = bio_data_dir(bi);
+	enum stripe_result ret;
+	struct stripe_head *sh;
+	sector_t new_sector;
+	int previous = 0;
+	int dd_idx;
+
+	if (unlikely(conf->reshape_progress != MaxSector)) {
+		/*
+		 * Spinlock is needed as reshape_progress may be
+		 * 64bit on a 32bit platform, and so it might be
+		 * possible to see a half-updated value
+		 * Of course reshape_progress could change after
+		 * the lock is dropped, so once we get a reference
+		 * to the stripe that we think it is, we will have
+		 * to check again.
+		 */
+		spin_lock_irq(&conf->device_lock);
+		if (ahead_of_reshape(mddev, logical_sector,
+				     conf->reshape_progress)) {
+			previous = 1;
+		} else {
+			if (ahead_of_reshape(mddev, logical_sector,
+					     conf->reshape_safe)) {
+				spin_unlock_irq(&conf->device_lock);
+				return STRIPE_SCHEDULE_AND_RETRY;
+			}
+		}
+		spin_unlock_irq(&conf->device_lock);
+	}
+
+	new_sector = raid5_compute_sector(conf, logical_sector, previous,
+					  &dd_idx, NULL);
+	pr_debug("raid456: %s, sector %llu logical %llu\n", __func__,
+		 new_sector, logical_sector);
+
+	sh = raid5_get_active_stripe(conf, new_sector, previous,
+				     (bi->bi_opf & REQ_RAHEAD), 0);
+	if (unlikely(!sh)) {
+		/* cannot get stripe, just give-up */
+		bi->bi_status = BLK_STS_IOERR;
+		return STRIPE_FAIL;
+	}
+
+	if (unlikely(previous)) {
+		/*
+		 * Expansion might have moved on while waiting for a
+		 * stripe, so we must do the range check again.
+		 * Expansion could still move past after this
+		 * test, but as we are holding a reference to
+		 * 'sh', we know that if that happens,
+		 *  STRIPE_EXPANDING will get set and the expansion
+		 * won't proceed until we finish with the stripe.
+		 */
+		int must_retry = 0;
+		spin_lock_irq(&conf->device_lock);
+		if (!ahead_of_reshape(mddev, logical_sector,
+				      conf->reshape_progress))
+			/* mismatch, need to try again */
+			must_retry = 1;
+		spin_unlock_irq(&conf->device_lock);
+		if (must_retry) {
+			ret = STRIPE_SCHEDULE_AND_RETRY;
+			goto out_release;
+		}
+	}
+
+	if (read_seqcount_retry(&conf->gen_lock, seq)) {
+		/* Might have got the wrong stripe_head by accident */
+		ret = STRIPE_RETRY;
+		goto out_release;
+	}
+
+	if (test_bit(STRIPE_EXPANDING, &sh->state) ||
+	    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+		/*
+		 * Stripe is busy expanding or add failed due to
+		 * overlap. Flush everything and wait a while.
+		 */
+		md_wakeup_thread(mddev->thread);
+		ret = STRIPE_SCHEDULE_AND_RETRY;
+		goto out_release;
+	}
+
+	if (stripe_can_batch(sh))
+		stripe_add_to_batch_list(conf, sh);
+
+	if (ctx->do_flush) {
+		set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+		/* we only need flush for one stripe */
+		ctx->do_flush = false;
+	}
+
+	set_bit(STRIPE_HANDLE, &sh->state);
+	clear_bit(STRIPE_DELAYED, &sh->state);
+	if ((!sh->batch_head || sh == sh->batch_head) &&
+	    (bi->bi_opf & REQ_SYNC) &&
+	    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+		atomic_inc(&conf->preread_active_stripes);
+
+	release_stripe_plug(mddev, sh);
+	return STRIPE_SUCCESS;
+
+out_release:
+	raid5_release_stripe(sh);
+	return ret;
+}
+
 static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
-	int dd_idx;
-	sector_t new_sector;
 	sector_t logical_sector, last_sector;
-	struct stripe_head *sh;
+	struct stripe_request_ctx ctx = {};
 	const int rw = bio_data_dir(bi);
+	enum stripe_result res;
 	DEFINE_WAIT(w);
 	bool do_prepare;
-	bool do_flush = false;
 
 	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
 		int ret = log_handle_flush_request(conf, bi);
@@ -5812,7 +5934,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 		 * if r5l_handle_flush_request() didn't clear REQ_PREFLUSH,
 		 * we need to flush journal device
 		 */
-		do_flush = bi->bi_opf & REQ_PREFLUSH;
+		ctx.do_flush = bi->bi_opf & REQ_PREFLUSH;
 	}
 
 	if (!md_write_start(mddev, bi))
@@ -5852,117 +5974,30 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	md_account_bio(mddev, &bi);
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
-		int previous;
 		int seq;
 
 		do_prepare = false;
 	retry:
 		seq = read_seqcount_begin(&conf->gen_lock);
-		previous = 0;
 		if (do_prepare)
 			prepare_to_wait(&conf->wait_for_overlap, &w,
 				TASK_UNINTERRUPTIBLE);
-		if (unlikely(conf->reshape_progress != MaxSector)) {
-			/* spinlock is needed as reshape_progress may be
-			 * 64bit on a 32bit platform, and so it might be
-			 * possible to see a half-updated value
-			 * Of course reshape_progress could change after
-			 * the lock is dropped, so once we get a reference
-			 * to the stripe that we think it is, we will have
-			 * to check again.
-			 */
-			spin_lock_irq(&conf->device_lock);
-			if (ahead_of_reshape(mddev, logical_sector,
-					     conf->reshape_progress)) {
-				previous = 1;
-			} else {
-				if (ahead_of_reshape(mddev, logical_sector,
-						     conf->reshape_safe)) {
-					spin_unlock_irq(&conf->device_lock);
-					schedule();
-					do_prepare = true;
-					goto retry;
-				}
-			}
-			spin_unlock_irq(&conf->device_lock);
-		}
-
-		new_sector = raid5_compute_sector(conf, logical_sector,
-						  previous,
-						  &dd_idx, NULL);
-		pr_debug("raid456: raid5_make_request, sector %llu logical %llu\n",
-			(unsigned long long)new_sector,
-			(unsigned long long)logical_sector);
 
-		sh = raid5_get_active_stripe(conf, new_sector, previous,
-				       (bi->bi_opf & REQ_RAHEAD), 0);
-		if (unlikely(!sh)) {
-			/* cannot get stripe, just give-up */
-			bi->bi_status = BLK_STS_IOERR;
+		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
+					  bi, seq);
+		if (res == STRIPE_FAIL)
 			break;
-		}
-
-		if (unlikely(previous)) {
-			/* expansion might have moved on while waiting for a
-			 * stripe, so we must do the range check again.
-			 * Expansion could still move past after this
-			 * test, but as we are holding a reference to
-			 * 'sh', we know that if that happens,
-			 *  STRIPE_EXPANDING will get set and the expansion
-			 * won't proceed until we finish with the stripe.
-			 */
-			int must_retry = 0;
-			spin_lock_irq(&conf->device_lock);
-			if (!ahead_of_reshape(mddev, logical_sector,
-					      conf->reshape_progress))
-				/* mismatch, need to try again */
-				must_retry = 1;
-			spin_unlock_irq(&conf->device_lock);
-			if (must_retry) {
-				raid5_release_stripe(sh);
-				schedule();
-				do_prepare = true;
-				goto retry;
-			}
-		}
 
-		if (read_seqcount_retry(&conf->gen_lock, seq)) {
-			/* Might have got the wrong stripe_head by accident */
-			raid5_release_stripe(sh);
+		if (res == STRIPE_RETRY)
 			goto retry;
-		}
 
-		if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-		    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
-			/*
-			 * Stripe is busy expanding or add failed due to
-			 * overlap. Flush everything and wait a while.
-			 */
-			md_wakeup_thread(mddev->thread);
-			raid5_release_stripe(sh);
+		if (res == STRIPE_SCHEDULE_AND_RETRY) {
 			schedule();
 			do_prepare = true;
 			goto retry;
 		}
-
-		if (stripe_can_batch(sh))
-			stripe_add_to_batch_list(conf, sh);
-
-		if (do_flush) {
-			set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
-			/* we only need flush for one stripe */
-			do_flush = false;
-		}
-
-		set_bit(STRIPE_HANDLE, &sh->state);
-		clear_bit(STRIPE_DELAYED, &sh->state);
-		if ((!sh->batch_head || sh == sh->batch_head) &&
-		    (bi->bi_opf & REQ_SYNC) &&
-		    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-			atomic_inc(&conf->preread_active_stripes);
-
-		release_stripe_plug(mddev, sh);
 	}
+
 	finish_wait(&conf->wait_for_overlap, &w);
 
 	if (rw == WRITE)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 07/15] md/raid5: Drop the do_prepare flag in raid5_make_request()
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 06/15] md/raid5: Factor out helper from raid5_make_request() loop Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 08/15] md/raid5: Move read_seqcount_begin() into make_stripe_request() Logan Gunthorpe
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

prepare_to_wait() can be reasonably called after schedule instead of
setting a flag and preparing in the next loop iteration.

This means that prepare_to_wait() will be called before
read_seqcount_begin(), but there shouldn't be any reason that the order
matters here. On the first iteration of the loop prepare_to_wait() is
already called first.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
 drivers/md/raid5.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 26ef292842de..c58e70db204a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5918,7 +5918,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	const int rw = bio_data_dir(bi);
 	enum stripe_result res;
 	DEFINE_WAIT(w);
-	bool do_prepare;
 
 	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
 		int ret = log_handle_flush_request(conf, bi);
@@ -5976,12 +5975,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
 		int seq;
 
-		do_prepare = false;
 	retry:
 		seq = read_seqcount_begin(&conf->gen_lock);
-		if (do_prepare)
-			prepare_to_wait(&conf->wait_for_overlap, &w,
-				TASK_UNINTERRUPTIBLE);
 
 		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
 					  bi, seq);
@@ -5993,7 +5988,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 		if (res == STRIPE_SCHEDULE_AND_RETRY) {
 			schedule();
-			do_prepare = true;
+			prepare_to_wait(&conf->wait_for_overlap, &w,
+					TASK_UNINTERRUPTIBLE);
 			goto retry;
 		}
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 08/15] md/raid5: Move read_seqcount_begin() into make_stripe_request()
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (6 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 07/15] md/raid5: Drop the do_prepare flag in raid5_make_request() Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 09/15] md/raid5: Refactor for loop in raid5_make_request() into while loop Logan Gunthorpe
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
into make_stripe_request().

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
 drivers/md/raid5.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c58e70db204a..345350d34623 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5800,14 +5800,16 @@ struct stripe_request_ctx {
 
 static enum stripe_result make_stripe_request(struct mddev *mddev,
 		struct r5conf *conf, struct stripe_request_ctx *ctx,
-		sector_t logical_sector, struct bio *bi, int seq)
+		sector_t logical_sector, struct bio *bi)
 {
 	const int rw = bio_data_dir(bi);
 	enum stripe_result ret;
 	struct stripe_head *sh;
 	sector_t new_sector;
 	int previous = 0;
-	int dd_idx;
+	int seq, dd_idx;
+
+	seq = read_seqcount_begin(&conf->gen_lock);
 
 	if (unlikely(conf->reshape_progress != MaxSector)) {
 		/*
@@ -5973,13 +5975,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	md_account_bio(mddev, &bi);
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
-		int seq;
-
 	retry:
-		seq = read_seqcount_begin(&conf->gen_lock);
-
 		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
-					  bi, seq);
+					  bi);
 		if (res == STRIPE_FAIL)
 			break;
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 09/15] md/raid5: Refactor for loop in raid5_make_request() into while loop
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (7 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 08/15] md/raid5: Move read_seqcount_begin() into make_stripe_request() Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 10/15] md/raid5: Keep a reference to last stripe_head for batch Logan Gunthorpe
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

The for loop with retry label can be more cleanly expressed as a while
loop by moving the logical_sector increment into the success path.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 345350d34623..17ddaa41147c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5974,22 +5974,23 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	md_account_bio(mddev, &bi);
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
-	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
-	retry:
+	while (logical_sector < last_sector) {
 		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
 					  bi);
 		if (res == STRIPE_FAIL)
 			break;
 
 		if (res == STRIPE_RETRY)
-			goto retry;
+			continue;
 
 		if (res == STRIPE_SCHEDULE_AND_RETRY) {
 			schedule();
 			prepare_to_wait(&conf->wait_for_overlap, &w,
 					TASK_UNINTERRUPTIBLE);
-			goto retry;
+			continue;
 		}
+
+		logical_sector += RAID5_STRIPE_SECTORS(conf);
 	}
 
 	finish_wait(&conf->wait_for_overlap, &w);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 10/15] md/raid5: Keep a reference to last stripe_head for batch
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 09/15] md/raid5: Refactor for loop in raid5_make_request() into while loop Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 11/15] md/raid5: Refactor add_stripe_bio() Logan Gunthorpe
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

When batching, every stripe head has to find the previous stripe head to
add to the batch list. This involves taking the hash lock which is
highly contended during IO.

Instead of finding the previous stripe_head each time, store a
reference to the previous stripe_head in a pointer so that it doesn't
require taking the contended lock another time.

The reference to the previous stripe must be released before scheduling
and waiting for work to get done. Otherwise, it can hold up
raid5_activate_delayed() and deadlock.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
---
 drivers/md/raid5.c | 52 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 40 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 17ddaa41147c..34f8d6c18bd3 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -843,7 +843,8 @@ static bool stripe_can_batch(struct stripe_head *sh)
 }
 
 /* we only do back search */
-static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh)
+static void stripe_add_to_batch_list(struct r5conf *conf,
+		struct stripe_head *sh, struct stripe_head *last_sh)
 {
 	struct stripe_head *head;
 	sector_t head_sector, tmp_sec;
@@ -856,15 +857,20 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
 		return;
 	head_sector = sh->sector - RAID5_STRIPE_SECTORS(conf);
 
-	hash = stripe_hash_locks_hash(conf, head_sector);
-	spin_lock_irq(conf->hash_locks + hash);
-	head = find_get_stripe(conf, head_sector, conf->generation, hash);
-	spin_unlock_irq(conf->hash_locks + hash);
-
-	if (!head)
-		return;
-	if (!stripe_can_batch(head))
-		goto out;
+	if (last_sh && head_sector == last_sh->sector) {
+		head = last_sh;
+		atomic_inc(&head->count);
+	} else {
+		hash = stripe_hash_locks_hash(conf, head_sector);
+		spin_lock_irq(conf->hash_locks + hash);
+		head = find_get_stripe(conf, head_sector, conf->generation,
+				       hash);
+		spin_unlock_irq(conf->hash_locks + hash);
+		if (!head)
+			return;
+		if (!stripe_can_batch(head))
+			goto out;
+	}
 
 	lock_two_stripes(head, sh);
 	/* clear_batch_ready clear the flag */
@@ -5794,6 +5800,8 @@ enum stripe_result {
 };
 
 struct stripe_request_ctx {
+	/* a reference to the last stripe_head for batching */
+	struct stripe_head *batch_last;
 	/* the request had REQ_PREFLUSH, cleared after the first stripe_head */
 	bool do_flush;
 };
@@ -5888,8 +5896,13 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 		goto out_release;
 	}
 
-	if (stripe_can_batch(sh))
-		stripe_add_to_batch_list(conf, sh);
+	if (stripe_can_batch(sh)) {
+		stripe_add_to_batch_list(conf, sh, ctx->batch_last);
+		if (ctx->batch_last)
+			raid5_release_stripe(ctx->batch_last);
+		atomic_inc(&sh->count);
+		ctx->batch_last = sh;
+	}
 
 	if (ctx->do_flush) {
 		set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
@@ -5984,6 +5997,18 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			continue;
 
 		if (res == STRIPE_SCHEDULE_AND_RETRY) {
+			/*
+			 * Must release the reference to batch_last before
+			 * scheduling and waiting for work to be done,
+			 * otherwise the batch_last stripe head could prevent
+			 * raid5_activate_delayed() from making progress
+			 * and thus deadlocking.
+			 */
+			if (ctx.batch_last) {
+				raid5_release_stripe(ctx.batch_last);
+				ctx.batch_last = NULL;
+			}
+
 			schedule();
 			prepare_to_wait(&conf->wait_for_overlap, &w,
 					TASK_UNINTERRUPTIBLE);
@@ -5995,6 +6020,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 	finish_wait(&conf->wait_for_overlap, &w);
 
+	if (ctx.batch_last)
+		raid5_release_stripe(ctx.batch_last);
+
 	if (rw == WRITE)
 		md_write_end(mddev);
 	bio_endio(bi);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 11/15] md/raid5: Refactor add_stripe_bio()
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (9 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 10/15] md/raid5: Keep a reference to last stripe_head for batch Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 12/15] md/raid5: Check all disks in a stripe_head for reshape progress Logan Gunthorpe
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

Factor out two helper functions from add_stripe_bio(): one to check for
overlap (stripe_bio_overlaps()), and one to actually add the bio to the
stripe (__add_stripe_bio()). The latter function will always succeed.

This will be useful in the next patch so that overlap can be checked for
multiple disks before adding any

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5.c | 86 ++++++++++++++++++++++++++++++----------------
 1 file changed, 56 insertions(+), 30 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 34f8d6c18bd3..f12773c2387a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3415,39 +3415,32 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		s->locked, s->ops_request);
 }
 
-/*
- * Each stripe/dev can have one or more bion attached.
- * toread/towrite point to the first in a chain.
- * The bi_next chain must be in order.
- */
-static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
-			  int forwrite, int previous)
+static bool stripe_bio_overlaps(struct stripe_head *sh, struct bio *bi,
+				int dd_idx, int forwrite)
 {
-	struct bio **bip;
 	struct r5conf *conf = sh->raid_conf;
-	int firstwrite=0;
+	struct bio **bip;
 
-	pr_debug("adding bi b#%llu to stripe s#%llu\n",
-		(unsigned long long)bi->bi_iter.bi_sector,
-		(unsigned long long)sh->sector);
+	pr_debug("checking bi b#%llu to stripe s#%llu\n",
+		 bi->bi_iter.bi_sector, sh->sector);
 
-	spin_lock_irq(&sh->stripe_lock);
 	/* Don't allow new IO added to stripes in batch list */
 	if (sh->batch_head)
-		goto overlap;
-	if (forwrite) {
+		return true;
+
+	if (forwrite)
 		bip = &sh->dev[dd_idx].towrite;
-		if (*bip == NULL)
-			firstwrite = 1;
-	} else
+	else
 		bip = &sh->dev[dd_idx].toread;
+
 	while (*bip && (*bip)->bi_iter.bi_sector < bi->bi_iter.bi_sector) {
 		if (bio_end_sector(*bip) > bi->bi_iter.bi_sector)
-			goto overlap;
-		bip = & (*bip)->bi_next;
+			return true;
+		bip = &(*bip)->bi_next;
 	}
+
 	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
-		goto overlap;
+		return true;
 
 	if (forwrite && raid5_has_ppl(conf)) {
 		/*
@@ -3476,9 +3469,30 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		}
 
 		if (first + conf->chunk_sectors * (count - 1) != last)
-			goto overlap;
+			return true;
 	}
 
+	return false;
+}
+
+static void __add_stripe_bio(struct stripe_head *sh, struct bio *bi,
+			     int dd_idx, int forwrite, int previous)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct bio **bip;
+	int firstwrite = 0;
+
+	if (forwrite) {
+		bip = &sh->dev[dd_idx].towrite;
+		if (!*bip)
+			firstwrite = 1;
+	} else {
+		bip = &sh->dev[dd_idx].toread;
+	}
+
+	while (*bip && (*bip)->bi_iter.bi_sector < bi->bi_iter.bi_sector)
+		bip = &(*bip)->bi_next;
+
 	if (!forwrite || previous)
 		clear_bit(STRIPE_BATCH_READY, &sh->state);
 
@@ -3505,8 +3519,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	}
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
-		(unsigned long long)(*bip)->bi_iter.bi_sector,
-		(unsigned long long)sh->sector, dd_idx);
+		 (*bip)->bi_iter.bi_sector, sh->sector, dd_idx);
 
 	if (conf->mddev->bitmap && firstwrite) {
 		/* Cannot hold spinlock over bitmap_startwrite,
@@ -3514,7 +3527,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		 * we have added to the bitmap and set bm_seq.
 		 * So set STRIPE_BITMAP_PENDING to prevent
 		 * batching.
-		 * If multiple add_stripe_bio() calls race here they
+		 * If multiple __add_stripe_bio() calls race here they
 		 * much all set STRIPE_BITMAP_PENDING.  So only the first one
 		 * to complete "bitmap_startwrite" gets to set
 		 * STRIPE_BIT_DELAY.  This is important as once a stripe
@@ -3532,14 +3545,27 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 			set_bit(STRIPE_BIT_DELAY, &sh->state);
 		}
 	}
-	spin_unlock_irq(&sh->stripe_lock);
+}
 
-	return 1;
+/*
+ * Each stripe/dev can have one or more bios attached.
+ * toread/towrite point to the first in a chain.
+ * The bi_next chain must be in order.
+ */
+static bool add_stripe_bio(struct stripe_head *sh, struct bio *bi,
+			   int dd_idx, int forwrite, int previous)
+{
+	spin_lock_irq(&sh->stripe_lock);
 
- overlap:
-	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+	if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
+		set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+		spin_unlock_irq(&sh->stripe_lock);
+		return false;
+	}
+
+	__add_stripe_bio(sh, bi, dd_idx, forwrite, previous);
 	spin_unlock_irq(&sh->stripe_lock);
-	return 0;
+	return true;
 }
 
 static void end_reshape(struct r5conf *conf);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 12/15] md/raid5: Check all disks in a stripe_head for reshape progress
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (10 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 11/15] md/raid5: Refactor add_stripe_bio() Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 13/15] md/raid5: Pivot raid5_make_request() Logan Gunthorpe
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

When testing if a previous stripe has had reshape expand past it, use
the earliest or latest logical sector in all the disks for that stripe
head. This will allow adding multiple disks at a time in a subesquent
patch.

To do this cleaner, refactor the check into a helper function called
stripe_ahead_of_reshape().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5.c | 53 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 39 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index f12773c2387a..b27b754ee18c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5818,6 +5818,40 @@ static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
 					  sector >= reshape_sector;
 }
 
+static bool range_ahead_of_reshape(struct mddev *mddev, sector_t min,
+				   sector_t max, sector_t reshape_sector)
+{
+	return mddev->reshape_backwards ? max < reshape_sector :
+					  min >= reshape_sector;
+}
+
+static bool stripe_ahead_of_reshape(struct mddev *mddev, struct r5conf *conf,
+				    struct stripe_head *sh)
+{
+	sector_t max_sector = 0, min_sector = MaxSector;
+	bool ret = false;
+	int dd_idx;
+
+	for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+		if (dd_idx == sh->pd_idx)
+			continue;
+
+		min_sector = min(min_sector, sh->dev[dd_idx].sector);
+		max_sector = min(max_sector, sh->dev[dd_idx].sector);
+	}
+
+	spin_lock_irq(&conf->device_lock);
+
+	if (!range_ahead_of_reshape(mddev, min_sector, max_sector,
+				     conf->reshape_progress))
+		/* mismatch, need to try again */
+		ret = true;
+
+	spin_unlock_irq(&conf->device_lock);
+
+	return ret;
+}
+
 enum stripe_result {
 	STRIPE_SUCCESS = 0,
 	STRIPE_RETRY,
@@ -5882,27 +5916,18 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 		return STRIPE_FAIL;
 	}
 
-	if (unlikely(previous)) {
+	if (unlikely(previous) &&
+	    stripe_ahead_of_reshape(mddev, conf, sh)) {
 		/*
-		 * Expansion might have moved on while waiting for a
-		 * stripe, so we must do the range check again.
+		 * Expansion moved on while waiting for a stripe.
 		 * Expansion could still move past after this
 		 * test, but as we are holding a reference to
 		 * 'sh', we know that if that happens,
 		 *  STRIPE_EXPANDING will get set and the expansion
 		 * won't proceed until we finish with the stripe.
 		 */
-		int must_retry = 0;
-		spin_lock_irq(&conf->device_lock);
-		if (!ahead_of_reshape(mddev, logical_sector,
-				      conf->reshape_progress))
-			/* mismatch, need to try again */
-			must_retry = 1;
-		spin_unlock_irq(&conf->device_lock);
-		if (must_retry) {
-			ret = STRIPE_SCHEDULE_AND_RETRY;
-			goto out_release;
-		}
+		ret = STRIPE_SCHEDULE_AND_RETRY;
+		goto out_release;
 	}
 
 	if (read_seqcount_retry(&conf->gen_lock, seq)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 13/15] md/raid5: Pivot raid5_make_request()
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (11 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 12/15] md/raid5: Check all disks in a stripe_head for reshape progress Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 14/15] md/raid5: Improve debug prints Logan Gunthorpe
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the
disk.

This causes a great deal of contention on the hash_lock and extra
work seeing each stripe must be found once for every data disk.

The number of times a stripe must be found can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the hash lock must be taken
by a factor equal to the number of data disks.

To accomplish this, the logical sectors that have already been added
must be tracked. Tracking them is done with a bitmap: the bits
for all pages are set at the start of the request and each bit
is cleared once the bio is added to a stripe.

Finding the next sector to be done is then just a call to
find_first_bit() so that sectors that have been done can simply be
skipped.

One minor downside is that the maximum sectors for a request must be
limited so that the bitmap can be appropriately sized on the stack.
This limit is arbitrarily chosen to be 256 stripe pages which works out
to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually
restrict the maximum request further seeing the default block queue
settings are used which restricts the number of segments to 128 (which
results in request sizes that are approximately 512KB).

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 89 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 83 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b27b754ee18c..9e18dc9ae8b4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -61,6 +61,8 @@
 #define cpu_to_group(cpu) cpu_to_node(cpu)
 #define ANY_GROUP NUMA_NO_NODE
 
+#define RAID5_MAX_REQ_STRIPES 256
+
 static bool devices_handle_discard_safely = false;
 module_param(devices_handle_discard_safely, bool, 0644);
 MODULE_PARM_DESC(devices_handle_discard_safely,
@@ -5862,10 +5864,69 @@ enum stripe_result {
 struct stripe_request_ctx {
 	/* a reference to the last stripe_head for batching */
 	struct stripe_head *batch_last;
+
+	/* first sector in the request */
+	sector_t first_sector;
+
+	/* last sector in the request */
+	sector_t last_sector;
+
+	/* bitmap to track stripe sectors that have been added to stripes */
+	DECLARE_BITMAP(sectors_to_do, RAID5_MAX_REQ_STRIPES);
+
 	/* the request had REQ_PREFLUSH, cleared after the first stripe_head */
 	bool do_flush;
 };
 
+static int add_all_stripe_bios(struct r5conf *conf,
+		struct stripe_request_ctx *ctx, struct stripe_head *sh,
+		struct bio *bi, int forwrite, int previous)
+{
+	int dd_idx;
+	int ret = 1;
+
+	spin_lock_irq(&sh->stripe_lock);
+
+	for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+		struct r5dev *dev = &sh->dev[dd_idx];
+
+		if (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx)
+			continue;
+
+		if (dev->sector < ctx->first_sector ||
+		    dev->sector >= ctx->last_sector)
+			continue;
+
+		if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
+			set_bit(R5_Overlap, &dev->flags);
+			ret = 0;
+			continue;
+		}
+	}
+
+	if (!ret)
+		goto out;
+
+	for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+		struct r5dev *dev = &sh->dev[dd_idx];
+
+		if (dd_idx == sh->pd_idx || dd_idx == sh->qd_idx)
+			continue;
+
+		if (dev->sector < ctx->first_sector ||
+		    dev->sector >= ctx->last_sector)
+			continue;
+
+		__add_stripe_bio(sh, bi, dd_idx, forwrite, previous);
+		clear_bit((dev->sector - ctx->first_sector) >>
+			  RAID5_STRIPE_SHIFT(conf), ctx->sectors_to_do);
+	}
+
+out:
+	spin_unlock_irq(&sh->stripe_lock);
+	return ret;
+}
+
 static enum stripe_result make_stripe_request(struct mddev *mddev,
 		struct r5conf *conf, struct stripe_request_ctx *ctx,
 		sector_t logical_sector, struct bio *bi)
@@ -5937,7 +5998,7 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 	}
 
 	if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-	    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+	    !add_all_stripe_bios(conf, ctx, sh, bi, rw, previous)) {
 		/*
 		 * Stripe is busy expanding or add failed due to
 		 * overlap. Flush everything and wait a while.
@@ -5979,11 +6040,12 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
-	sector_t logical_sector, last_sector;
+	sector_t logical_sector;
 	struct stripe_request_ctx ctx = {};
 	const int rw = bio_data_dir(bi);
 	enum stripe_result res;
 	DEFINE_WAIT(w);
+	int s;
 
 	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
 		int ret = log_handle_flush_request(conf, bi);
@@ -6023,9 +6085,14 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 
 	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1);
-	last_sector = bio_end_sector(bi);
+	ctx.first_sector = logical_sector;
+	ctx.last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
 
+	bitmap_set(ctx.sectors_to_do, 0,
+		   DIV_ROUND_UP(ctx.last_sector - logical_sector,
+				RAID5_STRIPE_SECTORS(conf)));
+
 	/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
 	if ((bi->bi_opf & REQ_NOWAIT) &&
 	    (conf->reshape_progress != MaxSector) &&
@@ -6038,7 +6105,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	md_account_bio(mddev, &bi);
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
-	while (logical_sector < last_sector) {
+	while (1) {
 		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
 					  bi);
 		if (res == STRIPE_FAIL)
@@ -6066,7 +6133,12 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			continue;
 		}
 
-		logical_sector += RAID5_STRIPE_SECTORS(conf);
+		s = find_first_bit(ctx.sectors_to_do, RAID5_MAX_REQ_STRIPES);
+		if (s == RAID5_MAX_REQ_STRIPES)
+			break;
+
+		logical_sector = ctx.first_sector +
+			(s << RAID5_STRIPE_SHIFT(conf));
 	}
 
 	finish_wait(&conf->wait_for_overlap, &w);
@@ -7923,7 +7995,12 @@ static int raid5_run(struct mddev *mddev)
 		    mddev->queue->limits.discard_granularity < stripe)
 			blk_queue_max_discard_sectors(mddev->queue, 0);
 
-		blk_queue_max_hw_sectors(mddev->queue, UINT_MAX);
+		/*
+		 * Requests require having a bitmap for each stripe.
+		 * Limit the max sectors based on this.
+		 */
+		blk_queue_max_hw_sectors(mddev->queue,
+			RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf));
 	}
 
 	if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 14/15] md/raid5: Improve debug prints
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (12 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 13/15] md/raid5: Pivot raid5_make_request() Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 19:19 ` [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request Logan Gunthorpe
  2022-06-23  4:48 ` [PATCH v3 00/15] Improve Raid5 Lock Contention Song Liu
  15 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Add a debug print for raid5_make_request() so that each request is
printed and add the logical sector number to the debug print in
__add_stripe_bio().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9e18dc9ae8b4..e48c16bfe657 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3520,8 +3520,9 @@ static void __add_stripe_bio(struct stripe_head *sh, struct bio *bi,
 				sh->overwrite_disks++;
 	}
 
-	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
-		 (*bip)->bi_iter.bi_sector, sh->sector, dd_idx);
+	pr_debug("added bi b#%llu to stripe s#%llu, disk %d, logical %llu\n",
+		 (*bip)->bi_iter.bi_sector, sh->sector, dd_idx,
+		 sh->dev[dd_idx].sector);
 
 	if (conf->mddev->bitmap && firstwrite) {
 		/* Cannot hold spinlock over bitmap_startwrite,
@@ -6093,6 +6094,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 		   DIV_ROUND_UP(ctx.last_sector - logical_sector,
 				RAID5_STRIPE_SECTORS(conf)));
 
+	pr_debug("raid456: %s, logical %llu to %llu\n", __func__,
+		 bi->bi_iter.bi_sector, ctx.last_sector);
+
 	/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
 	if ((bi->bi_opf & REQ_NOWAIT) &&
 	    (conf->reshape_progress != MaxSector) &&
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (13 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 14/15] md/raid5: Improve debug prints Logan Gunthorpe
@ 2022-06-16 19:19 ` Logan Gunthorpe
  2022-06-16 21:32   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
  2022-06-23  4:48 ` [PATCH v3 00/15] Improve Raid5 Lock Contention Song Liu
  15 siblings, 1 reply; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 19:19 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

The block layer defaults the maximum segments to 128, which means
requests tend to get split around the 512KB depending on how many
pages can be merged. There's no such restriction in the raid5 code
so increase the limit to USHRT_MAX so that larger requests can be
sent as one.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index e48c16bfe657..5723a497108a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8005,6 +8005,9 @@ static int raid5_run(struct mddev *mddev)
 		 */
 		blk_queue_max_hw_sectors(mddev->queue,
 			RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf));
+
+		/* No restrictions on the number of segments in the request */
+		blk_queue_max_segments(mddev->queue, USHRT_MAX);
 	}
 
 	if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: [Non-DoD Source] [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request
  2022-06-16 19:19 ` [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request Logan Gunthorpe
@ 2022-06-16 21:32   ` Finlayson, James M CIV (USA)
  2022-06-16 21:35     ` Logan Gunthorpe
  0 siblings, 1 reply; 21+ messages in thread
From: Finlayson, James M CIV (USA) @ 2022-06-16 21:32 UTC (permalink / raw)
  To: 'Logan Gunthorpe', linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Finlayson, James M CIV (USA)

Innocent question from position of ignorance....I see these last 15 check-ins all as performance improvements.   I tend to push hard on mdraid performance, but have RAID6 needs....are these some optimizations available for RAID6 and are they in process or did I just ask a silly question?

Regards,
Jim Finlayson
US Department of Defense

-----Original Message-----
From: Logan Gunthorpe <logang@deltatee.com> 
Sent: Thursday, June 16, 2022 3:20 PM
To: linux-kernel@vger.kernel.org; linux-raid@vger.kernel.org; Song Liu <song@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>; Guoqing Jiang <guoqing.jiang@linux.dev>; Stephen Bates <sbates@raithlin.com>; Martin Oliveira <Martin.Oliveira@eideticom.com>; David Sloan <David.Sloan@eideticom.com>; Logan Gunthorpe <logang@deltatee.com>
Subject: [Non-DoD Source] [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request

The block layer defaults the maximum segments to 128, which means requests tend to get split around the 512KB depending on how many pages can be merged. There's no such restriction in the raid5 code so increase the limit to USHRT_MAX so that larger requests can be sent as one.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e48c16bfe657..5723a497108a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -8005,6 +8005,9 @@ static int raid5_run(struct mddev *mddev)
 		 */
 		blk_queue_max_hw_sectors(mddev->queue,
 			RAID5_MAX_REQ_STRIPES << RAID5_STRIPE_SHIFT(conf));
+
+		/* No restrictions on the number of segments in the request */
+		blk_queue_max_segments(mddev->queue, USHRT_MAX);
 	}
 
 	if (log_init(conf, journal_dev, raid5_has_ppl(conf)))
--
2.30.2


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [Non-DoD Source] [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request
  2022-06-16 21:32   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
@ 2022-06-16 21:35     ` Logan Gunthorpe
  0 siblings, 0 replies; 21+ messages in thread
From: Logan Gunthorpe @ 2022-06-16 21:35 UTC (permalink / raw)
  To: Finlayson, James M CIV (USA), linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan



On 2022-06-16 15:32, Finlayson, James M CIV (USA) wrote:
> Innocent question from position of ignorance....I see these last 15 check-ins all as performance improvements.   I tend to push hard on mdraid performance, but have RAID6 needs....are these some optimizations available for RAID6 and are they in process or did I just ask a silly question?

Yes, the optimizations in this series should improve raid6 performance
in the exact same way.

Raid6 and Raid5 share the same implementation. They are both
(confusingly) in the raid5.c file.

Logan

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/15] Improve Raid5 Lock Contention
  2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (14 preceding siblings ...)
  2022-06-16 19:19 ` [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request Logan Gunthorpe
@ 2022-06-23  4:48 ` Song Liu
  2022-07-03  7:51   ` Christoph Hellwig
  15 siblings, 1 reply; 21+ messages in thread
From: Song Liu @ 2022-06-23  4:48 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: open list, linux-raid, Christoph Hellwig, Guoqing Jiang,
	Stephen Bates, Martin Oliveira, David Sloan

On Thu, Jun 16, 2022 at 12:20 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
> Hi,
>
> Now that I've done some cleanup of the mdadm testing infrastructure
> as well as a lot of long run testing and bug fixes I'm much more
> confident in the correctness of this series. The previous posting is
> at [1].

Applied to md-next. Thanks for the great work!

Song

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/15] Improve Raid5 Lock Contention
  2022-06-23  4:48 ` [PATCH v3 00/15] Improve Raid5 Lock Contention Song Liu
@ 2022-07-03  7:51   ` Christoph Hellwig
  2022-07-03 14:54     ` Song Liu
  0 siblings, 1 reply; 21+ messages in thread
From: Christoph Hellwig @ 2022-07-03  7:51 UTC (permalink / raw)
  To: Song Liu
  Cc: Logan Gunthorpe, open list, linux-raid, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Wed, Jun 22, 2022 at 09:48:02PM -0700, Song Liu wrote:
> > Now that I've done some cleanup of the mdadm testing infrastructure
> > as well as a lot of long run testing and bug fixes I'm much more
> > confident in the correctness of this series. The previous posting is
> > at [1].
> 
> Applied to md-next. Thanks for the great work!

They still don't seem to have made it to linux-next.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v3 00/15] Improve Raid5 Lock Contention
  2022-07-03  7:51   ` Christoph Hellwig
@ 2022-07-03 14:54     ` Song Liu
  0 siblings, 0 replies; 21+ messages in thread
From: Song Liu @ 2022-07-03 14:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Logan Gunthorpe, open list, linux-raid, Guoqing Jiang,
	Stephen Bates, Martin Oliveira, David Sloan

On Sun, Jul 3, 2022 at 12:51 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Wed, Jun 22, 2022 at 09:48:02PM -0700, Song Liu wrote:
> > > Now that I've done some cleanup of the mdadm testing infrastructure
> > > as well as a lot of long run testing and bug fixes I'm much more
> > > confident in the correctness of this series. The previous posting is
> > > at [1].
> >
> > Applied to md-next. Thanks for the great work!
>
> They still don't seem to have made it to linux-next.

I will send pull request to Jens soon. Thanks for the reminder.

Song

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-07-03 14:54 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-16 19:19 [PATCH v3 00/15] Improve Raid5 Lock Contention Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 01/15] md/raid5: Make logic blocking check consistent with logic that blocks Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 02/15] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 03/15] md/raid5: Refactor raid5_make_request loop Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 04/15] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 05/15] md/raid5: Move common stripe get code into new find_get_stripe() helper Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 06/15] md/raid5: Factor out helper from raid5_make_request() loop Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 07/15] md/raid5: Drop the do_prepare flag in raid5_make_request() Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 08/15] md/raid5: Move read_seqcount_begin() into make_stripe_request() Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 09/15] md/raid5: Refactor for loop in raid5_make_request() into while loop Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 10/15] md/raid5: Keep a reference to last stripe_head for batch Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 11/15] md/raid5: Refactor add_stripe_bio() Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 12/15] md/raid5: Check all disks in a stripe_head for reshape progress Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 13/15] md/raid5: Pivot raid5_make_request() Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 14/15] md/raid5: Improve debug prints Logan Gunthorpe
2022-06-16 19:19 ` [PATCH v3 15/15] md/raid5: Increase restriction on max segments per request Logan Gunthorpe
2022-06-16 21:32   ` [Non-DoD Source] " Finlayson, James M CIV (USA)
2022-06-16 21:35     ` Logan Gunthorpe
2022-06-23  4:48 ` [PATCH v3 00/15] Improve Raid5 Lock Contention Song Liu
2022-07-03  7:51   ` Christoph Hellwig
2022-07-03 14:54     ` Song Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).