All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/12] Improve Raid5 Lock Contention
@ 2022-04-20 19:54 Logan Gunthorpe
  2022-04-20 19:54 ` [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
                   ` (14 more replies)
  0 siblings, 15 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Hi,

This is v2 of this series which addresses Christoph's feedback and
fixes some bugs. The first posting is at [1]. A git branch is
available at [2].

--

I've been doing some work trying to improve the bulk write performance
of raid5 on large systems with fast NVMe drives. The bottleneck appears
largely to be lock contention on the hash_lock and device_lock. This
series improves the situation slightly by addressing a couple of low
hanging fruit ways to take the lock fewer times in the request path.

Patch 9 adjusts how batching works by keeping a reference to the
previous stripe_head in raid5_make_request(). Under most situtations,
this removes the need to take the hash_lock in stripe_add_to_batch_list()
which should reduce the number of times the lock is taken by a factor of
about 2.

Patch 12 pivots the way raid5_make_request() works. Before the patch, the
code must find the stripe_head for every 4KB page in the request, so each
stripe head must be found once for every data disk. The patch changes this
so that all the data disks can be added to a stripe_head at once and the
number of times the stripe_head must be found (and thus the number of
times the hash_lock is taken) should be reduced by a factor roughly equal
to the number of data disks.

The remaining patches are just cleanup and prep patches for those two
patches.

Doing apples to apples testing this series on a small VM with 5 ram
disks, I saw a bandwidth increase of roughly 14% and lock contentions
on the hash_lock (as reported by lock stat) reduced by more than a factor
of 5 (though it is still significantly contended).

Testing on larger systems with NVMe drives saw similar small bandwidth
increases from 3% to 20% depending on the parameters. Oddly small arrays
had larger gains, likely due to them having lower starting bandwidths; I
would have expected larger gains with larger arrays (seeing there
should have been even fewer locks taken in raid5_make_request()).

Logan

[1] https://lkml.kernel.org/r/20220407164511.8472-1-logang@deltatee.com
[2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2

--

Changes since v1:
  - Rebased on current md-next branch (190a901246c69d79)
  - Added patch to create a helper for checking if a sector
    is ahead of the reshape (per Christoph)
  - Reworked the __find_stripe() patch to create a find_get_stripe()
    helper (per Christoph)
  - Added more patches to further refactor raid5_make_request() and
    pull most of the loop body into a helper function (per Christoph)
  - A few other minor cleanups (boolean return, droping casting when
    printing sectors, commit message grammar) as suggested by Christoph.
  - Fixed two uncommon but bad data corruption bugs in that were found.

--

Logan Gunthorpe (12):
  md/raid5: Factor out ahead_of_reshape() function
  md/raid5: Refactor raid5_make_request loop
  md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
  md/raid5: Move common stripe count increment code into __find_stripe()
  md/raid5: Factor out helper from raid5_make_request() loop
  md/raid5: Drop the do_prepare flag in raid5_make_request()
  md/raid5: Move read_seqcount_begin() into make_stripe_request()
  md/raid5: Refactor for loop in raid5_make_request() into while loop
  md/raid5: Keep a reference to last stripe_head for batch
  md/raid5: Refactor add_stripe_bio()
  md/raid5: Check all disks in a stripe_head for reshape progress
  md/raid5: Pivot raid5_make_request()

 drivers/md/raid5.c | 632 +++++++++++++++++++++++++++++----------------
 drivers/md/raid5.h |   1 +
 2 files changed, 410 insertions(+), 223 deletions(-)


base-commit: 190a901246c69d79dadd1ab574022548da612724
--
2.30.2

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:07   ` Christoph Hellwig
                     ` (2 more replies)
  2022-04-20 19:54 ` [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop Logan Gunthorpe
                   ` (13 subsequent siblings)
  14 siblings, 3 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

There are a few uses of an ugly ternary operator in raid5_make_request()
to check if a sector is a head of a reshape sector.

Factor this out into a simple helper called ahead_of_reshape().

This appears to fix the first bio_wouldblock_error() check which appears
to have comparison operators that didn't match the check below which
causes a schedule. Besides this, no functional changes intended.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7f7d1546b9ba..97b23c18402b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5787,6 +5787,15 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
 	bio_endio(bi);
 }
 
+static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
+			     sector_t reshape_sector)
+{
+	if (mddev->reshape_backwards)
+		return sector < reshape_sector;
+	else
+		return sector >= reshape_sector;
+}
+
 static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
@@ -5843,9 +5852,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
 	if ((bi->bi_opf & REQ_NOWAIT) &&
 	    (conf->reshape_progress != MaxSector) &&
-	    (mddev->reshape_backwards
-	    ? (logical_sector > conf->reshape_progress && logical_sector <= conf->reshape_safe)
-	    : (logical_sector >= conf->reshape_safe && logical_sector < conf->reshape_progress))) {
+	    !ahead_of_reshape(mddev, logical_sector, conf->reshape_progress) &&
+	    ahead_of_reshape(mddev, logical_sector, conf->reshape_safe)) {
 		bio_wouldblock_error(bi);
 		if (rw == WRITE)
 			md_write_end(mddev);
@@ -5874,14 +5882,12 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			 * to check again.
 			 */
 			spin_lock_irq(&conf->device_lock);
-			if (mddev->reshape_backwards
-			    ? logical_sector < conf->reshape_progress
-			    : logical_sector >= conf->reshape_progress) {
+			if (ahead_of_reshape(mddev, logical_sector,
+					     conf->reshape_progress)) {
 				previous = 1;
 			} else {
-				if (mddev->reshape_backwards
-				    ? logical_sector < conf->reshape_safe
-				    : logical_sector >= conf->reshape_safe) {
+				if (ahead_of_reshape(mddev, logical_sector,
+						     conf->reshape_safe)) {
 					spin_unlock_irq(&conf->device_lock);
 					schedule();
 					do_prepare = true;
@@ -5912,9 +5918,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 				 */
 				int must_retry = 0;
 				spin_lock_irq(&conf->device_lock);
-				if (mddev->reshape_backwards
-				    ? logical_sector >= conf->reshape_progress
-				    : logical_sector < conf->reshape_progress)
+				if (!ahead_of_reshape(mddev, logical_sector,
+						      conf->reshape_progress))
 					/* mismatch, need to try again */
 					must_retry = 1;
 				spin_unlock_irq(&conf->device_lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
  2022-04-20 19:54 ` [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:08   ` Christoph Hellwig
  2022-04-27  1:32   ` Guoqing Jiang
  2022-04-20 19:54 ` [PATCH v2 03/12] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() Logan Gunthorpe
                   ` (12 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Break immediately if raid5_get_active_stripe() returns NULL and deindent
the rest of the loop. Annotate this check with an unlikely().

This makes the code easier to read and reduces the indentation level.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 109 +++++++++++++++++++++++----------------------
 1 file changed, 55 insertions(+), 54 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 97b23c18402b..cda6857e6207 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5906,68 +5906,69 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 		sh = raid5_get_active_stripe(conf, new_sector, previous,
 				       (bi->bi_opf & REQ_RAHEAD), 0);
-		if (sh) {
-			if (unlikely(previous)) {
-				/* expansion might have moved on while waiting for a
-				 * stripe, so we must do the range check again.
-				 * Expansion could still move past after this
-				 * test, but as we are holding a reference to
-				 * 'sh', we know that if that happens,
-				 *  STRIPE_EXPANDING will get set and the expansion
-				 * won't proceed until we finish with the stripe.
-				 */
-				int must_retry = 0;
-				spin_lock_irq(&conf->device_lock);
-				if (!ahead_of_reshape(mddev, logical_sector,
-						      conf->reshape_progress))
-					/* mismatch, need to try again */
-					must_retry = 1;
-				spin_unlock_irq(&conf->device_lock);
-				if (must_retry) {
-					raid5_release_stripe(sh);
-					schedule();
-					do_prepare = true;
-					goto retry;
-				}
-			}
-			if (read_seqcount_retry(&conf->gen_lock, seq)) {
-				/* Might have got the wrong stripe_head
-				 * by accident
-				 */
-				raid5_release_stripe(sh);
-				goto retry;
-			}
+		if (unlikely(!sh)) {
+			/* cannot get stripe, just give-up */
+			bi->bi_status = BLK_STS_IOERR;
+			break;
+		}
 
-			if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-			    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
-				/* Stripe is busy expanding or
-				 * add failed due to overlap.  Flush everything
-				 * and wait a while
-				 */
-				md_wakeup_thread(mddev->thread);
+		if (unlikely(previous)) {
+			/* expansion might have moved on while waiting for a
+			 * stripe, so we must do the range check again.
+			 * Expansion could still move past after this
+			 * test, but as we are holding a reference to
+			 * 'sh', we know that if that happens,
+			 *  STRIPE_EXPANDING will get set and the expansion
+			 * won't proceed until we finish with the stripe.
+			 */
+			int must_retry = 0;
+			spin_lock_irq(&conf->device_lock);
+			if (!ahead_of_reshape(mddev, logical_sector,
+					      conf->reshape_progress))
+				/* mismatch, need to try again */
+				must_retry = 1;
+			spin_unlock_irq(&conf->device_lock);
+			if (must_retry) {
 				raid5_release_stripe(sh);
 				schedule();
 				do_prepare = true;
 				goto retry;
 			}
-			if (do_flush) {
-				set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
-				/* we only need flush for one stripe */
-				do_flush = false;
-			}
+		}
 
-			set_bit(STRIPE_HANDLE, &sh->state);
-			clear_bit(STRIPE_DELAYED, &sh->state);
-			if ((!sh->batch_head || sh == sh->batch_head) &&
-			    (bi->bi_opf & REQ_SYNC) &&
-			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				atomic_inc(&conf->preread_active_stripes);
-			release_stripe_plug(mddev, sh);
-		} else {
-			/* cannot get stripe for read-ahead, just give-up */
-			bi->bi_status = BLK_STS_IOERR;
-			break;
+		if (read_seqcount_retry(&conf->gen_lock, seq)) {
+			/* Might have got the wrong stripe_head by accident */
+			raid5_release_stripe(sh);
+			goto retry;
+		}
+
+		if (test_bit(STRIPE_EXPANDING, &sh->state) ||
+		    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+			/*
+			 * Stripe is busy expanding or add failed due to
+			 * overlap. Flush everything and wait a while.
+			 */
+			md_wakeup_thread(mddev->thread);
+			raid5_release_stripe(sh);
+			schedule();
+			do_prepare = true;
+			goto retry;
 		}
+
+		if (do_flush) {
+			set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+			/* we only need flush for one stripe */
+			do_flush = false;
+		}
+
+		set_bit(STRIPE_HANDLE, &sh->state);
+		clear_bit(STRIPE_DELAYED, &sh->state);
+		if ((!sh->batch_head || sh == sh->batch_head) &&
+		    (bi->bi_opf & REQ_SYNC) &&
+		    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+			atomic_inc(&conf->preread_active_stripes);
+
+		release_stripe_plug(mddev, sh);
 	}
 	finish_wait(&conf->wait_for_overlap, &w);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 03/12] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
  2022-04-20 19:54 ` [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
  2022-04-20 19:54 ` [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-27  1:33   ` Guoqing Jiang
  2022-04-20 19:54 ` [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe() Logan Gunthorpe
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe, Christoph Hellwig

stripe_add_to_batch_list() is better done in the loop in make_request
instead of inside add_stripe_bio(). This is clearer and allows for
storing the batch_head state outside the loop in a subsequent patch.

The call to add_stripe_bio() in retry_aligned_read() is for read
and batching only applies to write. So it's impossible for batching
to happen at that call site.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 drivers/md/raid5.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index cda6857e6207..8e1ece5ce984 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3534,8 +3534,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	}
 	spin_unlock_irq(&sh->stripe_lock);
 
-	if (stripe_can_batch(sh))
-		stripe_add_to_batch_list(conf, sh);
 	return 1;
 
  overlap:
@@ -5955,6 +5953,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			goto retry;
 		}
 
+		if (stripe_can_batch(sh))
+			stripe_add_to_batch_list(conf, sh);
+
 		if (do_flush) {
 			set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
 			/* we only need flush for one stripe */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe()
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (2 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 03/12] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:10   ` Christoph Hellwig
  2022-04-27  1:33   ` Guoqing Jiang
  2022-04-20 19:54 ` [PATCH v2 05/12] md/raid5: Factor out helper from raid5_make_request() loop Logan Gunthorpe
                   ` (10 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Both uses of find_stripe() require a fairly complicated dance to
increment the reference count. Move this into a common find_get_stripe()
helper.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 133 ++++++++++++++++++++++-----------------------
 1 file changed, 65 insertions(+), 68 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 8e1ece5ce984..a0946af5b1ac 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -612,7 +612,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
 }
 
 static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
-					 short generation)
+					 short generation, int hash)
 {
 	struct stripe_head *sh;
 
@@ -624,6 +624,49 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
 	return NULL;
 }
 
+static struct stripe_head *find_get_stripe(struct r5conf *conf,
+		sector_t sector, short generation, int hash)
+{
+	int inc_empty_inactive_list_flag;
+	struct stripe_head *sh;
+
+	sh = __find_stripe(conf, sector, generation, hash);
+	if (!sh)
+		return NULL;
+
+	if (atomic_inc_not_zero(&sh->count))
+		return sh;
+
+	/*
+	 * Slow path. The reference count is zero which means the stripe must
+	 * be on a list (sh->lru). Must remove the stripe from the list that
+	 * references it with the device_lock held.
+	 */
+
+	spin_lock(&conf->device_lock);
+	if (!atomic_read(&sh->count)) {
+		if (!test_bit(STRIPE_HANDLE, &sh->state))
+			atomic_inc(&conf->active_stripes);
+		BUG_ON(list_empty(&sh->lru) &&
+		       !test_bit(STRIPE_EXPANDING, &sh->state));
+		inc_empty_inactive_list_flag = 0;
+		if (!list_empty(conf->inactive_list + hash))
+			inc_empty_inactive_list_flag = 1;
+		list_del_init(&sh->lru);
+		if (list_empty(conf->inactive_list + hash) &&
+		    inc_empty_inactive_list_flag)
+			atomic_inc(&conf->empty_inactive_list_nr);
+		if (sh->group) {
+			sh->group->stripes_cnt--;
+			sh->group = NULL;
+		}
+	}
+	atomic_inc(&sh->count);
+	spin_unlock(&conf->device_lock);
+
+	return sh;
+}
+
 /*
  * Need to check if array has failed when deciding whether to:
  *  - start an array
@@ -716,7 +759,6 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 {
 	struct stripe_head *sh;
 	int hash = stripe_hash_locks_hash(conf, sector);
-	int inc_empty_inactive_list_flag;
 
 	pr_debug("get_stripe, sector %llu\n", (unsigned long long)sector);
 
@@ -726,57 +768,34 @@ raid5_get_active_stripe(struct r5conf *conf, sector_t sector,
 		wait_event_lock_irq(conf->wait_for_quiescent,
 				    conf->quiesce == 0 || noquiesce,
 				    *(conf->hash_locks + hash));
-		sh = __find_stripe(conf, sector, conf->generation - previous);
-		if (!sh) {
-			if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
-				sh = get_free_stripe(conf, hash);
-				if (!sh && !test_bit(R5_DID_ALLOC,
-						     &conf->cache_state))
-					set_bit(R5_ALLOC_MORE,
-						&conf->cache_state);
-			}
-			if (noblock && sh == NULL)
-				break;
+		sh = find_get_stripe(conf, sector, conf->generation - previous,
+				     hash);
+		if (sh)
+			break;
 
-			r5c_check_stripe_cache_usage(conf);
-			if (!sh) {
-				set_bit(R5_INACTIVE_BLOCKED,
-					&conf->cache_state);
-				r5l_wake_reclaim(conf->log, 0);
-				wait_event_lock_irq(
-					conf->wait_for_stripe,
+		if (!test_bit(R5_INACTIVE_BLOCKED, &conf->cache_state)) {
+			sh = get_free_stripe(conf, hash);
+			if (!sh && !test_bit(R5_DID_ALLOC, &conf->cache_state))
+				set_bit(R5_ALLOC_MORE, &conf->cache_state);
+		}
+		if (noblock && !sh)
+			break;
+
+		r5c_check_stripe_cache_usage(conf);
+		if (!sh) {
+			set_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
+			r5l_wake_reclaim(conf->log, 0);
+			wait_event_lock_irq(conf->wait_for_stripe,
 					!list_empty(conf->inactive_list + hash) &&
 					(atomic_read(&conf->active_stripes)
 					 < (conf->max_nr_stripes * 3 / 4)
 					 || !test_bit(R5_INACTIVE_BLOCKED,
 						      &conf->cache_state)),
 					*(conf->hash_locks + hash));
-				clear_bit(R5_INACTIVE_BLOCKED,
-					  &conf->cache_state);
-			} else {
-				init_stripe(sh, sector, previous);
-				atomic_inc(&sh->count);
-			}
-		} else if (!atomic_inc_not_zero(&sh->count)) {
-			spin_lock(&conf->device_lock);
-			if (!atomic_read(&sh->count)) {
-				if (!test_bit(STRIPE_HANDLE, &sh->state))
-					atomic_inc(&conf->active_stripes);
-				BUG_ON(list_empty(&sh->lru) &&
-				       !test_bit(STRIPE_EXPANDING, &sh->state));
-				inc_empty_inactive_list_flag = 0;
-				if (!list_empty(conf->inactive_list + hash))
-					inc_empty_inactive_list_flag = 1;
-				list_del_init(&sh->lru);
-				if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
-					atomic_inc(&conf->empty_inactive_list_nr);
-				if (sh->group) {
-					sh->group->stripes_cnt--;
-					sh->group = NULL;
-				}
-			}
+			clear_bit(R5_INACTIVE_BLOCKED, &conf->cache_state);
+		} else {
+			init_stripe(sh, sector, previous);
 			atomic_inc(&sh->count);
-			spin_unlock(&conf->device_lock);
 		}
 	} while (sh == NULL);
 
@@ -830,7 +849,6 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
 	sector_t head_sector, tmp_sec;
 	int hash;
 	int dd_idx;
-	int inc_empty_inactive_list_flag;
 
 	/* Don't cross chunks, so stripe pd_idx/qd_idx is the same */
 	tmp_sec = sh->sector;
@@ -840,28 +858,7 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
 
 	hash = stripe_hash_locks_hash(conf, head_sector);
 	spin_lock_irq(conf->hash_locks + hash);
-	head = __find_stripe(conf, head_sector, conf->generation);
-	if (head && !atomic_inc_not_zero(&head->count)) {
-		spin_lock(&conf->device_lock);
-		if (!atomic_read(&head->count)) {
-			if (!test_bit(STRIPE_HANDLE, &head->state))
-				atomic_inc(&conf->active_stripes);
-			BUG_ON(list_empty(&head->lru) &&
-			       !test_bit(STRIPE_EXPANDING, &head->state));
-			inc_empty_inactive_list_flag = 0;
-			if (!list_empty(conf->inactive_list + hash))
-				inc_empty_inactive_list_flag = 1;
-			list_del_init(&head->lru);
-			if (list_empty(conf->inactive_list + hash) && inc_empty_inactive_list_flag)
-				atomic_inc(&conf->empty_inactive_list_nr);
-			if (head->group) {
-				head->group->stripes_cnt--;
-				head->group = NULL;
-			}
-		}
-		atomic_inc(&head->count);
-		spin_unlock(&conf->device_lock);
-	}
+	head = find_get_stripe(conf, head_sector, conf->generation, hash);
 	spin_unlock_irq(conf->hash_locks + hash);
 
 	if (!head)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 05/12] md/raid5: Factor out helper from raid5_make_request() loop
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (3 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe() Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:14   ` Christoph Hellwig
  2022-04-20 19:54 ` [PATCH v2 06/12] md/raid5: Drop the do_prepare flag in raid5_make_request() Logan Gunthorpe
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Factor out the inner loop of raid5_make_request() into it's own helper
called make_stripe_request().

The helper returns a number of statuses: SUCCESS, RETRY,
SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to
understand and allows the SCHEDULE_AND_RETRY path to be made common.

A context structure is added to contain do_flush. It will be used
more in subsequent patches for state that needs to be kept
outside the loop.

No functional changes intended. This will be cleaned up further in
subsequent patches to untangle the gen_lock and do_prepare logic
further.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 225 +++++++++++++++++++++++++--------------------
 1 file changed, 125 insertions(+), 100 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index a0946af5b1ac..5a7334ba0997 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5791,17 +5791,131 @@ static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
 		return sector >= reshape_sector;
 }
 
+enum stripe_result {
+	STRIPE_SUCCESS = 0,
+	STRIPE_RETRY,
+	STRIPE_SCHEDULE_AND_RETRY,
+	STRIPE_FAIL,
+};
+
+struct stripe_request_ctx {
+	bool do_flush;
+};
+
+static enum stripe_result make_stripe_request(struct mddev *mddev,
+		struct r5conf *conf, struct stripe_request_ctx *ctx,
+		sector_t logical_sector, struct bio *bi, int seq)
+{
+	const int rw = bio_data_dir(bi);
+	struct stripe_head *sh;
+	sector_t new_sector;
+	int previous = 0;
+	int dd_idx;
+
+	if (unlikely(conf->reshape_progress != MaxSector)) {
+		/* spinlock is needed as reshape_progress may be
+		 * 64bit on a 32bit platform, and so it might be
+		 * possible to see a half-updated value
+		 * Of course reshape_progress could change after
+		 * the lock is dropped, so once we get a reference
+		 * to the stripe that we think it is, we will have
+		 * to check again.
+		 */
+		spin_lock_irq(&conf->device_lock);
+		if (ahead_of_reshape(mddev, logical_sector,
+				     conf->reshape_progress)) {
+			previous = 1;
+		} else {
+			if (ahead_of_reshape(mddev, logical_sector,
+					     conf->reshape_safe)) {
+				spin_unlock_irq(&conf->device_lock);
+				return STRIPE_SCHEDULE_AND_RETRY;
+			}
+		}
+		spin_unlock_irq(&conf->device_lock);
+	}
+
+	new_sector = raid5_compute_sector(conf, logical_sector, previous,
+					  &dd_idx, NULL);
+	pr_debug("raid456: %s, sector %llu logical %llu\n", __func__,
+		 new_sector, logical_sector);
+
+	sh = raid5_get_active_stripe(conf, new_sector, previous,
+				     (bi->bi_opf & REQ_RAHEAD), 0);
+	if (unlikely(!sh)) {
+		/* cannot get stripe, just give-up */
+		bi->bi_status = BLK_STS_IOERR;
+		return STRIPE_FAIL;
+	}
+
+	if (unlikely(previous)) {
+		/* expansion might have moved on while waiting for a
+		 * stripe, so we must do the range check again.
+		 * Expansion could still move past after this
+		 * test, but as we are holding a reference to
+		 * 'sh', we know that if that happens,
+		 *  STRIPE_EXPANDING will get set and the expansion
+		 * won't proceed until we finish with the stripe.
+		 */
+		int must_retry = 0;
+		spin_lock_irq(&conf->device_lock);
+		if (!ahead_of_reshape(mddev, logical_sector,
+				      conf->reshape_progress))
+			/* mismatch, need to try again */
+			must_retry = 1;
+		spin_unlock_irq(&conf->device_lock);
+		if (must_retry) {
+			raid5_release_stripe(sh);
+			return STRIPE_SCHEDULE_AND_RETRY;
+		}
+	}
+
+	if (read_seqcount_retry(&conf->gen_lock, seq)) {
+		/* Might have got the wrong stripe_head by accident */
+		raid5_release_stripe(sh);
+		return STRIPE_RETRY;
+	}
+
+	if (test_bit(STRIPE_EXPANDING, &sh->state) ||
+	    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+		/*
+		 * Stripe is busy expanding or add failed due to
+		 * overlap. Flush everything and wait a while.
+		 */
+		md_wakeup_thread(mddev->thread);
+		raid5_release_stripe(sh);
+		return STRIPE_SCHEDULE_AND_RETRY;
+	}
+
+	if (stripe_can_batch(sh))
+		stripe_add_to_batch_list(conf, sh);
+
+	if (ctx->do_flush) {
+		set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
+		/* we only need flush for one stripe */
+		ctx->do_flush = false;
+	}
+
+	set_bit(STRIPE_HANDLE, &sh->state);
+	clear_bit(STRIPE_DELAYED, &sh->state);
+	if ((!sh->batch_head || sh == sh->batch_head) &&
+	    (bi->bi_opf & REQ_SYNC) &&
+	    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+		atomic_inc(&conf->preread_active_stripes);
+
+	release_stripe_plug(mddev, sh);
+	return STRIPE_SUCCESS;
+}
+
 static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
-	int dd_idx;
-	sector_t new_sector;
 	sector_t logical_sector, last_sector;
-	struct stripe_head *sh;
+	struct stripe_request_ctx ctx = {};
 	const int rw = bio_data_dir(bi);
+	enum stripe_result res;
 	DEFINE_WAIT(w);
 	bool do_prepare;
-	bool do_flush = false;
 
 	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
 		int ret = log_handle_flush_request(conf, bi);
@@ -5817,7 +5931,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 		 * if r5l_handle_flush_request() didn't clear REQ_PREFLUSH,
 		 * we need to flush journal device
 		 */
-		do_flush = bi->bi_opf & REQ_PREFLUSH;
+		ctx.do_flush = bi->bi_opf & REQ_PREFLUSH;
 	}
 
 	if (!md_write_start(mddev, bi))
@@ -5857,117 +5971,28 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	md_account_bio(mddev, &bi);
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
-		int previous;
 		int seq;
 
 		do_prepare = false;
 	retry:
 		seq = read_seqcount_begin(&conf->gen_lock);
-		previous = 0;
 		if (do_prepare)
 			prepare_to_wait(&conf->wait_for_overlap, &w,
 				TASK_UNINTERRUPTIBLE);
-		if (unlikely(conf->reshape_progress != MaxSector)) {
-			/* spinlock is needed as reshape_progress may be
-			 * 64bit on a 32bit platform, and so it might be
-			 * possible to see a half-updated value
-			 * Of course reshape_progress could change after
-			 * the lock is dropped, so once we get a reference
-			 * to the stripe that we think it is, we will have
-			 * to check again.
-			 */
-			spin_lock_irq(&conf->device_lock);
-			if (ahead_of_reshape(mddev, logical_sector,
-					     conf->reshape_progress)) {
-				previous = 1;
-			} else {
-				if (ahead_of_reshape(mddev, logical_sector,
-						     conf->reshape_safe)) {
-					spin_unlock_irq(&conf->device_lock);
-					schedule();
-					do_prepare = true;
-					goto retry;
-				}
-			}
-			spin_unlock_irq(&conf->device_lock);
-		}
-
-		new_sector = raid5_compute_sector(conf, logical_sector,
-						  previous,
-						  &dd_idx, NULL);
-		pr_debug("raid456: raid5_make_request, sector %llu logical %llu\n",
-			(unsigned long long)new_sector,
-			(unsigned long long)logical_sector);
 
-		sh = raid5_get_active_stripe(conf, new_sector, previous,
-				       (bi->bi_opf & REQ_RAHEAD), 0);
-		if (unlikely(!sh)) {
-			/* cannot get stripe, just give-up */
-			bi->bi_status = BLK_STS_IOERR;
+		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
+					  bi, seq);
+		if (res == STRIPE_FAIL) {
 			break;
-		}
-
-		if (unlikely(previous)) {
-			/* expansion might have moved on while waiting for a
-			 * stripe, so we must do the range check again.
-			 * Expansion could still move past after this
-			 * test, but as we are holding a reference to
-			 * 'sh', we know that if that happens,
-			 *  STRIPE_EXPANDING will get set and the expansion
-			 * won't proceed until we finish with the stripe.
-			 */
-			int must_retry = 0;
-			spin_lock_irq(&conf->device_lock);
-			if (!ahead_of_reshape(mddev, logical_sector,
-					      conf->reshape_progress))
-				/* mismatch, need to try again */
-				must_retry = 1;
-			spin_unlock_irq(&conf->device_lock);
-			if (must_retry) {
-				raid5_release_stripe(sh);
-				schedule();
-				do_prepare = true;
-				goto retry;
-			}
-		}
-
-		if (read_seqcount_retry(&conf->gen_lock, seq)) {
-			/* Might have got the wrong stripe_head by accident */
-			raid5_release_stripe(sh);
+		} else if (res == STRIPE_RETRY) {
 			goto retry;
-		}
-
-		if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-		    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
-			/*
-			 * Stripe is busy expanding or add failed due to
-			 * overlap. Flush everything and wait a while.
-			 */
-			md_wakeup_thread(mddev->thread);
-			raid5_release_stripe(sh);
+		} else if (res == STRIPE_SCHEDULE_AND_RETRY) {
 			schedule();
 			do_prepare = true;
 			goto retry;
 		}
-
-		if (stripe_can_batch(sh))
-			stripe_add_to_batch_list(conf, sh);
-
-		if (do_flush) {
-			set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
-			/* we only need flush for one stripe */
-			do_flush = false;
-		}
-
-		set_bit(STRIPE_HANDLE, &sh->state);
-		clear_bit(STRIPE_DELAYED, &sh->state);
-		if ((!sh->batch_head || sh == sh->batch_head) &&
-		    (bi->bi_opf & REQ_SYNC) &&
-		    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-			atomic_inc(&conf->preread_active_stripes);
-
-		release_stripe_plug(mddev, sh);
 	}
+
 	finish_wait(&conf->wait_for_overlap, &w);
 
 	if (rw == WRITE)
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 06/12] md/raid5: Drop the do_prepare flag in raid5_make_request()
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (4 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 05/12] md/raid5: Factor out helper from raid5_make_request() loop Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:15   ` Christoph Hellwig
  2022-04-27  2:11   ` Guoqing Jiang
  2022-04-20 19:54 ` [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request() Logan Gunthorpe
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

prepare_to_wait() can be reasonably called after schedule instead of
setting a flag and preparing in the next loop iteration.

This means that prepare_to_wait() will be called before
read_seqcount_begin(), but there shouldn't be any reason that
the order matters here. On the first iteration of the loop
prepare_to_wait() is already called first.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5a7334ba0997..b9f618356446 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5915,7 +5915,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	const int rw = bio_data_dir(bi);
 	enum stripe_result res;
 	DEFINE_WAIT(w);
-	bool do_prepare;
 
 	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
 		int ret = log_handle_flush_request(conf, bi);
@@ -5973,12 +5972,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
 		int seq;
 
-		do_prepare = false;
 	retry:
 		seq = read_seqcount_begin(&conf->gen_lock);
-		if (do_prepare)
-			prepare_to_wait(&conf->wait_for_overlap, &w,
-				TASK_UNINTERRUPTIBLE);
 
 		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
 					  bi, seq);
@@ -5988,7 +5983,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 			goto retry;
 		} else if (res == STRIPE_SCHEDULE_AND_RETRY) {
 			schedule();
-			do_prepare = true;
+			prepare_to_wait(&conf->wait_for_overlap, &w,
+					TASK_UNINTERRUPTIBLE);
 			goto retry;
 		}
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request()
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (5 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 06/12] md/raid5: Drop the do_prepare flag in raid5_make_request() Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:15   ` Christoph Hellwig
  2022-04-27  2:13   ` Guoqing Jiang
  2022-04-20 19:54 ` [PATCH v2 08/12] md/raid5: Refactor for loop in raid5_make_request() into while loop Logan Gunthorpe
                   ` (7 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
into make_stripe_request().

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index b9f618356446..1bce9075e165 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5804,13 +5804,15 @@ struct stripe_request_ctx {
 
 static enum stripe_result make_stripe_request(struct mddev *mddev,
 		struct r5conf *conf, struct stripe_request_ctx *ctx,
-		sector_t logical_sector, struct bio *bi, int seq)
+		sector_t logical_sector, struct bio *bi)
 {
 	const int rw = bio_data_dir(bi);
 	struct stripe_head *sh;
 	sector_t new_sector;
 	int previous = 0;
-	int dd_idx;
+	int seq, dd_idx;
+
+	seq = read_seqcount_begin(&conf->gen_lock);
 
 	if (unlikely(conf->reshape_progress != MaxSector)) {
 		/* spinlock is needed as reshape_progress may be
@@ -5970,13 +5972,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	md_account_bio(mddev, &bi);
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
 	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
-		int seq;
-
 	retry:
-		seq = read_seqcount_begin(&conf->gen_lock);
-
 		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
-					  bi, seq);
+					  bi);
 		if (res == STRIPE_FAIL) {
 			break;
 		} else if (res == STRIPE_RETRY) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 08/12] md/raid5: Refactor for loop in raid5_make_request() into while loop
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (6 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request() Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:16   ` Christoph Hellwig
  2022-04-20 19:54 ` [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch Logan Gunthorpe
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

The for loop with retry label can be more cleanly expressed as a while
loop by moving the logical_sector increment into the success path.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1bce9075e165..0c250cc3bfff 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5971,20 +5971,21 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	md_account_bio(mddev, &bi);
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
-	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
-	retry:
+	while (logical_sector < last_sector) {
 		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
 					  bi);
 		if (res == STRIPE_FAIL) {
 			break;
 		} else if (res == STRIPE_RETRY) {
-			goto retry;
+			continue;
 		} else if (res == STRIPE_SCHEDULE_AND_RETRY) {
 			schedule();
 			prepare_to_wait(&conf->wait_for_overlap, &w,
 					TASK_UNINTERRUPTIBLE);
-			goto retry;
+			continue;
 		}
+
+		logical_sector += RAID5_STRIPE_SECTORS(conf);
 	}
 
 	finish_wait(&conf->wait_for_overlap, &w);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (7 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 08/12] md/raid5: Refactor for loop in raid5_make_request() into while loop Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:17   ` Christoph Hellwig
  2022-04-27  1:36   ` Guoqing Jiang
  2022-04-20 19:54 ` [PATCH v2 10/12] md/raid5: Refactor add_stripe_bio() Logan Gunthorpe
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

When batching, every stripe head has to find the previous stripe head to
add to the batch list. This involves taking the hash lock which is
highly contended during IO.

Instead of finding the previous stripe_head each time, store a
reference to the previous stripe_head in a pointer so that it doesn't
require taking the contended lock another time.

The reference to the previous stripe must be released before scheduling
and waiting for work to get done. Otherwise, it can hold up
raid5_activate_delayed() and deadlock.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 51 +++++++++++++++++++++++++++++++++++-----------
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0c250cc3bfff..28ea7b9b6ab6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -843,7 +843,8 @@ static bool stripe_can_batch(struct stripe_head *sh)
 }
 
 /* we only do back search */
-static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh)
+static void stripe_add_to_batch_list(struct r5conf *conf,
+		struct stripe_head *sh, struct stripe_head *last_sh)
 {
 	struct stripe_head *head;
 	sector_t head_sector, tmp_sec;
@@ -856,15 +857,20 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
 		return;
 	head_sector = sh->sector - RAID5_STRIPE_SECTORS(conf);
 
-	hash = stripe_hash_locks_hash(conf, head_sector);
-	spin_lock_irq(conf->hash_locks + hash);
-	head = find_get_stripe(conf, head_sector, conf->generation, hash);
-	spin_unlock_irq(conf->hash_locks + hash);
-
-	if (!head)
-		return;
-	if (!stripe_can_batch(head))
-		goto out;
+	if (last_sh && head_sector == last_sh->sector) {
+		head = last_sh;
+		atomic_inc(&head->count);
+	} else {
+		hash = stripe_hash_locks_hash(conf, head_sector);
+		spin_lock_irq(conf->hash_locks + hash);
+		head = find_get_stripe(conf, head_sector, conf->generation,
+				       hash);
+		spin_unlock_irq(conf->hash_locks + hash);
+		if (!head)
+			return;
+		if (!stripe_can_batch(head))
+			goto out;
+	}
 
 	lock_two_stripes(head, sh);
 	/* clear_batch_ready clear the flag */
@@ -5800,6 +5806,7 @@ enum stripe_result {
 
 struct stripe_request_ctx {
 	bool do_flush;
+	struct stripe_head *batch_last;
 };
 
 static enum stripe_result make_stripe_request(struct mddev *mddev,
@@ -5889,8 +5896,13 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 		return STRIPE_SCHEDULE_AND_RETRY;
 	}
 
-	if (stripe_can_batch(sh))
-		stripe_add_to_batch_list(conf, sh);
+	if (stripe_can_batch(sh)) {
+		stripe_add_to_batch_list(conf, sh, ctx->batch_last);
+		if (ctx->batch_last)
+			raid5_release_stripe(ctx->batch_last);
+		atomic_inc(&sh->count);
+		ctx->batch_last = sh;
+	}
 
 	if (ctx->do_flush) {
 		set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
@@ -5979,6 +5991,18 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 		} else if (res == STRIPE_RETRY) {
 			continue;
 		} else if (res == STRIPE_SCHEDULE_AND_RETRY) {
+			/*
+			 * Must release the reference to batch_last before
+			 * scheduling and waiting for work to be done,
+			 * otherwise the batch_last stripe head could prevent
+			 * raid5_activate_delayed() from making progress
+			 * and thus deadlocking.
+			 */
+			if (ctx.batch_last) {
+				raid5_release_stripe(ctx.batch_last);
+				ctx.batch_last = NULL;
+			}
+
 			schedule();
 			prepare_to_wait(&conf->wait_for_overlap, &w,
 					TASK_UNINTERRUPTIBLE);
@@ -5990,6 +6014,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 
 	finish_wait(&conf->wait_for_overlap, &w);
 
+	if (ctx.batch_last)
+		raid5_release_stripe(ctx.batch_last);
+
 	if (rw == WRITE)
 		md_write_end(mddev);
 	bio_endio(bi);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 10/12] md/raid5: Refactor add_stripe_bio()
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (8 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:18   ` Christoph Hellwig
  2022-04-20 19:54 ` [PATCH v2 11/12] md/raid5: Check all disks in a stripe_head for reshape progress Logan Gunthorpe
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

Factor out two helper functions from add_stripe_bio(): one to check for
overlap (stripe_bio_overlaps()), and one to actually add the bio to the
stripe (__add_stripe_bio()). The latter function will always succeed.

This will be useful in the next patch so that overlap can be checked for
multiple disks before adding any

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 86 ++++++++++++++++++++++++++++++----------------
 1 file changed, 56 insertions(+), 30 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 28ea7b9b6ab6..1fa82d8fa89e 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3418,39 +3418,32 @@ schedule_reconstruction(struct stripe_head *sh, struct stripe_head_state *s,
 		s->locked, s->ops_request);
 }
 
-/*
- * Each stripe/dev can have one or more bion attached.
- * toread/towrite point to the first in a chain.
- * The bi_next chain must be in order.
- */
-static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
-			  int forwrite, int previous)
+static bool stripe_bio_overlaps(struct stripe_head *sh, struct bio *bi,
+				int dd_idx, int forwrite)
 {
-	struct bio **bip;
 	struct r5conf *conf = sh->raid_conf;
-	int firstwrite=0;
+	struct bio **bip;
 
-	pr_debug("adding bi b#%llu to stripe s#%llu\n",
-		(unsigned long long)bi->bi_iter.bi_sector,
-		(unsigned long long)sh->sector);
+	pr_debug("checking bi b#%llu to stripe s#%llu\n",
+		 bi->bi_iter.bi_sector, sh->sector);
 
-	spin_lock_irq(&sh->stripe_lock);
 	/* Don't allow new IO added to stripes in batch list */
 	if (sh->batch_head)
-		goto overlap;
-	if (forwrite) {
+		return true;
+
+	if (forwrite)
 		bip = &sh->dev[dd_idx].towrite;
-		if (*bip == NULL)
-			firstwrite = 1;
-	} else
+	else
 		bip = &sh->dev[dd_idx].toread;
+
 	while (*bip && (*bip)->bi_iter.bi_sector < bi->bi_iter.bi_sector) {
 		if (bio_end_sector(*bip) > bi->bi_iter.bi_sector)
-			goto overlap;
-		bip = & (*bip)->bi_next;
+			return true;
+		bip = &(*bip)->bi_next;
 	}
+
 	if (*bip && (*bip)->bi_iter.bi_sector < bio_end_sector(bi))
-		goto overlap;
+		return true;
 
 	if (forwrite && raid5_has_ppl(conf)) {
 		/*
@@ -3479,9 +3472,30 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		}
 
 		if (first + conf->chunk_sectors * (count - 1) != last)
-			goto overlap;
+			return true;
 	}
 
+	return false;
+}
+
+static void __add_stripe_bio(struct stripe_head *sh, struct bio *bi,
+			     int dd_idx, int forwrite, int previous)
+{
+	struct r5conf *conf = sh->raid_conf;
+	struct bio **bip;
+	int firstwrite = 0;
+
+	if (forwrite) {
+		bip = &sh->dev[dd_idx].towrite;
+		if (!*bip)
+			firstwrite = 1;
+	} else {
+		bip = &sh->dev[dd_idx].toread;
+	}
+
+	while (*bip && (*bip)->bi_iter.bi_sector < bi->bi_iter.bi_sector)
+		bip = &(*bip)->bi_next;
+
 	if (!forwrite || previous)
 		clear_bit(STRIPE_BATCH_READY, &sh->state);
 
@@ -3508,8 +3522,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 	}
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
-		(unsigned long long)(*bip)->bi_iter.bi_sector,
-		(unsigned long long)sh->sector, dd_idx);
+		 (*bip)->bi_iter.bi_sector, sh->sector, dd_idx);
 
 	if (conf->mddev->bitmap && firstwrite) {
 		/* Cannot hold spinlock over bitmap_startwrite,
@@ -3517,7 +3530,7 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 		 * we have added to the bitmap and set bm_seq.
 		 * So set STRIPE_BITMAP_PENDING to prevent
 		 * batching.
-		 * If multiple add_stripe_bio() calls race here they
+		 * If multiple __add_stripe_bio() calls race here they
 		 * much all set STRIPE_BITMAP_PENDING.  So only the first one
 		 * to complete "bitmap_startwrite" gets to set
 		 * STRIPE_BIT_DELAY.  This is important as once a stripe
@@ -3535,14 +3548,27 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
 			set_bit(STRIPE_BIT_DELAY, &sh->state);
 		}
 	}
-	spin_unlock_irq(&sh->stripe_lock);
+}
 
-	return 1;
+/*
+ * Each stripe/dev can have one or more bios attached.
+ * toread/towrite point to the first in a chain.
+ * The bi_next chain must be in order.
+ */
+static bool add_stripe_bio(struct stripe_head *sh, struct bio *bi,
+			   int dd_idx, int forwrite, int previous)
+{
+	spin_lock_irq(&sh->stripe_lock);
 
- overlap:
-	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+	if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
+		set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+		spin_unlock_irq(&sh->stripe_lock);
+		return false;
+	}
+
+	__add_stripe_bio(sh, bi, dd_idx, forwrite, previous);
 	spin_unlock_irq(&sh->stripe_lock);
-	return 0;
+	return true;
 }
 
 static void end_reshape(struct r5conf *conf);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 11/12] md/raid5: Check all disks in a stripe_head for reshape progress
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (9 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 10/12] md/raid5: Refactor add_stripe_bio() Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:18   ` Christoph Hellwig
  2022-04-27  1:53   ` Guoqing Jiang
  2022-04-20 19:54 ` [PATCH v2 12/12] md/raid5: Pivot raid5_make_request() Logan Gunthorpe
                   ` (3 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

When testing if a previous stripe has had reshape expand past it, use
the earliest or latest logical sector in all the disks for that stripe
head. This will allow adding multiple disks at a time in a subesquent
patch.

To do this cleaner, refactor the check into a helper function called
stripe_ahead_of_reshape().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 55 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 41 insertions(+), 14 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 1fa82d8fa89e..40a25c4b80bd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5823,6 +5823,42 @@ static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
 		return sector >= reshape_sector;
 }
 
+static bool range_ahead_of_reshape(struct mddev *mddev, sector_t min,
+				   sector_t max, sector_t reshape_sector)
+{
+	if (mddev->reshape_backwards)
+		return max < reshape_sector;
+	else
+		return min >= reshape_sector;
+}
+
+static bool stripe_ahead_of_reshape(struct mddev *mddev, struct r5conf *conf,
+				    struct stripe_head *sh)
+{
+	sector_t max_sector = 0, min_sector = MaxSector;
+	bool ret = false;
+	int dd_idx;
+
+	for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+		if (dd_idx == sh->pd_idx)
+			continue;
+
+		min_sector = min(min_sector, sh->dev[dd_idx].sector);
+		max_sector = min(max_sector, sh->dev[dd_idx].sector);
+	}
+
+	spin_lock_irq(&conf->device_lock);
+
+	if (!range_ahead_of_reshape(mddev, min_sector, max_sector,
+				     conf->reshape_progress))
+		/* mismatch, need to try again */
+		ret = true;
+
+	spin_unlock_irq(&conf->device_lock);
+
+	return ret;
+}
+
 enum stripe_result {
 	STRIPE_SUCCESS = 0,
 	STRIPE_RETRY,
@@ -5883,26 +5919,17 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 		return STRIPE_FAIL;
 	}
 
-	if (unlikely(previous)) {
-		/* expansion might have moved on while waiting for a
-		 * stripe, so we must do the range check again.
+	if (unlikely(previous) &&
+	    stripe_ahead_of_reshape(mddev, conf, sh)) {
+		/* Expansion moved on while waiting for a stripe.
 		 * Expansion could still move past after this
 		 * test, but as we are holding a reference to
 		 * 'sh', we know that if that happens,
 		 *  STRIPE_EXPANDING will get set and the expansion
 		 * won't proceed until we finish with the stripe.
 		 */
-		int must_retry = 0;
-		spin_lock_irq(&conf->device_lock);
-		if (!ahead_of_reshape(mddev, logical_sector,
-				      conf->reshape_progress))
-			/* mismatch, need to try again */
-			must_retry = 1;
-		spin_unlock_irq(&conf->device_lock);
-		if (must_retry) {
-			raid5_release_stripe(sh);
-			return STRIPE_SCHEDULE_AND_RETRY;
-		}
+		raid5_release_stripe(sh);
+		return STRIPE_SCHEDULE_AND_RETRY;
 	}
 
 	if (read_seqcount_retry(&conf->gen_lock, seq)) {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (10 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 11/12] md/raid5: Check all disks in a stripe_head for reshape progress Logan Gunthorpe
@ 2022-04-20 19:54 ` Logan Gunthorpe
  2022-04-21  6:43   ` Christoph Hellwig
  2022-04-27  2:06   ` Guoqing Jiang
  2022-04-21  8:45 ` [PATCH v2 00/12] Improve Raid5 Lock Contention Xiao Ni
                   ` (2 subsequent siblings)
  14 siblings, 2 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-20 19:54 UTC (permalink / raw)
  To: linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Guoqing Jiang, Stephen Bates, Martin Oliveira,
	David Sloan, Logan Gunthorpe

raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the disk.

This causes a great deal of contention on the hash_lock seeing the
lock for that hash must be taken for every single page.

The number of times the lock is taken can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the lock must be taken by
a factor equal to the number of data disks.

To accomplish this, store the minimum and maxmimum disk sector that
has already been finished and continue to the next logical sector if
it is found that the disk sector has already been done. Then add a
add_all_stripe_bios() to check all the bios for overlap and add them
all if none of them overlap.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
---
 drivers/md/raid5.c | 92 +++++++++++++++++++++++++++++++++++++++++++---
 drivers/md/raid5.h |  1 +
 2 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 40a25c4b80bd..f86866cb15be 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3571,6 +3571,48 @@ static bool add_stripe_bio(struct stripe_head *sh, struct bio *bi,
 	return true;
 }
 
+static int add_all_stripe_bios(struct stripe_head *sh, struct bio *bi,
+		sector_t first_logical_sector, sector_t last_sector,
+		int forwrite, int previous)
+{
+	int dd_idx;
+	int ret = 1;
+
+	spin_lock_irq(&sh->stripe_lock);
+
+	for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
+		struct r5dev *dev = &sh->dev[dd_idx];
+
+		clear_bit(R5_BioReady, &dev->flags);
+
+		if (dd_idx == sh->pd_idx)
+			continue;
+
+		if (dev->sector < first_logical_sector ||
+		    dev->sector >= last_sector)
+			continue;
+
+		if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
+			set_bit(R5_Overlap, &dev->flags);
+			ret = 0;
+			continue;
+		}
+
+		set_bit(R5_BioReady, &dev->flags);
+	}
+
+	if (!ret)
+		goto out;
+
+	for (dd_idx = 0; dd_idx < sh->disks; dd_idx++)
+		if (test_bit(R5_BioReady, &sh->dev[dd_idx].flags))
+			__add_stripe_bio(sh, bi, dd_idx, forwrite, previous);
+
+out:
+	spin_unlock_irq(&sh->stripe_lock);
+	return ret;
+}
+
 static void end_reshape(struct r5conf *conf);
 
 static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
@@ -5869,6 +5911,10 @@ enum stripe_result {
 struct stripe_request_ctx {
 	bool do_flush;
 	struct stripe_head *batch_last;
+	sector_t disk_sector_done;
+	sector_t start_disk_sector;
+	bool first_wrap;
+	sector_t last_sector;
 };
 
 static enum stripe_result make_stripe_request(struct mddev *mddev,
@@ -5908,6 +5954,36 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 
 	new_sector = raid5_compute_sector(conf, logical_sector, previous,
 					  &dd_idx, NULL);
+
+	/*
+	 * This is a tricky algorithm to figure out which stripe_heads that
+	 * have already been visited and exit early if the stripe_head has
+	 * already been done. (Seeing all disks are added to a stripe_head
+	 * once in add_all_stripe_bios().
+	 *
+	 * To start with, the disk sector of the last stripe that has been
+	 * completed is stored in ctx->disk_sector_done. If the new_sector is
+	 * less than this value, the stripe_head has already been done.
+	 *
+	 * There's one issue with this: if the request starts in the middle of
+	 * a chunk, all the stripe heads before the starting offset will be
+	 * missed. To account for this, set the first_wrap boolean to true
+	 * if new_sector is less than the starting sector. Clear the
+	 * boolean once the start sector is hit for the second time.
+	 * When first_wrap is set, ignore the disk_sector_done.
+	 */
+	if (ctx->start_disk_sector == MaxSector) {
+		ctx->start_disk_sector = new_sector;
+	} else if (new_sector < ctx->start_disk_sector) {
+		ctx->first_wrap = true;
+	} else if (new_sector == ctx->start_disk_sector) {
+		ctx->first_wrap = false;
+		ctx->start_disk_sector = 0;
+		return STRIPE_SUCCESS;
+	} else if (!ctx->first_wrap && new_sector <= ctx->disk_sector_done) {
+		return STRIPE_SUCCESS;
+	}
+
 	pr_debug("raid456: %s, sector %llu logical %llu\n", __func__,
 		 new_sector, logical_sector);
 
@@ -5939,7 +6015,8 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 	}
 
 	if (test_bit(STRIPE_EXPANDING, &sh->state) ||
-	    !add_stripe_bio(sh, bi, dd_idx, rw, previous)) {
+	    !add_all_stripe_bios(sh, bi, logical_sector, ctx->last_sector, rw,
+				 previous)) {
 		/*
 		 * Stripe is busy expanding or add failed due to
 		 * overlap. Flush everything and wait a while.
@@ -5949,6 +6026,9 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 		return STRIPE_SCHEDULE_AND_RETRY;
 	}
 
+	if (new_sector > ctx->disk_sector_done)
+		ctx->disk_sector_done = new_sector;
+
 	if (stripe_can_batch(sh)) {
 		stripe_add_to_batch_list(conf, sh, ctx->batch_last);
 		if (ctx->batch_last)
@@ -5977,8 +6057,10 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
 static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
-	sector_t logical_sector, last_sector;
-	struct stripe_request_ctx ctx = {};
+	sector_t logical_sector;
+	struct stripe_request_ctx ctx = {
+		.start_disk_sector = MaxSector,
+	};
 	const int rw = bio_data_dir(bi);
 	enum stripe_result res;
 	DEFINE_WAIT(w);
@@ -6021,7 +6103,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 
 	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1);
-	last_sector = bio_end_sector(bi);
+	ctx.last_sector = bio_end_sector(bi);
 	bi->bi_next = NULL;
 
 	/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
@@ -6036,7 +6118,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
 	}
 	md_account_bio(mddev, &bi);
 	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
-	while (logical_sector < last_sector) {
+	while (logical_sector < ctx.last_sector) {
 		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
 					  bi);
 		if (res == STRIPE_FAIL) {
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 638d29863503..e73b58844f83 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -308,6 +308,7 @@ enum r5dev_flags {
 	R5_Wantwrite,
 	R5_Overlap,	/* There is a pending overlapping request
 			 * on this block */
+	R5_BioReady,    /* The current bio can be added to this disk */
 	R5_ReadNoMerge, /* prevent bio from merging in block-layer */
 	R5_ReadError,	/* seen a read error here recently */
 	R5_ReWrite,	/* have tried to over-write the readerror */
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-20 19:54 ` [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
@ 2022-04-21  6:07   ` Christoph Hellwig
  2022-04-21  9:17   ` Paul Menzel
  2022-04-27  1:28   ` Guoqing Jiang
  2 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:07 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig

On Wed, Apr 20, 2022 at 01:54:14PM -0600, Logan Gunthorpe wrote:
> +static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
> +			     sector_t reshape_sector)
> +{
> +	if (mddev->reshape_backwards)
> +		return sector < reshape_sector;
> +	else
> +		return sector >= reshape_sector;
> +}

No real eed for the else here.  Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop
  2022-04-20 19:54 ` [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop Logan Gunthorpe
@ 2022-04-21  6:08   ` Christoph Hellwig
  2022-04-27  1:32   ` Guoqing Jiang
  1 sibling, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:08 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Wed, Apr 20, 2022 at 01:54:15PM -0600, Logan Gunthorpe wrote:
> Break immediately if raid5_get_active_stripe() returns NULL and deindent
> the rest of the loop. Annotate this check with an unlikely().
> 
> This makes the code easier to read and reduces the indentation level.
> 
> No functional changes intended.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe()
  2022-04-20 19:54 ` [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe() Logan Gunthorpe
@ 2022-04-21  6:10   ` Christoph Hellwig
  2022-04-27  1:33   ` Guoqing Jiang
  1 sibling, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:10 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Wed, Apr 20, 2022 at 01:54:17PM -0600, Logan Gunthorpe wrote:
> Both uses of find_stripe() require a fairly complicated dance to
> increment the reference count. Move this into a common find_get_stripe()
> helper.
> 
> No functional changes intended.

The subject is wrong now.

>  static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
> -					 short generation)
> +					 short generation, int hash)
>  {
>  	struct stripe_head *sh;
>  
> @@ -624,6 +624,49 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
>  	return NULL;
>  }

And the new hash argument here is not needed.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 05/12] md/raid5: Factor out helper from raid5_make_request() loop
  2022-04-20 19:54 ` [PATCH v2 05/12] md/raid5: Factor out helper from raid5_make_request() loop Logan Gunthorpe
@ 2022-04-21  6:14   ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:14 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

> +	if (unlikely(conf->reshape_progress != MaxSector)) {
> +		/* spinlock is needed as reshape_progress may be

Normal kernel style is to stat with a

		/*

line and to captalize the first word.  (also applies for a few other
comments).

> +		if (must_retry) {
> +			raid5_release_stripe(sh);
> +			return STRIPE_SCHEDULE_AND_RETRY;

The raid5_release_stripe cleanup is duplicated a few times.  I think
using goto based unwinding would be hepful in this function.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 06/12] md/raid5: Drop the do_prepare flag in raid5_make_request()
  2022-04-20 19:54 ` [PATCH v2 06/12] md/raid5: Drop the do_prepare flag in raid5_make_request() Logan Gunthorpe
@ 2022-04-21  6:15   ` Christoph Hellwig
  2022-04-27  2:11   ` Guoqing Jiang
  1 sibling, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:15 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Wed, Apr 20, 2022 at 01:54:19PM -0600, Logan Gunthorpe wrote:
> prepare_to_wait() can be reasonably called after schedule instead of
> setting a flag and preparing in the next loop iteration.
> 
> This means that prepare_to_wait() will be called before
> read_seqcount_begin(), but there shouldn't be any reason that
> the order matters here. On the first iteration of the loop
> prepare_to_wait() is already called first.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request()
  2022-04-20 19:54 ` [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request() Logan Gunthorpe
@ 2022-04-21  6:15   ` Christoph Hellwig
  2022-04-27  2:13   ` Guoqing Jiang
  1 sibling, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:15 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Wed, Apr 20, 2022 at 01:54:20PM -0600, Logan Gunthorpe wrote:
> Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
> into make_stripe_request().
> 
> No functional changes intended.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 08/12] md/raid5: Refactor for loop in raid5_make_request() into while loop
  2022-04-20 19:54 ` [PATCH v2 08/12] md/raid5: Refactor for loop in raid5_make_request() into while loop Logan Gunthorpe
@ 2022-04-21  6:16   ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:16 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Wed, Apr 20, 2022 at 01:54:21PM -0600, Logan Gunthorpe wrote:
>  	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
> -	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
> -	retry:
> +	while (logical_sector < last_sector) {
>  		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
>  					  bi);
>  		if (res == STRIPE_FAIL) {
>  			break;
>  		} else if (res == STRIPE_RETRY) {
> -			goto retry;
> +			continue;
>  		} else if (res == STRIPE_SCHEDULE_AND_RETRY) {
>  			schedule();
>  			prepare_to_wait(&conf->wait_for_overlap, &w,
>  					TASK_UNINTERRUPTIBLE);
> -			goto retry;
> +			continue;
>  		}

All the else statements here aren't needed (this is really a
comment for the earlier patch adding them).

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch
  2022-04-20 19:54 ` [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch Logan Gunthorpe
@ 2022-04-21  6:17   ` Christoph Hellwig
  2022-04-27  1:36   ` Guoqing Jiang
  1 sibling, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Wed, Apr 20, 2022 at 01:54:22PM -0600, Logan Gunthorpe wrote:
> When batching, every stripe head has to find the previous stripe head to
> add to the batch list. This involves taking the hash lock which is
> highly contended during IO.
> 
> Instead of finding the previous stripe_head each time, store a
> reference to the previous stripe_head in a pointer so that it doesn't
> require taking the contended lock another time.
> 
> The reference to the previous stripe must be released before scheduling
> and waiting for work to get done. Otherwise, it can hold up
> raid5_activate_delayed() and deadlock.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 10/12] md/raid5: Refactor add_stripe_bio()
  2022-04-20 19:54 ` [PATCH v2 10/12] md/raid5: Refactor add_stripe_bio() Logan Gunthorpe
@ 2022-04-21  6:18   ` Christoph Hellwig
  0 siblings, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:18 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 11/12] md/raid5: Check all disks in a stripe_head for reshape progress
  2022-04-20 19:54 ` [PATCH v2 11/12] md/raid5: Check all disks in a stripe_head for reshape progress Logan Gunthorpe
@ 2022-04-21  6:18   ` Christoph Hellwig
  2022-04-27  1:53   ` Guoqing Jiang
  1 sibling, 0 replies; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:18 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Wed, Apr 20, 2022 at 01:54:24PM -0600, Logan Gunthorpe wrote:
> +static bool range_ahead_of_reshape(struct mddev *mddev, sector_t min,
> +				   sector_t max, sector_t reshape_sector)
> +{
> +	if (mddev->reshape_backwards)
> +		return max < reshape_sector;
> +	else
> +		return min >= reshape_sector;
> +}

Nit: no need for the return.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()
  2022-04-20 19:54 ` [PATCH v2 12/12] md/raid5: Pivot raid5_make_request() Logan Gunthorpe
@ 2022-04-21  6:43   ` Christoph Hellwig
  2022-04-21 15:54     ` Logan Gunthorpe
  2022-04-27  2:06   ` Guoqing Jiang
  1 sibling, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2022-04-21  6:43 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Wed, Apr 20, 2022 at 01:54:25PM -0600, Logan Gunthorpe wrote:
>  struct stripe_request_ctx {
>  	bool do_flush;
>  	struct stripe_head *batch_last;
> +	sector_t disk_sector_done;
> +	sector_t start_disk_sector;

Very nitpicky, but why use two different naming styles for the sectors
here?

> +	bool first_wrap;
> +	sector_t last_sector;

And especially with the last_sector here a few comments explaining
what each of the sector values mean might be useful.

I'd also keep the two bool variables together for a better structure
layout.

> +	 * if new_sector is less than the starting sector. Clear the
> +	 * boolean once the start sector is hit for the second time.
> +	 * When first_wrap is set, ignore the disk_sector_done.
> +	 */
> +	if (ctx->start_disk_sector == MaxSector) {
> +		ctx->start_disk_sector = new_sector;
> +	} else if (new_sector < ctx->start_disk_sector) {
> +		ctx->first_wrap = true;
> +	} else if (new_sector == ctx->start_disk_sector) {
> +		ctx->first_wrap = false;
> +		ctx->start_disk_sector = 0;
> +		return STRIPE_SUCCESS;
> +	} else if (!ctx->first_wrap && new_sector <= ctx->disk_sector_done) {
> +		return STRIPE_SUCCESS;
> +	}
> +

I find this a bit confusing to read.  While trying to mentally untangle
it I came up with this version instead, but it could really use some
good comments explaining each of the checks as I found your big comment
to not quite match the logic easily.

	if (ctx->start_disk_sector == MaxSector) {
		/*
		 * First loop iteration, start our state machine.
		 * 
		ctx->start_disk_sector = new_sector;
	} else {
		/*
		 * We are done if we wrapped around to the same sector.
		 * (???)
		 */
		if (new_sector == ctx->start_disk_sector) {
			ctx->first_wrap = false;
			ctx->start_disk_sector = 0;
			return STRIPE_SUCCESS;
		}

		/*
		 * Sector before the start sector?  Keep going and wrap
		 * around.
		 */
		if (new_sector < ctx->start_disk_sector) {
			ctx->first_wrap = true;
		} else {
			// ???
			if (new_sector <= ctx->disk_sector_done &&
			   !ctx->first_wrap)
				return STRIPE_SUCCESS;
		}
	}

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (11 preceding siblings ...)
  2022-04-20 19:54 ` [PATCH v2 12/12] md/raid5: Pivot raid5_make_request() Logan Gunthorpe
@ 2022-04-21  8:45 ` Xiao Ni
  2022-04-21 16:02   ` Logan Gunthorpe
  2022-04-24  7:53 ` Guoqing Jiang
  2022-04-25 23:07 ` Song Liu
  14 siblings, 1 reply; 59+ messages in thread
From: Xiao Ni @ 2022-04-21  8:45 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: open list, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan

On Thu, Apr 21, 2022 at 3:55 AM Logan Gunthorpe <logang@deltatee.com> wrote:
>
> Hi,
>
> This is v2 of this series which addresses Christoph's feedback and
> fixes some bugs. The first posting is at [1]. A git branch is
> available at [2].
>
> --
>
> I've been doing some work trying to improve the bulk write performance
> of raid5 on large systems with fast NVMe drives. The bottleneck appears
> largely to be lock contention on the hash_lock and device_lock. This
> series improves the situation slightly by addressing a couple of low
> hanging fruit ways to take the lock fewer times in the request path.
>
> Patch 9 adjusts how batching works by keeping a reference to the
> previous stripe_head in raid5_make_request(). Under most situtations,
> this removes the need to take the hash_lock in stripe_add_to_batch_list()
> which should reduce the number of times the lock is taken by a factor of
> about 2.
>
> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
> code must find the stripe_head for every 4KB page in the request, so each
> stripe head must be found once for every data disk. The patch changes this
> so that all the data disks can be added to a stripe_head at once and the
> number of times the stripe_head must be found (and thus the number of
> times the hash_lock is taken) should be reduced by a factor roughly equal
> to the number of data disks.
>
> The remaining patches are just cleanup and prep patches for those two
> patches.
>
> Doing apples to apples testing this series on a small VM with 5 ram
> disks, I saw a bandwidth increase of roughly 14% and lock contentions
> on the hash_lock (as reported by lock stat) reduced by more than a factor
> of 5 (though it is still significantly contended).
>
> Testing on larger systems with NVMe drives saw similar small bandwidth
> increases from 3% to 20% depending on the parameters. Oddly small arrays
> had larger gains, likely due to them having lower starting bandwidths; I
> would have expected larger gains with larger arrays (seeing there
> should have been even fewer locks taken in raid5_make_request()).


Hi Logan

Could you share the commands to get the test result (lock contention
and performance)?

Regards
Xiao


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-20 19:54 ` [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
  2022-04-21  6:07   ` Christoph Hellwig
@ 2022-04-21  9:17   ` Paul Menzel
  2022-04-21 16:05     ` Logan Gunthorpe
  2022-04-27  1:28   ` Guoqing Jiang
  2 siblings, 1 reply; 59+ messages in thread
From: Paul Menzel @ 2022-04-21  9:17 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig

Dear Logan,


Thank you for these patches.


Am 20.04.22 um 21:54 schrieb Logan Gunthorpe:
> There are a few uses of an ugly ternary operator in raid5_make_request()
> to check if a sector is a head of a reshape sector.
> 
> Factor this out into a simple helper called ahead_of_reshape().
> 
> This appears to fix the first bio_wouldblock_error() check which appears
> to have comparison operators that didn't match the check below which
> causes a schedule. Besides this, no functional changes intended.

If there is an error, could that be fixed in a separate commit, which 
could be applied to the stable series?

> Suggested-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   drivers/md/raid5.c | 29 +++++++++++++++++------------
>   1 file changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7f7d1546b9ba..97b23c18402b 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5787,6 +5787,15 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>   	bio_endio(bi);
>   }
>   
> +static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
> +			     sector_t reshape_sector)
> +{
> +	if (mddev->reshape_backwards)
> +		return sector < reshape_sector;
> +	else
> +		return sector >= reshape_sector;

I like the ternary operator. ;-)

     return mddev->reshape_backwards ? (return sector < reshape_sector) 
: (sector >= reshape_sector)

Sorry, does not matter.

Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>



Kind regards,

Paul


> +}
> +
>   static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   {
>   	struct r5conf *conf = mddev->private;
> @@ -5843,9 +5852,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   	/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
>   	if ((bi->bi_opf & REQ_NOWAIT) &&
>   	    (conf->reshape_progress != MaxSector) &&
> -	    (mddev->reshape_backwards
> -	    ? (logical_sector > conf->reshape_progress && logical_sector <= conf->reshape_safe)
> -	    : (logical_sector >= conf->reshape_safe && logical_sector < conf->reshape_progress))) {
> +	    !ahead_of_reshape(mddev, logical_sector, conf->reshape_progress) &&
> +	    ahead_of_reshape(mddev, logical_sector, conf->reshape_safe)) {
>   		bio_wouldblock_error(bi);
>   		if (rw == WRITE)
>   			md_write_end(mddev);
> @@ -5874,14 +5882,12 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   			 * to check again.
>   			 */
>   			spin_lock_irq(&conf->device_lock);
> -			if (mddev->reshape_backwards
> -			    ? logical_sector < conf->reshape_progress
> -			    : logical_sector >= conf->reshape_progress) {
> +			if (ahead_of_reshape(mddev, logical_sector,
> +					     conf->reshape_progress)) {
>   				previous = 1;
>   			} else {
> -				if (mddev->reshape_backwards
> -				    ? logical_sector < conf->reshape_safe
> -				    : logical_sector >= conf->reshape_safe) {
> +				if (ahead_of_reshape(mddev, logical_sector,
> +						     conf->reshape_safe)) {
>   					spin_unlock_irq(&conf->device_lock);
>   					schedule();
>   					do_prepare = true;
> @@ -5912,9 +5918,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   				 */
>   				int must_retry = 0;
>   				spin_lock_irq(&conf->device_lock);
> -				if (mddev->reshape_backwards
> -				    ? logical_sector >= conf->reshape_progress
> -				    : logical_sector < conf->reshape_progress)
> +				if (!ahead_of_reshape(mddev, logical_sector,
> +						      conf->reshape_progress))
>   					/* mismatch, need to try again */
>   					must_retry = 1;
>   				spin_unlock_irq(&conf->device_lock);

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()
  2022-04-21  6:43   ` Christoph Hellwig
@ 2022-04-21 15:54     ` Logan Gunthorpe
  0 siblings, 0 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-21 15:54 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-kernel, linux-raid, Song Liu, Guoqing Jiang, Stephen Bates,
	Martin Oliveira, David Sloan



On 2022-04-21 00:43, Christoph Hellwig wrote:
> On Wed, Apr 20, 2022 at 01:54:25PM -0600, Logan Gunthorpe wrote:
>>  struct stripe_request_ctx {
>>  	bool do_flush;
>>  	struct stripe_head *batch_last;
>> +	sector_t disk_sector_done;
>> +	sector_t start_disk_sector;
> 
> Very nitpicky, but why use two different naming styles for the sectors
> here?
> 
>> +	bool first_wrap;
>> +	sector_t last_sector;
> 
> And especially with the last_sector here a few comments explaining
> what each of the sector values mean might be useful.
> 
> I'd also keep the two bool variables together for a better structure
> layout.

Yeah, this logic has been very tricky to figure out. Thus explaining it
was quite difficult. It was a consistent source of bugs for a me for a
while until I figured out this little state machine. I'll follow your
rough template and rework the comments and try to make a large example
comment or something to explain this, and maybe factor it into a helper
function.

>> +	 * if new_sector is less than the starting sector. Clear the
>> +	 * boolean once the start sector is hit for the second time.
>> +	 * When first_wrap is set, ignore the disk_sector_done.
>> +	 */
>> +	if (ctx->start_disk_sector == MaxSector) {
>> +		ctx->start_disk_sector = new_sector;
>> +	} else if (new_sector < ctx->start_disk_sector) {
>> +		ctx->first_wrap = true;
>> +	} else if (new_sector == ctx->start_disk_sector) {
>> +		ctx->first_wrap = false;
>> +		ctx->start_disk_sector = 0;
>> +		return STRIPE_SUCCESS;
>> +	} else if (!ctx->first_wrap && new_sector <= ctx->disk_sector_done) {
>> +		return STRIPE_SUCCESS;
>> +	}
>> +
> 
> I find this a bit confusing to read.  While trying to mentally untangle
> it I came up with this version instead, but it could really use some
> good comments explaining each of the checks as I found your big comment
> to not quite match the logic easily.
> 
> 	if (ctx->start_disk_sector == MaxSector) {
> 		/*
> 		 * First loop iteration, start our state machine.
> 		 * 
> 		ctx->start_disk_sector = new_sector;
> 	} else {
> 		/*
> 		 * We are done if we wrapped around to the same sector.
> 		 * (???)
> 		 */
> 		if (new_sector == ctx->start_disk_sector) {
> 			ctx->first_wrap = false;
> 			ctx->start_disk_sector = 0;
> 			return STRIPE_SUCCESS;
> 		}
> 
> 		/*
> 		 * Sector before the start sector?  Keep going and wrap
> 		 * around.
> 		 */
> 		if (new_sector < ctx->start_disk_sector) {
> 			ctx->first_wrap = true;
> 		} else {
> 			// ???

This is actually the common most important branch that says if we've
already done this disk sector to stop and move on. So I'll probably
rework it some so that it is not so deeply nested.

> 			if (new_sector <= ctx->disk_sector_done &&
> 			   !ctx->first_wrap)
> 				return STRIPE_SUCCESS;
> 		}
> 	}

No problem cleaning up your other nits. I'll send a v3 in due course.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-21  8:45 ` [PATCH v2 00/12] Improve Raid5 Lock Contention Xiao Ni
@ 2022-04-21 16:02   ` Logan Gunthorpe
  2022-04-24  8:00     ` Guoqing Jiang
  0 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-21 16:02 UTC (permalink / raw)
  To: Xiao Ni
  Cc: open list, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan



On 2022-04-21 02:45, Xiao Ni wrote:
> Could you share the commands to get the test result (lock contention
> and performance)?

Sure. The performance we were focused on was large block writes. So we
setup raid5 instances with varying number of disks and ran the following
fio script directly on the drive.

[simple]
filename=/dev/md0
ioengine=libaio
rw=write
direct=1
size=8G
blocksize=2m
iodepth=16
runtime=30s
time_based=1
offset_increment=8G
numjobs=12

(We also played around with tuning this but didn't find substantial
changes once the bottleneck was hit)

We tuned md with parameters like:

echo 4 > /sys/block/md0/md/group_thread_cnt
echo 8192 > /sys/block/md0/md/stripe_cache_size

For lock contention stats, we just used lockstat[1]; roughly like:

echo 1 > /proc/sys/kernel/lock_stat
fio test.fio
echo 0 > /proc/sys/kernel/lock_stat
cat /proc/lock_stat

And compared the before and after.

Logan

[1] https://www.kernel.org/doc/html/latest/locking/lockstat.html

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-21  9:17   ` Paul Menzel
@ 2022-04-21 16:05     ` Logan Gunthorpe
  2022-04-21 23:33       ` Wol
  0 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-21 16:05 UTC (permalink / raw)
  To: Paul Menzel
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig



On 2022-04-21 03:17, Paul Menzel wrote:
> Dear Logan,
> 
> 
> Thank you for these patches.
> 
> 
> Am 20.04.22 um 21:54 schrieb Logan Gunthorpe:
>> There are a few uses of an ugly ternary operator in raid5_make_request()
>> to check if a sector is a head of a reshape sector.
>>
>> Factor this out into a simple helper called ahead_of_reshape().
>>
>> This appears to fix the first bio_wouldblock_error() check which appears
>> to have comparison operators that didn't match the check below which
>> causes a schedule. Besides this, no functional changes intended.
> 
> If there is an error, could that be fixed in a separate commit, which
> could be applied to the stable series?

Yes, sure. Though, I'm not 100% sure there's an error or noticeable bug.
It's just that the logic didn't match and it made cleaning up the code
overly complicated.


> 
>> Suggested-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
>> ---
>>   drivers/md/raid5.c | 29 +++++++++++++++++------------
>>   1 file changed, 17 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 7f7d1546b9ba..97b23c18402b 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -5787,6 +5787,15 @@ static void make_discard_request(struct mddev
>> *mddev, struct bio *bi)
>>       bio_endio(bi);
>>   }
>>   +static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
>> +                 sector_t reshape_sector)
>> +{
>> +    if (mddev->reshape_backwards)
>> +        return sector < reshape_sector;
>> +    else
>> +        return sector >= reshape_sector;
> 
> I like the ternary operator. ;-)
> 
>     return mddev->reshape_backwards ? (return sector < reshape_sector) :
> (sector >= reshape_sector)
> 
> Sorry, does not matter.

Yeah, I think plenty of people do not, though; it's harder to read with
the long line and awkward to wrap.

> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>

Thanks,

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-21 16:05     ` Logan Gunthorpe
@ 2022-04-21 23:33       ` Wol
  0 siblings, 0 replies; 59+ messages in thread
From: Wol @ 2022-04-21 23:33 UTC (permalink / raw)
  To: Logan Gunthorpe, Paul Menzel
  Cc: linux-kernel, linux-raid, Song Liu, Christoph Hellwig,
	Guoqing Jiang, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig

On 21/04/2022 17:05, Logan Gunthorpe wrote:
>> I like the ternary operator.;-)
>>
>>      return mddev->reshape_backwards ? (return sector < reshape_sector) :
>> (sector >= reshape_sector)
>>
>> Sorry, does not matter.
> Yeah, I think plenty of people do not, though; it's harder to read with
> the long line and awkward to wrap.
> 
I like the ternary too, but is there a superfluous return in there? That 
would shorten the line. Or break it on the question mark - condition, 
true, false all on their own lines.

Cheers,
Wol

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (12 preceding siblings ...)
  2022-04-21  8:45 ` [PATCH v2 00/12] Improve Raid5 Lock Contention Xiao Ni
@ 2022-04-24  7:53 ` Guoqing Jiang
  2022-04-25 15:37   ` Logan Gunthorpe
  2022-04-25 23:07 ` Song Liu
  14 siblings, 1 reply; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-24  7:53 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> Hi,
>
> This is v2 of this series which addresses Christoph's feedback and
> fixes some bugs. The first posting is at [1]. A git branch is
> available at [2].
>
> --
>
> I've been doing some work trying to improve the bulk write performance
> of raid5 on large systems with fast NVMe drives. The bottleneck appears
> largely to be lock contention on the hash_lock and device_lock. This
> series improves the situation slightly by addressing a couple of low
> hanging fruit ways to take the lock fewer times in the request path.
>
> Patch 9 adjusts how batching works by keeping a reference to the
> previous stripe_head in raid5_make_request(). Under most situtations,
> this removes the need to take the hash_lock in stripe_add_to_batch_list()
> which should reduce the number of times the lock is taken by a factor of
> about 2.
>
> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
> code must find the stripe_head for every 4KB page in the request, so each
> stripe head must be found once for every data disk. The patch changes this
> so that all the data disks can be added to a stripe_head at once and the
> number of times the stripe_head must be found (and thus the number of
> times the hash_lock is taken) should be reduced by a factor roughly equal
> to the number of data disks.
>
> The remaining patches are just cleanup and prep patches for those two
> patches.
>
> Doing apples to apples testing this series on a small VM with 5 ram
> disks, I saw a bandwidth increase of roughly 14% and lock contentions
> on the hash_lock (as reported by lock stat) reduced by more than a factor
> of 5 (though it is still significantly contended).
>
> Testing on larger systems with NVMe drives saw similar small bandwidth
> increases from 3% to 20% depending on the parameters. Oddly small arrays
> had larger gains, likely due to them having lower starting bandwidths; I
> would have expected larger gains with larger arrays (seeing there
> should have been even fewer locks taken in raid5_make_request()).
>
> Logan
>
> [1] https://lkml.kernel.org/r/20220407164511.8472-1-logang@deltatee.com
> [2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2
>
> --
>
> Changes since v1:
>    - Rebased on current md-next branch (190a901246c69d79)
>    - Added patch to create a helper for checking if a sector
>      is ahead of the reshape (per Christoph)
>    - Reworked the __find_stripe() patch to create a find_get_stripe()
>      helper (per Christoph)
>    - Added more patches to further refactor raid5_make_request() and
>      pull most of the loop body into a helper function (per Christoph)
>    - A few other minor cleanups (boolean return, droping casting when
>      printing sectors, commit message grammar) as suggested by Christoph.
>    - Fixed two uncommon but bad data corruption bugs in that were found.
>
> --
>
> Logan Gunthorpe (12):
>    md/raid5: Factor out ahead_of_reshape() function
>    md/raid5: Refactor raid5_make_request loop
>    md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
>    md/raid5: Move common stripe count increment code into __find_stripe()
>    md/raid5: Factor out helper from raid5_make_request() loop
>    md/raid5: Drop the do_prepare flag in raid5_make_request()
>    md/raid5: Move read_seqcount_begin() into make_stripe_request()
>    md/raid5: Refactor for loop in raid5_make_request() into while loop
>    md/raid5: Keep a reference to last stripe_head for batch
>    md/raid5: Refactor add_stripe_bio()
>    md/raid5: Check all disks in a stripe_head for reshape progress
>    md/raid5: Pivot raid5_make_request()

Generally, I don't object the cleanup patches since the code looks more 
cleaner.
But my concern is that since some additional function calls are added to 
hot path
(raid5_make_request), could the performance be affected?

And I think patch 9 and patch 12 are helpful for performance 
improvement,  did
you measure the performance without those cleanup patches?

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-21 16:02   ` Logan Gunthorpe
@ 2022-04-24  8:00     ` Guoqing Jiang
  2022-04-25 15:39       ` Logan Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-24  8:00 UTC (permalink / raw)
  To: Logan Gunthorpe, Xiao Ni
  Cc: open list, linux-raid, Song Liu, Christoph Hellwig,
	Stephen Bates, Martin Oliveira, David Sloan



On 4/22/22 12:02 AM, Logan Gunthorpe wrote:
>
> On 2022-04-21 02:45, Xiao Ni wrote:
>> Could you share the commands to get the test result (lock contention
>> and performance)?
> Sure. The performance we were focused on was large block writes. So we
> setup raid5 instances with varying number of disks and ran the following
> fio script directly on the drive.
>
> [simple]
> filename=/dev/md0
> ioengine=libaio
> rw=write
> direct=1
> size=8G
> blocksize=2m
> iodepth=16
> runtime=30s
> time_based=1
> offset_increment=8G
> numjobs=12
> 
> (We also played around with tuning this but didn't find substantial
> changes once the bottleneck was hit)

Nice, I suppose other IO patterns keep the same performance as before.

> We tuned md with parameters like:
>
> echo 4 > /sys/block/md0/md/group_thread_cnt
> echo 8192 > /sys/block/md0/md/stripe_cache_size
>
> For lock contention stats, we just used lockstat[1]; roughly like:
>
> echo 1 > /proc/sys/kernel/lock_stat
> fio test.fio
> echo 0 > /proc/sys/kernel/lock_stat
> cat /proc/lock_stat
>
> And compared the before and after.

Thanks for your effort, besides the performance test, please try to run
mdadm test suites to avoid regression.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-24  7:53 ` Guoqing Jiang
@ 2022-04-25 15:37   ` Logan Gunthorpe
  0 siblings, 0 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-25 15:37 UTC (permalink / raw)
  To: Guoqing Jiang, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 2022-04-24 01:53, Guoqing Jiang wrote:
> 
> 
> On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
>> Hi,
>>
>> This is v2 of this series which addresses Christoph's feedback and
>> fixes some bugs. The first posting is at [1]. A git branch is
>> available at [2].
>>
>> --
>>
>> I've been doing some work trying to improve the bulk write performance
>> of raid5 on large systems with fast NVMe drives. The bottleneck appears
>> largely to be lock contention on the hash_lock and device_lock. This
>> series improves the situation slightly by addressing a couple of low
>> hanging fruit ways to take the lock fewer times in the request path.
>>
>> Patch 9 adjusts how batching works by keeping a reference to the
>> previous stripe_head in raid5_make_request(). Under most situtations,
>> this removes the need to take the hash_lock in stripe_add_to_batch_list()
>> which should reduce the number of times the lock is taken by a factor of
>> about 2.
>>
>> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
>> code must find the stripe_head for every 4KB page in the request, so each
>> stripe head must be found once for every data disk. The patch changes this
>> so that all the data disks can be added to a stripe_head at once and the
>> number of times the stripe_head must be found (and thus the number of
>> times the hash_lock is taken) should be reduced by a factor roughly equal
>> to the number of data disks.
>>
>> The remaining patches are just cleanup and prep patches for those two
>> patches.
>>
>> Doing apples to apples testing this series on a small VM with 5 ram
>> disks, I saw a bandwidth increase of roughly 14% and lock contentions
>> on the hash_lock (as reported by lock stat) reduced by more than a factor
>> of 5 (though it is still significantly contended).
>>
>> Testing on larger systems with NVMe drives saw similar small bandwidth
>> increases from 3% to 20% depending on the parameters. Oddly small arrays
>> had larger gains, likely due to them having lower starting bandwidths; I
>> would have expected larger gains with larger arrays (seeing there
>> should have been even fewer locks taken in raid5_make_request()).
>>
>> Logan
>>
>> [1] https://lkml.kernel.org/r/20220407164511.8472-1-logang@deltatee.com
>> [2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2
>>
>> --
>>
>> Changes since v1:
>>    - Rebased on current md-next branch (190a901246c69d79)
>>    - Added patch to create a helper for checking if a sector
>>      is ahead of the reshape (per Christoph)
>>    - Reworked the __find_stripe() patch to create a find_get_stripe()
>>      helper (per Christoph)
>>    - Added more patches to further refactor raid5_make_request() and
>>      pull most of the loop body into a helper function (per Christoph)
>>    - A few other minor cleanups (boolean return, droping casting when
>>      printing sectors, commit message grammar) as suggested by Christoph.
>>    - Fixed two uncommon but bad data corruption bugs in that were found.
>>
>> --
>>
>> Logan Gunthorpe (12):
>>    md/raid5: Factor out ahead_of_reshape() function
>>    md/raid5: Refactor raid5_make_request loop
>>    md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
>>    md/raid5: Move common stripe count increment code into __find_stripe()
>>    md/raid5: Factor out helper from raid5_make_request() loop
>>    md/raid5: Drop the do_prepare flag in raid5_make_request()
>>    md/raid5: Move read_seqcount_begin() into make_stripe_request()
>>    md/raid5: Refactor for loop in raid5_make_request() into while loop
>>    md/raid5: Keep a reference to last stripe_head for batch
>>    md/raid5: Refactor add_stripe_bio()
>>    md/raid5: Check all disks in a stripe_head for reshape progress
>>    md/raid5: Pivot raid5_make_request()
> 
> Generally, I don't object the cleanup patches since the code looks more 
> cleaner.
> But my concern is that since some additional function calls are added to 
> hot path
> (raid5_make_request), could the performance be affected?

There's a bit of logic added to the raid5_make_requests but it is all
local and should be fast, and it reduces the amount of calls to the slow
contended locks.

> And I think patch 9 and patch 12 are helpful for performance 
> improvement,  did
> you measure the performance without those cleanup patches?

Yes, I compared performance with and without this entire series.

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-24  8:00     ` Guoqing Jiang
@ 2022-04-25 15:39       ` Logan Gunthorpe
  2022-04-25 16:12         ` Xiao Ni
  0 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-25 15:39 UTC (permalink / raw)
  To: Guoqing Jiang, Xiao Ni
  Cc: open list, linux-raid, Song Liu, Christoph Hellwig,
	Stephen Bates, Martin Oliveira, David Sloan



On 2022-04-24 02:00, Guoqing Jiang wrote:
> 
> 
> On 4/22/22 12:02 AM, Logan Gunthorpe wrote:
>>
>> On 2022-04-21 02:45, Xiao Ni wrote:
>>> Could you share the commands to get the test result (lock contention
>>> and performance)?
>> Sure. The performance we were focused on was large block writes. So we
>> setup raid5 instances with varying number of disks and ran the following
>> fio script directly on the drive.
>>
>> [simple]
>> filename=/dev/md0
>> ioengine=libaio
>> rw=write
>> direct=1
>> size=8G
>> blocksize=2m
>> iodepth=16
>> runtime=30s
>> time_based=1
>> offset_increment=8G
>> numjobs=12
>> 
>> (We also played around with tuning this but didn't find substantial
>> changes once the bottleneck was hit)
> 
> Nice, I suppose other IO patterns keep the same performance as before.
> 
>> We tuned md with parameters like:
>>
>> echo 4 > /sys/block/md0/md/group_thread_cnt
>> echo 8192 > /sys/block/md0/md/stripe_cache_size
>>
>> For lock contention stats, we just used lockstat[1]; roughly like:
>>
>> echo 1 > /proc/sys/kernel/lock_stat
>> fio test.fio
>> echo 0 > /proc/sys/kernel/lock_stat
>> cat /proc/lock_stat
>>
>> And compared the before and after.
> 
> Thanks for your effort, besides the performance test, please try to run
> mdadm test suites to avoid regression.

Yeah, is there any documentation for that? I tried to look into it but
couldn't figure out how it's run.

I do know that lkp-tests has run it on this series as I did get an error
from it. But while I'm pretty sure that error has been resolved, I was
never able to figure out how to run them locally.

Thanks,

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-25 15:39       ` Logan Gunthorpe
@ 2022-04-25 16:12         ` Xiao Ni
  2022-04-28 21:22           ` Logan Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Xiao Ni @ 2022-04-25 16:12 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: Guoqing Jiang, open list, linux-raid, Song Liu,
	Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan

On Mon, Apr 25, 2022 at 11:39 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2022-04-24 02:00, Guoqing Jiang wrote:
> >
> >
> > On 4/22/22 12:02 AM, Logan Gunthorpe wrote:
> >>
> >> On 2022-04-21 02:45, Xiao Ni wrote:
> >>> Could you share the commands to get the test result (lock contention
> >>> and performance)?
> >> Sure. The performance we were focused on was large block writes. So we
> >> setup raid5 instances with varying number of disks and ran the following
> >> fio script directly on the drive.
> >>
> >> [simple]
> >> filename=/dev/md0
> >> ioengine=libaio
> >> rw=write
> >> direct=1
> >> size=8G
> >> blocksize=2m
> >> iodepth=16
> >> runtime=30s
> >> time_based=1
> >> offset_increment=8G
> >> numjobs=12
> >> 
> >> (We also played around with tuning this but didn't find substantial
> >> changes once the bottleneck was hit)
> >
> > Nice, I suppose other IO patterns keep the same performance as before.
> >
> >> We tuned md with parameters like:
> >>
> >> echo 4 > /sys/block/md0/md/group_thread_cnt
> >> echo 8192 > /sys/block/md0/md/stripe_cache_size
> >>
> >> For lock contention stats, we just used lockstat[1]; roughly like:
> >>
> >> echo 1 > /proc/sys/kernel/lock_stat
> >> fio test.fio
> >> echo 0 > /proc/sys/kernel/lock_stat
> >> cat /proc/lock_stat
> >>
> >> And compared the before and after.
> >
> > Thanks for your effort, besides the performance test, please try to run
> > mdadm test suites to avoid regression.
>
> Yeah, is there any documentation for that? I tried to look into it but
> couldn't figure out how it's run.
>
> I do know that lkp-tests has run it on this series as I did get an error
> from it. But while I'm pretty sure that error has been resolved, I was
> never able to figure out how to run them locally.
>

Hi Logan

You can clone the mdadm repo at
git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
Then you can find there is a script test under the directory. It's not
under the tests directory.
The test cases are under tests directory.

Regards
Xiao


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
                   ` (13 preceding siblings ...)
  2022-04-24  7:53 ` Guoqing Jiang
@ 2022-04-25 23:07 ` Song Liu
  14 siblings, 0 replies; 59+ messages in thread
From: Song Liu @ 2022-04-25 23:07 UTC (permalink / raw)
  To: Logan Gunthorpe
  Cc: open list, linux-raid, Christoph Hellwig, Guoqing Jiang,
	Stephen Bates, Martin Oliveira, David Sloan

On Wed, Apr 20, 2022 at 12:55 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
> Hi,
>
> This is v2 of this series which addresses Christoph's feedback and
> fixes some bugs. The first posting is at [1]. A git branch is
> available at [2].
>
> --
>
> I've been doing some work trying to improve the bulk write performance
> of raid5 on large systems with fast NVMe drives. The bottleneck appears
> largely to be lock contention on the hash_lock and device_lock. This
> series improves the situation slightly by addressing a couple of low
> hanging fruit ways to take the lock fewer times in the request path.
>
> Patch 9 adjusts how batching works by keeping a reference to the
> previous stripe_head in raid5_make_request(). Under most situtations,
> this removes the need to take the hash_lock in stripe_add_to_batch_list()
> which should reduce the number of times the lock is taken by a factor of
> about 2.
>
> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
> code must find the stripe_head for every 4KB page in the request, so each
> stripe head must be found once for every data disk. The patch changes this
> so that all the data disks can be added to a stripe_head at once and the
> number of times the stripe_head must be found (and thus the number of
> times the hash_lock is taken) should be reduced by a factor roughly equal
> to the number of data disks.
>
> The remaining patches are just cleanup and prep patches for those two
> patches.
>
> Doing apples to apples testing this series on a small VM with 5 ram
> disks, I saw a bandwidth increase of roughly 14% and lock contentions
> on the hash_lock (as reported by lock stat) reduced by more than a factor
> of 5 (though it is still significantly contended).
>
> Testing on larger systems with NVMe drives saw similar small bandwidth
> increases from 3% to 20% depending on the parameters. Oddly small arrays
> had larger gains, likely due to them having lower starting bandwidths; I
> would have expected larger gains with larger arrays (seeing there
> should have been even fewer locks taken in raid5_make_request()).
>
> Logan
>
> [1] https://lkml.kernel.org/r/20220407164511.8472-1-logang@deltatee.com
> [2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2
>

The set looks good to me overall. Thanks everyone for the review and feedback.

Logan, please incorporate feedback and send v3.

Thanks,
Song

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-20 19:54 ` [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
  2022-04-21  6:07   ` Christoph Hellwig
  2022-04-21  9:17   ` Paul Menzel
@ 2022-04-27  1:28   ` Guoqing Jiang
  2022-04-27 16:07     ` Logan Gunthorpe
  2 siblings, 1 reply; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-27  1:28 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> There are a few uses of an ugly ternary operator in raid5_make_request()
> to check if a sector is a head of a reshape sector.
>
> Factor this out into a simple helper called ahead_of_reshape().
>
> This appears to fix the first bio_wouldblock_error() check which appears
> to have comparison operators that didn't match the check below which
> causes a schedule. Besides this, no functional changes intended.
>
> Suggested-by: Christoph Hellwig<hch@lst.de>
> Signed-off-by: Logan Gunthorpe<logang@deltatee.com>
> ---
>   drivers/md/raid5.c | 29 +++++++++++++++++------------
>   1 file changed, 17 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7f7d1546b9ba..97b23c18402b 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5787,6 +5787,15 @@ static void make_discard_request(struct mddev *mddev, struct bio *bi)
>   	bio_endio(bi);
>   }
>   
> +static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
> +			     sector_t reshape_sector)
> +{
> +	if (mddev->reshape_backwards)
> +		return sector < reshape_sector;
> +	else
> +		return sector >= reshape_sector;
> +}

I think it can be an inline function.

> +
>   static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   {
>   	struct r5conf *conf = mddev->private;
> @@ -5843,9 +5852,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   	/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
>   	if ((bi->bi_opf & REQ_NOWAIT) &&
>   	    (conf->reshape_progress != MaxSector) &&
> -	    (mddev->reshape_backwards
> -	    ? (logical_sector > conf->reshape_progress && logical_sector <= conf->reshape_safe)
> -	    : (logical_sector >= conf->reshape_safe && logical_sector < conf->reshape_progress))) {
> +	    !ahead_of_reshape(mddev, logical_sector, conf->reshape_progress) &&
> +	    ahead_of_reshape(mddev, logical_sector, conf->reshape_safe)) {

TBH, the previous code is more readable to me though I can live with the 
change.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop
  2022-04-20 19:54 ` [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop Logan Gunthorpe
  2022-04-21  6:08   ` Christoph Hellwig
@ 2022-04-27  1:32   ` Guoqing Jiang
  2022-04-27 16:08     ` Logan Gunthorpe
  1 sibling, 1 reply; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-27  1:32 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> Break immediately if raid5_get_active_stripe() returns NULL and deindent
> the rest of the loop. Annotate this check with an unlikely().
>
> This makes the code easier to read and reduces the indentation level.
>
> No functional changes intended.
>
> Signed-off-by: Logan Gunthorpe<logang@deltatee.com>
> ---
>   drivers/md/raid5.c | 109 +++++++++++++++++++++++----------------------
>   1 file changed, 55 insertions(+), 54 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 97b23c18402b..cda6857e6207 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5906,68 +5906,69 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)

...

> +		if (unlikely(!sh)) {
> +			/* cannot get stripe, just give-up */
> +			bi->bi_status = BLK_STS_IOERR;
> +			break;
> +		}


Nit, I would suggest to keep below original comment.

> -			/* cannot get stripe for read-ahead, just give-up */
> -			bi->bi_status = BLK_STS_IOERR;
> -			break;

Anyway. Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 03/12] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
  2022-04-20 19:54 ` [PATCH v2 03/12] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() Logan Gunthorpe
@ 2022-04-27  1:33   ` Guoqing Jiang
  0 siblings, 0 replies; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-27  1:33 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> stripe_add_to_batch_list() is better done in the loop in make_request
> instead of inside add_stripe_bio(). This is clearer and allows for
> storing the batch_head state outside the loop in a subsequent patch.
>
> The call to add_stripe_bio() in retry_aligned_read() is for read
> and batching only applies to write. So it's impossible for batching
> to happen at that call site.
>
> No functional changes intended.
>
> Signed-off-by: Logan Gunthorpe<logang@deltatee.com>
> Reviewed-by: Christoph Hellwig<hch@lst.de>
> ---
>   drivers/md/raid5.c | 5 +++--
>   1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index cda6857e6207..8e1ece5ce984 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3534,8 +3534,6 @@ static int add_stripe_bio(struct stripe_head *sh, struct bio *bi, int dd_idx,
>   	}
>   	spin_unlock_irq(&sh->stripe_lock);
>   
> -	if (stripe_can_batch(sh))
> -		stripe_add_to_batch_list(conf, sh);
>   	return 1;
>   
>    overlap:
> @@ -5955,6 +5953,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   			goto retry;
>   		}
>   
> +		if (stripe_can_batch(sh))
> +			stripe_add_to_batch_list(conf, sh);
> +
>   		if (do_flush) {
>   			set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
>   			/* we only need flush for one stripe */


Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe()
  2022-04-20 19:54 ` [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe() Logan Gunthorpe
  2022-04-21  6:10   ` Christoph Hellwig
@ 2022-04-27  1:33   ` Guoqing Jiang
  1 sibling, 0 replies; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-27  1:33 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> Both uses of find_stripe() require a fairly complicated dance to
> increment the reference count. Move this into a common find_get_stripe()
> helper.
>
> No functional changes intended.
>
> Signed-off-by: Logan Gunthorpe<logang@deltatee.com>
> ---
>   drivers/md/raid5.c | 133 ++++++++++++++++++++++-----------------------
>   1 file changed, 65 insertions(+), 68 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 8e1ece5ce984..a0946af5b1ac 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -612,7 +612,7 @@ static void init_stripe(struct stripe_head *sh, sector_t sector, int previous)
>   }

...

>   @@ -624,6 +624,49 @@ static struct stripe_head *__find_stripe(struct r5conf *conf, sector_t sector,
>   	return NULL;
>   }
>   
> +static struct stripe_head *find_get_stripe(struct r5conf *conf,
> +		sector_t sector, short generation, int hash)
> +{

The subject doesn't match the change, maybe something about " md/raid5: 
factor out common stripe count
increment code into find_get_stripe()", just FYI. Otherwise, it 
generally looks good.

Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>


Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch
  2022-04-20 19:54 ` [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch Logan Gunthorpe
  2022-04-21  6:17   ` Christoph Hellwig
@ 2022-04-27  1:36   ` Guoqing Jiang
  2022-04-27 23:27     ` Logan Gunthorpe
  1 sibling, 1 reply; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-27  1:36 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> When batching, every stripe head has to find the previous stripe head to
> add to the batch list. This involves taking the hash lock which is
> highly contended during IO.
>
> Instead of finding the previous stripe_head each time, store a
> reference to the previous stripe_head in a pointer so that it doesn't
> require taking the contended lock another time.
>
> The reference to the previous stripe must be released before scheduling
> and waiting for work to get done. Otherwise, it can hold up
> raid5_activate_delayed() and deadlock.
>
> Signed-off-by: Logan Gunthorpe<logang@deltatee.com>
> ---
>   drivers/md/raid5.c | 51 +++++++++++++++++++++++++++++++++++-----------
>   1 file changed, 39 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 0c250cc3bfff..28ea7b9b6ab6 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -843,7 +843,8 @@ static bool stripe_can_batch(struct stripe_head *sh)
>   }
>   
>   /* we only do back search */
> -static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh)
> +static void stripe_add_to_batch_list(struct r5conf *conf,
> +		struct stripe_head *sh, struct stripe_head *last_sh)

Nit, from stripe_add_to_batch_list's view, I think "head_sh" makes more 
sense than
"last_sh".

>   {
>   	struct stripe_head *head;
>   	sector_t head_sector, tmp_sec;
> @@ -856,15 +857,20 @@ static void stripe_add_to_batch_list(struct r5conf *conf, struct stripe_head *sh
>   		return;
>   	head_sector = sh->sector - RAID5_STRIPE_SECTORS(conf);
>   
> -	hash = stripe_hash_locks_hash(conf, head_sector);
> -	spin_lock_irq(conf->hash_locks + hash);
> -	head = find_get_stripe(conf, head_sector, conf->generation, hash);
> -	spin_unlock_irq(conf->hash_locks + hash);
> -
> -	if (!head)
> -		return;
> -	if (!stripe_can_batch(head))
> -		goto out;
> +	if (last_sh && head_sector == last_sh->sector) {
> +		head = last_sh;
> +		atomic_inc(&head->count);
> +	} else {
> +		hash = stripe_hash_locks_hash(conf, head_sector);
> +		spin_lock_irq(conf->hash_locks + hash);
> +		head = find_get_stripe(conf, head_sector, conf->generation,
> +				       hash);
> +		spin_unlock_irq(conf->hash_locks + hash);
> +		if (!head)
> +			return;
> +		if (!stripe_can_batch(head))
> +			goto out;
> +	}
>   
>   	lock_two_stripes(head, sh);
>   	/* clear_batch_ready clear the flag */
> @@ -5800,6 +5806,7 @@ enum stripe_result {
>   
>   struct stripe_request_ctx {
>   	bool do_flush;
> +	struct stripe_head *batch_last;
>   };
>   
>   static enum stripe_result make_stripe_request(struct mddev *mddev,
> @@ -5889,8 +5896,13 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
>   		return STRIPE_SCHEDULE_AND_RETRY;
>   	}
>   
> -	if (stripe_can_batch(sh))
> -		stripe_add_to_batch_list(conf, sh);
> +	if (stripe_can_batch(sh)) {
> +		stripe_add_to_batch_list(conf, sh, ctx->batch_last);
> +		if (ctx->batch_last)
> +			raid5_release_stripe(ctx->batch_last);
> +		atomic_inc(&sh->count);
> +		ctx->batch_last = sh;
> +	}
>   
>   	if (ctx->do_flush) {
>   		set_bit(STRIPE_R5C_PREFLUSH, &sh->state);
> @@ -5979,6 +5991,18 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   		} else if (res == STRIPE_RETRY) {
>   			continue;
>   		} else if (res == STRIPE_SCHEDULE_AND_RETRY) {
> +			/*
> +			 * Must release the reference to batch_last before
> +			 * scheduling and waiting for work to be done,
> +			 * otherwise the batch_last stripe head could prevent
> +			 * raid5_activate_delayed() from making progress
> +			 * and thus deadlocking.
> +			 */
> +			if (ctx.batch_last) {
> +				raid5_release_stripe(ctx.batch_last);
> +				ctx.batch_last = NULL;
> +			}
> +
>   			schedule();
>   			prepare_to_wait(&conf->wait_for_overlap, &w,
>   					TASK_UNINTERRUPTIBLE);
> @@ -5990,6 +6014,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   
>   	finish_wait(&conf->wait_for_overlap, &w);
>   
> +	if (ctx.batch_last)
> +		raid5_release_stripe(ctx.batch_last);
> +
>   	if (rw == WRITE)
>   		md_write_end(mddev);
>   	bio_endio(bi);

Otherwise looks good, Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 11/12] md/raid5: Check all disks in a stripe_head for reshape progress
  2022-04-20 19:54 ` [PATCH v2 11/12] md/raid5: Check all disks in a stripe_head for reshape progress Logan Gunthorpe
  2022-04-21  6:18   ` Christoph Hellwig
@ 2022-04-27  1:53   ` Guoqing Jiang
  2022-04-27 16:11     ` Logan Gunthorpe
  1 sibling, 1 reply; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-27  1:53 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> When testing if a previous stripe has had reshape expand past it, use
> the earliest or latest logical sector in all the disks for that stripe
> head. This will allow adding multiple disks at a time in a subesquent
> patch.
>
> To do this cleaner, refactor the check into a helper function called
> stripe_ahead_of_reshape().
>
> Signed-off-by: Logan Gunthorpe<logang@deltatee.com>
> ---
>   drivers/md/raid5.c | 55 ++++++++++++++++++++++++++++++++++------------
>   1 file changed, 41 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 1fa82d8fa89e..40a25c4b80bd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5823,6 +5823,42 @@ static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
>   		return sector >= reshape_sector;
>   }
>   
> +static bool range_ahead_of_reshape(struct mddev *mddev, sector_t min,
> +				   sector_t max, sector_t reshape_sector)
> +{
> +	if (mddev->reshape_backwards)
> +		return max < reshape_sector;
> +	else
> +		return min >= reshape_sector;
> +}
> +
> +static bool stripe_ahead_of_reshape(struct mddev *mddev, struct r5conf *conf,
> +				    struct stripe_head *sh)
> +{
> +	sector_t max_sector = 0, min_sector = MaxSector;
> +	bool ret = false;
> +	int dd_idx;
> +
> +	for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
> +		if (dd_idx == sh->pd_idx)
> +			continue;
> +
> +		min_sector = min(min_sector, sh->dev[dd_idx].sector);
> +		max_sector = min(max_sector, sh->dev[dd_idx].sector);
> +	}
> +
> +	spin_lock_irq(&conf->device_lock);
> +
> +	if (!range_ahead_of_reshape(mddev, min_sector, max_sector,
> +				     conf->reshape_progress))
> +		/* mismatch, need to try again */
> +		ret = true;

I think we can just open code range_ahead_of_reshape.

And seems the above is not same as below original checking which compare
logical_sector with reshape_progress. Is it intentional or am I miss 
something?

...

> -		int must_retry = 0;
> -		spin_lock_irq(&conf->device_lock);
> -		if (!ahead_of_reshape(mddev, logical_sector,
> -				      conf->reshape_progress))
> -			/* mismatch, need to try again */
> -			must_retry = 1;
> -		spin_unlock_irq(&conf->device_lock);
> -		if (must_retry) {
> -			raid5_release_stripe(sh);
> -			return STRIPE_SCHEDULE_AND_RETRY;
> -		}
> +		raid5_release_stripe(sh);
> +		return STRIPE_SCHEDULE_AND_RETRY;
>   	}
>   
>   	if (read_seqcount_retry(&conf->gen_lock, seq)) {

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()
  2022-04-20 19:54 ` [PATCH v2 12/12] md/raid5: Pivot raid5_make_request() Logan Gunthorpe
  2022-04-21  6:43   ` Christoph Hellwig
@ 2022-04-27  2:06   ` Guoqing Jiang
  2022-04-27 16:18     ` Logan Gunthorpe
  1 sibling, 1 reply; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-27  2:06 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:

> The number of times the lock is taken can be reduced by pivoting
> raid5_make_request() so that it loops through every stripe and then
> loops through every disk in that stripe to see if the bio must be
> added. This reduces the number of times the lock must be taken by
> a factor equal to the number of data disks.
>
> To accomplish this, store the minimum and maxmimum disk sector that
> has already been finished and continue to the next logical sector if
> it is found that the disk sector has already been done. Then add a
> add_all_stripe_bios() to check all the bios for overlap and add them
> all if none of them overlap.
>
> Signed-off-by: Logan Gunthorpe<logang@deltatee.com>
> ---
>   drivers/md/raid5.c | 92 +++++++++++++++++++++++++++++++++++++++++++---
>   drivers/md/raid5.h |  1 +
>   2 files changed, 88 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 40a25c4b80bd..f86866cb15be 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -3571,6 +3571,48 @@ static bool add_stripe_bio(struct stripe_head *sh, struct bio *bi,
>   	return true;
>   }
>   
> +static int add_all_stripe_bios(struct stripe_head *sh, struct bio *bi,
> +		sector_t first_logical_sector, sector_t last_sector,
> +		int forwrite, int previous)
> +{
> +	int dd_idx;
> +	int ret = 1;
> +
> +	spin_lock_irq(&sh->stripe_lock);
> +
> +	for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
> +		struct r5dev *dev = &sh->dev[dd_idx];
> +
> +		clear_bit(R5_BioReady, &dev->flags);
> +
> +		if (dd_idx == sh->pd_idx)
> +			continue;
> +
> +		if (dev->sector < first_logical_sector ||
> +		    dev->sector >= last_sector)
> +			continue;
> +
> +		if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
> +			set_bit(R5_Overlap, &dev->flags);
> +			ret = 0;
> +			continue;
> +		}
> +
> +		set_bit(R5_BioReady, &dev->flags);

Is  it possible to just call __add_stripe_bio here? And change above 
"continue"
to "return",

> +	}
> +
> +	if (!ret)
> +		goto out;
> +
> +	for (dd_idx = 0; dd_idx < sh->disks; dd_idx++)
> +		if (test_bit(R5_BioReady, &sh->dev[dd_idx].flags))
> +			__add_stripe_bio(sh, bi, dd_idx, forwrite, previous);

then we don't need another loop here, also no need to introduce another 
flag.
But I am not sure it is feasible, so just FYI.

> +
> +out:
> +	spin_unlock_irq(&sh->stripe_lock);
> +	return ret;
> +}
> +
>   static void end_reshape(struct r5conf *conf);
>   
>   static void stripe_set_idx(sector_t stripe, struct r5conf *conf, int previous,
> @@ -5869,6 +5911,10 @@ enum stripe_result {
>   struct stripe_request_ctx {
>   	bool do_flush;
>   	struct stripe_head *batch_last;
> +	sector_t disk_sector_done;
> +	sector_t start_disk_sector;
> +	bool first_wrap;
> +	sector_t last_sector;
>   };

Could you add some comments to above members if possible?

>   static enum stripe_result make_stripe_request(struct mddev *mddev,
> @@ -5908,6 +5954,36 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
>   
>   	new_sector = raid5_compute_sector(conf, logical_sector, previous,
>   					  &dd_idx, NULL);
> +
> +	/*
> +	 * This is a tricky algorithm to figure out which stripe_heads that
> +	 * have already been visited and exit early if the stripe_head has
> +	 * already been done. (Seeing all disks are added to a stripe_head
> +	 * once in add_all_stripe_bios().
> +	 *
> +	 * To start with, the disk sector of the last stripe that has been
> +	 * completed is stored in ctx->disk_sector_done. If the new_sector is
> +	 * less than this value, the stripe_head has already been done.
> +	 *
> +	 * There's one issue with this: if the request starts in the middle of
> +	 * a chunk, all the stripe heads before the starting offset will be
> +	 * missed. To account for this, set the first_wrap boolean to true
> +	 * if new_sector is less than the starting sector. Clear the
> +	 * boolean once the start sector is hit for the second time.
> +	 * When first_wrap is set, ignore the disk_sector_done.
> +	 */
> +	if (ctx->start_disk_sector == MaxSector) {
> +		ctx->start_disk_sector = new_sector;
> +	} else if (new_sector < ctx->start_disk_sector) {
> +		ctx->first_wrap = true;
> +	} else if (new_sector == ctx->start_disk_sector) {
> +		ctx->first_wrap = false;
> +		ctx->start_disk_sector = 0;
> +		return STRIPE_SUCCESS;
> +	} else if (!ctx->first_wrap && new_sector <= ctx->disk_sector_done) {
> +		return STRIPE_SUCCESS;
> +	}
> +

Hmm, with above tricky algorithm, I guess the point is that we can avoid 
to call below
stripe_add_to_batch_list where has hash_lock contention. If so, maybe we 
can change
stripe_can_batch for the purpose.

>   	if (stripe_can_batch(sh)) {
>   		stripe_add_to_batch_list(conf, sh, ctx->batch_last);
>   		if (ctx->batch_last)
> @@ -5977,8 +6057,10 @@ static enum stripe_result make_stripe_request(struct mddev *mddev,
>   static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   {
>   	struct r5conf *conf = mddev->private;
> -	sector_t logical_sector, last_sector;
> -	struct stripe_request_ctx ctx = {};
> +	sector_t logical_sector;
> +	struct stripe_request_ctx ctx = {
> +		.start_disk_sector = MaxSector,
> +	};
>   	const int rw = bio_data_dir(bi);
>   	enum stripe_result res;
>   	DEFINE_WAIT(w);
> @@ -6021,7 +6103,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   	}
>   
>   	logical_sector = bi->bi_iter.bi_sector & ~((sector_t)RAID5_STRIPE_SECTORS(conf)-1);
> -	last_sector = bio_end_sector(bi);
> +	ctx.last_sector = bio_end_sector(bi);
>   	bi->bi_next = NULL;
>   
>   	/* Bail out if conflicts with reshape and REQ_NOWAIT is set */
> @@ -6036,7 +6118,7 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   	}
>   	md_account_bio(mddev, &bi);
>   	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
> -	while (logical_sector < last_sector) {
> +	while (logical_sector < ctx.last_sector) {
>   		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
>   					  bi);
>   		if (res == STRIPE_FAIL) {
> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
> index 638d29863503..e73b58844f83 100644
> --- a/drivers/md/raid5.h
> +++ b/drivers/md/raid5.h
> @@ -308,6 +308,7 @@ enum r5dev_flags {
>   	R5_Wantwrite,
>   	R5_Overlap,	/* There is a pending overlapping request
>   			 * on this block */
> +	R5_BioReady,    /* The current bio can be added to this disk */

This doesn't seem right to me, since the comment describes bio status 
while others
are probably for r5dev.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 06/12] md/raid5: Drop the do_prepare flag in raid5_make_request()
  2022-04-20 19:54 ` [PATCH v2 06/12] md/raid5: Drop the do_prepare flag in raid5_make_request() Logan Gunthorpe
  2022-04-21  6:15   ` Christoph Hellwig
@ 2022-04-27  2:11   ` Guoqing Jiang
  1 sibling, 0 replies; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-27  2:11 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> prepare_to_wait() can be reasonably called after schedule instead of
> setting a flag and preparing in the next loop iteration.
>
> This means that prepare_to_wait() will be called before
> read_seqcount_begin(), but there shouldn't be any reason that
> the order matters here. On the first iteration of the loop
> prepare_to_wait() is already called first.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   drivers/md/raid5.c | 8 ++------
>   1 file changed, 2 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 5a7334ba0997..b9f618356446 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5915,7 +5915,6 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   	const int rw = bio_data_dir(bi);
>   	enum stripe_result res;
>   	DEFINE_WAIT(w);
> -	bool do_prepare;
>   
>   	if (unlikely(bi->bi_opf & REQ_PREFLUSH)) {
>   		int ret = log_handle_flush_request(conf, bi);
> @@ -5973,12 +5972,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
>   		int seq;
>   
> -		do_prepare = false;
>   	retry:
>   		seq = read_seqcount_begin(&conf->gen_lock);
> -		if (do_prepare)
> -			prepare_to_wait(&conf->wait_for_overlap, &w,
> -				TASK_UNINTERRUPTIBLE);
>   
>   		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
>   					  bi, seq);
> @@ -5988,7 +5983,8 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   			goto retry;
>   		} else if (res == STRIPE_SCHEDULE_AND_RETRY) {
>   			schedule();
> -			do_prepare = true;
> +			prepare_to_wait(&conf->wait_for_overlap, &w,
> +					TASK_UNINTERRUPTIBLE);
>   			goto retry;
>   		}
>   	}

Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request()
  2022-04-20 19:54 ` [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request() Logan Gunthorpe
  2022-04-21  6:15   ` Christoph Hellwig
@ 2022-04-27  2:13   ` Guoqing Jiang
  1 sibling, 0 replies; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-27  2:13 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
> Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
> into make_stripe_request().
>
> No functional changes intended.
>
> Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
> ---
>   drivers/md/raid5.c | 12 +++++-------
>   1 file changed, 5 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index b9f618356446..1bce9075e165 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5804,13 +5804,15 @@ struct stripe_request_ctx {
>   
>   static enum stripe_result make_stripe_request(struct mddev *mddev,
>   		struct r5conf *conf, struct stripe_request_ctx *ctx,
> -		sector_t logical_sector, struct bio *bi, int seq)
> +		sector_t logical_sector, struct bio *bi)
>   {
>   	const int rw = bio_data_dir(bi);
>   	struct stripe_head *sh;
>   	sector_t new_sector;
>   	int previous = 0;
> -	int dd_idx;
> +	int seq, dd_idx;
> +
> +	seq = read_seqcount_begin(&conf->gen_lock);
>   
>   	if (unlikely(conf->reshape_progress != MaxSector)) {
>   		/* spinlock is needed as reshape_progress may be
> @@ -5970,13 +5972,9 @@ static bool raid5_make_request(struct mddev *mddev, struct bio * bi)
>   	md_account_bio(mddev, &bi);
>   	prepare_to_wait(&conf->wait_for_overlap, &w, TASK_UNINTERRUPTIBLE);
>   	for (; logical_sector < last_sector; logical_sector += RAID5_STRIPE_SECTORS(conf)) {
> -		int seq;
> -
>   	retry:
> -		seq = read_seqcount_begin(&conf->gen_lock);
> -
>   		res = make_stripe_request(mddev, conf, &ctx, logical_sector,
> -					  bi, seq);
> +					  bi);
>   		if (res == STRIPE_FAIL) {
>   			break;
>   		} else if (res == STRIPE_RETRY) {

Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-27  1:28   ` Guoqing Jiang
@ 2022-04-27 16:07     ` Logan Gunthorpe
  2022-04-28  1:49       ` Guoqing Jiang
  0 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-27 16:07 UTC (permalink / raw)
  To: Guoqing Jiang, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig



On 2022-04-26 19:28, Guoqing Jiang wrote:
>>   +static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
>> +                 sector_t reshape_sector)
>> +{
>> +    if (mddev->reshape_backwards)
>> +        return sector < reshape_sector;
>> +    else
>> +        return sector >= reshape_sector;
>> +}
> 
> I think it can be an inline function.

Marking static functions in C files as inline is not recommended. GCC
will inline it, if it is appropriate.

https://yarchive.net/comp/linux/inline.html
https://www.kernel.org/doc/local/inline.html

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop
  2022-04-27  1:32   ` Guoqing Jiang
@ 2022-04-27 16:08     ` Logan Gunthorpe
  2022-04-28  1:16       ` Guoqing Jiang
  0 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-27 16:08 UTC (permalink / raw)
  To: Guoqing Jiang, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 2022-04-26 19:32, Guoqing Jiang wrote:
> 
> 
> On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
>> Break immediately if raid5_get_active_stripe() returns NULL and deindent
>> the rest of the loop. Annotate this check with an unlikely().
>>
>> This makes the code easier to read and reduces the indentation level.
>>
>> No functional changes intended.
>>
>> Signed-off-by: Logan Gunthorpe<logang@deltatee.com>
>> ---
>>   drivers/md/raid5.c | 109 +++++++++++++++++++++++----------------------
>>   1 file changed, 55 insertions(+), 54 deletions(-)
>>
>> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
>> index 97b23c18402b..cda6857e6207 100644
>> --- a/drivers/md/raid5.c
>> +++ b/drivers/md/raid5.c
>> @@ -5906,68 +5906,69 @@ static bool raid5_make_request(struct mddev
>> *mddev, struct bio * bi)
> 
> ...
> 
>> +        if (unlikely(!sh)) {
>> +            /* cannot get stripe, just give-up */
>> +            bi->bi_status = BLK_STS_IOERR;
>> +            break;
>> +        }
> 
> 
> Nit, I would suggest to keep below original comment.

But the original comment was plainly wrong...

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 11/12] md/raid5: Check all disks in a stripe_head for reshape progress
  2022-04-27  1:53   ` Guoqing Jiang
@ 2022-04-27 16:11     ` Logan Gunthorpe
  0 siblings, 0 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-27 16:11 UTC (permalink / raw)
  To: Guoqing Jiang, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 2022-04-26 19:53, Guoqing Jiang wrote:
>> +static bool stripe_ahead_of_reshape(struct mddev *mddev, struct
>> r5conf *conf,
>> +                    struct stripe_head *sh)
>> +{
>> +    sector_t max_sector = 0, min_sector = MaxSector;
>> +    bool ret = false;
>> +    int dd_idx;
>> +
>> +    for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
>> +        if (dd_idx == sh->pd_idx)
>> +            continue;
>> +
>> +        min_sector = min(min_sector, sh->dev[dd_idx].sector);
>> +        max_sector = min(max_sector, sh->dev[dd_idx].sector);
>> +    }
>> +
>> +    spin_lock_irq(&conf->device_lock);
>> +
>> +    if (!range_ahead_of_reshape(mddev, min_sector, max_sector,
>> +                     conf->reshape_progress))
>> +        /* mismatch, need to try again */
>> +        ret = true;
> 
> I think we can just open code range_ahead_of_reshape.

I think this way is much easier to read. GCC is much

> And seems the above is not same as below original checking which compare
> logical_sector with reshape_progress. Is it intentional or am I miss
> something?

Yes, this was intentional, the change is necessary for the next patch to
ensure that all pages in a stripe are ahead of the reshape. This was not
intended as just a cleanup patch.

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()
  2022-04-27  2:06   ` Guoqing Jiang
@ 2022-04-27 16:18     ` Logan Gunthorpe
  2022-04-28  1:32       ` Guoqing Jiang
  0 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-27 16:18 UTC (permalink / raw)
  To: Guoqing Jiang, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 2022-04-26 20:06, Guoqing Jiang wrote:
>>   +static int add_all_stripe_bios(struct stripe_head *sh, struct bio *bi,
>> +        sector_t first_logical_sector, sector_t last_sector,
>> +        int forwrite, int previous)
>> +{
>> +    int dd_idx;
>> +    int ret = 1;
>> +
>> +    spin_lock_irq(&sh->stripe_lock);
>> +
>> +    for (dd_idx = 0; dd_idx < sh->disks; dd_idx++) {
>> +        struct r5dev *dev = &sh->dev[dd_idx];
>> +
>> +        clear_bit(R5_BioReady, &dev->flags);
>> +
>> +        if (dd_idx == sh->pd_idx)
>> +            continue;
>> +
>> +        if (dev->sector < first_logical_sector ||
>> +            dev->sector >= last_sector)
>> +            continue;
>> +
>> +        if (stripe_bio_overlaps(sh, bi, dd_idx, forwrite)) {
>> +            set_bit(R5_Overlap, &dev->flags);
>> +            ret = 0;
>> +            continue;
>> +        }
>> +
>> +        set_bit(R5_BioReady, &dev->flags);
> 
> Is  it possible to just call __add_stripe_bio here? And change above
> "continue"
> to "return",

No. The reason it was done this way is because we either have to add the
bio to all the disks or none of them. Otherwise, if one fails and we
have to retry and we can't know which stripes were already added or not.

>> @@ -5869,6 +5911,10 @@ enum stripe_result {
>>   struct stripe_request_ctx {
>>       bool do_flush;
>>       struct stripe_head *batch_last;
>> +    sector_t disk_sector_done;
>> +    sector_t start_disk_sector;
>> +    bool first_wrap;
>> +    sector_t last_sector;
>>   };
> 
> Could you add some comments to above members if possible?

Yes, Christoph already asked for this and I've reworked this patch to
make it much clearer for v3.

>>   static enum stripe_result make_stripe_request(struct mddev *mddev,
>> @@ -5908,6 +5954,36 @@ static enum stripe_result
>> make_stripe_request(struct mddev *mddev,
>>         new_sector = raid5_compute_sector(conf, logical_sector, previous,
>>                         &dd_idx, NULL);
>> +
>> +    /*
>> +     * This is a tricky algorithm to figure out which stripe_heads that
>> +     * have already been visited and exit early if the stripe_head has
>> +     * already been done. (Seeing all disks are added to a stripe_head
>> +     * once in add_all_stripe_bios().
>> +     *
>> +     * To start with, the disk sector of the last stripe that has been
>> +     * completed is stored in ctx->disk_sector_done. If the
>> new_sector is
>> +     * less than this value, the stripe_head has already been done.
>> +     *
>> +     * There's one issue with this: if the request starts in the
>> middle of
>> +     * a chunk, all the stripe heads before the starting offset will be
>> +     * missed. To account for this, set the first_wrap boolean to true
>> +     * if new_sector is less than the starting sector. Clear the
>> +     * boolean once the start sector is hit for the second time.
>> +     * When first_wrap is set, ignore the disk_sector_done.
>> +     */
>> +    if (ctx->start_disk_sector == MaxSector) {
>> +        ctx->start_disk_sector = new_sector;
>> +    } else if (new_sector < ctx->start_disk_sector) {
>> +        ctx->first_wrap = true;
>> +    } else if (new_sector == ctx->start_disk_sector) {
>> +        ctx->first_wrap = false;
>> +        ctx->start_disk_sector = 0;
>> +        return STRIPE_SUCCESS;
>> +    } else if (!ctx->first_wrap && new_sector <=
>> ctx->disk_sector_done) {
>> +        return STRIPE_SUCCESS;
>> +    }
>> +
> 
> Hmm, with above tricky algorithm, I guess the point is that we can avoid
> to call below
> stripe_add_to_batch_list where has hash_lock contention. If so, maybe we
> can change
> stripe_can_batch for the purpose.

No, that's not the purpose. The purpose is to add the bio to the stripe
for every disk, so that we can avoid calling find_get_stripe() for every
single page and only call it once per stripe head.


>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>> index 638d29863503..e73b58844f83 100644
>> --- a/drivers/md/raid5.h
>> +++ b/drivers/md/raid5.h
>> @@ -308,6 +308,7 @@ enum r5dev_flags {
>>       R5_Wantwrite,
>>       R5_Overlap,    /* There is a pending overlapping request
>>                * on this block */
>> +    R5_BioReady,    /* The current bio can be added to this disk */
> 
> This doesn't seem right to me, since the comment describes bio status
> while others
> are probably for r5dev.

I'm not sure I understand the objection. If you have a better option
please let me know.

I'm still working on this patch. Caught a couple more rare bugs that I'm
working to fix. The next version should also hopefully be clearer.

Thanks,

Logan



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch
  2022-04-27  1:36   ` Guoqing Jiang
@ 2022-04-27 23:27     ` Logan Gunthorpe
  0 siblings, 0 replies; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-27 23:27 UTC (permalink / raw)
  To: Guoqing Jiang, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 2022-04-26 19:36, Guoqing Jiang wrote:
> On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
>>     /* we only do back search */
>> -static void stripe_add_to_batch_list(struct r5conf *conf, struct
>> stripe_head *sh)
>> +static void stripe_add_to_batch_list(struct r5conf *conf,
>> +        struct stripe_head *sh, struct stripe_head *last_sh)
> 
> Nit, from stripe_add_to_batch_list's view, I think "head_sh" makes more
> sense than
> "last_sh".

That made sense to me, but upon a closer look while making the change, I
think it's not a good idea:

stripe_add_to_batch_list() already has a stripe_head variable called
"head". If it now has an argument called "head_sh", it becomes very
confusing. This statement wouldn't make any sense:
+    if (last_sh && head_sector == last_sh->sector) {
+        head = last_sh;
+        atomic_inc(&head->count);
+    } else {

If it was changed to "head = head_sh" what would that even mean?

From stripe_add_to_batch_list's perspective, it is the "last" stripe
head. And it then decides whether the it is the correct stripe to use as
the head of the list to add to.

So I decline to make this change.

Thanks,

Logan




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop
  2022-04-27 16:08     ` Logan Gunthorpe
@ 2022-04-28  1:16       ` Guoqing Jiang
  0 siblings, 0 replies; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-28  1:16 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/28/22 12:08 AM, Logan Gunthorpe wrote:
>
>>> +        if (unlikely(!sh)) {
>>> +            /* cannot get stripe, just give-up */
>>> +            bi->bi_status = BLK_STS_IOERR;
>>> +            break;
>>> +        }
>>
>> Nit, I would suggest to keep below original comment.
> But the original comment was plainly wrong...

I think it was for get stripe for read-ahead.

https://elixir.bootlin.com/linux/v5.18-rc4/source/drivers/md/raid5.c#L5869

And it deserves a separate change if you think the comment is wrong, but
just my $0.02.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 12/12] md/raid5: Pivot raid5_make_request()
  2022-04-27 16:18     ` Logan Gunthorpe
@ 2022-04-28  1:32       ` Guoqing Jiang
  0 siblings, 0 replies; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-28  1:32 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 4/28/22 12:18 AM, Logan Gunthorpe wrote:
>>> diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
>>> index 638d29863503..e73b58844f83 100644
>>> --- a/drivers/md/raid5.h
>>> +++ b/drivers/md/raid5.h
>>> @@ -308,6 +308,7 @@ enum r5dev_flags {
>>>        R5_Wantwrite,
>>>        R5_Overlap,    /* There is a pending overlapping request
>>>                 * on this block */
>>> +    R5_BioReady,    /* The current bio can be added to this disk */
>> This doesn't seem right to me, since the comment describes bio status
>> while others
>> are probably for r5dev.
> I'm not sure I understand the objection. If you have a better option
> please let me know.

As said, the flag is for r5dev not for bio, please don't make people confuse
about it.

My personal suggestion would be change it to R5_Ready_For_BIO or some
thing else, then update the comment accordingly.

> I'm still working on this patch. Caught a couple more rare bugs that I'm
> working to fix. The next version should also hopefully be clearer.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-27 16:07     ` Logan Gunthorpe
@ 2022-04-28  1:49       ` Guoqing Jiang
  2022-04-28 15:44         ` Logan Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-28  1:49 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig



On 4/28/22 12:07 AM, Logan Gunthorpe wrote:
>
> On 2022-04-26 19:28, Guoqing Jiang wrote:
>>>    +static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
>>> +                 sector_t reshape_sector)
>>> +{
>>> +    if (mddev->reshape_backwards)
>>> +        return sector < reshape_sector;
>>> +    else
>>> +        return sector >= reshape_sector;
>>> +}
>> I think it can be an inline function.
> Marking static functions in C files as inline is not recommended. GCC
> will inline it, if it is appropriate.
>
> https://yarchive.net/comp/linux/inline.html
> https://www.kernel.org/doc/local/inline.html

Thanks for the link, then I suppose those can be deleted

linux> grep "static inline" drivers/md/md.h -r
static inline int is_badblock(struct md_rdev *rdev, sector_t s, int sectors,
static inline int __must_check mddev_lock(struct mddev *mddev)
static inline void mddev_lock_nointr(struct mddev *mddev)
static inline int mddev_trylock(struct mddev *mddev)
static inline int mddev_is_locked(struct mddev *mddev)
static inline void md_sync_acct(struct block_device *bdev, unsigned long 
nr_sectors)
static inline void md_sync_acct_bio(struct bio *bio, unsigned long 
nr_sectors)
static inline struct kernfs_node *sysfs_get_dirent_safe(struct 
kernfs_node *sd, char *name)
static inline void sysfs_notify_dirent_safe(struct kernfs_node *sd)
static inline char * mdname (struct mddev * mddev)
static inline int sysfs_link_rdev(struct mddev *mddev, struct md_rdev *rdev)
static inline void sysfs_unlink_rdev(struct mddev *mddev, struct md_rdev 
*rdev)
static inline void safe_put_page(struct page *p)
static inline bool is_mddev_broken(struct md_rdev *rdev, const char 
*md_type)
static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev 
*mddev)
static inline int mddev_is_clustered(struct mddev *mddev)
static inline void mddev_clear_unsupported_flags(struct mddev *mddev,
static inline void mddev_check_write_zeroes(struct mddev *mddev, struct 
bio *bio)

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-28  1:49       ` Guoqing Jiang
@ 2022-04-28 15:44         ` Logan Gunthorpe
  2022-04-29  0:24           ` Guoqing Jiang
  0 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-28 15:44 UTC (permalink / raw)
  To: Guoqing Jiang, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig




On 2022-04-27 19:49, Guoqing Jiang wrote:
> 
> 
> On 4/28/22 12:07 AM, Logan Gunthorpe wrote:
>>
>> On 2022-04-26 19:28, Guoqing Jiang wrote:
>>>>    +static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
>>>> +                 sector_t reshape_sector)
>>>> +{
>>>> +    if (mddev->reshape_backwards)
>>>> +        return sector < reshape_sector;
>>>> +    else
>>>> +        return sector >= reshape_sector;
>>>> +}
>>> I think it can be an inline function.
>> Marking static functions in C files as inline is not recommended. GCC
>> will inline it, if it is appropriate.
>>
>> https://yarchive.net/comp/linux/inline.html
>> https://www.kernel.org/doc/local/inline.html
> 
> Thanks for the link, then I suppose those can be deleted

> linux> grep "static inline" drivers/md/md.h -r

It's standard practice to annotate small functions in headers with
"static inline". Without the inline annotation, any C file that includes
the header and doesn't use the function will emit a "defined but not
used warning".

Functions in headers also should, by definition, be small and
specifically inline-able (ie they are used as a type-safe macro).

static functions in C files (not headers) should not have the inline
keyword as GCC can optimize them and inline them as it sees fit and the
inline keyword doesn't actually do anything.

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-25 16:12         ` Xiao Ni
@ 2022-04-28 21:22           ` Logan Gunthorpe
  2022-04-29  0:49             ` Guoqing Jiang
  0 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-28 21:22 UTC (permalink / raw)
  To: Xiao Ni
  Cc: Guoqing Jiang, open list, linux-raid, Song Liu,
	Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan



On 2022-04-25 10:12, Xiao Ni wrote:
>> I do know that lkp-tests has run it on this series as I did get an error
>> from it. But while I'm pretty sure that error has been resolved, I was
>> never able to figure out how to run them locally.
>>
> 
> Hi Logan
> 
> You can clone the mdadm repo at
> git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
> Then you can find there is a script test under the directory. It's not
> under the tests directory.
> The test cases are under tests directory.

So I've been fighting with this and it seems there are just a ton of
failures in these tests without my changes. Running on the latest master
(52c67fcdd6dad) with stock v5.17.5 I see major brokenness. About 17 out
of 44 tests that run failed. I had to run with --disable-integrity
because those tests seem to hang on an infinite loop waiting for the md
array to go into the U state (even though it appears idle).

Even though I ran the tests with '--keep-going', the testing stopped
after the 07revert-grow reported errors in dmesg -- even though the only
errors printed to dmesg were that of mdadm segfaulting.

Running on md/md-next seems to get a bit further (to
10ddf-create-fail-rebuild) and stops with the same segfaulting issue (or
perhaps the 07 test only randomly fails first -- I haven't run it that
many times). Though most of the tests between these points fail anyway.

My upcoming v3 patches cause no failures that are different from the
md/md-next branch. But it seems these tests have rotted to the point
that they aren't all that useful; or maybe there are a ton of
regressions in the kernel already and nobody was paying much attention.
I have also tried to test certain cases that appear broken in recent
kernels anyway (like reducing the number of disks in a raid5 array hangs
on the first stripe to reshape).

In any case I have a very rough ad-hoc test suite I've been expanding
that is targeted at testing my specific changes. Testing these changes
has definitely been challenging. In any case, I've published my tests here:

https://github.com/Eideticom/raid5-tests

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function
  2022-04-28 15:44         ` Logan Gunthorpe
@ 2022-04-29  0:24           ` Guoqing Jiang
  0 siblings, 0 replies; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-29  0:24 UTC (permalink / raw)
  To: Logan Gunthorpe, linux-kernel, linux-raid, Song Liu
  Cc: Christoph Hellwig, Stephen Bates, Martin Oliveira, David Sloan,
	Christoph Hellwig



On 4/28/22 11:44 PM, Logan Gunthorpe wrote:
>
>
> On 2022-04-27 19:49, Guoqing Jiang wrote:
>>
>> On 4/28/22 12:07 AM, Logan Gunthorpe wrote:
>>> On 2022-04-26 19:28, Guoqing Jiang wrote:
>>>>>     +static bool ahead_of_reshape(struct mddev *mddev, sector_t sector,
>>>>> +                 sector_t reshape_sector)
>>>>> +{
>>>>> +    if (mddev->reshape_backwards)
>>>>> +        return sector < reshape_sector;
>>>>> +    else
>>>>> +        return sector >= reshape_sector;
>>>>> +}
>>>> I think it can be an inline function.
>>> Marking static functions in C files as inline is not recommended. GCC
>>> will inline it, if it is appropriate.
>>>
>>> https://yarchive.net/comp/linux/inline.html
>>> https://www.kernel.org/doc/local/inline.html
>> Thanks for the link, then I suppose those can be deleted
>> linux> grep "static inline" drivers/md/md.h -r
> It's standard practice to annotate small functions in headers with
> "static inline". Without the inline annotation, any C file that includes
> the header and doesn't use the function will emit a "defined but not
> used warning".
>
> Functions in headers also should, by definition, be small and
> specifically inline-able (ie they are used as a type-safe macro).
>
> static functions in C files (not headers) should not have the inline
> keyword as GCC can optimize them and inline them as it sees fit and the
> inline keyword doesn't actually do anything.

I am happy to be taught, but still I can see lots of static function in 
C files
as well, at least

linux> grep "static inline" drivers/md/*.c -r|wc
      98     661    8630

Anyway, I don't want to argue about it anymore.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-28 21:22           ` Logan Gunthorpe
@ 2022-04-29  0:49             ` Guoqing Jiang
  2022-04-29 16:01               ` Logan Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-29  0:49 UTC (permalink / raw)
  To: Logan Gunthorpe, Xiao Ni
  Cc: open list, linux-raid, Song Liu, Christoph Hellwig,
	Stephen Bates, Martin Oliveira, David Sloan



On 4/29/22 5:22 AM, Logan Gunthorpe wrote:
>
> On 2022-04-25 10:12, Xiao Ni wrote:
>>> I do know that lkp-tests has run it on this series as I did get an error
>>> from it. But while I'm pretty sure that error has been resolved, I was
>>> never able to figure out how to run them locally.
>>>
>> Hi Logan
>>
>> You can clone the mdadm repo at
>> git://git.kernel.org/pub/scm/utils/mdadm/mdadm.git
>> Then you can find there is a script test under the directory. It's not
>> under the tests directory.
>> The test cases are under tests directory.
> So I've been fighting with this and it seems there are just a ton of
> failures in these tests without my changes. Running on the latest master
> (52c67fcdd6dad) with stock v5.17.5 I see major brokenness. About 17 out
> of 44 tests that run failed. I had to run with --disable-integrity
> because those tests seem to hang on an infinite loop waiting for the md
> array to go into the U state (even though it appears idle).
>
> Even though I ran the tests with '--keep-going', the testing stopped
> after the 07revert-grow reported errors in dmesg -- even though the only
> errors printed to dmesg were that of mdadm segfaulting.
>
> Running on md/md-next seems to get a bit further (to
> 10ddf-create-fail-rebuild) and stops with the same segfaulting issue (or
> perhaps the 07 test only randomly fails first -- I haven't run it that
> many times). Though most of the tests between these points fail anyway.
>
> My upcoming v3 patches cause no failures that are different from the
> md/md-next branch. But it seems these tests have rotted to the point
> that they aren't all that useful; or maybe there are a ton of
> regressions in the kernel already and nobody was paying much attention.

I can't agree with you anymore. I would say some patches were submitted
without run enough tests, then after one by one kernel release, the thing
becomes worse.

This is also the reason that I recommend run mdadm tests since md raid
is a complex subsystem, perhaps a simple change could cause regression.
And considering there are really limited developers and reviewers in the
community, the chance to cause regression get bigger.

> I have also tried to test certain cases that appear broken in recent
> kernels anyway (like reducing the number of disks in a raid5 array hangs
> on the first stripe to reshape).
>
> In any case I have a very rough ad-hoc test suite I've been expanding
> that is targeted at testing my specific changes. Testing these changes
> has definitely been challenging. In any case, I've published my tests here:
>
> https://github.com/Eideticom/raid5-tests

If I may, is it possible to submit your tests to mdadm as well? So we can
have one common place to contain enough tests.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-29  0:49             ` Guoqing Jiang
@ 2022-04-29 16:01               ` Logan Gunthorpe
  2022-04-30  1:44                 ` Guoqing Jiang
  0 siblings, 1 reply; 59+ messages in thread
From: Logan Gunthorpe @ 2022-04-29 16:01 UTC (permalink / raw)
  To: Guoqing Jiang, Xiao Ni
  Cc: open list, linux-raid, Song Liu, Christoph Hellwig,
	Stephen Bates, Martin Oliveira, David Sloan



On 2022-04-28 18:49, Guoqing Jiang wrote:
> I can't agree with you anymore. I would say some patches were submitted
> without run enough tests, then after one by one kernel release, the thing
> becomes worse.

I'm not sure where we disagree here. I certainly don't want to introduce
regressions myself. I haven't submitted v3 yet because I've become less
certain that there are no regressions in it. The point of my last email
was try to explain that I am taking testing seriously.

> This is also the reason that I recommend run mdadm tests since md raid
> is a complex subsystem, perhaps a simple change could cause regression.
> And considering there are really limited developers and reviewers in the
> community, the chance to cause regression get bigger.

While I'd certainly like to run mdadm tests, they appear to be very
broken to me. Too broken for me to fix all of it -- I don't have time
for fixing that many issues. Seems I'm not the only one to run into this
problem recently:

https://lore.kernel.org/linux-raid/20220111130635.00001478@linux.intel.com/T/#t

And it's a shame nobody could even bother to remove the unsupported 0.9
metadata tests from the repo as a result of this conversation.

> If I may, is it possible to submit your tests to mdadm as well? So we can
> have one common place to contain enough tests.

I'd certainly consider that if I could run the test suite. Though one
hitch is that I've found I need to run my tests repeatedly, for hours,
before hitting some rare bugs. Running the tests only once is much
easier to pass. It's hard to fully test things like this with so many
rare retry paths in a simple regression test.

Logan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 00/12] Improve Raid5 Lock Contention
  2022-04-29 16:01               ` Logan Gunthorpe
@ 2022-04-30  1:44                 ` Guoqing Jiang
  0 siblings, 0 replies; 59+ messages in thread
From: Guoqing Jiang @ 2022-04-30  1:44 UTC (permalink / raw)
  To: Logan Gunthorpe, Xiao Ni
  Cc: open list, linux-raid, Song Liu, Christoph Hellwig,
	Stephen Bates, Martin Oliveira, David Sloan



On 4/30/22 12:01 AM, Logan Gunthorpe wrote:
>
> On 2022-04-28 18:49, Guoqing Jiang wrote:
>> I can't agree with you anymore. I would say some patches were submitted
>> without run enough tests, then after one by one kernel release, the thing
>> becomes worse.
> I'm not sure where we disagree here. I certainly don't want to introduce
> regressions myself. I haven't submitted v3 yet because I've become less
> certain that there are no regressions in it. The point of my last email
> was try to explain that I am taking testing seriously.

That is my intention too, no more new regression.

>> This is also the reason that I recommend run mdadm tests since md raid
>> is a complex subsystem, perhaps a simple change could cause regression.
>> And considering there are really limited developers and reviewers in the
>> community, the chance to cause regression get bigger.
> While I'd certainly like to run mdadm tests, they appear to be very
> broken to me. Too broken for me to fix all of it -- I don't have time
> for fixing that many issues.

I do agree it is not reasonable to ask you to fix them,  just compare 
the test result
with and without your set, at least there is no more new failure as said.

> Seems I'm not the only one to run into this problem recently:
>
> https://lore.kernel.org/linux-raid/20220111130635.00001478@linux.intel.com/T/#t
>
> And it's a shame nobody could even bother to remove the unsupported 0.9
> metadata tests from the repo as a result of this conversation.
>
>> If I may, is it possible to submit your tests to mdadm as well? So we can
>> have one common place to contain enough tests.
> I'd certainly consider that if I could run the test suite. Though one
> hitch is that I've found I need to run my tests repeatedly, for hours,
> before hitting some rare bugs. Running the tests only once is much
> easier to pass. It's hard to fully test things like this with so many
> rare retry paths in a simple regression test.

Let's focus on raid456 test given the code only touches raid5, you can 
pass argument
like this, FYI.

mdadm> ./test --raidtype=raid456 --dev=loop

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2022-04-30  1:46 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-20 19:54 [PATCH v2 00/12] Improve Raid5 Lock Contention Logan Gunthorpe
2022-04-20 19:54 ` [PATCH v2 01/12] md/raid5: Factor out ahead_of_reshape() function Logan Gunthorpe
2022-04-21  6:07   ` Christoph Hellwig
2022-04-21  9:17   ` Paul Menzel
2022-04-21 16:05     ` Logan Gunthorpe
2022-04-21 23:33       ` Wol
2022-04-27  1:28   ` Guoqing Jiang
2022-04-27 16:07     ` Logan Gunthorpe
2022-04-28  1:49       ` Guoqing Jiang
2022-04-28 15:44         ` Logan Gunthorpe
2022-04-29  0:24           ` Guoqing Jiang
2022-04-20 19:54 ` [PATCH v2 02/12] md/raid5: Refactor raid5_make_request loop Logan Gunthorpe
2022-04-21  6:08   ` Christoph Hellwig
2022-04-27  1:32   ` Guoqing Jiang
2022-04-27 16:08     ` Logan Gunthorpe
2022-04-28  1:16       ` Guoqing Jiang
2022-04-20 19:54 ` [PATCH v2 03/12] md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() Logan Gunthorpe
2022-04-27  1:33   ` Guoqing Jiang
2022-04-20 19:54 ` [PATCH v2 04/12] md/raid5: Move common stripe count increment code into __find_stripe() Logan Gunthorpe
2022-04-21  6:10   ` Christoph Hellwig
2022-04-27  1:33   ` Guoqing Jiang
2022-04-20 19:54 ` [PATCH v2 05/12] md/raid5: Factor out helper from raid5_make_request() loop Logan Gunthorpe
2022-04-21  6:14   ` Christoph Hellwig
2022-04-20 19:54 ` [PATCH v2 06/12] md/raid5: Drop the do_prepare flag in raid5_make_request() Logan Gunthorpe
2022-04-21  6:15   ` Christoph Hellwig
2022-04-27  2:11   ` Guoqing Jiang
2022-04-20 19:54 ` [PATCH v2 07/12] md/raid5: Move read_seqcount_begin() into make_stripe_request() Logan Gunthorpe
2022-04-21  6:15   ` Christoph Hellwig
2022-04-27  2:13   ` Guoqing Jiang
2022-04-20 19:54 ` [PATCH v2 08/12] md/raid5: Refactor for loop in raid5_make_request() into while loop Logan Gunthorpe
2022-04-21  6:16   ` Christoph Hellwig
2022-04-20 19:54 ` [PATCH v2 09/12] md/raid5: Keep a reference to last stripe_head for batch Logan Gunthorpe
2022-04-21  6:17   ` Christoph Hellwig
2022-04-27  1:36   ` Guoqing Jiang
2022-04-27 23:27     ` Logan Gunthorpe
2022-04-20 19:54 ` [PATCH v2 10/12] md/raid5: Refactor add_stripe_bio() Logan Gunthorpe
2022-04-21  6:18   ` Christoph Hellwig
2022-04-20 19:54 ` [PATCH v2 11/12] md/raid5: Check all disks in a stripe_head for reshape progress Logan Gunthorpe
2022-04-21  6:18   ` Christoph Hellwig
2022-04-27  1:53   ` Guoqing Jiang
2022-04-27 16:11     ` Logan Gunthorpe
2022-04-20 19:54 ` [PATCH v2 12/12] md/raid5: Pivot raid5_make_request() Logan Gunthorpe
2022-04-21  6:43   ` Christoph Hellwig
2022-04-21 15:54     ` Logan Gunthorpe
2022-04-27  2:06   ` Guoqing Jiang
2022-04-27 16:18     ` Logan Gunthorpe
2022-04-28  1:32       ` Guoqing Jiang
2022-04-21  8:45 ` [PATCH v2 00/12] Improve Raid5 Lock Contention Xiao Ni
2022-04-21 16:02   ` Logan Gunthorpe
2022-04-24  8:00     ` Guoqing Jiang
2022-04-25 15:39       ` Logan Gunthorpe
2022-04-25 16:12         ` Xiao Ni
2022-04-28 21:22           ` Logan Gunthorpe
2022-04-29  0:49             ` Guoqing Jiang
2022-04-29 16:01               ` Logan Gunthorpe
2022-04-30  1:44                 ` Guoqing Jiang
2022-04-24  7:53 ` Guoqing Jiang
2022-04-25 15:37   ` Logan Gunthorpe
2022-04-25 23:07 ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.