All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] md/raid5: avoid device_lock in read_one_chunk()
       [not found] <162302508816.16225.936948442459930625@noble.neil.brown.name>
@ 2021-06-07 11:07 ` gal.ofri
  2021-06-07 11:22   ` NeilBrown
  0 siblings, 1 reply; 3+ messages in thread
From: gal.ofri @ 2021-06-07 11:07 UTC (permalink / raw)
  To: linux-raid; +Cc: Gal Ofri, Song Liu, Neil Brown

From: Gal Ofri <gal.ofri@storing.io>

There is a lock contention on device_lock in read_one_chunk().
device_lock is taken to sync conf->active_aligned_reads and
conf->quiesce.
read_one_chunk() takes the lock, then waits for quiesce=0 (resumed)
before incrementing active_aligned_reads.
raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits
for active_aligned_reads to be zero before setting quiesce=1
(suspended).

Introduce a fast (lockless) path in read_one_chunk(): activate aligned
read without taking device_lock.  In case quiesce starts while
activating the aligned-read in fast path, deactivate it and revert to
old behavior (take device_lock and wait for quiesce to finish).

Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to
gaurantee that read_one_chunk() does not miss an ongoing quiesce.

My setups:
1. 8 local nvme drives (each up to 250k iops).
2. 8 ram disks (brd).

Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per
socket) system. Record both iops and cpu spent on this contention with
rand-read-4k. Record bw with sequential-read-128k.  Note: in most cases
cpu is still busy but due to "new" bottlenecks.

nvme:
              | iops           | cpu  | bw
-----------------------------------------------
without patch | 1.6M           | ~50% | 5.5GB/s
with patch    | 2M (throttled) | 0%   | 16GB/s (throttled)

ram (brd):
              | iops           | cpu  | bw
-----------------------------------------------
without patch | 2M             | ~80% | 24GB/s
with patch    | 4M             | 0%   | 55GB/s

CC: Song Liu <song@kernel.org>
CC: Neil Brown <neilb@suse.de>
Signed-off-by: Gal Ofri <gal.ofri@storing.io>
---
* tested with kcsan & lockdep; no new errors.
* tested mixed io (70% read) with data verification (vdbench -v)
while repeatedly changing group_thread_cnt (enter/exit quiesce).
* thank you Neil for guiding me through this patch.
---
 drivers/md/raid5.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7d4ff8a5c55e..f64259f044fd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5396,6 +5396,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 	struct md_rdev *rdev;
 	sector_t sector, end_sector, first_bad;
 	int bad_sectors, dd_idx;
+	bool did_inc;
 
 	if (!in_chunk_boundary(mddev, raid_bio)) {
 		pr_debug("%s: non aligned\n", __func__);
@@ -5443,11 +5444,24 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
 	/* No reshape active, so we can trust rdev->data_offset */
 	align_bio->bi_iter.bi_sector += rdev->data_offset;
 
-	spin_lock_irq(&conf->device_lock);
-	wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
-			    conf->device_lock);
-	atomic_inc(&conf->active_aligned_reads);
-	spin_unlock_irq(&conf->device_lock);
+	did_inc = false;
+	if (conf->quiesce == 0) {
+		atomic_inc(&conf->active_aligned_reads);
+		did_inc = true;
+	}
+	/* need a memory barrier to detect the race with raid5_quiesce() */
+	if (!did_inc || smp_load_acquire(&conf->quiesce) != 0) {
+		/* quiesce is in progress, so we need to undo io activation and wait
+		 * for it to finish
+		 */
+		if (did_inc && atomic_dec_and_test(&conf->active_aligned_reads))
+			wake_up(&conf->wait_for_quiescent);
+		spin_lock_irq(&conf->device_lock);
+		wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
+				    conf->device_lock);
+		atomic_inc(&conf->active_aligned_reads);
+		spin_unlock_irq(&conf->device_lock);
+	}
 
 	if (mddev->gendisk)
 		trace_block_bio_remap(align_bio, disk_devt(mddev->gendisk),
@@ -8334,7 +8348,10 @@ static void raid5_quiesce(struct mddev *mddev, int quiesce)
 		 * active stripes can drain
 		 */
 		r5c_flush_cache(conf, INT_MAX);
-		conf->quiesce = 2;
+		/* need a memory barrier to make sure read_one_chunk() sees
+		 * quiesce started and reverts to slow (locked) path.
+		 */
+		smp_store_release(&conf->quiesce, 2);
 		wait_event_cmd(conf->wait_for_quiescent,
 				    atomic_read(&conf->active_stripes) == 0 &&
 				    atomic_read(&conf->active_aligned_reads) == 0,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] md/raid5: avoid device_lock in read_one_chunk()
  2021-06-07 11:07 ` [PATCH v2] md/raid5: avoid device_lock in read_one_chunk() gal.ofri
@ 2021-06-07 11:22   ` NeilBrown
  2021-06-09  5:47     ` Song Liu
  0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2021-06-07 11:22 UTC (permalink / raw)
  To: gal.ofri; +Cc: linux-raid, Gal Ofri, Song Liu

On Mon, 07 Jun 2021, gal.ofri@storing.io wrote:
> From: Gal Ofri <gal.ofri@storing.io>
> 
> There is a lock contention on device_lock in read_one_chunk().
> device_lock is taken to sync conf->active_aligned_reads and
> conf->quiesce.
> read_one_chunk() takes the lock, then waits for quiesce=0 (resumed)
> before incrementing active_aligned_reads.
> raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits
> for active_aligned_reads to be zero before setting quiesce=1
> (suspended).
> 
> Introduce a fast (lockless) path in read_one_chunk(): activate aligned
> read without taking device_lock.  In case quiesce starts while
> activating the aligned-read in fast path, deactivate it and revert to
> old behavior (take device_lock and wait for quiesce to finish).
> 
> Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to
> gaurantee that read_one_chunk() does not miss an ongoing quiesce.
> 
> My setups:
> 1. 8 local nvme drives (each up to 250k iops).
> 2. 8 ram disks (brd).
> 
> Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per
> socket) system. Record both iops and cpu spent on this contention with
> rand-read-4k. Record bw with sequential-read-128k.  Note: in most cases
> cpu is still busy but due to "new" bottlenecks.
> 
> nvme:
>               | iops           | cpu  | bw
> -----------------------------------------------
> without patch | 1.6M           | ~50% | 5.5GB/s
> with patch    | 2M (throttled) | 0%   | 16GB/s (throttled)
> 
> ram (brd):
>               | iops           | cpu  | bw
> -----------------------------------------------
> without patch | 2M             | ~80% | 24GB/s
> with patch    | 4M             | 0%   | 55GB/s
> 
> CC: Song Liu <song@kernel.org>
> CC: Neil Brown <neilb@suse.de>
> Signed-off-by: Gal Ofri <gal.ofri@storing.io>

Reviewed-by: NeilBrown <neilb@suse.de>

Thanks for this!

NeilBrown


> ---
> * tested with kcsan & lockdep; no new errors.
> * tested mixed io (70% read) with data verification (vdbench -v)
> while repeatedly changing group_thread_cnt (enter/exit quiesce).
> * thank you Neil for guiding me through this patch.
> ---
>  drivers/md/raid5.c | 29 +++++++++++++++++++++++------
>  1 file changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7d4ff8a5c55e..f64259f044fd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5396,6 +5396,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>  	struct md_rdev *rdev;
>  	sector_t sector, end_sector, first_bad;
>  	int bad_sectors, dd_idx;
> +	bool did_inc;
>  
>  	if (!in_chunk_boundary(mddev, raid_bio)) {
>  		pr_debug("%s: non aligned\n", __func__);
> @@ -5443,11 +5444,24 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>  	/* No reshape active, so we can trust rdev->data_offset */
>  	align_bio->bi_iter.bi_sector += rdev->data_offset;
>  
> -	spin_lock_irq(&conf->device_lock);
> -	wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
> -			    conf->device_lock);
> -	atomic_inc(&conf->active_aligned_reads);
> -	spin_unlock_irq(&conf->device_lock);
> +	did_inc = false;
> +	if (conf->quiesce == 0) {
> +		atomic_inc(&conf->active_aligned_reads);
> +		did_inc = true;
> +	}
> +	/* need a memory barrier to detect the race with raid5_quiesce() */
> +	if (!did_inc || smp_load_acquire(&conf->quiesce) != 0) {
> +		/* quiesce is in progress, so we need to undo io activation and wait
> +		 * for it to finish
> +		 */
> +		if (did_inc && atomic_dec_and_test(&conf->active_aligned_reads))
> +			wake_up(&conf->wait_for_quiescent);
> +		spin_lock_irq(&conf->device_lock);
> +		wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
> +				    conf->device_lock);
> +		atomic_inc(&conf->active_aligned_reads);
> +		spin_unlock_irq(&conf->device_lock);
> +	}
>  
>  	if (mddev->gendisk)
>  		trace_block_bio_remap(align_bio, disk_devt(mddev->gendisk),
> @@ -8334,7 +8348,10 @@ static void raid5_quiesce(struct mddev *mddev, int quiesce)
>  		 * active stripes can drain
>  		 */
>  		r5c_flush_cache(conf, INT_MAX);
> -		conf->quiesce = 2;
> +		/* need a memory barrier to make sure read_one_chunk() sees
> +		 * quiesce started and reverts to slow (locked) path.
> +		 */
> +		smp_store_release(&conf->quiesce, 2);
>  		wait_event_cmd(conf->wait_for_quiescent,
>  				    atomic_read(&conf->active_stripes) == 0 &&
>  				    atomic_read(&conf->active_aligned_reads) == 0,
> -- 
> 2.25.1
> 
> 

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v2] md/raid5: avoid device_lock in read_one_chunk()
  2021-06-07 11:22   ` NeilBrown
@ 2021-06-09  5:47     ` Song Liu
  0 siblings, 0 replies; 3+ messages in thread
From: Song Liu @ 2021-06-09  5:47 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, Gal Ofri

On Mon, Jun 7, 2021 at 4:22 AM NeilBrown <neilb@suse.de> wrote:
>
> On Mon, 07 Jun 2021, gal.ofri@storing.io wrote:
[...]
> > -----------------------------------------------
> > without patch | 2M             | ~80% | 24GB/s
> > with patch    | 4M             | 0%   | 55GB/s
> >
> > CC: Song Liu <song@kernel.org>
> > CC: Neil Brown <neilb@suse.de>
> > Signed-off-by: Gal Ofri <gal.ofri@storing.io>
>
> Reviewed-by: NeilBrown <neilb@suse.de>

Applied to md-next. Thanks!

Song

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-06-09  5:48 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <162302508816.16225.936948442459930625@noble.neil.brown.name>
2021-06-07 11:07 ` [PATCH v2] md/raid5: avoid device_lock in read_one_chunk() gal.ofri
2021-06-07 11:22   ` NeilBrown
2021-06-09  5:47     ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.