All of lore.kernel.org
 help / color / mirror / Atom feed
From: "NeilBrown" <neilb@suse.de>
To: gal.ofri@storing.io
Cc: linux-raid@vger.kernel.org, "Gal Ofri" <gal.ofri@storing.io>,
	"Song Liu" <song@kernel.org>
Subject: Re: [PATCH v2] md/raid5: avoid device_lock in read_one_chunk()
Date: Mon, 07 Jun 2021 21:22:52 +1000	[thread overview]
Message-ID: <162306497207.32569.4821556528932781303@noble.neil.brown.name> (raw)
In-Reply-To: <20210607110702.660443-1-gal.ofri@storing.io>

On Mon, 07 Jun 2021, gal.ofri@storing.io wrote:
> From: Gal Ofri <gal.ofri@storing.io>
> 
> There is a lock contention on device_lock in read_one_chunk().
> device_lock is taken to sync conf->active_aligned_reads and
> conf->quiesce.
> read_one_chunk() takes the lock, then waits for quiesce=0 (resumed)
> before incrementing active_aligned_reads.
> raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits
> for active_aligned_reads to be zero before setting quiesce=1
> (suspended).
> 
> Introduce a fast (lockless) path in read_one_chunk(): activate aligned
> read without taking device_lock.  In case quiesce starts while
> activating the aligned-read in fast path, deactivate it and revert to
> old behavior (take device_lock and wait for quiesce to finish).
> 
> Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to
> gaurantee that read_one_chunk() does not miss an ongoing quiesce.
> 
> My setups:
> 1. 8 local nvme drives (each up to 250k iops).
> 2. 8 ram disks (brd).
> 
> Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per
> socket) system. Record both iops and cpu spent on this contention with
> rand-read-4k. Record bw with sequential-read-128k.  Note: in most cases
> cpu is still busy but due to "new" bottlenecks.
> 
> nvme:
>               | iops           | cpu  | bw
> -----------------------------------------------
> without patch | 1.6M           | ~50% | 5.5GB/s
> with patch    | 2M (throttled) | 0%   | 16GB/s (throttled)
> 
> ram (brd):
>               | iops           | cpu  | bw
> -----------------------------------------------
> without patch | 2M             | ~80% | 24GB/s
> with patch    | 4M             | 0%   | 55GB/s
> 
> CC: Song Liu <song@kernel.org>
> CC: Neil Brown <neilb@suse.de>
> Signed-off-by: Gal Ofri <gal.ofri@storing.io>

Reviewed-by: NeilBrown <neilb@suse.de>

Thanks for this!

NeilBrown


> ---
> * tested with kcsan & lockdep; no new errors.
> * tested mixed io (70% read) with data verification (vdbench -v)
> while repeatedly changing group_thread_cnt (enter/exit quiesce).
> * thank you Neil for guiding me through this patch.
> ---
>  drivers/md/raid5.c | 29 +++++++++++++++++++++++------
>  1 file changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7d4ff8a5c55e..f64259f044fd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5396,6 +5396,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>  	struct md_rdev *rdev;
>  	sector_t sector, end_sector, first_bad;
>  	int bad_sectors, dd_idx;
> +	bool did_inc;
>  
>  	if (!in_chunk_boundary(mddev, raid_bio)) {
>  		pr_debug("%s: non aligned\n", __func__);
> @@ -5443,11 +5444,24 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
>  	/* No reshape active, so we can trust rdev->data_offset */
>  	align_bio->bi_iter.bi_sector += rdev->data_offset;
>  
> -	spin_lock_irq(&conf->device_lock);
> -	wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
> -			    conf->device_lock);
> -	atomic_inc(&conf->active_aligned_reads);
> -	spin_unlock_irq(&conf->device_lock);
> +	did_inc = false;
> +	if (conf->quiesce == 0) {
> +		atomic_inc(&conf->active_aligned_reads);
> +		did_inc = true;
> +	}
> +	/* need a memory barrier to detect the race with raid5_quiesce() */
> +	if (!did_inc || smp_load_acquire(&conf->quiesce) != 0) {
> +		/* quiesce is in progress, so we need to undo io activation and wait
> +		 * for it to finish
> +		 */
> +		if (did_inc && atomic_dec_and_test(&conf->active_aligned_reads))
> +			wake_up(&conf->wait_for_quiescent);
> +		spin_lock_irq(&conf->device_lock);
> +		wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
> +				    conf->device_lock);
> +		atomic_inc(&conf->active_aligned_reads);
> +		spin_unlock_irq(&conf->device_lock);
> +	}
>  
>  	if (mddev->gendisk)
>  		trace_block_bio_remap(align_bio, disk_devt(mddev->gendisk),
> @@ -8334,7 +8348,10 @@ static void raid5_quiesce(struct mddev *mddev, int quiesce)
>  		 * active stripes can drain
>  		 */
>  		r5c_flush_cache(conf, INT_MAX);
> -		conf->quiesce = 2;
> +		/* need a memory barrier to make sure read_one_chunk() sees
> +		 * quiesce started and reverts to slow (locked) path.
> +		 */
> +		smp_store_release(&conf->quiesce, 2);
>  		wait_event_cmd(conf->wait_for_quiescent,
>  				    atomic_read(&conf->active_stripes) == 0 &&
>  				    atomic_read(&conf->active_aligned_reads) == 0,
> -- 
> 2.25.1
> 
> 

  reply	other threads:[~2021-06-07 11:22 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <162302508816.16225.936948442459930625@noble.neil.brown.name>
2021-06-07 11:07 ` [PATCH v2] md/raid5: avoid device_lock in read_one_chunk() gal.ofri
2021-06-07 11:22   ` NeilBrown [this message]
2021-06-09  5:47     ` Song Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=162306497207.32569.4821556528932781303@noble.neil.brown.name \
    --to=neilb@suse.de \
    --cc=gal.ofri@storing.io \
    --cc=linux-raid@vger.kernel.org \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.