* [PATCH v2] md/raid5: avoid device_lock in read_one_chunk()
[not found] <162302508816.16225.936948442459930625@noble.neil.brown.name>
@ 2021-06-07 11:07 ` gal.ofri
2021-06-07 11:22 ` NeilBrown
0 siblings, 1 reply; 3+ messages in thread
From: gal.ofri @ 2021-06-07 11:07 UTC (permalink / raw)
To: linux-raid; +Cc: Gal Ofri, Song Liu, Neil Brown
From: Gal Ofri <gal.ofri@storing.io>
There is a lock contention on device_lock in read_one_chunk().
device_lock is taken to sync conf->active_aligned_reads and
conf->quiesce.
read_one_chunk() takes the lock, then waits for quiesce=0 (resumed)
before incrementing active_aligned_reads.
raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits
for active_aligned_reads to be zero before setting quiesce=1
(suspended).
Introduce a fast (lockless) path in read_one_chunk(): activate aligned
read without taking device_lock. In case quiesce starts while
activating the aligned-read in fast path, deactivate it and revert to
old behavior (take device_lock and wait for quiesce to finish).
Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to
gaurantee that read_one_chunk() does not miss an ongoing quiesce.
My setups:
1. 8 local nvme drives (each up to 250k iops).
2. 8 ram disks (brd).
Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per
socket) system. Record both iops and cpu spent on this contention with
rand-read-4k. Record bw with sequential-read-128k. Note: in most cases
cpu is still busy but due to "new" bottlenecks.
nvme:
| iops | cpu | bw
-----------------------------------------------
without patch | 1.6M | ~50% | 5.5GB/s
with patch | 2M (throttled) | 0% | 16GB/s (throttled)
ram (brd):
| iops | cpu | bw
-----------------------------------------------
without patch | 2M | ~80% | 24GB/s
with patch | 4M | 0% | 55GB/s
CC: Song Liu <song@kernel.org>
CC: Neil Brown <neilb@suse.de>
Signed-off-by: Gal Ofri <gal.ofri@storing.io>
---
* tested with kcsan & lockdep; no new errors.
* tested mixed io (70% read) with data verification (vdbench -v)
while repeatedly changing group_thread_cnt (enter/exit quiesce).
* thank you Neil for guiding me through this patch.
---
drivers/md/raid5.c | 29 +++++++++++++++++++++++------
1 file changed, 23 insertions(+), 6 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 7d4ff8a5c55e..f64259f044fd 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -5396,6 +5396,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
struct md_rdev *rdev;
sector_t sector, end_sector, first_bad;
int bad_sectors, dd_idx;
+ bool did_inc;
if (!in_chunk_boundary(mddev, raid_bio)) {
pr_debug("%s: non aligned\n", __func__);
@@ -5443,11 +5444,24 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
/* No reshape active, so we can trust rdev->data_offset */
align_bio->bi_iter.bi_sector += rdev->data_offset;
- spin_lock_irq(&conf->device_lock);
- wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
- conf->device_lock);
- atomic_inc(&conf->active_aligned_reads);
- spin_unlock_irq(&conf->device_lock);
+ did_inc = false;
+ if (conf->quiesce == 0) {
+ atomic_inc(&conf->active_aligned_reads);
+ did_inc = true;
+ }
+ /* need a memory barrier to detect the race with raid5_quiesce() */
+ if (!did_inc || smp_load_acquire(&conf->quiesce) != 0) {
+ /* quiesce is in progress, so we need to undo io activation and wait
+ * for it to finish
+ */
+ if (did_inc && atomic_dec_and_test(&conf->active_aligned_reads))
+ wake_up(&conf->wait_for_quiescent);
+ spin_lock_irq(&conf->device_lock);
+ wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
+ conf->device_lock);
+ atomic_inc(&conf->active_aligned_reads);
+ spin_unlock_irq(&conf->device_lock);
+ }
if (mddev->gendisk)
trace_block_bio_remap(align_bio, disk_devt(mddev->gendisk),
@@ -8334,7 +8348,10 @@ static void raid5_quiesce(struct mddev *mddev, int quiesce)
* active stripes can drain
*/
r5c_flush_cache(conf, INT_MAX);
- conf->quiesce = 2;
+ /* need a memory barrier to make sure read_one_chunk() sees
+ * quiesce started and reverts to slow (locked) path.
+ */
+ smp_store_release(&conf->quiesce, 2);
wait_event_cmd(conf->wait_for_quiescent,
atomic_read(&conf->active_stripes) == 0 &&
atomic_read(&conf->active_aligned_reads) == 0,
--
2.25.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH v2] md/raid5: avoid device_lock in read_one_chunk()
2021-06-07 11:07 ` [PATCH v2] md/raid5: avoid device_lock in read_one_chunk() gal.ofri
@ 2021-06-07 11:22 ` NeilBrown
2021-06-09 5:47 ` Song Liu
0 siblings, 1 reply; 3+ messages in thread
From: NeilBrown @ 2021-06-07 11:22 UTC (permalink / raw)
To: gal.ofri; +Cc: linux-raid, Gal Ofri, Song Liu
On Mon, 07 Jun 2021, gal.ofri@storing.io wrote:
> From: Gal Ofri <gal.ofri@storing.io>
>
> There is a lock contention on device_lock in read_one_chunk().
> device_lock is taken to sync conf->active_aligned_reads and
> conf->quiesce.
> read_one_chunk() takes the lock, then waits for quiesce=0 (resumed)
> before incrementing active_aligned_reads.
> raid5_quiesce() takes the lock, sets quiesce=2 (in-progress), then waits
> for active_aligned_reads to be zero before setting quiesce=1
> (suspended).
>
> Introduce a fast (lockless) path in read_one_chunk(): activate aligned
> read without taking device_lock. In case quiesce starts while
> activating the aligned-read in fast path, deactivate it and revert to
> old behavior (take device_lock and wait for quiesce to finish).
>
> Add smp store/load in raid5_quiesce()/read_one_chunk() respectively to
> gaurantee that read_one_chunk() does not miss an ongoing quiesce.
>
> My setups:
> 1. 8 local nvme drives (each up to 250k iops).
> 2. 8 ram disks (brd).
>
> Each setup with raid6 (6+2), 1024 io threads on a 96 cpu-cores (48 per
> socket) system. Record both iops and cpu spent on this contention with
> rand-read-4k. Record bw with sequential-read-128k. Note: in most cases
> cpu is still busy but due to "new" bottlenecks.
>
> nvme:
> | iops | cpu | bw
> -----------------------------------------------
> without patch | 1.6M | ~50% | 5.5GB/s
> with patch | 2M (throttled) | 0% | 16GB/s (throttled)
>
> ram (brd):
> | iops | cpu | bw
> -----------------------------------------------
> without patch | 2M | ~80% | 24GB/s
> with patch | 4M | 0% | 55GB/s
>
> CC: Song Liu <song@kernel.org>
> CC: Neil Brown <neilb@suse.de>
> Signed-off-by: Gal Ofri <gal.ofri@storing.io>
Reviewed-by: NeilBrown <neilb@suse.de>
Thanks for this!
NeilBrown
> ---
> * tested with kcsan & lockdep; no new errors.
> * tested mixed io (70% read) with data verification (vdbench -v)
> while repeatedly changing group_thread_cnt (enter/exit quiesce).
> * thank you Neil for guiding me through this patch.
> ---
> drivers/md/raid5.c | 29 +++++++++++++++++++++++------
> 1 file changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 7d4ff8a5c55e..f64259f044fd 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -5396,6 +5396,7 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
> struct md_rdev *rdev;
> sector_t sector, end_sector, first_bad;
> int bad_sectors, dd_idx;
> + bool did_inc;
>
> if (!in_chunk_boundary(mddev, raid_bio)) {
> pr_debug("%s: non aligned\n", __func__);
> @@ -5443,11 +5444,24 @@ static int raid5_read_one_chunk(struct mddev *mddev, struct bio *raid_bio)
> /* No reshape active, so we can trust rdev->data_offset */
> align_bio->bi_iter.bi_sector += rdev->data_offset;
>
> - spin_lock_irq(&conf->device_lock);
> - wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
> - conf->device_lock);
> - atomic_inc(&conf->active_aligned_reads);
> - spin_unlock_irq(&conf->device_lock);
> + did_inc = false;
> + if (conf->quiesce == 0) {
> + atomic_inc(&conf->active_aligned_reads);
> + did_inc = true;
> + }
> + /* need a memory barrier to detect the race with raid5_quiesce() */
> + if (!did_inc || smp_load_acquire(&conf->quiesce) != 0) {
> + /* quiesce is in progress, so we need to undo io activation and wait
> + * for it to finish
> + */
> + if (did_inc && atomic_dec_and_test(&conf->active_aligned_reads))
> + wake_up(&conf->wait_for_quiescent);
> + spin_lock_irq(&conf->device_lock);
> + wait_event_lock_irq(conf->wait_for_quiescent, conf->quiesce == 0,
> + conf->device_lock);
> + atomic_inc(&conf->active_aligned_reads);
> + spin_unlock_irq(&conf->device_lock);
> + }
>
> if (mddev->gendisk)
> trace_block_bio_remap(align_bio, disk_devt(mddev->gendisk),
> @@ -8334,7 +8348,10 @@ static void raid5_quiesce(struct mddev *mddev, int quiesce)
> * active stripes can drain
> */
> r5c_flush_cache(conf, INT_MAX);
> - conf->quiesce = 2;
> + /* need a memory barrier to make sure read_one_chunk() sees
> + * quiesce started and reverts to slow (locked) path.
> + */
> + smp_store_release(&conf->quiesce, 2);
> wait_event_cmd(conf->wait_for_quiescent,
> atomic_read(&conf->active_stripes) == 0 &&
> atomic_read(&conf->active_aligned_reads) == 0,
> --
> 2.25.1
>
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH v2] md/raid5: avoid device_lock in read_one_chunk()
2021-06-07 11:22 ` NeilBrown
@ 2021-06-09 5:47 ` Song Liu
0 siblings, 0 replies; 3+ messages in thread
From: Song Liu @ 2021-06-09 5:47 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, Gal Ofri
On Mon, Jun 7, 2021 at 4:22 AM NeilBrown <neilb@suse.de> wrote:
>
> On Mon, 07 Jun 2021, gal.ofri@storing.io wrote:
[...]
> > -----------------------------------------------
> > without patch | 2M | ~80% | 24GB/s
> > with patch | 4M | 0% | 55GB/s
> >
> > CC: Song Liu <song@kernel.org>
> > CC: Neil Brown <neilb@suse.de>
> > Signed-off-by: Gal Ofri <gal.ofri@storing.io>
>
> Reviewed-by: NeilBrown <neilb@suse.de>
Applied to md-next. Thanks!
Song
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2021-06-09 5:48 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <162302508816.16225.936948442459930625@noble.neil.brown.name>
2021-06-07 11:07 ` [PATCH v2] md/raid5: avoid device_lock in read_one_chunk() gal.ofri
2021-06-07 11:22 ` NeilBrown
2021-06-09 5:47 ` Song Liu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.