All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH -next 0/3] md/raid10: reduce lock contention for io
@ 2022-08-29 13:14 Yu Kuai
  2022-08-29 13:15 ` [PATCH -next 1/3] md/raid10: fix improper BUG_ON() in raise_barrier() Yu Kuai
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Yu Kuai @ 2022-08-29 13:14 UTC (permalink / raw)
  To: song; +Cc: linux-raid, linux-kernel, yukuai3, yukuai1, yi.zhang

From: Yu Kuai <yukuai3@huawei.com>

patch 1 is a small problem found by code review.
patch 2 avoid holding resync_lock in fast path.
patch 3 avoid holding lock in wake_up() in fast path.

Test environment:

Architecture: aarch64
Cpu: Huawei KUNPENG 920, there are four numa nodes

Raid10 initialize:
mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Test cmd:
fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread

Test result:
before this patchset:	2.9 GiB/s
after this patchset:	6.6 Gib/s

Please noted that in kunpeng-920, memory access latency is very bad
accross nodes compare to local node, and in other architecture
performance improvement might not be significant.

Yu Kuai (3):
  md/raid10: fix improper BUG_ON() in raise_barrier()
  md/raid10: convert resync_lock to use seqlock
  md/raid10: prevent unnecessary calls to wake_up() in fast path

 drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++----------------
 drivers/md/raid10.h |  2 +-
 2 files changed, 59 insertions(+), 31 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH -next 1/3] md/raid10: fix improper BUG_ON() in raise_barrier()
  2022-08-29 13:14 [PATCH -next 0/3] md/raid10: reduce lock contention for io Yu Kuai
@ 2022-08-29 13:15 ` Yu Kuai
  2022-08-29 19:53   ` John Stoffel
  2022-08-29 13:15 ` [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock Yu Kuai
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2022-08-29 13:15 UTC (permalink / raw)
  To: song; +Cc: linux-raid, linux-kernel, yukuai3, yukuai1, yi.zhang

From: Yu Kuai <yukuai3@huawei.com>

'conf->barrier' is protected by 'conf->resync_lock', reading
'conf->barrier' without holding the lock is wrong.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 9117fcdee1be..b70c207f7932 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -930,8 +930,8 @@ static void flush_pending_writes(struct r10conf *conf)
 
 static void raise_barrier(struct r10conf *conf, int force)
 {
-	BUG_ON(force && !conf->barrier);
 	spin_lock_irq(&conf->resync_lock);
+	BUG_ON(force && !conf->barrier);
 
 	/* Wait until no block IO is waiting (unless 'force') */
 	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-08-29 13:14 [PATCH -next 0/3] md/raid10: reduce lock contention for io Yu Kuai
  2022-08-29 13:15 ` [PATCH -next 1/3] md/raid10: fix improper BUG_ON() in raise_barrier() Yu Kuai
@ 2022-08-29 13:15 ` Yu Kuai
  2022-09-01 18:41   ` Logan Gunthorpe
  2022-09-02  9:42   ` Guoqing Jiang
  2022-08-29 13:15 ` [PATCH -next 3/3] md/raid10: prevent unnecessary calls to wake_up() in fast path Yu Kuai
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 28+ messages in thread
From: Yu Kuai @ 2022-08-29 13:15 UTC (permalink / raw)
  To: song; +Cc: linux-raid, linux-kernel, yukuai3, yukuai1, yi.zhang

From: Yu Kuai <yukuai3@huawei.com>

Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier',
and io can't be dispatched until 'barrier' is dropped.

Since holding the 'barrier' is not common, convert 'resync_lock' to use
seqlock so that holding lock can be avoided in fast path.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 62 ++++++++++++++++++++++++++++++---------------
 drivers/md/raid10.h |  2 +-
 2 files changed, 43 insertions(+), 21 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index b70c207f7932..086216b051f5 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -930,38 +930,60 @@ static void flush_pending_writes(struct r10conf *conf)
 
 static void raise_barrier(struct r10conf *conf, int force)
 {
-	spin_lock_irq(&conf->resync_lock);
+	write_seqlock_irq(&conf->resync_lock);
 	BUG_ON(force && !conf->barrier);
 
 	/* Wait until no block IO is waiting (unless 'force') */
 	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
-			    conf->resync_lock);
+			    conf->resync_lock.lock);
 
 	/* block any new IO from starting */
-	conf->barrier++;
+	WRITE_ONCE(conf->barrier, conf->barrier + 1);
 
 	/* Now wait for all pending IO to complete */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !atomic_read(&conf->nr_pending) && conf->barrier < RESYNC_DEPTH,
-			    conf->resync_lock);
+			    conf->resync_lock.lock);
 
-	spin_unlock_irq(&conf->resync_lock);
+	write_sequnlock_irq(&conf->resync_lock);
 }
 
 static void lower_barrier(struct r10conf *conf)
 {
 	unsigned long flags;
-	spin_lock_irqsave(&conf->resync_lock, flags);
-	conf->barrier--;
-	spin_unlock_irqrestore(&conf->resync_lock, flags);
+
+	write_seqlock_irqsave(&conf->resync_lock, flags);
+	WRITE_ONCE(conf->barrier, conf->barrier - 1);
+	write_sequnlock_irqrestore(&conf->resync_lock, flags);
 	wake_up(&conf->wait_barrier);
 }
 
+static bool wait_barrier_nolock(struct r10conf *conf)
+{
+	unsigned int seq = raw_read_seqcount(&conf->resync_lock.seqcount);
+
+	if (seq & 1)
+		return false;
+
+	if (READ_ONCE(conf->barrier))
+		return false;
+
+	atomic_inc(&conf->nr_pending);
+	if (!read_seqcount_retry(&conf->resync_lock.seqcount, seq))
+		return true;
+
+	atomic_dec(&conf->nr_pending);
+	return false;
+}
+
 static bool wait_barrier(struct r10conf *conf, bool nowait)
 {
 	bool ret = true;
 
-	spin_lock_irq(&conf->resync_lock);
+	if (wait_barrier_nolock(conf))
+		return true;
+
+	write_seqlock_irq(&conf->resync_lock);
 	if (conf->barrier) {
 		struct bio_list *bio_list = current->bio_list;
 		conf->nr_waiting++;
@@ -992,7 +1014,7 @@ static bool wait_barrier(struct r10conf *conf, bool nowait)
 					      test_bit(MD_RECOVERY_RUNNING,
 						       &conf->mddev->recovery) &&
 					      conf->nr_queued > 0),
-					    conf->resync_lock);
+					    conf->resync_lock.lock);
 		}
 		conf->nr_waiting--;
 		if (!conf->nr_waiting)
@@ -1001,7 +1023,7 @@ static bool wait_barrier(struct r10conf *conf, bool nowait)
 	/* Only increment nr_pending when we wait */
 	if (ret)
 		atomic_inc(&conf->nr_pending);
-	spin_unlock_irq(&conf->resync_lock);
+	write_sequnlock_irq(&conf->resync_lock);
 	return ret;
 }
 
@@ -1026,27 +1048,27 @@ static void freeze_array(struct r10conf *conf, int extra)
 	 * must match the number of pending IOs (nr_pending) before
 	 * we continue.
 	 */
-	spin_lock_irq(&conf->resync_lock);
+	write_seqlock_irq(&conf->resync_lock);
 	conf->array_freeze_pending++;
-	conf->barrier++;
+	WRITE_ONCE(conf->barrier, conf->barrier + 1);
 	conf->nr_waiting++;
 	wait_event_lock_irq_cmd(conf->wait_barrier,
 				atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
-				conf->resync_lock,
+				conf->resync_lock.lock,
 				flush_pending_writes(conf));
 
 	conf->array_freeze_pending--;
-	spin_unlock_irq(&conf->resync_lock);
+	write_sequnlock_irq(&conf->resync_lock);
 }
 
 static void unfreeze_array(struct r10conf *conf)
 {
 	/* reverse the effect of the freeze */
-	spin_lock_irq(&conf->resync_lock);
-	conf->barrier--;
+	write_seqlock_irq(&conf->resync_lock);
+	WRITE_ONCE(conf->barrier, conf->barrier - 1);
 	conf->nr_waiting--;
 	wake_up(&conf->wait_barrier);
-	spin_unlock_irq(&conf->resync_lock);
+	write_sequnlock_irq(&conf->resync_lock);
 }
 
 static sector_t choose_data_offset(struct r10bio *r10_bio,
@@ -4033,7 +4055,7 @@ static struct r10conf *setup_conf(struct mddev *mddev)
 	INIT_LIST_HEAD(&conf->retry_list);
 	INIT_LIST_HEAD(&conf->bio_end_io_list);
 
-	spin_lock_init(&conf->resync_lock);
+	seqlock_init(&conf->resync_lock);
 	init_waitqueue_head(&conf->wait_barrier);
 	atomic_set(&conf->nr_pending, 0);
 
@@ -4352,7 +4374,7 @@ static void *raid10_takeover_raid0(struct mddev *mddev, sector_t size, int devs)
 				rdev->new_raid_disk = rdev->raid_disk * 2;
 				rdev->sectors = size;
 			}
-		conf->barrier = 1;
+		WRITE_ONCE(conf->barrier, 1);
 	}
 
 	return conf;
diff --git a/drivers/md/raid10.h b/drivers/md/raid10.h
index 5c0804d8bb1f..8c072ce0bc54 100644
--- a/drivers/md/raid10.h
+++ b/drivers/md/raid10.h
@@ -76,7 +76,7 @@ struct r10conf {
 	/* queue pending writes and submit them on unplug */
 	struct bio_list		pending_bio_list;
 
-	spinlock_t		resync_lock;
+	seqlock_t		resync_lock;
 	atomic_t		nr_pending;
 	int			nr_waiting;
 	int			nr_queued;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH -next 3/3] md/raid10: prevent unnecessary calls to wake_up() in fast path
  2022-08-29 13:14 [PATCH -next 0/3] md/raid10: reduce lock contention for io Yu Kuai
  2022-08-29 13:15 ` [PATCH -next 1/3] md/raid10: fix improper BUG_ON() in raise_barrier() Yu Kuai
  2022-08-29 13:15 ` [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock Yu Kuai
@ 2022-08-29 13:15 ` Yu Kuai
  2022-08-29 13:40 ` [PATCH -next 0/3] md/raid10: reduce lock contention for io Guoqing Jiang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-08-29 13:15 UTC (permalink / raw)
  To: song; +Cc: linux-raid, linux-kernel, yukuai3, yukuai1, yi.zhang

From: Yu Kuai <yukuai3@huawei.com>

Currently, wake_up() is called unconditionally in fast path such as
raid10_make_request(), which will cause lock contention under high
concurrency:

raid10_make_request
 wake_up
  __wake_up_common_lock
   spin_lock_irqsave

Improve performance by only call wake_up() if waitqueue is not empty.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
---
 drivers/md/raid10.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 086216b051f5..2f7c8bef6dc2 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -274,6 +274,12 @@ static void put_buf(struct r10bio *r10_bio)
 	lower_barrier(conf);
 }
 
+static void wake_up_barrier(struct r10conf *conf)
+{
+	if (wq_has_sleeper(&conf->wait_barrier))
+		wake_up(&conf->wait_barrier);
+}
+
 static void reschedule_retry(struct r10bio *r10_bio)
 {
 	unsigned long flags;
@@ -286,7 +292,7 @@ static void reschedule_retry(struct r10bio *r10_bio)
 	spin_unlock_irqrestore(&conf->device_lock, flags);
 
 	/* wake up frozen array... */
-	wake_up(&conf->wait_barrier);
+	wake_up_barrier(conf);
 
 	md_wakeup_thread(mddev->thread);
 }
@@ -884,7 +890,7 @@ static void flush_pending_writes(struct r10conf *conf)
 		/* flush any pending bitmap writes to disk
 		 * before proceeding w/ I/O */
 		md_bitmap_unplug(conf->mddev->bitmap);
-		wake_up(&conf->wait_barrier);
+		wake_up_barrier(conf);
 
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
@@ -955,7 +961,7 @@ static void lower_barrier(struct r10conf *conf)
 	write_seqlock_irqsave(&conf->resync_lock, flags);
 	WRITE_ONCE(conf->barrier, conf->barrier - 1);
 	write_sequnlock_irqrestore(&conf->resync_lock, flags);
-	wake_up(&conf->wait_barrier);
+	wake_up_barrier(conf);
 }
 
 static bool wait_barrier_nolock(struct r10conf *conf)
@@ -1018,7 +1024,7 @@ static bool wait_barrier(struct r10conf *conf, bool nowait)
 		}
 		conf->nr_waiting--;
 		if (!conf->nr_waiting)
-			wake_up(&conf->wait_barrier);
+			wake_up_barrier(conf);
 	}
 	/* Only increment nr_pending when we wait */
 	if (ret)
@@ -1031,7 +1037,7 @@ static void allow_barrier(struct r10conf *conf)
 {
 	if ((atomic_dec_and_test(&conf->nr_pending)) ||
 			(conf->array_freeze_pending))
-		wake_up(&conf->wait_barrier);
+		wake_up_barrier(conf);
 }
 
 static void freeze_array(struct r10conf *conf, int extra)
@@ -1067,7 +1073,7 @@ static void unfreeze_array(struct r10conf *conf)
 	write_seqlock_irq(&conf->resync_lock);
 	WRITE_ONCE(conf->barrier, conf->barrier - 1);
 	conf->nr_waiting--;
-	wake_up(&conf->wait_barrier);
+	wake_up_barrier(conf);
 	write_sequnlock_irq(&conf->resync_lock);
 }
 
@@ -1092,7 +1098,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 		spin_lock_irq(&conf->device_lock);
 		bio_list_merge(&conf->pending_bio_list, &plug->pending);
 		spin_unlock_irq(&conf->device_lock);
-		wake_up(&conf->wait_barrier);
+		wake_up_barrier(conf);
 		md_wakeup_thread(mddev->thread);
 		kfree(plug);
 		return;
@@ -1101,7 +1107,7 @@ static void raid10_unplug(struct blk_plug_cb *cb, bool from_schedule)
 	/* we aren't scheduling, so we can do the write-out directly. */
 	bio = bio_list_get(&plug->pending);
 	md_bitmap_unplug(mddev->bitmap);
-	wake_up(&conf->wait_barrier);
+	wake_up_barrier(conf);
 
 	while (bio) { /* submit pending writes */
 		struct bio *next = bio->bi_next;
@@ -1907,7 +1913,7 @@ static bool raid10_make_request(struct mddev *mddev, struct bio *bio)
 	__make_request(mddev, bio, sectors);
 
 	/* In case raid10d snuck in to freeze_array */
-	wake_up(&conf->wait_barrier);
+	wake_up_barrier(conf);
 	return true;
 }
 
@@ -3055,7 +3061,7 @@ static void handle_write_completed(struct r10conf *conf, struct r10bio *r10_bio)
 			 * In case freeze_array() is waiting for condition
 			 * nr_pending == nr_queued + extra to be true.
 			 */
-			wake_up(&conf->wait_barrier);
+			wake_up_barrier(conf);
 			md_wakeup_thread(conf->mddev->thread);
 		} else {
 			if (test_bit(R10BIO_WriteError,
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 0/3] md/raid10: reduce lock contention for io
  2022-08-29 13:14 [PATCH -next 0/3] md/raid10: reduce lock contention for io Yu Kuai
                   ` (2 preceding siblings ...)
  2022-08-29 13:15 ` [PATCH -next 3/3] md/raid10: prevent unnecessary calls to wake_up() in fast path Yu Kuai
@ 2022-08-29 13:40 ` Guoqing Jiang
  2022-08-31 11:55   ` Yu Kuai
  2022-08-29 13:58 ` Paul Menzel
  2022-08-31 18:00 ` Song Liu
  5 siblings, 1 reply; 28+ messages in thread
From: Guoqing Jiang @ 2022-08-29 13:40 UTC (permalink / raw)
  To: Yu Kuai, song; +Cc: linux-raid, linux-kernel, yukuai3, yi.zhang



On 8/29/22 9:14 PM, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
>
> patch 1 is a small problem found by code review.
> patch 2 avoid holding resync_lock in fast path.
> patch 3 avoid holding lock in wake_up() in fast path.
>
> Test environment:
>
> Architecture: aarch64
> Cpu: Huawei KUNPENG 920, there are four numa nodes
>
> Raid10 initialize:
> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>
> Test cmd:
> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>
> Test result:
> before this patchset:	2.9 GiB/s
> after this patchset:	6.6 Gib/s

Impressive! Pls try mdadm test suites too to avoid regression.

> Please noted that in kunpeng-920, memory access latency is very bad
> accross nodes compare to local node, and in other architecture
> performance improvement might not be significant.

By any chance can someone try with x64?

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 0/3] md/raid10: reduce lock contention for io
  2022-08-29 13:14 [PATCH -next 0/3] md/raid10: reduce lock contention for io Yu Kuai
                   ` (3 preceding siblings ...)
  2022-08-29 13:40 ` [PATCH -next 0/3] md/raid10: reduce lock contention for io Guoqing Jiang
@ 2022-08-29 13:58 ` Paul Menzel
  2022-08-30  1:09   ` Yu Kuai
  2022-08-31 18:00 ` Song Liu
  5 siblings, 1 reply; 28+ messages in thread
From: Paul Menzel @ 2022-08-29 13:58 UTC (permalink / raw)
  To: Yu Kuai; +Cc: song, linux-raid, linux-kernel, yukuai3, yi.zhang

Dear Yu,


Thank you for your patches.

Am 29.08.22 um 15:14 schrieb Yu Kuai:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> patch 1 is a small problem found by code review.
> patch 2 avoid holding resync_lock in fast path.
> patch 3 avoid holding lock in wake_up() in fast path.
> 
> Test environment:
> 
> Architecture: aarch64
> Cpu: Huawei KUNPENG 920, there are four numa nodes
> 
> Raid10 initialize:
> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> Test cmd:
> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
> 
> Test result:
> before this patchset:	2.9 GiB/s
> after this patchset:	6.6 Gib/s

Could you please give more details about the test setup, like the drives 
used?

Did you use some tools like ftrace to figure out the bottleneck?

> Please noted that in kunpeng-920, memory access latency is very bad
> accross nodes compare to local node, and in other architecture
> performance improvement might not be significant.
> 
> Yu Kuai (3):
>    md/raid10: fix improper BUG_ON() in raise_barrier()
>    md/raid10: convert resync_lock to use seqlock
>    md/raid10: prevent unnecessary calls to wake_up() in fast path
> 
>   drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++----------------
>   drivers/md/raid10.h |  2 +-
>   2 files changed, 59 insertions(+), 31 deletions(-)


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 1/3] md/raid10: fix improper BUG_ON() in raise_barrier()
  2022-08-29 13:15 ` [PATCH -next 1/3] md/raid10: fix improper BUG_ON() in raise_barrier() Yu Kuai
@ 2022-08-29 19:53   ` John Stoffel
  2022-08-30  1:01     ` Yu Kuai
  2022-08-30  6:32     ` Paul Menzel
  0 siblings, 2 replies; 28+ messages in thread
From: John Stoffel @ 2022-08-29 19:53 UTC (permalink / raw)
  To: Yu Kuai; +Cc: song, linux-raid, linux-kernel, yukuai3, yi.zhang

>>>>> "Yu" == Yu Kuai <yukuai1@huaweicloud.com> writes:

Yu> From: Yu Kuai <yukuai3@huawei.com>
Yu> 'conf->barrier' is protected by 'conf->resync_lock', reading
Yu> 'conf->barrier' without holding the lock is wrong.

Yu> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Yu> ---
Yu>  drivers/md/raid10.c | 2 +-
Yu>  1 file changed, 1 insertion(+), 1 deletion(-)

Yu> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
Yu> index 9117fcdee1be..b70c207f7932 100644
Yu> --- a/drivers/md/raid10.c
Yu> +++ b/drivers/md/raid10.c
Yu> @@ -930,8 +930,8 @@ static void flush_pending_writes(struct r10conf *conf)
 
Yu>  static void raise_barrier(struct r10conf *conf, int force)
Yu>  {
Yu> -	BUG_ON(force && !conf->barrier);
Yu>  	spin_lock_irq(&conf->resync_lock);
Yu> +	BUG_ON(force && !conf->barrier);

I don't like this BUG_ON() at all, why are you crashing the system
here instead of just doing a simple WARN_ONCE() instead?  Is there
anything the user can do to get into this situation on their own, or
does it really signify a logic error in the code?  If so, why are you
killing the system?


 
Yu>  	/* Wait until no block IO is waiting (unless 'force') */
Yu>  	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
Yu> -- 
Yu> 2.31.1


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 1/3] md/raid10: fix improper BUG_ON() in raise_barrier()
  2022-08-29 19:53   ` John Stoffel
@ 2022-08-30  1:01     ` Yu Kuai
  2022-08-30  6:32     ` Paul Menzel
  1 sibling, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-08-30  1:01 UTC (permalink / raw)
  To: John Stoffel, Yu Kuai
  Cc: song, linux-raid, linux-kernel, yi.zhang, yukuai (C)

Hi, John

在 2022/08/30 3:53, John Stoffel 写道:
>>>>>> "Yu" == Yu Kuai <yukuai1@huaweicloud.com> writes:
> 
> Yu> From: Yu Kuai <yukuai3@huawei.com>
> Yu> 'conf->barrier' is protected by 'conf->resync_lock', reading
> Yu> 'conf->barrier' without holding the lock is wrong.
> 
> Yu> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Yu> ---
> Yu>  drivers/md/raid10.c | 2 +-
> Yu>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Yu> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> Yu> index 9117fcdee1be..b70c207f7932 100644
> Yu> --- a/drivers/md/raid10.c
> Yu> +++ b/drivers/md/raid10.c
> Yu> @@ -930,8 +930,8 @@ static void flush_pending_writes(struct r10conf *conf)
>   
> Yu>  static void raise_barrier(struct r10conf *conf, int force)
> Yu>  {
> Yu> -	BUG_ON(force && !conf->barrier);
> Yu>  	spin_lock_irq(&conf->resync_lock);
> Yu> +	BUG_ON(force && !conf->barrier);
> 
> I don't like this BUG_ON() at all, why are you crashing the system
> here instead of just doing a simple WARN_ONCE() instead?  Is there
> anything the user can do to get into this situation on their own, or
> does it really signify a logic error in the code?  If so, why are you
> killing the system?

I'm not sure why to use the BUG_ON() here. I just noticed that
'conf->barrier' is read without holding 'resync_lock', and BUG_ON() can
be triggered false positive.

Thanks,
Kuai
> 
> 
>   
> Yu>  	/* Wait until no block IO is waiting (unless 'force') */
> Yu>  	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
> Yu> --
> Yu> 2.31.1
> 
> 
> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 0/3] md/raid10: reduce lock contention for io
  2022-08-29 13:58 ` Paul Menzel
@ 2022-08-30  1:09   ` Yu Kuai
  2022-08-31 11:59     ` Paul Menzel
  0 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2022-08-30  1:09 UTC (permalink / raw)
  To: Paul Menzel, Yu Kuai; +Cc: song, linux-raid, linux-kernel, yi.zhang, yukuai (C)

Hi, Paul!

在 2022/08/29 21:58, Paul Menzel 写道:
> Dear Yu,
> 
> 
> Thank you for your patches.
> 
> Am 29.08.22 um 15:14 schrieb Yu Kuai:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> patch 1 is a small problem found by code review.
>> patch 2 avoid holding resync_lock in fast path.
>> patch 3 avoid holding lock in wake_up() in fast path.
>>
>> Test environment:
>>
>> Architecture: aarch64
>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>
>> Raid10 initialize:
>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 
>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> Test cmd:
>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 
>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 
>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>
>> Test result:
>> before this patchset:    2.9 GiB/s
>> after this patchset:    6.6 Gib/s
> 
> Could you please give more details about the test setup, like the drives 
> used?

test setup is described above, four nvme disks is used.
> 
> Did you use some tools like ftrace to figure out the bottleneck?

Yes, I'm sure the bottleneck is spin_lock(), specifically threads from
multiple nodes try to grab the same lock. By the way, if I bind the
threads to the same node, performance can also improve to 6.6 Gib/s
without this patchset.

Thanks,
Kuai
> 
>> Please noted that in kunpeng-920, memory access latency is very bad
>> accross nodes compare to local node, and in other architecture
>> performance improvement might not be significant.
>>
>> Yu Kuai (3):
>>    md/raid10: fix improper BUG_ON() in raise_barrier()
>>    md/raid10: convert resync_lock to use seqlock
>>    md/raid10: prevent unnecessary calls to wake_up() in fast path
>>
>>   drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++----------------
>>   drivers/md/raid10.h |  2 +-
>>   2 files changed, 59 insertions(+), 31 deletions(-)
> 
> 
> Kind regards,
> 
> Paul
> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 1/3] md/raid10: fix improper BUG_ON() in raise_barrier()
  2022-08-29 19:53   ` John Stoffel
  2022-08-30  1:01     ` Yu Kuai
@ 2022-08-30  6:32     ` Paul Menzel
  1 sibling, 0 replies; 28+ messages in thread
From: Paul Menzel @ 2022-08-30  6:32 UTC (permalink / raw)
  To: John Stoffel; +Cc: Yu Kuai, song, linux-raid, linux-kernel, yukuai3, yi.zhang

Dear John,


Am 29.08.22 um 21:53 schrieb John Stoffel:
>>>>>> "Yu" == Yu Kuai <yukuai1@huaweicloud.com> writes:
> 
> Yu> From: Yu Kuai <yukuai3@huawei.com>

The quoting style is really confusing, as it does not seem to be the 
standard, and a lot of MUAs won’t mark up the citation.

[…]

> Yu> 'conf->barrier' is protected by 'conf->resync_lock', reading
> Yu> 'conf->barrier' without holding the lock is wrong.
> 
> Yu> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> Yu> ---
> Yu>  drivers/md/raid10.c | 2 +-
> Yu>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Yu> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> Yu> index 9117fcdee1be..b70c207f7932 100644
> Yu> --- a/drivers/md/raid10.c
> Yu> +++ b/drivers/md/raid10.c
> Yu> @@ -930,8 +930,8 @@ static void flush_pending_writes(struct r10conf *conf)
>   
> Yu>  static void raise_barrier(struct r10conf *conf, int force)
> Yu>  {
> Yu> -	BUG_ON(force && !conf->barrier);
> Yu>  	spin_lock_irq(&conf->resync_lock);
> Yu> +	BUG_ON(force && !conf->barrier);
> 
> I don't like this BUG_ON() at all, why are you crashing the system
> here instead of just doing a simple WARN_ONCE() instead?  Is there
> anything the user can do to get into this situation on their own, or
> does it really signify a logic error in the code?  If so, why are you
> killing the system?

As you can see, the BUG_ON() was there before, so it’s unrelated to this 
patch and Yun is not killing anything.

[…]


> Yu>  	/* Wait until no block IO is waiting (unless 'force') */
> Yu>  	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
> Yu> --
> Yu> 2.31.1


Kind regards,

Paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 0/3] md/raid10: reduce lock contention for io
  2022-08-29 13:40 ` [PATCH -next 0/3] md/raid10: reduce lock contention for io Guoqing Jiang
@ 2022-08-31 11:55   ` Yu Kuai
  0 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-08-31 11:55 UTC (permalink / raw)
  To: Guoqing Jiang, Yu Kuai, song
  Cc: linux-raid, linux-kernel, yi.zhang, yukuai (C)

Hi!

在 2022/08/29 21:40, Guoqing Jiang 写道:
> 
> 
> On 8/29/22 9:14 PM, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> patch 1 is a small problem found by code review.
>> patch 2 avoid holding resync_lock in fast path.
>> patch 3 avoid holding lock in wake_up() in fast path.
>>
>> Test environment:
>>
>> Architecture: aarch64
>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>
>> Raid10 initialize:
>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 
>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> Test cmd:
>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 
>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 
>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>
>> Test result:
>> before this patchset:    2.9 GiB/s
>> after this patchset:    6.6 Gib/s
> 
> Impressive! Pls try mdadm test suites too to avoid regression.
> 
>> Please noted that in kunpeng-920, memory access latency is very bad
>> accross nodes compare to local node, and in other architecture
>> performance improvement might not be significant.
> 
> By any chance can someone try with x64?

I tried to test with Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz,

before this patchset: 7.0 GiB/s
after this patchset : 9.3 GiB/s

Thanks,
Kuai
> 
> Thanks,
> Guoqing
> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 0/3] md/raid10: reduce lock contention for io
  2022-08-30  1:09   ` Yu Kuai
@ 2022-08-31 11:59     ` Paul Menzel
  2022-08-31 12:07       ` Yu Kuai
  0 siblings, 1 reply; 28+ messages in thread
From: Paul Menzel @ 2022-08-31 11:59 UTC (permalink / raw)
  To: Yu Kuai; +Cc: song, linux-raid, linux-kernel, yi.zhang, yukuai3

Dear Yu,


Am 30.08.22 um 03:09 schrieb Yu Kuai:

> 在 2022/08/29 21:58, Paul Menzel 写道:

>> Am 29.08.22 um 15:14 schrieb Yu Kuai:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> patch 1 is a small problem found by code review.
>>> patch 2 avoid holding resync_lock in fast path.
>>> patch 3 avoid holding lock in wake_up() in fast path.
>>>
>>> Test environment:
>>>
>>> Architecture: aarch64
>>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>>
>>> Raid10 initialize:
>>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 
>>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>>
>>> Test cmd:
>>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 
>>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 
>>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>>
>>> Test result:
>>> before this patchset:    2.9 GiB/s
>>> after this patchset:    6.6 Gib/s
>>
>> Could you please give more details about the test setup, like the 
>> drives used?
> 
> test setup is described above, four nvme disks is used.

I was wondering about the model to be able to reproduce it.

>> Did you use some tools like ftrace to figure out the bottleneck?
> 
> Yes, I'm sure the bottleneck is spin_lock(), specifically threads from
> multiple nodes try to grab the same lock. By the way, if I bind the
> threads to the same node, performance can also improve to 6.6 Gib/s
> without this patchset.

Interesting. Maybe you could add all that to the commit message of the 
second patch.


Kind regards,

Paul


>>> Please noted that in kunpeng-920, memory access latency is very bad
>>> accross nodes compare to local node, and in other architecture
>>> performance improvement might not be significant.
>>>
>>> Yu Kuai (3):
>>>    md/raid10: fix improper BUG_ON() in raise_barrier()
>>>    md/raid10: convert resync_lock to use seqlock
>>>    md/raid10: prevent unnecessary calls to wake_up() in fast path
>>>
>>>   drivers/md/raid10.c | 88 +++++++++++++++++++++++++++++----------------
>>>   drivers/md/raid10.h |  2 +-
>>>   2 files changed, 59 insertions(+), 31 deletions(-)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 0/3] md/raid10: reduce lock contention for io
  2022-08-31 11:59     ` Paul Menzel
@ 2022-08-31 12:07       ` Yu Kuai
  0 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-08-31 12:07 UTC (permalink / raw)
  To: Paul Menzel, Yu Kuai; +Cc: song, linux-raid, linux-kernel, yi.zhang, yukuai (C)

Hi, Paul!

在 2022/08/31 19:59, Paul Menzel 写道:
> Dear Yu,
> 
> 
> Am 30.08.22 um 03:09 schrieb Yu Kuai:
> 
>> 在 2022/08/29 21:58, Paul Menzel 写道:
> 
>>> Am 29.08.22 um 15:14 schrieb Yu Kuai:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> patch 1 is a small problem found by code review.
>>>> patch 2 avoid holding resync_lock in fast path.
>>>> patch 3 avoid holding lock in wake_up() in fast path.
>>>>
>>>> Test environment:
>>>>
>>>> Architecture: aarch64
>>>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>>>
>>>> Raid10 initialize:
>>>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 
>>>> /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>>>
>>>> Test cmd:
>>>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 
>>>> -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 
>>>> -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>>>
>>>> Test result:
>>>> before this patchset:    2.9 GiB/s
>>>> after this patchset:    6.6 Gib/s
>>>
>>> Could you please give more details about the test setup, like the 
>>> drives used?
>>
>> test setup is described above, four nvme disks is used.
> 
> I was wondering about the model to be able to reproduce it.
> 
>>> Did you use some tools like ftrace to figure out the bottleneck?
>>
>> Yes, I'm sure the bottleneck is spin_lock(), specifically threads from
>> multiple nodes try to grab the same lock. By the way, if I bind the
>> threads to the same node, performance can also improve to 6.6 Gib/s
>> without this patchset.
> 
> Interesting. Maybe you could add all that to the commit message of the 
> second patch.

Of course, I will do that in next version.

Thanks,
Kuai
> 
> 
> Kind regards,
> 
> Paul
> 
> 
>>>> Please noted that in kunpeng-920, memory access latency is very bad
>>>> accross nodes compare to local node, and in other architecture
>>>> performance improvement might not be significant.
>>>>
>>>> Yu Kuai (3):
>>>>    md/raid10: fix improper BUG_ON() in raise_barrier()
>>>>    md/raid10: convert resync_lock to use seqlock
>>>>    md/raid10: prevent unnecessary calls to wake_up() in fast path
>>>>
>>>>   drivers/md/raid10.c | 88 
>>>> +++++++++++++++++++++++++++++----------------
>>>>   drivers/md/raid10.h |  2 +-
>>>>   2 files changed, 59 insertions(+), 31 deletions(-)
> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 0/3] md/raid10: reduce lock contention for io
  2022-08-29 13:14 [PATCH -next 0/3] md/raid10: reduce lock contention for io Yu Kuai
                   ` (4 preceding siblings ...)
  2022-08-29 13:58 ` Paul Menzel
@ 2022-08-31 18:00 ` Song Liu
  2022-09-03  6:08   ` Yu Kuai
  5 siblings, 1 reply; 28+ messages in thread
From: Song Liu @ 2022-08-31 18:00 UTC (permalink / raw)
  To: Yu Kuai; +Cc: linux-raid, open list, yukuai3, yi.zhang

On Mon, Aug 29, 2022 at 6:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> From: Yu Kuai <yukuai3@huawei.com>
>
> patch 1 is a small problem found by code review.
> patch 2 avoid holding resync_lock in fast path.
> patch 3 avoid holding lock in wake_up() in fast path.
>
> Test environment:
>
> Architecture: aarch64
> Cpu: Huawei KUNPENG 920, there are four numa nodes
>
> Raid10 initialize:
> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>
> Test cmd:
> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>
> Test result:
> before this patchset:   2.9 GiB/s
> after this patchset:    6.6 Gib/s

Nice work! Applied to md-next.

Thanks,
Song

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-08-29 13:15 ` [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock Yu Kuai
@ 2022-09-01 18:41   ` Logan Gunthorpe
  2022-09-02  0:49     ` Guoqing Jiang
  2022-09-02  1:21     ` Yu Kuai
  2022-09-02  9:42   ` Guoqing Jiang
  1 sibling, 2 replies; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-01 18:41 UTC (permalink / raw)
  To: Yu Kuai, song; +Cc: linux-raid, linux-kernel, yukuai3, yi.zhang

Hi,

On 2022-08-29 07:15, Yu Kuai wrote:
> From: Yu Kuai <yukuai3@huawei.com>
> 
> Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier',
> and io can't be dispatched until 'barrier' is dropped.
> 
> Since holding the 'barrier' is not common, convert 'resync_lock' to use
> seqlock so that holding lock can be avoided in fast path.
> 
> Signed-off-by: Yu Kuai <yukuai3@huawei.com>

I've found some lockdep issues starting with this patch in md-next while
running mdadm tests (specifically 00raid10 when run about 10 times in a
row).

I've seen a couple different lock dep errors. The first seems to be
reproducible on this patch, then it possibly changes to the second on
subsequent patches. Not sure exactly.

I haven't dug into it too deeply, but hopefully it can be fixed easily.

Logan

--


    ================================
    WARNING: inconsistent lock state
    6.0.0-rc2-eid-vmlocalyes-dbg-00023-gfd68041d2fd2 #2604 Not tainted
    --------------------------------
    inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
    fsck.ext3/1695 [HC0[0]:SC0[0]:HE0:SE1] takes:
    ffff8881049b0120 (&____s->seqcount#10){+.?.}-{0:0}, at:
raid10_read_request+0x21f/0x760
		(raid10.c:1134)

    {IN-SOFTIRQ-W} state was registered at:
      lock_acquire+0x183/0x440
      lower_barrier+0x5e/0xd0
      end_sync_request+0x178/0x180
      end_sync_write+0x193/0x380
      bio_endio+0x346/0x3a0
      blk_update_request+0x1eb/0x7c0
      blk_mq_end_request+0x30/0x50
      lo_complete_rq+0xb7/0x100
      blk_complete_reqs+0x77/0x90
      blk_done_softirq+0x38/0x40
      __do_softirq+0x10c/0x650
      run_ksoftirqd+0x48/0x80
      smpboot_thread_fn+0x302/0x400
      kthread+0x18c/0x1c0
      ret_from_fork+0x1f/0x30

    irq event stamp: 8930
    hardirqs last  enabled at (8929): [<ffffffff96df8351>]
_raw_spin_unlock_irqrestore+0x31/0x60
    hardirqs last disabled at (8930): [<ffffffff96df7fc5>]
_raw_spin_lock_irq+0x75/0x90
    softirqs last  enabled at (6768): [<ffffffff9554970e>]
__irq_exit_rcu+0xfe/0x150
    softirqs last disabled at (6757): [<ffffffff9554970e>]
__irq_exit_rcu+0xfe/0x150

    other info that might help us debug this:
     Possible unsafe locking scenario:

           CPU0
           ----
      lock(&____s->seqcount#10);
      <Interrupt>
        lock(&____s->seqcount#10);

     *** DEADLOCK ***

    2 locks held by fsck.ext3/1695:
     #0: ffff8881007d0930 (mapping.invalidate_lock#2){++++}-{3:3}, at:
page_cache_ra_unbounded+0xaf/0x250
     #1: ffff8881049b0120 (&____s->seqcount#10){+.?.}-{0:0}, at:
raid10_read_request+0x21f/0x760

    stack backtrace:
    CPU: 0 PID: 1695 Comm: fsck.ext3 Not tainted
6.0.0-rc2-eid-vmlocalyes-dbg-00023-gfd68041d2fd2 #2604
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2
04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x5a/0x74
     dump_stack+0x10/0x12
     print_usage_bug.part.0+0x233/0x246
     mark_lock.part.0.cold+0x73/0x14f
     mark_held_locks+0x71/0xa0
     lockdep_hardirqs_on_prepare+0x158/0x230
     trace_hardirqs_on+0x34/0x100
     _raw_spin_unlock_irq+0x28/0x60
     wait_barrier+0x4a6/0x720
         raid10.c:1004
     raid10_read_request+0x21f/0x760
     raid10_make_request+0x2d6/0x2160
     md_handle_request+0x3f3/0x5b0
     md_submit_bio+0xd9/0x120
     __submit_bio+0x9d/0x100
     submit_bio_noacct_nocheck+0x1fd/0x470
     submit_bio_noacct+0x4c2/0xbb0
     submit_bio+0x3f/0xf0
     mpage_readahead+0x323/0x3b0
     blkdev_readahead+0x15/0x20
     read_pages+0x136/0x7a0
     page_cache_ra_unbounded+0x18d/0x250
     page_cache_ra_order+0x2c9/0x400
     ondemand_readahead+0x320/0x730
     page_cache_sync_ra+0xa6/0xb0
     filemap_get_pages+0x1eb/0xc00
     filemap_read+0x1f1/0x770
     blkdev_read_iter+0x164/0x310
     vfs_read+0x467/0x5a0
     __x64_sys_pread64+0x122/0x160
     do_syscall_64+0x35/0x80
     entry_SYSCALL_64_after_hwframe+0x46/0xb0

--

    ======================================================
    WARNING: possible circular locking dependency detected
    6.0.0-rc2-eid-vmlocalyes-dbg-00027-gcd6aa5181bbb #2600 Not tainted
    ------------------------------------------------------
    systemd-udevd/292 is trying to acquire lock:
    ffff88817b644170 (&(&conf->resync_lock)->lock){....}-{2:2}, at:
wait_barrier+0x4fe/0x770

    but task is already holding lock:
    ffff88817b644120 (&____s->seqcount#11){+.+.}-{0:0}, at:
raid10_read_request+0x21f/0x760
			raid10.c:1140  wait_barrier()
			raid10.c:1204  regular_request_wait()



    which lock already depends on the new lock.


    the existing dependency chain (in reverse order) is:

    -> #1 (&____s->seqcount#11){+.+.}-{0:0}:
           raise_barrier+0xe0/0x300
		raid10.c:940 write_seqlock_irq()
           raid10_sync_request+0x629/0x4750
		raid10.c:3689 raise_barrire()
           md_do_sync.cold+0x8ec/0x1491
           md_thread+0x19d/0x2d0
           kthread+0x18c/0x1c0
           ret_from_fork+0x1f/0x30

    -> #0 (&(&conf->resync_lock)->lock){....}-{2:2}:
           __lock_acquire+0x1cb4/0x3170
           lock_acquire+0x183/0x440
           _raw_spin_lock_irq+0x4d/0x90
           wait_barrier+0x4fe/0x770
           raid10_read_request+0x21f/0x760
		raid10.c:1140  wait_barrier()
		raid10.c:1204  regular_request_wait()
           raid10_make_request+0x2d6/0x2190
           md_handle_request+0x3f3/0x5b0
           md_submit_bio+0xd9/0x120
           __submit_bio+0x9d/0x100
           submit_bio_noacct_nocheck+0x1fd/0x470
           submit_bio_noacct+0x4c2/0xbb0
           submit_bio+0x3f/0xf0
           submit_bh_wbc+0x270/0x2a0
           block_read_full_folio+0x37c/0x580
           blkdev_read_folio+0x18/0x20
           filemap_read_folio+0x3f/0x110
           do_read_cache_folio+0x13b/0x2c0
           read_cache_folio+0x42/0x50
           read_part_sector+0x74/0x1c0
           read_lba+0x176/0x2a0
           efi_partition+0x1ce/0xdd0
           bdev_disk_changed+0x2e7/0x6a0
           blkdev_get_whole+0xd2/0x140
           blkdev_get_by_dev.part.0+0x37f/0x570
           blkdev_get_by_dev+0x51/0x60
           disk_scan_partitions+0xad/0xf0
           blkdev_common_ioctl+0x3f3/0xdf0
           blkdev_ioctl+0x1e1/0x450
           __x64_sys_ioctl+0xc0/0x100
           do_syscall_64+0x35/0x80
           entry_SYSCALL_64_after_hwframe+0x46/0xb0

    other info that might help us debug this:

     Possible unsafe locking scenario:

           CPU0                    CPU1
           ----                    ----
      lock(&____s->seqcount#11);
                                   lock(&(&conf->resync_lock)->lock);
                                   lock(&____s->seqcount#11);
      lock(&(&conf->resync_lock)->lock);

     *** DEADLOCK ***

    2 locks held by systemd-udevd/292:
     #0: ffff88817a532528 (&disk->open_mutex){+.+.}-{3:3}, at:
blkdev_get_by_dev.part.0+0x180/0x570
     #1: ffff88817b644120 (&____s->seqcount#11){+.+.}-{0:0}, at:
raid10_read_request+0x21f/0x760

    stack backtrace:
    CPU: 3 PID: 292 Comm: systemd-udevd Not tainted
6.0.0-rc2-eid-vmlocalyes-dbg-00027-gcd6aa5181bbb #2600
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2
04/01/2014
    Call Trace:
     <TASK>
     dump_stack_lvl+0x5a/0x74
     dump_stack+0x10/0x12
     print_circular_bug.cold+0x146/0x14b
     check_noncircular+0x1ff/0x250
     __lock_acquire+0x1cb4/0x3170
     lock_acquire+0x183/0x440
     _raw_spin_lock_irq+0x4d/0x90
     wait_barrier+0x4fe/0x770
     raid10_read_request+0x21f/0x760
     raid10_make_request+0x2d6/0x2190
     md_handle_request+0x3f3/0x5b0
     md_submit_bio+0xd9/0x120
     __submit_bio+0x9d/0x100
     submit_bio_noacct_nocheck+0x1fd/0x470
     submit_bio_noacct+0x4c2/0xbb0
     submit_bio+0x3f/0xf0
     submit_bh_wbc+0x270/0x2a0
     block_read_full_folio+0x37c/0x580
     blkdev_read_folio+0x18/0x20
     filemap_read_folio+0x3f/0x110
     do_read_cache_folio+0x13b/0x2c0
     read_cache_folio+0x42/0x50
     read_part_sector+0x74/0x1c0
     read_lba+0x176/0x2a0
     efi_partition+0x1ce/0xdd0
     bdev_disk_changed+0x2e7/0x6a0
     blkdev_get_whole+0xd2/0x140
     blkdev_get_by_dev.part.0+0x37f/0x570
     blkdev_get_by_dev+0x51/0x60
     disk_scan_partitions+0xad/0xf0
     blkdev_common_ioctl+0x3f3/0xdf0
     blkdev_ioctl+0x1e1/0x450
     __x64_sys_ioctl+0xc0/0x100
     do_syscall_64+0x35/0x80

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-01 18:41   ` Logan Gunthorpe
@ 2022-09-02  0:49     ` Guoqing Jiang
  2022-09-02  0:56       ` Logan Gunthorpe
  2022-09-02  1:21     ` Yu Kuai
  1 sibling, 1 reply; 28+ messages in thread
From: Guoqing Jiang @ 2022-09-02  0:49 UTC (permalink / raw)
  To: Logan Gunthorpe, Yu Kuai, song
  Cc: linux-raid, linux-kernel, yukuai3, yi.zhang



On 9/2/22 2:41 AM, Logan Gunthorpe wrote:
> Hi,
>
> On 2022-08-29 07:15, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier',
>> and io can't be dispatched until 'barrier' is dropped.
>>
>> Since holding the 'barrier' is not common, convert 'resync_lock' to use
>> seqlock so that holding lock can be avoided in fast path.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> I've found some lockdep issues starting with this patch in md-next while
> running mdadm tests (specifically 00raid10 when run about 10 times in a
> row).
>
> I've seen a couple different lock dep errors. The first seems to be
> reproducible on this patch, then it possibly changes to the second on
> subsequent patches. Not sure exactly.

That's why I said "try mdadm test suites too to avoid regression." ...

Guoqing

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-02  0:49     ` Guoqing Jiang
@ 2022-09-02  0:56       ` Logan Gunthorpe
  2022-09-02  1:00         ` Guoqing Jiang
  0 siblings, 1 reply; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-02  0:56 UTC (permalink / raw)
  To: Guoqing Jiang, Yu Kuai, song; +Cc: linux-raid, linux-kernel, yukuai3, yi.zhang



On 2022-09-01 18:49, Guoqing Jiang wrote:
> 
> 
> On 9/2/22 2:41 AM, Logan Gunthorpe wrote:
>> Hi,
>>
>> On 2022-08-29 07:15, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> Currently, wait_barrier() will hold 'resync_lock' to read
>>> 'conf->barrier',
>>> and io can't be dispatched until 'barrier' is dropped.
>>>
>>> Since holding the 'barrier' is not common, convert 'resync_lock' to use
>>> seqlock so that holding lock can be avoided in fast path.
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>> I've found some lockdep issues starting with this patch in md-next while
>> running mdadm tests (specifically 00raid10 when run about 10 times in a
>> row).
>>
>> I've seen a couple different lock dep errors. The first seems to be
>> reproducible on this patch, then it possibly changes to the second on
>> subsequent patches. Not sure exactly.
> 
> That's why I said "try mdadm test suites too to avoid regression." ...

You may have to run it multiple times, a single run tends not to catch
all errors. I had to loop the noted test 10 times to be sure I hit this
every time when I did the simple bisect.

And ensure that all the debug options are on when you run it (take a
look at the Kernel Hacking section in menuconfig). You won't hit this
bug without at least CONFIG_PROVE_LOCKING=y.

Logan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-02  0:56       ` Logan Gunthorpe
@ 2022-09-02  1:00         ` Guoqing Jiang
  0 siblings, 0 replies; 28+ messages in thread
From: Guoqing Jiang @ 2022-09-02  1:00 UTC (permalink / raw)
  To: Logan Gunthorpe, Yu Kuai, song
  Cc: linux-raid, linux-kernel, yukuai3, yi.zhang



On 9/2/22 8:56 AM, Logan Gunthorpe wrote:
>
> On 2022-09-01 18:49, Guoqing Jiang wrote:
>>
>> On 9/2/22 2:41 AM, Logan Gunthorpe wrote:
>>> Hi,
>>>
>>> On 2022-08-29 07:15, Yu Kuai wrote:
>>>> From: Yu Kuai <yukuai3@huawei.com>
>>>>
>>>> Currently, wait_barrier() will hold 'resync_lock' to read
>>>> 'conf->barrier',
>>>> and io can't be dispatched until 'barrier' is dropped.
>>>>
>>>> Since holding the 'barrier' is not common, convert 'resync_lock' to use
>>>> seqlock so that holding lock can be avoided in fast path.
>>>>
>>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>> I've found some lockdep issues starting with this patch in md-next while
>>> running mdadm tests (specifically 00raid10 when run about 10 times in a
>>> row).
>>>
>>> I've seen a couple different lock dep errors. The first seems to be
>>> reproducible on this patch, then it possibly changes to the second on
>>> subsequent patches. Not sure exactly.
>> That's why I said "try mdadm test suites too to avoid regression." ...
> You may have to run it multiple times, a single run tends not to catch
> all errors. I had to loop the noted test 10 times to be sure I hit this
> every time when I did the simple bisect.
>
> And ensure that all the debug options are on when you run it (take a
> look at the Kernel Hacking section in menuconfig). You won't hit this
> bug without at least CONFIG_PROVE_LOCKING=y.

Yes,  we definitely need to enable the option to test change for locking 
stuffs.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-01 18:41   ` Logan Gunthorpe
  2022-09-02  0:49     ` Guoqing Jiang
@ 2022-09-02  1:21     ` Yu Kuai
  2022-09-02  8:14       ` Yu Kuai
  1 sibling, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2022-09-02  1:21 UTC (permalink / raw)
  To: Logan Gunthorpe, Yu Kuai, song
  Cc: linux-raid, linux-kernel, yi.zhang, yukuai (C)

Hi,

在 2022/09/02 2:41, Logan Gunthorpe 写道:
> Hi,
> 
> On 2022-08-29 07:15, Yu Kuai wrote:
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> Currently, wait_barrier() will hold 'resync_lock' to read 'conf->barrier',
>> and io can't be dispatched until 'barrier' is dropped.
>>
>> Since holding the 'barrier' is not common, convert 'resync_lock' to use
>> seqlock so that holding lock can be avoided in fast path.
>>
>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
> 
> I've found some lockdep issues starting with this patch in md-next while
> running mdadm tests (specifically 00raid10 when run about 10 times in a
> row).
> 
> I've seen a couple different lock dep errors. The first seems to be
> reproducible on this patch, then it possibly changes to the second on
> subsequent patches. Not sure exactly.
> 

Thanks for the test,

I think this is false positive because of the special usage here,

for example, in raise_barrier():

write_seqlock_irq
  spin_lock_irq();
   lock_acquire
  do_write_seqcount_begin
   seqcount_acquire

  wait_event_lock_irq_cmd
   spin_unlock_irq -> lock is released while seqcount is still hold
		     if other context hold the lock again, lockdep
		     will trigger warning.
   ...
   spin_lock_irq

write_sequnlock_irq

Functionality should be ok, I'll try to find a way to prevent such
warning.

Thanks,
Kuai
> I haven't dug into it too deeply, but hopefully it can be fixed easily.
> 
> Logan
> 
> --
> 
> 
>      ================================
>      WARNING: inconsistent lock state
>      6.0.0-rc2-eid-vmlocalyes-dbg-00023-gfd68041d2fd2 #2604 Not tainted
>      --------------------------------
>      inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
>      fsck.ext3/1695 [HC0[0]:SC0[0]:HE0:SE1] takes:
>      ffff8881049b0120 (&____s->seqcount#10){+.?.}-{0:0}, at:
> raid10_read_request+0x21f/0x760
> 		(raid10.c:1134)
> 
>      {IN-SOFTIRQ-W} state was registered at:
>        lock_acquire+0x183/0x440
>        lower_barrier+0x5e/0xd0
>        end_sync_request+0x178/0x180
>        end_sync_write+0x193/0x380
>        bio_endio+0x346/0x3a0
>        blk_update_request+0x1eb/0x7c0
>        blk_mq_end_request+0x30/0x50
>        lo_complete_rq+0xb7/0x100
>        blk_complete_reqs+0x77/0x90
>        blk_done_softirq+0x38/0x40
>        __do_softirq+0x10c/0x650
>        run_ksoftirqd+0x48/0x80
>        smpboot_thread_fn+0x302/0x400
>        kthread+0x18c/0x1c0
>        ret_from_fork+0x1f/0x30
> 
>      irq event stamp: 8930
>      hardirqs last  enabled at (8929): [<ffffffff96df8351>]
> _raw_spin_unlock_irqrestore+0x31/0x60
>      hardirqs last disabled at (8930): [<ffffffff96df7fc5>]
> _raw_spin_lock_irq+0x75/0x90
>      softirqs last  enabled at (6768): [<ffffffff9554970e>]
> __irq_exit_rcu+0xfe/0x150
>      softirqs last disabled at (6757): [<ffffffff9554970e>]
> __irq_exit_rcu+0xfe/0x150
> 
>      other info that might help us debug this:
>       Possible unsafe locking scenario:
> 
>             CPU0
>             ----
>        lock(&____s->seqcount#10);
>        <Interrupt>
>          lock(&____s->seqcount#10);
> 
>       *** DEADLOCK ***
> 
>      2 locks held by fsck.ext3/1695:
>       #0: ffff8881007d0930 (mapping.invalidate_lock#2){++++}-{3:3}, at:
> page_cache_ra_unbounded+0xaf/0x250
>       #1: ffff8881049b0120 (&____s->seqcount#10){+.?.}-{0:0}, at:
> raid10_read_request+0x21f/0x760
> 
>      stack backtrace:
>      CPU: 0 PID: 1695 Comm: fsck.ext3 Not tainted
> 6.0.0-rc2-eid-vmlocalyes-dbg-00023-gfd68041d2fd2 #2604
>      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2
> 04/01/2014
>      Call Trace:
>       <TASK>
>       dump_stack_lvl+0x5a/0x74
>       dump_stack+0x10/0x12
>       print_usage_bug.part.0+0x233/0x246
>       mark_lock.part.0.cold+0x73/0x14f
>       mark_held_locks+0x71/0xa0
>       lockdep_hardirqs_on_prepare+0x158/0x230
>       trace_hardirqs_on+0x34/0x100
>       _raw_spin_unlock_irq+0x28/0x60
>       wait_barrier+0x4a6/0x720
>           raid10.c:1004
>       raid10_read_request+0x21f/0x760
>       raid10_make_request+0x2d6/0x2160
>       md_handle_request+0x3f3/0x5b0
>       md_submit_bio+0xd9/0x120
>       __submit_bio+0x9d/0x100
>       submit_bio_noacct_nocheck+0x1fd/0x470
>       submit_bio_noacct+0x4c2/0xbb0
>       submit_bio+0x3f/0xf0
>       mpage_readahead+0x323/0x3b0
>       blkdev_readahead+0x15/0x20
>       read_pages+0x136/0x7a0
>       page_cache_ra_unbounded+0x18d/0x250
>       page_cache_ra_order+0x2c9/0x400
>       ondemand_readahead+0x320/0x730
>       page_cache_sync_ra+0xa6/0xb0
>       filemap_get_pages+0x1eb/0xc00
>       filemap_read+0x1f1/0x770
>       blkdev_read_iter+0x164/0x310
>       vfs_read+0x467/0x5a0
>       __x64_sys_pread64+0x122/0x160
>       do_syscall_64+0x35/0x80
>       entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> --
> 
>      ======================================================
>      WARNING: possible circular locking dependency detected
>      6.0.0-rc2-eid-vmlocalyes-dbg-00027-gcd6aa5181bbb #2600 Not tainted
>      ------------------------------------------------------
>      systemd-udevd/292 is trying to acquire lock:
>      ffff88817b644170 (&(&conf->resync_lock)->lock){....}-{2:2}, at:
> wait_barrier+0x4fe/0x770
> 
>      but task is already holding lock:
>      ffff88817b644120 (&____s->seqcount#11){+.+.}-{0:0}, at:
> raid10_read_request+0x21f/0x760
> 			raid10.c:1140  wait_barrier()
> 			raid10.c:1204  regular_request_wait()
> 
> 
> 
>      which lock already depends on the new lock.
> 
> 
>      the existing dependency chain (in reverse order) is:
> 
>      -> #1 (&____s->seqcount#11){+.+.}-{0:0}:
>             raise_barrier+0xe0/0x300
> 		raid10.c:940 write_seqlock_irq()
>             raid10_sync_request+0x629/0x4750
> 		raid10.c:3689 raise_barrire()
>             md_do_sync.cold+0x8ec/0x1491
>             md_thread+0x19d/0x2d0
>             kthread+0x18c/0x1c0
>             ret_from_fork+0x1f/0x30
> 
>      -> #0 (&(&conf->resync_lock)->lock){....}-{2:2}:
>             __lock_acquire+0x1cb4/0x3170
>             lock_acquire+0x183/0x440
>             _raw_spin_lock_irq+0x4d/0x90
>             wait_barrier+0x4fe/0x770
>             raid10_read_request+0x21f/0x760
> 		raid10.c:1140  wait_barrier()
> 		raid10.c:1204  regular_request_wait()
>             raid10_make_request+0x2d6/0x2190
>             md_handle_request+0x3f3/0x5b0
>             md_submit_bio+0xd9/0x120
>             __submit_bio+0x9d/0x100
>             submit_bio_noacct_nocheck+0x1fd/0x470
>             submit_bio_noacct+0x4c2/0xbb0
>             submit_bio+0x3f/0xf0
>             submit_bh_wbc+0x270/0x2a0
>             block_read_full_folio+0x37c/0x580
>             blkdev_read_folio+0x18/0x20
>             filemap_read_folio+0x3f/0x110
>             do_read_cache_folio+0x13b/0x2c0
>             read_cache_folio+0x42/0x50
>             read_part_sector+0x74/0x1c0
>             read_lba+0x176/0x2a0
>             efi_partition+0x1ce/0xdd0
>             bdev_disk_changed+0x2e7/0x6a0
>             blkdev_get_whole+0xd2/0x140
>             blkdev_get_by_dev.part.0+0x37f/0x570
>             blkdev_get_by_dev+0x51/0x60
>             disk_scan_partitions+0xad/0xf0
>             blkdev_common_ioctl+0x3f3/0xdf0
>             blkdev_ioctl+0x1e1/0x450
>             __x64_sys_ioctl+0xc0/0x100
>             do_syscall_64+0x35/0x80
>             entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
>      other info that might help us debug this:
> 
>       Possible unsafe locking scenario:
> 
>             CPU0                    CPU1
>             ----                    ----
>        lock(&____s->seqcount#11);
>                                     lock(&(&conf->resync_lock)->lock);
>                                     lock(&____s->seqcount#11);
>        lock(&(&conf->resync_lock)->lock);
> 
>       *** DEADLOCK ***
> 
>      2 locks held by systemd-udevd/292:
>       #0: ffff88817a532528 (&disk->open_mutex){+.+.}-{3:3}, at:
> blkdev_get_by_dev.part.0+0x180/0x570
>       #1: ffff88817b644120 (&____s->seqcount#11){+.+.}-{0:0}, at:
> raid10_read_request+0x21f/0x760
> 
>      stack backtrace:
>      CPU: 3 PID: 292 Comm: systemd-udevd Not tainted
> 6.0.0-rc2-eid-vmlocalyes-dbg-00027-gcd6aa5181bbb #2600
>      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2
> 04/01/2014
>      Call Trace:
>       <TASK>
>       dump_stack_lvl+0x5a/0x74
>       dump_stack+0x10/0x12
>       print_circular_bug.cold+0x146/0x14b
>       check_noncircular+0x1ff/0x250
>       __lock_acquire+0x1cb4/0x3170
>       lock_acquire+0x183/0x440
>       _raw_spin_lock_irq+0x4d/0x90
>       wait_barrier+0x4fe/0x770
>       raid10_read_request+0x21f/0x760
>       raid10_make_request+0x2d6/0x2190
>       md_handle_request+0x3f3/0x5b0
>       md_submit_bio+0xd9/0x120
>       __submit_bio+0x9d/0x100
>       submit_bio_noacct_nocheck+0x1fd/0x470
>       submit_bio_noacct+0x4c2/0xbb0
>       submit_bio+0x3f/0xf0
>       submit_bh_wbc+0x270/0x2a0
>       block_read_full_folio+0x37c/0x580
>       blkdev_read_folio+0x18/0x20
>       filemap_read_folio+0x3f/0x110
>       do_read_cache_folio+0x13b/0x2c0
>       read_cache_folio+0x42/0x50
>       read_part_sector+0x74/0x1c0
>       read_lba+0x176/0x2a0
>       efi_partition+0x1ce/0xdd0
>       bdev_disk_changed+0x2e7/0x6a0
>       blkdev_get_whole+0xd2/0x140
>       blkdev_get_by_dev.part.0+0x37f/0x570
>       blkdev_get_by_dev+0x51/0x60
>       disk_scan_partitions+0xad/0xf0
>       blkdev_common_ioctl+0x3f3/0xdf0
>       blkdev_ioctl+0x1e1/0x450
>       __x64_sys_ioctl+0xc0/0x100
>       do_syscall_64+0x35/0x80
> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-02  1:21     ` Yu Kuai
@ 2022-09-02  8:14       ` Yu Kuai
  2022-09-02 17:03         ` Logan Gunthorpe
  0 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2022-09-02  8:14 UTC (permalink / raw)
  To: Yu Kuai, Logan Gunthorpe, song
  Cc: linux-raid, linux-kernel, yi.zhang, yukuai (C)

Hi, Logan

在 2022/09/02 9:21, Yu Kuai 写道:
> Hi,
> 
> 在 2022/09/02 2:41, Logan Gunthorpe 写道:
>> Hi,
>>
>> On 2022-08-29 07:15, Yu Kuai wrote:
>>> From: Yu Kuai <yukuai3@huawei.com>
>>>
>>> Currently, wait_barrier() will hold 'resync_lock' to read 
>>> 'conf->barrier',
>>> and io can't be dispatched until 'barrier' is dropped.
>>>
>>> Since holding the 'barrier' is not common, convert 'resync_lock' to use
>>> seqlock so that holding lock can be avoided in fast path.
>>>
>>> Signed-off-by: Yu Kuai <yukuai3@huawei.com>
>>
>> I've found some lockdep issues starting with this patch in md-next while
>> running mdadm tests (specifically 00raid10 when run about 10 times in a
>> row).
>>
>> I've seen a couple different lock dep errors. The first seems to be
>> reproducible on this patch, then it possibly changes to the second on
>> subsequent patches. Not sure exactly.
>>
> 
> Thanks for the test,
> 
> I think this is false positive because of the special usage here,
> 
> for example, in raise_barrier():
> 
> write_seqlock_irq
>   spin_lock_irq();
>    lock_acquire
>   do_write_seqcount_begin
>    seqcount_acquire
> 
>   wait_event_lock_irq_cmd
>    spin_unlock_irq -> lock is released while seqcount is still hold
>               if other context hold the lock again, lockdep
>               will trigger warning.
>    ...
>    spin_lock_irq
> 
> write_sequnlock_irq
> 
> Functionality should be ok, I'll try to find a way to prevent such
> warning.

Can you try the following patch? I'm running mdadm tests myself and I
didn't see any problems yet.

diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 2f7c8bef6dc2..317bd862f40a 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -940,16 +940,16 @@ static void raise_barrier(struct r10conf *conf, 
int force)
         BUG_ON(force && !conf->barrier);

         /* Wait until no block IO is waiting (unless 'force') */
-       wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
-                           conf->resync_lock.lock);
+       wait_event_seqlock_irq(conf->wait_barrier, force || 
!conf->nr_waiting,
+                              conf->resync_lock);

         /* block any new IO from starting */
         WRITE_ONCE(conf->barrier, conf->barrier + 1);

         /* Now wait for all pending IO to complete */
-       wait_event_lock_irq(conf->wait_barrier,
-                           !atomic_read(&conf->nr_pending) && 
conf->barrier < RESYNC_DEPTH,
-                           conf->resync_lock.lock);
+       wait_event_seqlock_irq(conf->wait_barrier,
+                              !atomic_read(&conf->nr_pending) &&
+                              conf->barrier < RESYNC_DEPTH, 
conf->resync_lock);

         write_sequnlock_irq(&conf->resync_lock);
  }
@@ -1007,7 +1007,7 @@ static bool wait_barrier(struct r10conf *conf, 
bool nowait)
                         ret = false;
                 } else {
                         raid10_log(conf->mddev, "wait barrier");
-                       wait_event_lock_irq(conf->wait_barrier,
+                       wait_event_seqlock_irq(conf->wait_barrier,
                                             !conf->barrier ||
 
(atomic_read(&conf->nr_pending) &&
                                              bio_list &&
@@ -1020,7 +1020,7 @@ static bool wait_barrier(struct r10conf *conf, 
bool nowait)
                                               test_bit(MD_RECOVERY_RUNNING,
 
&conf->mddev->recovery) &&
                                               conf->nr_queued > 0),
-                                           conf->resync_lock.lock);
+                                           conf->resync_lock);
                 }
                 conf->nr_waiting--;
                 if (!conf->nr_waiting)
@@ -1058,10 +1058,9 @@ static void freeze_array(struct r10conf *conf, 
int extra)
         conf->array_freeze_pending++;
         WRITE_ONCE(conf->barrier, conf->barrier + 1);
         conf->nr_waiting++;
-       wait_event_lock_irq_cmd(conf->wait_barrier,
+       wait_event_seqlock_irq_cmd(conf->wait_barrier,
                                 atomic_read(&conf->nr_pending) == 
conf->nr_queued+extra,
-                               conf->resync_lock.lock,
-                               flush_pending_writes(conf));
+                               conf->resync_lock, 
flush_pending_writes(conf));

         conf->array_freeze_pending--;
         write_sequnlock_irq(&conf->resync_lock);
diff --git a/include/linux/wait.h b/include/linux/wait.h
index 58cfbf81447c..97d6b378e40c 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -977,6 +977,13 @@ extern int do_wait_intr_irq(wait_queue_head_t *, 
wait_queue_entry_t *);
                             schedule(); 
         \
                             spin_lock_irq(&lock))

+#define __wait_event_seqlock_irq(wq_head, condition, lock, cmd) 
                \
+       (void)___wait_event(wq_head, condition, TASK_UNINTERRUPTIBLE, 0, 
0,     \
+                           write_sequnlock_irq(&lock); 
        \
+                           cmd; 
        \
+                           schedule(); 
        \
+                           write_seqlock_irq(&lock))
+
  /**
   * wait_event_lock_irq_cmd - sleep until a condition gets true. The
   *                          condition is checked under the lock. This
@@ -1007,6 +1014,13 @@ do { 
                                \
         __wait_event_lock_irq(wq_head, condition, lock, cmd); 
         \
  } while (0)

+#define wait_event_seqlock_irq_cmd(wq_head, condition, lock, cmd) 
        \
+do { 
        \
+       if (condition) 
        \
+               break; 
        \
+       __wait_event_seqlock_irq(wq_head, condition, lock, cmd); 
        \
+} while (0)
+
  /**
   * wait_event_lock_irq - sleep until a condition gets true. The
   *                      condition is checked under the lock. This
@@ -1034,6 +1048,12 @@ do { 
                                \
         __wait_event_lock_irq(wq_head, condition, lock, ); 
         \
  } while (0)

+#define wait_event_seqlock_irq(wq_head, condition, lock) 
        \
+do { 
        \
+       if (condition) 
        \
+               break; 
        \
+       __wait_event_seqlock_irq(wq_head, condition, lock, ); 
        \
+} while (0)

  #define __wait_event_interruptible_lock_irq(wq_head, condition, lock, 
cmd)     \
         ___wait_event(wq_head, condition, TASK_INTERRUPTIBLE, 0, 0, 
         \

> 
> Thanks,
> Kuai
>> I haven't dug into it too deeply, but hopefully it can be fixed easily.
>>
>> Logan
>>
>> -- 
>>
>>
>>      ================================
>>      WARNING: inconsistent lock state
>>      6.0.0-rc2-eid-vmlocalyes-dbg-00023-gfd68041d2fd2 #2604 Not tainted
>>      --------------------------------
>>      inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
>>      fsck.ext3/1695 [HC0[0]:SC0[0]:HE0:SE1] takes:
>>      ffff8881049b0120 (&____s->seqcount#10){+.?.}-{0:0}, at:
>> raid10_read_request+0x21f/0x760
>>         (raid10.c:1134)
>>
>>      {IN-SOFTIRQ-W} state was registered at:
>>        lock_acquire+0x183/0x440
>>        lower_barrier+0x5e/0xd0
>>        end_sync_request+0x178/0x180
>>        end_sync_write+0x193/0x380
>>        bio_endio+0x346/0x3a0
>>        blk_update_request+0x1eb/0x7c0
>>        blk_mq_end_request+0x30/0x50
>>        lo_complete_rq+0xb7/0x100
>>        blk_complete_reqs+0x77/0x90
>>        blk_done_softirq+0x38/0x40
>>        __do_softirq+0x10c/0x650
>>        run_ksoftirqd+0x48/0x80
>>        smpboot_thread_fn+0x302/0x400
>>        kthread+0x18c/0x1c0
>>        ret_from_fork+0x1f/0x30
>>
>>      irq event stamp: 8930
>>      hardirqs last  enabled at (8929): [<ffffffff96df8351>]
>> _raw_spin_unlock_irqrestore+0x31/0x60
>>      hardirqs last disabled at (8930): [<ffffffff96df7fc5>]
>> _raw_spin_lock_irq+0x75/0x90
>>      softirqs last  enabled at (6768): [<ffffffff9554970e>]
>> __irq_exit_rcu+0xfe/0x150
>>      softirqs last disabled at (6757): [<ffffffff9554970e>]
>> __irq_exit_rcu+0xfe/0x150
>>
>>      other info that might help us debug this:
>>       Possible unsafe locking scenario:
>>
>>             CPU0
>>             ----
>>        lock(&____s->seqcount#10);
>>        <Interrupt>
>>          lock(&____s->seqcount#10);
>>
>>       *** DEADLOCK ***
>>
>>      2 locks held by fsck.ext3/1695:
>>       #0: ffff8881007d0930 (mapping.invalidate_lock#2){++++}-{3:3}, at:
>> page_cache_ra_unbounded+0xaf/0x250
>>       #1: ffff8881049b0120 (&____s->seqcount#10){+.?.}-{0:0}, at:
>> raid10_read_request+0x21f/0x760
>>
>>      stack backtrace:
>>      CPU: 0 PID: 1695 Comm: fsck.ext3 Not tainted
>> 6.0.0-rc2-eid-vmlocalyes-dbg-00023-gfd68041d2fd2 #2604
>>      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2
>> 04/01/2014
>>      Call Trace:
>>       <TASK>
>>       dump_stack_lvl+0x5a/0x74
>>       dump_stack+0x10/0x12
>>       print_usage_bug.part.0+0x233/0x246
>>       mark_lock.part.0.cold+0x73/0x14f
>>       mark_held_locks+0x71/0xa0
>>       lockdep_hardirqs_on_prepare+0x158/0x230
>>       trace_hardirqs_on+0x34/0x100
>>       _raw_spin_unlock_irq+0x28/0x60
>>       wait_barrier+0x4a6/0x720
>>           raid10.c:1004
>>       raid10_read_request+0x21f/0x760
>>       raid10_make_request+0x2d6/0x2160
>>       md_handle_request+0x3f3/0x5b0
>>       md_submit_bio+0xd9/0x120
>>       __submit_bio+0x9d/0x100
>>       submit_bio_noacct_nocheck+0x1fd/0x470
>>       submit_bio_noacct+0x4c2/0xbb0
>>       submit_bio+0x3f/0xf0
>>       mpage_readahead+0x323/0x3b0
>>       blkdev_readahead+0x15/0x20
>>       read_pages+0x136/0x7a0
>>       page_cache_ra_unbounded+0x18d/0x250
>>       page_cache_ra_order+0x2c9/0x400
>>       ondemand_readahead+0x320/0x730
>>       page_cache_sync_ra+0xa6/0xb0
>>       filemap_get_pages+0x1eb/0xc00
>>       filemap_read+0x1f1/0x770
>>       blkdev_read_iter+0x164/0x310
>>       vfs_read+0x467/0x5a0
>>       __x64_sys_pread64+0x122/0x160
>>       do_syscall_64+0x35/0x80
>>       entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>
>> -- 
>>
>>      ======================================================
>>      WARNING: possible circular locking dependency detected
>>      6.0.0-rc2-eid-vmlocalyes-dbg-00027-gcd6aa5181bbb #2600 Not tainted
>>      ------------------------------------------------------
>>      systemd-udevd/292 is trying to acquire lock:
>>      ffff88817b644170 (&(&conf->resync_lock)->lock){....}-{2:2}, at:
>> wait_barrier+0x4fe/0x770
>>
>>      but task is already holding lock:
>>      ffff88817b644120 (&____s->seqcount#11){+.+.}-{0:0}, at:
>> raid10_read_request+0x21f/0x760
>>             raid10.c:1140  wait_barrier()
>>             raid10.c:1204  regular_request_wait()
>>
>>
>>
>>      which lock already depends on the new lock.
>>
>>
>>      the existing dependency chain (in reverse order) is:
>>
>>      -> #1 (&____s->seqcount#11){+.+.}-{0:0}:
>>             raise_barrier+0xe0/0x300
>>         raid10.c:940 write_seqlock_irq()
>>             raid10_sync_request+0x629/0x4750
>>         raid10.c:3689 raise_barrire()
>>             md_do_sync.cold+0x8ec/0x1491
>>             md_thread+0x19d/0x2d0
>>             kthread+0x18c/0x1c0
>>             ret_from_fork+0x1f/0x30
>>
>>      -> #0 (&(&conf->resync_lock)->lock){....}-{2:2}:
>>             __lock_acquire+0x1cb4/0x3170
>>             lock_acquire+0x183/0x440
>>             _raw_spin_lock_irq+0x4d/0x90
>>             wait_barrier+0x4fe/0x770
>>             raid10_read_request+0x21f/0x760
>>         raid10.c:1140  wait_barrier()
>>         raid10.c:1204  regular_request_wait()
>>             raid10_make_request+0x2d6/0x2190
>>             md_handle_request+0x3f3/0x5b0
>>             md_submit_bio+0xd9/0x120
>>             __submit_bio+0x9d/0x100
>>             submit_bio_noacct_nocheck+0x1fd/0x470
>>             submit_bio_noacct+0x4c2/0xbb0
>>             submit_bio+0x3f/0xf0
>>             submit_bh_wbc+0x270/0x2a0
>>             block_read_full_folio+0x37c/0x580
>>             blkdev_read_folio+0x18/0x20
>>             filemap_read_folio+0x3f/0x110
>>             do_read_cache_folio+0x13b/0x2c0
>>             read_cache_folio+0x42/0x50
>>             read_part_sector+0x74/0x1c0
>>             read_lba+0x176/0x2a0
>>             efi_partition+0x1ce/0xdd0
>>             bdev_disk_changed+0x2e7/0x6a0
>>             blkdev_get_whole+0xd2/0x140
>>             blkdev_get_by_dev.part.0+0x37f/0x570
>>             blkdev_get_by_dev+0x51/0x60
>>             disk_scan_partitions+0xad/0xf0
>>             blkdev_common_ioctl+0x3f3/0xdf0
>>             blkdev_ioctl+0x1e1/0x450
>>             __x64_sys_ioctl+0xc0/0x100
>>             do_syscall_64+0x35/0x80
>>             entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>
>>      other info that might help us debug this:
>>
>>       Possible unsafe locking scenario:
>>
>>             CPU0                    CPU1
>>             ----                    ----
>>        lock(&____s->seqcount#11);
>>                                     lock(&(&conf->resync_lock)->lock);
>>                                     lock(&____s->seqcount#11);
>>        lock(&(&conf->resync_lock)->lock);
>>
>>       *** DEADLOCK ***
>>
>>      2 locks held by systemd-udevd/292:
>>       #0: ffff88817a532528 (&disk->open_mutex){+.+.}-{3:3}, at:
>> blkdev_get_by_dev.part.0+0x180/0x570
>>       #1: ffff88817b644120 (&____s->seqcount#11){+.+.}-{0:0}, at:
>> raid10_read_request+0x21f/0x760
>>
>>      stack backtrace:
>>      CPU: 3 PID: 292 Comm: systemd-udevd Not tainted
>> 6.0.0-rc2-eid-vmlocalyes-dbg-00027-gcd6aa5181bbb #2600
>>      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2
>> 04/01/2014
>>      Call Trace:
>>       <TASK>
>>       dump_stack_lvl+0x5a/0x74
>>       dump_stack+0x10/0x12
>>       print_circular_bug.cold+0x146/0x14b
>>       check_noncircular+0x1ff/0x250
>>       __lock_acquire+0x1cb4/0x3170
>>       lock_acquire+0x183/0x440
>>       _raw_spin_lock_irq+0x4d/0x90
>>       wait_barrier+0x4fe/0x770
>>       raid10_read_request+0x21f/0x760
>>       raid10_make_request+0x2d6/0x2190
>>       md_handle_request+0x3f3/0x5b0
>>       md_submit_bio+0xd9/0x120
>>       __submit_bio+0x9d/0x100
>>       submit_bio_noacct_nocheck+0x1fd/0x470
>>       submit_bio_noacct+0x4c2/0xbb0
>>       submit_bio+0x3f/0xf0
>>       submit_bh_wbc+0x270/0x2a0
>>       block_read_full_folio+0x37c/0x580
>>       blkdev_read_folio+0x18/0x20
>>       filemap_read_folio+0x3f/0x110
>>       do_read_cache_folio+0x13b/0x2c0
>>       read_cache_folio+0x42/0x50
>>       read_part_sector+0x74/0x1c0
>>       read_lba+0x176/0x2a0
>>       efi_partition+0x1ce/0xdd0
>>       bdev_disk_changed+0x2e7/0x6a0
>>       blkdev_get_whole+0xd2/0x140
>>       blkdev_get_by_dev.part.0+0x37f/0x570
>>       blkdev_get_by_dev+0x51/0x60
>>       disk_scan_partitions+0xad/0xf0
>>       blkdev_common_ioctl+0x3f3/0xdf0
>>       blkdev_ioctl+0x1e1/0x450
>>       __x64_sys_ioctl+0xc0/0x100
>>       do_syscall_64+0x35/0x80
>> .
>>
> 
> .
> 


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-08-29 13:15 ` [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock Yu Kuai
  2022-09-01 18:41   ` Logan Gunthorpe
@ 2022-09-02  9:42   ` Guoqing Jiang
  2022-09-02 10:02     ` Yu Kuai
  1 sibling, 1 reply; 28+ messages in thread
From: Guoqing Jiang @ 2022-09-02  9:42 UTC (permalink / raw)
  To: Yu Kuai, song; +Cc: linux-raid, linux-kernel, yukuai3, yi.zhang

Hi,

On 8/29/22 9:15 PM, Yu Kuai wrote:
> +static bool wait_barrier_nolock(struct r10conf *conf)
> +{
> +	unsigned int seq = raw_read_seqcount(&conf->resync_lock.seqcount);
> +
> +	if (seq & 1)
> +		return false;
> +
> +	if (READ_ONCE(conf->barrier))
> +		return false;
> +
> +	atomic_inc(&conf->nr_pending);
> +	if (!read_seqcount_retry(&conf->resync_lock.seqcount, seq))

I think 'seq' is usually get from read_seqcount_begin.

> +		return true;
> +
> +	atomic_dec(&conf->nr_pending);
> +	return false;
> +

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-02  9:42   ` Guoqing Jiang
@ 2022-09-02 10:02     ` Yu Kuai
  2022-09-02 10:16       ` Guoqing Jiang
  0 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2022-09-02 10:02 UTC (permalink / raw)
  To: Guoqing Jiang, Yu Kuai, song
  Cc: linux-raid, linux-kernel, yi.zhang, yukuai (C)

Hi,

在 2022/09/02 17:42, Guoqing Jiang 写道:
> Hi,
> 
> On 8/29/22 9:15 PM, Yu Kuai wrote:
>> +static bool wait_barrier_nolock(struct r10conf *conf)
>> +{
>> +    unsigned int seq = raw_read_seqcount(&conf->resync_lock.seqcount);
>> +
>> +    if (seq & 1)
>> +        return false;
>> +
>> +    if (READ_ONCE(conf->barrier))
>> +        return false;
>> +
>> +    atomic_inc(&conf->nr_pending);
>> +    if (!read_seqcount_retry(&conf->resync_lock.seqcount, seq))
> 
> I think 'seq' is usually get from read_seqcount_begin.

read_seqcount_begin will loop untill "req & 1" failed, I'm afraid this
will cause high cpu usage in come cases.

What I try to do here is just try once, and fall back to hold lock and
wait if failed.

What do you think?

Thanks,
Kuai
> 
>> +        return true;
>> +
>> +    atomic_dec(&conf->nr_pending);
>> +    return false;
>> +
> 
> Thanks,
> Guoqing
> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-02 10:02     ` Yu Kuai
@ 2022-09-02 10:16       ` Guoqing Jiang
  2022-09-02 10:53         ` Yu Kuai
  0 siblings, 1 reply; 28+ messages in thread
From: Guoqing Jiang @ 2022-09-02 10:16 UTC (permalink / raw)
  To: Yu Kuai, song; +Cc: linux-raid, linux-kernel, yi.zhang, yukuai (C)



On 9/2/22 6:02 PM, Yu Kuai wrote:
> Hi,
>
> 在 2022/09/02 17:42, Guoqing Jiang 写道:
>> Hi,
>>
>> On 8/29/22 9:15 PM, Yu Kuai wrote:
>>> +static bool wait_barrier_nolock(struct r10conf *conf)
>>> +{
>>> +    unsigned int seq = raw_read_seqcount(&conf->resync_lock.seqcount);
>>> +
>>> +    if (seq & 1)
>>> +        return false;
>>> +
>>> +    if (READ_ONCE(conf->barrier))
>>> +        return false;
>>> +
>>> +    atomic_inc(&conf->nr_pending);
>>> +    if (!read_seqcount_retry(&conf->resync_lock.seqcount, seq))
>>
>> I think 'seq' is usually get from read_seqcount_begin.
>
> read_seqcount_begin will loop untill "req & 1" failed, I'm afraid this
> will cause high cpu usage in come cases.
>
> What I try to do here is just try once, and fall back to hold lock and
> wait if failed.

Thanks for the explanation.

I'd suggest to try with read_seqcount_begin/read_seqcount_retry pattern
because it is a common usage in kernel I think, then check whether the
performance drops or not.  Maybe it is related to lockdep issue, but I am
not sure.

Thanks,
Guoqing

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-02 10:16       ` Guoqing Jiang
@ 2022-09-02 10:53         ` Yu Kuai
  0 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-09-02 10:53 UTC (permalink / raw)
  To: Guoqing Jiang, Yu Kuai, song
  Cc: linux-raid, linux-kernel, yi.zhang, yukuai (C)

Hi,

在 2022/09/02 18:16, Guoqing Jiang 写道:
> 
> 
> On 9/2/22 6:02 PM, Yu Kuai wrote:
>> Hi,
>>
>> 在 2022/09/02 17:42, Guoqing Jiang 写道:
>>> Hi,
>>>
>>> On 8/29/22 9:15 PM, Yu Kuai wrote:
>>>> +static bool wait_barrier_nolock(struct r10conf *conf)
>>>> +{
>>>> +    unsigned int seq = raw_read_seqcount(&conf->resync_lock.seqcount);
>>>> +
>>>> +    if (seq & 1)
>>>> +        return false;
>>>> +
>>>> +    if (READ_ONCE(conf->barrier))
>>>> +        return false;
>>>> +
>>>> +    atomic_inc(&conf->nr_pending);
>>>> +    if (!read_seqcount_retry(&conf->resync_lock.seqcount, seq))
>>>
>>> I think 'seq' is usually get from read_seqcount_begin.
>>
>> read_seqcount_begin will loop untill "req & 1" failed, I'm afraid this
>> will cause high cpu usage in come cases.
>>
>> What I try to do here is just try once, and fall back to hold lock and
>> wait if failed.
> 
> Thanks for the explanation.
> 
> I'd suggest to try with read_seqcount_begin/read_seqcount_retry pattern
> because it is a common usage in kernel I think, then check whether the
> performance drops or not.  Maybe it is related to lockdep issue, but I am
> not sure.

I can try read_seqcount_begin/read_seqcount_retry.

Please take a look at another thread, lockdep issue is due to
inconsistent usage of lock and seqcount inside seqlock:

wait_event() only release lock, seqcount is not released.

Thansk,
Kuai
> 
> Thanks,
> Guoqing
> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-02  8:14       ` Yu Kuai
@ 2022-09-02 17:03         ` Logan Gunthorpe
  2022-09-03  6:07           ` Yu Kuai
  0 siblings, 1 reply; 28+ messages in thread
From: Logan Gunthorpe @ 2022-09-02 17:03 UTC (permalink / raw)
  To: Yu Kuai, song; +Cc: linux-raid, linux-kernel, yi.zhang, yukuai (C)




On 2022-09-02 02:14, Yu Kuai wrote:
> Can you try the following patch? I'm running mdadm tests myself and I
> didn't see any problems yet.

Yes, that patch seems to fix the issue.

However, may I suggest we do this without trying to introduce new
helpers into wait.h? I suspect that could result in a fair amount of
bike shedding and delay. wait_event_cmd() is often used in situations 
where a specific lock type doesn't have a helper.

My stab at it is in a diff below which also fixes the bug. 

I'd also recommend somebody clean up that nasty condition in 
wait_barrier(). Put it into an appropriately named function
with some comments. As is, it is pretty much unreadable.

Logan

--


diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0e3229ee1ebc..ae297bc870bd 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -934,22 +934,26 @@ static void flush_pending_writes(struct r10conf *conf)
  *    lower_barrier when the particular background IO completes.
  */
 
+#define wait_event_barrier_cmd(conf, cond, cmd) \
+	wait_event_cmd((conf)->wait_barrier, cond, \
+		       write_sequnlock_irq(&(conf)->resync_lock); cmd, \
+		       write_seqlock_irq(&(conf)->resync_lock))
+#define wait_event_barrier(conf, cond) wait_event_barrier_cmd(conf, cond, )
+
 static void raise_barrier(struct r10conf *conf, int force)
 {
 	write_seqlock_irq(&conf->resync_lock);
 	BUG_ON(force && !conf->barrier);
 
 	/* Wait until no block IO is waiting (unless 'force') */
-	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
-			    conf->resync_lock.lock);
+	wait_event_barrier(conf, force || !conf->nr_waiting);
 
 	/* block any new IO from starting */
 	WRITE_ONCE(conf->barrier, conf->barrier + 1);
 
 	/* Now wait for all pending IO to complete */
-	wait_event_lock_irq(conf->wait_barrier,
-			    !atomic_read(&conf->nr_pending) && conf->barrier < RESYNC_DEPTH,
-			    conf->resync_lock.lock);
+	wait_event_barrier(conf, !atomic_read(&conf->nr_pending) &&
+			   conf->barrier < RESYNC_DEPTH);
 
 	write_sequnlock_irq(&conf->resync_lock);
 }
@@ -1007,20 +1011,19 @@ static bool wait_barrier(struct r10conf *conf, bool nowait)
 			ret = false;
 		} else {
 			raid10_log(conf->mddev, "wait barrier");
-			wait_event_lock_irq(conf->wait_barrier,
-					    !conf->barrier ||
-					    (atomic_read(&conf->nr_pending) &&
-					     bio_list &&
-					     (!bio_list_empty(&bio_list[0]) ||
-					      !bio_list_empty(&bio_list[1]))) ||
+			wait_event_barrier(conf,
+					   !conf->barrier ||
+					   (atomic_read(&conf->nr_pending) &&
+					    bio_list &&
+					    (!bio_list_empty(&bio_list[0]) ||
+					     !bio_list_empty(&bio_list[1]))) ||
 					     /* move on if recovery thread is
 					      * blocked by us
 					      */
-					     (conf->mddev->thread->tsk == current &&
-					      test_bit(MD_RECOVERY_RUNNING,
-						       &conf->mddev->recovery) &&
-					      conf->nr_queued > 0),
-					    conf->resync_lock.lock);
+					    (conf->mddev->thread->tsk == current &&
+					     test_bit(MD_RECOVERY_RUNNING,
+					       &conf->mddev->recovery) &&
+					     conf->nr_queued > 0));
 		}
 		conf->nr_waiting--;
 		if (!conf->nr_waiting)
@@ -1058,10 +1061,9 @@ static void freeze_array(struct r10conf *conf, int extra)
 	conf->array_freeze_pending++;
 	WRITE_ONCE(conf->barrier, conf->barrier + 1);
 	conf->nr_waiting++;
-	wait_event_lock_irq_cmd(conf->wait_barrier,
-				atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
-				conf->resync_lock.lock,
-				flush_pending_writes(conf));
+	wait_event_barrier_cmd(conf,
+		atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
+		flush_pending_writes(conf));
 
 	conf->array_freeze_pending--;
 	write_sequnlock_irq(&conf->resync_lock);

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock
  2022-09-02 17:03         ` Logan Gunthorpe
@ 2022-09-03  6:07           ` Yu Kuai
  0 siblings, 0 replies; 28+ messages in thread
From: Yu Kuai @ 2022-09-03  6:07 UTC (permalink / raw)
  To: Logan Gunthorpe, Yu Kuai, song
  Cc: linux-raid, linux-kernel, yi.zhang, yukuai (C)

Hi,

在 2022/09/03 1:03, Logan Gunthorpe 写道:
> 
> 
> 
> On 2022-09-02 02:14, Yu Kuai wrote:
>> Can you try the following patch? I'm running mdadm tests myself and I
>> didn't see any problems yet.
> 
> Yes, that patch seems to fix the issue.
> 
> However, may I suggest we do this without trying to introduce new
> helpers into wait.h? I suspect that could result in a fair amount of
> bike shedding and delay. wait_event_cmd() is often used in situations
> where a specific lock type doesn't have a helper.

Yes, that sounds good.
> 
> My stab at it is in a diff below which also fixes the bug.
> 
> I'd also recommend somebody clean up that nasty condition in
> wait_barrier(). Put it into an appropriately named function
> with some comments. As is, it is pretty much unreadable.

Now we're at it, I can take a look.

Thanks,
Kuai
> 
> Logan
> 
> --
> 
> 
> diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
> index 0e3229ee1ebc..ae297bc870bd 100644
> --- a/drivers/md/raid10.c
> +++ b/drivers/md/raid10.c
> @@ -934,22 +934,26 @@ static void flush_pending_writes(struct r10conf *conf)
>    *    lower_barrier when the particular background IO completes.
>    */
>   
> +#define wait_event_barrier_cmd(conf, cond, cmd) \
> +	wait_event_cmd((conf)->wait_barrier, cond, \
> +		       write_sequnlock_irq(&(conf)->resync_lock); cmd, \
> +		       write_seqlock_irq(&(conf)->resync_lock))
> +#define wait_event_barrier(conf, cond) wait_event_barrier_cmd(conf, cond, )
> +
>   static void raise_barrier(struct r10conf *conf, int force)
>   {
>   	write_seqlock_irq(&conf->resync_lock);
>   	BUG_ON(force && !conf->barrier);
>   
>   	/* Wait until no block IO is waiting (unless 'force') */
> -	wait_event_lock_irq(conf->wait_barrier, force || !conf->nr_waiting,
> -			    conf->resync_lock.lock);
> +	wait_event_barrier(conf, force || !conf->nr_waiting);
>   
>   	/* block any new IO from starting */
>   	WRITE_ONCE(conf->barrier, conf->barrier + 1);
>   
>   	/* Now wait for all pending IO to complete */
> -	wait_event_lock_irq(conf->wait_barrier,
> -			    !atomic_read(&conf->nr_pending) && conf->barrier < RESYNC_DEPTH,
> -			    conf->resync_lock.lock);
> +	wait_event_barrier(conf, !atomic_read(&conf->nr_pending) &&
> +			   conf->barrier < RESYNC_DEPTH);
>   
>   	write_sequnlock_irq(&conf->resync_lock);
>   }
> @@ -1007,20 +1011,19 @@ static bool wait_barrier(struct r10conf *conf, bool nowait)
>   			ret = false;
>   		} else {
>   			raid10_log(conf->mddev, "wait barrier");
> -			wait_event_lock_irq(conf->wait_barrier,
> -					    !conf->barrier ||
> -					    (atomic_read(&conf->nr_pending) &&
> -					     bio_list &&
> -					     (!bio_list_empty(&bio_list[0]) ||
> -					      !bio_list_empty(&bio_list[1]))) ||
> +			wait_event_barrier(conf,
> +					   !conf->barrier ||
> +					   (atomic_read(&conf->nr_pending) &&
> +					    bio_list &&
> +					    (!bio_list_empty(&bio_list[0]) ||
> +					     !bio_list_empty(&bio_list[1]))) ||
>   					     /* move on if recovery thread is
>   					      * blocked by us
>   					      */
> -					     (conf->mddev->thread->tsk == current &&
> -					      test_bit(MD_RECOVERY_RUNNING,
> -						       &conf->mddev->recovery) &&
> -					      conf->nr_queued > 0),
> -					    conf->resync_lock.lock);
> +					    (conf->mddev->thread->tsk == current &&
> +					     test_bit(MD_RECOVERY_RUNNING,
> +					       &conf->mddev->recovery) &&
> +					     conf->nr_queued > 0));
>   		}
>   		conf->nr_waiting--;
>   		if (!conf->nr_waiting)
> @@ -1058,10 +1061,9 @@ static void freeze_array(struct r10conf *conf, int extra)
>   	conf->array_freeze_pending++;
>   	WRITE_ONCE(conf->barrier, conf->barrier + 1);
>   	conf->nr_waiting++;
> -	wait_event_lock_irq_cmd(conf->wait_barrier,
> -				atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
> -				conf->resync_lock.lock,
> -				flush_pending_writes(conf));
> +	wait_event_barrier_cmd(conf,
> +		atomic_read(&conf->nr_pending) == conf->nr_queued+extra,
> +		flush_pending_writes(conf));
>   
>   	conf->array_freeze_pending--;
>   	write_sequnlock_irq(&conf->resync_lock);
> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 0/3] md/raid10: reduce lock contention for io
  2022-08-31 18:00 ` Song Liu
@ 2022-09-03  6:08   ` Yu Kuai
  2022-09-09 14:45     ` Song Liu
  0 siblings, 1 reply; 28+ messages in thread
From: Yu Kuai @ 2022-09-03  6:08 UTC (permalink / raw)
  To: Song Liu, Yu Kuai; +Cc: linux-raid, open list, yi.zhang, yukuai (C)

Hi, Song

在 2022/09/01 2:00, Song Liu 写道:
> On Mon, Aug 29, 2022 at 6:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>>
>> From: Yu Kuai <yukuai3@huawei.com>
>>
>> patch 1 is a small problem found by code review.
>> patch 2 avoid holding resync_lock in fast path.
>> patch 3 avoid holding lock in wake_up() in fast path.
>>
>> Test environment:
>>
>> Architecture: aarch64
>> Cpu: Huawei KUNPENG 920, there are four numa nodes
>>
>> Raid10 initialize:
>> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> Test cmd:
>> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
>>
>> Test result:
>> before this patchset:   2.9 GiB/s
>> after this patchset:    6.6 Gib/s
> 
> Nice work! Applied to md-next.

Can you drop this version? There are something to improve. I can send a
new version.

Thanks,
Kuai
> 
> Thanks,
> Song
> .
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH -next 0/3] md/raid10: reduce lock contention for io
  2022-09-03  6:08   ` Yu Kuai
@ 2022-09-09 14:45     ` Song Liu
  0 siblings, 0 replies; 28+ messages in thread
From: Song Liu @ 2022-09-09 14:45 UTC (permalink / raw)
  To: Yu Kuai; +Cc: linux-raid, open list, yi.zhang, yukuai (C)

On Fri, Sep 2, 2022 at 11:08 PM Yu Kuai <yukuai1@huaweicloud.com> wrote:
>
> Hi, Song
>
> 在 2022/09/01 2:00, Song Liu 写道:
> > On Mon, Aug 29, 2022 at 6:03 AM Yu Kuai <yukuai1@huaweicloud.com> wrote:
> >>
> >> From: Yu Kuai <yukuai3@huawei.com>
> >>
> >> patch 1 is a small problem found by code review.
> >> patch 2 avoid holding resync_lock in fast path.
> >> patch 3 avoid holding lock in wake_up() in fast path.
> >>
> >> Test environment:
> >>
> >> Architecture: aarch64
> >> Cpu: Huawei KUNPENG 920, there are four numa nodes
> >>
> >> Raid10 initialize:
> >> mdadm --create /dev/md0 --level 10 --bitmap none --raid-devices 4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> >>
> >> Test cmd:
> >> fio -name=0 -ioengine=libaio -direct=1 -group_reporting=1 -randseed=2022 -rwmixread=70 -refill_buffers -filename=/dev/md0 -numjobs=16 -runtime=60s -bs=4k -iodepth=256 -rw=randread
> >>
> >> Test result:
> >> before this patchset:   2.9 GiB/s
> >> after this patchset:    6.6 Gib/s
> >
> > Nice work! Applied to md-next.
>
> Can you drop this version? There are something to improve. I can send a
> new version.

Sure, I will drop it from md-next.

Song

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-09-09 14:46 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-29 13:14 [PATCH -next 0/3] md/raid10: reduce lock contention for io Yu Kuai
2022-08-29 13:15 ` [PATCH -next 1/3] md/raid10: fix improper BUG_ON() in raise_barrier() Yu Kuai
2022-08-29 19:53   ` John Stoffel
2022-08-30  1:01     ` Yu Kuai
2022-08-30  6:32     ` Paul Menzel
2022-08-29 13:15 ` [PATCH -next 2/3] md/raid10: convert resync_lock to use seqlock Yu Kuai
2022-09-01 18:41   ` Logan Gunthorpe
2022-09-02  0:49     ` Guoqing Jiang
2022-09-02  0:56       ` Logan Gunthorpe
2022-09-02  1:00         ` Guoqing Jiang
2022-09-02  1:21     ` Yu Kuai
2022-09-02  8:14       ` Yu Kuai
2022-09-02 17:03         ` Logan Gunthorpe
2022-09-03  6:07           ` Yu Kuai
2022-09-02  9:42   ` Guoqing Jiang
2022-09-02 10:02     ` Yu Kuai
2022-09-02 10:16       ` Guoqing Jiang
2022-09-02 10:53         ` Yu Kuai
2022-08-29 13:15 ` [PATCH -next 3/3] md/raid10: prevent unnecessary calls to wake_up() in fast path Yu Kuai
2022-08-29 13:40 ` [PATCH -next 0/3] md/raid10: reduce lock contention for io Guoqing Jiang
2022-08-31 11:55   ` Yu Kuai
2022-08-29 13:58 ` Paul Menzel
2022-08-30  1:09   ` Yu Kuai
2022-08-31 11:59     ` Paul Menzel
2022-08-31 12:07       ` Yu Kuai
2022-08-31 18:00 ` Song Liu
2022-09-03  6:08   ` Yu Kuai
2022-09-09 14:45     ` Song Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.