All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd
@ 2015-04-29  2:48 Yuanhan Liu
  2015-04-29  2:48 ` [PATCH 2/3] md/raid5: split wait_for_stripe and introduce wait_for_quiescent Yuanhan Liu
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Yuanhan Liu @ 2015-04-29  2:48 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, linux-kernel, Yuanhan Liu, Ingo Molnar, Peter Zijlstra

It's just a variant of wait_event_cmd(), with exclusive flag being set.

For cases like RAID5, which puts many processes to sleep until 1/4
resources are free, a wake_up wakes up all processes to run, but
there is one process being able to get the resource as it's protected
by a spin lock. That ends up introducing heavy lock contentions, and
hurts performance badly.

Here introduce wait_event_exclusive_cmd to relieve the lock contention
naturally by letting wake_up just wake up one process.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
v2: its assumed that wait*() and __wait*() have the same arguments - peterz

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 include/linux/wait.h | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 2db8334..db78c72 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -358,6 +358,19 @@ do {									\
 	__ret;								\
 })
 
+#define __wait_event_exclusive_cmd(wq, condition, cmd1, cmd2)		\
+	(void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 1, 0,	\
+			    cmd1; schedule(); cmd2)
+/*
+ * Just like wait_event_cmd(), except it sets exclusive flag
+ */
+#define wait_event_exclusive_cmd(wq, condition, cmd1, cmd2)		\
+do {									\
+	if (condition)							\
+		break;							\
+	__wait_event_exclusive_cmd(wq, condition, cmd1, cmd2);		\
+} while (0)
+
 #define __wait_event_cmd(wq, condition, cmd1, cmd2)			\
 	(void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0,	\
 			    cmd1; schedule(); cmd2)
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] md/raid5: split wait_for_stripe and introduce wait_for_quiescent
  2015-04-29  2:48 [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd Yuanhan Liu
@ 2015-04-29  2:48 ` Yuanhan Liu
  2015-04-29  2:48   ` Yuanhan Liu
  2015-04-29  9:31 ` [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd Peter Zijlstra
  2 siblings, 0 replies; 9+ messages in thread
From: Yuanhan Liu @ 2015-04-29  2:48 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, linux-kernel, Yuanhan Liu

I noticed heavy spin lock contention at get_active_stripe(), introduced
at being wake up stage, where a bunch of processes try to re-hold the
spin lock again.

After giving some thoughts on this issue, I found the lock could be
relieved(and even avoided) if we turn the wait_for_stripe to per
waitqueue for each lock hash and make the wake up exclusive: wake up
one process each time, which avoids the lock contention naturally.

Before go hacking with wait_for_stripe, I found it actually has 2
usages: for the array to enter or leave the quiescent state, and also
to wait for an available stripe in each of the hash lists.

So this patch splits the first usage off into a separate wait_queue,
wait_for_quiescent, and the next patch will turn the second usage into
one waitqueue for each hash value, and make it exclusive, to relieve
the lock contention.

v2: wake_up(wait_for_quiescent) when (active_stripes == 0)
    Commit log refactor suggestion from Neil.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 drivers/md/raid5.c | 15 +++++++++------
 drivers/md/raid5.h |  1 +
 2 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 77dfd72..64d5bea 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -374,6 +374,8 @@ static void release_inactive_stripe_list(struct r5conf *conf,
 
 	if (do_wakeup) {
 		wake_up(&conf->wait_for_stripe);
+		if (atomic_read(&conf->active_stripes) == 0)
+			wake_up(&conf->wait_for_quiescent);
 		if (conf->retry_read_aligned)
 			md_wakeup_thread(conf->mddev->thread);
 	}
@@ -667,7 +669,7 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 	spin_lock_irq(conf->hash_locks + hash);
 
 	do {
-		wait_event_lock_irq(conf->wait_for_stripe,
+		wait_event_lock_irq(conf->wait_for_quiescent,
 				    conf->quiesce == 0 || noquiesce,
 				    *(conf->hash_locks + hash));
 		sh = __find_stripe(conf, sector, conf->generation - previous);
@@ -4729,7 +4731,7 @@ static void raid5_align_endio(struct bio *bi, int error)
 					 raid_bi, 0);
 		bio_endio(raid_bi, 0);
 		if (atomic_dec_and_test(&conf->active_aligned_reads))
-			wake_up(&conf->wait_for_stripe);
+			wake_up(&conf->wait_for_quiescent);
 		return;
 	}
 
@@ -4824,7 +4826,7 @@ static int chunk_aligned_read(struct mddev *mddev, struct bio * raid_bio)
 		align_bi->bi_iter.bi_sector += rdev->data_offset;
 
 		spin_lock_irq(&conf->device_lock);
-		wait_event_lock_irq(conf->wait_for_stripe,
+		wait_event_lock_irq(conf->wait_for_quiescent,
 				    conf->quiesce == 0,
 				    conf->device_lock);
 		atomic_inc(&conf->active_aligned_reads);
@@ -5668,7 +5670,7 @@ static int  retry_aligned_read(struct r5conf *conf, struct bio *raid_bio)
 		bio_endio(raid_bio, 0);
 	}
 	if (atomic_dec_and_test(&conf->active_aligned_reads))
-		wake_up(&conf->wait_for_stripe);
+		wake_up(&conf->wait_for_quiescent);
 	return handled;
 }
 
@@ -6399,6 +6401,7 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 		goto abort;
 	spin_lock_init(&conf->device_lock);
 	seqcount_init(&conf->gen_lock);
+	init_waitqueue_head(&conf->wait_for_quiescent);
 	init_waitqueue_head(&conf->wait_for_stripe);
 	init_waitqueue_head(&conf->wait_for_overlap);
 	INIT_LIST_HEAD(&conf->handle_list);
@@ -7422,7 +7425,7 @@ static void raid5_quiesce(struct mddev *mddev, int state)
 		 * active stripes can drain
 		 */
 		conf->quiesce = 2;
-		wait_event_cmd(conf->wait_for_stripe,
+		wait_event_cmd(conf->wait_for_quiescent,
 				    atomic_read(&conf->active_stripes) == 0 &&
 				    atomic_read(&conf->active_aligned_reads) == 0,
 				    unlock_all_device_hash_locks_irq(conf),
@@ -7436,7 +7439,7 @@ static void raid5_quiesce(struct mddev *mddev, int state)
 	case 0: /* re-enable writes */
 		lock_all_device_hash_locks_irq(conf);
 		conf->quiesce = 0;
-		wake_up(&conf->wait_for_stripe);
+		wake_up(&conf->wait_for_quiescent);
 		wake_up(&conf->wait_for_overlap);
 		unlock_all_device_hash_locks_irq(conf);
 		break;
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 7dc0dd8..4cc05ec 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -508,6 +508,7 @@ struct r5conf {
 	struct list_head	inactive_list[NR_STRIPE_HASH_LOCKS];
 	atomic_t		empty_inactive_list_nr;
 	struct llist_head	released_stripes;
+	wait_queue_head_t	wait_for_quiescent;
 	wait_queue_head_t	wait_for_stripe;
 	wait_queue_head_t	wait_for_overlap;
 	unsigned long		cache_state;
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] md/raid5: per hash value and exclusive wait_for_stripe
  2015-04-29  2:48 [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd Yuanhan Liu
@ 2015-04-29  2:48   ` Yuanhan Liu
  2015-04-29  2:48   ` Yuanhan Liu
  2015-04-29  9:31 ` [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd Peter Zijlstra
  2 siblings, 0 replies; 9+ messages in thread
From: Yuanhan Liu @ 2015-04-29  2:48 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, linux-kernel, Yuanhan Liu

I noticed heavy spin lock contention at get_active_stripe() with fsmark
multiple thread write workloads.

Here is how this hot contention comes from. We have limited stripes, and
it's a multiple thread write workload. Hence, those stripes will be taken
soon, which puts later processes to sleep for waiting free stripes. When
enough stripes(>= 1/4 total stripes) are released, all process are woken,
trying to get the lock. But there is one only being able to get this lock
for each hash lock, making other processes spinning out there for acquiring
the lock.

Thus, it's effectiveless to wakeup all processes and let them battle for
a lock that permits one to access only each time. Instead, we could make
it be a exclusive wake up: wake up one process only. That avoids the heavy
spin lock contention naturally.

To do the exclusive wake up, we've to split wait_for_stripe into multiple
wait queues, to make it per hash value, just like the hash lock.

Here are some test results I have got with this patch applied(all test run
3 times):

`fsmark.files_per_sec'
=====================

next-20150317                 this patch
-------------------------     -------------------------
metric_value     ±stddev      metric_value     ±stddev     change      testbox/benchmark/testcase-params
-------------------------     -------------------------   --------     ------------------------------
      25.600     ±0.0              92.700     ±2.5          262.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      25.600     ±0.0              77.800     ±0.6          203.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      32.000     ±0.0              93.800     ±1.7          193.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
      32.000     ±0.0              81.233     ±1.7          153.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
      48.800     ±14.5             99.667     ±2.0          104.2%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
       6.400     ±0.0              12.800     ±0.0          100.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
      63.133     ±8.2              82.800     ±0.7           31.2%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
     245.067     ±0.7             306.567     ±7.9           25.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
      17.533     ±0.3              21.000     ±0.8           19.8%     ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
     188.167     ±1.9             215.033     ±3.1           14.3%     ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
     254.500     ±1.8             290.733     ±2.4           14.2%     ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync

`time.system_time'
=====================

next-20150317                 this patch
-------------------------    -------------------------
metric_value     ±stddev     metric_value     ±stddev     change       testbox/benchmark/testcase-params
-------------------------    -------------------------    --------     ------------------------------
    7235.603     ±1.2             185.163     ±1.9          -97.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
    7666.883     ±2.9             202.750     ±1.0          -97.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
   14567.893     ±0.7             421.230     ±0.4          -97.1%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
    3697.667     ±14.0            148.190     ±1.7          -96.0%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
    5572.867     ±3.8             310.717     ±1.4          -94.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
    5565.050     ±0.5             313.277     ±1.5          -94.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
    2420.707     ±17.1            171.043     ±2.7          -92.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
    3743.300     ±4.6             379.827     ±3.5          -89.9%     ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
    3308.687     ±6.3             363.050     ±2.0          -89.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose

Where,

     1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark

     1t, 64t: where 't' means thread

     4M: means the single file size, corresponding to the '-s' option of fsmark
     40G, 30G, 120G: means the total test size

     4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
               the size of one ramdisk. So, it would be 48G in total. And we made a
               raid on those ramdisk

As you can see, though there are no much performance gain for hard disk
workload, the system time is dropped heavily, up to 97%. And as expected,
the performance increased a lot, up to 260%, for fast device(ram disk).

v2: use bits instead of array to note down wait queue need to wake up.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 drivers/md/raid5.c | 27 +++++++++++++++++++--------
 drivers/md/raid5.h |  2 +-
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 64d5bea..697d77a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -344,7 +344,8 @@ static void release_inactive_stripe_list(struct r5conf *conf,
 					 int hash)
 {
 	int size;
-	bool do_wakeup = false;
+	unsigned long do_wakeup = 0;
+	int i = 0;
 	unsigned long flags;
 
 	if (hash == NR_STRIPE_HASH_LOCKS) {
@@ -365,15 +366,19 @@ static void release_inactive_stripe_list(struct r5conf *conf,
 			    !list_empty(list))
 				atomic_dec(&conf->empty_inactive_list_nr);
 			list_splice_tail_init(list, conf->inactive_list + hash);
-			do_wakeup = true;
+			do_wakeup |= 1 << (size - 1);
 			spin_unlock_irqrestore(conf->hash_locks + hash, flags);
 		}
 		size--;
 		hash--;
 	}
 
+	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) {
+		if (do_wakeup & (1 << i))
+			wake_up(&conf->wait_for_stripe[i]);
+	}
+
 	if (do_wakeup) {
-		wake_up(&conf->wait_for_stripe);
 		if (atomic_read(&conf->active_stripes) == 0)
 			wake_up(&conf->wait_for_quiescent);
 		if (conf->retry_read_aligned)
@@ -686,14 +691,15 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 			if (!sh) {
 				set_bit(R5_INACTIVE_BLOCKED,
 					&conf->cache_state);
-				wait_event_lock_irq(
-					conf->wait_for_stripe,
+				wait_event_exclusive_cmd(
+					conf->wait_for_stripe[hash],
 					!list_empty(conf->inactive_list + hash) &&
 					(atomic_read(&conf->active_stripes)
 					 < (conf->max_nr_stripes * 3 / 4)
 					 || !test_bit(R5_INACTIVE_BLOCKED,
 						      &conf->cache_state)),
-					*(conf->hash_locks + hash));
+					spin_unlock_irq(conf->hash_locks + hash),
+					spin_lock_irq(conf->hash_locks + hash));
 				clear_bit(R5_INACTIVE_BLOCKED,
 					  &conf->cache_state);
 			} else {
@@ -718,6 +724,9 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 		}
 	} while (sh == NULL);
 
+	if (!list_empty(conf->inactive_list + hash))
+		wake_up(&conf->wait_for_stripe[hash]);
+
 	spin_unlock_irq(conf->hash_locks + hash);
 	return sh;
 }
@@ -2138,7 +2147,7 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 	cnt = 0;
 	list_for_each_entry(nsh, &newstripes, lru) {
 		lock_device_hash_lock(conf, hash);
-		wait_event_cmd(conf->wait_for_stripe,
+		wait_event_exclusive_cmd(conf->wait_for_stripe[hash],
 				    !list_empty(conf->inactive_list + hash),
 				    unlock_device_hash_lock(conf, hash),
 				    lock_device_hash_lock(conf, hash));
@@ -6402,7 +6411,9 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	spin_lock_init(&conf->device_lock);
 	seqcount_init(&conf->gen_lock);
 	init_waitqueue_head(&conf->wait_for_quiescent);
-	init_waitqueue_head(&conf->wait_for_stripe);
+	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) {
+		init_waitqueue_head(&conf->wait_for_stripe[i]);
+	}
 	init_waitqueue_head(&conf->wait_for_overlap);
 	INIT_LIST_HEAD(&conf->handle_list);
 	INIT_LIST_HEAD(&conf->hold_list);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 4cc05ec..6307b90 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -509,7 +509,7 @@ struct r5conf {
 	atomic_t		empty_inactive_list_nr;
 	struct llist_head	released_stripes;
 	wait_queue_head_t	wait_for_quiescent;
-	wait_queue_head_t	wait_for_stripe;
+	wait_queue_head_t	wait_for_stripe[NR_STRIPE_HASH_LOCKS];
 	wait_queue_head_t	wait_for_overlap;
 	unsigned long		cache_state;
 #define R5_INACTIVE_BLOCKED	1	/* release of inactive stripes blocked,
-- 
1.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] md/raid5: per hash value and exclusive wait_for_stripe
@ 2015-04-29  2:48   ` Yuanhan Liu
  0 siblings, 0 replies; 9+ messages in thread
From: Yuanhan Liu @ 2015-04-29  2:48 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, linux-kernel, Yuanhan Liu

I noticed heavy spin lock contention at get_active_stripe() with fsmark
multiple thread write workloads.

Here is how this hot contention comes from. We have limited stripes, and
it's a multiple thread write workload. Hence, those stripes will be taken
soon, which puts later processes to sleep for waiting free stripes. When
enough stripes(>= 1/4 total stripes) are released, all process are woken,
trying to get the lock. But there is one only being able to get this lock
for each hash lock, making other processes spinning out there for acquiring
the lock.

Thus, it's effectiveless to wakeup all processes and let them battle for
a lock that permits one to access only each time. Instead, we could make
it be a exclusive wake up: wake up one process only. That avoids the heavy
spin lock contention naturally.

To do the exclusive wake up, we've to split wait_for_stripe into multiple
wait queues, to make it per hash value, just like the hash lock.

Here are some test results I have got with this patch applied(all test run
3 times):

`fsmark.files_per_sec'
=====================

next-20150317                 this patch
-------------------------     -------------------------
metric_value     ±stddev      metric_value     ±stddev     change      testbox/benchmark/testcase-params
-------------------------     -------------------------   --------     ------------------------------
      25.600     ±0.0              92.700     ±2.5          262.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      25.600     ±0.0              77.800     ±0.6          203.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
      32.000     ±0.0              93.800     ±1.7          193.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
      32.000     ±0.0              81.233     ±1.7          153.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
      48.800     ±14.5             99.667     ±2.0          104.2%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
       6.400     ±0.0              12.800     ±0.0          100.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
      63.133     ±8.2              82.800     ±0.7           31.2%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
     245.067     ±0.7             306.567     ±7.9           25.1%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-f2fs-4M-30G-fsyncBeforeClose
      17.533     ±0.3              21.000     ±0.8           19.8%     ivb44/fsmark/1x-1t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose
     188.167     ±1.9             215.033     ±3.1           14.3%     ivb44/fsmark/1x-1t-4BRD_12G-RAID5-btrfs-4M-30G-NoSync
     254.500     ±1.8             290.733     ±2.4           14.2%     ivb44/fsmark/1x-1t-9BRD_6G-RAID5-btrfs-4M-30G-NoSync

`time.system_time'
=====================

next-20150317                 this patch
-------------------------    -------------------------
metric_value     ±stddev     metric_value     ±stddev     change       testbox/benchmark/testcase-params
-------------------------    -------------------------    --------     ------------------------------
    7235.603     ±1.2             185.163     ±1.9          -97.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-btrfs-4M-30G-fsyncBeforeClose
    7666.883     ±2.9             202.750     ±1.0          -97.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-btrfs-4M-30G-fsyncBeforeClose
   14567.893     ±0.7             421.230     ±0.4          -97.1%     ivb44/fsmark/1x-64t-3HDD-RAID5-btrfs-4M-40G-fsyncBeforeClose
    3697.667     ±14.0            148.190     ±1.7          -96.0%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-xfs-4M-30G-fsyncBeforeClose
    5572.867     ±3.8             310.717     ±1.4          -94.4%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-ext4-4M-30G-fsyncBeforeClose
    5565.050     ±0.5             313.277     ±1.5          -94.4%     ivb44/fsmark/1x-64t-4BRD_12G-RAID5-ext4-4M-30G-fsyncBeforeClose
    2420.707     ±17.1            171.043     ±2.7          -92.9%     ivb44/fsmark/1x-64t-9BRD_6G-RAID5-xfs-4M-30G-fsyncBeforeClose
    3743.300     ±4.6             379.827     ±3.5          -89.9%     ivb44/fsmark/1x-64t-3HDD-RAID5-ext4-4M-40G-fsyncBeforeClose
    3308.687     ±6.3             363.050     ±2.0          -89.0%     ivb44/fsmark/1x-64t-3HDD-RAID5-xfs-4M-40G-fsyncBeforeClose

Where,

     1x: where 'x' means iterations or loop, corresponding to the 'L' option of fsmark

     1t, 64t: where 't' means thread

     4M: means the single file size, corresponding to the '-s' option of fsmark
     40G, 30G, 120G: means the total test size

     4BRD_12G: BRD is the ramdisk, where '4' means 4 ramdisk, and where '12G' means
               the size of one ramdisk. So, it would be 48G in total. And we made a
               raid on those ramdisk

As you can see, though there are no much performance gain for hard disk
workload, the system time is dropped heavily, up to 97%. And as expected,
the performance increased a lot, up to 260%, for fast device(ram disk).

v2: use bits instead of array to note down wait queue need to wake up.

Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
---
 drivers/md/raid5.c | 27 +++++++++++++++++++--------
 drivers/md/raid5.h |  2 +-
 2 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 64d5bea..697d77a 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -344,7 +344,8 @@ static void release_inactive_stripe_list(struct r5conf *conf,
 					 int hash)
 {
 	int size;
-	bool do_wakeup = false;
+	unsigned long do_wakeup = 0;
+	int i = 0;
 	unsigned long flags;
 
 	if (hash == NR_STRIPE_HASH_LOCKS) {
@@ -365,15 +366,19 @@ static void release_inactive_stripe_list(struct r5conf *conf,
 			    !list_empty(list))
 				atomic_dec(&conf->empty_inactive_list_nr);
 			list_splice_tail_init(list, conf->inactive_list + hash);
-			do_wakeup = true;
+			do_wakeup |= 1 << (size - 1);
 			spin_unlock_irqrestore(conf->hash_locks + hash, flags);
 		}
 		size--;
 		hash--;
 	}
 
+	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) {
+		if (do_wakeup & (1 << i))
+			wake_up(&conf->wait_for_stripe[i]);
+	}
+
 	if (do_wakeup) {
-		wake_up(&conf->wait_for_stripe);
 		if (atomic_read(&conf->active_stripes) == 0)
 			wake_up(&conf->wait_for_quiescent);
 		if (conf->retry_read_aligned)
@@ -686,14 +691,15 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 			if (!sh) {
 				set_bit(R5_INACTIVE_BLOCKED,
 					&conf->cache_state);
-				wait_event_lock_irq(
-					conf->wait_for_stripe,
+				wait_event_exclusive_cmd(
+					conf->wait_for_stripe[hash],
 					!list_empty(conf->inactive_list + hash) &&
 					(atomic_read(&conf->active_stripes)
 					 < (conf->max_nr_stripes * 3 / 4)
 					 || !test_bit(R5_INACTIVE_BLOCKED,
 						      &conf->cache_state)),
-					*(conf->hash_locks + hash));
+					spin_unlock_irq(conf->hash_locks + hash),
+					spin_lock_irq(conf->hash_locks + hash));
 				clear_bit(R5_INACTIVE_BLOCKED,
 					  &conf->cache_state);
 			} else {
@@ -718,6 +724,9 @@ get_active_stripe(struct r5conf *conf, sector_t sector,
 		}
 	} while (sh == NULL);
 
+	if (!list_empty(conf->inactive_list + hash))
+		wake_up(&conf->wait_for_stripe[hash]);
+
 	spin_unlock_irq(conf->hash_locks + hash);
 	return sh;
 }
@@ -2138,7 +2147,7 @@ static int resize_stripes(struct r5conf *conf, int newsize)
 	cnt = 0;
 	list_for_each_entry(nsh, &newstripes, lru) {
 		lock_device_hash_lock(conf, hash);
-		wait_event_cmd(conf->wait_for_stripe,
+		wait_event_exclusive_cmd(conf->wait_for_stripe[hash],
 				    !list_empty(conf->inactive_list + hash),
 				    unlock_device_hash_lock(conf, hash),
 				    lock_device_hash_lock(conf, hash));
@@ -6402,7 +6411,9 @@ static struct r5conf *setup_conf(struct mddev *mddev)
 	spin_lock_init(&conf->device_lock);
 	seqcount_init(&conf->gen_lock);
 	init_waitqueue_head(&conf->wait_for_quiescent);
-	init_waitqueue_head(&conf->wait_for_stripe);
+	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) {
+		init_waitqueue_head(&conf->wait_for_stripe[i]);
+	}
 	init_waitqueue_head(&conf->wait_for_overlap);
 	INIT_LIST_HEAD(&conf->handle_list);
 	INIT_LIST_HEAD(&conf->hold_list);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 4cc05ec..6307b90 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -509,7 +509,7 @@ struct r5conf {
 	atomic_t		empty_inactive_list_nr;
 	struct llist_head	released_stripes;
 	wait_queue_head_t	wait_for_quiescent;
-	wait_queue_head_t	wait_for_stripe;
+	wait_queue_head_t	wait_for_stripe[NR_STRIPE_HASH_LOCKS];
 	wait_queue_head_t	wait_for_overlap;
 	unsigned long		cache_state;
 #define R5_INACTIVE_BLOCKED	1	/* release of inactive stripes blocked,
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd
  2015-04-29  2:48 [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd Yuanhan Liu
  2015-04-29  2:48 ` [PATCH 2/3] md/raid5: split wait_for_stripe and introduce wait_for_quiescent Yuanhan Liu
  2015-04-29  2:48   ` Yuanhan Liu
@ 2015-04-29  9:31 ` Peter Zijlstra
  2015-04-29 22:17   ` NeilBrown
  2 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2015-04-29  9:31 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: neilb, linux-raid, linux-kernel, Ingo Molnar

On Wed, Apr 29, 2015 at 10:48:53AM +0800, Yuanhan Liu wrote:
> It's just a variant of wait_event_cmd(), with exclusive flag being set.
> 
> For cases like RAID5, which puts many processes to sleep until 1/4
> resources are free, a wake_up wakes up all processes to run, but
> there is one process being able to get the resource as it's protected
> by a spin lock. That ends up introducing heavy lock contentions, and
> hurts performance badly.
> 
> Here introduce wait_event_exclusive_cmd to relieve the lock contention
> naturally by letting wake_up just wake up one process.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>

Thanks!

Assuming you want to route this through Neil's MD tree:

Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

> ---
>  include/linux/wait.h | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/include/linux/wait.h b/include/linux/wait.h
> index 2db8334..db78c72 100644
> --- a/include/linux/wait.h
> +++ b/include/linux/wait.h
> @@ -358,6 +358,19 @@ do {									\
>  	__ret;								\
>  })
>  
> +#define __wait_event_exclusive_cmd(wq, condition, cmd1, cmd2)		\
> +	(void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 1, 0,	\
> +			    cmd1; schedule(); cmd2)
> +/*
> + * Just like wait_event_cmd(), except it sets exclusive flag
> + */
> +#define wait_event_exclusive_cmd(wq, condition, cmd1, cmd2)		\
> +do {									\
> +	if (condition)							\
> +		break;							\
> +	__wait_event_exclusive_cmd(wq, condition, cmd1, cmd2);		\
> +} while (0)
> +
>  #define __wait_event_cmd(wq, condition, cmd1, cmd2)			\
>  	(void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0,	\
>  			    cmd1; schedule(); cmd2)
> -- 
> 1.9.0
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd
  2015-04-29  9:31 ` [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd Peter Zijlstra
@ 2015-04-29 22:17   ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2015-04-29 22:17 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Yuanhan Liu, linux-raid, linux-kernel, Ingo Molnar

[-- Attachment #1: Type: text/plain, Size: 2351 bytes --]

On Wed, 29 Apr 2015 11:31:09 +0200 Peter Zijlstra <peterz@infradead.org>
wrote:

> On Wed, Apr 29, 2015 at 10:48:53AM +0800, Yuanhan Liu wrote:
> > It's just a variant of wait_event_cmd(), with exclusive flag being set.
> > 
> > For cases like RAID5, which puts many processes to sleep until 1/4
> > resources are free, a wake_up wakes up all processes to run, but
> > there is one process being able to get the resource as it's protected
> > by a spin lock. That ends up introducing heavy lock contentions, and
> > hurts performance badly.
> > 
> > Here introduce wait_event_exclusive_cmd to relieve the lock contention
> > naturally by letting wake_up just wake up one process.
> > 
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Yuanhan Liu <yuanhan.liu@linux.intel.com>
> 
> Thanks!
> 
> Assuming you want to route this through Neil's MD tree:
> 
> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>

Thanks a lot Peter!

I've queued the three patches in my md tree for 4.2.

Thanks yliu!

NeilBrown


> 
> > ---
> >  include/linux/wait.h | 13 +++++++++++++
> >  1 file changed, 13 insertions(+)
> > 
> > diff --git a/include/linux/wait.h b/include/linux/wait.h
> > index 2db8334..db78c72 100644
> > --- a/include/linux/wait.h
> > +++ b/include/linux/wait.h
> > @@ -358,6 +358,19 @@ do {									\
> >  	__ret;								\
> >  })
> >  
> > +#define __wait_event_exclusive_cmd(wq, condition, cmd1, cmd2)		\
> > +	(void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 1, 0,	\
> > +			    cmd1; schedule(); cmd2)
> > +/*
> > + * Just like wait_event_cmd(), except it sets exclusive flag
> > + */
> > +#define wait_event_exclusive_cmd(wq, condition, cmd1, cmd2)		\
> > +do {									\
> > +	if (condition)							\
> > +		break;							\
> > +	__wait_event_exclusive_cmd(wq, condition, cmd1, cmd2);		\
> > +} while (0)
> > +
> >  #define __wait_event_cmd(wq, condition, cmd1, cmd2)			\
> >  	(void)___wait_event(wq, condition, TASK_UNINTERRUPTIBLE, 0, 0,	\
> >  			    cmd1; schedule(); cmd2)
> > -- 
> > 1.9.0
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] md/raid5: per hash value and exclusive wait_for_stripe
  2015-04-29  2:48   ` Yuanhan Liu
  (?)
@ 2015-05-14  5:45   ` NeilBrown
  2015-05-14  5:55     ` Yuanhan Liu
  -1 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2015-05-14  5:45 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: linux-raid, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1349 bytes --]

On Wed, 29 Apr 2015 10:48:55 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
wrote:

> diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> index 64d5bea..697d77a 100644
> --- a/drivers/md/raid5.c
> +++ b/drivers/md/raid5.c
> @@ -344,7 +344,8 @@ static void release_inactive_stripe_list(struct r5conf *conf,
>  					 int hash)
>  {
>  	int size;
> -	bool do_wakeup = false;
> +	unsigned long do_wakeup = 0;
> +	int i = 0;
>  	unsigned long flags;
>  
>  	if (hash == NR_STRIPE_HASH_LOCKS) {
> @@ -365,15 +366,19 @@ static void release_inactive_stripe_list(struct r5conf *conf,
>  			    !list_empty(list))
>  				atomic_dec(&conf->empty_inactive_list_nr);
>  			list_splice_tail_init(list, conf->inactive_list + hash);
> -			do_wakeup = true;
> +			do_wakeup |= 1 << (size - 1);
>  			spin_unlock_irqrestore(conf->hash_locks + hash, flags);
>  		}
>  		size--;
>  		hash--;
>  	}
>  
> +	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) {
> +		if (do_wakeup & (1 << i))
> +			wake_up(&conf->wait_for_stripe[i]);
> +	}
> +

hi,
 I've been doing some testing and got a lock-up in resize_stripes, waiting
 on wait_for_stripe[].

 Looking at the above code,  I think
      do_wakeup |= 1 << (size - 1);
 should be
      do_wakeup |= 1 << hash;

 do you agree?  Or am I missing something?

Thanks,
NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] md/raid5: per hash value and exclusive wait_for_stripe
  2015-05-14  5:45   ` NeilBrown
@ 2015-05-14  5:55     ` Yuanhan Liu
  2015-05-14  6:09       ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Yuanhan Liu @ 2015-05-14  5:55 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, linux-kernel

On Thu, May 14, 2015 at 03:45:11PM +1000, NeilBrown wrote:
> On Wed, 29 Apr 2015 10:48:55 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
> wrote:
> 
> > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > index 64d5bea..697d77a 100644
> > --- a/drivers/md/raid5.c
> > +++ b/drivers/md/raid5.c
> > @@ -344,7 +344,8 @@ static void release_inactive_stripe_list(struct r5conf *conf,
> >  					 int hash)
> >  {
> >  	int size;
> > -	bool do_wakeup = false;
> > +	unsigned long do_wakeup = 0;
> > +	int i = 0;
> >  	unsigned long flags;
> >  
> >  	if (hash == NR_STRIPE_HASH_LOCKS) {
> > @@ -365,15 +366,19 @@ static void release_inactive_stripe_list(struct r5conf *conf,
> >  			    !list_empty(list))
> >  				atomic_dec(&conf->empty_inactive_list_nr);
> >  			list_splice_tail_init(list, conf->inactive_list + hash);
> > -			do_wakeup = true;
> > +			do_wakeup |= 1 << (size - 1);
> >  			spin_unlock_irqrestore(conf->hash_locks + hash, flags);
> >  		}
> >  		size--;
> >  		hash--;
> >  	}
> >  
> > +	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) {
> > +		if (do_wakeup & (1 << i))
> > +			wake_up(&conf->wait_for_stripe[i]);
> > +	}
> > +
> 
> hi,
>  I've been doing some testing and got a lock-up in resize_stripes, waiting
>  on wait_for_stripe[].
> 
>  Looking at the above code,  I think
>       do_wakeup |= 1 << (size - 1);
>  should be
>       do_wakeup |= 1 << hash;
> 
>  do you agree?  Or am I missing something?

Right. Sorry for the careless mistake.

	--yliu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 3/3] md/raid5: per hash value and exclusive wait_for_stripe
  2015-05-14  5:55     ` Yuanhan Liu
@ 2015-05-14  6:09       ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2015-05-14  6:09 UTC (permalink / raw)
  To: Yuanhan Liu; +Cc: linux-raid, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1823 bytes --]

On Thu, 14 May 2015 13:55:08 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
wrote:

> On Thu, May 14, 2015 at 03:45:11PM +1000, NeilBrown wrote:
> > On Wed, 29 Apr 2015 10:48:55 +0800 Yuanhan Liu <yuanhan.liu@linux.intel.com>
> > wrote:
> > 
> > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
> > > index 64d5bea..697d77a 100644
> > > --- a/drivers/md/raid5.c
> > > +++ b/drivers/md/raid5.c
> > > @@ -344,7 +344,8 @@ static void release_inactive_stripe_list(struct r5conf *conf,
> > >  					 int hash)
> > >  {
> > >  	int size;
> > > -	bool do_wakeup = false;
> > > +	unsigned long do_wakeup = 0;
> > > +	int i = 0;
> > >  	unsigned long flags;
> > >  
> > >  	if (hash == NR_STRIPE_HASH_LOCKS) {
> > > @@ -365,15 +366,19 @@ static void release_inactive_stripe_list(struct r5conf *conf,
> > >  			    !list_empty(list))
> > >  				atomic_dec(&conf->empty_inactive_list_nr);
> > >  			list_splice_tail_init(list, conf->inactive_list + hash);
> > > -			do_wakeup = true;
> > > +			do_wakeup |= 1 << (size - 1);
> > >  			spin_unlock_irqrestore(conf->hash_locks + hash, flags);
> > >  		}
> > >  		size--;
> > >  		hash--;
> > >  	}
> > >  
> > > +	for (i = 0; i < NR_STRIPE_HASH_LOCKS; i++) {
> > > +		if (do_wakeup & (1 << i))
> > > +			wake_up(&conf->wait_for_stripe[i]);
> > > +	}
> > > +
> > 
> > hi,
> >  I've been doing some testing and got a lock-up in resize_stripes, waiting
> >  on wait_for_stripe[].
> > 
> >  Looking at the above code,  I think
> >       do_wakeup |= 1 << (size - 1);
> >  should be
> >       do_wakeup |= 1 << hash;
> > 
> >  do you agree?  Or am I missing something?
> 
> Right. Sorry for the careless mistake.

Thanks for confirming.
Mistakes happen - that's why we test :-)

I've merged that fix into your patch.

NeilBrown

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-05-14  6:09 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-29  2:48 [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd Yuanhan Liu
2015-04-29  2:48 ` [PATCH 2/3] md/raid5: split wait_for_stripe and introduce wait_for_quiescent Yuanhan Liu
2015-04-29  2:48 ` [PATCH 3/3] md/raid5: per hash value and exclusive wait_for_stripe Yuanhan Liu
2015-04-29  2:48   ` Yuanhan Liu
2015-05-14  5:45   ` NeilBrown
2015-05-14  5:55     ` Yuanhan Liu
2015-05-14  6:09       ` NeilBrown
2015-04-29  9:31 ` [PATCH 1/3 v2] wait: introduce wait_event_exclusive_cmd Peter Zijlstra
2015-04-29 22:17   ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.