All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/8] raid5: improve write performance for fast storage
@ 2012-06-04  8:01 Shaohua Li
  2012-06-04  8:01 ` [patch 1/8] raid5: add a per-stripe lock Shaohua Li
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Shaohua Li @ 2012-06-04  8:01 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

Hi,

Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
thread becomes a bottleneck. raid5 can offload calculation like checksum to
async threads. And if storge is fast, scheduling async work and running async
work will introduce heavy lock contention of workqueue, which makes such
optimization useless. And calculation isn't the only bottleneck. For example,
in my test raid5 thread must handle > 450k requests per second. Just doing
dispatch and completion will make raid5 thread incapable. The only chance to
scale is using several threads to handle stripe.

Simpliy using several threads doesn't work. conf->device_lock is a global lock
which is heavily contended. The first 7 patches in the set are trying to
address this problem. With them, when several threads are handling stripe,
device_lock is still contended but takes much less cpu time and not the heavist
locking any more. Even the 8th patch isn't accepted, the first 7 patches look
good to merge.

With the locking issue solved (at least largely), switching stripe handling to
multiple threads is trival.

In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
improvement (double stripe_cache_size) and the throughput is pretty close to
theory value. With >=4 disks, the improvement is even bigger, for example, can
improve 200% for 4-disk setup, but the throughput is far less than theory
value, which is caused by several factors like request queue lock contention,
cache issue, latency introduced by how a stripe is handled in different disks.
Those factors need further investigations.

Comments and suggestions are welcome!

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 1/8] raid5: add a per-stripe lock
  2012-06-04  8:01 [patch 0/8] raid5: improve write performance for fast storage Shaohua Li
@ 2012-06-04  8:01 ` Shaohua Li
  2012-06-07  0:54   ` NeilBrown
  2012-06-12 21:10   ` Dan Williams
  2012-06-04  8:01 ` [patch 2/8] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Shaohua Li @ 2012-06-04  8:01 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-add-perstripe-lock.patch --]
[-- Type: text/plain, Size: 4922 bytes --]

Add a per-stripe lock to protect stripe specific data, like dev->read,
written, ... The purpose is to reduce lock contention of conf->device_lock.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   17 +++++++++++++++++
 drivers/md/raid5.h |    1 +
 2 files changed, 18 insertions(+)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-01 13:38:54.705210229 +0800
+++ linux/drivers/md/raid5.c	2012-06-01 13:43:05.594056130 +0800
@@ -749,6 +749,7 @@ static void ops_complete_biofill(void *s
 
 	/* clear completed biofills */
 	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&sh->stripe_lock);
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
@@ -774,6 +775,7 @@ static void ops_complete_biofill(void *s
 			}
 		}
 	}
+	spin_unlock_irq(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
@@ -798,8 +800,10 @@ static void ops_run_biofill(struct strip
 		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi;
 			spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&sh->stripe_lock);
 			dev->read = rbi = dev->toread;
 			dev->toread = NULL;
+			spin_unlock_irq(&sh->stripe_lock);
 			spin_unlock_irq(&conf->device_lock);
 			while (rbi && rbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
@@ -1137,10 +1141,12 @@ ops_run_biodrain(struct stripe_head *sh,
 			struct bio *wbi;
 
 			spin_lock_irq(&sh->raid_conf->device_lock);
+			spin_lock_irq(&sh->stripe_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
+			spin_unlock_irq(&sh->stripe_lock);
 			spin_unlock_irq(&sh->raid_conf->device_lock);
 
 			while (wbi && wbi->bi_sector <
@@ -1446,6 +1452,8 @@ static int grow_one_stripe(struct r5conf
 	init_waitqueue_head(&sh->ops.wait_for_ops);
 	#endif
 
+	spin_lock_init(&sh->stripe_lock);
+
 	if (grow_buffers(sh)) {
 		shrink_buffers(sh);
 		kmem_cache_free(conf->slab_cache, sh);
@@ -2327,6 +2335,7 @@ static int add_stripe_bio(struct stripe_
 
 
 	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&sh->stripe_lock);
 	if (forwrite) {
 		bip = &sh->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
@@ -2360,6 +2369,7 @@ static int add_stripe_bio(struct stripe_
 		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
+	spin_unlock_irq(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
@@ -2376,6 +2386,7 @@ static int add_stripe_bio(struct stripe_
 
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+	spin_unlock_irq(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	return 0;
 }
@@ -2427,6 +2438,7 @@ handle_failed_stripe(struct r5conf *conf
 			}
 		}
 		spin_lock_irq(&conf->device_lock);
+		spin_lock_irq(&sh->stripe_lock);
 		/* fail all writes first */
 		bi = sh->dev[i].towrite;
 		sh->dev[i].towrite = NULL;
@@ -2488,6 +2500,7 @@ handle_failed_stripe(struct r5conf *conf
 				bi = nextbi;
 			}
 		}
+		spin_unlock_irq(&sh->stripe_lock);
 		spin_unlock_irq(&conf->device_lock);
 		if (bitmap_end)
 			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -2695,6 +2708,7 @@ static void handle_stripe_clean_event(st
 				int bitmap_end = 0;
 				pr_debug("Return write for disc %d\n", i);
 				spin_lock_irq(&conf->device_lock);
+				spin_lock_irq(&sh->stripe_lock);
 				wbi = dev->written;
 				dev->written = NULL;
 				while (wbi && wbi->bi_sector <
@@ -2709,6 +2723,7 @@ static void handle_stripe_clean_event(st
 				}
 				if (dev->towrite == NULL)
 					bitmap_end = 1;
+				spin_unlock_irq(&sh->stripe_lock);
 				spin_unlock_irq(&conf->device_lock);
 				if (bitmap_end)
 					bitmap_endwrite(conf->mddev->bitmap,
@@ -3168,6 +3183,7 @@ static void analyse_stripe(struct stripe
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
 	spin_lock_irq(&conf->device_lock);
+	spin_lock_irq(&sh->stripe_lock);
 	for (i=disks; i--; ) {
 		struct md_rdev *rdev;
 		sector_t first_bad;
@@ -3313,6 +3329,7 @@ static void analyse_stripe(struct stripe
 				do_recovery = 1;
 		}
 	}
+	spin_unlock_irq(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	if (test_bit(STRIPE_SYNCING, &sh->state)) {
 		/* If there is a failed device being replaced,
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-06-01 13:38:54.717210079 +0800
+++ linux/drivers/md/raid5.h	2012-06-01 13:44:19.229127709 +0800
@@ -210,6 +210,7 @@ struct stripe_head {
 	int			disks;		/* disks in stripe */
 	enum check_states	check_state;
 	enum reconstruct_states reconstruct_state;
+	spinlock_t		stripe_lock;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 2/8] raid5: lockless access raid5 overrided bi_phys_segments
  2012-06-04  8:01 [patch 0/8] raid5: improve write performance for fast storage Shaohua Li
  2012-06-04  8:01 ` [patch 1/8] raid5: add a per-stripe lock Shaohua Li
@ 2012-06-04  8:01 ` Shaohua Li
  2012-06-07  1:06   ` NeilBrown
  2012-06-04  8:01 ` [patch 3/8] raid5: remove some device_lock locking places Shaohua Li
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-04  8:01 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-atomic-segment-accounting.patch --]
[-- Type: text/plain, Size: 2780 bytes --]

Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   37 ++++++++++++++++++++++++-------------
 1 file changed, 24 insertions(+), 13 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-01 13:43:05.594056130 +0800
+++ linux/drivers/md/raid5.c	2012-06-01 13:50:39.852349690 +0800
@@ -101,32 +101,43 @@ static inline struct bio *r5_next_bio(st
  */
 static inline int raid5_bi_phys_segments(struct bio *bio)
 {
-	return bio->bi_phys_segments & 0xffff;
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	return atomic_read(segments) & 0xffff;
 }
 
 static inline int raid5_bi_hw_segments(struct bio *bio)
 {
-	return (bio->bi_phys_segments >> 16) & 0xffff;
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	return (atomic_read(segments) >> 16) & 0xffff;
 }
 
 static inline int raid5_dec_bi_phys_segments(struct bio *bio)
 {
-	--bio->bi_phys_segments;
-	return raid5_bi_phys_segments(bio);
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	return atomic_sub_return(1, segments) & 0xffff;
 }
 
-static inline int raid5_dec_bi_hw_segments(struct bio *bio)
+static inline void raid5_inc_bi_phys_segments(struct bio *bio)
 {
-	unsigned short val = raid5_bi_hw_segments(bio);
-
-	--val;
-	bio->bi_phys_segments = (val << 16) | raid5_bi_phys_segments(bio);
-	return val;
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	atomic_inc(segments);
 }
 
 static inline void raid5_set_bi_hw_segments(struct bio *bio, unsigned int cnt)
 {
-	bio->bi_phys_segments = raid5_bi_phys_segments(bio) | (cnt << 16);
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	int old, new;
+
+	do {
+		old = atomic_read(segments);
+		new = (old & 0xffff) | (cnt << 16);
+	} while (atomic_cmpxchg(segments, old, new) != old);
+}
+
+static inline void raid5_set_bi_segments(struct bio *bio, unsigned int cnt)
+{
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	atomic_set(segments, cnt);
 }
 
 /* Find first data disk in a raid6 stripe */
@@ -2354,7 +2365,7 @@ static int add_stripe_bio(struct stripe_
 	if (*bip)
 		bi->bi_next = *bip;
 	*bip = bi;
-	bi->bi_phys_segments++;
+	raid5_inc_bi_phys_segments(bi);
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -3783,7 +3794,7 @@ static struct bio *remove_bio_from_retry
 		 * this sets the active strip count to 1 and the processed
 		 * strip count to zero (upper 8 bits)
 		 */
-		bi->bi_phys_segments = 1; /* biased count of active stripes */
+		raid5_set_bi_segments(bi, 1); /* biased count of active stripes */
 	}
 
 	return bi;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 3/8] raid5: remove some device_lock locking places
  2012-06-04  8:01 [patch 0/8] raid5: improve write performance for fast storage Shaohua Li
  2012-06-04  8:01 ` [patch 1/8] raid5: add a per-stripe lock Shaohua Li
  2012-06-04  8:01 ` [patch 2/8] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
@ 2012-06-04  8:01 ` Shaohua Li
  2012-06-04  8:01 ` [patch 4/8] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Shaohua Li @ 2012-06-04  8:01 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-remove-some-lock.patch --]
[-- Type: text/plain, Size: 5362 bytes --]

With per-stripe lock and bi_phys_segments lockless, we can safely remove some
locking places of device_lock.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   21 ---------------------
 1 file changed, 21 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-05-28 12:12:42.970613636 +0800
+++ linux/drivers/md/raid5.c	2012-05-28 14:29:01.535797473 +0800
@@ -752,14 +752,12 @@ static void ops_complete_biofill(void *s
 {
 	struct stripe_head *sh = stripe_head_ref;
 	struct bio *return_bi = NULL;
-	struct r5conf *conf = sh->raid_conf;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
 	/* clear completed biofills */
-	spin_lock_irq(&conf->device_lock);
 	spin_lock_irq(&sh->stripe_lock);
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
@@ -787,7 +785,6 @@ static void ops_complete_biofill(void *s
 		}
 	}
 	spin_unlock_irq(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
 	return_io(return_bi);
@@ -799,7 +796,6 @@ static void ops_complete_biofill(void *s
 static void ops_run_biofill(struct stripe_head *sh)
 {
 	struct dma_async_tx_descriptor *tx = NULL;
-	struct r5conf *conf = sh->raid_conf;
 	struct async_submit_ctl submit;
 	int i;
 
@@ -810,12 +806,10 @@ static void ops_run_biofill(struct strip
 		struct r5dev *dev = &sh->dev[i];
 		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi;
-			spin_lock_irq(&conf->device_lock);
 			spin_lock_irq(&sh->stripe_lock);
 			dev->read = rbi = dev->toread;
 			dev->toread = NULL;
 			spin_unlock_irq(&sh->stripe_lock);
-			spin_unlock_irq(&conf->device_lock);
 			while (rbi && rbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(0, rbi, dev->page,
@@ -1151,14 +1145,12 @@ ops_run_biodrain(struct stripe_head *sh,
 		if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) {
 			struct bio *wbi;
 
-			spin_lock_irq(&sh->raid_conf->device_lock);
 			spin_lock_irq(&sh->stripe_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
 			spin_unlock_irq(&sh->stripe_lock);
-			spin_unlock_irq(&sh->raid_conf->device_lock);
 
 			while (wbi && wbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
@@ -2345,7 +2337,6 @@ static int add_stripe_bio(struct stripe_
 		(unsigned long long)sh->sector);
 
 
-	spin_lock_irq(&conf->device_lock);
 	spin_lock_irq(&sh->stripe_lock);
 	if (forwrite) {
 		bip = &sh->dev[dd_idx].towrite;
@@ -2381,7 +2372,6 @@ static int add_stripe_bio(struct stripe_
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
 	spin_unlock_irq(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)(*bip)->bi_sector,
@@ -2398,7 +2388,6 @@ static int add_stripe_bio(struct stripe_
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
 	spin_unlock_irq(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
 	return 0;
 }
 
@@ -2448,7 +2437,6 @@ handle_failed_stripe(struct r5conf *conf
 				rdev_dec_pending(rdev, conf->mddev);
 			}
 		}
-		spin_lock_irq(&conf->device_lock);
 		spin_lock_irq(&sh->stripe_lock);
 		/* fail all writes first */
 		bi = sh->dev[i].towrite;
@@ -2512,7 +2500,6 @@ handle_failed_stripe(struct r5conf *conf
 			}
 		}
 		spin_unlock_irq(&sh->stripe_lock);
-		spin_unlock_irq(&conf->device_lock);
 		if (bitmap_end)
 			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
 					STRIPE_SECTORS, 0, 0);
@@ -2718,7 +2705,6 @@ static void handle_stripe_clean_event(st
 				struct bio *wbi, *wbi2;
 				int bitmap_end = 0;
 				pr_debug("Return write for disc %d\n", i);
-				spin_lock_irq(&conf->device_lock);
 				spin_lock_irq(&sh->stripe_lock);
 				wbi = dev->written;
 				dev->written = NULL;
@@ -2735,7 +2721,6 @@ static void handle_stripe_clean_event(st
 				if (dev->towrite == NULL)
 					bitmap_end = 1;
 				spin_unlock_irq(&sh->stripe_lock);
-				spin_unlock_irq(&conf->device_lock);
 				if (bitmap_end)
 					bitmap_endwrite(conf->mddev->bitmap,
 							sh->sector,
@@ -3193,7 +3178,6 @@ static void analyse_stripe(struct stripe
 
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
-	spin_lock_irq(&conf->device_lock);
 	spin_lock_irq(&sh->stripe_lock);
 	for (i=disks; i--; ) {
 		struct md_rdev *rdev;
@@ -3341,7 +3325,6 @@ static void analyse_stripe(struct stripe
 		}
 	}
 	spin_unlock_irq(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
 	if (test_bit(STRIPE_SYNCING, &sh->state)) {
 		/* If there is a failed device being replaced,
 		 *     we must be recovering.
@@ -4132,9 +4115,7 @@ static void make_request(struct mddev *m
 	if (!plugged)
 		md_wakeup_thread(mddev->thread);
 
-	spin_lock_irq(&conf->device_lock);
 	remaining = raid5_dec_bi_phys_segments(bi);
-	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0) {
 
 		if ( rw == WRITE )
@@ -4514,9 +4495,7 @@ static int  retry_aligned_read(struct r5
 		release_stripe(sh);
 		handled++;
 	}
-	spin_lock_irq(&conf->device_lock);
 	remaining = raid5_dec_bi_phys_segments(raid_bio);
-	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0)
 		bio_endio(raid_bio, 0);
 	if (atomic_dec_and_test(&conf->active_aligned_reads))


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 4/8] raid5: reduce chance release_stripe() taking device_lock
  2012-06-04  8:01 [patch 0/8] raid5: improve write performance for fast storage Shaohua Li
                   ` (2 preceding siblings ...)
  2012-06-04  8:01 ` [patch 3/8] raid5: remove some device_lock locking places Shaohua Li
@ 2012-06-04  8:01 ` Shaohua Li
  2012-06-07  0:50   ` NeilBrown
  2012-06-04  8:01 ` [patch 5/8] raid5: add batch stripe release Shaohua Li
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-04  8:01 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-reduce-release_stripe-lock.patch --]
[-- Type: text/plain, Size: 4095 bytes --]

release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required. On the on the other hand,
decreasing count first and taking lock if count is 0 can expose races:
1. bewteen dec count and taking lock, another thread hits the stripe in cache,
so increase count. The stripe will be deleted from any list. In this case
stripe count isn't 0.
2. between dec count and taking lock, another thread hits the stripe in cache
and release it. In this case the stripe is already in specific list. We do
list_move to adjust its position.
So both cases are fixable to me.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-01 13:50:56.336138112 +0800
+++ linux/drivers/md/raid5.c	2012-06-01 14:03:17.062826938 +0800
@@ -201,20 +201,39 @@ static int stripe_operations_active(stru
 	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
 }
 
-static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
+static void __release_stripe(struct r5conf *conf, struct stripe_head *sh,
+	int locking)
 {
+	unsigned long uninitialized_var(flags);
+
 	if (atomic_dec_and_test(&sh->count)) {
-		BUG_ON(!list_empty(&sh->lru));
+		/*
+		 * Before we hold device_lock, other thread can hit this stripe
+		 * in cache. It could do:
+		 * 1. just get_active_stripe(). The stripe count isn't 0 then.
+		 * 2. do get_active_stripe() and follow release_stripe(). So the
+		 * stripe might be already released and already in specific
+		 * list. we do list_move to adjust its position in the list.
+		 */
+		if (locking) {
+			spin_lock_irqsave(&conf->device_lock, flags);
+			if (atomic_read(&sh->count) != 0) {
+				spin_unlock_irqrestore(&conf->device_lock,
+							flags);
+				return;
+			}
+		}
+
 		BUG_ON(atomic_read(&conf->active_stripes)==0);
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
 			if (test_bit(STRIPE_DELAYED, &sh->state))
-				list_add_tail(&sh->lru, &conf->delayed_list);
+				list_move_tail(&sh->lru, &conf->delayed_list);
 			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
 				   sh->bm_seq - conf->seq_write > 0)
-				list_add_tail(&sh->lru, &conf->bitmap_list);
+				list_move_tail(&sh->lru, &conf->bitmap_list);
 			else {
 				clear_bit(STRIPE_BIT_DELAY, &sh->state);
-				list_add_tail(&sh->lru, &conf->handle_list);
+				list_move_tail(&sh->lru, &conf->handle_list);
 			}
 			md_wakeup_thread(conf->mddev->thread);
 		} else {
@@ -225,23 +244,22 @@ static void __release_stripe(struct r5co
 					md_wakeup_thread(conf->mddev->thread);
 			atomic_dec(&conf->active_stripes);
 			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
-				list_add_tail(&sh->lru, &conf->inactive_list);
+				list_move_tail(&sh->lru, &conf->inactive_list);
 				wake_up(&conf->wait_for_stripe);
 				if (conf->retry_read_aligned)
 					md_wakeup_thread(conf->mddev->thread);
 			}
 		}
+		if (locking)
+			spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 }
 
 static void release_stripe(struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
-	unsigned long flags;
 
-	spin_lock_irqsave(&conf->device_lock, flags);
-	__release_stripe(conf, sh);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	__release_stripe(conf, sh, 1);
 }
 
 static inline void remove_hash(struct stripe_head *sh)
@@ -484,9 +502,6 @@ get_active_stripe(struct r5conf *conf, s
 			} else {
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
-				if (list_empty(&sh->lru) &&
-				    !test_bit(STRIPE_EXPANDING, &sh->state))
-					BUG();
 				list_del_init(&sh->lru);
 			}
 		}
@@ -3672,7 +3687,7 @@ static void activate_bit_delay(struct r5
 		struct stripe_head *sh = list_entry(head.next, struct stripe_head, lru);
 		list_del_init(&sh->lru);
 		atomic_inc(&sh->count);
-		__release_stripe(conf, sh);
+		__release_stripe(conf, sh, 0);
 	}
 }
 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 5/8] raid5: add batch stripe release
  2012-06-04  8:01 [patch 0/8] raid5: improve write performance for fast storage Shaohua Li
                   ` (3 preceding siblings ...)
  2012-06-04  8:01 ` [patch 4/8] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
@ 2012-06-04  8:01 ` Shaohua Li
  2012-06-04  8:01 ` [patch 6/8] raid5: make_request use " Shaohua Li
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 34+ messages in thread
From: Shaohua Li @ 2012-06-04  8:01 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-stripe-batch.patch --]
[-- Type: text/plain, Size: 2244 bytes --]

Add stripe batch and corresponding batch stripe release. Next patch will use it
to reduce device_lock locking.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   35 +++++++++++++++++++++++++++++++++++
 drivers/md/raid5.h |    6 ++++++
 2 files changed, 41 insertions(+)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-01 14:03:17.062826938 +0800
+++ linux/drivers/md/raid5.c	2012-06-01 14:13:46.846909398 +0800
@@ -262,6 +262,41 @@ static void release_stripe(struct stripe
 	__release_stripe(conf, sh, 1);
 }
 
+static void __release_stripe_flush_batch(struct stripe_head_batch *batch)
+{
+	int i;
+
+	for (i = 0; i < batch->count; i++) {
+		struct stripe_head *sh = batch->stripes[i];
+		__release_stripe(sh->raid_conf, sh, 0);
+	}
+	batch->count = 0;
+}
+
+static void release_stripe_flush_batch(struct stripe_head_batch *batch)
+{
+	struct r5conf *conf = batch->stripes[0]->raid_conf;
+
+	spin_lock_irq(&conf->device_lock);
+	__release_stripe_flush_batch(batch);
+	spin_unlock_irq(&conf->device_lock);
+}
+
+static void release_stripe_add_batch(struct stripe_head_batch *batch,
+	struct stripe_head *sh)
+{
+	struct r5conf *conf = sh->raid_conf;
+
+	preempt_disable();
+	if (batch->count > 0 && batch->stripes[0]->raid_conf != conf)
+		release_stripe_flush_batch(batch);
+	batch->stripes[batch->count] = sh;
+	batch->count++;
+	if (batch->count >= MAX_STRIPE_BATCH)
+		release_stripe_flush_batch(batch);
+	preempt_enable();
+}
+
 static inline void remove_hash(struct stripe_head *sh)
 {
 	pr_debug("remove_hash(), stripe %llu\n",
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-06-01 13:44:19.229127709 +0800
+++ linux/drivers/md/raid5.h	2012-06-01 14:13:46.846909398 +0800
@@ -239,6 +239,12 @@ struct stripe_head {
 	} dev[1]; /* allocated with extra space depending of RAID geometry */
 };
 
+#define MAX_STRIPE_BATCH 8
+struct stripe_head_batch {
+	struct stripe_head *stripes[MAX_STRIPE_BATCH];
+	int count;
+};
+
 /* stripe_head_state - collects and tracks the dynamic state of a stripe_head
  *     for handle_stripe.
  */


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 6/8] raid5: make_request use batch stripe release
  2012-06-04  8:01 [patch 0/8] raid5: improve write performance for fast storage Shaohua Li
                   ` (4 preceding siblings ...)
  2012-06-04  8:01 ` [patch 5/8] raid5: add batch stripe release Shaohua Li
@ 2012-06-04  8:01 ` Shaohua Li
  2012-06-07  1:23   ` NeilBrown
  2012-06-04  8:01 ` [patch 7/8] raid5: raid5d handle stripe in batch way Shaohua Li
  2012-06-04  8:02 ` [patch 8/8] raid5: create multiple threads to handle stripes Shaohua Li
  7 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-04  8:01 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-make_request-relase_stripe-batch.patch --]
[-- Type: text/plain, Size: 2622 bytes --]

make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.

Below patch makes stripe release batch. When maxium strips of a batch reach,
the batch will be flushed out. Another way to do the flush is when unplug is
called.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   42 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 41 insertions(+), 1 deletion(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-01 14:13:46.846909398 +0800
+++ linux/drivers/md/raid5.c	2012-06-01 14:22:07.944611949 +0800
@@ -4023,6 +4023,38 @@ static struct stripe_head *__get_priorit
 	return sh;
 }
 
+struct raid5_plug {
+	struct blk_plug_cb cb;
+	struct stripe_head_batch batch;
+};
+static DEFINE_PER_CPU(struct raid5_plug, raid5_plugs);
+
+static void raid5_do_plug(struct blk_plug_cb *cb)
+{
+	struct raid5_plug *plug = container_of(cb, struct raid5_plug, cb);
+
+	release_stripe_flush_batch(&plug->batch);
+	INIT_LIST_HEAD(&plug->cb.list);
+}
+
+static void release_stripe_plug(struct stripe_head *sh)
+{
+	struct blk_plug *plug = current->plug;
+	struct raid5_plug *raid5_plug;
+
+	if (!plug) {
+		release_stripe(sh);
+		return;
+	}
+	preempt_disable();
+	raid5_plug = &__raw_get_cpu_var(raid5_plugs);
+	release_stripe_add_batch(&raid5_plug->batch, sh);
+
+	if (list_empty(&raid5_plug->cb.list))
+		list_add(&raid5_plug->cb.list, &plug->cb_list);
+	preempt_enable();
+}
+
 static void make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
@@ -4153,7 +4185,7 @@ static void make_request(struct mddev *m
 			if ((bi->bi_rw & REQ_SYNC) &&
 			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
-			release_stripe(sh);
+			release_stripe_plug(sh);
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -6170,6 +6202,14 @@ static struct md_personality raid4_perso
 
 static int __init raid5_init(void)
 {
+	int i;
+
+	for_each_present_cpu(i) {
+		struct raid5_plug *plug = &per_cpu(raid5_plugs, i);
+		plug->batch.count = 0;
+		INIT_LIST_HEAD(&plug->cb.list);
+		plug->cb.callback = raid5_do_plug;
+	}
 	register_md_personality(&raid6_personality);
 	register_md_personality(&raid5_personality);
 	register_md_personality(&raid4_personality);


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 7/8] raid5: raid5d handle stripe in batch way
  2012-06-04  8:01 [patch 0/8] raid5: improve write performance for fast storage Shaohua Li
                   ` (5 preceding siblings ...)
  2012-06-04  8:01 ` [patch 6/8] raid5: make_request use " Shaohua Li
@ 2012-06-04  8:01 ` Shaohua Li
  2012-06-07  1:32   ` NeilBrown
  2012-06-04  8:02 ` [patch 8/8] raid5: create multiple threads to handle stripes Shaohua Li
  7 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-04  8:01 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-raid5d-fetch-stripe-batch.patch --]
[-- Type: text/plain, Size: 1755 bytes --]

Let raid5d handle stripe in batch way to reduce conf->device_lock locking.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   35 ++++++++++++++++++++++++++---------
 1 file changed, 26 insertions(+), 9 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-01 14:34:03.987606911 +0800
+++ linux/drivers/md/raid5.c	2012-06-01 14:49:26.388010973 +0800
@@ -4585,6 +4585,22 @@ static int  retry_aligned_read(struct r5
 	return handled;
 }
 
+static int __get_stripe_batch(struct r5conf *conf,
+		struct stripe_head_batch *batch)
+{
+	struct stripe_head *sh;
+
+	batch->count = 0;
+	do {
+		sh = __get_priority_stripe(conf);
+		if (sh) {
+			batch->stripes[batch->count] = sh;
+			batch->count++;
+		}
+	} while (sh && batch->count < MAX_STRIPE_BATCH);
+
+	return batch->count;
+}
 
 /*
  * This is our raid5 kernel thread.
@@ -4595,10 +4611,10 @@ static int  retry_aligned_read(struct r5
  */
 static void raid5d(struct mddev *mddev)
 {
-	struct stripe_head *sh;
 	struct r5conf *conf = mddev->private;
-	int handled;
+	int handled, i;
 	struct blk_plug plug;
+	struct stripe_head_batch batch;
 
 	pr_debug("+++ raid5d active\n");
 
@@ -4633,15 +4649,16 @@ static void raid5d(struct mddev *mddev)
 			handled++;
 		}
 
-		sh = __get_priority_stripe(conf);
-
-		if (!sh)
+		if (!__get_stripe_batch(conf, &batch))
 			break;
 		spin_unlock_irq(&conf->device_lock);
-		
-		handled++;
-		handle_stripe(sh);
-		release_stripe(sh);
+
+		for (i = 0; i < batch.count; i++) {
+			handled++;
+			handle_stripe(batch.stripes[i]);
+		}
+
+		release_stripe_flush_batch(&batch);
 		cond_resched();
 
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch 8/8] raid5: create multiple threads to handle stripes
  2012-06-04  8:01 [patch 0/8] raid5: improve write performance for fast storage Shaohua Li
                   ` (6 preceding siblings ...)
  2012-06-04  8:01 ` [patch 7/8] raid5: raid5d handle stripe in batch way Shaohua Li
@ 2012-06-04  8:02 ` Shaohua Li
  2012-06-07  1:39   ` NeilBrown
  7 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-04  8:02 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-multiple-threads.patch --]
[-- Type: text/plain, Size: 6043 bytes --]

Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
thread becomes a bottleneck. raid5 can offload calculation like checksum to
async threads. And if storge is fast, scheduling async work and running async
work will introduce heavy lock contention of workqueue, which makes such
optimization useless. And calculation isn't the only bottleneck. For example,
in my test raid5 thread must handle > 450k requests per second. Just doing
dispatch and completion will make raid5 thread incapable. The only chance to
scale is using several threads to handle stripe.

With this patch, user can create several extra threads to handle stripe. How
many threads are better depending on disk number, so the thread number can be
changed in userspace. By default, the thread number is 0, which means no extra
thread.

In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
improvement (double stripe_cache_size) and the throughput is pretty close to
theory value. With >=4 disks, the improvement is even bigger, for example, can
improve 200% for 4-disk setup, but the throughput is far less than theory
value, which is caused by several factors like request queue lock contention,
cache issue, latency introduced by how a stripe is handled in different disks.
Those factors need further investigations.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/md/raid5.h |    2 
 2 files changed, 120 insertions(+)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-01 15:23:56.001992200 +0800
+++ linux/drivers/md/raid5.c	2012-06-04 11:54:25.775331361 +0800
@@ -4602,6 +4602,41 @@ static int __get_stripe_batch(struct r5c
 	return batch->count;
 }
 
+static void raid5auxd(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+	struct blk_plug plug;
+	struct stripe_head_batch batch;
+	int handled, i;
+
+	pr_debug("+++ raid5auxd active\n");
+
+	blk_start_plug(&plug);
+	handled = 0;
+	spin_lock_irq(&conf->device_lock);
+	while (1) {
+		if (!__get_stripe_batch(conf, &batch))
+			break;
+		spin_unlock_irq(&conf->device_lock);
+
+		for (i = 0; i < batch.count; i++) {
+			handled++;
+			handle_stripe(batch.stripes[i]);
+		}
+
+		release_stripe_flush_batch(&batch);
+		cond_resched();
+
+		spin_lock_irq(&conf->device_lock);
+	}
+	pr_debug("%d stripes handled\n", handled);
+
+	spin_unlock_irq(&conf->device_lock);
+	blk_finish_plug(&plug);
+
+	pr_debug("--- raid5auxd inactive\n");
+}
+
 /*
  * This is our raid5 kernel thread.
  *
@@ -4651,6 +4686,8 @@ static void raid5d(struct mddev *mddev)
 
 		if (!__get_stripe_batch(conf, &batch))
 			break;
+		for (i = 0; i < conf->aux_thread_num; i++)
+			md_wakeup_thread(conf->aux_threads[i]);
 		spin_unlock_irq(&conf->device_lock);
 
 		for (i = 0; i < batch.count; i++) {
@@ -4784,10 +4821,86 @@ stripe_cache_active_show(struct mddev *m
 static struct md_sysfs_entry
 raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
 
+static ssize_t
+raid5_show_auxthread_number(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	if (conf)
+		return sprintf(page, "%d\n", conf->aux_thread_num);
+	else
+		return 0;
+}
+
+static ssize_t
+raid5_store_auxthread_number(struct mddev *mddev, const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	unsigned long new;
+	int i;
+	struct md_thread **threads;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (!conf)
+		return -ENODEV;
+
+	if (strict_strtoul(page, 10, &new))
+		return -EINVAL;
+
+	if (new == conf->aux_thread_num)
+		return len;
+
+	if (new > conf->aux_thread_num) {
+
+		threads = kmalloc(sizeof(struct md_thread *) * new, GFP_KERNEL);
+		if (!threads)
+			return -EFAULT;
+
+		i = conf->aux_thread_num;
+		while (i < new) {
+			char name[10];
+
+			sprintf(name, "aux%d", i);
+			threads[i] = md_register_thread(raid5auxd, mddev, name);
+			if (!threads[i])
+				goto error;
+			i++;
+		}
+		memcpy(threads, conf->aux_threads,
+			sizeof(struct md_thread *) * conf->aux_thread_num);
+		spin_lock_irq(&conf->device_lock);
+		kfree(conf->aux_threads);
+		conf->aux_threads = threads;
+		conf->aux_thread_num = new;
+		spin_unlock_irq(&conf->device_lock);
+	} else {
+		int old = conf->aux_thread_num;
+
+		spin_lock_irq(&conf->device_lock);
+		conf->aux_thread_num = new;
+		spin_unlock_irq(&conf->device_lock);
+		for (i = new; i < old; i++)
+			md_unregister_thread(&conf->aux_threads[i]);
+	}
+
+	return len;
+error:
+	while (--i >= conf->aux_thread_num)
+		md_unregister_thread(&threads[i]);
+	kfree(threads);
+	return -EFAULT;
+}
+
+static struct md_sysfs_entry
+raid5_auxthread_number = __ATTR(auxthread_number, S_IRUGO|S_IWUSR,
+				raid5_show_auxthread_number,
+				raid5_store_auxthread_number);
+
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
 	&raid5_preread_bypass_threshold.attr,
+	&raid5_auxthread_number.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
@@ -4835,6 +4948,7 @@ static void raid5_free_percpu(struct r5c
 
 static void free_conf(struct r5conf *conf)
 {
+	kfree(conf->aux_threads);
 	shrink_stripes(conf);
 	raid5_free_percpu(conf);
 	kfree(conf->disks);
@@ -5391,6 +5505,10 @@ abort:
 static int stop(struct mddev *mddev)
 {
 	struct r5conf *conf = mddev->private;
+	int i;
+
+	for (i = 0; i < conf->aux_thread_num; i++)
+		md_unregister_thread(&conf->aux_threads[i]);
 
 	md_unregister_thread(&mddev->thread);
 	if (mddev->queue)
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-06-01 15:23:56.017991998 +0800
+++ linux/drivers/md/raid5.h	2012-06-01 15:27:12.515521685 +0800
@@ -463,6 +463,8 @@ struct r5conf {
 	 * the new thread here until we fully activate the array.
 	 */
 	struct md_thread	*thread;
+	int			aux_thread_num;
+	struct md_thread	**aux_threads;
 };
 
 /*


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 4/8] raid5: reduce chance release_stripe() taking device_lock
  2012-06-04  8:01 ` [patch 4/8] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
@ 2012-06-07  0:50   ` NeilBrown
  0 siblings, 0 replies; 34+ messages in thread
From: NeilBrown @ 2012-06-07  0:50 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 4933 bytes --]

On Mon, 04 Jun 2012 16:01:56 +0800 Shaohua Li <shli@kernel.org> wrote:

> release_stripe() is a place conf->device_lock is heavily contended. We take the
> lock even stripe count isn't 1, which isn't required. On the on the other hand,
> decreasing count first and taking lock if count is 0 can expose races:
> 1. bewteen dec count and taking lock, another thread hits the stripe in cache,
> so increase count. The stripe will be deleted from any list. In this case
> stripe count isn't 0.
> 2. between dec count and taking lock, another thread hits the stripe in cache
> and release it. In this case the stripe is already in specific list. We do
> list_move to adjust its position.
> So both cases are fixable to me.

1/ Please keep this as two different entry points, one which is called with
   the lock held, which which already holds the lock.  i.e. don't add a
   'locking' flag.

2/ Use "atomic_dec_and_lock" to avoid taking the lock when not needed.

So one entry point does:
  if atomic_dec_and_lock
      common code
      unlock
while the other does
  if atomic_test_and lock
      common code

Thanks,
NeilBrown


> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |   43 +++++++++++++++++++++++++++++--------------
>  1 file changed, 29 insertions(+), 14 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-01 13:50:56.336138112 +0800
> +++ linux/drivers/md/raid5.c	2012-06-01 14:03:17.062826938 +0800
> @@ -201,20 +201,39 @@ static int stripe_operations_active(stru
>  	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
>  }
>  
> -static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
> +static void __release_stripe(struct r5conf *conf, struct stripe_head *sh,
> +	int locking)
>  {
> +	unsigned long uninitialized_var(flags);
> +
>  	if (atomic_dec_and_test(&sh->count)) {
> -		BUG_ON(!list_empty(&sh->lru));
> +		/*
> +		 * Before we hold device_lock, other thread can hit this stripe
> +		 * in cache. It could do:
> +		 * 1. just get_active_stripe(). The stripe count isn't 0 then.
> +		 * 2. do get_active_stripe() and follow release_stripe(). So the
> +		 * stripe might be already released and already in specific
> +		 * list. we do list_move to adjust its position in the list.
> +		 */
> +		if (locking) {
> +			spin_lock_irqsave(&conf->device_lock, flags);
> +			if (atomic_read(&sh->count) != 0) {
> +				spin_unlock_irqrestore(&conf->device_lock,
> +							flags);
> +				return;
> +			}
> +		}
> +
>  		BUG_ON(atomic_read(&conf->active_stripes)==0);
>  		if (test_bit(STRIPE_HANDLE, &sh->state)) {
>  			if (test_bit(STRIPE_DELAYED, &sh->state))
> -				list_add_tail(&sh->lru, &conf->delayed_list);
> +				list_move_tail(&sh->lru, &conf->delayed_list);
>  			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
>  				   sh->bm_seq - conf->seq_write > 0)
> -				list_add_tail(&sh->lru, &conf->bitmap_list);
> +				list_move_tail(&sh->lru, &conf->bitmap_list);
>  			else {
>  				clear_bit(STRIPE_BIT_DELAY, &sh->state);
> -				list_add_tail(&sh->lru, &conf->handle_list);
> +				list_move_tail(&sh->lru, &conf->handle_list);
>  			}
>  			md_wakeup_thread(conf->mddev->thread);
>  		} else {
> @@ -225,23 +244,22 @@ static void __release_stripe(struct r5co
>  					md_wakeup_thread(conf->mddev->thread);
>  			atomic_dec(&conf->active_stripes);
>  			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
> -				list_add_tail(&sh->lru, &conf->inactive_list);
> +				list_move_tail(&sh->lru, &conf->inactive_list);
>  				wake_up(&conf->wait_for_stripe);
>  				if (conf->retry_read_aligned)
>  					md_wakeup_thread(conf->mddev->thread);
>  			}
>  		}
> +		if (locking)
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
>  	}
>  }
>  
>  static void release_stripe(struct stripe_head *sh)
>  {
>  	struct r5conf *conf = sh->raid_conf;
> -	unsigned long flags;
>  
> -	spin_lock_irqsave(&conf->device_lock, flags);
> -	__release_stripe(conf, sh);
> -	spin_unlock_irqrestore(&conf->device_lock, flags);
> +	__release_stripe(conf, sh, 1);
>  }
>  
>  static inline void remove_hash(struct stripe_head *sh)
> @@ -484,9 +502,6 @@ get_active_stripe(struct r5conf *conf, s
>  			} else {
>  				if (!test_bit(STRIPE_HANDLE, &sh->state))
>  					atomic_inc(&conf->active_stripes);
> -				if (list_empty(&sh->lru) &&
> -				    !test_bit(STRIPE_EXPANDING, &sh->state))
> -					BUG();
>  				list_del_init(&sh->lru);
>  			}
>  		}
> @@ -3672,7 +3687,7 @@ static void activate_bit_delay(struct r5
>  		struct stripe_head *sh = list_entry(head.next, struct stripe_head, lru);
>  		list_del_init(&sh->lru);
>  		atomic_inc(&sh->count);
> -		__release_stripe(conf, sh);
> +		__release_stripe(conf, sh, 0);
>  	}
>  }
>  


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 1/8] raid5: add a per-stripe lock
  2012-06-04  8:01 ` [patch 1/8] raid5: add a per-stripe lock Shaohua Li
@ 2012-06-07  0:54   ` NeilBrown
  2012-06-07  6:29     ` Shaohua Li
  2012-06-12 21:10   ` Dan Williams
  1 sibling, 1 reply; 34+ messages in thread
From: NeilBrown @ 2012-06-07  0:54 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 5978 bytes --]

On Mon, 04 Jun 2012 16:01:53 +0800 Shaohua Li <shli@kernel.org> wrote:

> Add a per-stripe lock to protect stripe specific data, like dev->read,
> written, ... The purpose is to reduce lock contention of conf->device_lock.

I'm not convinced that you need to add a lock.
I am convinced that if you do add one you need to explain exactly what it is
protecting.

The STRIPE_ACTIVE bit serves as a lock and ensures that only one process can
be in handle_stripe at a time.
So I don't think dev->read actually needs any protection (though I haven't
checked thoroughly).

I think the only things that device_lock protects are things shared by
multiple stripes, so adding a per-stripe spinlock isn't going to help remove
device_lock.

Thanks,
NeilBrown


> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |   17 +++++++++++++++++
>  drivers/md/raid5.h |    1 +
>  2 files changed, 18 insertions(+)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-01 13:38:54.705210229 +0800
> +++ linux/drivers/md/raid5.c	2012-06-01 13:43:05.594056130 +0800
> @@ -749,6 +749,7 @@ static void ops_complete_biofill(void *s
>  
>  	/* clear completed biofills */
>  	spin_lock_irq(&conf->device_lock);
> +	spin_lock_irq(&sh->stripe_lock);
>  	for (i = sh->disks; i--; ) {
>  		struct r5dev *dev = &sh->dev[i];
>  
> @@ -774,6 +775,7 @@ static void ops_complete_biofill(void *s
>  			}
>  		}
>  	}
> +	spin_unlock_irq(&sh->stripe_lock);
>  	spin_unlock_irq(&conf->device_lock);
>  	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
>  
> @@ -798,8 +800,10 @@ static void ops_run_biofill(struct strip
>  		if (test_bit(R5_Wantfill, &dev->flags)) {
>  			struct bio *rbi;
>  			spin_lock_irq(&conf->device_lock);
> +			spin_lock_irq(&sh->stripe_lock);
>  			dev->read = rbi = dev->toread;
>  			dev->toread = NULL;
> +			spin_unlock_irq(&sh->stripe_lock);
>  			spin_unlock_irq(&conf->device_lock);
>  			while (rbi && rbi->bi_sector <
>  				dev->sector + STRIPE_SECTORS) {
> @@ -1137,10 +1141,12 @@ ops_run_biodrain(struct stripe_head *sh,
>  			struct bio *wbi;
>  
>  			spin_lock_irq(&sh->raid_conf->device_lock);
> +			spin_lock_irq(&sh->stripe_lock);
>  			chosen = dev->towrite;
>  			dev->towrite = NULL;
>  			BUG_ON(dev->written);
>  			wbi = dev->written = chosen;
> +			spin_unlock_irq(&sh->stripe_lock);
>  			spin_unlock_irq(&sh->raid_conf->device_lock);
>  
>  			while (wbi && wbi->bi_sector <
> @@ -1446,6 +1452,8 @@ static int grow_one_stripe(struct r5conf
>  	init_waitqueue_head(&sh->ops.wait_for_ops);
>  	#endif
>  
> +	spin_lock_init(&sh->stripe_lock);
> +
>  	if (grow_buffers(sh)) {
>  		shrink_buffers(sh);
>  		kmem_cache_free(conf->slab_cache, sh);
> @@ -2327,6 +2335,7 @@ static int add_stripe_bio(struct stripe_
>  
>  
>  	spin_lock_irq(&conf->device_lock);
> +	spin_lock_irq(&sh->stripe_lock);
>  	if (forwrite) {
>  		bip = &sh->dev[dd_idx].towrite;
>  		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
> @@ -2360,6 +2369,7 @@ static int add_stripe_bio(struct stripe_
>  		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
>  			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
>  	}
> +	spin_unlock_irq(&sh->stripe_lock);
>  	spin_unlock_irq(&conf->device_lock);
>  
>  	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
> @@ -2376,6 +2386,7 @@ static int add_stripe_bio(struct stripe_
>  
>   overlap:
>  	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
> +	spin_unlock_irq(&sh->stripe_lock);
>  	spin_unlock_irq(&conf->device_lock);
>  	return 0;
>  }
> @@ -2427,6 +2438,7 @@ handle_failed_stripe(struct r5conf *conf
>  			}
>  		}
>  		spin_lock_irq(&conf->device_lock);
> +		spin_lock_irq(&sh->stripe_lock);
>  		/* fail all writes first */
>  		bi = sh->dev[i].towrite;
>  		sh->dev[i].towrite = NULL;
> @@ -2488,6 +2500,7 @@ handle_failed_stripe(struct r5conf *conf
>  				bi = nextbi;
>  			}
>  		}
> +		spin_unlock_irq(&sh->stripe_lock);
>  		spin_unlock_irq(&conf->device_lock);
>  		if (bitmap_end)
>  			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
> @@ -2695,6 +2708,7 @@ static void handle_stripe_clean_event(st
>  				int bitmap_end = 0;
>  				pr_debug("Return write for disc %d\n", i);
>  				spin_lock_irq(&conf->device_lock);
> +				spin_lock_irq(&sh->stripe_lock);
>  				wbi = dev->written;
>  				dev->written = NULL;
>  				while (wbi && wbi->bi_sector <
> @@ -2709,6 +2723,7 @@ static void handle_stripe_clean_event(st
>  				}
>  				if (dev->towrite == NULL)
>  					bitmap_end = 1;
> +				spin_unlock_irq(&sh->stripe_lock);
>  				spin_unlock_irq(&conf->device_lock);
>  				if (bitmap_end)
>  					bitmap_endwrite(conf->mddev->bitmap,
> @@ -3168,6 +3183,7 @@ static void analyse_stripe(struct stripe
>  	/* Now to look around and see what can be done */
>  	rcu_read_lock();
>  	spin_lock_irq(&conf->device_lock);
> +	spin_lock_irq(&sh->stripe_lock);
>  	for (i=disks; i--; ) {
>  		struct md_rdev *rdev;
>  		sector_t first_bad;
> @@ -3313,6 +3329,7 @@ static void analyse_stripe(struct stripe
>  				do_recovery = 1;
>  		}
>  	}
> +	spin_unlock_irq(&sh->stripe_lock);
>  	spin_unlock_irq(&conf->device_lock);
>  	if (test_bit(STRIPE_SYNCING, &sh->state)) {
>  		/* If there is a failed device being replaced,
> Index: linux/drivers/md/raid5.h
> ===================================================================
> --- linux.orig/drivers/md/raid5.h	2012-06-01 13:38:54.717210079 +0800
> +++ linux/drivers/md/raid5.h	2012-06-01 13:44:19.229127709 +0800
> @@ -210,6 +210,7 @@ struct stripe_head {
>  	int			disks;		/* disks in stripe */
>  	enum check_states	check_state;
>  	enum reconstruct_states reconstruct_state;
> +	spinlock_t		stripe_lock;
>  	/**
>  	 * struct stripe_operations
>  	 * @target - STRIPE_OP_COMPUTE_BLK target


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/8] raid5: lockless access raid5 overrided bi_phys_segments
  2012-06-04  8:01 ` [patch 2/8] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
@ 2012-06-07  1:06   ` NeilBrown
  2012-06-12 20:41     ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2012-06-07  1:06 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 3513 bytes --]

On Mon, 04 Jun 2012 16:01:54 +0800 Shaohua Li <shli@kernel.org> wrote:

> Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
> which is unnecessary, We can make it lockless actually.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>

I cannot say that I like this (casting fields in the bio structure), but I
can see the value and it should work.  'atomic_t' is currently always the same
size as an 'int', and I doubt that will change.

So maybe I'll get used to the idea.

I think we should take the opportunity to change the names to refer to
"active" and "processed" rather than "phys" and "hw".

Thanks,

NeilBrown


> ---
>  drivers/md/raid5.c |   37 ++++++++++++++++++++++++-------------
>  1 file changed, 24 insertions(+), 13 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-01 13:43:05.594056130 +0800
> +++ linux/drivers/md/raid5.c	2012-06-01 13:50:39.852349690 +0800
> @@ -101,32 +101,43 @@ static inline struct bio *r5_next_bio(st
>   */
>  static inline int raid5_bi_phys_segments(struct bio *bio)
>  {
> -	return bio->bi_phys_segments & 0xffff;
> +	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +	return atomic_read(segments) & 0xffff;
>  }
>  
>  static inline int raid5_bi_hw_segments(struct bio *bio)
>  {
> -	return (bio->bi_phys_segments >> 16) & 0xffff;
> +	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +	return (atomic_read(segments) >> 16) & 0xffff;
>  }
>  
>  static inline int raid5_dec_bi_phys_segments(struct bio *bio)
>  {
> -	--bio->bi_phys_segments;
> -	return raid5_bi_phys_segments(bio);
> +	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +	return atomic_sub_return(1, segments) & 0xffff;
>  }
>  
> -static inline int raid5_dec_bi_hw_segments(struct bio *bio)
> +static inline void raid5_inc_bi_phys_segments(struct bio *bio)
>  {
> -	unsigned short val = raid5_bi_hw_segments(bio);
> -
> -	--val;
> -	bio->bi_phys_segments = (val << 16) | raid5_bi_phys_segments(bio);
> -	return val;
> +	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +	atomic_inc(segments);
>  }
>  
>  static inline void raid5_set_bi_hw_segments(struct bio *bio, unsigned int cnt)
>  {
> -	bio->bi_phys_segments = raid5_bi_phys_segments(bio) | (cnt << 16);
> +	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +	int old, new;
> +
> +	do {
> +		old = atomic_read(segments);
> +		new = (old & 0xffff) | (cnt << 16);
> +	} while (atomic_cmpxchg(segments, old, new) != old);
> +}
> +
> +static inline void raid5_set_bi_segments(struct bio *bio, unsigned int cnt)
> +{
> +	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +	atomic_set(segments, cnt);
>  }
>  
>  /* Find first data disk in a raid6 stripe */
> @@ -2354,7 +2365,7 @@ static int add_stripe_bio(struct stripe_
>  	if (*bip)
>  		bi->bi_next = *bip;
>  	*bip = bi;
> -	bi->bi_phys_segments++;
> +	raid5_inc_bi_phys_segments(bi);
>  
>  	if (forwrite) {
>  		/* check if page is covered */
> @@ -3783,7 +3794,7 @@ static struct bio *remove_bio_from_retry
>  		 * this sets the active strip count to 1 and the processed
>  		 * strip count to zero (upper 8 bits)
>  		 */
> -		bi->bi_phys_segments = 1; /* biased count of active stripes */
> +		raid5_set_bi_segments(bi, 1); /* biased count of active stripes */
>  	}
>  
>  	return bi;


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 6/8] raid5: make_request use batch stripe release
  2012-06-04  8:01 ` [patch 6/8] raid5: make_request use " Shaohua Li
@ 2012-06-07  1:23   ` NeilBrown
  2012-06-07  6:33     ` Shaohua Li
  0 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2012-06-07  1:23 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 3862 bytes --]

On Mon, 04 Jun 2012 16:01:58 +0800 Shaohua Li <shli@kernel.org> wrote:

> make_request() does stripe release for every stripe and the stripe usually has
> count 1, which makes previous release_stripe() optimization not work. In my
> test, this release_stripe() becomes the heaviest pleace to take
> conf->device_lock after previous patches applied.
> 
> Below patch makes stripe release batch. When maxium strips of a batch reach,
> the batch will be flushed out. Another way to do the flush is when unplug is
> called.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>

I like the idea of a batched release.
I don't like the per-cpu variables... and I don't think it is safe to only
allocate them for_each_present_cpu without support cpu-hot-plug.

I would much rather keep a list of stripes (linked on ->lru) in struct
md_plug_cb (or maybe in some structure which contains that) and release them
all on unplug - and only on unplug.

Maybe pass a size to mddev_check_unplugged, and it allocates that much more
space.  Get mddev_check_unplugged to return the md_plug_cb structure.
If the new space is NULL, then list_head_init it, and change the cb.callback
to a raid5 specific function.
Then add any stripe to the md_plug_cb, and in the unplug function, release
them all.

Does that make sense?

Also I would rather the batched stripe release code were defined in the same
patch that used it.  It isn't big enough to justify a separate patch.

Thanks,
NeilBrown



> ---
>  drivers/md/raid5.c |   42 +++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 41 insertions(+), 1 deletion(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-01 14:13:46.846909398 +0800
> +++ linux/drivers/md/raid5.c	2012-06-01 14:22:07.944611949 +0800
> @@ -4023,6 +4023,38 @@ static struct stripe_head *__get_priorit
>  	return sh;
>  }
>  
> +struct raid5_plug {
> +	struct blk_plug_cb cb;
> +	struct stripe_head_batch batch;
> +};
> +static DEFINE_PER_CPU(struct raid5_plug, raid5_plugs);
> +
> +static void raid5_do_plug(struct blk_plug_cb *cb)
> +{
> +	struct raid5_plug *plug = container_of(cb, struct raid5_plug, cb);
> +
> +	release_stripe_flush_batch(&plug->batch);
> +	INIT_LIST_HEAD(&plug->cb.list);
> +}
> +
> +static void release_stripe_plug(struct stripe_head *sh)
> +{
> +	struct blk_plug *plug = current->plug;
> +	struct raid5_plug *raid5_plug;
> +
> +	if (!plug) {
> +		release_stripe(sh);
> +		return;
> +	}
> +	preempt_disable();
> +	raid5_plug = &__raw_get_cpu_var(raid5_plugs);
> +	release_stripe_add_batch(&raid5_plug->batch, sh);
> +
> +	if (list_empty(&raid5_plug->cb.list))
> +		list_add(&raid5_plug->cb.list, &plug->cb_list);
> +	preempt_enable();
> +}
> +
>  static void make_request(struct mddev *mddev, struct bio * bi)
>  {
>  	struct r5conf *conf = mddev->private;
> @@ -4153,7 +4185,7 @@ static void make_request(struct mddev *m
>  			if ((bi->bi_rw & REQ_SYNC) &&
>  			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>  				atomic_inc(&conf->preread_active_stripes);
> -			release_stripe(sh);
> +			release_stripe_plug(sh);
>  		} else {
>  			/* cannot get stripe for read-ahead, just give-up */
>  			clear_bit(BIO_UPTODATE, &bi->bi_flags);
> @@ -6170,6 +6202,14 @@ static struct md_personality raid4_perso
>  
>  static int __init raid5_init(void)
>  {
> +	int i;
> +
> +	for_each_present_cpu(i) {
> +		struct raid5_plug *plug = &per_cpu(raid5_plugs, i);
> +		plug->batch.count = 0;
> +		INIT_LIST_HEAD(&plug->cb.list);
> +		plug->cb.callback = raid5_do_plug;
> +	}
>  	register_md_personality(&raid6_personality);
>  	register_md_personality(&raid5_personality);
>  	register_md_personality(&raid4_personality);


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 7/8] raid5: raid5d handle stripe in batch way
  2012-06-04  8:01 ` [patch 7/8] raid5: raid5d handle stripe in batch way Shaohua Li
@ 2012-06-07  1:32   ` NeilBrown
  2012-06-07  6:35     ` Shaohua Li
  0 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2012-06-07  1:32 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 2708 bytes --]

On Mon, 04 Jun 2012 16:01:59 +0800 Shaohua Li <shli@kernel.org> wrote:

> Let raid5d handle stripe in batch way to reduce conf->device_lock locking.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>

I like this.
I don't think it justifies a separate function.

#define MAX_STRIPE_BATCH 8
struct stripe_head *batch[MAX_STRIPE_BATCH]
int batch_size = 0;

...

while (batch_size < MAX_STRPE_BATCH &&
       (sh = __get_priority_stripe(conf)) != NULL)
     batch[batch_size++] = sh;

spin_unlock(&conf->device_lock);
if (batch_size == 0)
     break;

handled += batch_size;

for (i = 0; i < batch_size; i++)
     handle_stripe(batch[i]);
cond_resched();
if (....) md_check_recovery(mddev);

spin_lock_irq(&conf->lock);
for (i = 0; i < batch_size; i++)
     __release_stripe(batch[i]);


something like that?

Thanks,
NeilBrown



> ---
>  drivers/md/raid5.c |   35 ++++++++++++++++++++++++++---------
>  1 file changed, 26 insertions(+), 9 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-01 14:34:03.987606911 +0800
> +++ linux/drivers/md/raid5.c	2012-06-01 14:49:26.388010973 +0800
> @@ -4585,6 +4585,22 @@ static int  retry_aligned_read(struct r5
>  	return handled;
>  }
>  
> +static int __get_stripe_batch(struct r5conf *conf,
> +		struct stripe_head_batch *batch)
> +{
> +	struct stripe_head *sh;
> +
> +	batch->count = 0;
> +	do {
> +		sh = __get_priority_stripe(conf);
> +		if (sh) {
> +			batch->stripes[batch->count] = sh;
> +			batch->count++;
> +		}
> +	} while (sh && batch->count < MAX_STRIPE_BATCH);
> +
> +	return batch->count;
> +}
>  
>  /*
>   * This is our raid5 kernel thread.
> @@ -4595,10 +4611,10 @@ static int  retry_aligned_read(struct r5
>   */
>  static void raid5d(struct mddev *mddev)
>  {
> -	struct stripe_head *sh;
>  	struct r5conf *conf = mddev->private;
> -	int handled;
> +	int handled, i;
>  	struct blk_plug plug;
> +	struct stripe_head_batch batch;
>  
>  	pr_debug("+++ raid5d active\n");
>  
> @@ -4633,15 +4649,16 @@ static void raid5d(struct mddev *mddev)
>  			handled++;
>  		}
>  
> -		sh = __get_priority_stripe(conf);
> -
> -		if (!sh)
> +		if (!__get_stripe_batch(conf, &batch))
>  			break;
>  		spin_unlock_irq(&conf->device_lock);
> -		
> -		handled++;
> -		handle_stripe(sh);
> -		release_stripe(sh);
> +
> +		for (i = 0; i < batch.count; i++) {
> +			handled++;
> +			handle_stripe(batch.stripes[i]);
> +		}
> +
> +		release_stripe_flush_batch(&batch);
>  		cond_resched();
>  
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 8/8] raid5: create multiple threads to handle stripes
  2012-06-04  8:02 ` [patch 8/8] raid5: create multiple threads to handle stripes Shaohua Li
@ 2012-06-07  1:39   ` NeilBrown
  2012-06-07  6:45     ` Shaohua Li
  0 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2012-06-07  1:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 7335 bytes --]

On Mon, 04 Jun 2012 16:02:00 +0800 Shaohua Li <shli@kernel.org> wrote:

> Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
> thread becomes a bottleneck. raid5 can offload calculation like checksum to
> async threads. And if storge is fast, scheduling async work and running async
> work will introduce heavy lock contention of workqueue, which makes such
> optimization useless. And calculation isn't the only bottleneck. For example,
> in my test raid5 thread must handle > 450k requests per second. Just doing
> dispatch and completion will make raid5 thread incapable. The only chance to
> scale is using several threads to handle stripe.
> 
> With this patch, user can create several extra threads to handle stripe. How
> many threads are better depending on disk number, so the thread number can be
> changed in userspace. By default, the thread number is 0, which means no extra
> thread.
> 
> In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
> improvement (double stripe_cache_size) and the throughput is pretty close to
> theory value. With >=4 disks, the improvement is even bigger, for example, can
> improve 200% for 4-disk setup, but the throughput is far less than theory
> value, which is caused by several factors like request queue lock contention,
> cache issue, latency introduced by how a stripe is handled in different disks.
> Those factors need further investigations.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>

I think it is great that you have got RAID5 to the point where multiple
threads improve performance.
I really don't like the idea of having to configure that number of threads.

It would be great if it would auto-configure.
Maybe the main thread could fork aux threads when it notices a high load.
e.g. if it has been servicing requests for more than 100ms without a break,
and the number of threads is less than the number of CPUs, then it forks a new
helper and resets the timer.

If a thread has been idle for more than 30 minutes, it exits.

Might that be reasonable?

Thanks,
NeilBrown

> ---
>  drivers/md/raid5.c |  118 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/md/raid5.h |    2 
>  2 files changed, 120 insertions(+)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-01 15:23:56.001992200 +0800
> +++ linux/drivers/md/raid5.c	2012-06-04 11:54:25.775331361 +0800
> @@ -4602,6 +4602,41 @@ static int __get_stripe_batch(struct r5c
>  	return batch->count;
>  }
>  
> +static void raid5auxd(struct mddev *mddev)
> +{
> +	struct r5conf *conf = mddev->private;
> +	struct blk_plug plug;
> +	struct stripe_head_batch batch;
> +	int handled, i;
> +
> +	pr_debug("+++ raid5auxd active\n");
> +
> +	blk_start_plug(&plug);
> +	handled = 0;
> +	spin_lock_irq(&conf->device_lock);
> +	while (1) {
> +		if (!__get_stripe_batch(conf, &batch))
> +			break;
> +		spin_unlock_irq(&conf->device_lock);
> +
> +		for (i = 0; i < batch.count; i++) {
> +			handled++;
> +			handle_stripe(batch.stripes[i]);
> +		}
> +
> +		release_stripe_flush_batch(&batch);
> +		cond_resched();
> +
> +		spin_lock_irq(&conf->device_lock);
> +	}
> +	pr_debug("%d stripes handled\n", handled);
> +
> +	spin_unlock_irq(&conf->device_lock);
> +	blk_finish_plug(&plug);
> +
> +	pr_debug("--- raid5auxd inactive\n");
> +}
> +
>  /*
>   * This is our raid5 kernel thread.
>   *
> @@ -4651,6 +4686,8 @@ static void raid5d(struct mddev *mddev)
>  
>  		if (!__get_stripe_batch(conf, &batch))
>  			break;
> +		for (i = 0; i < conf->aux_thread_num; i++)
> +			md_wakeup_thread(conf->aux_threads[i]);
>  		spin_unlock_irq(&conf->device_lock);
>  
>  		for (i = 0; i < batch.count; i++) {
> @@ -4784,10 +4821,86 @@ stripe_cache_active_show(struct mddev *m
>  static struct md_sysfs_entry
>  raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
>  
> +static ssize_t
> +raid5_show_auxthread_number(struct mddev *mddev, char *page)
> +{
> +	struct r5conf *conf = mddev->private;
> +	if (conf)
> +		return sprintf(page, "%d\n", conf->aux_thread_num);
> +	else
> +		return 0;
> +}
> +
> +static ssize_t
> +raid5_store_auxthread_number(struct mddev *mddev, const char *page, size_t len)
> +{
> +	struct r5conf *conf = mddev->private;
> +	unsigned long new;
> +	int i;
> +	struct md_thread **threads;
> +
> +	if (len >= PAGE_SIZE)
> +		return -EINVAL;
> +	if (!conf)
> +		return -ENODEV;
> +
> +	if (strict_strtoul(page, 10, &new))
> +		return -EINVAL;
> +
> +	if (new == conf->aux_thread_num)
> +		return len;
> +
> +	if (new > conf->aux_thread_num) {
> +
> +		threads = kmalloc(sizeof(struct md_thread *) * new, GFP_KERNEL);
> +		if (!threads)
> +			return -EFAULT;
> +
> +		i = conf->aux_thread_num;
> +		while (i < new) {
> +			char name[10];
> +
> +			sprintf(name, "aux%d", i);
> +			threads[i] = md_register_thread(raid5auxd, mddev, name);
> +			if (!threads[i])
> +				goto error;
> +			i++;
> +		}
> +		memcpy(threads, conf->aux_threads,
> +			sizeof(struct md_thread *) * conf->aux_thread_num);
> +		spin_lock_irq(&conf->device_lock);
> +		kfree(conf->aux_threads);
> +		conf->aux_threads = threads;
> +		conf->aux_thread_num = new;
> +		spin_unlock_irq(&conf->device_lock);
> +	} else {
> +		int old = conf->aux_thread_num;
> +
> +		spin_lock_irq(&conf->device_lock);
> +		conf->aux_thread_num = new;
> +		spin_unlock_irq(&conf->device_lock);
> +		for (i = new; i < old; i++)
> +			md_unregister_thread(&conf->aux_threads[i]);
> +	}
> +
> +	return len;
> +error:
> +	while (--i >= conf->aux_thread_num)
> +		md_unregister_thread(&threads[i]);
> +	kfree(threads);
> +	return -EFAULT;
> +}
> +
> +static struct md_sysfs_entry
> +raid5_auxthread_number = __ATTR(auxthread_number, S_IRUGO|S_IWUSR,
> +				raid5_show_auxthread_number,
> +				raid5_store_auxthread_number);
> +
>  static struct attribute *raid5_attrs[] =  {
>  	&raid5_stripecache_size.attr,
>  	&raid5_stripecache_active.attr,
>  	&raid5_preread_bypass_threshold.attr,
> +	&raid5_auxthread_number.attr,
>  	NULL,
>  };
>  static struct attribute_group raid5_attrs_group = {
> @@ -4835,6 +4948,7 @@ static void raid5_free_percpu(struct r5c
>  
>  static void free_conf(struct r5conf *conf)
>  {
> +	kfree(conf->aux_threads);
>  	shrink_stripes(conf);
>  	raid5_free_percpu(conf);
>  	kfree(conf->disks);
> @@ -5391,6 +5505,10 @@ abort:
>  static int stop(struct mddev *mddev)
>  {
>  	struct r5conf *conf = mddev->private;
> +	int i;
> +
> +	for (i = 0; i < conf->aux_thread_num; i++)
> +		md_unregister_thread(&conf->aux_threads[i]);
>  
>  	md_unregister_thread(&mddev->thread);
>  	if (mddev->queue)
> Index: linux/drivers/md/raid5.h
> ===================================================================
> --- linux.orig/drivers/md/raid5.h	2012-06-01 15:23:56.017991998 +0800
> +++ linux/drivers/md/raid5.h	2012-06-01 15:27:12.515521685 +0800
> @@ -463,6 +463,8 @@ struct r5conf {
>  	 * the new thread here until we fully activate the array.
>  	 */
>  	struct md_thread	*thread;
> +	int			aux_thread_num;
> +	struct md_thread	**aux_threads;
>  };
>  
>  /*


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 1/8] raid5: add a per-stripe lock
  2012-06-07  0:54   ` NeilBrown
@ 2012-06-07  6:29     ` Shaohua Li
  2012-06-07  6:35       ` NeilBrown
  0 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-07  6:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Thu, Jun 07, 2012 at 10:54:10AM +1000, NeilBrown wrote:
> On Mon, 04 Jun 2012 16:01:53 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > Add a per-stripe lock to protect stripe specific data, like dev->read,
> > written, ... The purpose is to reduce lock contention of conf->device_lock.
> 
> I'm not convinced that you need to add a lock.
> I am convinced that if you do add one you need to explain exactly what it is
> protecting.
> 
> The STRIPE_ACTIVE bit serves as a lock and ensures that only one process can
> be in handle_stripe at a time.
> So I don't think dev->read actually needs any protection (though I haven't
> checked thoroughly).
> 
> I think the only things that device_lock protects are things shared by
> multiple stripes, so adding a per-stripe spinlock isn't going to help remove
> device_lock.

This sounds not true to me. both the async callbacks and request completion
access stripe data, like dev->read. Such things are not protected by
STRIPE_ACTIVE bit. Thought we can delete STRIPE_ACTIVE bit with stripe lock
introduced.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 6/8] raid5: make_request use batch stripe release
  2012-06-07  1:23   ` NeilBrown
@ 2012-06-07  6:33     ` Shaohua Li
  2012-06-07  7:33       ` NeilBrown
  0 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-07  6:33 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Thu, Jun 07, 2012 at 11:23:45AM +1000, NeilBrown wrote:
> On Mon, 04 Jun 2012 16:01:58 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > make_request() does stripe release for every stripe and the stripe usually has
> > count 1, which makes previous release_stripe() optimization not work. In my
> > test, this release_stripe() becomes the heaviest pleace to take
> > conf->device_lock after previous patches applied.
> > 
> > Below patch makes stripe release batch. When maxium strips of a batch reach,
> > the batch will be flushed out. Another way to do the flush is when unplug is
> > called.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> 
> I like the idea of a batched release.
> I don't like the per-cpu variables... and I don't think it is safe to only
> allocate them for_each_present_cpu without support cpu-hot-plug.
> 
> I would much rather keep a list of stripes (linked on ->lru) in struct
> md_plug_cb (or maybe in some structure which contains that) and release them
> all on unplug - and only on unplug.
> 
> Maybe pass a size to mddev_check_unplugged, and it allocates that much more
> space.  Get mddev_check_unplugged to return the md_plug_cb structure.
> If the new space is NULL, then list_head_init it, and change the cb.callback
> to a raid5 specific function.
> Then add any stripe to the md_plug_cb, and in the unplug function, release
> them all.
> 
> Does that make sense?
> 
> Also I would rather the batched stripe release code were defined in the same
> patch that used it.  It isn't big enough to justify a separate patch.

The stripe->lru need protection of device_lock, so I can't use a list. An array
is preferred. I really didn't like the idea to allocate memory especially when
allocating an array. I'll fix the code for cpuhotplug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 1/8] raid5: add a per-stripe lock
  2012-06-07  6:29     ` Shaohua Li
@ 2012-06-07  6:35       ` NeilBrown
  2012-06-07  6:52         ` Shaohua Li
  0 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2012-06-07  6:35 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 1324 bytes --]

On Thu, 7 Jun 2012 14:29:39 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Thu, Jun 07, 2012 at 10:54:10AM +1000, NeilBrown wrote:
> > On Mon, 04 Jun 2012 16:01:53 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > Add a per-stripe lock to protect stripe specific data, like dev->read,
> > > written, ... The purpose is to reduce lock contention of conf->device_lock.
> > 
> > I'm not convinced that you need to add a lock.
> > I am convinced that if you do add one you need to explain exactly what it is
> > protecting.
> > 
> > The STRIPE_ACTIVE bit serves as a lock and ensures that only one process can
> > be in handle_stripe at a time.
> > So I don't think dev->read actually needs any protection (though I haven't
> > checked thoroughly).
> > 
> > I think the only things that device_lock protects are things shared by
> > multiple stripes, so adding a per-stripe spinlock isn't going to help remove
> > device_lock.
> 
> This sounds not true to me. both the async callbacks and request completion
> access stripe data, like dev->read. Such things are not protected by
> STRIPE_ACTIVE bit. Thought we can delete STRIPE_ACTIVE bit with stripe lock
> introduced.

Please give specifics.  What race do you see with access to dev->read that is
not protected by STRIPE_ACTIVE ?

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 7/8] raid5: raid5d handle stripe in batch way
  2012-06-07  1:32   ` NeilBrown
@ 2012-06-07  6:35     ` Shaohua Li
  2012-06-07  7:38       ` NeilBrown
  0 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-07  6:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Thu, Jun 07, 2012 at 11:32:39AM +1000, NeilBrown wrote:
> On Mon, 04 Jun 2012 16:01:59 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > Let raid5d handle stripe in batch way to reduce conf->device_lock locking.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> 
> I like this.
> I don't think it justifies a separate function.
> 
> #define MAX_STRIPE_BATCH 8
> struct stripe_head *batch[MAX_STRIPE_BATCH]
> int batch_size = 0;
> 
> ...
> 
> while (batch_size < MAX_STRPE_BATCH &&
>        (sh = __get_priority_stripe(conf)) != NULL)
>      batch[batch_size++] = sh;
> 
> spin_unlock(&conf->device_lock);
> if (batch_size == 0)
>      break;
> 
> handled += batch_size;
> 
> for (i = 0; i < batch_size; i++)
>      handle_stripe(batch[i]);
> cond_resched();
> if (....) md_check_recovery(mddev);
> 
> spin_lock_irq(&conf->lock);
> for (i = 0; i < batch_size; i++)
>      __release_stripe(batch[i]);
> 
> 
> something like that?

the 8th patch does the same thing, so I moved the code to a separate function.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 8/8] raid5: create multiple threads to handle stripes
  2012-06-07  1:39   ` NeilBrown
@ 2012-06-07  6:45     ` Shaohua Li
  2012-06-13  4:08       ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-07  6:45 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Thu, Jun 07, 2012 at 11:39:58AM +1000, NeilBrown wrote:
> On Mon, 04 Jun 2012 16:02:00 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
> > thread becomes a bottleneck. raid5 can offload calculation like checksum to
> > async threads. And if storge is fast, scheduling async work and running async
> > work will introduce heavy lock contention of workqueue, which makes such
> > optimization useless. And calculation isn't the only bottleneck. For example,
> > in my test raid5 thread must handle > 450k requests per second. Just doing
> > dispatch and completion will make raid5 thread incapable. The only chance to
> > scale is using several threads to handle stripe.
> > 
> > With this patch, user can create several extra threads to handle stripe. How
> > many threads are better depending on disk number, so the thread number can be
> > changed in userspace. By default, the thread number is 0, which means no extra
> > thread.
> > 
> > In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
> > improvement (double stripe_cache_size) and the throughput is pretty close to
> > theory value. With >=4 disks, the improvement is even bigger, for example, can
> > improve 200% for 4-disk setup, but the throughput is far less than theory
> > value, which is caused by several factors like request queue lock contention,
> > cache issue, latency introduced by how a stripe is handled in different disks.
> > Those factors need further investigations.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> 
> I think it is great that you have got RAID5 to the point where multiple
> threads improve performance.
> I really don't like the idea of having to configure that number of threads.
> 
> It would be great if it would auto-configure.
> Maybe the main thread could fork aux threads when it notices a high load.
> e.g. if it has been servicing requests for more than 100ms without a break,
> and the number of threads is less than the number of CPUs, then it forks a new
> helper and resets the timer.
> 
> If a thread has been idle for more than 30 minutes, it exits.
> 
> Might that be reasonable?

Yep, I bet this patch needs more discussion. auto-configure is preferred. Your
idea is worthy doing. However, the concern is if doing auto fork/kill thread,
user can't do numa binding, which is important for high speed storage. Maybe
have a reasonable default thread number, like one thread one disk? Need more
investigations, I'm open to any suggestion in this side.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 1/8] raid5: add a per-stripe lock
  2012-06-07  6:35       ` NeilBrown
@ 2012-06-07  6:52         ` Shaohua Li
  2012-06-12 21:02           ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-07  6:52 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Thu, Jun 07, 2012 at 04:35:22PM +1000, NeilBrown wrote:
> On Thu, 7 Jun 2012 14:29:39 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > On Thu, Jun 07, 2012 at 10:54:10AM +1000, NeilBrown wrote:
> > > On Mon, 04 Jun 2012 16:01:53 +0800 Shaohua Li <shli@kernel.org> wrote:
> > > 
> > > > Add a per-stripe lock to protect stripe specific data, like dev->read,
> > > > written, ... The purpose is to reduce lock contention of conf->device_lock.
> > > 
> > > I'm not convinced that you need to add a lock.
> > > I am convinced that if you do add one you need to explain exactly what it is
> > > protecting.
> > > 
> > > The STRIPE_ACTIVE bit serves as a lock and ensures that only one process can
> > > be in handle_stripe at a time.
> > > So I don't think dev->read actually needs any protection (though I haven't
> > > checked thoroughly).
> > > 
> > > I think the only things that device_lock protects are things shared by
> > > multiple stripes, so adding a per-stripe spinlock isn't going to help remove
> > > device_lock.
> > 
> > This sounds not true to me. both the async callbacks and request completion
> > access stripe data, like dev->read. Such things are not protected by
> > STRIPE_ACTIVE bit. Thought we can delete STRIPE_ACTIVE bit with stripe lock
> > introduced.
> 
> Please give specifics.  What race do you see with access to dev->read that is
> not protected by STRIPE_ACTIVE ?

For example, ops_complete_biofill() will change dev->read which isn't protected
by STRIPE_ACTIVE. add_stripe_bio() checks ->toread ->towrite, which isn't
protected by the bit too. Am I missing anything?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 6/8] raid5: make_request use batch stripe release
  2012-06-07  6:33     ` Shaohua Li
@ 2012-06-07  7:33       ` NeilBrown
  2012-06-07  7:58         ` Shaohua Li
  0 siblings, 1 reply; 34+ messages in thread
From: NeilBrown @ 2012-06-07  7:33 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 2675 bytes --]

On Thu, 7 Jun 2012 14:33:58 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Thu, Jun 07, 2012 at 11:23:45AM +1000, NeilBrown wrote:
> > On Mon, 04 Jun 2012 16:01:58 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > make_request() does stripe release for every stripe and the stripe usually has
> > > count 1, which makes previous release_stripe() optimization not work. In my
> > > test, this release_stripe() becomes the heaviest pleace to take
> > > conf->device_lock after previous patches applied.
> > > 
> > > Below patch makes stripe release batch. When maxium strips of a batch reach,
> > > the batch will be flushed out. Another way to do the flush is when unplug is
> > > called.
> > > 
> > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > 
> > I like the idea of a batched release.
> > I don't like the per-cpu variables... and I don't think it is safe to only
> > allocate them for_each_present_cpu without support cpu-hot-plug.
> > 
> > I would much rather keep a list of stripes (linked on ->lru) in struct
> > md_plug_cb (or maybe in some structure which contains that) and release them
> > all on unplug - and only on unplug.
> > 
> > Maybe pass a size to mddev_check_unplugged, and it allocates that much more
> > space.  Get mddev_check_unplugged to return the md_plug_cb structure.
> > If the new space is NULL, then list_head_init it, and change the cb.callback
> > to a raid5 specific function.
> > Then add any stripe to the md_plug_cb, and in the unplug function, release
> > them all.
> > 
> > Does that make sense?
> > 
> > Also I would rather the batched stripe release code were defined in the same
> > patch that used it.  It isn't big enough to justify a separate patch.
> 
> The stripe->lru need protection of device_lock, so I can't use a list. An array
> is preferred. I really didn't like the idea to allocate memory especially when
> allocating an array. I'll fix the code for cpuhotplug.

You don't need device_lock to use ->lru.
Currently the lru is not used when sh->count is not-zero unless
STRIPE_EXPANDING is set - and we never attach IO requests if STRIPE_EXPANDING
is set.
So when make_request wants to release a stripe_head, ->lru is currently
unused.
So we can use it to put the stripe on a per-thread list without locking.

We need another stripe_head flag to say "is on a per-thread unplug list" to
avoid racing between processes, but we don't need a spinlock for that.
ie.
  if (!test_and_set(STRIPE_ON_UNPLUG_LIST, &sh->state))
           list_add(&plug->list, &sh->lru);

or similar.

Please do waste time on cpuhotplug - it isn't the right solution.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 7/8] raid5: raid5d handle stripe in batch way
  2012-06-07  6:35     ` Shaohua Li
@ 2012-06-07  7:38       ` NeilBrown
  0 siblings, 0 replies; 34+ messages in thread
From: NeilBrown @ 2012-06-07  7:38 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 1693 bytes --]

On Thu, 7 Jun 2012 14:35:47 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Thu, Jun 07, 2012 at 11:32:39AM +1000, NeilBrown wrote:
> > On Mon, 04 Jun 2012 16:01:59 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > Let raid5d handle stripe in batch way to reduce conf->device_lock locking.
> > > 
> > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > 
> > I like this.
> > I don't think it justifies a separate function.
> > 
> > #define MAX_STRIPE_BATCH 8
> > struct stripe_head *batch[MAX_STRIPE_BATCH]
> > int batch_size = 0;
> > 
> > ...
> > 
> > while (batch_size < MAX_STRPE_BATCH &&
> >        (sh = __get_priority_stripe(conf)) != NULL)
> >      batch[batch_size++] = sh;
> > 
> > spin_unlock(&conf->device_lock);
> > if (batch_size == 0)
> >      break;
> > 
> > handled += batch_size;
> > 
> > for (i = 0; i < batch_size; i++)
> >      handle_stripe(batch[i]);
> > cond_resched();
> > if (....) md_check_recovery(mddev);
> > 
> > spin_lock_irq(&conf->lock);
> > for (i = 0; i < batch_size; i++)
> >      __release_stripe(batch[i]);
> > 
> > 
> > something like that?
> 
> the 8th patch does the same thing, so I moved the code to a separate function.

The 8th patch should instead move all of the above into a separate function,
then call it both from raid5d and raid5auxd.
Maybe keep the md_check_recovery bit separate, in raid5d it would look like
  if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {
      spin_unlock();
      make_check_recovert();
      spin_lock();
  }

Having the 
   fill the batch
   handle the batch
   release the batch
all open coded in the one place significantly aids readability.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 6/8] raid5: make_request use batch stripe release
  2012-06-07  7:33       ` NeilBrown
@ 2012-06-07  7:58         ` Shaohua Li
  2012-06-08  6:16           ` Shaohua Li
  0 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-07  7:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Thu, Jun 07, 2012 at 05:33:10PM +1000, NeilBrown wrote:
> On Thu, 7 Jun 2012 14:33:58 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > On Thu, Jun 07, 2012 at 11:23:45AM +1000, NeilBrown wrote:
> > > On Mon, 04 Jun 2012 16:01:58 +0800 Shaohua Li <shli@kernel.org> wrote:
> > > 
> > > > make_request() does stripe release for every stripe and the stripe usually has
> > > > count 1, which makes previous release_stripe() optimization not work. In my
> > > > test, this release_stripe() becomes the heaviest pleace to take
> > > > conf->device_lock after previous patches applied.
> > > > 
> > > > Below patch makes stripe release batch. When maxium strips of a batch reach,
> > > > the batch will be flushed out. Another way to do the flush is when unplug is
> > > > called.
> > > > 
> > > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > 
> > > I like the idea of a batched release.
> > > I don't like the per-cpu variables... and I don't think it is safe to only
> > > allocate them for_each_present_cpu without support cpu-hot-plug.
> > > 
> > > I would much rather keep a list of stripes (linked on ->lru) in struct
> > > md_plug_cb (or maybe in some structure which contains that) and release them
> > > all on unplug - and only on unplug.
> > > 
> > > Maybe pass a size to mddev_check_unplugged, and it allocates that much more
> > > space.  Get mddev_check_unplugged to return the md_plug_cb structure.
> > > If the new space is NULL, then list_head_init it, and change the cb.callback
> > > to a raid5 specific function.
> > > Then add any stripe to the md_plug_cb, and in the unplug function, release
> > > them all.
> > > 
> > > Does that make sense?
> > > 
> > > Also I would rather the batched stripe release code were defined in the same
> > > patch that used it.  It isn't big enough to justify a separate patch.
> > 
> > The stripe->lru need protection of device_lock, so I can't use a list. An array
> > is preferred. I really didn't like the idea to allocate memory especially when
> > allocating an array. I'll fix the code for cpuhotplug.
> 
> You don't need device_lock to use ->lru.
> Currently the lru is not used when sh->count is not-zero unless
> STRIPE_EXPANDING is set - and we never attach IO requests if STRIPE_EXPANDING
> is set.
> So when make_request wants to release a stripe_head, ->lru is currently
> unused.
> So we can use it to put the stripe on a per-thread list without locking.
> 
> We need another stripe_head flag to say "is on a per-thread unplug list" to
> avoid racing between processes, but we don't need a spinlock for that.
> ie.
>   if (!test_and_set(STRIPE_ON_UNPLUG_LIST, &sh->state))
>            list_add(&plug->list, &sh->lru);
> 
> or similar.

I did see some BUG_ON trigger when I access ->lru without device_lock hold
before, for example get_active_stripe will remove it from list. Maybe can use
the same bit to avoid it. Let me try.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 6/8] raid5: make_request use batch stripe release
  2012-06-07  7:58         ` Shaohua Li
@ 2012-06-08  6:16           ` Shaohua Li
  2012-06-08  6:42             ` NeilBrown
  0 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-08  6:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Thu, Jun 07, 2012 at 03:58:16PM +0800, Shaohua Li wrote:
> On Thu, Jun 07, 2012 at 05:33:10PM +1000, NeilBrown wrote:
> > On Thu, 7 Jun 2012 14:33:58 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > On Thu, Jun 07, 2012 at 11:23:45AM +1000, NeilBrown wrote:
> > > > On Mon, 04 Jun 2012 16:01:58 +0800 Shaohua Li <shli@kernel.org> wrote:
> > > > 
> > > > > make_request() does stripe release for every stripe and the stripe usually has
> > > > > count 1, which makes previous release_stripe() optimization not work. In my
> > > > > test, this release_stripe() becomes the heaviest pleace to take
> > > > > conf->device_lock after previous patches applied.
> > > > > 
> > > > > Below patch makes stripe release batch. When maxium strips of a batch reach,
> > > > > the batch will be flushed out. Another way to do the flush is when unplug is
> > > > > called.
> > > > > 
> > > > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > > 
> > > > I like the idea of a batched release.
> > > > I don't like the per-cpu variables... and I don't think it is safe to only
> > > > allocate them for_each_present_cpu without support cpu-hot-plug.
> > > > 
> > > > I would much rather keep a list of stripes (linked on ->lru) in struct
> > > > md_plug_cb (or maybe in some structure which contains that) and release them
> > > > all on unplug - and only on unplug.
> > > > 
> > > > Maybe pass a size to mddev_check_unplugged, and it allocates that much more
> > > > space.  Get mddev_check_unplugged to return the md_plug_cb structure.
> > > > If the new space is NULL, then list_head_init it, and change the cb.callback
> > > > to a raid5 specific function.
> > > > Then add any stripe to the md_plug_cb, and in the unplug function, release
> > > > them all.
> > > > 
> > > > Does that make sense?
> > > > 
> > > > Also I would rather the batched stripe release code were defined in the same
> > > > patch that used it.  It isn't big enough to justify a separate patch.
> > > 
> > > The stripe->lru need protection of device_lock, so I can't use a list. An array
> > > is preferred. I really didn't like the idea to allocate memory especially when
> > > allocating an array. I'll fix the code for cpuhotplug.
> > 
> > You don't need device_lock to use ->lru.
> > Currently the lru is not used when sh->count is not-zero unless
> > STRIPE_EXPANDING is set - and we never attach IO requests if STRIPE_EXPANDING
> > is set.
> > So when make_request wants to release a stripe_head, ->lru is currently
> > unused.
> > So we can use it to put the stripe on a per-thread list without locking.
> > 
> > We need another stripe_head flag to say "is on a per-thread unplug list" to
> > avoid racing between processes, but we don't need a spinlock for that.
> > ie.
> >   if (!test_and_set(STRIPE_ON_UNPLUG_LIST, &sh->state))
> >            list_add(&plug->list, &sh->lru);
> > 
> > or similar.
> 
> I did see some BUG_ON trigger when I access ->lru without device_lock hold
> before, for example get_active_stripe will remove it from list. Maybe can use
> the same bit to avoid it. Let me try.

Thinking a bit more, the STRIPE_ON_UNPLUG_LIST bit can't avoid races. For
example,
Task 1 hit a stripe, assume stripe count 0 (could not be 0 too):
it does:
1. inc count
2. set STRIPE_ON_UNPLUG_LIST
3. add stripe to plug list
4. unplug to release the stripe
Between 3 and 4, task 2 hit the stripe, it does:
A: inc count. Since the bit set, do nothing more
B: unplug
If the order is 3, A, 4, B. Task1 will not release the stripe, since the
count is 2. Tasks2 will not release the stripe, since stripe isn't in its
list. The stripe will never be handled.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 6/8] raid5: make_request use batch stripe release
  2012-06-08  6:16           ` Shaohua Li
@ 2012-06-08  6:42             ` NeilBrown
  0 siblings, 0 replies; 34+ messages in thread
From: NeilBrown @ 2012-06-08  6:42 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 4532 bytes --]

On Fri, 8 Jun 2012 14:16:57 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Thu, Jun 07, 2012 at 03:58:16PM +0800, Shaohua Li wrote:
> > On Thu, Jun 07, 2012 at 05:33:10PM +1000, NeilBrown wrote:
> > > On Thu, 7 Jun 2012 14:33:58 +0800 Shaohua Li <shli@kernel.org> wrote:
> > > 
> > > > On Thu, Jun 07, 2012 at 11:23:45AM +1000, NeilBrown wrote:
> > > > > On Mon, 04 Jun 2012 16:01:58 +0800 Shaohua Li <shli@kernel.org> wrote:
> > > > > 
> > > > > > make_request() does stripe release for every stripe and the stripe usually has
> > > > > > count 1, which makes previous release_stripe() optimization not work. In my
> > > > > > test, this release_stripe() becomes the heaviest pleace to take
> > > > > > conf->device_lock after previous patches applied.
> > > > > > 
> > > > > > Below patch makes stripe release batch. When maxium strips of a batch reach,
> > > > > > the batch will be flushed out. Another way to do the flush is when unplug is
> > > > > > called.
> > > > > > 
> > > > > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > > > 
> > > > > I like the idea of a batched release.
> > > > > I don't like the per-cpu variables... and I don't think it is safe to only
> > > > > allocate them for_each_present_cpu without support cpu-hot-plug.
> > > > > 
> > > > > I would much rather keep a list of stripes (linked on ->lru) in struct
> > > > > md_plug_cb (or maybe in some structure which contains that) and release them
> > > > > all on unplug - and only on unplug.
> > > > > 
> > > > > Maybe pass a size to mddev_check_unplugged, and it allocates that much more
> > > > > space.  Get mddev_check_unplugged to return the md_plug_cb structure.
> > > > > If the new space is NULL, then list_head_init it, and change the cb.callback
> > > > > to a raid5 specific function.
> > > > > Then add any stripe to the md_plug_cb, and in the unplug function, release
> > > > > them all.
> > > > > 
> > > > > Does that make sense?
> > > > > 
> > > > > Also I would rather the batched stripe release code were defined in the same
> > > > > patch that used it.  It isn't big enough to justify a separate patch.
> > > > 
> > > > The stripe->lru need protection of device_lock, so I can't use a list. An array
> > > > is preferred. I really didn't like the idea to allocate memory especially when
> > > > allocating an array. I'll fix the code for cpuhotplug.
> > > 
> > > You don't need device_lock to use ->lru.
> > > Currently the lru is not used when sh->count is not-zero unless
> > > STRIPE_EXPANDING is set - and we never attach IO requests if STRIPE_EXPANDING
> > > is set.
> > > So when make_request wants to release a stripe_head, ->lru is currently
> > > unused.
> > > So we can use it to put the stripe on a per-thread list without locking.
> > > 
> > > We need another stripe_head flag to say "is on a per-thread unplug list" to
> > > avoid racing between processes, but we don't need a spinlock for that.
> > > ie.
> > >   if (!test_and_set(STRIPE_ON_UNPLUG_LIST, &sh->state))
> > >            list_add(&plug->list, &sh->lru);
> > > 
> > > or similar.
> > 
> > I did see some BUG_ON trigger when I access ->lru without device_lock hold
> > before, for example get_active_stripe will remove it from list. Maybe can use
> > the same bit to avoid it. Let me try.
> 
> Thinking a bit more, the STRIPE_ON_UNPLUG_LIST bit can't avoid races. For
> example,
> Task 1 hit a stripe, assume stripe count 0 (could not be 0 too):
> it does:
> 1. inc count
> 2. set STRIPE_ON_UNPLUG_LIST
> 3. add stripe to plug list
> 4. unplug to release the stripe
> Between 3 and 4, task 2 hit the stripe, it does:
> A: inc count. Since the bit set, do nothing more
> B: unplug
> If the order is 3, A, 4, B. Task1 will not release the stripe, since the
> count is 2. Tasks2 will not release the stripe, since stripe isn't in its
> list. The stripe will never be handled.

"Since the bit set, do nothing" isn't correct - we need to release the
reference.
So it should be
   if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state))
          list_add(&sh->lru, &plug->list);
   else
          release_stripe(&sh);

We expect that in most cases release_stripe will just decrement the counter
and not need to take the lock.
Then the unplug takes the lock and calls __release_stripe() on all the
stripes.  So the stripe always gets released, either immediately or
at unplug.

So my initial attempt at a code fragment was incomplete.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 2/8] raid5: lockless access raid5 overrided bi_phys_segments
  2012-06-07  1:06   ` NeilBrown
@ 2012-06-12 20:41     ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2012-06-12 20:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, axboe, shli

On Wed, Jun 6, 2012 at 6:06 PM, NeilBrown <neilb@suse.de> wrote:
> On Mon, 04 Jun 2012 16:01:54 +0800 Shaohua Li <shli@kernel.org> wrote:
>
>> Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
>> which is unnecessary, We can make it lockless actually.
>>
>> Signed-off-by: Shaohua Li <shli@fusionio.com>
>
> I cannot say that I like this (casting fields in the bio structure), but I
> can see the value and it should work.  'atomic_t' is currently always the same
> size as an 'int', and I doubt that will change.
>
> So maybe I'll get used to the idea.

I think we should just bite the bullet and acknowledge that this field
has other meanings depending on the context and make it a union of int
and atomic_t..

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 1/8] raid5: add a per-stripe lock
  2012-06-07  6:52         ` Shaohua Li
@ 2012-06-12 21:02           ` Dan Williams
  2012-06-13  4:08             ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2012-06-12 21:02 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid, axboe, shli

On Wed, Jun 6, 2012 at 11:52 PM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Jun 07, 2012 at 04:35:22PM +1000, NeilBrown wrote:
>> On Thu, 7 Jun 2012 14:29:39 +0800 Shaohua Li <shli@kernel.org> wrote:
>>
>> > On Thu, Jun 07, 2012 at 10:54:10AM +1000, NeilBrown wrote:
>> > > On Mon, 04 Jun 2012 16:01:53 +0800 Shaohua Li <shli@kernel.org> wrote:
>> > >
>> > > > Add a per-stripe lock to protect stripe specific data, like dev->read,
>> > > > written, ... The purpose is to reduce lock contention of conf->device_lock.
>> > >
>> > > I'm not convinced that you need to add a lock.
>> > > I am convinced that if you do add one you need to explain exactly what it is
>> > > protecting.
>> > >
>> > > The STRIPE_ACTIVE bit serves as a lock and ensures that only one process can
>> > > be in handle_stripe at a time.
>> > > So I don't think dev->read actually needs any protection (though I haven't
>> > > checked thoroughly).
>> > >
>> > > I think the only things that device_lock protects are things shared by
>> > > multiple stripes, so adding a per-stripe spinlock isn't going to help remove
>> > > device_lock.
>> >
>> > This sounds not true to me. both the async callbacks and request completion
>> > access stripe data, like dev->read. Such things are not protected by
>> > STRIPE_ACTIVE bit. Thought we can delete STRIPE_ACTIVE bit with stripe lock
>> > introduced.
>>
>> Please give specifics.  What race do you see with access to dev->read that is
>> not protected by STRIPE_ACTIVE ?
>
> For example, ops_complete_biofill() will change dev->read which isn't protected
> by STRIPE_ACTIVE. add_stripe_bio() checks ->toread ->towrite, which isn't
> protected by the bit too. Am I missing anything?

STRIPE_ACTIVE is the replacement for the old per-stripe lock.  That
lock never was meant/able to synchronize add_stripe_bio() vs ops_run_*
(producer vs consumer).  That's always been device_lock's job because
an individual bio may be added to several stripes.  If device_lock is
gone we need a different scheme.  That's what tripped me up last time
I looked at this.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 1/8] raid5: add a per-stripe lock
  2012-06-04  8:01 ` [patch 1/8] raid5: add a per-stripe lock Shaohua Li
  2012-06-07  0:54   ` NeilBrown
@ 2012-06-12 21:10   ` Dan Williams
  1 sibling, 0 replies; 34+ messages in thread
From: Dan Williams @ 2012-06-12 21:10 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe, shli

On Mon, Jun 4, 2012 at 1:01 AM, Shaohua Li <shli@kernel.org> wrote:
> Add a per-stripe lock to protect stripe specific data, like dev->read,
> written, ... The purpose is to reduce lock contention of conf->device_lock.
>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |   17 +++++++++++++++++
>  drivers/md/raid5.h |    1 +
>  2 files changed, 18 insertions(+)
>
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c       2012-06-01 13:38:54.705210229 +0800
> +++ linux/drivers/md/raid5.c    2012-06-01 13:43:05.594056130 +0800
> @@ -749,6 +749,7 @@ static void ops_complete_biofill(void *s
>
>        /* clear completed biofills */
>        spin_lock_irq(&conf->device_lock);
> +       spin_lock_irq(&sh->stripe_lock);
>        for (i = sh->disks; i--; ) {
>                struct r5dev *dev = &sh->dev[i];
>
> @@ -774,6 +775,7 @@ static void ops_complete_biofill(void *s
>                        }
>                }
>        }
> +       spin_unlock_irq(&sh->stripe_lock);
>        spin_unlock_irq(&conf->device_lock);

Btw... I know this is fixed up in a later patch with the deletion of
device_lock, but bisection may land on this patch which enables irqs a
bit too early.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 8/8] raid5: create multiple threads to handle stripes
  2012-06-07  6:45     ` Shaohua Li
@ 2012-06-13  4:08       ` Dan Williams
  2012-06-21 10:09         ` Shaohua Li
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2012-06-13  4:08 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid, axboe, shli

On Wed, Jun 6, 2012 at 11:45 PM, Shaohua Li <shli@kernel.org> wrote:
> On Thu, Jun 07, 2012 at 11:39:58AM +1000, NeilBrown wrote:
>> On Mon, 04 Jun 2012 16:02:00 +0800 Shaohua Li <shli@kernel.org> wrote:
>>
>> > Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
>> > thread becomes a bottleneck. raid5 can offload calculation like checksum to
>> > async threads. And if storge is fast, scheduling async work and running async
>> > work will introduce heavy lock contention of workqueue, which makes such
>> > optimization useless. And calculation isn't the only bottleneck. For example,
>> > in my test raid5 thread must handle > 450k requests per second. Just doing
>> > dispatch and completion will make raid5 thread incapable. The only chance to
>> > scale is using several threads to handle stripe.
>> >
>> > With this patch, user can create several extra threads to handle stripe. How
>> > many threads are better depending on disk number, so the thread number can be
>> > changed in userspace. By default, the thread number is 0, which means no extra
>> > thread.
>> >
>> > In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
>> > improvement (double stripe_cache_size) and the throughput is pretty close to
>> > theory value. With >=4 disks, the improvement is even bigger, for example, can
>> > improve 200% for 4-disk setup, but the throughput is far less than theory
>> > value, which is caused by several factors like request queue lock contention,
>> > cache issue, latency introduced by how a stripe is handled in different disks.
>> > Those factors need further investigations.
>> >
>> > Signed-off-by: Shaohua Li <shli@fusionio.com>
>>
>> I think it is great that you have got RAID5 to the point where multiple
>> threads improve performance.
>> I really don't like the idea of having to configure that number of threads.
>>
>> It would be great if it would auto-configure.
>> Maybe the main thread could fork aux threads when it notices a high load.
>> e.g. if it has been servicing requests for more than 100ms without a break,
>> and the number of threads is less than the number of CPUs, then it forks a new
>> helper and resets the timer.
>>
>> If a thread has been idle for more than 30 minutes, it exits.
>>
>> Might that be reasonable?
>
> Yep, I bet this patch needs more discussion. auto-configure is preferred. Your
> idea is worthy doing. However, the concern is if doing auto fork/kill thread,
> user can't do numa binding, which is important for high speed storage. Maybe
> have a reasonable default thread number, like one thread one disk? Need more
> investigations, I'm open to any suggestion in this side.

The last time I looked at this the btrfs thread pool looked like a
good candidate:

  http://marc.info/?l=linux-raid&m=126944260704907&w=2

...have not looked if Tejun has made this available as a generic workqueue mode.

--
Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 1/8] raid5: add a per-stripe lock
  2012-06-12 21:02           ` Dan Williams
@ 2012-06-13  4:08             ` Dan Williams
  2012-06-13  4:23               ` Shaohua Li
  0 siblings, 1 reply; 34+ messages in thread
From: Dan Williams @ 2012-06-13  4:08 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid, axboe, shli

On Tue, Jun 12, 2012 at 2:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Wed, Jun 6, 2012 at 11:52 PM, Shaohua Li <shli@kernel.org> wrote:
>> On Thu, Jun 07, 2012 at 04:35:22PM +1000, NeilBrown wrote:
>>> On Thu, 7 Jun 2012 14:29:39 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>
>>> > On Thu, Jun 07, 2012 at 10:54:10AM +1000, NeilBrown wrote:
>>> > > On Mon, 04 Jun 2012 16:01:53 +0800 Shaohua Li <shli@kernel.org> wrote:
>>> > >
>>> > > > Add a per-stripe lock to protect stripe specific data, like dev->read,
>>> > > > written, ... The purpose is to reduce lock contention of conf->device_lock.
>>> > >
>>> > > I'm not convinced that you need to add a lock.
>>> > > I am convinced that if you do add one you need to explain exactly what it is
>>> > > protecting.
>>> > >
>>> > > The STRIPE_ACTIVE bit serves as a lock and ensures that only one process can
>>> > > be in handle_stripe at a time.
>>> > > So I don't think dev->read actually needs any protection (though I haven't
>>> > > checked thoroughly).
>>> > >
>>> > > I think the only things that device_lock protects are things shared by
>>> > > multiple stripes, so adding a per-stripe spinlock isn't going to help remove
>>> > > device_lock.
>>> >
>>> > This sounds not true to me. both the async callbacks and request completion
>>> > access stripe data, like dev->read. Such things are not protected by
>>> > STRIPE_ACTIVE bit. Thought we can delete STRIPE_ACTIVE bit with stripe lock
>>> > introduced.
>>>
>>> Please give specifics.  What race do you see with access to dev->read that is
>>> not protected by STRIPE_ACTIVE ?
>>
>> For example, ops_complete_biofill() will change dev->read which isn't protected
>> by STRIPE_ACTIVE. add_stripe_bio() checks ->toread ->towrite, which isn't
>> protected by the bit too. Am I missing anything?
>
> STRIPE_ACTIVE is the replacement for the old per-stripe lock.  That
> lock never was meant/able to synchronize add_stripe_bio() vs ops_run_*
> (producer vs consumer).  That's always been device_lock's job because
> an individual bio may be added to several stripes.  If device_lock is
> gone we need a different scheme.  That's what tripped me up last time
> I looked at this.

Actually now that I look at add_stripe_bio again, I think it could be
made to work if:

1/ bi_phys_segments is incremented prior to publishing the bio on
to{read|write} otherwise we potentially race with a consumer without a
reference

2/ making sure the overlap checking does not walk off into invalid
bios as it may do once we no longer have a global lock

Outside of that we need a lock for making sure bi->next fields are
updated properly and two threads only collide there on the same
stripe.  But I need to think more about the other usages.

--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 1/8] raid5: add a per-stripe lock
  2012-06-13  4:08             ` Dan Williams
@ 2012-06-13  4:23               ` Shaohua Li
  0 siblings, 0 replies; 34+ messages in thread
From: Shaohua Li @ 2012-06-13  4:23 UTC (permalink / raw)
  To: Dan Williams; +Cc: NeilBrown, linux-raid, axboe

2012/6/13 Dan Williams <dan.j.williams@intel.com>:
> On Tue, Jun 12, 2012 at 2:02 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> On Wed, Jun 6, 2012 at 11:52 PM, Shaohua Li <shli@kernel.org> wrote:
>>> On Thu, Jun 07, 2012 at 04:35:22PM +1000, NeilBrown wrote:
>>>> On Thu, 7 Jun 2012 14:29:39 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>>
>>>> > On Thu, Jun 07, 2012 at 10:54:10AM +1000, NeilBrown wrote:
>>>> > > On Mon, 04 Jun 2012 16:01:53 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>> > >
>>>> > > > Add a per-stripe lock to protect stripe specific data, like dev->read,
>>>> > > > written, ... The purpose is to reduce lock contention of conf->device_lock.
>>>> > >
>>>> > > I'm not convinced that you need to add a lock.
>>>> > > I am convinced that if you do add one you need to explain exactly what it is
>>>> > > protecting.
>>>> > >
>>>> > > The STRIPE_ACTIVE bit serves as a lock and ensures that only one process can
>>>> > > be in handle_stripe at a time.
>>>> > > So I don't think dev->read actually needs any protection (though I haven't
>>>> > > checked thoroughly).
>>>> > >
>>>> > > I think the only things that device_lock protects are things shared by
>>>> > > multiple stripes, so adding a per-stripe spinlock isn't going to help remove
>>>> > > device_lock.
>>>> >
>>>> > This sounds not true to me. both the async callbacks and request completion
>>>> > access stripe data, like dev->read. Such things are not protected by
>>>> > STRIPE_ACTIVE bit. Thought we can delete STRIPE_ACTIVE bit with stripe lock
>>>> > introduced.
>>>>
>>>> Please give specifics.  What race do you see with access to dev->read that is
>>>> not protected by STRIPE_ACTIVE ?
>>>
>>> For example, ops_complete_biofill() will change dev->read which isn't protected
>>> by STRIPE_ACTIVE. add_stripe_bio() checks ->toread ->towrite, which isn't
>>> protected by the bit too. Am I missing anything?
>>
>> STRIPE_ACTIVE is the replacement for the old per-stripe lock.  That
>> lock never was meant/able to synchronize add_stripe_bio() vs ops_run_*
>> (producer vs consumer).  That's always been device_lock's job because
>> an individual bio may be added to several stripes.  If device_lock is
>> gone we need a different scheme.  That's what tripped me up last time
>> I looked at this.
>
> Actually now that I look at add_stripe_bio again, I think it could be
> made to work if:

Yes, I'm currently checking if bi_phys_segments can completely
avoid the problem you described too. It's a kind of reference count.

> 1/ bi_phys_segments is incremented prior to publishing the bio on
> to{read|write} otherwise we potentially race with a consumer without a
> reference

we still have a per-strip lock, so the same stripe isn't a
big problem. If it's different stripe, there should be refrerence
counted.

> 2/ making sure the overlap checking does not walk off into invalid
> bios as it may do once we no longer have a global lock

I assume we already do this, r5_next_bio will check it. But
I need double check.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 8/8] raid5: create multiple threads to handle stripes
  2012-06-13  4:08       ` Dan Williams
@ 2012-06-21 10:09         ` Shaohua Li
  2012-07-02 20:43           ` Dan Williams
  0 siblings, 1 reply; 34+ messages in thread
From: Shaohua Li @ 2012-06-21 10:09 UTC (permalink / raw)
  To: Dan Williams; +Cc: NeilBrown, linux-raid, axboe, shli

On Tue, Jun 12, 2012 at 09:08:17PM -0700, Dan Williams wrote:
> On Wed, Jun 6, 2012 at 11:45 PM, Shaohua Li <shli@kernel.org> wrote:
> > On Thu, Jun 07, 2012 at 11:39:58AM +1000, NeilBrown wrote:
> >> On Mon, 04 Jun 2012 16:02:00 +0800 Shaohua Li <shli@kernel.org> wrote:
> >>
> >> > Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
> >> > thread becomes a bottleneck. raid5 can offload calculation like checksum to
> >> > async threads. And if storge is fast, scheduling async work and running async
> >> > work will introduce heavy lock contention of workqueue, which makes such
> >> > optimization useless. And calculation isn't the only bottleneck. For example,
> >> > in my test raid5 thread must handle > 450k requests per second. Just doing
> >> > dispatch and completion will make raid5 thread incapable. The only chance to
> >> > scale is using several threads to handle stripe.
> >> >
> >> > With this patch, user can create several extra threads to handle stripe. How
> >> > many threads are better depending on disk number, so the thread number can be
> >> > changed in userspace. By default, the thread number is 0, which means no extra
> >> > thread.
> >> >
> >> > In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
> >> > improvement (double stripe_cache_size) and the throughput is pretty close to
> >> > theory value. With >=4 disks, the improvement is even bigger, for example, can
> >> > improve 200% for 4-disk setup, but the throughput is far less than theory
> >> > value, which is caused by several factors like request queue lock contention,
> >> > cache issue, latency introduced by how a stripe is handled in different disks.
> >> > Those factors need further investigations.
> >> >
> >> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> >>
> >> I think it is great that you have got RAID5 to the point where multiple
> >> threads improve performance.
> >> I really don't like the idea of having to configure that number of threads.
> >>
> >> It would be great if it would auto-configure.
> >> Maybe the main thread could fork aux threads when it notices a high load.
> >> e.g. if it has been servicing requests for more than 100ms without a break,
> >> and the number of threads is less than the number of CPUs, then it forks a new
> >> helper and resets the timer.
> >>
> >> If a thread has been idle for more than 30 minutes, it exits.
> >>
> >> Might that be reasonable?
> >
> > Yep, I bet this patch needs more discussion. auto-configure is preferred. Your
> > idea is worthy doing. However, the concern is if doing auto fork/kill thread,
> > user can't do numa binding, which is important for high speed storage. Maybe
> > have a reasonable default thread number, like one thread one disk? Need more
> > investigations, I'm open to any suggestion in this side.
> 
> The last time I looked at this the btrfs thread pool looked like a
> good candidate:
> 
>   http://marc.info/?l=linux-raid&m=126944260704907&w=2
> 
> ...have not looked if Tejun has made this available as a generic workqueue mode.

I tried to create a UNBOUND workqueue and set max active to the cpu number, so
each cpu will handle one work. In the work, the cpu will handle 8 stripes. The
throughput is relative ok, but CPU utilization is very high compared to just
create 3 or 4 threads like the patch does. There is heavy lock contention in
block queue_lock, since every cpu now dispatches request. There are other
issues like cache, raid5 device_lock has more contention too. It appears too
many threads to handle stripe isn't as good as expected.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch 8/8] raid5: create multiple threads to handle stripes
  2012-06-21 10:09         ` Shaohua Li
@ 2012-07-02 20:43           ` Dan Williams
  0 siblings, 0 replies; 34+ messages in thread
From: Dan Williams @ 2012-07-02 20:43 UTC (permalink / raw)
  To: Shaohua Li; +Cc: NeilBrown, linux-raid, axboe, shli

On Thu, Jun 21, 2012 at 3:09 AM, Shaohua Li <shli@kernel.org> wrote:
> On Tue, Jun 12, 2012 at 09:08:17PM -0700, Dan Williams wrote:
>> On Wed, Jun 6, 2012 at 11:45 PM, Shaohua Li <shli@kernel.org> wrote:
>> > On Thu, Jun 07, 2012 at 11:39:58AM +1000, NeilBrown wrote:
>> >> On Mon, 04 Jun 2012 16:02:00 +0800 Shaohua Li <shli@kernel.org> wrote:
>> >>
>> >> > Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
>> >> > thread becomes a bottleneck. raid5 can offload calculation like checksum to
>> >> > async threads. And if storge is fast, scheduling async work and running async
>> >> > work will introduce heavy lock contention of workqueue, which makes such
>> >> > optimization useless. And calculation isn't the only bottleneck. For example,
>> >> > in my test raid5 thread must handle > 450k requests per second. Just doing
>> >> > dispatch and completion will make raid5 thread incapable. The only chance to
>> >> > scale is using several threads to handle stripe.
>> >> >
>> >> > With this patch, user can create several extra threads to handle stripe. How
>> >> > many threads are better depending on disk number, so the thread number can be
>> >> > changed in userspace. By default, the thread number is 0, which means no extra
>> >> > thread.
>> >> >
>> >> > In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
>> >> > improvement (double stripe_cache_size) and the throughput is pretty close to
>> >> > theory value. With >=4 disks, the improvement is even bigger, for example, can
>> >> > improve 200% for 4-disk setup, but the throughput is far less than theory
>> >> > value, which is caused by several factors like request queue lock contention,
>> >> > cache issue, latency introduced by how a stripe is handled in different disks.
>> >> > Those factors need further investigations.
>> >> >
>> >> > Signed-off-by: Shaohua Li <shli@fusionio.com>
>> >>
>> >> I think it is great that you have got RAID5 to the point where multiple
>> >> threads improve performance.
>> >> I really don't like the idea of having to configure that number of threads.
>> >>
>> >> It would be great if it would auto-configure.
>> >> Maybe the main thread could fork aux threads when it notices a high load.
>> >> e.g. if it has been servicing requests for more than 100ms without a break,
>> >> and the number of threads is less than the number of CPUs, then it forks a new
>> >> helper and resets the timer.
>> >>
>> >> If a thread has been idle for more than 30 minutes, it exits.
>> >>
>> >> Might that be reasonable?
>> >
>> > Yep, I bet this patch needs more discussion. auto-configure is preferred. Your
>> > idea is worthy doing. However, the concern is if doing auto fork/kill thread,
>> > user can't do numa binding, which is important for high speed storage. Maybe
>> > have a reasonable default thread number, like one thread one disk? Need more
>> > investigations, I'm open to any suggestion in this side.
>>
>> The last time I looked at this the btrfs thread pool looked like a
>> good candidate:
>>
>>   http://marc.info/?l=linux-raid&m=126944260704907&w=2
>>
>> ...have not looked if Tejun has made this available as a generic workqueue mode.
>
> I tried to create a UNBOUND workqueue and set max active to the cpu number, so
> each cpu will handle one work. In the work, the cpu will handle 8 stripes. The
> throughput is relative ok, but CPU utilization is very high compared to just
> create 3 or 4 threads like the patch does. There is heavy lock contention in
> block queue_lock, since every cpu now dispatches request. There are other
> issues like cache, raid5 device_lock has more contention too. It appears too
> many threads to handle stripe isn't as good as expected.

Yes, the unbounded workqueue is too many threads because it will keep
creating threads as long as there is work.  That's the behavior you
want for async_schedule() but not raid.  This was the reasoning for
exploring the btrfs thread pool because it had a threshold parameter
to push back on thread creation.  This then goes back to my other
question about the workload that triggers the cpu bottleneck?

The other side of the coin is what to do about the "too fast" stripe
processing problem.  Currently get_priority_stripe() operates on the
principle that stripe processing naturally backs up the submission
queue allowing more full-stripe writes to coalesce.  The better we get
a stripe processing the worse we may do at coalescing.

--
Dan

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2012-07-02 20:43 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-04  8:01 [patch 0/8] raid5: improve write performance for fast storage Shaohua Li
2012-06-04  8:01 ` [patch 1/8] raid5: add a per-stripe lock Shaohua Li
2012-06-07  0:54   ` NeilBrown
2012-06-07  6:29     ` Shaohua Li
2012-06-07  6:35       ` NeilBrown
2012-06-07  6:52         ` Shaohua Li
2012-06-12 21:02           ` Dan Williams
2012-06-13  4:08             ` Dan Williams
2012-06-13  4:23               ` Shaohua Li
2012-06-12 21:10   ` Dan Williams
2012-06-04  8:01 ` [patch 2/8] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
2012-06-07  1:06   ` NeilBrown
2012-06-12 20:41     ` Dan Williams
2012-06-04  8:01 ` [patch 3/8] raid5: remove some device_lock locking places Shaohua Li
2012-06-04  8:01 ` [patch 4/8] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
2012-06-07  0:50   ` NeilBrown
2012-06-04  8:01 ` [patch 5/8] raid5: add batch stripe release Shaohua Li
2012-06-04  8:01 ` [patch 6/8] raid5: make_request use " Shaohua Li
2012-06-07  1:23   ` NeilBrown
2012-06-07  6:33     ` Shaohua Li
2012-06-07  7:33       ` NeilBrown
2012-06-07  7:58         ` Shaohua Li
2012-06-08  6:16           ` Shaohua Li
2012-06-08  6:42             ` NeilBrown
2012-06-04  8:01 ` [patch 7/8] raid5: raid5d handle stripe in batch way Shaohua Li
2012-06-07  1:32   ` NeilBrown
2012-06-07  6:35     ` Shaohua Li
2012-06-07  7:38       ` NeilBrown
2012-06-04  8:02 ` [patch 8/8] raid5: create multiple threads to handle stripes Shaohua Li
2012-06-07  1:39   ` NeilBrown
2012-06-07  6:45     ` Shaohua Li
2012-06-13  4:08       ` Dan Williams
2012-06-21 10:09         ` Shaohua Li
2012-07-02 20:43           ` Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.