All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 00/10 v3] raid5: improve write performance for fast storage
@ 2012-06-25  7:24 Shaohua Li
  2012-06-25  7:24 ` [patch 01/10 v3] raid5: use wake_up_all for overlap waking Shaohua Li
                   ` (9 more replies)
  0 siblings, 10 replies; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

Hi,

Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
thread becomes a bottleneck. raid5 can offload calculation like checksum to
async threads. And if storge is fast, scheduling async work and running async
work will introduce heavy lock contention of workqueue, which makes such
optimization useless. And calculation isn't the only bottleneck. For example,
in my test raid5 thread must handle > 450k requests per second. Just doing
dispatch and completion will make raid5 thread incapable. The only chance to
scale is using several threads to handle stripe.

Simpliy using several threads doesn't work. conf->device_lock is a global lock
which is heavily contended. patch 3-9 in the set are trying to address this
problem. With them, when several threads are handling stripe, device_lock is
still contended but takes much less cpu time and not the heavist locking any
more. Even the 10th patch isn't accepted, the patch 3-9 look good to merge.

I did stress test (block size range 1k - 64k with a small total size, so
overlap/stripe sharing guaranteed) with the patches and looks fine except some
issues fixed in the first two patches. That issues aren't related to the series,
but I need them in stress test.

With the locking issue solved (at least largely), switching stripe handling to
multiple threads is trival.

Threads are still created in advance (default thread number is disk number) and
can be reconfigured by user. Automatically creating and reaping threads is
great, but I'm worrying about numa binding.

In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
improvement (double stripe_cache_size) and the throughput is pretty close to
theory value. With >=4 disks, the improvement is even bigger, for example, can
improve 200% for 4-disk setup, but the throughput is far less than theory
value, which is caused by several factors like request queue lock contention,
cache issue, latency introduced by how a stripe is handled in different disks.
Those factors need further investigations.

V2->V3:
1. fixed a hang caused by stripe with both STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE
bit.
2. fixed issue pointed out by Dan
3. Doesn't always wakeup all worker threads any more.

V1->V2:
1. fixed several issues pointed out by Neil and Dan.
2. fixed a wake_up issue.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 01/10 v3] raid5: use wake_up_all for overlap waking
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-06-28  7:26   ` NeilBrown
  2012-06-25  7:24 ` [patch 02/10 v3] raid5: delayed stripe fix Shaohua Li
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-wakeupall.patch --]
[-- Type: text/plain, Size: 3879 bytes --]

It's possible several tasks are waiting for stripe overlap. We clear R5_Overlap
bit and wake_up, but wake_up just wakes one task. So if there are several tasks
in the wait queue, some tasks will not be woken up even its strip R5_Overlap
clear. The end result is tasks hang in make_request.

wake_up_all should not introduce performance issue here, since overlap case is
rare.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-19 08:11:10.021688417 +0800
+++ linux/drivers/md/raid5.c	2012-06-19 08:11:29.833439339 +0800
@@ -1399,7 +1399,7 @@ static void __raid_run_ops(struct stripe
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 			if (test_and_clear_bit(R5_Overlap, &dev->flags))
-				wake_up(&sh->raid_conf->wait_for_overlap);
+				wake_up_all(&sh->raid_conf->wait_for_overlap);
 		}
 	put_cpu();
 }
@@ -2436,7 +2436,7 @@ handle_failed_stripe(struct r5conf *conf
 		}
 
 		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
-			wake_up(&conf->wait_for_overlap);
+			wake_up_all(&conf->wait_for_overlap);
 
 		while (bi && bi->bi_sector <
 			sh->dev[i].sector + STRIPE_SECTORS) {
@@ -2474,7 +2474,7 @@ handle_failed_stripe(struct r5conf *conf
 			bi = sh->dev[i].toread;
 			sh->dev[i].toread = NULL;
 			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
-				wake_up(&conf->wait_for_overlap);
+				wake_up_all(&conf->wait_for_overlap);
 			if (bi) s->to_read--;
 			while (bi && bi->bi_sector <
 			       sh->dev[i].sector + STRIPE_SECTORS) {
@@ -3572,7 +3572,7 @@ static void handle_stripe(struct stripe_
 	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
 	}
 
@@ -4249,7 +4249,7 @@ static sector_t reshape_request(struct m
 		spin_lock_irq(&conf->device_lock);
 		conf->reshape_safe = mddev->reshape_position;
 		spin_unlock_irq(&conf->device_lock);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	}
 
@@ -4340,7 +4340,7 @@ static sector_t reshape_request(struct m
 		spin_lock_irq(&conf->device_lock);
 		conf->reshape_safe = mddev->reshape_position;
 		spin_unlock_irq(&conf->device_lock);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	}
 	return reshape_sectors;
@@ -5718,7 +5718,7 @@ static void end_reshape(struct r5conf *c
 		smp_wmb();
 		conf->reshape_progress = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 
 		/* read-ahead size must cover two whole stripes, which is
 		 * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
@@ -5776,7 +5776,7 @@ static void raid5_quiesce(struct mddev *
 
 	switch(state) {
 	case 2: /* resume for a suspend */
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		break;
 
 	case 1: /* stop all writes */
@@ -5792,14 +5792,14 @@ static void raid5_quiesce(struct mddev *
 		conf->quiesce = 1;
 		spin_unlock_irq(&conf->device_lock);
 		/* allow reshape to continue */
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		break;
 
 	case 0: /* re-enable writes */
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 0;
 		wake_up(&conf->wait_for_stripe);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		spin_unlock_irq(&conf->device_lock);
 		break;
 	}


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 02/10 v3] raid5: delayed stripe fix
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
  2012-06-25  7:24 ` [patch 01/10 v3] raid5: use wake_up_all for overlap waking Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-07-02  0:46   ` NeilBrown
  2012-06-25  7:24 ` [patch 03/10 v3] raid5: add a per-stripe lock Shaohua Li
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-delayed-stripe-fix.patch --]
[-- Type: text/plain, Size: 1429 bytes --]

There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
the two bits have relationship. A delayed stripe can be moved to hold list only
when preread active stripe count is below IO_THRESHOLD. If a stripe has both
the bits set, such stripe will be in delayed list and preread count not 0,
which will make such stripe never leave delayed list.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:15.964613183 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
@@ -196,12 +196,14 @@ static void __release_stripe(struct r5co
 		BUG_ON(!list_empty(&sh->lru));
 		BUG_ON(atomic_read(&conf->active_stripes)==0);
 		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state))
+			if (test_bit(STRIPE_DELAYED, &sh->state) &&
+			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				list_add_tail(&sh->lru, &conf->delayed_list);
 			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
 				   sh->bm_seq - conf->seq_write > 0)
 				list_add_tail(&sh->lru, &conf->bitmap_list);
 			else {
+				clear_bit(STRIPE_DELAYED, &sh->state);
 				clear_bit(STRIPE_BIT_DELAY, &sh->state);
 				list_add_tail(&sh->lru, &conf->handle_list);
 			}


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 03/10 v3] raid5: add a per-stripe lock
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
  2012-06-25  7:24 ` [patch 01/10 v3] raid5: use wake_up_all for overlap waking Shaohua Li
  2012-06-25  7:24 ` [patch 02/10 v3] raid5: delayed stripe fix Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-07-02  0:50   ` NeilBrown
  2012-06-25  7:24 ` [patch 04/10 v3] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-add-perstripe-lock.patch --]
[-- Type: text/plain, Size: 4862 bytes --]

Add a per-stripe lock to protect stripe specific data, like dev->read,
written, ... The purpose is to reduce lock contention of conf->device_lock.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   17 +++++++++++++++++
 drivers/md/raid5.h |    1 +
 2 files changed, 18 insertions(+)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:37:13.651888057 +0800
@@ -751,6 +751,7 @@ static void ops_complete_biofill(void *s
 
 	/* clear completed biofills */
 	spin_lock_irq(&conf->device_lock);
+	spin_lock(&sh->stripe_lock);
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
@@ -776,6 +777,7 @@ static void ops_complete_biofill(void *s
 			}
 		}
 	}
+	spin_unlock(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
@@ -800,8 +802,10 @@ static void ops_run_biofill(struct strip
 		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi;
 			spin_lock_irq(&conf->device_lock);
+			spin_lock(&sh->stripe_lock);
 			dev->read = rbi = dev->toread;
 			dev->toread = NULL;
+			spin_unlock(&sh->stripe_lock);
 			spin_unlock_irq(&conf->device_lock);
 			while (rbi && rbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
@@ -1139,10 +1143,12 @@ ops_run_biodrain(struct stripe_head *sh,
 			struct bio *wbi;
 
 			spin_lock_irq(&sh->raid_conf->device_lock);
+			spin_lock(&sh->stripe_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
+			spin_unlock(&sh->stripe_lock);
 			spin_unlock_irq(&sh->raid_conf->device_lock);
 
 			while (wbi && wbi->bi_sector <
@@ -1448,6 +1454,8 @@ static int grow_one_stripe(struct r5conf
 	init_waitqueue_head(&sh->ops.wait_for_ops);
 	#endif
 
+	spin_lock_init(&sh->stripe_lock);
+
 	if (grow_buffers(sh)) {
 		shrink_buffers(sh);
 		kmem_cache_free(conf->slab_cache, sh);
@@ -2329,6 +2337,7 @@ static int add_stripe_bio(struct stripe_
 
 
 	spin_lock_irq(&conf->device_lock);
+	spin_lock(&sh->stripe_lock);
 	if (forwrite) {
 		bip = &sh->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
@@ -2362,6 +2371,7 @@ static int add_stripe_bio(struct stripe_
 		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
+	spin_unlock(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
@@ -2378,6 +2388,7 @@ static int add_stripe_bio(struct stripe_
 
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+	spin_unlock(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	return 0;
 }
@@ -2429,6 +2440,7 @@ handle_failed_stripe(struct r5conf *conf
 			}
 		}
 		spin_lock_irq(&conf->device_lock);
+		spin_lock(&sh->stripe_lock);
 		/* fail all writes first */
 		bi = sh->dev[i].towrite;
 		sh->dev[i].towrite = NULL;
@@ -2490,6 +2502,7 @@ handle_failed_stripe(struct r5conf *conf
 				bi = nextbi;
 			}
 		}
+		spin_unlock(&sh->stripe_lock);
 		spin_unlock_irq(&conf->device_lock);
 		if (bitmap_end)
 			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -2697,6 +2710,7 @@ static void handle_stripe_clean_event(st
 				int bitmap_end = 0;
 				pr_debug("Return write for disc %d\n", i);
 				spin_lock_irq(&conf->device_lock);
+				spin_lock(&sh->stripe_lock);
 				wbi = dev->written;
 				dev->written = NULL;
 				while (wbi && wbi->bi_sector <
@@ -2711,6 +2725,7 @@ static void handle_stripe_clean_event(st
 				}
 				if (dev->towrite == NULL)
 					bitmap_end = 1;
+				spin_unlock(&sh->stripe_lock);
 				spin_unlock_irq(&conf->device_lock);
 				if (bitmap_end)
 					bitmap_endwrite(conf->mddev->bitmap,
@@ -3170,6 +3185,7 @@ static void analyse_stripe(struct stripe
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
 	spin_lock_irq(&conf->device_lock);
+	spin_lock(&sh->stripe_lock);
 	for (i=disks; i--; ) {
 		struct md_rdev *rdev;
 		sector_t first_bad;
@@ -3315,6 +3331,7 @@ static void analyse_stripe(struct stripe
 				do_recovery = 1;
 		}
 	}
+	spin_unlock(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	if (test_bit(STRIPE_SYNCING, &sh->state)) {
 		/* If there is a failed device being replaced,
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-06-25 14:36:13.940638627 +0800
+++ linux/drivers/md/raid5.h	2012-06-25 14:37:13.651888057 +0800
@@ -210,6 +210,7 @@ struct stripe_head {
 	int			disks;		/* disks in stripe */
 	enum check_states	check_state;
 	enum reconstruct_states reconstruct_state;
+	spinlock_t		stripe_lock;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 04/10 v3] raid5: lockless access raid5 overrided bi_phys_segments
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
                   ` (2 preceding siblings ...)
  2012-06-25  7:24 ` [patch 03/10 v3] raid5: add a per-stripe lock Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-06-25  7:24 ` [patch 05/10 v3] raid5: remove some device_lock locking places Shaohua Li
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-atomic-segment-accounting.patch --]
[-- Type: text/plain, Size: 6243 bytes --]

Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   58 +++++++++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 26 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:37:13.651888057 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:37:18.743841116 +0800
@@ -99,34 +99,40 @@ static inline struct bio *r5_next_bio(st
  * We maintain a biased count of active stripes in the bottom 16 bits of
  * bi_phys_segments, and a count of processed stripes in the upper 16 bits
  */
-static inline int raid5_bi_phys_segments(struct bio *bio)
+static inline int raid5_bi_processed_stripes(struct bio *bio)
 {
-	return bio->bi_phys_segments & 0xffff;
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	return (atomic_read(segments) >> 16) & 0xffff;
 }
 
-static inline int raid5_bi_hw_segments(struct bio *bio)
+static inline int raid5_dec_bi_active_stripes(struct bio *bio)
 {
-	return (bio->bi_phys_segments >> 16) & 0xffff;
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	return atomic_sub_return(1, segments) & 0xffff;
 }
 
-static inline int raid5_dec_bi_phys_segments(struct bio *bio)
+static inline void raid5_inc_bi_active_stripes(struct bio *bio)
 {
-	--bio->bi_phys_segments;
-	return raid5_bi_phys_segments(bio);
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	atomic_inc(segments);
 }
 
-static inline int raid5_dec_bi_hw_segments(struct bio *bio)
+static inline void raid5_set_bi_processed_stripes(struct bio *bio,
+	unsigned int cnt)
 {
-	unsigned short val = raid5_bi_hw_segments(bio);
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	int old, new;
 
-	--val;
-	bio->bi_phys_segments = (val << 16) | raid5_bi_phys_segments(bio);
-	return val;
+	do {
+		old = atomic_read(segments);
+		new = (old & 0xffff) | (cnt << 16);
+	} while (atomic_cmpxchg(segments, old, new) != old);
 }
 
-static inline void raid5_set_bi_hw_segments(struct bio *bio, unsigned int cnt)
+static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
 {
-	bio->bi_phys_segments = raid5_bi_phys_segments(bio) | (cnt << 16);
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	atomic_set(segments, cnt);
 }
 
 /* Find first data disk in a raid6 stripe */
@@ -769,7 +775,7 @@ static void ops_complete_biofill(void *s
 			while (rbi && rbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
-				if (!raid5_dec_bi_phys_segments(rbi)) {
+				if (!raid5_dec_bi_active_stripes(rbi)) {
 					rbi->bi_next = return_bi;
 					return_bi = rbi;
 				}
@@ -2356,7 +2362,7 @@ static int add_stripe_bio(struct stripe_
 	if (*bip)
 		bi->bi_next = *bip;
 	*bip = bi;
-	bi->bi_phys_segments++;
+	raid5_inc_bi_active_stripes(bi);
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -2456,7 +2462,7 @@ handle_failed_stripe(struct r5conf *conf
 			sh->dev[i].sector + STRIPE_SECTORS) {
 			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
-			if (!raid5_dec_bi_phys_segments(bi)) {
+			if (!raid5_dec_bi_active_stripes(bi)) {
 				md_write_end(conf->mddev);
 				bi->bi_next = *return_bi;
 				*return_bi = bi;
@@ -2471,7 +2477,7 @@ handle_failed_stripe(struct r5conf *conf
 		       sh->dev[i].sector + STRIPE_SECTORS) {
 			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
-			if (!raid5_dec_bi_phys_segments(bi)) {
+			if (!raid5_dec_bi_active_stripes(bi)) {
 				md_write_end(conf->mddev);
 				bi->bi_next = *return_bi;
 				*return_bi = bi;
@@ -2495,7 +2501,7 @@ handle_failed_stripe(struct r5conf *conf
 				struct bio *nextbi =
 					r5_next_bio(bi, sh->dev[i].sector);
 				clear_bit(BIO_UPTODATE, &bi->bi_flags);
-				if (!raid5_dec_bi_phys_segments(bi)) {
+				if (!raid5_dec_bi_active_stripes(bi)) {
 					bi->bi_next = *return_bi;
 					*return_bi = bi;
 				}
@@ -2716,7 +2722,7 @@ static void handle_stripe_clean_event(st
 				while (wbi && wbi->bi_sector <
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
-					if (!raid5_dec_bi_phys_segments(wbi)) {
+					if (!raid5_dec_bi_active_stripes(wbi)) {
 						md_write_end(conf->mddev);
 						wbi->bi_next = *return_bi;
 						*return_bi = wbi;
@@ -3785,7 +3791,7 @@ static struct bio *remove_bio_from_retry
 		 * this sets the active strip count to 1 and the processed
 		 * strip count to zero (upper 8 bits)
 		 */
-		bi->bi_phys_segments = 1; /* biased count of active stripes */
+		raid5_set_bi_stripes(bi, 1); /* biased count of active stripes */
 	}
 
 	return bi;
@@ -4124,7 +4130,7 @@ static void make_request(struct mddev *m
 		md_wakeup_thread(mddev->thread);
 
 	spin_lock_irq(&conf->device_lock);
-	remaining = raid5_dec_bi_phys_segments(bi);
+	remaining = raid5_dec_bi_active_stripes(bi);
 	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0) {
 
@@ -4481,7 +4487,7 @@ static int  retry_aligned_read(struct r5
 		     sector += STRIPE_SECTORS,
 		     scnt++) {
 
-		if (scnt < raid5_bi_hw_segments(raid_bio))
+		if (scnt < raid5_bi_processed_stripes(raid_bio))
 			/* already done this stripe */
 			continue;
 
@@ -4489,14 +4495,14 @@ static int  retry_aligned_read(struct r5
 
 		if (!sh) {
 			/* failed to get a stripe - must wait */
-			raid5_set_bi_hw_segments(raid_bio, scnt);
+			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
 			return handled;
 		}
 
 		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
 			release_stripe(sh);
-			raid5_set_bi_hw_segments(raid_bio, scnt);
+			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
 			return handled;
 		}
@@ -4506,7 +4512,7 @@ static int  retry_aligned_read(struct r5
 		handled++;
 	}
 	spin_lock_irq(&conf->device_lock);
-	remaining = raid5_dec_bi_phys_segments(raid_bio);
+	remaining = raid5_dec_bi_active_stripes(raid_bio);
 	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0)
 		bio_endio(raid_bio, 0);


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 05/10 v3] raid5: remove some device_lock locking places
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
                   ` (3 preceding siblings ...)
  2012-06-25  7:24 ` [patch 04/10 v3] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-06-25  7:24 ` [patch 06/10 v3] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-remove-some-lock.patch --]
[-- Type: text/plain, Size: 7548 bytes --]

With per-stripe lock and bi_phys_segments lockless, we can safely remove some
locking places of device_lock.

stripe ->read, ->toread ... are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock. If bio in ->read,
->toread ... list are shared by multiple stripes, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe

Let's have an example:
|  stripe1 |  stripe2    |  stripe3  |
...bio1......|bio2|bio3|....bio4.....

stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.

before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   60 ++++++++++++++++++++---------------------------------
 1 file changed, 23 insertions(+), 37 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:37:18.743841116 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:37:21.423789830 +0800
@@ -749,15 +749,13 @@ static void ops_complete_biofill(void *s
 {
 	struct stripe_head *sh = stripe_head_ref;
 	struct bio *return_bi = NULL;
-	struct r5conf *conf = sh->raid_conf;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
 	/* clear completed biofills */
-	spin_lock_irq(&conf->device_lock);
-	spin_lock(&sh->stripe_lock);
+	spin_lock_irq(&sh->stripe_lock);
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
@@ -783,8 +781,7 @@ static void ops_complete_biofill(void *s
 			}
 		}
 	}
-	spin_unlock(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&sh->stripe_lock);
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
 	return_io(return_bi);
@@ -796,7 +793,6 @@ static void ops_complete_biofill(void *s
 static void ops_run_biofill(struct stripe_head *sh)
 {
 	struct dma_async_tx_descriptor *tx = NULL;
-	struct r5conf *conf = sh->raid_conf;
 	struct async_submit_ctl submit;
 	int i;
 
@@ -807,12 +803,10 @@ static void ops_run_biofill(struct strip
 		struct r5dev *dev = &sh->dev[i];
 		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi;
-			spin_lock_irq(&conf->device_lock);
-			spin_lock(&sh->stripe_lock);
+			spin_lock_irq(&sh->stripe_lock);
 			dev->read = rbi = dev->toread;
 			dev->toread = NULL;
-			spin_unlock(&sh->stripe_lock);
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&sh->stripe_lock);
 			while (rbi && rbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(0, rbi, dev->page,
@@ -1148,14 +1142,12 @@ ops_run_biodrain(struct stripe_head *sh,
 		if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) {
 			struct bio *wbi;
 
-			spin_lock_irq(&sh->raid_conf->device_lock);
-			spin_lock(&sh->stripe_lock);
+			spin_lock_irq(&sh->stripe_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
-			spin_unlock(&sh->stripe_lock);
-			spin_unlock_irq(&sh->raid_conf->device_lock);
+			spin_unlock_irq(&sh->stripe_lock);
 
 			while (wbi && wbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
@@ -2341,9 +2333,15 @@ static int add_stripe_bio(struct stripe_
 		(unsigned long long)bi->bi_sector,
 		(unsigned long long)sh->sector);
 
-
-	spin_lock_irq(&conf->device_lock);
-	spin_lock(&sh->stripe_lock);
+	/*
+	 * If several bio share a stripe. The bio bi_phys_segments acts as a
+	 * reference count to avoid race. The reference count should already be
+	 * increased before this function is called (for example, in
+	 * make_request()), so other bio sharing this stripe will not free the
+	 * stripe. If a stripe is owned by one stripe, the stripe lock will
+	 * protect it.
+	 */
+	spin_lock_irq(&sh->stripe_lock);
 	if (forwrite) {
 		bip = &sh->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
@@ -2377,8 +2375,7 @@ static int add_stripe_bio(struct stripe_
 		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
-	spin_unlock(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&sh->stripe_lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)(*bip)->bi_sector,
@@ -2394,8 +2391,7 @@ static int add_stripe_bio(struct stripe_
 
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
-	spin_unlock(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&sh->stripe_lock);
 	return 0;
 }
 
@@ -2445,8 +2441,7 @@ handle_failed_stripe(struct r5conf *conf
 				rdev_dec_pending(rdev, conf->mddev);
 			}
 		}
-		spin_lock_irq(&conf->device_lock);
-		spin_lock(&sh->stripe_lock);
+		spin_lock_irq(&sh->stripe_lock);
 		/* fail all writes first */
 		bi = sh->dev[i].towrite;
 		sh->dev[i].towrite = NULL;
@@ -2508,8 +2503,7 @@ handle_failed_stripe(struct r5conf *conf
 				bi = nextbi;
 			}
 		}
-		spin_unlock(&sh->stripe_lock);
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&sh->stripe_lock);
 		if (bitmap_end)
 			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
 					STRIPE_SECTORS, 0, 0);
@@ -2715,8 +2709,7 @@ static void handle_stripe_clean_event(st
 				struct bio *wbi, *wbi2;
 				int bitmap_end = 0;
 				pr_debug("Return write for disc %d\n", i);
-				spin_lock_irq(&conf->device_lock);
-				spin_lock(&sh->stripe_lock);
+				spin_lock_irq(&sh->stripe_lock);
 				wbi = dev->written;
 				dev->written = NULL;
 				while (wbi && wbi->bi_sector <
@@ -2731,8 +2724,7 @@ static void handle_stripe_clean_event(st
 				}
 				if (dev->towrite == NULL)
 					bitmap_end = 1;
-				spin_unlock(&sh->stripe_lock);
-				spin_unlock_irq(&conf->device_lock);
+				spin_unlock_irq(&sh->stripe_lock);
 				if (bitmap_end)
 					bitmap_endwrite(conf->mddev->bitmap,
 							sh->sector,
@@ -3190,8 +3182,7 @@ static void analyse_stripe(struct stripe
 
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
-	spin_lock_irq(&conf->device_lock);
-	spin_lock(&sh->stripe_lock);
+	spin_lock_irq(&sh->stripe_lock);
 	for (i=disks; i--; ) {
 		struct md_rdev *rdev;
 		sector_t first_bad;
@@ -3337,8 +3328,7 @@ static void analyse_stripe(struct stripe
 				do_recovery = 1;
 		}
 	}
-	spin_unlock(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&sh->stripe_lock);
 	if (test_bit(STRIPE_SYNCING, &sh->state)) {
 		/* If there is a failed device being replaced,
 		 *     we must be recovering.
@@ -4129,9 +4119,7 @@ static void make_request(struct mddev *m
 	if (!plugged)
 		md_wakeup_thread(mddev->thread);
 
-	spin_lock_irq(&conf->device_lock);
 	remaining = raid5_dec_bi_active_stripes(bi);
-	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0) {
 
 		if ( rw == WRITE )
@@ -4511,9 +4499,7 @@ static int  retry_aligned_read(struct r5
 		release_stripe(sh);
 		handled++;
 	}
-	spin_lock_irq(&conf->device_lock);
 	remaining = raid5_dec_bi_active_stripes(raid_bio);
-	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0)
 		bio_endio(raid_bio, 0);
 	if (atomic_dec_and_test(&conf->active_aligned_reads))


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 06/10 v3] raid5: reduce chance release_stripe() taking device_lock
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
                   ` (4 preceding siblings ...)
  2012-06-25  7:24 ` [patch 05/10 v3] raid5: remove some device_lock locking places Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-07-02  0:57   ` NeilBrown
  2012-06-25  7:24 ` [patch 07/10 v3] md: personality can provide unplug private data Shaohua Li
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-reduce-release_stripe-lock.patch --]
[-- Type: text/plain, Size: 3794 bytes --]

release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   73 +++++++++++++++++++++++++++++------------------------
 1 file changed, 40 insertions(+), 33 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:37:21.000000000 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:38:13.899130571 +0800
@@ -196,49 +196,56 @@ static int stripe_operations_active(stru
 	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
 }
 
-static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
+static void handle_release_stripe(struct r5conf *conf, struct stripe_head *sh)
 {
-	if (atomic_dec_and_test(&sh->count)) {
-		BUG_ON(!list_empty(&sh->lru));
-		BUG_ON(atomic_read(&conf->active_stripes)==0);
-		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state) &&
-			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				list_add_tail(&sh->lru, &conf->delayed_list);
-			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
-				   sh->bm_seq - conf->seq_write > 0)
-				list_add_tail(&sh->lru, &conf->bitmap_list);
-			else {
-				clear_bit(STRIPE_DELAYED, &sh->state);
-				clear_bit(STRIPE_BIT_DELAY, &sh->state);
-				list_add_tail(&sh->lru, &conf->handle_list);
-			}
-			md_wakeup_thread(conf->mddev->thread);
-		} else {
-			BUG_ON(stripe_operations_active(sh));
-			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				if (atomic_dec_return(&conf->preread_active_stripes)
-				    < IO_THRESHOLD)
-					md_wakeup_thread(conf->mddev->thread);
-			atomic_dec(&conf->active_stripes);
-			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
-				list_add_tail(&sh->lru, &conf->inactive_list);
-				wake_up(&conf->wait_for_stripe);
-				if (conf->retry_read_aligned)
-					md_wakeup_thread(conf->mddev->thread);
-			}
+	BUG_ON(!list_empty(&sh->lru));
+	BUG_ON(atomic_read(&conf->active_stripes)==0);
+	if (test_bit(STRIPE_HANDLE, &sh->state)) {
+		if (test_bit(STRIPE_DELAYED, &sh->state) &&
+		    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+			list_add_tail(&sh->lru, &conf->delayed_list);
+		else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
+			   sh->bm_seq - conf->seq_write > 0)
+			list_add_tail(&sh->lru, &conf->bitmap_list);
+		else {
+			clear_bit(STRIPE_DELAYED, &sh->state);
+			clear_bit(STRIPE_BIT_DELAY, &sh->state);
+			list_add_tail(&sh->lru, &conf->handle_list);
+		}
+		md_wakeup_thread(conf->mddev->thread);
+	} else {
+		BUG_ON(stripe_operations_active(sh));
+		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+			if (atomic_dec_return(&conf->preread_active_stripes)
+			    < IO_THRESHOLD)
+				md_wakeup_thread(conf->mddev->thread);
+		atomic_dec(&conf->active_stripes);
+		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+			list_add_tail(&sh->lru, &conf->inactive_list);
+			wake_up(&conf->wait_for_stripe);
+			if (conf->retry_read_aligned)
+				md_wakeup_thread(conf->mddev->thread);
 		}
 	}
 }
 
+static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
+{
+	if (atomic_dec_and_test(&sh->count))
+		handle_release_stripe(conf, sh);
+}
+
 static void release_stripe(struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
 	unsigned long flags;
 
-	spin_lock_irqsave(&conf->device_lock, flags);
-	__release_stripe(conf, sh);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	local_irq_save(flags);
+	if (atomic_dec_and_lock(&sh->count, &conf->device_lock)) {
+		handle_release_stripe(conf, sh);
+		spin_unlock(&conf->device_lock);
+	}
+	local_irq_restore(flags);
 }
 
 static inline void remove_hash(struct stripe_head *sh)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 07/10 v3] md: personality can provide unplug private data
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
                   ` (5 preceding siblings ...)
  2012-06-25  7:24 ` [patch 06/10 v3] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-07-02  1:06   ` NeilBrown
  2012-06-25  7:24 ` [patch 08/10 v3] raid5: make_request use batch stripe release Shaohua Li
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: md-plug-add-callback.patch --]
[-- Type: text/plain, Size: 5624 bytes --]

Allow personality providing unplug private data. Next patch will use it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/md.c     |   31 +++++++++++++------------------
 drivers/md/md.h     |   20 +++++++++++++++++++-
 drivers/md/raid1.c  |    2 +-
 drivers/md/raid10.c |    2 +-
 drivers/md/raid5.c  |    2 +-
 5 files changed, 35 insertions(+), 22 deletions(-)

Index: linux/drivers/md/md.c
===================================================================
--- linux.orig/drivers/md/md.c	2012-06-25 14:36:13.668642048 +0800
+++ linux/drivers/md/md.c	2012-06-25 14:38:33.106889041 +0800
@@ -498,22 +498,13 @@ void md_flush_request(struct mddev *mdde
 }
 EXPORT_SYMBOL(md_flush_request);
 
-/* Support for plugging.
- * This mirrors the plugging support in request_queue, but does not
- * require having a whole queue or request structures.
- * We allocate an md_plug_cb for each md device and each thread it gets
- * plugged on.  This links tot the private plug_handle structure in the
- * personality data where we keep a count of the number of outstanding
- * plugs so other code can see if a plug is active.
- */
-struct md_plug_cb {
-	struct blk_plug_cb cb;
-	struct mddev *mddev;
-};
 
 static void plugger_unplug(struct blk_plug_cb *cb)
 {
 	struct md_plug_cb *mdcb = container_of(cb, struct md_plug_cb, cb);
+
+	if (mdcb->unplug)
+		mdcb->unplug(mdcb);
 	if (atomic_dec_and_test(&mdcb->mddev->plug_cnt))
 		md_wakeup_thread(mdcb->mddev->thread);
 	kfree(mdcb);
@@ -522,13 +513,14 @@ static void plugger_unplug(struct blk_pl
 /* Check that an unplug wakeup will come shortly.
  * If not, wakeup the md thread immediately
  */
-int mddev_check_plugged(struct mddev *mddev)
+struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
+	md_unplug_func_t unplug, size_t size)
 {
 	struct blk_plug *plug = current->plug;
 	struct md_plug_cb *mdcb;
 
 	if (!plug)
-		return 0;
+		return NULL;
 
 	list_for_each_entry(mdcb, &plug->cb_list, cb.list) {
 		if (mdcb->cb.callback == plugger_unplug &&
@@ -538,19 +530,22 @@ int mddev_check_plugged(struct mddev *md
 						    struct md_plug_cb,
 						    cb.list))
 				list_move(&mdcb->cb.list, &plug->cb_list);
-			return 1;
+			return mdcb;
 		}
 	}
 	/* Not currently on the callback list */
-	mdcb = kmalloc(sizeof(*mdcb), GFP_ATOMIC);
+	mdcb = kmalloc(sizeof(*mdcb) + size, GFP_ATOMIC);
 	if (!mdcb)
-		return 0;
+		return NULL;
 
 	mdcb->mddev = mddev;
 	mdcb->cb.callback = plugger_unplug;
 	atomic_inc(&mddev->plug_cnt);
 	list_add(&mdcb->cb.list, &plug->cb_list);
-	return 1;
+	mdcb->unplug = unplug;
+	if (size)
+		memset((void *)(mdcb + 1), 0, size);
+	return mdcb;
 }
 EXPORT_SYMBOL_GPL(mddev_check_plugged);
 
Index: linux/drivers/md/md.h
===================================================================
--- linux.orig/drivers/md/md.h	2012-06-25 14:36:13.676641948 +0800
+++ linux/drivers/md/md.h	2012-06-25 14:38:33.106889041 +0800
@@ -630,6 +630,24 @@ extern struct bio *bio_clone_mddev(struc
 				   struct mddev *mddev);
 extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 				   struct mddev *mddev);
-extern int mddev_check_plugged(struct mddev *mddev);
+
+/* Support for plugging.
+ * This mirrors the plugging support in request_queue, but does not
+ * require having a whole queue or request structures.
+ * We allocate an md_plug_cb for each md device and each thread it gets
+ * plugged on.  This links tot the private plug_handle structure in the
+ * personality data where we keep a count of the number of outstanding
+ * plugs so other code can see if a plug is active.
+ */
+struct md_plug_cb;
+typedef void (*md_unplug_func_t)(struct md_plug_cb *mdcb);
+struct md_plug_cb {
+	struct blk_plug_cb cb;
+	struct mddev *mddev;
+	md_unplug_func_t unplug;
+};
+
+extern struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
+	md_unplug_func_t unplug, size_t size);
 extern void md_trim_bio(struct bio *bio, int offset, int size);
 #endif /* _MD_MD_H */
Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-25 14:36:13.696641695 +0800
+++ linux/drivers/md/raid1.c	2012-06-25 14:38:33.110889008 +0800
@@ -1034,7 +1034,7 @@ read_again:
 	 * the bad blocks.  Each set of writes gets it's own r1bio
 	 * with a set of bios attached.
 	 */
-	plugged = mddev_check_plugged(mddev);
+	plugged = !!mddev_check_plugged(mddev, NULL, 0);
 
 	disks = conf->raid_disks * 2;
  retry_write:
Index: linux/drivers/md/raid10.c
===================================================================
--- linux.orig/drivers/md/raid10.c	2012-06-25 14:36:13.684641847 +0800
+++ linux/drivers/md/raid10.c	2012-06-25 14:38:33.110889008 +0800
@@ -1239,7 +1239,7 @@ read_again:
 	 * of r10_bios is recored in bio->bi_phys_segments just as with
 	 * the read case.
 	 */
-	plugged = mddev_check_plugged(mddev);
+	plugged = !!mddev_check_plugged(mddev, NULL, 0);
 
 	r10_bio->read_slot = -1; /* make sure repl_bio gets freed */
 	raid10_find_phys(conf, r10_bio);
Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:38:13.899130571 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:38:33.110889008 +0800
@@ -4012,7 +4012,7 @@ static void make_request(struct mddev *m
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
 
-	plugged = mddev_check_plugged(mddev);
+	plugged = !!mddev_check_plugged(mddev, NULL, 0);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		int previous;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 08/10 v3] raid5: make_request use batch stripe release
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
                   ` (6 preceding siblings ...)
  2012-06-25  7:24 ` [patch 07/10 v3] md: personality can provide unplug private data Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-07-02  2:31   ` NeilBrown
  2012-06-25  7:24 ` [patch 09/10 v3] raid5: raid5d handle stripe in batch way Shaohua Li
  2012-06-25  7:24 ` [patch 10/10 v3] raid5: create multiple threads to handle stripes Shaohua Li
  9 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-make_request-relase_stripe-batch.patch --]
[-- Type: text/plain, Size: 4473 bytes --]

make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.

Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.h |    1 
 2 files changed, 60 insertions(+), 5 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:38:33.110889008 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:38:37.378835415 +0800
@@ -484,7 +484,8 @@ get_active_stripe(struct r5conf *conf, s
 		} else {
 			if (atomic_read(&sh->count)) {
 				BUG_ON(!list_empty(&sh->lru)
-				    && !test_bit(STRIPE_EXPANDING, &sh->state));
+				    && !test_bit(STRIPE_EXPANDING, &sh->state)
+				    && !test_bit(STRIPE_ON_UNPLUG_LIST, &sh->state));
 			} else {
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
@@ -3984,6 +3985,51 @@ static struct stripe_head *__get_priorit
 	return sh;
 }
 
+#define raid5_unplug_list(mdcb) (struct list_head *)(mdcb + 1)
+static void raid5_unplug(struct md_plug_cb *mdcb)
+{
+	struct list_head *list = raid5_unplug_list(mdcb);
+	struct stripe_head *sh;
+	struct r5conf *conf = mdcb->mddev->private;
+
+	if (list->next == NULL || list_empty(list))
+		return;
+	spin_lock_irq(&conf->device_lock);
+	while (!list_empty(list)) {
+		sh = list_entry(list->next, struct stripe_head, lru);
+		list_del_init(&sh->lru);
+		/*
+		 * avoid race release_stripe_plug() sees STRIPE_ON_UNPLUG_LIST
+		 * clear but the stripe is still in our list
+		 */
+		smp_mb__before_clear_bit();
+		clear_bit(STRIPE_ON_UNPLUG_LIST, &sh->state);
+		__release_stripe(conf, sh);
+	}
+	spin_unlock_irq(&conf->device_lock);
+}
+
+static void release_stripe_plug(struct md_plug_cb *mdcb,
+				struct stripe_head *sh)
+{
+	struct list_head *list = raid5_unplug_list(mdcb);
+
+	if (!mdcb) {
+		release_stripe(sh);
+		return;
+	}
+
+	if (list->next == NULL) {
+		INIT_LIST_HEAD(list);
+		mdcb->unplug = raid5_unplug;
+	}
+
+	if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state))
+		list_add_tail(&sh->lru, list);
+	else
+		release_stripe(sh);
+}
+
 static void make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
@@ -3993,7 +4039,7 @@ static void make_request(struct mddev *m
 	struct stripe_head *sh;
 	const int rw = bio_data_dir(bi);
 	int remaining;
-	int plugged;
+	struct md_plug_cb *mdcb;
 
 	if (unlikely(bi->bi_rw & REQ_FLUSH)) {
 		md_flush_request(mddev, bi);
@@ -4012,7 +4058,8 @@ static void make_request(struct mddev *m
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
 
-	plugged = !!mddev_check_plugged(mddev, NULL, 0);
+	mdcb = mddev_check_plugged(mddev, raid5_unplug,
+				   sizeof(struct list_head));
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		int previous;
@@ -4114,7 +4161,14 @@ static void make_request(struct mddev *m
 			if ((bi->bi_rw & REQ_SYNC) &&
 			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
-			release_stripe(sh);
+			/*
+			 * We must recheck here. schedule() might be called
+			 * above which makes unplug invoked already, so the old
+			 * mdcb is invalid
+			 */
+			mdcb = mddev_check_plugged(mddev, raid5_unplug,
+						   sizeof(struct list_head));
+			release_stripe_plug(mdcb, sh);
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -4123,7 +4177,7 @@ static void make_request(struct mddev *m
 		}
 			
 	}
-	if (!plugged)
+	if (!mdcb)
 		md_wakeup_thread(mddev->thread);
 
 	remaining = raid5_dec_bi_active_stripes(bi);
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-06-25 14:37:13.651888057 +0800
+++ linux/drivers/md/raid5.h	2012-06-25 14:38:37.382835318 +0800
@@ -320,6 +320,7 @@ enum {
 	STRIPE_BIOFILL_RUN,
 	STRIPE_COMPUTE_RUN,
 	STRIPE_OPS_REQ_PENDING,
+	STRIPE_ON_UNPLUG_LIST,
 };
 
 /*


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 09/10 v3] raid5: raid5d handle stripe in batch way
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
                   ` (7 preceding siblings ...)
  2012-06-25  7:24 ` [patch 08/10 v3] raid5: make_request use batch stripe release Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-07-02  2:32   ` NeilBrown
  2012-06-25  7:24 ` [patch 10/10 v3] raid5: create multiple threads to handle stripes Shaohua Li
  9 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-raid5d-fetch-stripe-batch.patch --]
[-- Type: text/plain, Size: 2330 bytes --]

Let raid5d handle stripe in batch way to reduce conf->device_lock locking.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   45 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 13 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:38:37.378835415 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:38:40.150800523 +0800
@@ -4568,6 +4568,30 @@ static int  retry_aligned_read(struct r5
 	return handled;
 }
 
+#define MAX_STRIPE_BATCH 8
+static int handle_active_stripes(struct r5conf *conf)
+{
+	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
+	int i, batch_size = 0;
+
+	while (batch_size < MAX_STRIPE_BATCH &&
+			(sh = __get_priority_stripe(conf)) != NULL)
+		batch[batch_size++] = sh;
+
+	if (batch_size == 0)
+		return batch_size;
+	spin_unlock_irq(&conf->device_lock);
+
+	for (i = 0; i < batch_size; i++)
+		handle_stripe(batch[i]);
+
+	cond_resched();
+
+	spin_lock_irq(&conf->device_lock);
+	for (i = 0; i < batch_size; i++)
+		__release_stripe(conf, batch[i]);
+	return batch_size;
+}
 
 /*
  * This is our raid5 kernel thread.
@@ -4578,7 +4602,6 @@ static int  retry_aligned_read(struct r5
  */
 static void raid5d(struct mddev *mddev)
 {
-	struct stripe_head *sh;
 	struct r5conf *conf = mddev->private;
 	int handled;
 	struct blk_plug plug;
@@ -4592,6 +4615,7 @@ static void raid5d(struct mddev *mddev)
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
 		struct bio *bio;
+		int batch_size;
 
 		if (atomic_read(&mddev->plug_cnt) == 0 &&
 		    !list_empty(&conf->bitmap_list)) {
@@ -4616,21 +4640,16 @@ static void raid5d(struct mddev *mddev)
 			handled++;
 		}
 
-		sh = __get_priority_stripe(conf);
-
-		if (!sh)
+		batch_size = handle_active_stripes(conf);
+		if (!batch_size)
 			break;
-		spin_unlock_irq(&conf->device_lock);
-		
-		handled++;
-		handle_stripe(sh);
-		release_stripe(sh);
-		cond_resched();
+		handled += batch_size;
 
-		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
+		if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {
+			spin_unlock_irq(&conf->device_lock);
 			md_check_recovery(mddev);
-
-		spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&conf->device_lock);
+		}
 	}
 	pr_debug("%d stripes handled\n", handled);
 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [patch 10/10 v3] raid5: create multiple threads to handle stripes
  2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
                   ` (8 preceding siblings ...)
  2012-06-25  7:24 ` [patch 09/10 v3] raid5: raid5d handle stripe in batch way Shaohua Li
@ 2012-06-25  7:24 ` Shaohua Li
  2012-07-02  2:39   ` NeilBrown
  2012-07-02 20:03   ` Dan Williams
  9 siblings, 2 replies; 32+ messages in thread
From: Shaohua Li @ 2012-06-25  7:24 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-multiple-threads.patch --]
[-- Type: text/plain, Size: 8566 bytes --]

Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
thread becomes a bottleneck. raid5 can offload calculation like checksum to
async threads. And if storge is fast, scheduling async work and running async
work will introduce heavy lock contention of workqueue, which makes such
optimization useless. And calculation isn't the only bottleneck. For example,
in my test raid5 thread must handle > 450k requests per second. Just doing
dispatch and completion will make raid5 thread incapable. The only chance to
scale is using several threads to handle stripe.

With this patch, user can create several extra threads to handle stripe. How
many threads are better depending on disk number, so the thread number can be
changed in userspace. By default, the thread number is 0, which means no extra
thread.

In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
improvement (double stripe_cache_size) and the throughput is pretty close to
theory value. With >=4 disks, the improvement is even bigger, for example, can
improve 200% for 4-disk setup, but the throughput is far less than theory
value, which is caused by several factors like request queue lock contention,
cache issue, latency introduced by how a stripe is handled in different disks.
Those factors need further investigations.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |  137 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid5.h |    3 +
 2 files changed, 139 insertions(+), 1 deletion(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-25 14:58:06.420138526 +0800
+++ linux/drivers/md/raid5.c	2012-06-25 14:58:06.428138426 +0800
@@ -211,6 +211,7 @@ static void handle_release_stripe(struct
 			clear_bit(STRIPE_DELAYED, &sh->state);
 			clear_bit(STRIPE_BIT_DELAY, &sh->state);
 			list_add_tail(&sh->lru, &conf->handle_list);
+			conf->pending_stripes++;
 		}
 		md_wakeup_thread(conf->mddev->thread);
 	} else {
@@ -489,6 +490,10 @@ get_active_stripe(struct r5conf *conf, s
 			} else {
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
+				else if (!list_empty(&sh->lru)
+					 && !test_bit(STRIPE_DELAYED, &sh->state)
+					 && !test_bit(STRIPE_BIT_DELAY, &sh->state))
+					conf->pending_stripes--;
 				if (list_empty(&sh->lru) &&
 				    !test_bit(STRIPE_EXPANDING, &sh->state))
 					BUG();
@@ -3670,6 +3675,7 @@ static void raid5_activate_delayed(struc
 			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
 			list_add_tail(&sh->lru, &conf->hold_list);
+			conf->pending_stripes++;
 		}
 	}
 }
@@ -3979,6 +3985,7 @@ static struct stripe_head *__get_priorit
 	} else
 		return NULL;
 
+	conf->pending_stripes--;
 	list_del_init(&sh->lru);
 	atomic_inc(&sh->count);
 	BUG_ON(atomic_read(&sh->count) != 1);
@@ -4593,6 +4600,33 @@ static int handle_active_stripes(struct
 	return batch_size;
 }
 
+static void raid5auxd(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+	struct blk_plug plug;
+	int handled;
+
+	pr_debug("+++ raid5auxd active\n");
+
+	blk_start_plug(&plug);
+	handled = 0;
+	spin_lock_irq(&conf->device_lock);
+	while (1) {
+		int batch_size;
+
+		batch_size = handle_active_stripes(conf);
+		if (!batch_size)
+			break;
+		handled += batch_size;
+	}
+	pr_debug("%d stripes handled\n", handled);
+
+	spin_unlock_irq(&conf->device_lock);
+	blk_finish_plug(&plug);
+
+	pr_debug("--- raid5auxd inactive\n");
+}
+
 /*
  * This is our raid5 kernel thread.
  *
@@ -4615,7 +4649,7 @@ static void raid5d(struct mddev *mddev)
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
 		struct bio *bio;
-		int batch_size;
+		int batch_size, i;
 
 		if (atomic_read(&mddev->plug_cnt) == 0 &&
 		    !list_empty(&conf->bitmap_list)) {
@@ -4645,6 +4679,10 @@ static void raid5d(struct mddev *mddev)
 			break;
 		handled += batch_size;
 
+		for (i = 0; i < conf->aux_thread_num
+		     && i < conf->pending_stripes/MAX_STRIPE_BATCH + 1; i++)
+			md_wakeup_thread(conf->aux_threads[i]);
+
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {
 			spin_unlock_irq(&conf->device_lock);
 			md_check_recovery(mddev);
@@ -4769,10 +4807,85 @@ stripe_cache_active_show(struct mddev *m
 static struct md_sysfs_entry
 raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
 
+static ssize_t
+raid5_show_auxthread_number(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	if (conf)
+		return sprintf(page, "%d\n", conf->aux_thread_num);
+	else
+		return 0;
+}
+
+static ssize_t
+raid5_store_auxthread_number(struct mddev *mddev, const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	unsigned long new;
+	int i;
+	struct md_thread **threads;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (!conf)
+		return -ENODEV;
+
+	if (strict_strtoul(page, 10, &new))
+		return -EINVAL;
+
+	if (new == conf->aux_thread_num)
+		return len;
+
+	if (new > conf->aux_thread_num) {
+		threads = kmalloc(sizeof(struct md_thread *) * new, GFP_KERNEL);
+		if (!threads)
+			return -EFAULT;
+
+		i = conf->aux_thread_num;
+		while (i < new) {
+			char name[10];
+
+			sprintf(name, "aux%d", i);
+			threads[i] = md_register_thread(raid5auxd, mddev, name);
+			if (!threads[i])
+				goto error;
+			i++;
+		}
+		memcpy(threads, conf->aux_threads,
+			sizeof(struct md_thread *) * conf->aux_thread_num);
+		spin_lock_irq(&conf->device_lock);
+		kfree(conf->aux_threads);
+		conf->aux_threads = threads;
+		conf->aux_thread_num = new;
+		spin_unlock_irq(&conf->device_lock);
+	} else {
+		int old = conf->aux_thread_num;
+
+		spin_lock_irq(&conf->device_lock);
+		conf->aux_thread_num = new;
+		spin_unlock_irq(&conf->device_lock);
+		for (i = new; i < old; i++)
+			md_unregister_thread(&conf->aux_threads[i]);
+	}
+
+	return len;
+error:
+	while (--i >= conf->aux_thread_num)
+		md_unregister_thread(&threads[i]);
+	kfree(threads);
+	return -EFAULT;
+}
+
+static struct md_sysfs_entry
+raid5_auxthread_number = __ATTR(auxthread_number, S_IRUGO|S_IWUSR,
+				raid5_show_auxthread_number,
+				raid5_store_auxthread_number);
+
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
 	&raid5_preread_bypass_threshold.attr,
+	&raid5_auxthread_number.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
@@ -4820,6 +4933,7 @@ static void raid5_free_percpu(struct r5c
 
 static void free_conf(struct r5conf *conf)
 {
+	kfree(conf->aux_threads);
 	shrink_stripes(conf);
 	raid5_free_percpu(conf);
 	kfree(conf->disks);
@@ -4914,6 +5028,7 @@ static struct r5conf *setup_conf(struct
 	int raid_disk, memory, max_disks;
 	struct md_rdev *rdev;
 	struct disk_info *disk;
+	int i;
 
 	if (mddev->new_level != 5
 	    && mddev->new_level != 4
@@ -5037,6 +5152,22 @@ static struct r5conf *setup_conf(struct
 		printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
 		       mdname(mddev), memory);
 
+	/* By default, auxthread number equals to disk number */
+	conf->aux_threads = kmalloc(sizeof(struct md_thread *) * max_disks,
+				    GFP_KERNEL);
+	if (!conf->aux_threads)
+		goto abort;
+	for (i = 0; i < max_disks; i++) {
+		char name[10];
+
+		sprintf(name, "aux%d", i);
+		conf->aux_threads[i] = md_register_thread(raid5auxd, mddev, name);
+		if (!conf->aux_threads[i])
+			break;
+	}
+
+	conf->aux_thread_num = i;
+
 	conf->thread = md_register_thread(raid5d, mddev, NULL);
 	if (!conf->thread) {
 		printk(KERN_ERR
@@ -5376,6 +5507,10 @@ abort:
 static int stop(struct mddev *mddev)
 {
 	struct r5conf *conf = mddev->private;
+	int i;
+
+	for (i = 0; i < conf->aux_thread_num; i++)
+		md_unregister_thread(&conf->aux_threads[i]);
 
 	md_unregister_thread(&mddev->thread);
 	if (mddev->queue)
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-06-25 14:58:06.408138677 +0800
+++ linux/drivers/md/raid5.h	2012-06-25 14:58:06.432138376 +0800
@@ -450,6 +450,7 @@ struct r5conf {
 	int			inactive_blocked;	/* release of inactive stripes blocked,
 							 * waiting for 25% to be free
 							 */
+	int			pending_stripes;
 	int			pool_size; /* number of disks in stripeheads in pool */
 	spinlock_t		device_lock;
 	struct disk_info	*disks;
@@ -458,6 +459,8 @@ struct r5conf {
 	 * the new thread here until we fully activate the array.
 	 */
 	struct md_thread	*thread;
+	int			aux_thread_num;
+	struct md_thread	**aux_threads;
 };
 
 /*


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/10 v3] raid5: use wake_up_all for overlap waking
  2012-06-25  7:24 ` [patch 01/10 v3] raid5: use wake_up_all for overlap waking Shaohua Li
@ 2012-06-28  7:26   ` NeilBrown
  2012-06-28  8:53     ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: NeilBrown @ 2012-06-28  7:26 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 4566 bytes --]

On Mon, 25 Jun 2012 15:24:48 +0800 Shaohua Li <shli@kernel.org> wrote:

> It's possible several tasks are waiting for stripe overlap. We clear R5_Overlap
> bit and wake_up, but wake_up just wakes one task. So if there are several tasks
> in the wait queue, some tasks will not be woken up even its strip R5_Overlap
> clear. The end result is tasks hang in make_request.
> 
> wake_up_all should not introduce performance issue here, since overlap case is
> rare.

This is not necessary.
wake_up_all is only different from wake_up if WQ_FLAG_EXCLUSIVE it set, e.g.
by prepare_to_wait_exclusive.
As we don't use an exclusive wait to wait on wait_for_overlap, there is no
point in using wake_up_all, wake_up already wakes everything up.

NeilBrown


> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |   20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-19 08:11:10.021688417 +0800
> +++ linux/drivers/md/raid5.c	2012-06-19 08:11:29.833439339 +0800
> @@ -1399,7 +1399,7 @@ static void __raid_run_ops(struct stripe
>  		for (i = disks; i--; ) {
>  			struct r5dev *dev = &sh->dev[i];
>  			if (test_and_clear_bit(R5_Overlap, &dev->flags))
> -				wake_up(&sh->raid_conf->wait_for_overlap);
> +				wake_up_all(&sh->raid_conf->wait_for_overlap);
>  		}
>  	put_cpu();
>  }
> @@ -2436,7 +2436,7 @@ handle_failed_stripe(struct r5conf *conf
>  		}
>  
>  		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
> -			wake_up(&conf->wait_for_overlap);
> +			wake_up_all(&conf->wait_for_overlap);
>  
>  		while (bi && bi->bi_sector <
>  			sh->dev[i].sector + STRIPE_SECTORS) {
> @@ -2474,7 +2474,7 @@ handle_failed_stripe(struct r5conf *conf
>  			bi = sh->dev[i].toread;
>  			sh->dev[i].toread = NULL;
>  			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
> -				wake_up(&conf->wait_for_overlap);
> +				wake_up_all(&conf->wait_for_overlap);
>  			if (bi) s->to_read--;
>  			while (bi && bi->bi_sector <
>  			       sh->dev[i].sector + STRIPE_SECTORS) {
> @@ -3572,7 +3572,7 @@ static void handle_stripe(struct stripe_
>  	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
>  		clear_bit(STRIPE_EXPAND_READY, &sh->state);
>  		atomic_dec(&conf->reshape_stripes);
> -		wake_up(&conf->wait_for_overlap);
> +		wake_up_all(&conf->wait_for_overlap);
>  		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
>  	}
>  
> @@ -4249,7 +4249,7 @@ static sector_t reshape_request(struct m
>  		spin_lock_irq(&conf->device_lock);
>  		conf->reshape_safe = mddev->reshape_position;
>  		spin_unlock_irq(&conf->device_lock);
> -		wake_up(&conf->wait_for_overlap);
> +		wake_up_all(&conf->wait_for_overlap);
>  		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
>  	}
>  
> @@ -4340,7 +4340,7 @@ static sector_t reshape_request(struct m
>  		spin_lock_irq(&conf->device_lock);
>  		conf->reshape_safe = mddev->reshape_position;
>  		spin_unlock_irq(&conf->device_lock);
> -		wake_up(&conf->wait_for_overlap);
> +		wake_up_all(&conf->wait_for_overlap);
>  		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
>  	}
>  	return reshape_sectors;
> @@ -5718,7 +5718,7 @@ static void end_reshape(struct r5conf *c
>  		smp_wmb();
>  		conf->reshape_progress = MaxSector;
>  		spin_unlock_irq(&conf->device_lock);
> -		wake_up(&conf->wait_for_overlap);
> +		wake_up_all(&conf->wait_for_overlap);
>  
>  		/* read-ahead size must cover two whole stripes, which is
>  		 * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
> @@ -5776,7 +5776,7 @@ static void raid5_quiesce(struct mddev *
>  
>  	switch(state) {
>  	case 2: /* resume for a suspend */
> -		wake_up(&conf->wait_for_overlap);
> +		wake_up_all(&conf->wait_for_overlap);
>  		break;
>  
>  	case 1: /* stop all writes */
> @@ -5792,14 +5792,14 @@ static void raid5_quiesce(struct mddev *
>  		conf->quiesce = 1;
>  		spin_unlock_irq(&conf->device_lock);
>  		/* allow reshape to continue */
> -		wake_up(&conf->wait_for_overlap);
> +		wake_up_all(&conf->wait_for_overlap);
>  		break;
>  
>  	case 0: /* re-enable writes */
>  		spin_lock_irq(&conf->device_lock);
>  		conf->quiesce = 0;
>  		wake_up(&conf->wait_for_stripe);
> -		wake_up(&conf->wait_for_overlap);
> +		wake_up_all(&conf->wait_for_overlap);
>  		spin_unlock_irq(&conf->device_lock);
>  		break;
>  	}


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 01/10 v3] raid5: use wake_up_all for overlap waking
  2012-06-28  7:26   ` NeilBrown
@ 2012-06-28  8:53     ` Shaohua Li
  0 siblings, 0 replies; 32+ messages in thread
From: Shaohua Li @ 2012-06-28  8:53 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Thu, Jun 28, 2012 at 05:26:21PM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 15:24:48 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > It's possible several tasks are waiting for stripe overlap. We clear R5_Overlap
> > bit and wake_up, but wake_up just wakes one task. So if there are several tasks
> > in the wait queue, some tasks will not be woken up even its strip R5_Overlap
> > clear. The end result is tasks hang in make_request.
> > 
> > wake_up_all should not introduce performance issue here, since overlap case is
> > rare.
> 
> This is not necessary.
> wake_up_all is only different from wake_up if WQ_FLAG_EXCLUSIVE it set, e.g.
> by prepare_to_wait_exclusive.
> As we don't use an exclusive wait to wait on wait_for_overlap, there is no
> point in using wake_up_all, wake_up already wakes everything up.

Oh, this is silly, sorry about it. So the only problem I hit the hang is that I
fixed in the second patch. Other patches can still applied without this one.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 02/10 v3] raid5: delayed stripe fix
  2012-06-25  7:24 ` [patch 02/10 v3] raid5: delayed stripe fix Shaohua Li
@ 2012-07-02  0:46   ` NeilBrown
  2012-07-02  0:49     ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: NeilBrown @ 2012-07-02  0:46 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 1827 bytes --]

On Mon, 25 Jun 2012 15:24:49 +0800 Shaohua Li <shli@kernel.org> wrote:

> There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
> the two bits have relationship. A delayed stripe can be moved to hold list only
> when preread active stripe count is below IO_THRESHOLD. If a stripe has both
> the bits set, such stripe will be in delayed list and preread count not 0,
> which will make such stripe never leave delayed list.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |    4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:15.964613183 +0800
> +++ linux/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
> @@ -196,12 +196,14 @@ static void __release_stripe(struct r5co
>  		BUG_ON(!list_empty(&sh->lru));
>  		BUG_ON(atomic_read(&conf->active_stripes)==0);
>  		if (test_bit(STRIPE_HANDLE, &sh->state)) {
> -			if (test_bit(STRIPE_DELAYED, &sh->state))
> +			if (test_bit(STRIPE_DELAYED, &sh->state) &&
> +			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>  				list_add_tail(&sh->lru, &conf->delayed_list);
>  			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
>  				   sh->bm_seq - conf->seq_write > 0)
>  				list_add_tail(&sh->lru, &conf->bitmap_list);
>  			else {
> +				clear_bit(STRIPE_DELAYED, &sh->state);
>  				clear_bit(STRIPE_BIT_DELAY, &sh->state);
>  				list_add_tail(&sh->lru, &conf->handle_list);
>  			}

Thanks.  I've applied this patch and will submit it upstream shortly.

Have you actually seen a stripe get trapped with both bits set, or is this
just a theoretical problem discovered by code inspection?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 02/10 v3] raid5: delayed stripe fix
  2012-07-02  0:46   ` NeilBrown
@ 2012-07-02  0:49     ` Shaohua Li
  2012-07-02  0:55       ` NeilBrown
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-07-02  0:49 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Mon, Jul 02, 2012 at 10:46:48AM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 15:24:49 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
> > the two bits have relationship. A delayed stripe can be moved to hold list only
> > when preread active stripe count is below IO_THRESHOLD. If a stripe has both
> > the bits set, such stripe will be in delayed list and preread count not 0,
> > which will make such stripe never leave delayed list.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > ---
> >  drivers/md/raid5.c |    4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > Index: linux/drivers/md/raid5.c
> > ===================================================================
> > --- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:15.964613183 +0800
> > +++ linux/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
> > @@ -196,12 +196,14 @@ static void __release_stripe(struct r5co
> >  		BUG_ON(!list_empty(&sh->lru));
> >  		BUG_ON(atomic_read(&conf->active_stripes)==0);
> >  		if (test_bit(STRIPE_HANDLE, &sh->state)) {
> > -			if (test_bit(STRIPE_DELAYED, &sh->state))
> > +			if (test_bit(STRIPE_DELAYED, &sh->state) &&
> > +			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> >  				list_add_tail(&sh->lru, &conf->delayed_list);
> >  			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> >  				   sh->bm_seq - conf->seq_write > 0)
> >  				list_add_tail(&sh->lru, &conf->bitmap_list);
> >  			else {
> > +				clear_bit(STRIPE_DELAYED, &sh->state);
> >  				clear_bit(STRIPE_BIT_DELAY, &sh->state);
> >  				list_add_tail(&sh->lru, &conf->handle_list);
> >  			}
> 
> Thanks.  I've applied this patch and will submit it upstream shortly.
> 
> Have you actually seen a stripe get trapped with both bits set, or is this
> just a theoretical problem discovered by code inspection?

I print the flags of strip when there is overlap sleep in make_request(), and
found this case, so this is real.

How do you think about the other patches in the series?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 03/10 v3] raid5: add a per-stripe lock
  2012-06-25  7:24 ` [patch 03/10 v3] raid5: add a per-stripe lock Shaohua Li
@ 2012-07-02  0:50   ` NeilBrown
  2012-07-02  3:16     ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: NeilBrown @ 2012-07-02  0:50 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 6119 bytes --]

On Mon, 25 Jun 2012 15:24:50 +0800 Shaohua Li <shli@kernel.org> wrote:

> Add a per-stripe lock to protect stripe specific data, like dev->read,
> written, ... The purpose is to reduce lock contention of conf->device_lock.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>

I had hoped to avoid having a per-stripe lock again, but it does look like it
is needed.
However I don't like the way you have split up these three patches - it makes
them a little hard to review.

I would like to see one patch which converts the bi_phys_segments access to
be atomic and also removes all the spin_lock calls that were just for
protecting that.

Then another patch which adds the new stripe_lock, clearly documenting
exactly what is protects (not just "like dev->read" but an explicit list)
and also removes any spin_lock of device_lock that is no longer needed.

Then I could see what is being added and what is being removed all in the one
patch and I can be sure that they balance.

Thanks,
NeilBrown


> ---
>  drivers/md/raid5.c |   17 +++++++++++++++++
>  drivers/md/raid5.h |    1 +
>  2 files changed, 18 insertions(+)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
> +++ linux/drivers/md/raid5.c	2012-06-25 14:37:13.651888057 +0800
> @@ -751,6 +751,7 @@ static void ops_complete_biofill(void *s
>  
>  	/* clear completed biofills */
>  	spin_lock_irq(&conf->device_lock);
> +	spin_lock(&sh->stripe_lock);
>  	for (i = sh->disks; i--; ) {
>  		struct r5dev *dev = &sh->dev[i];
>  
> @@ -776,6 +777,7 @@ static void ops_complete_biofill(void *s
>  			}
>  		}
>  	}
> +	spin_unlock(&sh->stripe_lock);
>  	spin_unlock_irq(&conf->device_lock);
>  	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
>  
> @@ -800,8 +802,10 @@ static void ops_run_biofill(struct strip
>  		if (test_bit(R5_Wantfill, &dev->flags)) {
>  			struct bio *rbi;
>  			spin_lock_irq(&conf->device_lock);
> +			spin_lock(&sh->stripe_lock);
>  			dev->read = rbi = dev->toread;
>  			dev->toread = NULL;
> +			spin_unlock(&sh->stripe_lock);
>  			spin_unlock_irq(&conf->device_lock);
>  			while (rbi && rbi->bi_sector <
>  				dev->sector + STRIPE_SECTORS) {
> @@ -1139,10 +1143,12 @@ ops_run_biodrain(struct stripe_head *sh,
>  			struct bio *wbi;
>  
>  			spin_lock_irq(&sh->raid_conf->device_lock);
> +			spin_lock(&sh->stripe_lock);
>  			chosen = dev->towrite;
>  			dev->towrite = NULL;
>  			BUG_ON(dev->written);
>  			wbi = dev->written = chosen;
> +			spin_unlock(&sh->stripe_lock);
>  			spin_unlock_irq(&sh->raid_conf->device_lock);
>  
>  			while (wbi && wbi->bi_sector <
> @@ -1448,6 +1454,8 @@ static int grow_one_stripe(struct r5conf
>  	init_waitqueue_head(&sh->ops.wait_for_ops);
>  	#endif
>  
> +	spin_lock_init(&sh->stripe_lock);
> +
>  	if (grow_buffers(sh)) {
>  		shrink_buffers(sh);
>  		kmem_cache_free(conf->slab_cache, sh);
> @@ -2329,6 +2337,7 @@ static int add_stripe_bio(struct stripe_
>  
>  
>  	spin_lock_irq(&conf->device_lock);
> +	spin_lock(&sh->stripe_lock);
>  	if (forwrite) {
>  		bip = &sh->dev[dd_idx].towrite;
>  		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
> @@ -2362,6 +2371,7 @@ static int add_stripe_bio(struct stripe_
>  		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
>  			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
>  	}
> +	spin_unlock(&sh->stripe_lock);
>  	spin_unlock_irq(&conf->device_lock);
>  
>  	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
> @@ -2378,6 +2388,7 @@ static int add_stripe_bio(struct stripe_
>  
>   overlap:
>  	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
> +	spin_unlock(&sh->stripe_lock);
>  	spin_unlock_irq(&conf->device_lock);
>  	return 0;
>  }
> @@ -2429,6 +2440,7 @@ handle_failed_stripe(struct r5conf *conf
>  			}
>  		}
>  		spin_lock_irq(&conf->device_lock);
> +		spin_lock(&sh->stripe_lock);
>  		/* fail all writes first */
>  		bi = sh->dev[i].towrite;
>  		sh->dev[i].towrite = NULL;
> @@ -2490,6 +2502,7 @@ handle_failed_stripe(struct r5conf *conf
>  				bi = nextbi;
>  			}
>  		}
> +		spin_unlock(&sh->stripe_lock);
>  		spin_unlock_irq(&conf->device_lock);
>  		if (bitmap_end)
>  			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
> @@ -2697,6 +2710,7 @@ static void handle_stripe_clean_event(st
>  				int bitmap_end = 0;
>  				pr_debug("Return write for disc %d\n", i);
>  				spin_lock_irq(&conf->device_lock);
> +				spin_lock(&sh->stripe_lock);
>  				wbi = dev->written;
>  				dev->written = NULL;
>  				while (wbi && wbi->bi_sector <
> @@ -2711,6 +2725,7 @@ static void handle_stripe_clean_event(st
>  				}
>  				if (dev->towrite == NULL)
>  					bitmap_end = 1;
> +				spin_unlock(&sh->stripe_lock);
>  				spin_unlock_irq(&conf->device_lock);
>  				if (bitmap_end)
>  					bitmap_endwrite(conf->mddev->bitmap,
> @@ -3170,6 +3185,7 @@ static void analyse_stripe(struct stripe
>  	/* Now to look around and see what can be done */
>  	rcu_read_lock();
>  	spin_lock_irq(&conf->device_lock);
> +	spin_lock(&sh->stripe_lock);
>  	for (i=disks; i--; ) {
>  		struct md_rdev *rdev;
>  		sector_t first_bad;
> @@ -3315,6 +3331,7 @@ static void analyse_stripe(struct stripe
>  				do_recovery = 1;
>  		}
>  	}
> +	spin_unlock(&sh->stripe_lock);
>  	spin_unlock_irq(&conf->device_lock);
>  	if (test_bit(STRIPE_SYNCING, &sh->state)) {
>  		/* If there is a failed device being replaced,
> Index: linux/drivers/md/raid5.h
> ===================================================================
> --- linux.orig/drivers/md/raid5.h	2012-06-25 14:36:13.940638627 +0800
> +++ linux/drivers/md/raid5.h	2012-06-25 14:37:13.651888057 +0800
> @@ -210,6 +210,7 @@ struct stripe_head {
>  	int			disks;		/* disks in stripe */
>  	enum check_states	check_state;
>  	enum reconstruct_states reconstruct_state;
> +	spinlock_t		stripe_lock;
>  	/**
>  	 * struct stripe_operations
>  	 * @target - STRIPE_OP_COMPUTE_BLK target


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 02/10 v3] raid5: delayed stripe fix
  2012-07-02  0:49     ` Shaohua Li
@ 2012-07-02  0:55       ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2012-07-02  0:55 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 2337 bytes --]

On Mon, 2 Jul 2012 08:49:55 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Mon, Jul 02, 2012 at 10:46:48AM +1000, NeilBrown wrote:
> > On Mon, 25 Jun 2012 15:24:49 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > There isn't locking setting STRIPE_DELAYED and STRIPE_PREREAD_ACTIVE bits, but
> > > the two bits have relationship. A delayed stripe can be moved to hold list only
> > > when preread active stripe count is below IO_THRESHOLD. If a stripe has both
> > > the bits set, such stripe will be in delayed list and preread count not 0,
> > > which will make such stripe never leave delayed list.
> > > 
> > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > ---
> > >  drivers/md/raid5.c |    4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > Index: linux/drivers/md/raid5.c
> > > ===================================================================
> > > --- linux.orig/drivers/md/raid5.c	2012-06-25 14:36:15.964613183 +0800
> > > +++ linux/drivers/md/raid5.c	2012-06-25 14:36:57.280096788 +0800
> > > @@ -196,12 +196,14 @@ static void __release_stripe(struct r5co
> > >  		BUG_ON(!list_empty(&sh->lru));
> > >  		BUG_ON(atomic_read(&conf->active_stripes)==0);
> > >  		if (test_bit(STRIPE_HANDLE, &sh->state)) {
> > > -			if (test_bit(STRIPE_DELAYED, &sh->state))
> > > +			if (test_bit(STRIPE_DELAYED, &sh->state) &&
> > > +			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> > >  				list_add_tail(&sh->lru, &conf->delayed_list);
> > >  			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> > >  				   sh->bm_seq - conf->seq_write > 0)
> > >  				list_add_tail(&sh->lru, &conf->bitmap_list);
> > >  			else {
> > > +				clear_bit(STRIPE_DELAYED, &sh->state);
> > >  				clear_bit(STRIPE_BIT_DELAY, &sh->state);
> > >  				list_add_tail(&sh->lru, &conf->handle_list);
> > >  			}
> > 
> > Thanks.  I've applied this patch and will submit it upstream shortly.
> > 
> > Have you actually seen a stripe get trapped with both bits set, or is this
> > just a theoretical problem discovered by code inspection?
> 
> I print the flags of strip when there is overlap sleep in make_request(), and
> found this case, so this is real.

OK, thanks.

> 
> How do you think about the other patches in the series?

I'll let you know shortly.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 06/10 v3] raid5: reduce chance release_stripe() taking device_lock
  2012-06-25  7:24 ` [patch 06/10 v3] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
@ 2012-07-02  0:57   ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2012-07-02  0:57 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 4513 bytes --]

On Mon, 25 Jun 2012 15:24:53 +0800 Shaohua Li <shli@kernel.org> wrote:

> release_stripe() is a place conf->device_lock is heavily contended. We take the
> lock even stripe count isn't 1, which isn't required.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |   73 +++++++++++++++++++++++++++++------------------------
>  1 file changed, 40 insertions(+), 33 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-25 14:37:21.000000000 +0800
> +++ linux/drivers/md/raid5.c	2012-06-25 14:38:13.899130571 +0800
> @@ -196,49 +196,56 @@ static int stripe_operations_active(stru
>  	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
>  }
>  
> -static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
> +static void handle_release_stripe(struct r5conf *conf, struct stripe_head *sh)
>  {
> -	if (atomic_dec_and_test(&sh->count)) {
> -		BUG_ON(!list_empty(&sh->lru));
> -		BUG_ON(atomic_read(&conf->active_stripes)==0);
> -		if (test_bit(STRIPE_HANDLE, &sh->state)) {
> -			if (test_bit(STRIPE_DELAYED, &sh->state) &&
> -			    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> -				list_add_tail(&sh->lru, &conf->delayed_list);
> -			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> -				   sh->bm_seq - conf->seq_write > 0)
> -				list_add_tail(&sh->lru, &conf->bitmap_list);
> -			else {
> -				clear_bit(STRIPE_DELAYED, &sh->state);
> -				clear_bit(STRIPE_BIT_DELAY, &sh->state);
> -				list_add_tail(&sh->lru, &conf->handle_list);
> -			}
> -			md_wakeup_thread(conf->mddev->thread);
> -		} else {
> -			BUG_ON(stripe_operations_active(sh));
> -			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> -				if (atomic_dec_return(&conf->preread_active_stripes)
> -				    < IO_THRESHOLD)
> -					md_wakeup_thread(conf->mddev->thread);
> -			atomic_dec(&conf->active_stripes);
> -			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
> -				list_add_tail(&sh->lru, &conf->inactive_list);
> -				wake_up(&conf->wait_for_stripe);
> -				if (conf->retry_read_aligned)
> -					md_wakeup_thread(conf->mddev->thread);
> -			}
> +	BUG_ON(!list_empty(&sh->lru));
> +	BUG_ON(atomic_read(&conf->active_stripes)==0);
> +	if (test_bit(STRIPE_HANDLE, &sh->state)) {
> +		if (test_bit(STRIPE_DELAYED, &sh->state) &&
> +		    !test_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +			list_add_tail(&sh->lru, &conf->delayed_list);
> +		else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> +			   sh->bm_seq - conf->seq_write > 0)
> +			list_add_tail(&sh->lru, &conf->bitmap_list);
> +		else {
> +			clear_bit(STRIPE_DELAYED, &sh->state);
> +			clear_bit(STRIPE_BIT_DELAY, &sh->state);
> +			list_add_tail(&sh->lru, &conf->handle_list);
> +		}
> +		md_wakeup_thread(conf->mddev->thread);
> +	} else {
> +		BUG_ON(stripe_operations_active(sh));
> +		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +			if (atomic_dec_return(&conf->preread_active_stripes)
> +			    < IO_THRESHOLD)
> +				md_wakeup_thread(conf->mddev->thread);
> +		atomic_dec(&conf->active_stripes);
> +		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
> +			list_add_tail(&sh->lru, &conf->inactive_list);
> +			wake_up(&conf->wait_for_stripe);
> +			if (conf->retry_read_aligned)
> +				md_wakeup_thread(conf->mddev->thread);
>  		}
>  	}
>  }
>  
> +static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
> +{
> +	if (atomic_dec_and_test(&sh->count))
> +		handle_release_stripe(conf, sh);
> +}
> +
>  static void release_stripe(struct stripe_head *sh)
>  {
>  	struct r5conf *conf = sh->raid_conf;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&conf->device_lock, flags);
> -	__release_stripe(conf, sh);
> -	spin_unlock_irqrestore(&conf->device_lock, flags);
> +	local_irq_save(flags);
> +	if (atomic_dec_and_lock(&sh->count, &conf->device_lock)) {
> +		handle_release_stripe(conf, sh);
> +		spin_unlock(&conf->device_lock);
> +	}
> +	local_irq_restore(flags);
>  }
>  
>  static inline void remove_hash(struct stripe_head *sh)


Thanks.  I've applied this patch and it should appear in my -next shortly.
I renamed "handle_release_stripe" to "do_release_stripe", partly because I
think that is more consistent with practice in Linux, but mostly because
"handle" means something else inside raid5.c and I don't want to encourage
confusion.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 07/10 v3] md: personality can provide unplug private data
  2012-06-25  7:24 ` [patch 07/10 v3] md: personality can provide unplug private data Shaohua Li
@ 2012-07-02  1:06   ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2012-07-02  1:06 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 6432 bytes --]

On Mon, 25 Jun 2012 15:24:54 +0800 Shaohua Li <shli@kernel.org> wrote:

> Allow personality providing unplug private data. Next patch will use it.

Thanks. I've applied this with a couple of minor changes.

In particular I change the 'size' arg to be size total size of the
plug structure, not the amount to add to the end.  I also change it
to use kzalloc rather then an extra memset.

Thanks,
NeilBrown


> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/md.c     |   31 +++++++++++++------------------
>  drivers/md/md.h     |   20 +++++++++++++++++++-
>  drivers/md/raid1.c  |    2 +-
>  drivers/md/raid10.c |    2 +-
>  drivers/md/raid5.c  |    2 +-
>  5 files changed, 35 insertions(+), 22 deletions(-)
> 
> Index: linux/drivers/md/md.c
> ===================================================================
> --- linux.orig/drivers/md/md.c	2012-06-25 14:36:13.668642048 +0800
> +++ linux/drivers/md/md.c	2012-06-25 14:38:33.106889041 +0800
> @@ -498,22 +498,13 @@ void md_flush_request(struct mddev *mdde
>  }
>  EXPORT_SYMBOL(md_flush_request);
>  
> -/* Support for plugging.
> - * This mirrors the plugging support in request_queue, but does not
> - * require having a whole queue or request structures.
> - * We allocate an md_plug_cb for each md device and each thread it gets
> - * plugged on.  This links tot the private plug_handle structure in the
> - * personality data where we keep a count of the number of outstanding
> - * plugs so other code can see if a plug is active.
> - */
> -struct md_plug_cb {
> -	struct blk_plug_cb cb;
> -	struct mddev *mddev;
> -};
>  
>  static void plugger_unplug(struct blk_plug_cb *cb)
>  {
>  	struct md_plug_cb *mdcb = container_of(cb, struct md_plug_cb, cb);
> +
> +	if (mdcb->unplug)
> +		mdcb->unplug(mdcb);
>  	if (atomic_dec_and_test(&mdcb->mddev->plug_cnt))
>  		md_wakeup_thread(mdcb->mddev->thread);
>  	kfree(mdcb);
> @@ -522,13 +513,14 @@ static void plugger_unplug(struct blk_pl
>  /* Check that an unplug wakeup will come shortly.
>   * If not, wakeup the md thread immediately
>   */
> -int mddev_check_plugged(struct mddev *mddev)
> +struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
> +	md_unplug_func_t unplug, size_t size)
>  {
>  	struct blk_plug *plug = current->plug;
>  	struct md_plug_cb *mdcb;
>  
>  	if (!plug)
> -		return 0;
> +		return NULL;
>  
>  	list_for_each_entry(mdcb, &plug->cb_list, cb.list) {
>  		if (mdcb->cb.callback == plugger_unplug &&
> @@ -538,19 +530,22 @@ int mddev_check_plugged(struct mddev *md
>  						    struct md_plug_cb,
>  						    cb.list))
>  				list_move(&mdcb->cb.list, &plug->cb_list);
> -			return 1;
> +			return mdcb;
>  		}
>  	}
>  	/* Not currently on the callback list */
> -	mdcb = kmalloc(sizeof(*mdcb), GFP_ATOMIC);
> +	mdcb = kmalloc(sizeof(*mdcb) + size, GFP_ATOMIC);
>  	if (!mdcb)
> -		return 0;
> +		return NULL;
>  
>  	mdcb->mddev = mddev;
>  	mdcb->cb.callback = plugger_unplug;
>  	atomic_inc(&mddev->plug_cnt);
>  	list_add(&mdcb->cb.list, &plug->cb_list);
> -	return 1;
> +	mdcb->unplug = unplug;
> +	if (size)
> +		memset((void *)(mdcb + 1), 0, size);
> +	return mdcb;
>  }
>  EXPORT_SYMBOL_GPL(mddev_check_plugged);
>  
> Index: linux/drivers/md/md.h
> ===================================================================
> --- linux.orig/drivers/md/md.h	2012-06-25 14:36:13.676641948 +0800
> +++ linux/drivers/md/md.h	2012-06-25 14:38:33.106889041 +0800
> @@ -630,6 +630,24 @@ extern struct bio *bio_clone_mddev(struc
>  				   struct mddev *mddev);
>  extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
>  				   struct mddev *mddev);
> -extern int mddev_check_plugged(struct mddev *mddev);
> +
> +/* Support for plugging.
> + * This mirrors the plugging support in request_queue, but does not
> + * require having a whole queue or request structures.
> + * We allocate an md_plug_cb for each md device and each thread it gets
> + * plugged on.  This links tot the private plug_handle structure in the
> + * personality data where we keep a count of the number of outstanding
> + * plugs so other code can see if a plug is active.
> + */
> +struct md_plug_cb;
> +typedef void (*md_unplug_func_t)(struct md_plug_cb *mdcb);
> +struct md_plug_cb {
> +	struct blk_plug_cb cb;
> +	struct mddev *mddev;
> +	md_unplug_func_t unplug;
> +};
> +
> +extern struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
> +	md_unplug_func_t unplug, size_t size);
>  extern void md_trim_bio(struct bio *bio, int offset, int size);
>  #endif /* _MD_MD_H */
> Index: linux/drivers/md/raid1.c
> ===================================================================
> --- linux.orig/drivers/md/raid1.c	2012-06-25 14:36:13.696641695 +0800
> +++ linux/drivers/md/raid1.c	2012-06-25 14:38:33.110889008 +0800
> @@ -1034,7 +1034,7 @@ read_again:
>  	 * the bad blocks.  Each set of writes gets it's own r1bio
>  	 * with a set of bios attached.
>  	 */
> -	plugged = mddev_check_plugged(mddev);
> +	plugged = !!mddev_check_plugged(mddev, NULL, 0);
>  
>  	disks = conf->raid_disks * 2;
>   retry_write:
> Index: linux/drivers/md/raid10.c
> ===================================================================
> --- linux.orig/drivers/md/raid10.c	2012-06-25 14:36:13.684641847 +0800
> +++ linux/drivers/md/raid10.c	2012-06-25 14:38:33.110889008 +0800
> @@ -1239,7 +1239,7 @@ read_again:
>  	 * of r10_bios is recored in bio->bi_phys_segments just as with
>  	 * the read case.
>  	 */
> -	plugged = mddev_check_plugged(mddev);
> +	plugged = !!mddev_check_plugged(mddev, NULL, 0);
>  
>  	r10_bio->read_slot = -1; /* make sure repl_bio gets freed */
>  	raid10_find_phys(conf, r10_bio);
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-25 14:38:13.899130571 +0800
> +++ linux/drivers/md/raid5.c	2012-06-25 14:38:33.110889008 +0800
> @@ -4012,7 +4012,7 @@ static void make_request(struct mddev *m
>  	bi->bi_next = NULL;
>  	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
>  
> -	plugged = mddev_check_plugged(mddev);
> +	plugged = !!mddev_check_plugged(mddev, NULL, 0);
>  	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
>  		DEFINE_WAIT(w);
>  		int previous;


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 08/10 v3] raid5: make_request use batch stripe release
  2012-06-25  7:24 ` [patch 08/10 v3] raid5: make_request use batch stripe release Shaohua Li
@ 2012-07-02  2:31   ` NeilBrown
  2012-07-02  2:59     ` Shaohua Li
  0 siblings, 1 reply; 32+ messages in thread
From: NeilBrown @ 2012-07-02  2:31 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 9493 bytes --]

On Mon, 25 Jun 2012 15:24:55 +0800 Shaohua Li <shli@kernel.org> wrote:

> make_request() does stripe release for every stripe and the stripe usually has
> count 1, which makes previous release_stripe() optimization not work. In my
> test, this release_stripe() becomes the heaviest pleace to take
> conf->device_lock after previous patches applied.
> 
> Below patch makes stripe release batch. All the stripes will be released in
> unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
> lru.
> 

I've applied this patch, but I'm afraid I butchered it a bit first :-)


> @@ -3984,6 +3985,51 @@ static struct stripe_head *__get_priorit
>  	return sh;
>  }
>  
> +#define raid5_unplug_list(mdcb) (struct list_head *)(mdcb + 1)

I really don't like this sort of construct.  It is much cleaner (I think) to
add to a structure by embedding it in a larger structure, then using
"container_of" to map from the inner to the outer structure.  So I have
changed that.

> @@ -4114,7 +4161,14 @@ static void make_request(struct mddev *m
>  			if ((bi->bi_rw & REQ_SYNC) &&
>  			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>  				atomic_inc(&conf->preread_active_stripes);
> -			release_stripe(sh);
> +			/*
> +			 * We must recheck here. schedule() might be called
> +			 * above which makes unplug invoked already, so the old
> +			 * mdcb is invalid
> +			 */

I agree that this is an important check, but as a 'schedule()' can
theoretically happen at any time that preempt isn't explicitly disabled, we
really need to be even more careful.  So I have changed the md code to
disable preempt, and require the caller to re-enable preempt after it has
used the returned value.

The resulting serious should appear in my for-next shortly.  However for
easier review I'll include two patches below.  The first change
mddev_check_plugged to disable preemption.
The second is a diff against your patch which changes it to use an embedded
structure and container_of.
I haven't actually tested this yet, so there may be further changes.

Thanks,
NeilBrown

From 04b7dd7d0ad4a21622cad7c10821f914a8d9ccd3 Mon Sep 17 00:00:00 2001
From: NeilBrown <neilb@suse.de>
Date: Mon, 2 Jul 2012 12:14:49 +1000
Subject: [PATCH] md/plug: disable preempt when reported a plug is present.

As 'schedule' will unplug a queue, a plug added by
mddev_check_plugged is only valid until the next schedule().
So call preempt_disable before installing the plug, and require the
called to call preempt_enable once the value has been used.

Signed-off-by: NeilBrown  <neilb@suse.de>

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 1369c9d..63ea6d6 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -512,6 +512,10 @@ static void plugger_unplug(struct blk_plug_cb *cb)
 
 /* Check that an unplug wakeup will come shortly.
  * If not, wakeup the md thread immediately
+ * Note that the structure returned is only value until
+ * the next schedule(), so preemption is disabled when it
+ * is not NULL, and must be re-enabled after the value
+ * has been used.
  */
 struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
 				       md_unplug_func_t unplug, size_t size)
@@ -522,6 +526,7 @@ struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
 	if (!plug)
 		return NULL;
 
+	preempt_disable();
 	list_for_each_entry(mdcb, &plug->cb_list, cb.list) {
 		if (mdcb->cb.callback == plugger_unplug &&
 		    mdcb->mddev == mddev) {
@@ -533,6 +538,7 @@ struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
 			return mdcb;
 		}
 	}
+	preempt_enable();
 	/* Not currently on the callback list */
 	if (size < sizeof(*mdcb))
 		size = sizeof(*mdcb);
@@ -540,6 +546,7 @@ struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
 	if (!mdcb)
 		return NULL;
 
+	preempt_disable();
 	mdcb->mddev = mddev;
 	mdcb->cb.callback = plugger_unplug;
 	atomic_inc(&mddev->plug_cnt);
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index ebce488..2e19b68 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -883,7 +883,6 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 	const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
 	const unsigned long do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
 	struct md_rdev *blocked_rdev;
-	int plugged;
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
@@ -1034,8 +1033,6 @@ read_again:
 	 * the bad blocks.  Each set of writes gets it's own r1bio
 	 * with a set of bios attached.
 	 */
-	plugged = !!mddev_check_plugged(mddev, NULL, 0);
-
 	disks = conf->raid_disks * 2;
  retry_write:
 	blocked_rdev = NULL;
@@ -1214,8 +1211,11 @@ read_again:
 	/* In case raid1d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
 
-	if (do_sync || !bitmap || !plugged)
+	if (do_sync ||
+	    !mddev_check_plugged(mddev, NULL, 0))
 		md_wakeup_thread(mddev->thread);
+	else
+		preempt_enable();
 }
 
 static void status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 978a996..54f4b33 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1039,7 +1039,6 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 	const unsigned long do_fua = (bio->bi_rw & REQ_FUA);
 	unsigned long flags;
 	struct md_rdev *blocked_rdev;
-	int plugged;
 	int sectors_handled;
 	int max_sectors;
 	int sectors;
@@ -1239,7 +1238,6 @@ read_again:
 	 * of r10_bios is recored in bio->bi_phys_segments just as with
 	 * the read case.
 	 */
-	plugged = !!mddev_check_plugged(mddev, NULL, 0);
 
 	r10_bio->read_slot = -1; /* make sure repl_bio gets freed */
 	raid10_find_phys(conf, r10_bio);
@@ -1449,8 +1447,11 @@ retry_write:
 	/* In case raid10d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
 
-	if (do_sync || !mddev->bitmap || !plugged)
+	if (do_sync ||
+	    !mddev_check_plugged(mddev, NULL, 0))
 		md_wakeup_thread(mddev->thread);
+	else
+		preempt_enable();
 }
 
 static void status(struct seq_file *seq, struct mddev *mddev)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 0a23037..e9e920c 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4004,7 +4004,6 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 	struct stripe_head *sh;
 	const int rw = bio_data_dir(bi);
 	int remaining;
-	int plugged;
 
 	if (unlikely(bi->bi_rw & REQ_FLUSH)) {
 		md_flush_request(mddev, bi);
@@ -4023,7 +4022,6 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
 
-	plugged = !!mddev_check_plugged(mddev, NULL, 0);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		int previous;
@@ -4134,8 +4132,10 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 		}
 			
 	}
-	if (!plugged)
+	if (!mddev_check_plugged(mddev, NULL, 0))
 		md_wakeup_thread(mddev->thread);
+	else
+		preempt_enable();
 
 	spin_lock_irq(&conf->device_lock);
 	remaining = raid5_dec_bi_phys_segments(bi);






diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index fc98086..ef3baa4 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3996,18 +3996,23 @@ static struct stripe_head *__get_priority_stripe(struct r5conf *conf)
 	return sh;
 }
 
-#define raid5_unplug_list(mdcb) (struct list_head *)(mdcb + 1)
+struct raid5_plug_cb {
+	struct md_plug_cb	cb;
+	struct list_head	list;
+};
+
 static void raid5_unplug(struct md_plug_cb *mdcb)
 {
-	struct list_head *list = raid5_unplug_list(mdcb);
+	struct raid5_plug_cb *cb = container_of(
+		mdcb, struct raid5_plug_cb, cb);
 	struct stripe_head *sh;
 	struct r5conf *conf = mdcb->mddev->private;
 
-	if (list->next == NULL || list_empty(list))
+	if (cb->list.next == NULL || list_empty(&cb->list))
 		return;
 	spin_lock_irq(&conf->device_lock);
-	while (!list_empty(list)) {
-		sh = list_entry(list->next, struct stripe_head, lru);
+	while (!list_empty(&cb->list)) {
+		sh = list_first_entry(&cb->list, struct stripe_head, lru);
 		list_del_init(&sh->lru);
 		/*
 		 * avoid race release_stripe_plug() sees STRIPE_ON_UNPLUG_LIST
@@ -4024,20 +4029,20 @@ static void release_stripe_plug(struct mddev *mddev,
 				struct stripe_head *sh)
 {
 	struct md_plug_cb *mdcb = mddev_check_plugged(mddev, raid5_unplug,
-						      sizeof(struct list_head));
-	struct list_head *list = raid5_unplug_list(mdcb);
+						      sizeof(struct raid5_plug_cb));
+	struct raid5_plug_cb *cb;
 	if (!mdcb) {
 		release_stripe(sh);
 		return;
 	}
 
-	if (list->next == NULL) {
-		INIT_LIST_HEAD(list);
-		mdcb->unplug = raid5_unplug;
-	}
+	cb = container_of(mdcb, struct raid5_plug_cb, cb);
+
+	if (cb->list.next == NULL)
+		INIT_LIST_HEAD(&cb->list);
 
 	if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state))
-		list_add_tail(&sh->lru, list);
+		list_add_tail(&sh->lru, &cb->list);
 	else
 		release_stripe(sh);
 	preempt_enable();
@@ -4180,7 +4185,7 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 		}
 			
 	}
-	if (!mddev_check_plugged(mddev, raid5_unplug, sizeof(struct list_head)))
+	if (!mddev_check_plugged(mddev, raid5_unplug, sizeof(struct raid5_plug_cb)))
 		md_wakeup_thread(mddev->thread);
 	else
 		preempt_enable();



[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [patch 09/10 v3] raid5: raid5d handle stripe in batch way
  2012-06-25  7:24 ` [patch 09/10 v3] raid5: raid5d handle stripe in batch way Shaohua Li
@ 2012-07-02  2:32   ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2012-07-02  2:32 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 2721 bytes --]

On Mon, 25 Jun 2012 15:24:56 +0800 Shaohua Li <shli@kernel.org> wrote:

> Let raid5d handle stripe in batch way to reduce conf->device_lock locking.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |   45 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 32 insertions(+), 13 deletions(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-25 14:38:37.378835415 +0800
> +++ linux/drivers/md/raid5.c	2012-06-25 14:38:40.150800523 +0800
> @@ -4568,6 +4568,30 @@ static int  retry_aligned_read(struct r5
>  	return handled;
>  }
>  
> +#define MAX_STRIPE_BATCH 8
> +static int handle_active_stripes(struct r5conf *conf)
> +{
> +	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
> +	int i, batch_size = 0;
> +
> +	while (batch_size < MAX_STRIPE_BATCH &&
> +			(sh = __get_priority_stripe(conf)) != NULL)
> +		batch[batch_size++] = sh;
> +
> +	if (batch_size == 0)
> +		return batch_size;
> +	spin_unlock_irq(&conf->device_lock);
> +
> +	for (i = 0; i < batch_size; i++)
> +		handle_stripe(batch[i]);
> +
> +	cond_resched();
> +
> +	spin_lock_irq(&conf->device_lock);
> +	for (i = 0; i < batch_size; i++)
> +		__release_stripe(conf, batch[i]);
> +	return batch_size;
> +}
>  
>  /*
>   * This is our raid5 kernel thread.
> @@ -4578,7 +4602,6 @@ static int  retry_aligned_read(struct r5
>   */
>  static void raid5d(struct mddev *mddev)
>  {
> -	struct stripe_head *sh;
>  	struct r5conf *conf = mddev->private;
>  	int handled;
>  	struct blk_plug plug;
> @@ -4592,6 +4615,7 @@ static void raid5d(struct mddev *mddev)
>  	spin_lock_irq(&conf->device_lock);
>  	while (1) {
>  		struct bio *bio;
> +		int batch_size;
>  
>  		if (atomic_read(&mddev->plug_cnt) == 0 &&
>  		    !list_empty(&conf->bitmap_list)) {
> @@ -4616,21 +4640,16 @@ static void raid5d(struct mddev *mddev)
>  			handled++;
>  		}
>  
> -		sh = __get_priority_stripe(conf);
> -
> -		if (!sh)
> +		batch_size = handle_active_stripes(conf);
> +		if (!batch_size)
>  			break;
> -		spin_unlock_irq(&conf->device_lock);
> -		
> -		handled++;
> -		handle_stripe(sh);
> -		release_stripe(sh);
> -		cond_resched();
> +		handled += batch_size;
>  
> -		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
> +		if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {
> +			spin_unlock_irq(&conf->device_lock);
>  			md_check_recovery(mddev);
> -
> -		spin_lock_irq(&conf->device_lock);
> +			spin_lock_irq(&conf->device_lock);
> +		}
>  	}
>  	pr_debug("%d stripes handled\n", handled);
>  
> 


I've applied this patch, thanks.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 10/10 v3] raid5: create multiple threads to handle stripes
  2012-06-25  7:24 ` [patch 10/10 v3] raid5: create multiple threads to handle stripes Shaohua Li
@ 2012-07-02  2:39   ` NeilBrown
  2012-07-02 20:03   ` Dan Williams
  1 sibling, 0 replies; 32+ messages in thread
From: NeilBrown @ 2012-07-02  2:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 10513 bytes --]

On Mon, 25 Jun 2012 15:24:57 +0800 Shaohua Li <shli@kernel.org> wrote:

> Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
> thread becomes a bottleneck. raid5 can offload calculation like checksum to
> async threads. And if storge is fast, scheduling async work and running async
> work will introduce heavy lock contention of workqueue, which makes such
> optimization useless. And calculation isn't the only bottleneck. For example,
> in my test raid5 thread must handle > 450k requests per second. Just doing
> dispatch and completion will make raid5 thread incapable. The only chance to
> scale is using several threads to handle stripe.
> 
> With this patch, user can create several extra threads to handle stripe. How
> many threads are better depending on disk number, so the thread number can be
> changed in userspace. By default, the thread number is 0, which means no extra
> thread.
> 
> In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
> improvement (double stripe_cache_size) and the throughput is pretty close to
> theory value. With >=4 disks, the improvement is even bigger, for example, can
> improve 200% for 4-disk setup, but the throughput is far less than theory
> value, which is caused by several factors like request queue lock contention,
> cache issue, latency introduced by how a stripe is handled in different disks.
> Those factors need further investigations.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |  137 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  drivers/md/raid5.h |    3 +
>  2 files changed, 139 insertions(+), 1 deletion(-)
> 
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c	2012-06-25 14:58:06.420138526 +0800
> +++ linux/drivers/md/raid5.c	2012-06-25 14:58:06.428138426 +0800
> @@ -211,6 +211,7 @@ static void handle_release_stripe(struct
>  			clear_bit(STRIPE_DELAYED, &sh->state);
>  			clear_bit(STRIPE_BIT_DELAY, &sh->state);
>  			list_add_tail(&sh->lru, &conf->handle_list);
> +			conf->pending_stripes++;
>  		}
>  		md_wakeup_thread(conf->mddev->thread);
>  	} else {
> @@ -489,6 +490,10 @@ get_active_stripe(struct r5conf *conf, s
>  			} else {
>  				if (!test_bit(STRIPE_HANDLE, &sh->state))
>  					atomic_inc(&conf->active_stripes);
> +				else if (!list_empty(&sh->lru)
> +					 && !test_bit(STRIPE_DELAYED, &sh->state)
> +					 && !test_bit(STRIPE_BIT_DELAY, &sh->state))
> +					conf->pending_stripes--;
>  				if (list_empty(&sh->lru) &&
>  				    !test_bit(STRIPE_EXPANDING, &sh->state))
>  					BUG();
> @@ -3670,6 +3675,7 @@ static void raid5_activate_delayed(struc
>  			if (!test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
>  				atomic_inc(&conf->preread_active_stripes);
>  			list_add_tail(&sh->lru, &conf->hold_list);
> +			conf->pending_stripes++;
>  		}
>  	}
>  }
> @@ -3979,6 +3985,7 @@ static struct stripe_head *__get_priorit
>  	} else
>  		return NULL;
>  
> +	conf->pending_stripes--;
>  	list_del_init(&sh->lru);
>  	atomic_inc(&sh->count);
>  	BUG_ON(atomic_read(&sh->count) != 1);
> @@ -4593,6 +4600,33 @@ static int handle_active_stripes(struct
>  	return batch_size;
>  }
>  
> +static void raid5auxd(struct mddev *mddev)
> +{
> +	struct r5conf *conf = mddev->private;
> +	struct blk_plug plug;
> +	int handled;
> +
> +	pr_debug("+++ raid5auxd active\n");
> +
> +	blk_start_plug(&plug);
> +	handled = 0;
> +	spin_lock_irq(&conf->device_lock);
> +	while (1) {
> +		int batch_size;
> +
> +		batch_size = handle_active_stripes(conf);
> +		if (!batch_size)
> +			break;
> +		handled += batch_size;
> +	}
> +	pr_debug("%d stripes handled\n", handled);
> +
> +	spin_unlock_irq(&conf->device_lock);
> +	blk_finish_plug(&plug);
> +
> +	pr_debug("--- raid5auxd inactive\n");
> +}
> +
>  /*
>   * This is our raid5 kernel thread.
>   *
> @@ -4615,7 +4649,7 @@ static void raid5d(struct mddev *mddev)
>  	spin_lock_irq(&conf->device_lock);
>  	while (1) {
>  		struct bio *bio;
> -		int batch_size;
> +		int batch_size, i;
>  
>  		if (atomic_read(&mddev->plug_cnt) == 0 &&
>  		    !list_empty(&conf->bitmap_list)) {
> @@ -4645,6 +4679,10 @@ static void raid5d(struct mddev *mddev)
>  			break;
>  		handled += batch_size;
>  
> +		for (i = 0; i < conf->aux_thread_num
> +		     && i < conf->pending_stripes/MAX_STRIPE_BATCH + 1; i++)
> +			md_wakeup_thread(conf->aux_threads[i]);
> +
>  		if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {
>  			spin_unlock_irq(&conf->device_lock);
>  			md_check_recovery(mddev);
> @@ -4769,10 +4807,85 @@ stripe_cache_active_show(struct mddev *m
>  static struct md_sysfs_entry
>  raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
>  
> +static ssize_t
> +raid5_show_auxthread_number(struct mddev *mddev, char *page)
> +{
> +	struct r5conf *conf = mddev->private;
> +	if (conf)
> +		return sprintf(page, "%d\n", conf->aux_thread_num);
> +	else
> +		return 0;
> +}
> +
> +static ssize_t
> +raid5_store_auxthread_number(struct mddev *mddev, const char *page, size_t len)
> +{
> +	struct r5conf *conf = mddev->private;
> +	unsigned long new;
> +	int i;
> +	struct md_thread **threads;
> +
> +	if (len >= PAGE_SIZE)
> +		return -EINVAL;
> +	if (!conf)
> +		return -ENODEV;
> +
> +	if (strict_strtoul(page, 10, &new))
> +		return -EINVAL;
> +
> +	if (new == conf->aux_thread_num)
> +		return len;
> +
> +	if (new > conf->aux_thread_num) {
> +		threads = kmalloc(sizeof(struct md_thread *) * new, GFP_KERNEL);
> +		if (!threads)
> +			return -EFAULT;
> +
> +		i = conf->aux_thread_num;
> +		while (i < new) {
> +			char name[10];
> +
> +			sprintf(name, "aux%d", i);
> +			threads[i] = md_register_thread(raid5auxd, mddev, name);
> +			if (!threads[i])
> +				goto error;
> +			i++;
> +		}
> +		memcpy(threads, conf->aux_threads,
> +			sizeof(struct md_thread *) * conf->aux_thread_num);
> +		spin_lock_irq(&conf->device_lock);
> +		kfree(conf->aux_threads);
> +		conf->aux_threads = threads;
> +		conf->aux_thread_num = new;
> +		spin_unlock_irq(&conf->device_lock);
> +	} else {
> +		int old = conf->aux_thread_num;
> +
> +		spin_lock_irq(&conf->device_lock);
> +		conf->aux_thread_num = new;
> +		spin_unlock_irq(&conf->device_lock);
> +		for (i = new; i < old; i++)
> +			md_unregister_thread(&conf->aux_threads[i]);
> +	}
> +
> +	return len;
> +error:
> +	while (--i >= conf->aux_thread_num)
> +		md_unregister_thread(&threads[i]);
> +	kfree(threads);
> +	return -EFAULT;
> +}
> +
> +static struct md_sysfs_entry
> +raid5_auxthread_number = __ATTR(auxthread_number, S_IRUGO|S_IWUSR,
> +				raid5_show_auxthread_number,
> +				raid5_store_auxthread_number);
> +
>  static struct attribute *raid5_attrs[] =  {
>  	&raid5_stripecache_size.attr,
>  	&raid5_stripecache_active.attr,
>  	&raid5_preread_bypass_threshold.attr,
> +	&raid5_auxthread_number.attr,
>  	NULL,
>  };
>  static struct attribute_group raid5_attrs_group = {
> @@ -4820,6 +4933,7 @@ static void raid5_free_percpu(struct r5c
>  
>  static void free_conf(struct r5conf *conf)
>  {
> +	kfree(conf->aux_threads);
>  	shrink_stripes(conf);
>  	raid5_free_percpu(conf);
>  	kfree(conf->disks);
> @@ -4914,6 +5028,7 @@ static struct r5conf *setup_conf(struct
>  	int raid_disk, memory, max_disks;
>  	struct md_rdev *rdev;
>  	struct disk_info *disk;
> +	int i;
>  
>  	if (mddev->new_level != 5
>  	    && mddev->new_level != 4
> @@ -5037,6 +5152,22 @@ static struct r5conf *setup_conf(struct
>  		printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
>  		       mdname(mddev), memory);
>  
> +	/* By default, auxthread number equals to disk number */
> +	conf->aux_threads = kmalloc(sizeof(struct md_thread *) * max_disks,
> +				    GFP_KERNEL);
> +	if (!conf->aux_threads)
> +		goto abort;
> +	for (i = 0; i < max_disks; i++) {
> +		char name[10];
> +
> +		sprintf(name, "aux%d", i);
> +		conf->aux_threads[i] = md_register_thread(raid5auxd, mddev, name);
> +		if (!conf->aux_threads[i])
> +			break;
> +	}
> +
> +	conf->aux_thread_num = i;
> +
>  	conf->thread = md_register_thread(raid5d, mddev, NULL);
>  	if (!conf->thread) {
>  		printk(KERN_ERR
> @@ -5376,6 +5507,10 @@ abort:
>  static int stop(struct mddev *mddev)
>  {
>  	struct r5conf *conf = mddev->private;
> +	int i;
> +
> +	for (i = 0; i < conf->aux_thread_num; i++)
> +		md_unregister_thread(&conf->aux_threads[i]);
>  
>  	md_unregister_thread(&mddev->thread);
>  	if (mddev->queue)
> Index: linux/drivers/md/raid5.h
> ===================================================================
> --- linux.orig/drivers/md/raid5.h	2012-06-25 14:58:06.408138677 +0800
> +++ linux/drivers/md/raid5.h	2012-06-25 14:58:06.432138376 +0800
> @@ -450,6 +450,7 @@ struct r5conf {
>  	int			inactive_blocked;	/* release of inactive stripes blocked,
>  							 * waiting for 25% to be free
>  							 */
> +	int			pending_stripes;
>  	int			pool_size; /* number of disks in stripeheads in pool */
>  	spinlock_t		device_lock;
>  	struct disk_info	*disks;
> @@ -458,6 +459,8 @@ struct r5conf {
>  	 * the new thread here until we fully activate the array.
>  	 */
>  	struct md_thread	*thread;
> +	int			aux_thread_num;
> +	struct md_thread	**aux_threads;
>  };
>  
>  /*


Hi,
 I'm certainly interested in this patch, but I'm not going to apply it yet,
 partly because I want all the other bits to settle and be well tested first.

 I'm still uncomfortable about setting an explicit number of threads...

 I wonder if a different approach might be useful.  i.e. add an ioctl (or
 similar) while allows a normal user thread to start handling raid5 requests.
 Then instead of telling the kernel how many thread to start, we just start
 the right number of processes, bind them to CPUs or whatever might be
 wanted, then call the ioctl.
 Possibly the ioctl would return whenever it runs out of work to do, and this
 could be used somehow to dynamically adjust the number of threads.

 I haven't really thought this through fully yet so it might not work, but
 I'd like to explore the possibility of having the number of threads to
 adjusted automatically, and that probably means allowing user-space a fair
 bit of control and providing it with a fair bit of information.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 08/10 v3] raid5: make_request use batch stripe release
  2012-07-02  2:31   ` NeilBrown
@ 2012-07-02  2:59     ` Shaohua Li
  2012-07-02  5:07       ` NeilBrown
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-07-02  2:59 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Mon, Jul 02, 2012 at 12:31:12PM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 15:24:55 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > make_request() does stripe release for every stripe and the stripe usually has
> > count 1, which makes previous release_stripe() optimization not work. In my
> > test, this release_stripe() becomes the heaviest pleace to take
> > conf->device_lock after previous patches applied.
> > 
> > Below patch makes stripe release batch. All the stripes will be released in
> > unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
> > lru.
> > 
> 
> I've applied this patch, but I'm afraid I butchered it a bit first :-)
> 
> 
> > @@ -3984,6 +3985,51 @@ static struct stripe_head *__get_priorit
> >  	return sh;
> >  }
> >  
> > +#define raid5_unplug_list(mdcb) (struct list_head *)(mdcb + 1)
> 
> I really don't like this sort of construct.  It is much cleaner (I think) to
> add to a structure by embedding it in a larger structure, then using
> "container_of" to map from the inner to the outer structure.  So I have
> changed that.

Thanks.
 
> > @@ -4114,7 +4161,14 @@ static void make_request(struct mddev *m
> >  			if ((bi->bi_rw & REQ_SYNC) &&
> >  			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> >  				atomic_inc(&conf->preread_active_stripes);
> > -			release_stripe(sh);
> > +			/*
> > +			 * We must recheck here. schedule() might be called
> > +			 * above which makes unplug invoked already, so the old
> > +			 * mdcb is invalid
> > +			 */
> 
> I agree that this is an important check, but as a 'schedule()' can
> theoretically happen at any time that preempt isn't explicitly disabled, we
> really need to be even more careful.  So I have changed the md code to
> disable preempt, and require the caller to re-enable preempt after it has
> used the returned value.
> 
> The resulting serious should appear in my for-next shortly.  However for
> easier review I'll include two patches below.  The first change
> mddev_check_plugged to disable preemption.
> The second is a diff against your patch which changes it to use an embedded
> structure and container_of.
> I haven't actually tested this yet, so there may be further changes.
> 
> Thanks,
> NeilBrown
> 
> From 04b7dd7d0ad4a21622cad7c10821f914a8d9ccd3 Mon Sep 17 00:00:00 2001
> From: NeilBrown <neilb@suse.de>
> Date: Mon, 2 Jul 2012 12:14:49 +1000
> Subject: [PATCH] md/plug: disable preempt when reported a plug is present.
> 
> As 'schedule' will unplug a queue, a plug added by
> mddev_check_plugged is only valid until the next schedule().
> So call preempt_disable before installing the plug, and require the
> called to call preempt_enable once the value has been used.
> 
> Signed-off-by: NeilBrown  <neilb@suse.de>
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1369c9d..63ea6d6 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -512,6 +512,10 @@ static void plugger_unplug(struct blk_plug_cb *cb)
>  
>  /* Check that an unplug wakeup will come shortly.
>   * If not, wakeup the md thread immediately
> + * Note that the structure returned is only value until
> + * the next schedule(), so preemption is disabled when it
> + * is not NULL, and must be re-enabled after the value
> + * has been used.
>   */
>  struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
>  				       md_unplug_func_t unplug, size_t size)
> @@ -522,6 +526,7 @@ struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
>  	if (!plug)
>  		return NULL;
>  
> +	preempt_disable();
>  	list_for_each_entry(mdcb, &plug->cb_list, cb.list) {
>  		if (mdcb->cb.callback == plugger_unplug &&
>  		    mdcb->mddev == mddev) {
> @@ -533,6 +538,7 @@ struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
>  			return mdcb;
>  		}
>  	}
> +	preempt_enable();

preempt doesn't do unplug, only yield(schedule) does, so I don't like this,
just redoing mddev_check_plugged before checking the return value is fine to
me.

>  	/* Not currently on the callback list */
>  	if (size < sizeof(*mdcb))
>  		size = sizeof(*mdcb);
> @@ -540,6 +546,7 @@ struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
>  	if (!mdcb)
>  		return NULL;
>  
> +	preempt_disable();
>  	mdcb->mddev = mddev;
>  	mdcb->cb.callback = plugger_unplug;
>  	atomic_inc(&mddev->plug_cnt);
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index ebce488..2e19b68 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -883,7 +883,6 @@ static void make_request(struct mddev *mddev, struct bio * bio)
>  	const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
>  	const unsigned long do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
>  	struct md_rdev *blocked_rdev;
> -	int plugged;
>  	int first_clone;
>  	int sectors_handled;
>  	int max_sectors;
> @@ -1034,8 +1033,6 @@ read_again:
>  	 * the bad blocks.  Each set of writes gets it's own r1bio
>  	 * with a set of bios attached.
>  	 */
> -	plugged = !!mddev_check_plugged(mddev, NULL, 0);
> -
>  	disks = conf->raid_disks * 2;
>   retry_write:
>  	blocked_rdev = NULL;
> @@ -1214,8 +1211,11 @@ read_again:
>  	/* In case raid1d snuck in to freeze_array */
>  	wake_up(&conf->wait_barrier);
>  
> -	if (do_sync || !bitmap || !plugged)
> +	if (do_sync ||
> +	    !mddev_check_plugged(mddev, NULL, 0))
>  		md_wakeup_thread(mddev->thread);

Do we really bother to recheck here? just a wakeup.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 03/10 v3] raid5: add a per-stripe lock
  2012-07-02  0:50   ` NeilBrown
@ 2012-07-02  3:16     ` Shaohua Li
  2012-07-02  7:39       ` NeilBrown
  0 siblings, 1 reply; 32+ messages in thread
From: Shaohua Li @ 2012-07-02  3:16 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Mon, Jul 02, 2012 at 10:50:46AM +1000, NeilBrown wrote:
> On Mon, 25 Jun 2012 15:24:50 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > Add a per-stripe lock to protect stripe specific data, like dev->read,
> > written, ... The purpose is to reduce lock contention of conf->device_lock.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> 
> I had hoped to avoid having a per-stripe lock again, but it does look like it
> is needed.
> However I don't like the way you have split up these three patches - it makes
> them a little hard to review.
> 
> I would like to see one patch which converts the bi_phys_segments access to
> be atomic and also removes all the spin_lock calls that were just for
> protecting that.
> 
> Then another patch which adds the new stripe_lock, clearly documenting
> exactly what is protects (not just "like dev->read" but an explicit list)
> and also removes any spin_lock of device_lock that is no longer needed.
> 
> Then I could see what is being added and what is being removed all in the one
> patch and I can be sure that they balance.

reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 08/10 v3] raid5: make_request use batch stripe release
  2012-07-02  2:59     ` Shaohua Li
@ 2012-07-02  5:07       ` NeilBrown
  0 siblings, 0 replies; 32+ messages in thread
From: NeilBrown @ 2012-07-02  5:07 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 2745 bytes --]

On Mon, 2 Jul 2012 10:59:50 +0800 Shaohua Li <shli@kernel.org> wrote:


> > From 04b7dd7d0ad4a21622cad7c10821f914a8d9ccd3 Mon Sep 17 00:00:00 2001
> > From: NeilBrown <neilb@suse.de>
> > Date: Mon, 2 Jul 2012 12:14:49 +1000
> > Subject: [PATCH] md/plug: disable preempt when reported a plug is present.
> > 
> > As 'schedule' will unplug a queue, a plug added by
> > mddev_check_plugged is only valid until the next schedule().
> > So call preempt_disable before installing the plug, and require the
> > called to call preempt_enable once the value has been used.
> > 
> > Signed-off-by: NeilBrown  <neilb@suse.de>
> > 
> > diff --git a/drivers/md/md.c b/drivers/md/md.c
> > index 1369c9d..63ea6d6 100644
> > --- a/drivers/md/md.c
> > +++ b/drivers/md/md.c
> > @@ -512,6 +512,10 @@ static void plugger_unplug(struct blk_plug_cb *cb)
> >  
> >  /* Check that an unplug wakeup will come shortly.
> >   * If not, wakeup the md thread immediately
> > + * Note that the structure returned is only value until
> > + * the next schedule(), so preemption is disabled when it
> > + * is not NULL, and must be re-enabled after the value
> > + * has been used.
> >   */
> >  struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
> >  				       md_unplug_func_t unplug, size_t size)
> > @@ -522,6 +526,7 @@ struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
> >  	if (!plug)
> >  		return NULL;
> >  
> > +	preempt_disable();
> >  	list_for_each_entry(mdcb, &plug->cb_list, cb.list) {
> >  		if (mdcb->cb.callback == plugger_unplug &&
> >  		    mdcb->mddev == mddev) {
> > @@ -533,6 +538,7 @@ struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
> >  			return mdcb;
> >  		}
> >  	}
> > +	preempt_enable();
> 
> preempt doesn't do unplug, only yield(schedule) does, so I don't like this,
> just redoing mddev_check_plugged before checking the return value is fine to
> me.

<digs through code...>

Ahh, I see.  preempt calls _schedule(), not schedule(), and schedule() does
the unplug before calling _schedule().
So you are right -  I'll remove that extra patch.
Thanks.



> > -	if (do_sync || !bitmap || !plugged)
> > +	if (do_sync ||
> > +	    !mddev_check_plugged(mddev, NULL, 0))
> >  		md_wakeup_thread(mddev->thread);
> 
> Do we really bother to recheck here? just a wakeup.
>

This isn't a re-check.  This is the one-and-only check.
In the current code each request gets queued for raid1d and the all get
processed next time the thread wakes up.  So we would prefer to create the
plug and do the wakeup later.  But if creating the plug fails, we do the
wakeup immediately.

Once we are queueing the requests in the plug, this will change.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 03/10 v3] raid5: add a per-stripe lock
  2012-07-02  3:16     ` Shaohua Li
@ 2012-07-02  7:39       ` NeilBrown
  2012-07-03  1:27         ` Shaohua Li
  2012-07-03 12:16         ` majianpeng
  0 siblings, 2 replies; 32+ messages in thread
From: NeilBrown @ 2012-07-02  7:39 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, dan.j.williams, shli

[-- Attachment #1: Type: text/plain, Size: 1928 bytes --]

On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli@kernel.org> wrote:


> > Then I could see what is being added and what is being removed all in the one
> > patch and I can be sure that they balance.
> 
> reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.

Thanks.  That's looking really good.

However I think we can do better.  I've been looking more closely at the code
and I think that the only things that we need stripe_lock to protect are
->toread and ->towrite, together with the following bios.  e.g.
->toread->bi_next etc.

->read and ->written don't need stripe_lock protection, as they are only
manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
and refcounts for exclusion.

So add_stripe_bio need to take the lock while adding a bio to the
->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
need to take the lock while the move the list from ->to{read,write} to
->{read,written}.  But we don't need it anywhere else.  e.g. analyse_stripe
shouldn't need the lock at all.  Any change that could happen during the loop
could equally happen after the lock was released so we don't lose by not
having the lock.

There is another current user of the lock, but I think that should be
discarded as a false optimisation.
We currently try to optimise out extra calls to bitmap_startwrite and
bitmap_endwrite when we see back-to-back writes to the one stripe.  However I
suspect that is extremely unlikely and it just imposes and pointless need for
synchronisation in raid5.

We just simply call bitmap_startwrite whenever ->towrite changes from NULL to
non-NULL, and call bitmap_endwrite whenever we clear ->written, reguardless
what value ->towrite now has.

Would you like to experiment with that?  If I haven't described it well
enough I can write a patch to show what I mean.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 10/10 v3] raid5: create multiple threads to handle stripes
  2012-06-25  7:24 ` [patch 10/10 v3] raid5: create multiple threads to handle stripes Shaohua Li
  2012-07-02  2:39   ` NeilBrown
@ 2012-07-02 20:03   ` Dan Williams
  2012-07-03  8:04     ` Shaohua Li
  1 sibling, 1 reply; 32+ messages in thread
From: Dan Williams @ 2012-07-02 20:03 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe, shli

On Mon, Jun 25, 2012 at 12:24 AM, Shaohua Li <shli@kernel.org> wrote:
> Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
> thread becomes a bottleneck. raid5 can offload calculation like checksum to
> async threads. And if storge is fast, scheduling async work and running async
> work will introduce heavy lock contention of workqueue, which makes such
> optimization useless. And calculation isn't the only bottleneck. For example,
> in my test raid5 thread must handle > 450k requests per second. Just doing
> dispatch and completion will make raid5 thread incapable. The only chance to
> scale is using several threads to handle stripe.
>
> With this patch, user can create several extra threads to handle stripe. How
> many threads are better depending on disk number, so the thread number can be
> changed in userspace. By default, the thread number is 0, which means no extra
> thread.
>
> In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
> improvement (double stripe_cache_size) and the throughput is pretty close to
> theory value. With >=4 disks, the improvement is even bigger, for example, can
> improve 200% for 4-disk setup, but the throughput is far less than theory
> value, which is caused by several factors like request queue lock contention,
> cache issue, latency introduced by how a stripe is handled in different disks.
> Those factors need further investigations.
>
> Signed-off-by: Shaohua Li <shli@fusionio.com>

Can you share a bit more about your test setup?  Is this
single-threaded throughput?  I'm wondering if we can take advantage of
keeping the work cpu local.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 03/10 v3] raid5: add a per-stripe lock
  2012-07-02  7:39       ` NeilBrown
@ 2012-07-03  1:27         ` Shaohua Li
  2012-07-03 12:16         ` majianpeng
  1 sibling, 0 replies; 32+ messages in thread
From: Shaohua Li @ 2012-07-03  1:27 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, dan.j.williams, shli

On Mon, Jul 02, 2012 at 05:39:53PM +1000, NeilBrown wrote:
> On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> 
> > > Then I could see what is being added and what is being removed all in the one
> > > patch and I can be sure that they balance.
> > 
> > reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.
> 
> Thanks.  That's looking really good.
> 
> However I think we can do better.  I've been looking more closely at the code
> and I think that the only things that we need stripe_lock to protect are
> ->toread and ->towrite, together with the following bios.  e.g.
> ->toread->bi_next etc.
> 
> ->read and ->written don't need stripe_lock protection, as they are only
> manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
> and refcounts for exclusion.
> 
> So add_stripe_bio need to take the lock while adding a bio to the
> ->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
> need to take the lock while the move the list from ->to{read,write} to
> ->{read,written}.  But we don't need it anywhere else.  e.g. analyse_stripe
> shouldn't need the lock at all.  Any change that could happen during the loop
> could equally happen after the lock was released so we don't lose by not
> having the lock.

Looks reasonable.

> There is another current user of the lock, but I think that should be
> discarded as a false optimisation.
> We currently try to optimise out extra calls to bitmap_startwrite and
> bitmap_endwrite when we see back-to-back writes to the one stripe.  However I
> suspect that is extremely unlikely and it just imposes and pointless need for
> synchronisation in raid5.
> 
> We just simply call bitmap_startwrite whenever ->towrite changes from NULL to
> non-NULL, and call bitmap_endwrite whenever we clear ->written, reguardless
> what value ->towrite now has.

Yep, the chance write to the same stripe should be very low.
I'll try in my stress test, if no failure, I'll post the patches.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 10/10 v3] raid5: create multiple threads to handle stripes
  2012-07-02 20:03   ` Dan Williams
@ 2012-07-03  8:04     ` Shaohua Li
  0 siblings, 0 replies; 32+ messages in thread
From: Shaohua Li @ 2012-07-03  8:04 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid, neilb, axboe, shli

On Mon, Jul 02, 2012 at 01:03:53PM -0700, Dan Williams wrote:
> On Mon, Jun 25, 2012 at 12:24 AM, Shaohua Li <shli@kernel.org> wrote:
> > Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
> > thread becomes a bottleneck. raid5 can offload calculation like checksum to
> > async threads. And if storge is fast, scheduling async work and running async
> > work will introduce heavy lock contention of workqueue, which makes such
> > optimization useless. And calculation isn't the only bottleneck. For example,
> > in my test raid5 thread must handle > 450k requests per second. Just doing
> > dispatch and completion will make raid5 thread incapable. The only chance to
> > scale is using several threads to handle stripe.
> >
> > With this patch, user can create several extra threads to handle stripe. How
> > many threads are better depending on disk number, so the thread number can be
> > changed in userspace. By default, the thread number is 0, which means no extra
> > thread.
> >
> > In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
> > improvement (double stripe_cache_size) and the throughput is pretty close to
> > theory value. With >=4 disks, the improvement is even bigger, for example, can
> > improve 200% for 4-disk setup, but the throughput is far less than theory
> > value, which is caused by several factors like request queue lock contention,
> > cache issue, latency introduced by how a stripe is handled in different disks.
> > Those factors need further investigations.
> >
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> 
> Can you share a bit more about your test setup?  Is this
> single-threaded throughput?  I'm wondering if we can take advantage of
> keeping the work cpu local.

I use a 4-thread DIO randwrite 4k test. I'm thinking about keep work cpu local
too, that requires spearate the stripe lru list. Also async ops and request
completion is better cpu local aware to make this work well.

When I try the unbound workqueue, I set the max active work to online_cpus, so
there will be 24 worker threads. But even 24 threads are too many.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Re: [patch 03/10 v3] raid5: add a per-stripe lock
  2012-07-02  7:39       ` NeilBrown
  2012-07-03  1:27         ` Shaohua Li
@ 2012-07-03 12:16         ` majianpeng
  2012-07-03 23:56           ` NeilBrown
  1 sibling, 1 reply; 32+ messages in thread
From: majianpeng @ 2012-07-03 12:16 UTC (permalink / raw)
  To: Neil Brown, shli; +Cc: linux-raid, axboe, Dan Williams

On 2012-07-02 15:39 NeilBrown <neilb@suse.de> Wrote:
>On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli@kernel.org> wrote:
>
>
>> > Then I could see what is being added and what is being removed all in the one
>> > patch and I can be sure that they balance.
>> 
>> reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.
>
>Thanks.  That's looking really good.
>
>However I think we can do better.  I've been looking more closely at the code
>and I think that the only things that we need stripe_lock to protect are
>->toread and ->towrite, together with the following bios.  e.g.
>->toread->bi_next etc.
>
>->read and ->written don't need stripe_lock protection, as they are only
>manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
>and refcounts for exclusion.
>
>So add_stripe_bio need to take the lock while adding a bio to the
>->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
>need to take the lock while the move the list from ->to{read,write} to
>->{read,written}.  
How about xchg()?
>But we don't need it anywhere else.  e.g. analyse_stripe
>shouldn't need the lock at all.  Any change that could happen during the loop
>could equally happen after the lock was released so we don't lose by not
>having the lock.
>
>There is another current user of the lock, but I think that should be
>discarded as a false optimisation.
>We currently try to optimise out extra calls to bitmap_startwrite and
>bitmap_endwrite when we see back-to-back writes to the one stripe.  However I
>suspect that is extremely unlikely and it just imposes and pointless need for
>synchronisation in raid5.
>
>We just simply call bitmap_startwrite whenever ->towrite changes from NULL to
>non-NULL, and call bitmap_endwrite whenever we clear ->written, reguardless
>what value ->towrite now has.
>
>Would you like to experiment with that?  If I haven't described it well
>enough I can write a patch to show what I mean.
>
>Thanks,
>NeilBrown
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [patch 03/10 v3] raid5: add a per-stripe lock
  2012-07-03 12:16         ` majianpeng
@ 2012-07-03 23:56           ` NeilBrown
  2012-07-04  1:09             ` majianpeng
  0 siblings, 1 reply; 32+ messages in thread
From: NeilBrown @ 2012-07-03 23:56 UTC (permalink / raw)
  To: majianpeng; +Cc: shli, linux-raid, axboe, Dan Williams

[-- Attachment #1: Type: text/plain, Size: 1504 bytes --]

On Tue, 3 Jul 2012 20:16:31 +0800 majianpeng <majianpeng@gmail.com> wrote:

> On 2012-07-02 15:39 NeilBrown <neilb@suse.de> Wrote:
> >On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli@kernel.org> wrote:
> >
> >
> >> > Then I could see what is being added and what is being removed all in the one
> >> > patch and I can be sure that they balance.
> >> 
> >> reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.
> >
> >Thanks.  That's looking really good.
> >
> >However I think we can do better.  I've been looking more closely at the code
> >and I think that the only things that we need stripe_lock to protect are
> >->toread and ->towrite, together with the following bios.  e.g.
> >->toread->bi_next etc.
> >
> >->read and ->written don't need stripe_lock protection, as they are only
> >manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
> >and refcounts for exclusion.
> >
> >So add_stripe_bio need to take the lock while adding a bio to the
> >->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
> >need to take the lock while the move the list from ->to{read,write} to
> >->{read,written}.  

> How about xchg()?

(it would help a little if you made your comments stand out more, and maybe
be a bit more verbose).

bio_add_stripe() can insert a new bio into a list that starts are ->towrite
or ->toread.
You cannot do that with a simple xchg.  You really need a lock.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Re: [patch 03/10 v3] raid5: add a per-stripe lock
  2012-07-03 23:56           ` NeilBrown
@ 2012-07-04  1:09             ` majianpeng
  0 siblings, 0 replies; 32+ messages in thread
From: majianpeng @ 2012-07-04  1:09 UTC (permalink / raw)
  To: Neil Brown; +Cc: shli, linux-raid, axboe, Dan Williams

On 2012-07-04 07:56 NeilBrown <neilb@suse.de> Wrote:
>On Tue, 3 Jul 2012 20:16:31 +0800 majianpeng <majianpeng@gmail.com> wrote:
>
>> On 2012-07-02 15:39 NeilBrown <neilb@suse.de> Wrote:
>> >On Mon, 2 Jul 2012 11:16:26 +0800 Shaohua Li <shli@kernel.org> wrote:
>> >
>> >
>> >> > Then I could see what is being added and what is being removed all in the one
>> >> > patch and I can be sure that they balance.
>> >> 
>> >> reworked the patch 3-5 to two patches as you suggested, and sent to you. please check.
>> >
>> >Thanks.  That's looking really good.
>> >
>> >However I think we can do better.  I've been looking more closely at the code
>> >and I think that the only things that we need stripe_lock to protect are
>> >->toread and ->towrite, together with the following bios.  e.g.
>> >->toread->bi_next etc.
>> >
>> >->read and ->written don't need stripe_lock protection, as they are only
>> >manipulated by the handle_stripe state machine which uses STRIPE_ACTIVE 
>> >and refcounts for exclusion.
>> >
>> >So add_stripe_bio need to take the lock while adding a bio to the
>> >->toread and  ->towrite lists, and ops_run_biodrain() and ops_run_biofill
>> >need to take the lock while the move the list from ->to{read,write} to
>> >->{read,written}.  
>
>> How about xchg()?
>
>(it would help a little if you made your comments stand out more, and maybe
>be a bit more verbose).
>
>bio_add_stripe() can insert a new bio into a list that starts are ->towrite
>or ->toread.
>You cannot do that with a simple xchg.  You really need a lock.
>
>NeilBrown
>
>
sorry,i missed xchg.
But when dev/stripe is set R5_Wantdrain/R5_Wantfill, can add stripe to dev/stripe?

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-07-04  1:09 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-25  7:24 [patch 00/10 v3] raid5: improve write performance for fast storage Shaohua Li
2012-06-25  7:24 ` [patch 01/10 v3] raid5: use wake_up_all for overlap waking Shaohua Li
2012-06-28  7:26   ` NeilBrown
2012-06-28  8:53     ` Shaohua Li
2012-06-25  7:24 ` [patch 02/10 v3] raid5: delayed stripe fix Shaohua Li
2012-07-02  0:46   ` NeilBrown
2012-07-02  0:49     ` Shaohua Li
2012-07-02  0:55       ` NeilBrown
2012-06-25  7:24 ` [patch 03/10 v3] raid5: add a per-stripe lock Shaohua Li
2012-07-02  0:50   ` NeilBrown
2012-07-02  3:16     ` Shaohua Li
2012-07-02  7:39       ` NeilBrown
2012-07-03  1:27         ` Shaohua Li
2012-07-03 12:16         ` majianpeng
2012-07-03 23:56           ` NeilBrown
2012-07-04  1:09             ` majianpeng
2012-06-25  7:24 ` [patch 04/10 v3] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
2012-06-25  7:24 ` [patch 05/10 v3] raid5: remove some device_lock locking places Shaohua Li
2012-06-25  7:24 ` [patch 06/10 v3] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
2012-07-02  0:57   ` NeilBrown
2012-06-25  7:24 ` [patch 07/10 v3] md: personality can provide unplug private data Shaohua Li
2012-07-02  1:06   ` NeilBrown
2012-06-25  7:24 ` [patch 08/10 v3] raid5: make_request use batch stripe release Shaohua Li
2012-07-02  2:31   ` NeilBrown
2012-07-02  2:59     ` Shaohua Li
2012-07-02  5:07       ` NeilBrown
2012-06-25  7:24 ` [patch 09/10 v3] raid5: raid5d handle stripe in batch way Shaohua Li
2012-07-02  2:32   ` NeilBrown
2012-06-25  7:24 ` [patch 10/10 v3] raid5: create multiple threads to handle stripes Shaohua Li
2012-07-02  2:39   ` NeilBrown
2012-07-02 20:03   ` Dan Williams
2012-07-03  8:04     ` Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.