[patch 0/9 v2] raid5: improve write performance for fast storage

All of lore.kernel.org
 help / color / mirror / Atom feed

* [patch 0/9 v2] raid5: improve write performance for fast storage
@ 2012-06-19  1:59 Shaohua Li
  2012-06-19  1:59 ` [patch 1/9 v2] raid5: use wake_up_all for overlap waking Shaohua Li
                   ` (8 more replies)
  0 siblings, 9 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

Hi,

Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
thread becomes a bottleneck. raid5 can offload calculation like checksum to
async threads. And if storge is fast, scheduling async work and running async
work will introduce heavy lock contention of workqueue, which makes such
optimization useless. And calculation isn't the only bottleneck. For example,
in my test raid5 thread must handle > 450k requests per second. Just doing
dispatch and completion will make raid5 thread incapable. The only chance to
scale is using several threads to handle stripe.

Simpliy using several threads doesn't work. conf->device_lock is a global lock
which is heavily contended. patch 2-8 in the set are trying to address this
problem. With them, when several threads are handling stripe, device_lock is
still contended but takes much less cpu time and not the heavist locking any
more. Even the 9th patch isn't accepted, the patch 2-8 look good to merge.

I did stress test (block size range 1k - 64k with a small total size, so
overlap/stripe sharing guaranteed) with the patches and looks fine except a
wake_up issue which I fixed in the first patch. That issue isn't related to the
series, but I need it in stress test.

With the locking issue solved (at least largely), switching stripe handling to
multiple threads is trival.

Threads are still created in advance (default thread number is disk number) and
can be reconfigured by user. Automatically creating and reaping threads is
great, but I'm worrying about numa binding.

In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
improvement (double stripe_cache_size) and the throughput is pretty close to
theory value. With >=4 disks, the improvement is even bigger, for example, can
improve 200% for 4-disk setup, but the throughput is far less than theory
value, which is caused by several factors like request queue lock contention,
cache issue, latency introduced by how a stripe is handled in different disks.
Those factors need further investigations.

V1->V2:
1. fixed several issues pointed out by Neil and Dan.
2. fixed a wake_up issue.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 1/9 v2] raid5: use wake_up_all for overlap waking
  2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
@ 2012-06-19  1:59 ` Shaohua Li
  2012-06-19  1:59 ` [patch 2/9 v2] raid5: add a per-stripe lock Shaohua Li
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-wakeupall.patch --]
[-- Type: text/plain, Size: 3879 bytes --]

It's possible several tasks are waiting for stripe overlap. We clear R5_Overlap
bit and wake_up, but wake_up just wakes one task. So if there are several tasks
in the wait queue, some tasks will not be woken up even its strip R5_Overlap
clear. The end result is tasks hang in make_request.

wake_up_all should not introduce performance issue here, since overlap case is
rare.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-19 08:11:10.021688417 +0800
+++ linux/drivers/md/raid5.c	2012-06-19 08:11:29.833439339 +0800
@@ -1399,7 +1399,7 @@ static void __raid_run_ops(struct stripe
 		for (i = disks; i--; ) {
 			struct r5dev *dev = &sh->dev[i];
 			if (test_and_clear_bit(R5_Overlap, &dev->flags))
-				wake_up(&sh->raid_conf->wait_for_overlap);
+				wake_up_all(&sh->raid_conf->wait_for_overlap);
 		}
 	put_cpu();
 }
@@ -2436,7 +2436,7 @@ handle_failed_stripe(struct r5conf *conf
 		}
 
 		if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
-			wake_up(&conf->wait_for_overlap);
+			wake_up_all(&conf->wait_for_overlap);
 
 		while (bi && bi->bi_sector <
 			sh->dev[i].sector + STRIPE_SECTORS) {
@@ -2474,7 +2474,7 @@ handle_failed_stripe(struct r5conf *conf
 			bi = sh->dev[i].toread;
 			sh->dev[i].toread = NULL;
 			if (test_and_clear_bit(R5_Overlap, &sh->dev[i].flags))
-				wake_up(&conf->wait_for_overlap);
+				wake_up_all(&conf->wait_for_overlap);
 			if (bi) s->to_read--;
 			while (bi && bi->bi_sector <
 			       sh->dev[i].sector + STRIPE_SECTORS) {
@@ -3572,7 +3572,7 @@ static void handle_stripe(struct stripe_
 	} else if (s.expanded && !sh->reconstruct_state && s.locked == 0) {
 		clear_bit(STRIPE_EXPAND_READY, &sh->state);
 		atomic_dec(&conf->reshape_stripes);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
 	}
 
@@ -4249,7 +4249,7 @@ static sector_t reshape_request(struct m
 		spin_lock_irq(&conf->device_lock);
 		conf->reshape_safe = mddev->reshape_position;
 		spin_unlock_irq(&conf->device_lock);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	}
 
@@ -4340,7 +4340,7 @@ static sector_t reshape_request(struct m
 		spin_lock_irq(&conf->device_lock);
 		conf->reshape_safe = mddev->reshape_position;
 		spin_unlock_irq(&conf->device_lock);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		sysfs_notify(&mddev->kobj, NULL, "sync_completed");
 	}
 	return reshape_sectors;
@@ -5718,7 +5718,7 @@ static void end_reshape(struct r5conf *c
 		smp_wmb();
 		conf->reshape_progress = MaxSector;
 		spin_unlock_irq(&conf->device_lock);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 
 		/* read-ahead size must cover two whole stripes, which is
 		 * 2 * (datadisks) * chunksize where 'n' is the number of raid devices
@@ -5776,7 +5776,7 @@ static void raid5_quiesce(struct mddev *
 
 	switch(state) {
 	case 2: /* resume for a suspend */
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		break;
 
 	case 1: /* stop all writes */
@@ -5792,14 +5792,14 @@ static void raid5_quiesce(struct mddev *
 		conf->quiesce = 1;
 		spin_unlock_irq(&conf->device_lock);
 		/* allow reshape to continue */
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		break;
 
 	case 0: /* re-enable writes */
 		spin_lock_irq(&conf->device_lock);
 		conf->quiesce = 0;
 		wake_up(&conf->wait_for_stripe);
-		wake_up(&conf->wait_for_overlap);
+		wake_up_all(&conf->wait_for_overlap);
 		spin_unlock_irq(&conf->device_lock);
 		break;
 	}


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 2/9 v2] raid5: add a per-stripe lock
  2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
  2012-06-19  1:59 ` [patch 1/9 v2] raid5: use wake_up_all for overlap waking Shaohua Li
@ 2012-06-19  1:59 ` Shaohua Li
  2012-06-19  1:59 ` [patch 3/9 v2] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-add-perstripe-lock.patch --]
[-- Type: text/plain, Size: 4862 bytes --]

Add a per-stripe lock to protect stripe specific data, like dev->read,
written, ... The purpose is to reduce lock contention of conf->device_lock.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   17 +++++++++++++++++
 drivers/md/raid5.h |    1 +
 2 files changed, 18 insertions(+)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-06 09:35:56.338257401 +0800
+++ linux/drivers/md/raid5.c	2012-06-13 08:37:43.005641057 +0800
@@ -749,6 +749,7 @@ static void ops_complete_biofill(void *s
 
 	/* clear completed biofills */
 	spin_lock_irq(&conf->device_lock);
+	spin_lock(&sh->stripe_lock);
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
@@ -774,6 +775,7 @@ static void ops_complete_biofill(void *s
 			}
 		}
 	}
+	spin_unlock(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
@@ -798,8 +800,10 @@ static void ops_run_biofill(struct strip
 		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi;
 			spin_lock_irq(&conf->device_lock);
+			spin_lock(&sh->stripe_lock);
 			dev->read = rbi = dev->toread;
 			dev->toread = NULL;
+			spin_unlock(&sh->stripe_lock);
 			spin_unlock_irq(&conf->device_lock);
 			while (rbi && rbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
@@ -1137,10 +1141,12 @@ ops_run_biodrain(struct stripe_head *sh,
 			struct bio *wbi;
 
 			spin_lock_irq(&sh->raid_conf->device_lock);
+			spin_lock(&sh->stripe_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
+			spin_unlock(&sh->stripe_lock);
 			spin_unlock_irq(&sh->raid_conf->device_lock);
 
 			while (wbi && wbi->bi_sector <
@@ -1446,6 +1452,8 @@ static int grow_one_stripe(struct r5conf
 	init_waitqueue_head(&sh->ops.wait_for_ops);
 	#endif
 
+	spin_lock_init(&sh->stripe_lock);
+
 	if (grow_buffers(sh)) {
 		shrink_buffers(sh);
 		kmem_cache_free(conf->slab_cache, sh);
@@ -2327,6 +2335,7 @@ static int add_stripe_bio(struct stripe_
 
 
 	spin_lock_irq(&conf->device_lock);
+	spin_lock(&sh->stripe_lock);
 	if (forwrite) {
 		bip = &sh->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
@@ -2360,6 +2369,7 @@ static int add_stripe_bio(struct stripe_
 		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
+	spin_unlock(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
@@ -2376,6 +2386,7 @@ static int add_stripe_bio(struct stripe_
 
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
+	spin_unlock(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	return 0;
 }
@@ -2427,6 +2438,7 @@ handle_failed_stripe(struct r5conf *conf
 			}
 		}
 		spin_lock_irq(&conf->device_lock);
+		spin_lock(&sh->stripe_lock);
 		/* fail all writes first */
 		bi = sh->dev[i].towrite;
 		sh->dev[i].towrite = NULL;
@@ -2488,6 +2500,7 @@ handle_failed_stripe(struct r5conf *conf
 				bi = nextbi;
 			}
 		}
+		spin_unlock(&sh->stripe_lock);
 		spin_unlock_irq(&conf->device_lock);
 		if (bitmap_end)
 			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
@@ -2695,6 +2708,7 @@ static void handle_stripe_clean_event(st
 				int bitmap_end = 0;
 				pr_debug("Return write for disc %d\n", i);
 				spin_lock_irq(&conf->device_lock);
+				spin_lock(&sh->stripe_lock);
 				wbi = dev->written;
 				dev->written = NULL;
 				while (wbi && wbi->bi_sector <
@@ -2709,6 +2723,7 @@ static void handle_stripe_clean_event(st
 				}
 				if (dev->towrite == NULL)
 					bitmap_end = 1;
+				spin_unlock(&sh->stripe_lock);
 				spin_unlock_irq(&conf->device_lock);
 				if (bitmap_end)
 					bitmap_endwrite(conf->mddev->bitmap,
@@ -3168,6 +3183,7 @@ static void analyse_stripe(struct stripe
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
 	spin_lock_irq(&conf->device_lock);
+	spin_lock(&sh->stripe_lock);
 	for (i=disks; i--; ) {
 		struct md_rdev *rdev;
 		sector_t first_bad;
@@ -3313,6 +3329,7 @@ static void analyse_stripe(struct stripe
 				do_recovery = 1;
 		}
 	}
+	spin_unlock(&sh->stripe_lock);
 	spin_unlock_irq(&conf->device_lock);
 	if (test_bit(STRIPE_SYNCING, &sh->state)) {
 		/* If there is a failed device being replaced,
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-06-06 09:35:56.350257250 +0800
+++ linux/drivers/md/raid5.h	2012-06-13 08:34:13.660272415 +0800
@@ -210,6 +210,7 @@ struct stripe_head {
 	int			disks;		/* disks in stripe */
 	enum check_states	check_state;
 	enum reconstruct_states reconstruct_state;
+	spinlock_t		stripe_lock;
 	/**
 	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 3/9 v2] raid5: lockless access raid5 overrided bi_phys_segments
  2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
  2012-06-19  1:59 ` [patch 1/9 v2] raid5: use wake_up_all for overlap waking Shaohua Li
  2012-06-19  1:59 ` [patch 2/9 v2] raid5: add a per-stripe lock Shaohua Li
@ 2012-06-19  1:59 ` Shaohua Li
  2012-06-21  1:56   ` Dan Williams
  2012-06-19  1:59 ` [patch 4/9 v2] raid5: remove some device_lock locking places Shaohua Li
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-atomic-segment-accounting.patch --]
[-- Type: text/plain, Size: 6243 bytes --]

Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
which is unnecessary, We can make it lockless actually.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   58 +++++++++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 26 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-08 11:48:25.652618012 +0800
+++ linux/drivers/md/raid5.c	2012-06-08 13:15:31.458919391 +0800
@@ -99,34 +99,40 @@ static inline struct bio *r5_next_bio(st
  * We maintain a biased count of active stripes in the bottom 16 bits of
  * bi_phys_segments, and a count of processed stripes in the upper 16 bits
  */
-static inline int raid5_bi_phys_segments(struct bio *bio)
+static inline int raid5_bi_processed_stripes(struct bio *bio)
 {
-	return bio->bi_phys_segments & 0xffff;
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	return (atomic_read(segments) >> 16) & 0xffff;
 }
 
-static inline int raid5_bi_hw_segments(struct bio *bio)
+static inline int raid5_dec_bi_active_stripes(struct bio *bio)
 {
-	return (bio->bi_phys_segments >> 16) & 0xffff;
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	return atomic_sub_return(1, segments) & 0xffff;
 }
 
-static inline int raid5_dec_bi_phys_segments(struct bio *bio)
+static inline void raid5_inc_bi_active_stripes(struct bio *bio)
 {
-	--bio->bi_phys_segments;
-	return raid5_bi_phys_segments(bio);
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	atomic_inc(segments);
 }
 
-static inline int raid5_dec_bi_hw_segments(struct bio *bio)
+static inline void raid5_set_bi_processed_stripes(struct bio *bio,
+	unsigned int cnt)
 {
-	unsigned short val = raid5_bi_hw_segments(bio);
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	int old, new;
 
-	--val;
-	bio->bi_phys_segments = (val << 16) | raid5_bi_phys_segments(bio);
-	return val;
+	do {
+		old = atomic_read(segments);
+		new = (old & 0xffff) | (cnt << 16);
+	} while (atomic_cmpxchg(segments, old, new) != old);
 }
 
-static inline void raid5_set_bi_hw_segments(struct bio *bio, unsigned int cnt)
+static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
 {
-	bio->bi_phys_segments = raid5_bi_phys_segments(bio) | (cnt << 16);
+	atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
+	atomic_set(segments, cnt);
 }
 
 /* Find first data disk in a raid6 stripe */
@@ -767,7 +773,7 @@ static void ops_complete_biofill(void *s
 			while (rbi && rbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				rbi2 = r5_next_bio(rbi, dev->sector);
-				if (!raid5_dec_bi_phys_segments(rbi)) {
+				if (!raid5_dec_bi_active_stripes(rbi)) {
 					rbi->bi_next = return_bi;
 					return_bi = rbi;
 				}
@@ -2354,7 +2360,7 @@ static int add_stripe_bio(struct stripe_
 	if (*bip)
 		bi->bi_next = *bip;
 	*bip = bi;
-	bi->bi_phys_segments++;
+	raid5_inc_bi_active_stripes(bi);
 
 	if (forwrite) {
 		/* check if page is covered */
@@ -2454,7 +2460,7 @@ handle_failed_stripe(struct r5conf *conf
 			sh->dev[i].sector + STRIPE_SECTORS) {
 			struct bio *nextbi = r5_next_bio(bi, sh->dev[i].sector);
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
-			if (!raid5_dec_bi_phys_segments(bi)) {
+			if (!raid5_dec_bi_active_stripes(bi)) {
 				md_write_end(conf->mddev);
 				bi->bi_next = *return_bi;
 				*return_bi = bi;
@@ -2469,7 +2475,7 @@ handle_failed_stripe(struct r5conf *conf
 		       sh->dev[i].sector + STRIPE_SECTORS) {
 			struct bio *bi2 = r5_next_bio(bi, sh->dev[i].sector);
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
-			if (!raid5_dec_bi_phys_segments(bi)) {
+			if (!raid5_dec_bi_active_stripes(bi)) {
 				md_write_end(conf->mddev);
 				bi->bi_next = *return_bi;
 				*return_bi = bi;
@@ -2493,7 +2499,7 @@ handle_failed_stripe(struct r5conf *conf
 				struct bio *nextbi =
 					r5_next_bio(bi, sh->dev[i].sector);
 				clear_bit(BIO_UPTODATE, &bi->bi_flags);
-				if (!raid5_dec_bi_phys_segments(bi)) {
+				if (!raid5_dec_bi_active_stripes(bi)) {
 					bi->bi_next = *return_bi;
 					*return_bi = bi;
 				}
@@ -2714,7 +2720,7 @@ static void handle_stripe_clean_event(st
 				while (wbi && wbi->bi_sector <
 					dev->sector + STRIPE_SECTORS) {
 					wbi2 = r5_next_bio(wbi, dev->sector);
-					if (!raid5_dec_bi_phys_segments(wbi)) {
+					if (!raid5_dec_bi_active_stripes(wbi)) {
 						md_write_end(conf->mddev);
 						wbi->bi_next = *return_bi;
 						*return_bi = wbi;
@@ -3783,7 +3789,7 @@ static struct bio *remove_bio_from_retry
 		 * this sets the active strip count to 1 and the processed
 		 * strip count to zero (upper 8 bits)
 		 */
-		bi->bi_phys_segments = 1; /* biased count of active stripes */
+		raid5_set_bi_stripes(bi, 1); /* biased count of active stripes */
 	}
 
 	return bi;
@@ -4122,7 +4128,7 @@ static void make_request(struct mddev *m
 		md_wakeup_thread(mddev->thread);
 
 	spin_lock_irq(&conf->device_lock);
-	remaining = raid5_dec_bi_phys_segments(bi);
+	remaining = raid5_dec_bi_active_stripes(bi);
 	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0) {
 
@@ -4479,7 +4485,7 @@ static int  retry_aligned_read(struct r5
 		     sector += STRIPE_SECTORS,
 		     scnt++) {
 
-		if (scnt < raid5_bi_hw_segments(raid_bio))
+		if (scnt < raid5_bi_processed_stripes(raid_bio))
 			/* already done this stripe */
 			continue;
 
@@ -4487,14 +4493,14 @@ static int  retry_aligned_read(struct r5
 
 		if (!sh) {
 			/* failed to get a stripe - must wait */
-			raid5_set_bi_hw_segments(raid_bio, scnt);
+			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
 			return handled;
 		}
 
 		if (!add_stripe_bio(sh, raid_bio, dd_idx, 0)) {
 			release_stripe(sh);
-			raid5_set_bi_hw_segments(raid_bio, scnt);
+			raid5_set_bi_processed_stripes(raid_bio, scnt);
 			conf->retry_read_aligned = raid_bio;
 			return handled;
 		}
@@ -4504,7 +4510,7 @@ static int  retry_aligned_read(struct r5
 		handled++;
 	}
 	spin_lock_irq(&conf->device_lock);
-	remaining = raid5_dec_bi_phys_segments(raid_bio);
+	remaining = raid5_dec_bi_active_stripes(raid_bio);
 	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0)
 		bio_endio(raid_bio, 0);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 4/9 v2] raid5: remove some device_lock locking places
  2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
                   ` (2 preceding siblings ...)
  2012-06-19  1:59 ` [patch 3/9 v2] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
@ 2012-06-19  1:59 ` Shaohua Li
  2012-06-19  1:59 ` [patch 5/9 v2] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-remove-some-lock.patch --]
[-- Type: text/plain, Size: 7128 bytes --]

With per-stripe lock and bi_phys_segments lockless, we can safely remove some
locking places of device_lock.

stripe ->read, ->toread ... are protected by per-stripe lock. Accessing bio
list of the stripe is always serialized by this lock. If bio in ->read,
->toread ... list are shared by multiple stripes, there are two protections:
1. bi_phys_segments acts as a reference count
2. traverse the list uses r5_next_bio, which makes traverse never access bio
not belonging to the stripe

Let's have an example:
|  stripe1 |  stripe2    |  stripe3  |
...bio1......|bio2|bio3|....bio4.....

stripe2 has 4 bios, when it's finished, it will decrement bi_phys_segments for
all bios, but only end_bio for bio2 and bio3. bio1->bi_next still points to
bio2, but this doesn't matter. When stripe1 is finished, it will not touch bio2
because of r5_next_bio check. Next time stripe1 will end_bio for bio1 and
stripe3 will end_bio bio4.

before add_stripe_bio() addes a bio to a stripe, we already increament the bio
bi_phys_segments, so don't worry other stripes release the bio.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   51 +++++++++++++++------------------------------------
 1 file changed, 15 insertions(+), 36 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-13 14:08:04.000000000 +0800
+++ linux/drivers/md/raid5.c	2012-06-13 14:12:23.989189085 +0800
@@ -747,15 +747,13 @@ static void ops_complete_biofill(void *s
 {
 	struct stripe_head *sh = stripe_head_ref;
 	struct bio *return_bi = NULL;
-	struct r5conf *conf = sh->raid_conf;
 	int i;
 
 	pr_debug("%s: stripe %llu\n", __func__,
 		(unsigned long long)sh->sector);
 
 	/* clear completed biofills */
-	spin_lock_irq(&conf->device_lock);
-	spin_lock(&sh->stripe_lock);
+	spin_lock_irq(&sh->stripe_lock);
 	for (i = sh->disks; i--; ) {
 		struct r5dev *dev = &sh->dev[i];
 
@@ -781,8 +779,7 @@ static void ops_complete_biofill(void *s
 			}
 		}
 	}
-	spin_unlock(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&sh->stripe_lock);
 	clear_bit(STRIPE_BIOFILL_RUN, &sh->state);
 
 	return_io(return_bi);
@@ -794,7 +791,6 @@ static void ops_complete_biofill(void *s
 static void ops_run_biofill(struct stripe_head *sh)
 {
 	struct dma_async_tx_descriptor *tx = NULL;
-	struct r5conf *conf = sh->raid_conf;
 	struct async_submit_ctl submit;
 	int i;
 
@@ -805,12 +801,10 @@ static void ops_run_biofill(struct strip
 		struct r5dev *dev = &sh->dev[i];
 		if (test_bit(R5_Wantfill, &dev->flags)) {
 			struct bio *rbi;
-			spin_lock_irq(&conf->device_lock);
-			spin_lock(&sh->stripe_lock);
+			spin_lock_irq(&sh->stripe_lock);
 			dev->read = rbi = dev->toread;
 			dev->toread = NULL;
-			spin_unlock(&sh->stripe_lock);
-			spin_unlock_irq(&conf->device_lock);
+			spin_unlock_irq(&sh->stripe_lock);
 			while (rbi && rbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
 				tx = async_copy_data(0, rbi, dev->page,
@@ -1146,14 +1140,12 @@ ops_run_biodrain(struct stripe_head *sh,
 		if (test_and_clear_bit(R5_Wantdrain, &dev->flags)) {
 			struct bio *wbi;
 
-			spin_lock_irq(&sh->raid_conf->device_lock);
-			spin_lock(&sh->stripe_lock);
+			spin_lock_irq(&sh->stripe_lock);
 			chosen = dev->towrite;
 			dev->towrite = NULL;
 			BUG_ON(dev->written);
 			wbi = dev->written = chosen;
-			spin_unlock(&sh->stripe_lock);
-			spin_unlock_irq(&sh->raid_conf->device_lock);
+			spin_unlock_irq(&sh->stripe_lock);
 
 			while (wbi && wbi->bi_sector <
 				dev->sector + STRIPE_SECTORS) {
@@ -2340,8 +2332,7 @@ static int add_stripe_bio(struct stripe_
 		(unsigned long long)sh->sector);
 
 
-	spin_lock_irq(&conf->device_lock);
-	spin_lock(&sh->stripe_lock);
+	spin_lock_irq(&sh->stripe_lock);
 	if (forwrite) {
 		bip = &sh->dev[dd_idx].towrite;
 		if (*bip == NULL && sh->dev[dd_idx].written == NULL)
@@ -2375,8 +2366,7 @@ static int add_stripe_bio(struct stripe_
 		if (sector >= sh->dev[dd_idx].sector + STRIPE_SECTORS)
 			set_bit(R5_OVERWRITE, &sh->dev[dd_idx].flags);
 	}
-	spin_unlock(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&sh->stripe_lock);
 
 	pr_debug("added bi b#%llu to stripe s#%llu, disk %d.\n",
 		(unsigned long long)(*bip)->bi_sector,
@@ -2392,8 +2382,7 @@ static int add_stripe_bio(struct stripe_
 
  overlap:
 	set_bit(R5_Overlap, &sh->dev[dd_idx].flags);
-	spin_unlock(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&sh->stripe_lock);
 	return 0;
 }
 
@@ -2443,8 +2432,7 @@ handle_failed_stripe(struct r5conf *conf
 				rdev_dec_pending(rdev, conf->mddev);
 			}
 		}
-		spin_lock_irq(&conf->device_lock);
-		spin_lock(&sh->stripe_lock);
+		spin_lock_irq(&sh->stripe_lock);
 		/* fail all writes first */
 		bi = sh->dev[i].towrite;
 		sh->dev[i].towrite = NULL;
@@ -2506,8 +2494,7 @@ handle_failed_stripe(struct r5conf *conf
 				bi = nextbi;
 			}
 		}
-		spin_unlock(&sh->stripe_lock);
-		spin_unlock_irq(&conf->device_lock);
+		spin_unlock_irq(&sh->stripe_lock);
 		if (bitmap_end)
 			bitmap_endwrite(conf->mddev->bitmap, sh->sector,
 					STRIPE_SECTORS, 0, 0);
@@ -2713,8 +2700,7 @@ static void handle_stripe_clean_event(st
 				struct bio *wbi, *wbi2;
 				int bitmap_end = 0;
 				pr_debug("Return write for disc %d\n", i);
-				spin_lock_irq(&conf->device_lock);
-				spin_lock(&sh->stripe_lock);
+				spin_lock_irq(&sh->stripe_lock);
 				wbi = dev->written;
 				dev->written = NULL;
 				while (wbi && wbi->bi_sector <
@@ -2729,8 +2715,7 @@ static void handle_stripe_clean_event(st
 				}
 				if (dev->towrite == NULL)
 					bitmap_end = 1;
-				spin_unlock(&sh->stripe_lock);
-				spin_unlock_irq(&conf->device_lock);
+				spin_unlock_irq(&sh->stripe_lock);
 				if (bitmap_end)
 					bitmap_endwrite(conf->mddev->bitmap,
 							sh->sector,
@@ -3188,8 +3173,7 @@ static void analyse_stripe(struct stripe
 
 	/* Now to look around and see what can be done */
 	rcu_read_lock();
-	spin_lock_irq(&conf->device_lock);
-	spin_lock(&sh->stripe_lock);
+	spin_lock_irq(&sh->stripe_lock);
 	for (i=disks; i--; ) {
 		struct md_rdev *rdev;
 		sector_t first_bad;
@@ -3335,8 +3319,7 @@ static void analyse_stripe(struct stripe
 				do_recovery = 1;
 		}
 	}
-	spin_unlock(&sh->stripe_lock);
-	spin_unlock_irq(&conf->device_lock);
+	spin_unlock_irq(&sh->stripe_lock);
 	if (test_bit(STRIPE_SYNCING, &sh->state)) {
 		/* If there is a failed device being replaced,
 		 *     we must be recovering.
@@ -4127,9 +4110,7 @@ static void make_request(struct mddev *m
 	if (!plugged)
 		md_wakeup_thread(mddev->thread);
 
-	spin_lock_irq(&conf->device_lock);
 	remaining = raid5_dec_bi_active_stripes(bi);
-	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0) {
 
 		if ( rw == WRITE )
@@ -4509,9 +4490,7 @@ static int  retry_aligned_read(struct r5
 		release_stripe(sh);
 		handled++;
 	}
-	spin_lock_irq(&conf->device_lock);
 	remaining = raid5_dec_bi_active_stripes(raid_bio);
-	spin_unlock_irq(&conf->device_lock);
 	if (remaining == 0)
 		bio_endio(raid_bio, 0);
 	if (atomic_dec_and_test(&conf->active_aligned_reads))


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 5/9 v2] raid5: reduce chance release_stripe() taking device_lock
  2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
                   ` (3 preceding siblings ...)
  2012-06-19  1:59 ` [patch 4/9 v2] raid5: remove some device_lock locking places Shaohua Li
@ 2012-06-19  1:59 ` Shaohua Li
  2012-06-21  0:56   ` Dan Williams
  2012-06-19  1:59 ` [patch 6/9 v2] md: personality can provide unplug private data Shaohua Li
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-reduce-release_stripe-lock.patch --]
[-- Type: text/plain, Size: 4720 bytes --]

release_stripe() is a place conf->device_lock is heavily contended. We take the
lock even stripe count isn't 1, which isn't required. On the on the other hand,
decreasing count first and taking lock if count is 0 can expose races:
1. bewteen dec count and taking lock, another thread hits the stripe in cache,
so increase count. The stripe will be deleted from any list. In this case
stripe count isn't 0.
2. between dec count and taking lock, another thread hits the stripe in cache
and release it. In this case the stripe is already in specific list. We do
list_move to adjust its position.
So both cases are fixable to me.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   79 ++++++++++++++++++++++++++++++-----------------------
 1 file changed, 45 insertions(+), 34 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-08 13:18:39.296561227 +0800
+++ linux/drivers/md/raid5.c	2012-06-08 13:38:38.209487084 +0800
@@ -196,47 +196,61 @@ static int stripe_operations_active(stru
 	       test_bit(STRIPE_COMPUTE_RUN, &sh->state);
 }
 
-static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
+static void handle_release_stripe(struct r5conf *conf, struct stripe_head *sh)
 {
-	if (atomic_dec_and_test(&sh->count)) {
-		BUG_ON(!list_empty(&sh->lru));
-		BUG_ON(atomic_read(&conf->active_stripes)==0);
-		if (test_bit(STRIPE_HANDLE, &sh->state)) {
-			if (test_bit(STRIPE_DELAYED, &sh->state))
-				list_add_tail(&sh->lru, &conf->delayed_list);
-			else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
-				   sh->bm_seq - conf->seq_write > 0)
-				list_add_tail(&sh->lru, &conf->bitmap_list);
-			else {
-				clear_bit(STRIPE_BIT_DELAY, &sh->state);
-				list_add_tail(&sh->lru, &conf->handle_list);
-			}
-			md_wakeup_thread(conf->mddev->thread);
-		} else {
-			BUG_ON(stripe_operations_active(sh));
-			if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
-				if (atomic_dec_return(&conf->preread_active_stripes)
-				    < IO_THRESHOLD)
-					md_wakeup_thread(conf->mddev->thread);
-			atomic_dec(&conf->active_stripes);
-			if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
-				list_add_tail(&sh->lru, &conf->inactive_list);
-				wake_up(&conf->wait_for_stripe);
-				if (conf->retry_read_aligned)
-					md_wakeup_thread(conf->mddev->thread);
-			}
+	/*
+	 * Before we hold device_lock, other thread can hit this stripe
+	 * in cache. It could do:
+	 * 1. just get_active_stripe(). The stripe count isn't 0 then.
+	 * 2. do get_active_stripe() and follow release_stripe(). So the
+	 * stripe might be already released and already in specific
+	 * list. we do list_move to adjust its position in the list.
+	 */
+	BUG_ON(atomic_read(&conf->active_stripes)==0);
+	if (test_bit(STRIPE_HANDLE, &sh->state)) {
+		if (test_bit(STRIPE_DELAYED, &sh->state))
+			list_move_tail(&sh->lru, &conf->delayed_list);
+		else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
+			   sh->bm_seq - conf->seq_write > 0)
+			list_move_tail(&sh->lru, &conf->bitmap_list);
+		else {
+			clear_bit(STRIPE_BIT_DELAY, &sh->state);
+			list_move_tail(&sh->lru, &conf->handle_list);
+		}
+		md_wakeup_thread(conf->mddev->thread);
+	} else {
+		BUG_ON(stripe_operations_active(sh));
+		if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
+			if (atomic_dec_return(&conf->preread_active_stripes)
+			    < IO_THRESHOLD)
+				md_wakeup_thread(conf->mddev->thread);
+		atomic_dec(&conf->active_stripes);
+		if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
+			list_move_tail(&sh->lru, &conf->inactive_list);
+			wake_up(&conf->wait_for_stripe);
+			if (conf->retry_read_aligned)
+				md_wakeup_thread(conf->mddev->thread);
 		}
 	}
 }
 
+static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
+{
+	if (atomic_dec_and_test(&sh->count))
+		handle_release_stripe(conf, sh);
+}
+
 static void release_stripe(struct stripe_head *sh)
 {
 	struct r5conf *conf = sh->raid_conf;
 	unsigned long flags;
 
-	spin_lock_irqsave(&conf->device_lock, flags);
-	__release_stripe(conf, sh);
-	spin_unlock_irqrestore(&conf->device_lock, flags);
+	local_irq_save(flags);
+	if (atomic_dec_and_lock(&sh->count, &conf->device_lock)) {
+		handle_release_stripe(conf, sh);
+		spin_unlock(&conf->device_lock);
+	}
+	local_irq_restore(flags);
 }
 
 static inline void remove_hash(struct stripe_head *sh)
@@ -479,9 +493,6 @@ get_active_stripe(struct r5conf *conf, s
 			} else {
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
-				if (list_empty(&sh->lru) &&
-				    !test_bit(STRIPE_EXPANDING, &sh->state))
-					BUG();
 				list_del_init(&sh->lru);
 			}
 		}


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 6/9 v2] md: personality can provide unplug private data
  2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
                   ` (4 preceding siblings ...)
  2012-06-19  1:59 ` [patch 5/9 v2] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
@ 2012-06-19  1:59 ` Shaohua Li
  2012-06-19  1:59 ` [patch 7/9 v2] raid5: make_request use batch stripe release Shaohua Li
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: md-plug-add-callback.patch --]
[-- Type: text/plain, Size: 5624 bytes --]

Allow personality providing unplug private data. Next patch will use it.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/md.c     |   31 +++++++++++++------------------
 drivers/md/md.h     |   20 +++++++++++++++++++-
 drivers/md/raid1.c  |    2 +-
 drivers/md/raid10.c |    2 +-
 drivers/md/raid5.c  |    2 +-
 5 files changed, 35 insertions(+), 22 deletions(-)

Index: linux/drivers/md/md.c
===================================================================
--- linux.orig/drivers/md/md.c	2012-06-13 16:18:20.794187026 +0800
+++ linux/drivers/md/md.c	2012-06-18 08:29:26.887317468 +0800
@@ -498,22 +498,13 @@ void md_flush_request(struct mddev *mdde
 }
 EXPORT_SYMBOL(md_flush_request);
 
-/* Support for plugging.
- * This mirrors the plugging support in request_queue, but does not
- * require having a whole queue or request structures.
- * We allocate an md_plug_cb for each md device and each thread it gets
- * plugged on.  This links tot the private plug_handle structure in the
- * personality data where we keep a count of the number of outstanding
- * plugs so other code can see if a plug is active.
- */
-struct md_plug_cb {
-	struct blk_plug_cb cb;
-	struct mddev *mddev;
-};
 
 static void plugger_unplug(struct blk_plug_cb *cb)
 {
 	struct md_plug_cb *mdcb = container_of(cb, struct md_plug_cb, cb);
+
+	if (mdcb->unplug)
+		mdcb->unplug(mdcb);
 	if (atomic_dec_and_test(&mdcb->mddev->plug_cnt))
 		md_wakeup_thread(mdcb->mddev->thread);
 	kfree(mdcb);
@@ -522,13 +513,14 @@ static void plugger_unplug(struct blk_pl
 /* Check that an unplug wakeup will come shortly.
  * If not, wakeup the md thread immediately
  */
-int mddev_check_plugged(struct mddev *mddev)
+struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
+	md_unplug_func_t unplug, size_t size)
 {
 	struct blk_plug *plug = current->plug;
 	struct md_plug_cb *mdcb;
 
 	if (!plug)
-		return 0;
+		return NULL;
 
 	list_for_each_entry(mdcb, &plug->cb_list, cb.list) {
 		if (mdcb->cb.callback == plugger_unplug &&
@@ -538,19 +530,22 @@ int mddev_check_plugged(struct mddev *md
 						    struct md_plug_cb,
 						    cb.list))
 				list_move(&mdcb->cb.list, &plug->cb_list);
-			return 1;
+			return mdcb;
 		}
 	}
 	/* Not currently on the callback list */
-	mdcb = kmalloc(sizeof(*mdcb), GFP_ATOMIC);
+	mdcb = kmalloc(sizeof(*mdcb) + size, GFP_ATOMIC);
 	if (!mdcb)
-		return 0;
+		return NULL;
 
 	mdcb->mddev = mddev;
 	mdcb->cb.callback = plugger_unplug;
 	atomic_inc(&mddev->plug_cnt);
 	list_add(&mdcb->cb.list, &plug->cb_list);
-	return 1;
+	mdcb->unplug = unplug;
+	if (size)
+		memset((void *)(mdcb + 1), 0, size);
+	return mdcb;
 }
 EXPORT_SYMBOL_GPL(mddev_check_plugged);
 
Index: linux/drivers/md/md.h
===================================================================
--- linux.orig/drivers/md/md.h	2012-06-13 16:18:20.806186854 +0800
+++ linux/drivers/md/md.h	2012-06-18 08:31:44.745583865 +0800
@@ -630,6 +630,24 @@ extern struct bio *bio_clone_mddev(struc
 				   struct mddev *mddev);
 extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 				   struct mddev *mddev);
-extern int mddev_check_plugged(struct mddev *mddev);
+
+/* Support for plugging.
+ * This mirrors the plugging support in request_queue, but does not
+ * require having a whole queue or request structures.
+ * We allocate an md_plug_cb for each md device and each thread it gets
+ * plugged on.  This links tot the private plug_handle structure in the
+ * personality data where we keep a count of the number of outstanding
+ * plugs so other code can see if a plug is active.
+ */
+struct md_plug_cb;
+typedef void (*md_unplug_func_t)(struct md_plug_cb *mdcb);
+struct md_plug_cb {
+	struct blk_plug_cb cb;
+	struct mddev *mddev;
+	md_unplug_func_t unplug;
+};
+
+extern struct md_plug_cb *mddev_check_plugged(struct mddev *mddev,
+	md_unplug_func_t unplug, size_t size);
 extern void md_trim_bio(struct bio *bio, int offset, int size);
 #endif /* _MD_MD_H */
Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-13 16:18:20.766187393 +0800
+++ linux/drivers/md/raid1.c	2012-06-18 08:29:47.923047136 +0800
@@ -1034,7 +1034,7 @@ read_again:
 	 * the bad blocks.  Each set of writes gets it's own r1bio
 	 * with a set of bios attached.
 	 */
-	plugged = mddev_check_plugged(mddev);
+	plugged = !!mddev_check_plugged(mddev, NULL, 0);
 
 	disks = conf->raid_disks * 2;
  retry_write:
Index: linux/drivers/md/raid10.c
===================================================================
--- linux.orig/drivers/md/raid10.c	2012-06-13 16:18:20.730187418 +0800
+++ linux/drivers/md/raid10.c	2012-06-18 08:29:59.338909420 +0800
@@ -1239,7 +1239,7 @@ read_again:
 	 * of r10_bios is recored in bio->bi_phys_segments just as with
 	 * the read case.
 	 */
-	plugged = mddev_check_plugged(mddev);
+	plugged = !!mddev_check_plugged(mddev, NULL, 0);
 
 	r10_bio->read_slot = -1; /* make sure repl_bio gets freed */
 	raid10_find_phys(conf, r10_bio);
Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-14 09:03:59.024973998 +0800
+++ linux/drivers/md/raid5.c	2012-06-18 08:30:13.562730340 +0800
@@ -4007,7 +4007,7 @@ static void make_request(struct mddev *m
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
 
-	plugged = mddev_check_plugged(mddev);
+	plugged = !!mddev_check_plugged(mddev, NULL, 0);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		int previous;


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 7/9 v2] raid5: make_request use batch stripe release
  2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
                   ` (5 preceding siblings ...)
  2012-06-19  1:59 ` [patch 6/9 v2] md: personality can provide unplug private data Shaohua Li
@ 2012-06-19  1:59 ` Shaohua Li
  2012-06-19  1:59 ` [patch 8/9 v2] raid5: raid5d handle stripe in batch way Shaohua Li
  2012-06-19  1:59 ` [patch 9/9 v2] raid5: create multiple threads to handle stripes Shaohua Li
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-make_request-relase_stripe-batch.patch --]
[-- Type: text/plain, Size: 4473 bytes --]

make_request() does stripe release for every stripe and the stripe usually has
count 1, which makes previous release_stripe() optimization not work. In my
test, this release_stripe() becomes the heaviest pleace to take
conf->device_lock after previous patches applied.

Below patch makes stripe release batch. All the stripes will be released in
unplug. The STRIPE_ON_UNPLUG_LIST bit is to protect concurrent access stripe
lru.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   64 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 drivers/md/raid5.h |    1 
 2 files changed, 60 insertions(+), 5 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-18 08:30:13.562730340 +0800
+++ linux/drivers/md/raid5.c	2012-06-18 08:34:54.735195847 +0800
@@ -489,7 +489,8 @@ get_active_stripe(struct r5conf *conf, s
 		} else {
 			if (atomic_read(&sh->count)) {
 				BUG_ON(!list_empty(&sh->lru)
-				    && !test_bit(STRIPE_EXPANDING, &sh->state));
+				    && !test_bit(STRIPE_EXPANDING, &sh->state)
+				    && !test_bit(STRIPE_ON_UNPLUG_LIST, &sh->state));
 			} else {
 				if (!test_bit(STRIPE_HANDLE, &sh->state))
 					atomic_inc(&conf->active_stripes);
@@ -3979,6 +3980,51 @@ static struct stripe_head *__get_priorit
 	return sh;
 }
 
+#define raid5_unplug_list(mdcb) (struct list_head *)(mdcb + 1)
+static void raid5_unplug(struct md_plug_cb *mdcb)
+{
+	struct list_head *list = raid5_unplug_list(mdcb);
+	struct stripe_head *sh;
+	struct r5conf *conf = mdcb->mddev->private;
+
+	if (list->next == NULL || list_empty(list))
+		return;
+	spin_lock_irq(&conf->device_lock);
+	while (!list_empty(list)) {
+		sh = list_entry(list->next, struct stripe_head, lru);
+		list_del_init(&sh->lru);
+		/*
+		 * avoid race release_stripe_plug() sees STRIPE_ON_UNPLUG_LIST
+		 * clear but the stripe is still in our list
+		 */
+		smp_mb__before_clear_bit();
+		clear_bit(STRIPE_ON_UNPLUG_LIST, &sh->state);
+		__release_stripe(conf, sh);
+	}
+	spin_unlock_irq(&conf->device_lock);
+}
+
+static void release_stripe_plug(struct md_plug_cb *mdcb,
+				struct stripe_head *sh)
+{
+	struct list_head *list = raid5_unplug_list(mdcb);
+
+	if (!mdcb) {
+		release_stripe(sh);
+		return;
+	}
+
+	if (list->next == NULL) {
+		INIT_LIST_HEAD(list);
+		mdcb->unplug = raid5_unplug;
+	}
+
+	if (!test_and_set_bit(STRIPE_ON_UNPLUG_LIST, &sh->state))
+		list_add_tail(&sh->lru, list);
+	else
+		release_stripe(sh);
+}
+
 static void make_request(struct mddev *mddev, struct bio * bi)
 {
 	struct r5conf *conf = mddev->private;
@@ -3988,7 +4034,7 @@ static void make_request(struct mddev *m
 	struct stripe_head *sh;
 	const int rw = bio_data_dir(bi);
 	int remaining;
-	int plugged;
+	struct md_plug_cb *mdcb;
 
 	if (unlikely(bi->bi_rw & REQ_FLUSH)) {
 		md_flush_request(mddev, bi);
@@ -4007,7 +4053,8 @@ static void make_request(struct mddev *m
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
 
-	plugged = !!mddev_check_plugged(mddev, NULL, 0);
+	mdcb = mddev_check_plugged(mddev, raid5_unplug,
+				   sizeof(struct list_head));
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		int previous;
@@ -4109,7 +4156,14 @@ static void make_request(struct mddev *m
 			if ((bi->bi_rw & REQ_SYNC) &&
 			    !test_and_set_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
 				atomic_inc(&conf->preread_active_stripes);
-			release_stripe(sh);
+			/*
+			 * We must recheck here. schedule() might be called
+			 * above which makes unplug invoked already, so the old
+			 * mdcb is invalid
+			 */
+			mdcb = mddev_check_plugged(mddev, raid5_unplug,
+						   sizeof(struct list_head));
+			release_stripe_plug(mdcb, sh);
 		} else {
 			/* cannot get stripe for read-ahead, just give-up */
 			clear_bit(BIO_UPTODATE, &bi->bi_flags);
@@ -4118,7 +4172,7 @@ static void make_request(struct mddev *m
 		}
 			
 	}
-	if (!plugged)
+	if (!mdcb)
 		md_wakeup_thread(mddev->thread);
 
 	remaining = raid5_dec_bi_active_stripes(bi);
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-06-18 08:23:32.939767146 +0800
+++ linux/drivers/md/raid5.h	2012-06-18 08:32:40.572882491 +0800
@@ -320,6 +320,7 @@ enum {
 	STRIPE_BIOFILL_RUN,
 	STRIPE_COMPUTE_RUN,
 	STRIPE_OPS_REQ_PENDING,
+	STRIPE_ON_UNPLUG_LIST,
 };
 
 /*


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 8/9 v2] raid5: raid5d handle stripe in batch way
  2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
                   ` (6 preceding siblings ...)
  2012-06-19  1:59 ` [patch 7/9 v2] raid5: make_request use batch stripe release Shaohua Li
@ 2012-06-19  1:59 ` Shaohua Li
  2012-06-19  1:59 ` [patch 9/9 v2] raid5: create multiple threads to handle stripes Shaohua Li
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-raid5d-fetch-stripe-batch.patch --]
[-- Type: text/plain, Size: 2330 bytes --]

Let raid5d handle stripe in batch way to reduce conf->device_lock locking.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |   45 ++++++++++++++++++++++++++++++++-------------
 1 file changed, 32 insertions(+), 13 deletions(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-18 08:34:54.735195847 +0800
+++ linux/drivers/md/raid5.c	2012-06-18 08:36:47.833775907 +0800
@@ -4563,6 +4563,30 @@ static int  retry_aligned_read(struct r5
 	return handled;
 }
 
+#define MAX_STRIPE_BATCH 8
+static int handle_active_stripes(struct r5conf *conf)
+{
+	struct stripe_head *batch[MAX_STRIPE_BATCH], *sh;
+	int i, batch_size = 0;
+
+	while (batch_size < MAX_STRIPE_BATCH &&
+			(sh = __get_priority_stripe(conf)) != NULL)
+		batch[batch_size++] = sh;
+
+	if (batch_size == 0)
+		return batch_size;
+	spin_unlock_irq(&conf->device_lock);
+
+	for (i = 0; i < batch_size; i++)
+		handle_stripe(batch[i]);
+
+	cond_resched();
+
+	spin_lock_irq(&conf->device_lock);
+	for (i = 0; i < batch_size; i++)
+		__release_stripe(conf, batch[i]);
+	return batch_size;
+}
 
 /*
  * This is our raid5 kernel thread.
@@ -4573,7 +4597,6 @@ static int  retry_aligned_read(struct r5
  */
 static void raid5d(struct mddev *mddev)
 {
-	struct stripe_head *sh;
 	struct r5conf *conf = mddev->private;
 	int handled;
 	struct blk_plug plug;
@@ -4587,6 +4610,7 @@ static void raid5d(struct mddev *mddev)
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
 		struct bio *bio;
+		int batch_size;
 
 		if (atomic_read(&mddev->plug_cnt) == 0 &&
 		    !list_empty(&conf->bitmap_list)) {
@@ -4611,21 +4635,16 @@ static void raid5d(struct mddev *mddev)
 			handled++;
 		}
 
-		sh = __get_priority_stripe(conf);
-
-		if (!sh)
+		batch_size = handle_active_stripes(conf);
+		if (!batch_size)
 			break;
-		spin_unlock_irq(&conf->device_lock);
-		
-		handled++;
-		handle_stripe(sh);
-		release_stripe(sh);
-		cond_resched();
+		handled += batch_size;
 
-		if (mddev->flags & ~(1<<MD_CHANGE_PENDING))
+		if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {
+			spin_unlock_irq(&conf->device_lock);
 			md_check_recovery(mddev);
-
-		spin_lock_irq(&conf->device_lock);
+			spin_lock_irq(&conf->device_lock);
+		}
 	}
 	pr_debug("%d stripes handled\n", handled);
 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 9/9 v2] raid5: create multiple threads to handle stripes
  2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
                   ` (7 preceding siblings ...)
  2012-06-19  1:59 ` [patch 8/9 v2] raid5: raid5d handle stripe in batch way Shaohua Li
@ 2012-06-19  1:59 ` Shaohua Li
  8 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-19  1:59 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, dan.j.williams, shli

[-- Attachment #1: raid5-multiple-threads.patch --]
[-- Type: text/plain, Size: 7030 bytes --]

Like raid 1/10, raid5 uses one thread to handle stripe. In a fast storage, the
thread becomes a bottleneck. raid5 can offload calculation like checksum to
async threads. And if storge is fast, scheduling async work and running async
work will introduce heavy lock contention of workqueue, which makes such
optimization useless. And calculation isn't the only bottleneck. For example,
in my test raid5 thread must handle > 450k requests per second. Just doing
dispatch and completion will make raid5 thread incapable. The only chance to
scale is using several threads to handle stripe.

With this patch, user can create several extra threads to handle stripe. How
many threads are better depending on disk number, so the thread number can be
changed in userspace. By default, the thread number is 0, which means no extra
thread.

In a 3-disk raid5 setup, 2 extra threads can provide 130% throughput
improvement (double stripe_cache_size) and the throughput is pretty close to
theory value. With >=4 disks, the improvement is even bigger, for example, can
improve 200% for 4-disk setup, but the throughput is far less than theory
value, which is caused by several factors like request queue lock contention,
cache issue, latency introduced by how a stripe is handled in different disks.
Those factors need further investigations.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid5.c |  129 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 drivers/md/raid5.h |    2 
 2 files changed, 130 insertions(+), 1 deletion(-)

Index: linux/drivers/md/raid5.c
===================================================================
--- linux.orig/drivers/md/raid5.c	2012-06-18 08:36:47.833775907 +0800
+++ linux/drivers/md/raid5.c	2012-06-18 08:36:55.161681879 +0800
@@ -4588,6 +4588,33 @@ static int handle_active_stripes(struct
 	return batch_size;
 }
 
+static void raid5auxd(struct mddev *mddev)
+{
+	struct r5conf *conf = mddev->private;
+	struct blk_plug plug;
+	int handled;
+
+	pr_debug("+++ raid5auxd active\n");
+
+	blk_start_plug(&plug);
+	handled = 0;
+	spin_lock_irq(&conf->device_lock);
+	while (1) {
+		int batch_size;
+
+		batch_size = handle_active_stripes(conf);
+		if (!batch_size)
+			break;
+		handled += batch_size;
+	}
+	pr_debug("%d stripes handled\n", handled);
+
+	spin_unlock_irq(&conf->device_lock);
+	blk_finish_plug(&plug);
+
+	pr_debug("--- raid5auxd inactive\n");
+}
+
 /*
  * This is our raid5 kernel thread.
  *
@@ -4610,7 +4637,7 @@ static void raid5d(struct mddev *mddev)
 	spin_lock_irq(&conf->device_lock);
 	while (1) {
 		struct bio *bio;
-		int batch_size;
+		int batch_size, i;
 
 		if (atomic_read(&mddev->plug_cnt) == 0 &&
 		    !list_empty(&conf->bitmap_list)) {
@@ -4640,6 +4667,9 @@ static void raid5d(struct mddev *mddev)
 			break;
 		handled += batch_size;
 
+		for (i = 0; i < conf->aux_thread_num; i++)
+			md_wakeup_thread(conf->aux_threads[i]);
+
 		if (mddev->flags & ~(1<<MD_CHANGE_PENDING)) {
 			spin_unlock_irq(&conf->device_lock);
 			md_check_recovery(mddev);
@@ -4764,10 +4794,85 @@ stripe_cache_active_show(struct mddev *m
 static struct md_sysfs_entry
 raid5_stripecache_active = __ATTR_RO(stripe_cache_active);
 
+static ssize_t
+raid5_show_auxthread_number(struct mddev *mddev, char *page)
+{
+	struct r5conf *conf = mddev->private;
+	if (conf)
+		return sprintf(page, "%d\n", conf->aux_thread_num);
+	else
+		return 0;
+}
+
+static ssize_t
+raid5_store_auxthread_number(struct mddev *mddev, const char *page, size_t len)
+{
+	struct r5conf *conf = mddev->private;
+	unsigned long new;
+	int i;
+	struct md_thread **threads;
+
+	if (len >= PAGE_SIZE)
+		return -EINVAL;
+	if (!conf)
+		return -ENODEV;
+
+	if (strict_strtoul(page, 10, &new))
+		return -EINVAL;
+
+	if (new == conf->aux_thread_num)
+		return len;
+
+	if (new > conf->aux_thread_num) {
+		threads = kmalloc(sizeof(struct md_thread *) * new, GFP_KERNEL);
+		if (!threads)
+			return -EFAULT;
+
+		i = conf->aux_thread_num;
+		while (i < new) {
+			char name[10];
+
+			sprintf(name, "aux%d", i);
+			threads[i] = md_register_thread(raid5auxd, mddev, name);
+			if (!threads[i])
+				goto error;
+			i++;
+		}
+		memcpy(threads, conf->aux_threads,
+			sizeof(struct md_thread *) * conf->aux_thread_num);
+		spin_lock_irq(&conf->device_lock);
+		kfree(conf->aux_threads);
+		conf->aux_threads = threads;
+		conf->aux_thread_num = new;
+		spin_unlock_irq(&conf->device_lock);
+	} else {
+		int old = conf->aux_thread_num;
+
+		spin_lock_irq(&conf->device_lock);
+		conf->aux_thread_num = new;
+		spin_unlock_irq(&conf->device_lock);
+		for (i = new; i < old; i++)
+			md_unregister_thread(&conf->aux_threads[i]);
+	}
+
+	return len;
+error:
+	while (--i >= conf->aux_thread_num)
+		md_unregister_thread(&threads[i]);
+	kfree(threads);
+	return -EFAULT;
+}
+
+static struct md_sysfs_entry
+raid5_auxthread_number = __ATTR(auxthread_number, S_IRUGO|S_IWUSR,
+				raid5_show_auxthread_number,
+				raid5_store_auxthread_number);
+
 static struct attribute *raid5_attrs[] =  {
 	&raid5_stripecache_size.attr,
 	&raid5_stripecache_active.attr,
 	&raid5_preread_bypass_threshold.attr,
+	&raid5_auxthread_number.attr,
 	NULL,
 };
 static struct attribute_group raid5_attrs_group = {
@@ -4815,6 +4920,7 @@ static void raid5_free_percpu(struct r5c
 
 static void free_conf(struct r5conf *conf)
 {
+	kfree(conf->aux_threads);
 	shrink_stripes(conf);
 	raid5_free_percpu(conf);
 	kfree(conf->disks);
@@ -4909,6 +5015,7 @@ static struct r5conf *setup_conf(struct
 	int raid_disk, memory, max_disks;
 	struct md_rdev *rdev;
 	struct disk_info *disk;
+	int i;
 
 	if (mddev->new_level != 5
 	    && mddev->new_level != 4
@@ -5032,6 +5139,22 @@ static struct r5conf *setup_conf(struct
 		printk(KERN_INFO "md/raid:%s: allocated %dkB\n",
 		       mdname(mddev), memory);
 
+	/* By default, auxthread number equals to disk number */
+	conf->aux_threads = kmalloc(sizeof(struct md_thread *) * max_disks,
+				    GFP_KERNEL);
+	if (!conf->aux_threads)
+		goto abort;
+	for (i = 0; i < max_disks; i++) {
+		char name[10];
+
+		sprintf(name, "aux%d", i);
+		conf->aux_threads[i] = md_register_thread(raid5auxd, mddev, name);
+		if (!conf->aux_threads[i])
+			break;
+	}
+
+	conf->aux_thread_num = i;
+
 	conf->thread = md_register_thread(raid5d, mddev, NULL);
 	if (!conf->thread) {
 		printk(KERN_ERR
@@ -5371,6 +5494,10 @@ abort:
 static int stop(struct mddev *mddev)
 {
 	struct r5conf *conf = mddev->private;
+	int i;
+
+	for (i = 0; i < conf->aux_thread_num; i++)
+		md_unregister_thread(&conf->aux_threads[i]);
 
 	md_unregister_thread(&mddev->thread);
 	if (mddev->queue)
Index: linux/drivers/md/raid5.h
===================================================================
--- linux.orig/drivers/md/raid5.h	2012-06-18 08:32:40.572882491 +0800
+++ linux/drivers/md/raid5.h	2012-06-18 08:36:55.161681879 +0800
@@ -458,6 +458,8 @@ struct r5conf {
 	 * the new thread here until we fully activate the array.
 	 */
 	struct md_thread	*thread;
+	int			aux_thread_num;
+	struct md_thread	**aux_threads;
 };
 
 /*


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 5/9 v2] raid5: reduce chance release_stripe() taking device_lock
  2012-06-19  1:59 ` [patch 5/9 v2] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
@ 2012-06-21  0:56   ` Dan Williams
  2012-06-21  1:34     ` Shaohua Li
  0 siblings, 1 reply; 13+ messages in thread
From: Dan Williams @ 2012-06-21  0:56 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe, shli

On Mon, Jun 18, 2012 at 6:59 PM, Shaohua Li <shli@kernel.org> wrote:
> release_stripe() is a place conf->device_lock is heavily contended. We take the
> lock even stripe count isn't 1, which isn't required. On the on the other hand,
> decreasing count first and taking lock if count is 0 can expose races:
> 1. bewteen dec count and taking lock, another thread hits the stripe in cache,
> so increase count. The stripe will be deleted from any list. In this case
> stripe count isn't 0.
> 2. between dec count and taking lock, another thread hits the stripe in cache
> and release it. In this case the stripe is already in specific list. We do
> list_move to adjust its position.
> So both cases are fixable to me.
>

These "1" and "2" comments no longer apply right?  atomic_dec_and_lock
is equivalently safe to lock+dec_and_test.


> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |   79 ++++++++++++++++++++++++++++++-----------------------
>  1 file changed, 45 insertions(+), 34 deletions(-)
>
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c       2012-06-08 13:18:39.296561227 +0800
> +++ linux/drivers/md/raid5.c    2012-06-08 13:38:38.209487084 +0800
> @@ -196,47 +196,61 @@ static int stripe_operations_active(stru
>               test_bit(STRIPE_COMPUTE_RUN, &sh->state);
>  }
>
> -static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
> +static void handle_release_stripe(struct r5conf *conf, struct stripe_head *sh)
>  {
> -       if (atomic_dec_and_test(&sh->count)) {
> -               BUG_ON(!list_empty(&sh->lru));

Hmm, this BUG_ON does not get carried over?

> -               BUG_ON(atomic_read(&conf->active_stripes)==0);
> -               if (test_bit(STRIPE_HANDLE, &sh->state)) {
> -                       if (test_bit(STRIPE_DELAYED, &sh->state))
> -                               list_add_tail(&sh->lru, &conf->delayed_list);
> -                       else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> -                                  sh->bm_seq - conf->seq_write > 0)
> -                               list_add_tail(&sh->lru, &conf->bitmap_list);
> -                       else {
> -                               clear_bit(STRIPE_BIT_DELAY, &sh->state);
> -                               list_add_tail(&sh->lru, &conf->handle_list);
> -                       }
> -                       md_wakeup_thread(conf->mddev->thread);
> -               } else {
> -                       BUG_ON(stripe_operations_active(sh));
> -                       if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> -                               if (atomic_dec_return(&conf->preread_active_stripes)
> -                                   < IO_THRESHOLD)
> -                                       md_wakeup_thread(conf->mddev->thread);
> -                       atomic_dec(&conf->active_stripes);
> -                       if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
> -                               list_add_tail(&sh->lru, &conf->inactive_list);
> -                               wake_up(&conf->wait_for_stripe);
> -                               if (conf->retry_read_aligned)
> -                                       md_wakeup_thread(conf->mddev->thread);
> -                       }
> +       /*
> +        * Before we hold device_lock, other thread can hit this stripe
> +        * in cache. It could do:
> +        * 1. just get_active_stripe(). The stripe count isn't 0 then.
> +        * 2. do get_active_stripe() and follow release_stripe(). So the
> +        * stripe might be already released and already in specific
> +        * list. we do list_move to adjust its position in the list.
> +        */
> +       BUG_ON(atomic_read(&conf->active_stripes)==0);
> +       if (test_bit(STRIPE_HANDLE, &sh->state)) {
> +               if (test_bit(STRIPE_DELAYED, &sh->state))
> +                       list_move_tail(&sh->lru, &conf->delayed_list);
> +               else if (test_bit(STRIPE_BIT_DELAY, &sh->state) &&
> +                          sh->bm_seq - conf->seq_write > 0)
> +                       list_move_tail(&sh->lru, &conf->bitmap_list);
> +               else {
> +                       clear_bit(STRIPE_BIT_DELAY, &sh->state);
> +                       list_move_tail(&sh->lru, &conf->handle_list);
> +               }
> +               md_wakeup_thread(conf->mddev->thread);
> +       } else {
> +               BUG_ON(stripe_operations_active(sh));
> +               if (test_and_clear_bit(STRIPE_PREREAD_ACTIVE, &sh->state))
> +                       if (atomic_dec_return(&conf->preread_active_stripes)
> +                           < IO_THRESHOLD)
> +                               md_wakeup_thread(conf->mddev->thread);
> +               atomic_dec(&conf->active_stripes);
> +               if (!test_bit(STRIPE_EXPANDING, &sh->state)) {
> +                       list_move_tail(&sh->lru, &conf->inactive_list);
> +                       wake_up(&conf->wait_for_stripe);
> +                       if (conf->retry_read_aligned)
> +                               md_wakeup_thread(conf->mddev->thread);
>                }
>        }
>  }
>
> +static void __release_stripe(struct r5conf *conf, struct stripe_head *sh)
> +{
> +       if (atomic_dec_and_test(&sh->count))
> +               handle_release_stripe(conf, sh);
> +}
> +
>  static void release_stripe(struct stripe_head *sh)
>  {
>        struct r5conf *conf = sh->raid_conf;
>        unsigned long flags;
>
> -       spin_lock_irqsave(&conf->device_lock, flags);
> -       __release_stripe(conf, sh);
> -       spin_unlock_irqrestore(&conf->device_lock, flags);
> +       local_irq_save(flags);
> +       if (atomic_dec_and_lock(&sh->count, &conf->device_lock)) {
> +               handle_release_stripe(conf, sh);
> +               spin_unlock(&conf->device_lock);
> +       }
> +       local_irq_restore(flags);
>  }
>
>  static inline void remove_hash(struct stripe_head *sh)
> @@ -479,9 +493,6 @@ get_active_stripe(struct r5conf *conf, s
>                        } else {
>                                if (!test_bit(STRIPE_HANDLE, &sh->state))
>                                        atomic_inc(&conf->active_stripes);
> -                               if (list_empty(&sh->lru) &&
> -                                   !test_bit(STRIPE_EXPANDING, &sh->state))
> -                                       BUG();

Why are these checks removed?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 5/9 v2] raid5: reduce chance release_stripe() taking device_lock
  2012-06-21  0:56   ` Dan Williams
@ 2012-06-21  1:34     ` Shaohua Li
  0 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-21  1:34 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid, neilb, axboe, shli

On Wed, Jun 20, 2012 at 05:56:25PM -0700, Dan Williams wrote:
> On Mon, Jun 18, 2012 at 6:59 PM, Shaohua Li <shli@kernel.org> wrote:
> > release_stripe() is a place conf->device_lock is heavily contended. We take the
> > lock even stripe count isn't 1, which isn't required. On the on the other hand,
> > decreasing count first and taking lock if count is 0 can expose races:
> > 1. bewteen dec count and taking lock, another thread hits the stripe in cache,
> > so increase count. The stripe will be deleted from any list. In this case
> > stripe count isn't 0.
> > 2. between dec count and taking lock, another thread hits the stripe in cache
> > and release it. In this case the stripe is already in specific list. We do
> > list_move to adjust its position.
> > So both cases are fixable to me.
> >
> 
> These "1" and "2" comments no longer apply right?  atomic_dec_and_lock
> is equivalently safe to lock+dec_and_test.

Oh, I misunderstand atomic_dec_and_lock is dec and lock instead of lock and
dec, so there is no race. The other two deletion isn't required too. I'll
repost this patch.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 3/9 v2] raid5: lockless access raid5 overrided bi_phys_segments
  2012-06-19  1:59 ` [patch 3/9 v2] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
@ 2012-06-21  1:56   ` Dan Williams
  0 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2012-06-21  1:56 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, neilb, axboe, shli

On Mon, Jun 18, 2012 at 6:59 PM, Shaohua Li <shli@kernel.org> wrote:
> Raid5 overrides bio->bi_phys_segments, accessing it is with device_lock hold,
> which is unnecessary, We can make it lockless actually.
>
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid5.c |   58 +++++++++++++++++++++++++++++------------------------
>  1 file changed, 32 insertions(+), 26 deletions(-)
>
> Index: linux/drivers/md/raid5.c
> ===================================================================
> --- linux.orig/drivers/md/raid5.c       2012-06-08 11:48:25.652618012 +0800
> +++ linux/drivers/md/raid5.c    2012-06-08 13:15:31.458919391 +0800
> @@ -99,34 +99,40 @@ static inline struct bio *r5_next_bio(st
>  * We maintain a biased count of active stripes in the bottom 16 bits of
>  * bi_phys_segments, and a count of processed stripes in the upper 16 bits
>  */
> -static inline int raid5_bi_phys_segments(struct bio *bio)
> +static inline int raid5_bi_processed_stripes(struct bio *bio)
>  {
> -       return bio->bi_phys_segments & 0xffff;
> +       atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +       return (atomic_read(segments) >> 16) & 0xffff;
>  }
>
> -static inline int raid5_bi_hw_segments(struct bio *bio)
> +static inline int raid5_dec_bi_active_stripes(struct bio *bio)
>  {
> -       return (bio->bi_phys_segments >> 16) & 0xffff;
> +       atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +       return atomic_sub_return(1, segments) & 0xffff;
>  }
>
> -static inline int raid5_dec_bi_phys_segments(struct bio *bio)
> +static inline void raid5_inc_bi_active_stripes(struct bio *bio)
>  {
> -       --bio->bi_phys_segments;
> -       return raid5_bi_phys_segments(bio);
> +       atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +       atomic_inc(segments);
>  }
>
> -static inline int raid5_dec_bi_hw_segments(struct bio *bio)
> +static inline void raid5_set_bi_processed_stripes(struct bio *bio,
> +       unsigned int cnt)
>  {
> -       unsigned short val = raid5_bi_hw_segments(bio);
> +       atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +       int old, new;
>
> -       --val;
> -       bio->bi_phys_segments = (val << 16) | raid5_bi_phys_segments(bio);
> -       return val;
> +       do {
> +               old = atomic_read(segments);
> +               new = (old & 0xffff) | (cnt << 16);
> +       } while (atomic_cmpxchg(segments, old, new) != old);
>  }
>
> -static inline void raid5_set_bi_hw_segments(struct bio *bio, unsigned int cnt)
> +static inline void raid5_set_bi_stripes(struct bio *bio, unsigned int cnt)
>  {
> -       bio->bi_phys_segments = raid5_bi_phys_segments(bio) | (cnt << 16);
> +       atomic_t *segments = (atomic_t *)&bio->bi_phys_segments;
> +       atomic_set(segments, cnt);
>  }
>
>  /* Find first data disk in a raid6 stripe */
> @@ -767,7 +773,7 @@ static void ops_complete_biofill(void *s
>                        while (rbi && rbi->bi_sector <
>                                dev->sector + STRIPE_SECTORS) {
>                                rbi2 = r5_next_bio(rbi, dev->sector);
> -                               if (!raid5_dec_bi_phys_segments(rbi)) {
> +                               if (!raid5_dec_bi_active_stripes(rbi)) {
>                                        rbi->bi_next = return_bi;
>                                        return_bi = rbi;
>                                }
> @@ -2354,7 +2360,7 @@ static int add_stripe_bio(struct stripe_
>        if (*bip)
>                bi->bi_next = *bip;
>        *bip = bi;
> -       bi->bi_phys_segments++;
> +       raid5_inc_bi_active_stripes(bi);

Now that device_lock does not globally sync bio ingress/egress would
it be worth having a a comment here that says this ordering of adding
bi to the to{read|write} list is safe because make_request holds a
'segments' reference count until the bio has been added to all
affected stripes, and that within a given stripe bios are still synced
via stripe_lock?
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-06-21  1:56 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-19  1:59 [patch 0/9 v2] raid5: improve write performance for fast storage Shaohua Li
2012-06-19  1:59 ` [patch 1/9 v2] raid5: use wake_up_all for overlap waking Shaohua Li
2012-06-19  1:59 ` [patch 2/9 v2] raid5: add a per-stripe lock Shaohua Li
2012-06-19  1:59 ` [patch 3/9 v2] raid5: lockless access raid5 overrided bi_phys_segments Shaohua Li
2012-06-21  1:56   ` Dan Williams
2012-06-19  1:59 ` [patch 4/9 v2] raid5: remove some device_lock locking places Shaohua Li
2012-06-19  1:59 ` [patch 5/9 v2] raid5: reduce chance release_stripe() taking device_lock Shaohua Li
2012-06-21  0:56   ` Dan Williams
2012-06-21  1:34     ` Shaohua Li
2012-06-19  1:59 ` [patch 6/9 v2] md: personality can provide unplug private data Shaohua Li
2012-06-19  1:59 ` [patch 7/9 v2] raid5: make_request use batch stripe release Shaohua Li
2012-06-19  1:59 ` [patch 8/9 v2] raid5: raid5d handle stripe in batch way Shaohua Li
2012-06-19  1:59 ` [patch 9/9 v2] raid5: create multiple threads to handle stripes Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.