All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/4] MD: improve raid1/10 write performance for fast storage
@ 2012-05-23  7:26 Shaohua Li
  2012-05-23  7:26 ` [patch 1/4] raid1: directly dispatch write request if no bitmap Shaohua Li
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Shaohua Li @ 2012-05-23  7:26 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, shli

In raid1/10, all write requests are dispatched in a single thread. In fast
storage, the thread is a bottleneck, because it dispatches request too slow.
Also the thread migrates freely, which makes request completion cpu not match
with submission cpu even driver/block layer has such capability. This will
cause bad cache issue. Both these are not a big deal for slow storage.

Switching the dispatching to percpu/perthread based dramatically increases
performance.  The more raid disk number is, the more performance boosts. In a
4-disk raid10 setup, this can double the throughput.

percpu/perthread based dispatch doesn't harm slow storage. This is the way how
raw device is accessed, and there is correct block plug set which can help do
request merge and reduce lock contention.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 1/4] raid1: directly dispatch write request if no bitmap
  2012-05-23  7:26 [patch 0/4] MD: improve raid1/10 write performance for fast storage Shaohua Li
@ 2012-05-23  7:26 ` Shaohua Li
  2012-05-24  2:21   ` NeilBrown
  2012-05-23  7:26 ` [patch 2/4] raid1: percpu dispatch for write request if bitmap supported Shaohua Li
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2012-05-23  7:26 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, shli

[-- Attachment #1: raid1-write-direct-dispatch.patch --]
[-- Type: text/plain, Size: 1995 bytes --]

In raid1, all write requests are dispatched in raid1d thread. In fast storage,
the raid1d thread is a bottleneck, because it dispatches request too slow. Also
raid1d thread migrates freely, which makes request completion cpu not match
with submission cpu even driver/block layer has such capability. This will
cause bad cache issue.

If no bitmap, there is no point to queue bio to a thread and dispatch it in the
thread. Directly dispatching bio doesn't impact correctness and removes above
bottleneck.

Multiple threads dispatch requests could potentially reduce request merge and
increase lock contention. For slow stroage, we just worry about request merge.
Caller of .make_request should already have correct block plug set, which will
take care of request merge and locking just like accessing raw device, so we
don't need worry about this too much.

In a 4k randwrite test with a 2 disks setup, below patch can provide 20% ~ 50%
performance improvements depending on numa binding.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-05-22 13:50:26.989820654 +0800
+++ linux/drivers/md/raid1.c	2012-05-22 13:56:46.117054559 +0800
@@ -1187,10 +1187,13 @@ read_again:
 		mbio->bi_private = r1_bio;
 
 		atomic_inc(&r1_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		if (bitmap) {
+			spin_lock_irqsave(&conf->device_lock, flags);
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+		} else
+			generic_make_request(mbio);
 	}
 	/* Mustn't call r1_bio_write_done before this next test,
 	 * as it could result in the bio being freed.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 2/4] raid1: percpu dispatch for write request if bitmap supported
  2012-05-23  7:26 [patch 0/4] MD: improve raid1/10 write performance for fast storage Shaohua Li
  2012-05-23  7:26 ` [patch 1/4] raid1: directly dispatch write request if no bitmap Shaohua Li
@ 2012-05-23  7:26 ` Shaohua Li
  2012-05-24  3:34   ` NeilBrown
  2012-05-23  7:26 ` [patch 3/4] raid10: directly dispatch write request if no bitmap Shaohua Li
  2012-05-23  7:26 ` [patch 4/4] raid10: percpu dispatch for write request if bitmap supported Shaohua Li
  3 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2012-05-23  7:26 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, shli

[-- Attachment #1: raid1-write-percpulist.patch --]
[-- Type: text/plain, Size: 6442 bytes --]

In raid1, all write requests are dispatched in raid1d thread. In fast storage,
the raid1d thread is a bottleneck, because it dispatches request too slow. Also
raid1d thread migrates freely, which makes request completion cpu not match
with submission cpu even driver/block layer has such capability. This will
cause bad cache issue.

If bitmap support is enabled, write requests can only be dispatched after dirty
bitmap is flushed out. After bitmap is flushed, how write requests are
dispatched doesn't impact correctness. A natural idea is to distribute request
dispatch to several threads. With this patch, requests are added to a percpu
list first. After bitmap is flushed, then the percpu list requests will
dispatched in a workqueue. In this way, above bottleneck is removed.

In a 4k randwrite test with a 2 disks setup, below patch can provide 10% ~ 50%
performance improvements depending on numa binding.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   96 +++++++++++++++++++++++++++++++++++++++++++----------
 drivers/md/raid1.h |   10 ++++-
 2 files changed, 87 insertions(+), 19 deletions(-)

Index: linux/drivers/md/raid1.h
===================================================================
--- linux.orig/drivers/md/raid1.h	2012-05-22 19:07:40.806533172 +0800
+++ linux/drivers/md/raid1.h	2012-05-22 19:07:44.526486376 +0800
@@ -22,6 +22,14 @@ struct pool_info {
 	int	raid_disks;
 };
 
+struct write_list {
+	struct bio_list	pending_bio_list;
+	struct bio_list	running_bio_list;
+	struct bio_list	tmp_bio_list;
+	struct work_struct work;
+	struct r1conf *conf;
+};
+
 struct r1conf {
 	struct mddev		*mddev;
 	struct mirror_info	*mirrors;	/* twice 'raid_disks' to
@@ -50,7 +58,7 @@ struct r1conf {
 	struct list_head	retry_list;
 
 	/* queue pending writes to be submitted on unplug */
-	struct bio_list		pending_bio_list;
+	struct write_list __percpu *write_list;
 	int			pending_count;
 
 	/* for use when syncing mirrors:
Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-05-22 19:07:44.514486551 +0800
+++ linux/drivers/md/raid1.c	2012-05-22 19:20:54.480552239 +0800
@@ -687,22 +687,21 @@ static int raid1_congested(void *data, i
 		md_raid1_congested(mddev, bits);
 }
 
-static void flush_pending_writes(struct r1conf *conf)
+static void raid1_write_work(struct work_struct *work)
 {
-	/* Any writes that have been queued but are awaiting
-	 * bitmap updates get flushed here.
-	 */
-	spin_lock_irq(&conf->device_lock);
+	struct write_list *list = container_of(work, struct write_list, work);
+	struct bio *bio;
+	struct blk_plug plug;
+	bool try_again = true;
 
-	if (conf->pending_bio_list.head) {
-		struct bio *bio;
-		bio = bio_list_get(&conf->pending_bio_list);
-		conf->pending_count = 0;
-		spin_unlock_irq(&conf->device_lock);
-		/* flush any pending bitmap writes to
-		 * disk before proceeding w/ I/O */
-		bitmap_unplug(conf->mddev->bitmap);
-		wake_up(&conf->wait_barrier);
+	blk_start_plug(&plug);
+
+	while (try_again) {
+		spin_lock_irq(&list->conf->device_lock);
+		bio = bio_list_get(&list->running_bio_list);
+		spin_unlock_irq(&list->conf->device_lock);
+
+		try_again = (bio != NULL);
 
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
@@ -710,8 +709,52 @@ static void flush_pending_writes(struct
 			generic_make_request(bio);
 			bio = next;
 		}
-	} else
-		spin_unlock_irq(&conf->device_lock);
+	}
+	blk_finish_plug(&plug);
+}
+
+static void flush_pending_writes(struct r1conf *conf)
+{
+	int c;
+	struct write_list *list;
+
+	/* Any writes that have been queued but are awaiting
+	 * bitmap updates get flushed here.
+	 */
+	spin_lock_irq(&conf->device_lock);
+
+	for_each_present_cpu(c) {
+		list = per_cpu_ptr(conf->write_list, c);
+		if (!bio_list_empty(&list->pending_bio_list)) {
+			bio_list_merge(&list->tmp_bio_list,
+				      &list->pending_bio_list);
+			bio_list_init(&list->pending_bio_list);
+		}
+	}
+
+	conf->pending_count = 0;
+	spin_unlock_irq(&conf->device_lock);
+
+	/* flush any pending bitmap writes to disk before proceeding w/ I/O */
+	bitmap_unplug(conf->mddev->bitmap);
+	wake_up(&conf->wait_barrier);
+
+	spin_lock_irq(&conf->device_lock);
+	for_each_present_cpu(c) {
+		list = per_cpu_ptr(conf->write_list, c);
+		if (!bio_list_empty(&list->tmp_bio_list)) {
+			bio_list_merge(&list->running_bio_list,
+				       &list->tmp_bio_list);
+			bio_list_init(&list->tmp_bio_list);
+			if (likely(cpu_online(c)))
+				schedule_work_on(c, &list->work);
+			else
+				schedule_work_on(cpumask_any(cpu_online_mask),
+						 &list->work);
+		}
+	}
+
+	spin_unlock_irq(&conf->device_lock);
 }
 
 /* Barriers....
@@ -1188,8 +1231,10 @@ read_again:
 
 		atomic_inc(&r1_bio->remaining);
 		if (bitmap) {
+			struct write_list *list;
 			spin_lock_irqsave(&conf->device_lock, flags);
-			bio_list_add(&conf->pending_bio_list, mbio);
+			list = this_cpu_ptr(conf->write_list);
+			bio_list_add(&list->pending_bio_list, mbio);
 			conf->pending_count++;
 			spin_unlock_irqrestore(&conf->device_lock, flags);
 		} else
@@ -2573,7 +2618,6 @@ static struct r1conf *setup_conf(struct
 	spin_lock_init(&conf->resync_lock);
 	init_waitqueue_head(&conf->wait_barrier);
 
-	bio_list_init(&conf->pending_bio_list);
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
@@ -2617,6 +2661,19 @@ static struct r1conf *setup_conf(struct
 		goto abort;
 	}
 	err = -ENOMEM;
+
+	conf->write_list = alloc_percpu(struct write_list);
+	if (!conf->write_list)
+		goto abort;
+	for_each_present_cpu(i) {
+		struct write_list *list = per_cpu_ptr(conf->write_list, i);
+		bio_list_init(&list->pending_bio_list);
+		bio_list_init(&list->running_bio_list);
+		bio_list_init(&list->tmp_bio_list);
+		INIT_WORK(&list->work, raid1_write_work);
+		list->conf = conf;
+	}
+
 	conf->thread = md_register_thread(raid1d, mddev, NULL);
 	if (!conf->thread) {
 		printk(KERN_ERR
@@ -2629,6 +2686,7 @@ static struct r1conf *setup_conf(struct
 
  abort:
 	if (conf) {
+		free_percpu(conf->write_list);
 		if (conf->r1bio_pool)
 			mempool_destroy(conf->r1bio_pool);
 		kfree(conf->mirrors);
@@ -2735,6 +2793,8 @@ static int stop(struct mddev *mddev)
 	lower_barrier(conf);
 
 	md_unregister_thread(&mddev->thread);
+	free_percpu(conf->write_list);
+
 	if (conf->r1bio_pool)
 		mempool_destroy(conf->r1bio_pool);
 	kfree(conf->mirrors);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 3/4] raid10: directly dispatch write request if no bitmap
  2012-05-23  7:26 [patch 0/4] MD: improve raid1/10 write performance for fast storage Shaohua Li
  2012-05-23  7:26 ` [patch 1/4] raid1: directly dispatch write request if no bitmap Shaohua Li
  2012-05-23  7:26 ` [patch 2/4] raid1: percpu dispatch for write request if bitmap supported Shaohua Li
@ 2012-05-23  7:26 ` Shaohua Li
  2012-05-23  7:26 ` [patch 4/4] raid10: percpu dispatch for write request if bitmap supported Shaohua Li
  3 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2012-05-23  7:26 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, shli

[-- Attachment #1: raid10-write-direct-dispatch.patch --]
[-- Type: text/plain, Size: 2567 bytes --]

In raid10, all write requests are dispatched in raid10d thread. In fast
storage, the raid10d thread is a bottleneck, because it dispatches request too
slow. Also raid10d thread migrates freely, which makes request completion cpu
not match with submission cpu even driver/block layer has such capability. This
will cause bad cache issue.

If no bitmap, there is no point to queue bio to a thread and dispatch it in the
thread. Directly dispatching bio doesn't impact correctness and removes above
bottleneck.

Multiple threads dispatch requests could potentially reduce request merge and
increase lock contention. For slow stroage, we just worry about request merge.
Caller of .make_request should already have correct block plug set, which will
take care of request merge and locking just like accessing raw device, so we
don't need worry about this too much.

In a 4k randwrite test with a 4 disks setup, below patch can provide 95% ~ 135%
performance improvements depending on numa binding.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid10.c |   22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

Index: linux/drivers/md/raid10.c
===================================================================
--- linux.orig/drivers/md/raid10.c	2012-05-22 19:05:52.495894815 +0800
+++ linux/drivers/md/raid10.c	2012-05-22 19:06:30.955411279 +0800
@@ -1304,10 +1304,13 @@ retry_write:
 		mbio->bi_private = r10_bio;
 
 		atomic_inc(&r10_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		if (mddev->bitmap) {
+			spin_lock_irqsave(&conf->device_lock, flags);
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+		} else
+			generic_make_request(mbio);
 
 		if (!r10_bio->devs[i].repl_bio)
 			continue;
@@ -1329,10 +1332,13 @@ retry_write:
 		mbio->bi_private = r10_bio;
 
 		atomic_inc(&r10_bio->remaining);
-		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
-		spin_unlock_irqrestore(&conf->device_lock, flags);
+		if (mddev->bitmap) {
+			spin_lock_irqsave(&conf->device_lock, flags);
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+			spin_unlock_irqrestore(&conf->device_lock, flags);
+		} else
+			generic_make_request(mbio);
 	}
 
 	/* Don't remove the bias on 'remaining' (one_write_done) until


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [patch 4/4] raid10: percpu dispatch for write request if bitmap supported
  2012-05-23  7:26 [patch 0/4] MD: improve raid1/10 write performance for fast storage Shaohua Li
                   ` (2 preceding siblings ...)
  2012-05-23  7:26 ` [patch 3/4] raid10: directly dispatch write request if no bitmap Shaohua Li
@ 2012-05-23  7:26 ` Shaohua Li
  3 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2012-05-23  7:26 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe, shli

[-- Attachment #1: raid10-write-percpulist.patch --]
[-- Type: text/plain, Size: 6709 bytes --]

In raid10, all write requests are dispatched in raid10d thread. In fast
storage, the raid10d thread is a bottleneck, because it dispatches request too
slow. Also raid10d thread migrates freely, which makes request completion cpu
not match with submission cpu even driver/block layer has such capability. This
will cause bad cache issue.

If bitmap support is enabled, write requests can only be dispatched after dirty
bitmap is flushed out. After bitmap is flushed, how write requests are
dispatched doesn't impact correctness. A natural idea is to distribute request
dispatch to several threads. With this patch, requests are added to a percpu
list first. After bitmap is flushed, then the percpu list requests will
dispatched in a workqueue. In this way, above bottleneck is removed.

In a 4k randwrite test with a 4 disks setup, below patch can provide about 95%
performance improvements depending on numa binding.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid10.c |   97 ++++++++++++++++++++++++++++++++++++++++++----------
 drivers/md/raid10.h |   10 ++++-
 2 files changed, 88 insertions(+), 19 deletions(-)

Index: linux/drivers/md/raid10.h
===================================================================
--- linux.orig/drivers/md/raid10.h	2012-05-22 19:19:42.945454678 +0800
+++ linux/drivers/md/raid10.h	2012-05-22 19:21:18.100256510 +0800
@@ -11,6 +11,14 @@ struct mirror_info {
 						 */
 };
 
+struct write_list {
+	struct bio_list	pending_bio_list;
+	struct bio_list	running_bio_list;
+	struct bio_list	tmp_bio_list;
+	struct work_struct work;
+	struct r10conf *conf;
+};
+
 struct r10conf {
 	struct mddev		*mddev;
 	struct mirror_info	*mirrors;
@@ -43,7 +51,7 @@ struct r10conf {
 
 	struct list_head	retry_list;
 	/* queue pending writes and submit them on unplug */
-	struct bio_list		pending_bio_list;
+	struct write_list __percpu *write_list;
 	int			pending_count;
 
 	spinlock_t		resync_lock;
Index: linux/drivers/md/raid10.c
===================================================================
--- linux.orig/drivers/md/raid10.c	2012-05-22 19:21:17.216269521 +0800
+++ linux/drivers/md/raid10.c	2012-05-22 19:21:46.867897039 +0800
@@ -824,22 +824,21 @@ static int raid10_congested(void *data,
 	return ret;
 }
 
-static void flush_pending_writes(struct r10conf *conf)
+static void raid10_write_work(struct work_struct *work)
 {
-	/* Any writes that have been queued but are awaiting
-	 * bitmap updates get flushed here.
-	 */
-	spin_lock_irq(&conf->device_lock);
+	struct write_list *list = container_of(work, struct write_list, work);
+	struct bio *bio;
+	struct blk_plug plug;
+	bool try_again = true;
 
-	if (conf->pending_bio_list.head) {
-		struct bio *bio;
-		bio = bio_list_get(&conf->pending_bio_list);
-		conf->pending_count = 0;
-		spin_unlock_irq(&conf->device_lock);
-		/* flush any pending bitmap writes to disk
-		 * before proceeding w/ I/O */
-		bitmap_unplug(conf->mddev->bitmap);
-		wake_up(&conf->wait_barrier);
+	blk_start_plug(&plug);
+
+	while (try_again) {
+		spin_lock_irq(&list->conf->device_lock);
+		bio = bio_list_get(&list->running_bio_list);
+		spin_unlock_irq(&list->conf->device_lock);
+
+		try_again = (bio != NULL);
 
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
@@ -847,8 +846,52 @@ static void flush_pending_writes(struct
 			generic_make_request(bio);
 			bio = next;
 		}
-	} else
-		spin_unlock_irq(&conf->device_lock);
+	}
+	blk_finish_plug(&plug);
+}
+
+static void flush_pending_writes(struct r10conf *conf)
+{
+	int c;
+	struct write_list *list;
+
+	/* Any writes that have been queued but are awaiting
+	 * bitmap updates get flushed here.
+	 */
+	spin_lock_irq(&conf->device_lock);
+
+	for_each_present_cpu(c) {
+		list = per_cpu_ptr(conf->write_list, c);
+		if (!bio_list_empty(&list->pending_bio_list)) {
+			bio_list_merge(&list->tmp_bio_list,
+				      &list->pending_bio_list);
+			bio_list_init(&list->pending_bio_list);
+		}
+	}
+
+	conf->pending_count = 0;
+	spin_unlock_irq(&conf->device_lock);
+
+	/* flush any pending bitmap writes to disk before proceeding w/ I/O */
+	bitmap_unplug(conf->mddev->bitmap);
+	wake_up(&conf->wait_barrier);
+
+	spin_lock_irq(&conf->device_lock);
+	for_each_present_cpu(c) {
+		list = per_cpu_ptr(conf->write_list, c);
+		if (!bio_list_empty(&list->tmp_bio_list)) {
+			bio_list_merge(&list->running_bio_list,
+				       &list->tmp_bio_list);
+			bio_list_init(&list->tmp_bio_list);
+			if (likely(cpu_online(c)))
+				schedule_work_on(c, &list->work);
+			else
+				schedule_work_on(cpumask_any(cpu_online_mask),
+						 &list->work);
+		}
+	}
+
+	spin_unlock_irq(&conf->device_lock);
 }
 
 /* Barriers....
@@ -1305,8 +1348,10 @@ retry_write:
 
 		atomic_inc(&r10_bio->remaining);
 		if (mddev->bitmap) {
+			struct write_list *list;
 			spin_lock_irqsave(&conf->device_lock, flags);
-			bio_list_add(&conf->pending_bio_list, mbio);
+			list = this_cpu_ptr(conf->write_list);
+			bio_list_add(&list->pending_bio_list, mbio);
 			conf->pending_count++;
 			spin_unlock_irqrestore(&conf->device_lock, flags);
 		} else
@@ -1333,8 +1378,10 @@ retry_write:
 
 		atomic_inc(&r10_bio->remaining);
 		if (mddev->bitmap) {
+			struct write_list *list;
 			spin_lock_irqsave(&conf->device_lock, flags);
-			bio_list_add(&conf->pending_bio_list, mbio);
+			list = this_cpu_ptr(conf->write_list);
+			bio_list_add(&list->pending_bio_list, mbio);
 			conf->pending_count++;
 			spin_unlock_irqrestore(&conf->device_lock, flags);
 		} else
@@ -3261,6 +3308,18 @@ static struct r10conf *setup_conf(struct
 	spin_lock_init(&conf->resync_lock);
 	init_waitqueue_head(&conf->wait_barrier);
 
+	conf->write_list = alloc_percpu(struct write_list);
+	if (!conf->write_list)
+		goto out;
+	for_each_present_cpu(nc) {
+		struct write_list *list = per_cpu_ptr(conf->write_list, nc);
+		bio_list_init(&list->pending_bio_list);
+		bio_list_init(&list->running_bio_list);
+		bio_list_init(&list->tmp_bio_list);
+		INIT_WORK(&list->work, raid10_write_work);
+		list->conf = conf;
+	}
+
 	conf->thread = md_register_thread(raid10d, mddev, NULL);
 	if (!conf->thread)
 		goto out;
@@ -3272,6 +3331,7 @@ static struct r10conf *setup_conf(struct
 	printk(KERN_ERR "md/raid10:%s: couldn't allocate memory.\n",
 	       mdname(mddev));
 	if (conf) {
+		free_percpu(conf->write_list);
 		if (conf->r10bio_pool)
 			mempool_destroy(conf->r10bio_pool);
 		kfree(conf->mirrors);
@@ -3427,6 +3487,7 @@ static int stop(struct mddev *mddev)
 
 	md_unregister_thread(&mddev->thread);
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
+	free_percpu(conf->write_list);
 	if (conf->r10bio_pool)
 		mempool_destroy(conf->r10bio_pool);
 	kfree(conf->mirrors);


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/4] raid1: directly dispatch write request if no bitmap
  2012-05-23  7:26 ` [patch 1/4] raid1: directly dispatch write request if no bitmap Shaohua Li
@ 2012-05-24  2:21   ` NeilBrown
  2012-05-24  2:54     ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2012-05-24  2:21 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, shli

[-- Attachment #1: Type: text/plain, Size: 2865 bytes --]

On Wed, 23 May 2012 15:26:20 +0800 Shaohua Li <shli@kernel.org> wrote:

> In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> raid1d thread migrates freely, which makes request completion cpu not match
> with submission cpu even driver/block layer has such capability. This will
> cause bad cache issue.
> 
> If no bitmap, there is no point to queue bio to a thread and dispatch it in the
> thread. Directly dispatching bio doesn't impact correctness and removes above
> bottleneck.
> 
> Multiple threads dispatch requests could potentially reduce request merge and
> increase lock contention. For slow stroage, we just worry about request merge.
> Caller of .make_request should already have correct block plug set, which will
> take care of request merge and locking just like accessing raw device, so we
> don't need worry about this too much.
> 
> In a 4k randwrite test with a 2 disks setup, below patch can provide 20% ~ 50%
> performance improvements depending on numa binding.
> 
> Signed-off-by: Shaohua Li <shli@fusionio.com>
> ---
>  drivers/md/raid1.c |   11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> Index: linux/drivers/md/raid1.c
> ===================================================================
> --- linux.orig/drivers/md/raid1.c	2012-05-22 13:50:26.989820654 +0800
> +++ linux/drivers/md/raid1.c	2012-05-22 13:56:46.117054559 +0800
> @@ -1187,10 +1187,13 @@ read_again:
>  		mbio->bi_private = r1_bio;
>  
>  		atomic_inc(&r1_bio->remaining);
> -		spin_lock_irqsave(&conf->device_lock, flags);
> -		bio_list_add(&conf->pending_bio_list, mbio);
> -		conf->pending_count++;
> -		spin_unlock_irqrestore(&conf->device_lock, flags);
> +		if (bitmap) {
> +			spin_lock_irqsave(&conf->device_lock, flags);
> +			bio_list_add(&conf->pending_bio_list, mbio);
> +			conf->pending_count++;
> +			spin_unlock_irqrestore(&conf->device_lock, flags);
> +		} else
> +			generic_make_request(mbio);
>  	}
>  	/* Mustn't call r1_bio_write_done before this next test,
>  	 * as it could result in the bio being freed.

This looks like it should be 'obviously correct' but unfortunately it isn't.

If this raid1 is beneath a dm device (e.g. LVM), then generic_make_request
will queue the request internally and not actually start it.
In particular this means that the cloned bio has no chance of being released
before the next clone_bio call which is made on the next time around the loop.

This can lead to a deadlock as the clone_bio() might be waiting for that
first cloned bio to be released (if memory is really tight).

i.e. when allocating multiple bios from a mempool, we need to arrange for
them to be submitted by a separate thread.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/4] raid1: directly dispatch write request if no bitmap
  2012-05-24  2:21   ` NeilBrown
@ 2012-05-24  2:54     ` Shaohua Li
  2012-05-24  3:09       ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2012-05-24  2:54 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, shli

On Thu, May 24, 2012 at 12:21:12PM +1000, NeilBrown wrote:
> On Wed, 23 May 2012 15:26:20 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> > the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> > raid1d thread migrates freely, which makes request completion cpu not match
> > with submission cpu even driver/block layer has such capability. This will
> > cause bad cache issue.
> > 
> > If no bitmap, there is no point to queue bio to a thread and dispatch it in the
> > thread. Directly dispatching bio doesn't impact correctness and removes above
> > bottleneck.
> > 
> > Multiple threads dispatch requests could potentially reduce request merge and
> > increase lock contention. For slow stroage, we just worry about request merge.
> > Caller of .make_request should already have correct block plug set, which will
> > take care of request merge and locking just like accessing raw device, so we
> > don't need worry about this too much.
> > 
> > In a 4k randwrite test with a 2 disks setup, below patch can provide 20% ~ 50%
> > performance improvements depending on numa binding.
> > 
> > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > ---
> >  drivers/md/raid1.c |   11 +++++++----
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> > 
> > Index: linux/drivers/md/raid1.c
> > ===================================================================
> > --- linux.orig/drivers/md/raid1.c	2012-05-22 13:50:26.989820654 +0800
> > +++ linux/drivers/md/raid1.c	2012-05-22 13:56:46.117054559 +0800
> > @@ -1187,10 +1187,13 @@ read_again:
> >  		mbio->bi_private = r1_bio;
> >  
> >  		atomic_inc(&r1_bio->remaining);
> > -		spin_lock_irqsave(&conf->device_lock, flags);
> > -		bio_list_add(&conf->pending_bio_list, mbio);
> > -		conf->pending_count++;
> > -		spin_unlock_irqrestore(&conf->device_lock, flags);
> > +		if (bitmap) {
> > +			spin_lock_irqsave(&conf->device_lock, flags);
> > +			bio_list_add(&conf->pending_bio_list, mbio);
> > +			conf->pending_count++;
> > +			spin_unlock_irqrestore(&conf->device_lock, flags);
> > +		} else
> > +			generic_make_request(mbio);
> >  	}
> >  	/* Mustn't call r1_bio_write_done before this next test,
> >  	 * as it could result in the bio being freed.
> 
> This looks like it should be 'obviously correct' but unfortunately it isn't.
> 
> If this raid1 is beneath a dm device (e.g. LVM), then generic_make_request
> will queue the request internally and not actually start it.
> In particular this means that the cloned bio has no chance of being released
> before the next clone_bio call which is made on the next time around the loop.
> 
> This can lead to a deadlock as the clone_bio() might be waiting for that
> first cloned bio to be released (if memory is really tight).
> 
> i.e. when allocating multiple bios from a mempool, we need to arrange for
> them to be submitted by a separate thread.

If there isn't block plug, generic_make_request will dispatch the request
directly. If there is, when clone_bio() waits for memory, it will
sleep/schedule, which will trigger block unplug and dispatch the first bio, so
no deadlock to me. Am I misunderstanding you?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/4] raid1: directly dispatch write request if no bitmap
  2012-05-24  2:54     ` Shaohua Li
@ 2012-05-24  3:09       ` NeilBrown
  2012-05-24  3:31         ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2012-05-24  3:09 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, shli

[-- Attachment #1: Type: text/plain, Size: 3676 bytes --]

On Thu, 24 May 2012 10:54:25 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Thu, May 24, 2012 at 12:21:12PM +1000, NeilBrown wrote:
> > On Wed, 23 May 2012 15:26:20 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> > > the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> > > raid1d thread migrates freely, which makes request completion cpu not match
> > > with submission cpu even driver/block layer has such capability. This will
> > > cause bad cache issue.
> > > 
> > > If no bitmap, there is no point to queue bio to a thread and dispatch it in the
> > > thread. Directly dispatching bio doesn't impact correctness and removes above
> > > bottleneck.
> > > 
> > > Multiple threads dispatch requests could potentially reduce request merge and
> > > increase lock contention. For slow stroage, we just worry about request merge.
> > > Caller of .make_request should already have correct block plug set, which will
> > > take care of request merge and locking just like accessing raw device, so we
> > > don't need worry about this too much.
> > > 
> > > In a 4k randwrite test with a 2 disks setup, below patch can provide 20% ~ 50%
> > > performance improvements depending on numa binding.
> > > 
> > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > ---
> > >  drivers/md/raid1.c |   11 +++++++----
> > >  1 file changed, 7 insertions(+), 4 deletions(-)
> > > 
> > > Index: linux/drivers/md/raid1.c
> > > ===================================================================
> > > --- linux.orig/drivers/md/raid1.c	2012-05-22 13:50:26.989820654 +0800
> > > +++ linux/drivers/md/raid1.c	2012-05-22 13:56:46.117054559 +0800
> > > @@ -1187,10 +1187,13 @@ read_again:
> > >  		mbio->bi_private = r1_bio;
> > >  
> > >  		atomic_inc(&r1_bio->remaining);
> > > -		spin_lock_irqsave(&conf->device_lock, flags);
> > > -		bio_list_add(&conf->pending_bio_list, mbio);
> > > -		conf->pending_count++;
> > > -		spin_unlock_irqrestore(&conf->device_lock, flags);
> > > +		if (bitmap) {
> > > +			spin_lock_irqsave(&conf->device_lock, flags);
> > > +			bio_list_add(&conf->pending_bio_list, mbio);
> > > +			conf->pending_count++;
> > > +			spin_unlock_irqrestore(&conf->device_lock, flags);
> > > +		} else
> > > +			generic_make_request(mbio);
> > >  	}
> > >  	/* Mustn't call r1_bio_write_done before this next test,
> > >  	 * as it could result in the bio being freed.
> > 
> > This looks like it should be 'obviously correct' but unfortunately it isn't.
> > 
> > If this raid1 is beneath a dm device (e.g. LVM), then generic_make_request
> > will queue the request internally and not actually start it.
> > In particular this means that the cloned bio has no chance of being released
> > before the next clone_bio call which is made on the next time around the loop.
> > 
> > This can lead to a deadlock as the clone_bio() might be waiting for that
> > first cloned bio to be released (if memory is really tight).
> > 
> > i.e. when allocating multiple bios from a mempool, we need to arrange for
> > them to be submitted by a separate thread.
> 
> If there isn't block plug, generic_make_request will dispatch the request
> directly.

Not necessarily.  It might queue it:
	if (current->bio_list) {
		bio_list_add(current->bio_list, bio);
		return;
	}

NeilBrown



 If there is, when clone_bio() waits for memory, it will
> sleep/schedule, which will trigger block unplug and dispatch the first bio, so
> no deadlock to me. Am I misunderstanding you?
> 
> Thanks,
> Shaohua


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/4] raid1: directly dispatch write request if no bitmap
  2012-05-24  3:09       ` NeilBrown
@ 2012-05-24  3:31         ` Shaohua Li
  2012-05-24  3:51           ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2012-05-24  3:31 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, shli

On Thu, May 24, 2012 at 01:09:12PM +1000, NeilBrown wrote:
> On Thu, 24 May 2012 10:54:25 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > On Thu, May 24, 2012 at 12:21:12PM +1000, NeilBrown wrote:
> > > On Wed, 23 May 2012 15:26:20 +0800 Shaohua Li <shli@kernel.org> wrote:
> > > 
> > > > In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> > > > the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> > > > raid1d thread migrates freely, which makes request completion cpu not match
> > > > with submission cpu even driver/block layer has such capability. This will
> > > > cause bad cache issue.
> > > > 
> > > > If no bitmap, there is no point to queue bio to a thread and dispatch it in the
> > > > thread. Directly dispatching bio doesn't impact correctness and removes above
> > > > bottleneck.
> > > > 
> > > > Multiple threads dispatch requests could potentially reduce request merge and
> > > > increase lock contention. For slow stroage, we just worry about request merge.
> > > > Caller of .make_request should already have correct block plug set, which will
> > > > take care of request merge and locking just like accessing raw device, so we
> > > > don't need worry about this too much.
> > > > 
> > > > In a 4k randwrite test with a 2 disks setup, below patch can provide 20% ~ 50%
> > > > performance improvements depending on numa binding.
> > > > 
> > > > Signed-off-by: Shaohua Li <shli@fusionio.com>
> > > > ---
> > > >  drivers/md/raid1.c |   11 +++++++----
> > > >  1 file changed, 7 insertions(+), 4 deletions(-)
> > > > 
> > > > Index: linux/drivers/md/raid1.c
> > > > ===================================================================
> > > > --- linux.orig/drivers/md/raid1.c	2012-05-22 13:50:26.989820654 +0800
> > > > +++ linux/drivers/md/raid1.c	2012-05-22 13:56:46.117054559 +0800
> > > > @@ -1187,10 +1187,13 @@ read_again:
> > > >  		mbio->bi_private = r1_bio;
> > > >  
> > > >  		atomic_inc(&r1_bio->remaining);
> > > > -		spin_lock_irqsave(&conf->device_lock, flags);
> > > > -		bio_list_add(&conf->pending_bio_list, mbio);
> > > > -		conf->pending_count++;
> > > > -		spin_unlock_irqrestore(&conf->device_lock, flags);
> > > > +		if (bitmap) {
> > > > +			spin_lock_irqsave(&conf->device_lock, flags);
> > > > +			bio_list_add(&conf->pending_bio_list, mbio);
> > > > +			conf->pending_count++;
> > > > +			spin_unlock_irqrestore(&conf->device_lock, flags);
> > > > +		} else
> > > > +			generic_make_request(mbio);
> > > >  	}
> > > >  	/* Mustn't call r1_bio_write_done before this next test,
> > > >  	 * as it could result in the bio being freed.
> > > 
> > > This looks like it should be 'obviously correct' but unfortunately it isn't.
> > > 
> > > If this raid1 is beneath a dm device (e.g. LVM), then generic_make_request
> > > will queue the request internally and not actually start it.
> > > In particular this means that the cloned bio has no chance of being released
> > > before the next clone_bio call which is made on the next time around the loop.
> > > 
> > > This can lead to a deadlock as the clone_bio() might be waiting for that
> > > first cloned bio to be released (if memory is really tight).
> > > 
> > > i.e. when allocating multiple bios from a mempool, we need to arrange for
> > > them to be submitted by a separate thread.
> > 
> > If there isn't block plug, generic_make_request will dispatch the request
> > directly.
> 
> Not necessarily.  It might queue it:
> 	if (current->bio_list) {
> 		bio_list_add(current->bio_list, bio);
> 		return;
> 	}

Ah, yes. Looks we can workaround it to increase bioset entry number to
raid_disks*2. There isn't too much memory wasted with it.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/4] raid1: percpu dispatch for write request if bitmap supported
  2012-05-23  7:26 ` [patch 2/4] raid1: percpu dispatch for write request if bitmap supported Shaohua Li
@ 2012-05-24  3:34   ` NeilBrown
  2012-05-24  5:17     ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2012-05-24  3:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, shli

[-- Attachment #1: Type: text/plain, Size: 3360 bytes --]

On Wed, 23 May 2012 15:26:21 +0800 Shaohua Li <shli@kernel.org> wrote:

> In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> raid1d thread migrates freely, which makes request completion cpu not match
> with submission cpu even driver/block layer has such capability. This will
> cause bad cache issue.
> 
> If bitmap support is enabled, write requests can only be dispatched after dirty
> bitmap is flushed out. After bitmap is flushed, how write requests are
> dispatched doesn't impact correctness. A natural idea is to distribute request
> dispatch to several threads. With this patch, requests are added to a percpu
> list first. After bitmap is flushed, then the percpu list requests will
> dispatched in a workqueue. In this way, above bottleneck is removed.
> 
> In a 4k randwrite test with a 2 disks setup, below patch can provide 10% ~ 50%
> performance improvements depending on numa binding.

Those numbers are quite impressive so there is certainly room for improvement
here.  I'm not sure that I'm entirely comfortable with the approach though.

Passing the request to a per-cpu thread does make sense, but a generic
per-cpu thread feels dangerous as we don't know what else might be queued to
that thread and because of the potential for deadlocks between memory
allocation and generic_make_request (as mentioned in previous email) I find
it hard to convince myself that this approach is entirely safe.

I wonder if we might take a very different approach and try to do everything
in the one process.  i.e. don't hand tasks off to other threads at all - at
least in the common case.

So we could change plugger_unplug (in md.c) so that:

 - if current->bio_list is NULL, (meaning all requests have been submitted
   and there is no risk of deadlock) we call bitmap_unplug and then submit
   all the queued writes.
 - if current->bio_list is not NULL, then we just wakeup the md thread to
   do the work.

In the common case bio_list should be NULL (I think) and everything is done
in the one process.  It is only when memory is tight and we reschedule while
waiting for memory that the md thread will be called on and hopefully slower
performance then won't be so noticeable.

We probably want separate queues of pending writes for each CPU (or each
thread) so that each "plugger_unplug" only submits the writes that were
queued by that process - so that if multiple threads are submitting writes at
the same time, they don't end up handling each other's requests.

This might end up flushing the bitmap slightly more often - the current code
allocates one "md_plug_cb" for each different thread that is concurrently
writing to a given md device, and don't trigger the flush until all of them
unplug.  My proposal above would trigger it when any thread triggers the
final unplug.  I wonder if that makes any difference in practice.

Anyway, could you explore the possibility of submitting the requests in
plugger_unplug when current->bio_list is NULL?
Possibly you could:
  get mddev_check_plugged to return the md_plug_cb
  link all the md_plug_cb's for one mddev together
  store the list of queued write requests in that md_plug_cb

Make sense?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 1/4] raid1: directly dispatch write request if no bitmap
  2012-05-24  3:31         ` Shaohua Li
@ 2012-05-24  3:51           ` NeilBrown
  0 siblings, 0 replies; 14+ messages in thread
From: NeilBrown @ 2012-05-24  3:51 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, shli

[-- Attachment #1: Type: text/plain, Size: 687 bytes --]

On Thu, 24 May 2012 11:31:26 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Thu, May 24, 2012 at 01:09:12PM +1000, NeilBrown wrote:

> > Not necessarily.  It might queue it:
> > 	if (current->bio_list) {
> > 		bio_list_add(current->bio_list, bio);
> > 		return;
> > 	}
> 
> Ah, yes. Looks we can workaround it to increase bioset entry number to
> raid_disks*2. There isn't too much memory wasted with it.

No, that doesn't work.  There might be raid_disk*2 separate threads which
have each allocated on entry from the pool, and then they all block waiting
for another.

Adding the pool can change the probability of a deadlock but not the
possibility.

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/4] raid1: percpu dispatch for write request if bitmap supported
  2012-05-24  3:34   ` NeilBrown
@ 2012-05-24  5:17     ` Shaohua Li
  2012-05-24  5:34       ` NeilBrown
  0 siblings, 1 reply; 14+ messages in thread
From: Shaohua Li @ 2012-05-24  5:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, shli

On Thu, May 24, 2012 at 01:34:02PM +1000, NeilBrown wrote:
> On Wed, 23 May 2012 15:26:21 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> > the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> > raid1d thread migrates freely, which makes request completion cpu not match
> > with submission cpu even driver/block layer has such capability. This will
> > cause bad cache issue.
> > 
> > If bitmap support is enabled, write requests can only be dispatched after dirty
> > bitmap is flushed out. After bitmap is flushed, how write requests are
> > dispatched doesn't impact correctness. A natural idea is to distribute request
> > dispatch to several threads. With this patch, requests are added to a percpu
> > list first. After bitmap is flushed, then the percpu list requests will
> > dispatched in a workqueue. In this way, above bottleneck is removed.
> > 
> > In a 4k randwrite test with a 2 disks setup, below patch can provide 10% ~ 50%
> > performance improvements depending on numa binding.
> 
> Those numbers are quite impressive so there is certainly room for improvement
> here.  I'm not sure that I'm entirely comfortable with the approach though.
> 
> Passing the request to a per-cpu thread does make sense, but a generic
> per-cpu thread feels dangerous as we don't know what else might be queued to
> that thread and because of the potential for deadlocks between memory
> allocation and generic_make_request (as mentioned in previous email) I find
> it hard to convince myself that this approach is entirely safe.

No problem, we can use a separate workqueue.

> I wonder if we might take a very different approach and try to do everything
> in the one process.  i.e. don't hand tasks off to other threads at all - at
> least in the common case. 
> So we could change plugger_unplug (in md.c) so that:
> 
>  - if current->bio_list is NULL, (meaning all requests have been submitted
>    and there is no risk of deadlock) we call bitmap_unplug and then submit
>    all the queued writes.
>  - if current->bio_list is not NULL, then we just wakeup the md thread to
>    do the work.

The current->bio_list check does make sense. I'm going to do the check for the
1 and 3 patches.

But I believe we can't call generic_make_request in unplug. For example,
schedule()->unplug->generic_make_request->get_request_wait()->schedule(). At
least this will cause some request not dispatch. And last time I did similar
experiment for raid0 (request is added to a per-disk plug_cb list, not directly
dispatch) to reduce lock contention, I found nasty oops.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/4] raid1: percpu dispatch for write request if bitmap supported
  2012-05-24  5:17     ` Shaohua Li
@ 2012-05-24  5:34       ` NeilBrown
  2012-05-24  5:50         ` Shaohua Li
  0 siblings, 1 reply; 14+ messages in thread
From: NeilBrown @ 2012-05-24  5:34 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe, shli

[-- Attachment #1: Type: text/plain, Size: 3352 bytes --]

On Thu, 24 May 2012 13:17:09 +0800 Shaohua Li <shli@kernel.org> wrote:

> On Thu, May 24, 2012 at 01:34:02PM +1000, NeilBrown wrote:
> > On Wed, 23 May 2012 15:26:21 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> > > In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> > > the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> > > raid1d thread migrates freely, which makes request completion cpu not match
> > > with submission cpu even driver/block layer has such capability. This will
> > > cause bad cache issue.
> > > 
> > > If bitmap support is enabled, write requests can only be dispatched after dirty
> > > bitmap is flushed out. After bitmap is flushed, how write requests are
> > > dispatched doesn't impact correctness. A natural idea is to distribute request
> > > dispatch to several threads. With this patch, requests are added to a percpu
> > > list first. After bitmap is flushed, then the percpu list requests will
> > > dispatched in a workqueue. In this way, above bottleneck is removed.
> > > 
> > > In a 4k randwrite test with a 2 disks setup, below patch can provide 10% ~ 50%
> > > performance improvements depending on numa binding.
> > 
> > Those numbers are quite impressive so there is certainly room for improvement
> > here.  I'm not sure that I'm entirely comfortable with the approach though.
> > 
> > Passing the request to a per-cpu thread does make sense, but a generic
> > per-cpu thread feels dangerous as we don't know what else might be queued to
> > that thread and because of the potential for deadlocks between memory
> > allocation and generic_make_request (as mentioned in previous email) I find
> > it hard to convince myself that this approach is entirely safe.
> 
> No problem, we can use a separate workqueue.

A separate set of per-cpu workqueues for each md array doesn't sound like a
good idea to me - if we can possibly avoid it.

> 
> > I wonder if we might take a very different approach and try to do everything
> > in the one process.  i.e. don't hand tasks off to other threads at all - at
> > least in the common case. 
> > So we could change plugger_unplug (in md.c) so that:
> > 
> >  - if current->bio_list is NULL, (meaning all requests have been submitted
> >    and there is no risk of deadlock) we call bitmap_unplug and then submit
> >    all the queued writes.
> >  - if current->bio_list is not NULL, then we just wakeup the md thread to
> >    do the work.
> 
> The current->bio_list check does make sense. I'm going to do the check for the
> 1 and 3 patches.
> 
> But I believe we can't call generic_make_request in unplug. For example,
> schedule()->unplug->generic_make_request->get_request_wait()->schedule(). At

My ideas was to only submit the queued writes from a normal call to
blk_finish_plug.  Calls that come from schedule would use the current
approach of signalling the md thread to take over.  There must be a way to
tell the difference - maybe just a new flag to the unplug callback.

NeilBrown


> least this will cause some request not dispatch. And last time I did similar
> experiment for raid0 (request is added to a per-disk plug_cb list, not directly
> dispatch) to reduce lock contention, I found nasty oops.
> 
> Thanks,
> Shaohua


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [patch 2/4] raid1: percpu dispatch for write request if bitmap supported
  2012-05-24  5:34       ` NeilBrown
@ 2012-05-24  5:50         ` Shaohua Li
  0 siblings, 0 replies; 14+ messages in thread
From: Shaohua Li @ 2012-05-24  5:50 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe, shli

On Thu, May 24, 2012 at 03:34:12PM +1000, NeilBrown wrote:
> On Thu, 24 May 2012 13:17:09 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
> > On Thu, May 24, 2012 at 01:34:02PM +1000, NeilBrown wrote:
> > > On Wed, 23 May 2012 15:26:21 +0800 Shaohua Li <shli@kernel.org> wrote:
> > > 
> > > > In raid1, all write requests are dispatched in raid1d thread. In fast storage,
> > > > the raid1d thread is a bottleneck, because it dispatches request too slow. Also
> > > > raid1d thread migrates freely, which makes request completion cpu not match
> > > > with submission cpu even driver/block layer has such capability. This will
> > > > cause bad cache issue.
> > > > 
> > > > If bitmap support is enabled, write requests can only be dispatched after dirty
> > > > bitmap is flushed out. After bitmap is flushed, how write requests are
> > > > dispatched doesn't impact correctness. A natural idea is to distribute request
> > > > dispatch to several threads. With this patch, requests are added to a percpu
> > > > list first. After bitmap is flushed, then the percpu list requests will
> > > > dispatched in a workqueue. In this way, above bottleneck is removed.
> > > > 
> > > > In a 4k randwrite test with a 2 disks setup, below patch can provide 10% ~ 50%
> > > > performance improvements depending on numa binding.
> > > 
> > > Those numbers are quite impressive so there is certainly room for improvement
> > > here.  I'm not sure that I'm entirely comfortable with the approach though.
> > > 
> > > Passing the request to a per-cpu thread does make sense, but a generic
> > > per-cpu thread feels dangerous as we don't know what else might be queued to
> > > that thread and because of the potential for deadlocks between memory
> > > allocation and generic_make_request (as mentioned in previous email) I find
> > > it hard to convince myself that this approach is entirely safe.
> > 
> > No problem, we can use a separate workqueue.
> 
> A separate set of per-cpu workqueues for each md array doesn't sound like a
> good idea to me - if we can possibly avoid it.

Can be a separate workqueue sharing for all md arrays.
  
> > > I wonder if we might take a very different approach and try to do everything
> > > in the one process.  i.e. don't hand tasks off to other threads at all - at
> > > least in the common case. 
> > > So we could change plugger_unplug (in md.c) so that:
> > > 
> > >  - if current->bio_list is NULL, (meaning all requests have been submitted
> > >    and there is no risk of deadlock) we call bitmap_unplug and then submit
> > >    all the queued writes.
> > >  - if current->bio_list is not NULL, then we just wakeup the md thread to
> > >    do the work.
> > 
> > The current->bio_list check does make sense. I'm going to do the check for the
> > 1 and 3 patches.
> > 
> > But I believe we can't call generic_make_request in unplug. For example,
> > schedule()->unplug->generic_make_request->get_request_wait()->schedule(). At
> 
> My ideas was to only submit the queued writes from a normal call to
> blk_finish_plug.  Calls that come from schedule would use the current
> approach of signalling the md thread to take over.  There must be a way to
> tell the difference - maybe just a new flag to the unplug callback.

Should be the same I think. For exmaple, blk_finish_plug unplug, plug_cb should
generate several bios, but for the first bio, get_request_wait sleeps and flush
plug list again. later bio hasn't a chance to be queued actually till next time
the task runs. Then unplug loses its meaning.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2012-05-24  5:50 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-23  7:26 [patch 0/4] MD: improve raid1/10 write performance for fast storage Shaohua Li
2012-05-23  7:26 ` [patch 1/4] raid1: directly dispatch write request if no bitmap Shaohua Li
2012-05-24  2:21   ` NeilBrown
2012-05-24  2:54     ` Shaohua Li
2012-05-24  3:09       ` NeilBrown
2012-05-24  3:31         ` Shaohua Li
2012-05-24  3:51           ` NeilBrown
2012-05-23  7:26 ` [patch 2/4] raid1: percpu dispatch for write request if bitmap supported Shaohua Li
2012-05-24  3:34   ` NeilBrown
2012-05-24  5:17     ` Shaohua Li
2012-05-24  5:34       ` NeilBrown
2012-05-24  5:50         ` Shaohua Li
2012-05-23  7:26 ` [patch 3/4] raid10: directly dispatch write request if no bitmap Shaohua Li
2012-05-23  7:26 ` [patch 4/4] raid10: percpu dispatch for write request if bitmap supported Shaohua Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.