All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
@ 2012-06-13  9:11 Shaohua Li
  2012-06-13  9:11 ` [patch 1/3 v3] MD: add a specific workqueue to do dispatch Shaohua Li
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-13  9:11 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

In raid1/10, all write requests are dispatched in a single thread. In fast
storage, the thread is a bottleneck, because it dispatches request too slow.
Also the thread migrates freely, which makes request completion cpu not match
with submission cpu even driver/block layer has such capability. This will
cause bad cache issue. Both these are not a big deal for slow storage.

Switching the dispatching to percpu/perthread based dramatically increases
performance.  The more raid disk number is, the more performance boosts. In a
4-disk raid10 setup, this can double the throughput.

percpu/perthread based dispatch doesn't harm slow storage. This is the way how
raw device is accessed, and there is correct block plug set which can help do
request merge and reduce lock contention.

V2->V3:
rebase to latest tree and fix cpuhotplug issue

V1->V2:
1. droped direct dispatch patches. That has better performance imporvement, but
is hopelessly made correct.
2. Add a MD specific workqueue to do percpu dispatch.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 1/3 v3] MD: add a specific workqueue to do dispatch
  2012-06-13  9:11 [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage Shaohua Li
@ 2012-06-13  9:11 ` Shaohua Li
  2012-06-13  9:11 ` [patch 2/3 v3] raid1: percpu dispatch for write request if bitmap supported Shaohua Li
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-13  9:11 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: md-workqueue.patch --]
[-- Type: text/plain, Size: 2510 bytes --]

Add a specific workqueue to do dispatch. Later patches will use it to do
per-cpu request queue dispatch.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/md.c |   14 ++++++++++++++
 drivers/md/md.h |    1 +
 2 files changed, 15 insertions(+)

Index: linux/drivers/md/md.c
===================================================================
--- linux.orig/drivers/md/md.c	2012-05-24 15:42:56.892251685 +0800
+++ linux/drivers/md/md.c	2012-05-24 15:44:55.618759412 +0800
@@ -71,6 +71,7 @@ static void md_print_devices(void);
 static DECLARE_WAIT_QUEUE_HEAD(resync_wait);
 static struct workqueue_struct *md_wq;
 static struct workqueue_struct *md_misc_wq;
+static struct workqueue_struct *md_run_wq;
 
 #define MD_BUG(x...) { printk("md: bug in file %s, line %d\n", __FILE__, __LINE__); md_print_devices(); }
 
@@ -8446,6 +8447,12 @@ static void md_geninit(void)
 	proc_create("mdstat", S_IRUGO, NULL, &md_seq_fops);
 }
 
+int md_schedule_work_on(int cpu, struct work_struct *work)
+{
+	return queue_work_on(cpu, md_run_wq, work);
+}
+EXPORT_SYMBOL(md_schedule_work_on);
+
 static int __init md_init(void)
 {
 	int ret = -ENOMEM;
@@ -8458,6 +8465,10 @@ static int __init md_init(void)
 	if (!md_misc_wq)
 		goto err_misc_wq;
 
+	md_run_wq = alloc_workqueue("md_run", WQ_MEM_RECLAIM, 0);
+	if (!md_run_wq)
+		goto err_run_wq;
+
 	if ((ret = register_blkdev(MD_MAJOR, "md")) < 0)
 		goto err_md;
 
@@ -8479,6 +8490,8 @@ static int __init md_init(void)
 err_mdp:
 	unregister_blkdev(MD_MAJOR, "md");
 err_md:
+	destroy_workqueue(md_run_wq);
+err_run_wq:
 	destroy_workqueue(md_misc_wq);
 err_misc_wq:
 	destroy_workqueue(md_wq);
@@ -8571,6 +8584,7 @@ static __exit void md_exit(void)
 		export_array(mddev);
 		mddev->hold_active = 0;
 	}
+	destroy_workqueue(md_run_wq);
 	destroy_workqueue(md_misc_wq);
 	destroy_workqueue(md_wq);
 }
Index: linux/drivers/md/md.h
===================================================================
--- linux.orig/drivers/md/md.h	2012-05-24 15:42:56.896251635 +0800
+++ linux/drivers/md/md.h	2012-05-24 15:44:55.622759274 +0800
@@ -616,6 +616,7 @@ extern int md_integrity_register(struct
 extern void md_integrity_add_rdev(struct md_rdev *rdev, struct mddev *mddev);
 extern int strict_strtoul_scaled(const char *cp, unsigned long *res, int scale);
 extern void restore_bitmap_write_access(struct file *file);
+extern int md_schedule_work_on(int cpu, struct work_struct *work);
 
 extern void mddev_init(struct mddev *mddev);
 extern int md_run(struct mddev *mddev);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 2/3 v3] raid1: percpu dispatch for write request if bitmap supported
  2012-06-13  9:11 [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage Shaohua Li
  2012-06-13  9:11 ` [patch 1/3 v3] MD: add a specific workqueue to do dispatch Shaohua Li
@ 2012-06-13  9:11 ` Shaohua Li
  2012-06-13  9:11 ` [patch 3/3 v3] raid10: " Shaohua Li
  2012-06-28  9:03 ` [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage NeilBrown
  3 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-13  9:11 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: raid1-write-percpulist.patch --]
[-- Type: text/plain, Size: 6587 bytes --]

In raid1, all write requests are dispatched in raid1d thread. In fast storage,
the raid1d thread is a bottleneck, because it dispatches request too slow. Also
raid1d thread migrates freely, which makes request completion cpu not match
with submission cpu even driver/block layer has such capability. This will
cause bad cache issue.

If bitmap support is enabled, write requests can only be dispatched after dirty
bitmap is flushed out. After bitmap is flushed, how write requests are
dispatched doesn't impact correctness. A natural idea is to distribute request
dispatch to several threads. With this patch, requests are added to a percpu
list first. After bitmap is flushed, then the percpu list requests will
dispatched in a workqueue. In this way, above bottleneck is removed.

In a 4k randwrite test with a 2 disks setup, below patch can provide 10% ~ 50%
performance improvements depending on numa binding.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid1.c |   97 +++++++++++++++++++++++++++++++++++++++++++----------
 drivers/md/raid1.h |   10 ++++-
 2 files changed, 88 insertions(+), 19 deletions(-)

Index: linux/drivers/md/raid1.h
===================================================================
--- linux.orig/drivers/md/raid1.h	2012-06-08 09:35:53.268593019 +0800
+++ linux/drivers/md/raid1.h	2012-06-13 16:15:31.172319440 +0800
@@ -22,6 +22,14 @@ struct pool_info {
 	int	raid_disks;
 };
 
+struct write_list {
+	struct bio_list	pending_bio_list;
+	struct bio_list	running_bio_list;
+	struct bio_list	tmp_bio_list;
+	struct work_struct work;
+	struct r1conf *conf;
+};
+
 struct r1conf {
 	struct mddev		*mddev;
 	struct mirror_info	*mirrors;	/* twice 'raid_disks' to
@@ -50,7 +58,7 @@ struct r1conf {
 	struct list_head	retry_list;
 
 	/* queue pending writes to be submitted on unplug */
-	struct bio_list		pending_bio_list;
+	struct write_list __percpu *write_list;
 	int			pending_count;
 
 	/* for use when syncing mirrors:
Index: linux/drivers/md/raid1.c
===================================================================
--- linux.orig/drivers/md/raid1.c	2012-06-13 15:51:49.678190041 +0800
+++ linux/drivers/md/raid1.c	2012-06-13 16:15:31.172319440 +0800
@@ -687,22 +687,21 @@ static int raid1_congested(void *data, i
 		md_raid1_congested(mddev, bits);
 }
 
-static void flush_pending_writes(struct r1conf *conf)
+static void raid1_write_work(struct work_struct *work)
 {
-	/* Any writes that have been queued but are awaiting
-	 * bitmap updates get flushed here.
-	 */
-	spin_lock_irq(&conf->device_lock);
+	struct write_list *list = container_of(work, struct write_list, work);
+	struct bio *bio;
+	struct blk_plug plug;
+	bool try_again = true;
 
-	if (conf->pending_bio_list.head) {
-		struct bio *bio;
-		bio = bio_list_get(&conf->pending_bio_list);
-		conf->pending_count = 0;
-		spin_unlock_irq(&conf->device_lock);
-		/* flush any pending bitmap writes to
-		 * disk before proceeding w/ I/O */
-		bitmap_unplug(conf->mddev->bitmap);
-		wake_up(&conf->wait_barrier);
+	blk_start_plug(&plug);
+
+	while (try_again) {
+		spin_lock_irq(&list->conf->device_lock);
+		bio = bio_list_get(&list->running_bio_list);
+		spin_unlock_irq(&list->conf->device_lock);
+
+		try_again = (bio != NULL);
 
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
@@ -710,8 +709,53 @@ static void flush_pending_writes(struct
 			generic_make_request(bio);
 			bio = next;
 		}
-	} else
-		spin_unlock_irq(&conf->device_lock);
+	}
+	blk_finish_plug(&plug);
+}
+
+static void flush_pending_writes(struct r1conf *conf)
+{
+	int c;
+	struct write_list *list;
+
+	/* Any writes that have been queued but are awaiting
+	 * bitmap updates get flushed here.
+	 */
+	spin_lock_irq(&conf->device_lock);
+
+	for_each_possible_cpu(c) {
+		list = per_cpu_ptr(conf->write_list, c);
+		if (!bio_list_empty(&list->pending_bio_list)) {
+			bio_list_merge(&list->tmp_bio_list,
+				      &list->pending_bio_list);
+			bio_list_init(&list->pending_bio_list);
+		}
+	}
+
+	conf->pending_count = 0;
+	spin_unlock_irq(&conf->device_lock);
+
+	/* flush any pending bitmap writes to disk before proceeding w/ I/O */
+	bitmap_unplug(conf->mddev->bitmap);
+	wake_up(&conf->wait_barrier);
+
+	spin_lock_irq(&conf->device_lock);
+	for_each_possible_cpu(c) {
+		list = per_cpu_ptr(conf->write_list, c);
+		if (!bio_list_empty(&list->tmp_bio_list)) {
+			bio_list_merge(&list->running_bio_list,
+				       &list->tmp_bio_list);
+			bio_list_init(&list->tmp_bio_list);
+			if (likely(cpu_online(c)))
+				md_schedule_work_on(c, &list->work);
+			else {
+				int cpu = cpumask_any(cpu_online_mask);
+				md_schedule_work_on(cpu, &list->work);
+			}
+		}
+	}
+
+	spin_unlock_irq(&conf->device_lock);
 }
 
 /* Barriers....
@@ -1137,6 +1181,7 @@ read_again:
 	first_clone = 1;
 	for (i = 0; i < disks; i++) {
 		struct bio *mbio;
+		struct write_list *list;
 		if (!r1_bio->bios[i])
 			continue;
 
@@ -1188,7 +1233,8 @@ read_again:
 
 		atomic_inc(&r1_bio->remaining);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
+		list = this_cpu_ptr(conf->write_list);
+		bio_list_add(&list->pending_bio_list, mbio);
 		conf->pending_count++;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
@@ -2576,7 +2622,6 @@ static struct r1conf *setup_conf(struct
 	spin_lock_init(&conf->resync_lock);
 	init_waitqueue_head(&conf->wait_barrier);
 
-	bio_list_init(&conf->pending_bio_list);
 	conf->pending_count = 0;
 	conf->recovery_disabled = mddev->recovery_disabled - 1;
 
@@ -2621,6 +2666,19 @@ static struct r1conf *setup_conf(struct
 		goto abort;
 	}
 	err = -ENOMEM;
+
+	conf->write_list = alloc_percpu(struct write_list);
+	if (!conf->write_list)
+		goto abort;
+	for_each_possible_cpu(i) {
+		struct write_list *list = per_cpu_ptr(conf->write_list, i);
+		bio_list_init(&list->pending_bio_list);
+		bio_list_init(&list->running_bio_list);
+		bio_list_init(&list->tmp_bio_list);
+		INIT_WORK(&list->work, raid1_write_work);
+		list->conf = conf;
+	}
+
 	conf->thread = md_register_thread(raid1d, mddev, NULL);
 	if (!conf->thread) {
 		printk(KERN_ERR
@@ -2633,6 +2691,7 @@ static struct r1conf *setup_conf(struct
 
  abort:
 	if (conf) {
+		free_percpu(conf->write_list);
 		if (conf->r1bio_pool)
 			mempool_destroy(conf->r1bio_pool);
 		kfree(conf->mirrors);
@@ -2739,6 +2798,8 @@ static int stop(struct mddev *mddev)
 	lower_barrier(conf);
 
 	md_unregister_thread(&mddev->thread);
+	free_percpu(conf->write_list);
+
 	if (conf->r1bio_pool)
 		mempool_destroy(conf->r1bio_pool);
 	kfree(conf->mirrors);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [patch 3/3 v3] raid10: percpu dispatch for write request if bitmap supported
  2012-06-13  9:11 [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage Shaohua Li
  2012-06-13  9:11 ` [patch 1/3 v3] MD: add a specific workqueue to do dispatch Shaohua Li
  2012-06-13  9:11 ` [patch 2/3 v3] raid1: percpu dispatch for write request if bitmap supported Shaohua Li
@ 2012-06-13  9:11 ` Shaohua Li
  2012-06-28  9:03 ` [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage NeilBrown
  3 siblings, 0 replies; 13+ messages in thread
From: Shaohua Li @ 2012-06-13  9:11 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, axboe

[-- Attachment #1: raid10-write-percpulist.patch --]
[-- Type: text/plain, Size: 6989 bytes --]

In raid10, all write requests are dispatched in raid10d thread. In fast
storage, the raid10d thread is a bottleneck, because it dispatches request too
slow. Also raid10d thread migrates freely, which makes request completion cpu
not match with submission cpu even driver/block layer has such capability. This
will cause bad cache issue.

If bitmap support is enabled, write requests can only be dispatched after dirty
bitmap is flushed out. After bitmap is flushed, how write requests are
dispatched doesn't impact correctness. A natural idea is to distribute request
dispatch to several threads. With this patch, requests are added to a percpu
list first. After bitmap is flushed, then the percpu list requests will
dispatched in a workqueue. In this way, above bottleneck is removed.

In a 4k randwrite test with a 4 disks setup, below patch can provide about 95%
performance improvements depending on numa binding.

Signed-off-by: Shaohua Li <shli@fusionio.com>
---
 drivers/md/raid10.c |   98 ++++++++++++++++++++++++++++++++++++++++++----------
 drivers/md/raid10.h |   10 ++++-
 2 files changed, 89 insertions(+), 19 deletions(-)

Index: linux/drivers/md/raid10.h
===================================================================
--- linux.orig/drivers/md/raid10.h	2012-06-08 09:35:53.232593471 +0800
+++ linux/drivers/md/raid10.h	2012-06-13 16:15:38.972223880 +0800
@@ -11,6 +11,14 @@ struct mirror_info {
 						 */
 };
 
+struct write_list {
+	struct bio_list	pending_bio_list;
+	struct bio_list	running_bio_list;
+	struct bio_list	tmp_bio_list;
+	struct work_struct work;
+	struct r10conf *conf;
+};
+
 struct r10conf {
 	struct mddev		*mddev;
 	struct mirror_info	*mirrors;
@@ -49,7 +57,7 @@ struct r10conf {
 
 	struct list_head	retry_list;
 	/* queue pending writes and submit them on unplug */
-	struct bio_list		pending_bio_list;
+	struct write_list __percpu *write_list;
 	int			pending_count;
 
 	spinlock_t		resync_lock;
Index: linux/drivers/md/raid10.c
===================================================================
--- linux.orig/drivers/md/raid10.c	2012-06-13 15:51:49.666190191 +0800
+++ linux/drivers/md/raid10.c	2012-06-13 16:15:38.972223880 +0800
@@ -867,22 +867,21 @@ static int raid10_congested(void *data,
 	return ret;
 }
 
-static void flush_pending_writes(struct r10conf *conf)
+static void raid10_write_work(struct work_struct *work)
 {
-	/* Any writes that have been queued but are awaiting
-	 * bitmap updates get flushed here.
-	 */
-	spin_lock_irq(&conf->device_lock);
+	struct write_list *list = container_of(work, struct write_list, work);
+	struct bio *bio;
+	struct blk_plug plug;
+	bool try_again = true;
 
-	if (conf->pending_bio_list.head) {
-		struct bio *bio;
-		bio = bio_list_get(&conf->pending_bio_list);
-		conf->pending_count = 0;
-		spin_unlock_irq(&conf->device_lock);
-		/* flush any pending bitmap writes to disk
-		 * before proceeding w/ I/O */
-		bitmap_unplug(conf->mddev->bitmap);
-		wake_up(&conf->wait_barrier);
+	blk_start_plug(&plug);
+
+	while (try_again) {
+		spin_lock_irq(&list->conf->device_lock);
+		bio = bio_list_get(&list->running_bio_list);
+		spin_unlock_irq(&list->conf->device_lock);
+
+		try_again = (bio != NULL);
 
 		while (bio) { /* submit pending writes */
 			struct bio *next = bio->bi_next;
@@ -890,8 +889,53 @@ static void flush_pending_writes(struct
 			generic_make_request(bio);
 			bio = next;
 		}
-	} else
-		spin_unlock_irq(&conf->device_lock);
+	}
+	blk_finish_plug(&plug);
+}
+
+static void flush_pending_writes(struct r10conf *conf)
+{
+	int c;
+	struct write_list *list;
+
+	/* Any writes that have been queued but are awaiting
+	 * bitmap updates get flushed here.
+	 */
+	spin_lock_irq(&conf->device_lock);
+
+	for_each_possible_cpu(c) {
+		list = per_cpu_ptr(conf->write_list, c);
+		if (!bio_list_empty(&list->pending_bio_list)) {
+			bio_list_merge(&list->tmp_bio_list,
+				      &list->pending_bio_list);
+			bio_list_init(&list->pending_bio_list);
+		}
+	}
+
+	conf->pending_count = 0;
+	spin_unlock_irq(&conf->device_lock);
+
+	/* flush any pending bitmap writes to disk before proceeding w/ I/O */
+	bitmap_unplug(conf->mddev->bitmap);
+	wake_up(&conf->wait_barrier);
+
+	spin_lock_irq(&conf->device_lock);
+	for_each_possible_cpu(c) {
+		list = per_cpu_ptr(conf->write_list, c);
+		if (!bio_list_empty(&list->tmp_bio_list)) {
+			bio_list_merge(&list->running_bio_list,
+				       &list->tmp_bio_list);
+			bio_list_init(&list->tmp_bio_list);
+			if (likely(cpu_online(c)))
+				md_schedule_work_on(c, &list->work);
+			else {
+				int cpu = cpumask_any(cpu_online_mask);
+				md_schedule_work_on(cpu, &list->work);
+			}
+		}
+	}
+
+	spin_unlock_irq(&conf->device_lock);
 }
 
 /* Barriers....
@@ -1374,6 +1418,7 @@ retry_write:
 
 	for (i = 0; i < conf->copies; i++) {
 		struct bio *mbio;
+		struct write_list *list;
 		int d = r10_bio->devs[i].devnum;
 		if (!r10_bio->devs[i].bio)
 			continue;
@@ -1393,7 +1438,8 @@ retry_write:
 
 		atomic_inc(&r10_bio->remaining);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
+		list = this_cpu_ptr(conf->write_list);
+		bio_list_add(&list->pending_bio_list, mbio);
 		conf->pending_count++;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 
@@ -1420,7 +1466,8 @@ retry_write:
 
 		atomic_inc(&r10_bio->remaining);
 		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
+		list = this_cpu_ptr(conf->write_list);
+		bio_list_add(&list->pending_bio_list, mbio);
 		conf->pending_count++;
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
@@ -3360,6 +3407,7 @@ static struct r10conf *setup_conf(struct
 	int err = -EINVAL;
 	struct geom geo;
 	int copies;
+	int cpu;
 
 	copies = setup_geo(&geo, mddev, geo_new);
 
@@ -3421,6 +3469,18 @@ static struct r10conf *setup_conf(struct
 	spin_lock_init(&conf->resync_lock);
 	init_waitqueue_head(&conf->wait_barrier);
 
+	conf->write_list = alloc_percpu(struct write_list);
+	if (!conf->write_list)
+		goto out;
+	for_each_possible_cpu(cpu) {
+		struct write_list *list = per_cpu_ptr(conf->write_list, cpu);
+		bio_list_init(&list->pending_bio_list);
+		bio_list_init(&list->running_bio_list);
+		bio_list_init(&list->tmp_bio_list);
+		INIT_WORK(&list->work, raid10_write_work);
+		list->conf = conf;
+	}
+
 	conf->thread = md_register_thread(raid10d, mddev, NULL);
 	if (!conf->thread)
 		goto out;
@@ -3433,6 +3493,7 @@ static struct r10conf *setup_conf(struct
 		printk(KERN_ERR "md/raid10:%s: couldn't allocate memory.\n",
 		       mdname(mddev));
 	if (conf) {
+		free_percpu(conf->write_list);
 		if (conf->r10bio_pool)
 			mempool_destroy(conf->r10bio_pool);
 		kfree(conf->mirrors);
@@ -3636,6 +3697,7 @@ static int stop(struct mddev *mddev)
 
 	md_unregister_thread(&mddev->thread);
 	blk_sync_queue(mddev->queue); /* the unplug fn references 'conf'*/
+	free_percpu(conf->write_list);
 	if (conf->r10bio_pool)
 		mempool_destroy(conf->r10bio_pool);
 	kfree(conf->mirrors);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
  2012-06-13  9:11 [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage Shaohua Li
                   ` (2 preceding siblings ...)
  2012-06-13  9:11 ` [patch 3/3 v3] raid10: " Shaohua Li
@ 2012-06-28  9:03 ` NeilBrown
  2012-06-29  1:29   ` Stan Hoeppner
  2012-06-29  6:10   ` Shaohua Li
  3 siblings, 2 replies; 13+ messages in thread
From: NeilBrown @ 2012-06-28  9:03 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe

[-- Attachment #1: Type: text/plain, Size: 11228 bytes --]

On Wed, 13 Jun 2012 17:11:43 +0800 Shaohua Li <shli@kernel.org> wrote:

> In raid1/10, all write requests are dispatched in a single thread. In fast
> storage, the thread is a bottleneck, because it dispatches request too slow.
> Also the thread migrates freely, which makes request completion cpu not match
> with submission cpu even driver/block layer has such capability. This will
> cause bad cache issue. Both these are not a big deal for slow storage.
> 
> Switching the dispatching to percpu/perthread based dramatically increases
> performance.  The more raid disk number is, the more performance boosts. In a
> 4-disk raid10 setup, this can double the throughput.
> 
> percpu/perthread based dispatch doesn't harm slow storage. This is the way how
> raw device is accessed, and there is correct block plug set which can help do
> request merge and reduce lock contention.
> 
> V2->V3:
> rebase to latest tree and fix cpuhotplug issue
> 
> V1->V2:
> 1. droped direct dispatch patches. That has better performance imporvement, but
> is hopelessly made correct.
> 2. Add a MD specific workqueue to do percpu dispatch.


Hi.

I still don't like the per-cpu allocations and the extra work queues.

The following patch demonstrates how I would like to address this issue.  It
should submit requests from the same thread that initially made the request -
at least in most cases.

It leverages the plugging code and pushed everything out on the unplug,
unless that comes from a scheduler call (which should be uncommon).  In that
case it falls back on passing all the requests to the md thread.

Obviously if we proceed with this I'll split this up into neat reviewable
patches.  However before that it would help to know if it really helps as I
think it should.

So would you be able to test it on your SSD hardware and see how it compares
the current code, and to you code?  Thanks.

I have only tested it lightly myself so there could still be bugs, but
hopefully not obvious ones.

A simple "time mkfs" test on very modest hardware show as 25% reduction in
total time (168s -> 127s).  I guess that's a 33% increase in speed?
However sequential writes with 'dd' seem a little slower (14MB/s -> 13.6MB/s)

There are some hacks in there that need to be cleaned up, but I think the
general structure looks good.

Thanks,
NeilBrown

diff --git a/block/blk-core.c b/block/blk-core.c
index 3c923a7..17e2b4e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2883,7 +2883,7 @@ static void queue_unplugged(struct request_queue *q, unsigned int depth,
 
 }
 
-static void flush_plug_callbacks(struct blk_plug *plug)
+static void flush_plug_callbacks(struct blk_plug *plug, bool from_schedule)
 {
 	LIST_HEAD(callbacks);
 
@@ -2897,7 +2897,7 @@ static void flush_plug_callbacks(struct blk_plug *plug)
 							  struct blk_plug_cb,
 							  list);
 		list_del(&cb->list);
-		cb->callback(cb);
+		cb->callback(cb, from_schedule);
 	}
 }
 
@@ -2911,7 +2911,7 @@ void blk_flush_plug_list(struct blk_plug *plug, bool from_schedule)
 
 	BUG_ON(plug->magic != PLUG_MAGIC);
 
-	flush_plug_callbacks(plug);
+	flush_plug_callbacks(plug, from_schedule);
 	if (list_empty(&plug->list))
 		return;
 
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 15dbe03..94e7f6b 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -1305,7 +1305,7 @@ int bitmap_startwrite(struct bitmap *bitmap, sector_t offset, unsigned long sect
 			prepare_to_wait(&bitmap->overflow_wait, &__wait,
 					TASK_UNINTERRUPTIBLE);
 			spin_unlock_irq(&bitmap->counts.lock);
-			io_schedule();
+			schedule();
 			finish_wait(&bitmap->overflow_wait, &__wait);
 			continue;
 		}
diff --git a/drivers/md/md.c b/drivers/md/md.c
index c601c4b..bc31165 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -506,12 +506,7 @@ EXPORT_SYMBOL(md_flush_request);
  * personality data where we keep a count of the number of outstanding
  * plugs so other code can see if a plug is active.
  */
-struct md_plug_cb {
-	struct blk_plug_cb cb;
-	struct mddev *mddev;
-};
-
-static void plugger_unplug(struct blk_plug_cb *cb)
+static void plugger_unplug(struct blk_plug_cb *cb, bool from_schedule)
 {
 	struct md_plug_cb *mdcb = container_of(cb, struct md_plug_cb, cb);
 	if (atomic_dec_and_test(&mdcb->mddev->plug_cnt))
@@ -522,35 +517,41 @@ static void plugger_unplug(struct blk_plug_cb *cb)
 /* Check that an unplug wakeup will come shortly.
  * If not, wakeup the md thread immediately
  */
-int mddev_check_plugged(struct mddev *mddev)
+struct md_plug_cb *mddev_check_plugged(struct mddev *mddev, size_t size,
+				      void (*callback)(struct blk_plug_cb *,bool))
 {
 	struct blk_plug *plug = current->plug;
 	struct md_plug_cb *mdcb;
 
 	if (!plug)
-		return 0;
+		return NULL;
+
+	if (callback == NULL)
+		callback = plugger_unplug;
+	if (size == 0)
+		size = sizeof(*mdcb);
 
 	list_for_each_entry(mdcb, &plug->cb_list, cb.list) {
-		if (mdcb->cb.callback == plugger_unplug &&
+		if (mdcb->cb.callback == callback &&
 		    mdcb->mddev == mddev) {
 			/* Already on the list, move to top */
 			if (mdcb != list_first_entry(&plug->cb_list,
 						    struct md_plug_cb,
 						    cb.list))
 				list_move(&mdcb->cb.list, &plug->cb_list);
-			return 1;
+			return mdcb;
 		}
 	}
 	/* Not currently on the callback list */
-	mdcb = kmalloc(sizeof(*mdcb), GFP_ATOMIC);
+	mdcb = kzalloc(size, GFP_ATOMIC);
 	if (!mdcb)
-		return 0;
+		return NULL;
 
 	mdcb->mddev = mddev;
-	mdcb->cb.callback = plugger_unplug;
+	mdcb->cb.callback = callback;
 	atomic_inc(&mddev->plug_cnt);
 	list_add(&mdcb->cb.list, &plug->cb_list);
-	return 1;
+	return mdcb;
 }
 EXPORT_SYMBOL_GPL(mddev_check_plugged);
 
diff --git a/drivers/md/md.h b/drivers/md/md.h
index 7b4a3c3..c59fd53 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -630,6 +630,13 @@ extern struct bio *bio_clone_mddev(struct bio *bio, gfp_t gfp_mask,
 				   struct mddev *mddev);
 extern struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
 				   struct mddev *mddev);
-extern int mddev_check_plugged(struct mddev *mddev);
+struct md_plug_cb {
+	struct blk_plug_cb cb;
+	struct mddev *mddev;
+};
+
+extern struct md_plug_cb *mddev_check_plugged(
+	struct mddev *mddev, size_t size,
+	void (*callback)(struct blk_plug_cb *,bool));
 extern void md_trim_bio(struct bio *bio, int offset, int size);
 #endif /* _MD_MD_H */
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 39b2a8a..c8c5d62 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -870,6 +870,47 @@ do_sync_io:
 	pr_debug("%dB behind alloc failed, doing sync I/O\n", bio->bi_size);
 }
 
+struct raid1_plug_cb {
+	struct md_plug_cb	cb;
+	struct bio_list		pending;
+	int			pending_cnt;
+};
+
+static void raid1_unplug(struct blk_plug_cb *cb, bool from_schedule)
+{
+	struct raid1_plug_cb *plug = container_of(cb, struct raid1_plug_cb,
+						  cb.cb);
+	struct mddev *mddev = plug->cb.mddev;
+	struct r1conf *conf = mddev->private;
+	struct bio *bio;
+
+	spin_lock_irq(&conf->device_lock);
+	if (from_schedule) {
+		bio_list_merge(&conf->pending_bio_list, &plug->pending);
+		conf->pending_count += plug->pending_cnt;
+	}
+	if (atomic_dec_and_test(&mddev->plug_cnt))
+		md_wakeup_thread(mddev->thread);
+	spin_unlock_irq(&conf->device_lock);
+	if (from_schedule) {
+		kfree(plug);
+		return;
+	}
+
+	/* we aren't scheduling, so we can do the write-out directly. */
+	bio = bio_list_get(&plug->pending);
+	bitmap_unplug(mddev->bitmap);
+	wake_up(&conf->wait_barrier);
+
+	while (bio) { /* submit pending writes */
+		struct bio *next = bio->bi_next;
+		bio->bi_next = NULL;
+		generic_make_request(bio);
+		bio = next;
+	}
+	kfree(plug);
+}
+
 static void make_request(struct mddev *mddev, struct bio * bio)
 {
 	struct r1conf *conf = mddev->private;
@@ -883,7 +924,8 @@ static void make_request(struct mddev *mddev, struct bio * bio)
 	const unsigned long do_sync = (bio->bi_rw & REQ_SYNC);
 	const unsigned long do_flush_fua = (bio->bi_rw & (REQ_FLUSH | REQ_FUA));
 	struct md_rdev *blocked_rdev;
-	int plugged;
+	struct md_plug_cb *cb;
+	struct raid1_plug_cb *plug = NULL;
 	int first_clone;
 	int sectors_handled;
 	int max_sectors;
@@ -1034,7 +1076,6 @@ read_again:
 	 * the bad blocks.  Each set of writes gets it's own r1bio
 	 * with a set of bios attached.
 	 */
-	plugged = mddev_check_plugged(mddev);
 
 	disks = conf->raid_disks * 2;
  retry_write:
@@ -1187,9 +1228,20 @@ read_again:
 		mbio->bi_private = r1_bio;
 
 		atomic_inc(&r1_bio->remaining);
+
+		cb = mddev_check_plugged(mddev, sizeof(*plug), raid1_unplug);
+		if (cb)
+			plug = container_of(cb, struct raid1_plug_cb, cb);
+		else
+			plug = NULL;
 		spin_lock_irqsave(&conf->device_lock, flags);
-		bio_list_add(&conf->pending_bio_list, mbio);
-		conf->pending_count++;
+		if (plug) {
+			bio_list_add(&plug->pending, mbio);
+			plug->pending_cnt++;
+		} else {
+			bio_list_add(&conf->pending_bio_list, mbio);
+			conf->pending_count++;
+		}
 		spin_unlock_irqrestore(&conf->device_lock, flags);
 	}
 	/* Mustn't call r1_bio_write_done before this next test,
@@ -1214,7 +1266,7 @@ read_again:
 	/* In case raid1d snuck in to freeze_array */
 	wake_up(&conf->wait_barrier);
 
-	if (do_sync || !bitmap || !plugged)
+	if (do_sync || !bitmap || !plug)
 		md_wakeup_thread(mddev->thread);
 }
 
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index 0ee7602..ada2429 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -1239,7 +1239,7 @@ read_again:
 	 * of r10_bios is recored in bio->bi_phys_segments just as with
 	 * the read case.
 	 */
-	plugged = mddev_check_plugged(mddev);
+	plugged = mddev_check_plugged(mddev, 0, NULL);
 
 	r10_bio->read_slot = -1; /* make sure repl_bio gets freed */
 	raid10_find_phys(conf, r10_bio);
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 51169ec..5cb037b 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -3995,7 +3995,7 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 	struct stripe_head *sh;
 	const int rw = bio_data_dir(bi);
 	int remaining;
-	int plugged;
+	struct md_plug_cb *plugged;
 
 	if (unlikely(bi->bi_rw & REQ_FLUSH)) {
 		md_flush_request(mddev, bi);
@@ -4014,7 +4014,7 @@ static void make_request(struct mddev *mddev, struct bio * bi)
 	bi->bi_next = NULL;
 	bi->bi_phys_segments = 1;	/* over-loaded to count active stripes */
 
-	plugged = mddev_check_plugged(mddev);
+	plugged = mddev_check_plugged(mddev, 0, NULL);
 	for (;logical_sector < last_sector; logical_sector += STRIPE_SECTORS) {
 		DEFINE_WAIT(w);
 		int previous;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index ba43f40..390e660 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -914,7 +914,7 @@ struct blk_plug {
 
 struct blk_plug_cb {
 	struct list_head list;
-	void (*callback)(struct blk_plug_cb *);
+	void (*callback)(struct blk_plug_cb *, bool);
 };
 
 extern void blk_start_plug(struct blk_plug *);

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
  2012-06-28  9:03 ` [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage NeilBrown
@ 2012-06-29  1:29   ` Stan Hoeppner
  2012-06-29  2:52     ` NeilBrown
  2012-06-29  6:10   ` Shaohua Li
  1 sibling, 1 reply; 13+ messages in thread
From: Stan Hoeppner @ 2012-06-29  1:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, axboe

On 6/28/2012 4:03 AM, NeilBrown wrote:
> On Wed, 13 Jun 2012 17:11:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> 
>> In raid1/10, all write requests are dispatched in a single thread. In fast
>> storage, the thread is a bottleneck, because it dispatches request too slow.
>> Also the thread migrates freely, which makes request completion cpu not match
>> with submission cpu even driver/block layer has such capability. This will
>> cause bad cache issue. Both these are not a big deal for slow storage.
>>
>> Switching the dispatching to percpu/perthread based dramatically increases
>> performance.  The more raid disk number is, the more performance boosts. In a
>> 4-disk raid10 setup, this can double the throughput.
>>
>> percpu/perthread based dispatch doesn't harm slow storage. This is the way how
>> raw device is accessed, and there is correct block plug set which can help do
>> request merge and reduce lock contention.
>>
>> V2->V3:
>> rebase to latest tree and fix cpuhotplug issue
>>
>> V1->V2:
>> 1. droped direct dispatch patches. That has better performance imporvement, but
>> is hopelessly made correct.
>> 2. Add a MD specific workqueue to do percpu dispatch.


> I still don't like the per-cpu allocations and the extra work queues.

Why don't you like this method Neil?  Complexity?  The performance seems
to be there.

-- 
Stan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
  2012-06-29  1:29   ` Stan Hoeppner
@ 2012-06-29  2:52     ` NeilBrown
  2012-06-29  3:02       ` Roberto Spadim
  2012-06-30  4:37       ` Stan Hoeppner
  0 siblings, 2 replies; 13+ messages in thread
From: NeilBrown @ 2012-06-29  2:52 UTC (permalink / raw)
  To: stan; +Cc: Shaohua Li, linux-raid, axboe

[-- Attachment #1: Type: text/plain, Size: 2018 bytes --]

On Thu, 28 Jun 2012 20:29:21 -0500 Stan Hoeppner <stan@hardwarefreak.com>
wrote:

> On 6/28/2012 4:03 AM, NeilBrown wrote:
> > On Wed, 13 Jun 2012 17:11:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> > 
> >> In raid1/10, all write requests are dispatched in a single thread. In fast
> >> storage, the thread is a bottleneck, because it dispatches request too slow.
> >> Also the thread migrates freely, which makes request completion cpu not match
> >> with submission cpu even driver/block layer has such capability. This will
> >> cause bad cache issue. Both these are not a big deal for slow storage.
> >>
> >> Switching the dispatching to percpu/perthread based dramatically increases
> >> performance.  The more raid disk number is, the more performance boosts. In a
> >> 4-disk raid10 setup, this can double the throughput.
> >>
> >> percpu/perthread based dispatch doesn't harm slow storage. This is the way how
> >> raw device is accessed, and there is correct block plug set which can help do
> >> request merge and reduce lock contention.
> >>
> >> V2->V3:
> >> rebase to latest tree and fix cpuhotplug issue
> >>
> >> V1->V2:
> >> 1. droped direct dispatch patches. That has better performance imporvement, but
> >> is hopelessly made correct.
> >> 2. Add a MD specific workqueue to do percpu dispatch.
> 
> 
> > I still don't like the per-cpu allocations and the extra work queues.
> 
> Why don't you like this method Neil?  Complexity?  The performance seems
> to be there.
> 

Not an easy question to answer.  It just doesn't "taste" nice.
I certainly like the performance and if this is the only way to get that
performance then we'll probably go that way.  But I'm not convinced it is the
only way and I want to explore other options first.
I guess it feels a bit heavy handed.  On machines with 1024 cores, per-cpu
allocations and per-cpu threads are not as cheap as they are one 2-core
machines.  And I'm hoping for a 1024-core phone soon :-)

NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
  2012-06-29  2:52     ` NeilBrown
@ 2012-06-29  3:02       ` Roberto Spadim
  2012-06-30  4:37       ` Stan Hoeppner
  1 sibling, 0 replies; 13+ messages in thread
From: Roberto Spadim @ 2012-06-29  3:02 UTC (permalink / raw)
  To: NeilBrown; +Cc: stan, Shaohua Li, linux-raid, axboe

kkk 1024 cores
i will mark this email and go back when i get one 1024cores cpu maybe
just some 2 years? 10 years?

2012/6/28 NeilBrown <neilb@suse.de>
>
> On Thu, 28 Jun 2012 20:29:21 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
>
> > On 6/28/2012 4:03 AM, NeilBrown wrote:
> > > On Wed, 13 Jun 2012 17:11:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> > >
> > >> In raid1/10, all write requests are dispatched in a single thread. In fast
> > >> storage, the thread is a bottleneck, because it dispatches request too slow.
> > >> Also the thread migrates freely, which makes request completion cpu not match
> > >> with submission cpu even driver/block layer has such capability. This will
> > >> cause bad cache issue. Both these are not a big deal for slow storage.
> > >>
> > >> Switching the dispatching to percpu/perthread based dramatically increases
> > >> performance.  The more raid disk number is, the more performance boosts. In a
> > >> 4-disk raid10 setup, this can double the throughput.
> > >>
> > >> percpu/perthread based dispatch doesn't harm slow storage. This is the way how
> > >> raw device is accessed, and there is correct block plug set which can help do
> > >> request merge and reduce lock contention.
> > >>
> > >> V2->V3:
> > >> rebase to latest tree and fix cpuhotplug issue
> > >>
> > >> V1->V2:
> > >> 1. droped direct dispatch patches. That has better performance imporvement, but
> > >> is hopelessly made correct.
> > >> 2. Add a MD specific workqueue to do percpu dispatch.
> >
> >
> > > I still don't like the per-cpu allocations and the extra work queues.
> >
> > Why don't you like this method Neil?  Complexity?  The performance seems
> > to be there.
> >
>
> Not an easy question to answer.  It just doesn't "taste" nice.
> I certainly like the performance and if this is the only way to get that
> performance then we'll probably go that way.  But I'm not convinced it is the
> only way and I want to explore other options first.
> I guess it feels a bit heavy handed.  On machines with 1024 cores, per-cpu
> allocations and per-cpu threads are not as cheap as they are one 2-core
> machines.  And I'm hoping for a 1024-core phone soon :-)
>
> NeilBrown
>



--
Roberto Spadim
Spadim Technology / SPAEmpresarial
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
  2012-06-28  9:03 ` [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage NeilBrown
  2012-06-29  1:29   ` Stan Hoeppner
@ 2012-06-29  6:10   ` Shaohua Li
  2012-07-02  7:36     ` Shaohua Li
  1 sibling, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2012-06-29  6:10 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe

2012/6/28 NeilBrown <neilb@suse.de>:
> On Wed, 13 Jun 2012 17:11:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>
>> In raid1/10, all write requests are dispatched in a single thread. In fast
>> storage, the thread is a bottleneck, because it dispatches request too slow.
>> Also the thread migrates freely, which makes request completion cpu not match
>> with submission cpu even driver/block layer has such capability. This will
>> cause bad cache issue. Both these are not a big deal for slow storage.
>>
>> Switching the dispatching to percpu/perthread based dramatically increases
>> performance.  The more raid disk number is, the more performance boosts. In a
>> 4-disk raid10 setup, this can double the throughput.
>>
>> percpu/perthread based dispatch doesn't harm slow storage. This is the way how
>> raw device is accessed, and there is correct block plug set which can help do
>> request merge and reduce lock contention.
>>
>> V2->V3:
>> rebase to latest tree and fix cpuhotplug issue
>>
>> V1->V2:
>> 1. droped direct dispatch patches. That has better performance imporvement, but
>> is hopelessly made correct.
>> 2. Add a MD specific workqueue to do percpu dispatch.
>
>
> Hi.
>
> I still don't like the per-cpu allocations and the extra work queues.
>
> The following patch demonstrates how I would like to address this issue.  It
> should submit requests from the same thread that initially made the request -
> at least in most cases.
>
> It leverages the plugging code and pushed everything out on the unplug,
> unless that comes from a scheduler call (which should be uncommon).  In that
> case it falls back on passing all the requests to the md thread.
>
> Obviously if we proceed with this I'll split this up into neat reviewable
> patches.  However before that it would help to know if it really helps as I
> think it should.
>
> So would you be able to test it on your SSD hardware and see how it compares
> the current code, and to you code?  Thanks.
>
> I have only tested it lightly myself so there could still be bugs, but
> hopefully not obvious ones.
>
> A simple "time mkfs" test on very modest hardware show as 25% reduction in
> total time (168s -> 127s).  I guess that's a 33% increase in speed?
> However sequential writes with 'dd' seem a little slower (14MB/s -> 13.6MB/s)
>
> There are some hacks in there that need to be cleaned up, but I think the
> general structure looks good.

Thought I consider this approach before, and schedule from the unplug
callback is an issue. Maybe I overlooked it at that time, the from_schedule
check looks promising.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
  2012-06-29  2:52     ` NeilBrown
  2012-06-29  3:02       ` Roberto Spadim
@ 2012-06-30  4:37       ` Stan Hoeppner
  1 sibling, 0 replies; 13+ messages in thread
From: Stan Hoeppner @ 2012-06-30  4:37 UTC (permalink / raw)
  To: NeilBrown; +Cc: Shaohua Li, linux-raid, axboe

On 6/28/2012 9:52 PM, NeilBrown wrote:
> On Thu, 28 Jun 2012 20:29:21 -0500 Stan Hoeppner <stan@hardwarefreak.com>
> wrote:
> 
>> On 6/28/2012 4:03 AM, NeilBrown wrote:
>>> On Wed, 13 Jun 2012 17:11:43 +0800 Shaohua Li <shli@kernel.org> wrote:
>>>
>>>> In raid1/10, all write requests are dispatched in a single thread. In fast
>>>> storage, the thread is a bottleneck, because it dispatches request too slow.
>>>> Also the thread migrates freely, which makes request completion cpu not match
>>>> with submission cpu even driver/block layer has such capability. This will
>>>> cause bad cache issue. Both these are not a big deal for slow storage.
>>>>
>>>> Switching the dispatching to percpu/perthread based dramatically increases
>>>> performance.  The more raid disk number is, the more performance boosts. In a
>>>> 4-disk raid10 setup, this can double the throughput.
>>>>
>>>> percpu/perthread based dispatch doesn't harm slow storage. This is the way how
>>>> raw device is accessed, and there is correct block plug set which can help do
>>>> request merge and reduce lock contention.
>>>>
>>>> V2->V3:
>>>> rebase to latest tree and fix cpuhotplug issue
>>>>
>>>> V1->V2:
>>>> 1. droped direct dispatch patches. That has better performance imporvement, but
>>>> is hopelessly made correct.
>>>> 2. Add a MD specific workqueue to do percpu dispatch.
>>
>>
>>> I still don't like the per-cpu allocations and the extra work queues.
>>
>> Why don't you like this method Neil?  Complexity?  The performance seems
>> to be there.
>>
> 
> Not an easy question to answer.  It just doesn't "taste" nice.
> I certainly like the performance and if this is the only way to get that
> performance then we'll probably go that way.  But I'm not convinced it is the
> only way and I want to explore other options first.

I completely agree with the philosophy of exploring multiple options.

> I guess it feels a bit heavy handed.  On machines with 1024 cores, per-cpu
> allocations and per-cpu threads are not as cheap as they are one 2-core
> machines.  And I'm hoping for a 1024-core phone soon :-)

The only 1024 core machines on the planet are SGI Altix UV (up to 2560
cores).  And they make extensive use of per-cpu allocations and threads
in both XVM (the SGI Linux volume manager) and XFS.  Keep in mind that
the CpuMemSets API which enables this originated at SGI.  The storage is
FC SAN RAID, and XVM is used to stripe or concatenate the hw RAID LUNs.
 Without per-cpu threads this machine's IO couldn't scale.

Quoting Geoffrey Wehrman of SGI, from a post to the XFS list:

"With an SGI IS16000 array which supports up to 1,200 drives,
filesystems with large numbers of drives isn't difficult.  Most
configurations using the IS16000 have 8+2 RAID6 luns.  I've seen
sustained 15 GB/s to a single filesystem on one of the arrays with a 600
drive configuration.  The scalability of XFS is impressive."

Without per-cpu threads in XVM and XFS this level of throughput wouldn't
be possible.  XVM is closed source, but the XFS devs would probably be
open to discussing how they do this, their beef with your current
default stripe chunk size not withstanding. ;)

-- 
Stan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
  2012-06-29  6:10   ` Shaohua Li
@ 2012-07-02  7:36     ` Shaohua Li
  2012-07-03  8:58       ` Shaohua Li
  0 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2012-07-02  7:36 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe

On Fri, Jun 29, 2012 at 02:10:30PM +0800, Shaohua Li wrote:
> 2012/6/28 NeilBrown <neilb@suse.de>:
> > On Wed, 13 Jun 2012 17:11:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> >
> >> In raid1/10, all write requests are dispatched in a single thread. In fast
> >> storage, the thread is a bottleneck, because it dispatches request too slow.
> >> Also the thread migrates freely, which makes request completion cpu not match
> >> with submission cpu even driver/block layer has such capability. This will
> >> cause bad cache issue. Both these are not a big deal for slow storage.
> >>
> >> Switching the dispatching to percpu/perthread based dramatically increases
> >> performance.  The more raid disk number is, the more performance boosts. In a
> >> 4-disk raid10 setup, this can double the throughput.
> >>
> >> percpu/perthread based dispatch doesn't harm slow storage. This is the way how
> >> raw device is accessed, and there is correct block plug set which can help do
> >> request merge and reduce lock contention.
> >>
> >> V2->V3:
> >> rebase to latest tree and fix cpuhotplug issue
> >>
> >> V1->V2:
> >> 1. droped direct dispatch patches. That has better performance imporvement, but
> >> is hopelessly made correct.
> >> 2. Add a MD specific workqueue to do percpu dispatch.
> >
> >
> > Hi.
> >
> > I still don't like the per-cpu allocations and the extra work queues.
> >
> > The following patch demonstrates how I would like to address this issue.  It
> > should submit requests from the same thread that initially made the request -
> > at least in most cases.
> >
> > It leverages the plugging code and pushed everything out on the unplug,
> > unless that comes from a scheduler call (which should be uncommon).  In that
> > case it falls back on passing all the requests to the md thread.
> >
> > Obviously if we proceed with this I'll split this up into neat reviewable
> > patches.  However before that it would help to know if it really helps as I
> > think it should.
> >
> > So would you be able to test it on your SSD hardware and see how it compares
> > the current code, and to you code?  Thanks.
> >
> > I have only tested it lightly myself so there could still be bugs, but
> > hopefully not obvious ones.
> >
> > A simple "time mkfs" test on very modest hardware show as 25% reduction in
> > total time (168s -> 127s).  I guess that's a 33% increase in speed?
> > However sequential writes with 'dd' seem a little slower (14MB/s -> 13.6MB/s)
> >
> > There are some hacks in there that need to be cleaned up, but I think the
> > general structure looks good.
> 
> Thought I consider this approach before, and schedule from the unplug
> callback is an issue. Maybe I overlooked it at that time, the from_schedule
> check looks promising.

I tried raid1/raid10 performance with this patch (with similar change for
raid10, and add plug in the raid1/10 unplug function for dispatching), the
result is ok. The from_schedule check does the trick, there isn't race I
mentioned before. And I double checked the rate unplug is called from schedule,
which is very very low.

Now the only problem is if extra bitmap flush could be an overhead. Our card
hasn't such overhead, so not sure.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
  2012-07-02  7:36     ` Shaohua Li
@ 2012-07-03  8:58       ` Shaohua Li
  2012-07-04  1:45         ` NeilBrown
  0 siblings, 1 reply; 13+ messages in thread
From: Shaohua Li @ 2012-07-03  8:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, axboe

On Mon, Jul 02, 2012 at 03:36:45PM +0800, Shaohua Li wrote:
> On Fri, Jun 29, 2012 at 02:10:30PM +0800, Shaohua Li wrote:
> > 2012/6/28 NeilBrown <neilb@suse.de>:
> > > On Wed, 13 Jun 2012 17:11:43 +0800 Shaohua Li <shli@kernel.org> wrote:
> > >
> > >> In raid1/10, all write requests are dispatched in a single thread. In fast
> > >> storage, the thread is a bottleneck, because it dispatches request too slow.
> > >> Also the thread migrates freely, which makes request completion cpu not match
> > >> with submission cpu even driver/block layer has such capability. This will
> > >> cause bad cache issue. Both these are not a big deal for slow storage.
> > >>
> > >> Switching the dispatching to percpu/perthread based dramatically increases
> > >> performance.  The more raid disk number is, the more performance boosts. In a
> > >> 4-disk raid10 setup, this can double the throughput.
> > >>
> > >> percpu/perthread based dispatch doesn't harm slow storage. This is the way how
> > >> raw device is accessed, and there is correct block plug set which can help do
> > >> request merge and reduce lock contention.
> > >>
> > >> V2->V3:
> > >> rebase to latest tree and fix cpuhotplug issue
> > >>
> > >> V1->V2:
> > >> 1. droped direct dispatch patches. That has better performance imporvement, but
> > >> is hopelessly made correct.
> > >> 2. Add a MD specific workqueue to do percpu dispatch.
> > >
> > >
> > > Hi.
> > >
> > > I still don't like the per-cpu allocations and the extra work queues.
> > >
> > > The following patch demonstrates how I would like to address this issue.  It
> > > should submit requests from the same thread that initially made the request -
> > > at least in most cases.
> > >
> > > It leverages the plugging code and pushed everything out on the unplug,
> > > unless that comes from a scheduler call (which should be uncommon).  In that
> > > case it falls back on passing all the requests to the md thread.
> > >
> > > Obviously if we proceed with this I'll split this up into neat reviewable
> > > patches.  However before that it would help to know if it really helps as I
> > > think it should.
> > >
> > > So would you be able to test it on your SSD hardware and see how it compares
> > > the current code, and to you code?  Thanks.
> > >
> > > I have only tested it lightly myself so there could still be bugs, but
> > > hopefully not obvious ones.
> > >
> > > A simple "time mkfs" test on very modest hardware show as 25% reduction in
> > > total time (168s -> 127s).  I guess that's a 33% increase in speed?
> > > However sequential writes with 'dd' seem a little slower (14MB/s -> 13.6MB/s)
> > >
> > > There are some hacks in there that need to be cleaned up, but I think the
> > > general structure looks good.
> > 
> > Thought I consider this approach before, and schedule from the unplug
> > callback is an issue. Maybe I overlooked it at that time, the from_schedule
> > check looks promising.
> 
> I tried raid1/raid10 performance with this patch (with similar change for
> raid10, and add plug in the raid1/10 unplug function for dispatching), the
> result is ok. The from_schedule check does the trick, there isn't race I
> mentioned before. And I double checked the rate unplug is called from schedule,
> which is very very low.
> 
> Now the only problem is if extra bitmap flush could be an overhead. Our card
> hasn't such overhead, so not sure.

Looks you merged the patch to your tree, great! The raid1_unplug() still lacks
blk_start_plug/blk_finish_plug. Will you add a similar patch for raid10?

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage
  2012-07-03  8:58       ` Shaohua Li
@ 2012-07-04  1:45         ` NeilBrown
  0 siblings, 0 replies; 13+ messages in thread
From: NeilBrown @ 2012-07-04  1:45 UTC (permalink / raw)
  To: Shaohua Li; +Cc: linux-raid, axboe

[-- Attachment #1: Type: text/plain, Size: 565 bytes --]

On Tue, 3 Jul 2012 16:58:58 +0800 Shaohua Li <shli@kernel.org> wrote:

> Looks you merged the patch to your tree, great! The raid1_unplug() still lacks
> blk_start_plug/blk_finish_plug.

Is that needed?  raid1_unplug will be called from a context where there is
still a plug active.
That is one aspect that I need to carefully review - make sure that requests
submitted from unplug also get unplugged etc.

> Will you add a similar patch for raid10?

Once I'm sure that that raid1 patch is good - which requires more study and
testing.


NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-07-04  1:45 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-06-13  9:11 [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage Shaohua Li
2012-06-13  9:11 ` [patch 1/3 v3] MD: add a specific workqueue to do dispatch Shaohua Li
2012-06-13  9:11 ` [patch 2/3 v3] raid1: percpu dispatch for write request if bitmap supported Shaohua Li
2012-06-13  9:11 ` [patch 3/3 v3] raid10: " Shaohua Li
2012-06-28  9:03 ` [patch 0/3 v3] MD: improve raid1/10 write performance for fast storage NeilBrown
2012-06-29  1:29   ` Stan Hoeppner
2012-06-29  2:52     ` NeilBrown
2012-06-29  3:02       ` Roberto Spadim
2012-06-30  4:37       ` Stan Hoeppner
2012-06-29  6:10   ` Shaohua Li
2012-07-02  7:36     ` Shaohua Li
2012-07-03  8:58       ` Shaohua Li
2012-07-04  1:45         ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.