All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] md fixes for 2.6.32-rc
@ 2009-10-02  1:18 Dan Williams
  2009-10-02  1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-02  1:18 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

Hi Neil,

A few fixes:
1/ The multicore option is not ready for prime time
2/ A fix for raid5 init failures
3/ A reminder that the fix you sent to Greg a while back has not found
   its way upstream.

Thanks,
Dan

---

Dan Williams (2):
      Revert "md/raid456: distribute raid processing over multiple cores"
      md/raid5: initialize conf->device_lock earlier

NeilBrown (1):
      Allow sysfs_notify_dirent to be called from interrupt context.


 drivers/md/Kconfig |   11 ---------
 drivers/md/raid5.c |   62 +++++++++++++---------------------------------------
 fs/sysfs/file.c    |   14 +++++++-----
 3 files changed, 23 insertions(+), 64 deletions(-)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] md/raid5: initialize conf->device_lock earlier
  2009-10-02  1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
@ 2009-10-02  1:18 ` Dan Williams
  2009-10-02  1:18 ` [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" Dan Williams
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-02  1:18 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid

Deallocating a raid5_conf_t structure requires taking 'device_lock'.
Ensure it is initialized before it is used, i.e. initialize the lock
before attempting any further initializations that might fail.

Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/md/raid5.c |   25 ++++++++++++-------------
 1 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9482980..c21cc50 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4722,6 +4722,18 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
 	conf = kzalloc(sizeof(raid5_conf_t), GFP_KERNEL);
 	if (conf == NULL)
 		goto abort;
+	spin_lock_init(&conf->device_lock);
+	init_waitqueue_head(&conf->wait_for_stripe);
+	init_waitqueue_head(&conf->wait_for_overlap);
+	INIT_LIST_HEAD(&conf->handle_list);
+	INIT_LIST_HEAD(&conf->hold_list);
+	INIT_LIST_HEAD(&conf->delayed_list);
+	INIT_LIST_HEAD(&conf->bitmap_list);
+	INIT_LIST_HEAD(&conf->inactive_list);
+	atomic_set(&conf->active_stripes, 0);
+	atomic_set(&conf->preread_active_stripes, 0);
+	atomic_set(&conf->active_aligned_reads, 0);
+	conf->bypass_threshold = BYPASS_THRESHOLD;
 
 	conf->raid_disks = mddev->raid_disks;
 	conf->scribble_len = scribble_len(conf->raid_disks);
@@ -4744,19 +4756,6 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
 	if (raid5_alloc_percpu(conf) != 0)
 		goto abort;
 
-	spin_lock_init(&conf->device_lock);
-	init_waitqueue_head(&conf->wait_for_stripe);
-	init_waitqueue_head(&conf->wait_for_overlap);
-	INIT_LIST_HEAD(&conf->handle_list);
-	INIT_LIST_HEAD(&conf->hold_list);
-	INIT_LIST_HEAD(&conf->delayed_list);
-	INIT_LIST_HEAD(&conf->bitmap_list);
-	INIT_LIST_HEAD(&conf->inactive_list);
-	atomic_set(&conf->active_stripes, 0);
-	atomic_set(&conf->preread_active_stripes, 0);
-	atomic_set(&conf->active_aligned_reads, 0);
-	conf->bypass_threshold = BYPASS_THRESHOLD;
-
 	pr_debug("raid5: run(%s) called.\n", mdname(mddev));
 
 	list_for_each_entry(rdev, &mddev->disks, same_set) {


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores"
  2009-10-02  1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
  2009-10-02  1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams
@ 2009-10-02  1:18 ` Dan Williams
  2009-10-02  1:18 ` [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context Dan Williams
  2009-10-02  4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
  3 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-02  1:18 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, Holger Kiehl

The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%).  However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe().  As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).

Revert this for now, to be resurrected once we have a more workable
thread pool.

Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/md/Kconfig |   11 -----------
 drivers/md/raid5.c |   37 +++----------------------------------
 2 files changed, 3 insertions(+), 45 deletions(-)

diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2158377..f0fac58 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -154,17 +154,6 @@ config MD_RAID456
 
 	  If unsure, say Y.
 
-config MULTICORE_RAID456
-	bool "RAID-4/RAID-5/RAID-6 Multicore processing (EXPERIMENTAL)"
-	depends on MD_RAID456
-	depends on SMP
-	depends on EXPERIMENTAL
-	---help---
-	  Enable the raid456 module to dispatch per-stripe raid operations to a
-	  thread pool.
-
-	  If unsure, say N.
-
 config MD_RAID6_PQ
 	tristate
 
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c21cc50..11cdfe6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -47,7 +47,6 @@
 #include <linux/kthread.h>
 #include <linux/raid/pq.h>
 #include <linux/async_tx.h>
-#include <linux/async.h>
 #include <linux/seq_file.h>
 #include <linux/cpu.h>
 #include "md.h"
@@ -4349,36 +4348,6 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	return handled;
 }
 
-#ifdef CONFIG_MULTICORE_RAID456
-static void __process_stripe(void *param, async_cookie_t cookie)
-{
-	struct stripe_head *sh = param;
-
-	handle_stripe(sh);
-	release_stripe(sh);
-}
-
-static void process_stripe(struct stripe_head *sh, struct list_head *domain)
-{
-	async_schedule_domain(__process_stripe, sh, domain);
-}
-
-static void synchronize_stripe_processing(struct list_head *domain)
-{
-	async_synchronize_full_domain(domain);
-}
-#else
-static void process_stripe(struct stripe_head *sh, struct list_head *domain)
-{
-	handle_stripe(sh);
-	release_stripe(sh);
-	cond_resched();
-}
-
-static void synchronize_stripe_processing(struct list_head *domain)
-{
-}
-#endif
 
 
 /*
@@ -4393,7 +4362,6 @@ static void raid5d(mddev_t *mddev)
 	struct stripe_head *sh;
 	raid5_conf_t *conf = mddev->private;
 	int handled;
-	LIST_HEAD(raid_domain);
 
 	pr_debug("+++ raid5d active\n");
 
@@ -4430,7 +4398,9 @@ static void raid5d(mddev_t *mddev)
 		spin_unlock_irq(&conf->device_lock);
 		
 		handled++;
-		process_stripe(sh, &raid_domain);
+		handle_stripe(sh);
+		release_stripe(sh);
+		cond_resched();
 
 		spin_lock_irq(&conf->device_lock);
 	}
@@ -4438,7 +4408,6 @@ static void raid5d(mddev_t *mddev)
 
 	spin_unlock_irq(&conf->device_lock);
 
-	synchronize_stripe_processing(&raid_domain);
 	async_tx_issue_pending_all();
 	unplug_slaves(mddev);
 


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context.
  2009-10-02  1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
  2009-10-02  1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams
  2009-10-02  1:18 ` [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" Dan Williams
@ 2009-10-02  1:18 ` Dan Williams
  2009-10-02  4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
  3 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-02  1:18 UTC (permalink / raw)
  To: neilb; +Cc: linux-raid, Joel Andres Granados, stable

From: NeilBrown <neilb@suse.de>

sysfs_notify_dirent is a simple atomic operation that can be used to
alert user-space that new data can be read from a sysfs attribute.

Unfortunately is cannot currently be called from non-process context
because of it's use of spin_lock which is sometimes taken with
interrupt enabled.

So change all lockers of sysfs_open_dirent_lock to disable interrupts,
thus making sysfs_notify_dirent safe to be called from non-process
context (as drivers/md does in md_safemode_timeout).

sysfs_get_open_dirent is (documented as being) only called from
process context, so it uses spin_lock_irq.  Other places
use spin_lock_irqsave.

The usage for sysfs_notify_dirent in md_safemode_timeout was
introduced in 2.6.28, so this patch is suitable for that and more
recent kernels.

Cc: <stable@kernel.org>
Reported-by: Joel Andres Granados <jgranado@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 fs/sysfs/file.c |   14 ++++++++------
 1 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 561a9c0..f5ea468 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -268,7 +268,7 @@ static int sysfs_get_open_dirent(struct sysfs_dirent *sd,
 	struct sysfs_open_dirent *od, *new_od = NULL;
 
  retry:
-	spin_lock(&sysfs_open_dirent_lock);
+	spin_lock_irq(&sysfs_open_dirent_lock);
 
 	if (!sd->s_attr.open && new_od) {
 		sd->s_attr.open = new_od;
@@ -281,7 +281,7 @@ static int sysfs_get_open_dirent(struct sysfs_dirent *sd,
 		list_add_tail(&buffer->list, &od->buffers);
 	}
 
-	spin_unlock(&sysfs_open_dirent_lock);
+	spin_unlock_irq(&sysfs_open_dirent_lock);
 
 	if (od) {
 		kfree(new_od);
@@ -315,8 +315,9 @@ static void sysfs_put_open_dirent(struct sysfs_dirent *sd,
 				  struct sysfs_buffer *buffer)
 {
 	struct sysfs_open_dirent *od = sd->s_attr.open;
+	unsigned long flags;
 
-	spin_lock(&sysfs_open_dirent_lock);
+	spin_lock_irqsave(&sysfs_open_dirent_lock, flags);
 
 	list_del(&buffer->list);
 	if (atomic_dec_and_test(&od->refcnt))
@@ -324,7 +325,7 @@ static void sysfs_put_open_dirent(struct sysfs_dirent *sd,
 	else
 		od = NULL;
 
-	spin_unlock(&sysfs_open_dirent_lock);
+	spin_unlock_irqrestore(&sysfs_open_dirent_lock, flags);
 
 	kfree(od);
 }
@@ -456,8 +457,9 @@ static unsigned int sysfs_poll(struct file *filp, poll_table *wait)
 void sysfs_notify_dirent(struct sysfs_dirent *sd)
 {
 	struct sysfs_open_dirent *od;
+	unsigned long flags;
 
-	spin_lock(&sysfs_open_dirent_lock);
+	spin_lock_irqsave(&sysfs_open_dirent_lock, flags);
 
 	od = sd->s_attr.open;
 	if (od) {
@@ -465,7 +467,7 @@ void sysfs_notify_dirent(struct sysfs_dirent *sd)
 		wake_up_interruptible(&od->poll);
 	}
 
-	spin_unlock(&sysfs_open_dirent_lock);
+	spin_unlock_irqrestore(&sysfs_open_dirent_lock, flags);
 }
 EXPORT_SYMBOL_GPL(sysfs_notify_dirent);
 


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] md fixes for 2.6.32-rc
  2009-10-02  1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
                   ` (2 preceding siblings ...)
  2009-10-02  1:18 ` [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context Dan Williams
@ 2009-10-02  4:13 ` Neil Brown
  2009-10-03 15:54   ` Dan Williams
  2009-10-07  0:36   ` Dan Williams
  3 siblings, 2 replies; 13+ messages in thread
From: Neil Brown @ 2009-10-02  4:13 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid

On Thursday October 1, dan.j.williams@intel.com wrote:
> Hi Neil,
> 
> A few fixes:
> 1/ The multicore option is not ready for prime time

But it is already marked experimental...
So do we really need to revert?  or is the current code broken beyond
repair?

> 2/ A fix for raid5 init failures

Yep, thanks!

> 3/ A reminder that the fix you sent to Greg a while back has not found
>    its way upstream.

I've nagged Greg about this, thanks.

NeilBrown

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] md fixes for 2.6.32-rc
  2009-10-02  4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
@ 2009-10-03 15:54   ` Dan Williams
  2009-10-07  0:36   ` Dan Williams
  1 sibling, 0 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-03 15:54 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

Neil Brown wrote:
> On Thursday October 1, dan.j.williams@intel.com wrote:
>> Hi Neil,
>>
>> A few fixes:
>> 1/ The multicore option is not ready for prime time
> 
> But it is already marked experimental...
> So do we really need to revert?  or is the current code broken beyond
> repair?

On second thought perhaps it is salvageable.  I'll take a look at moving 
the parallelism down from handle_stripe() to raid_run_ops().

--
Dan

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] md fixes for 2.6.32-rc
  2009-10-02  4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
  2009-10-03 15:54   ` Dan Williams
@ 2009-10-07  0:36   ` Dan Williams
  2009-10-07  4:34     ` Neil Brown
  2009-10-07 12:05     ` Holger Kiehl
  1 sibling, 2 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-07  0:36 UTC (permalink / raw)
  To: Neil Brown; +Cc: linux-raid

From 0496c92cf6ac1f4f7dde6d416707988991d87d41 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@intel.com>
Date: Sat, 3 Oct 2009 13:47:05 -0700
Subject: [PATCH] md/raid456: downlevel multicore operations to raid_run_ops

The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%).  However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe().  As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).

By downleveling the parallelism to raid_run_ops the pathological
stripe_head bouncing is eliminated.  This version still exhibits an
average 11% throughput loss for:

	mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
	echo 1024 > /sys/block/md0/md/stripe_cache_size
	dd if=/dev/zero of=/dev/md0 bs=1024k count=2048

...but the results are at least stable and can be used as a base for
further multicore experimentation.

Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
On Thu, 2009-10-01 at 21:13 -0700, Neil Brown wrote:
> On Thursday October 1, dan.j.williams@intel.com wrote:
> > Hi Neil,
> > 
> > A few fixes:
> > 1/ The multicore option is not ready for prime time
> 
> But it is already marked experimental...
> So do we really need to revert?  or is the current code broken beyond
> repair?

So we don't need a revert, this fixes up the unpredictability of the
original implementation.  It surprised me that the overhead of passing
raid_run_ops to the async thread pool amounted to an 11% performance
regression.  In any event I think this is a better baseline for future
multicore experimentation than the current implementation.

--
Dan

 drivers/md/raid5.c |   72 ++++++++++++++++++++++++++-------------------------
 drivers/md/raid5.h |   12 ++++++++-
 2 files changed, 48 insertions(+), 36 deletions(-)

diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c21cc50..6b4a09f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1139,7 +1139,7 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu
 			   &sh->ops.zero_sum_result, percpu->spare_page, &submit);
 }
 
-static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
+static void __raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 {
 	int overlap_clear = 0, i, disks = sh->disks;
 	struct dma_async_tx_descriptor *tx = NULL;
@@ -1204,6 +1204,36 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
 	put_cpu();
 }
 
+#ifdef CONFIG_MULTICORE_RAID456
+static void async_run_ops(void *param, async_cookie_t cookie)
+{
+	struct stripe_head *sh = param;
+	unsigned long ops_request = sh->ops.request;
+
+	clear_bit_unlock(STRIPE_OPS_REQ_PENDING, &sh->state);
+	wake_up(&sh->ops.wait_for_ops);
+
+	__raid_run_ops(sh, ops_request);
+	release_stripe(sh);
+}
+
+static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
+{
+	/* since handle_stripe can be called outside of raid5d context
+	 * we need to ensure sh->ops.request is de-staged before another
+	 * request arrives
+	 */
+	wait_event(sh->ops.wait_for_ops,
+		   !test_and_set_bit_lock(STRIPE_OPS_REQ_PENDING, &sh->state));
+	sh->ops.request = ops_request;
+
+	atomic_inc(&sh->count);
+	async_schedule(async_run_ops, sh);
+}
+#else
+#define raid_run_ops __raid_run_ops
+#endif
+
 static int grow_one_stripe(raid5_conf_t *conf)
 {
 	struct stripe_head *sh;
@@ -1213,6 +1243,9 @@ static int grow_one_stripe(raid5_conf_t *conf)
 	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
 	sh->raid_conf = conf;
 	spin_lock_init(&sh->lock);
+	#ifdef CONFIG_MULTICORE_RAID456
+	init_waitqueue_head(&sh->ops.wait_for_ops);
+	#endif
 
 	if (grow_buffers(sh, conf->raid_disks)) {
 		shrink_buffers(sh, conf->raid_disks);
@@ -4349,37 +4382,6 @@ static int  retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
 	return handled;
 }
 
-#ifdef CONFIG_MULTICORE_RAID456
-static void __process_stripe(void *param, async_cookie_t cookie)
-{
-	struct stripe_head *sh = param;
-
-	handle_stripe(sh);
-	release_stripe(sh);
-}
-
-static void process_stripe(struct stripe_head *sh, struct list_head *domain)
-{
-	async_schedule_domain(__process_stripe, sh, domain);
-}
-
-static void synchronize_stripe_processing(struct list_head *domain)
-{
-	async_synchronize_full_domain(domain);
-}
-#else
-static void process_stripe(struct stripe_head *sh, struct list_head *domain)
-{
-	handle_stripe(sh);
-	release_stripe(sh);
-	cond_resched();
-}
-
-static void synchronize_stripe_processing(struct list_head *domain)
-{
-}
-#endif
-
 
 /*
  * This is our raid5 kernel thread.
@@ -4393,7 +4395,6 @@ static void raid5d(mddev_t *mddev)
 	struct stripe_head *sh;
 	raid5_conf_t *conf = mddev->private;
 	int handled;
-	LIST_HEAD(raid_domain);
 
 	pr_debug("+++ raid5d active\n");
 
@@ -4430,7 +4431,9 @@ static void raid5d(mddev_t *mddev)
 		spin_unlock_irq(&conf->device_lock);
 		
 		handled++;
-		process_stripe(sh, &raid_domain);
+		handle_stripe(sh);
+		release_stripe(sh);
+		cond_resched();
 
 		spin_lock_irq(&conf->device_lock);
 	}
@@ -4438,7 +4441,6 @@ static void raid5d(mddev_t *mddev)
 
 	spin_unlock_irq(&conf->device_lock);
 
-	synchronize_stripe_processing(&raid_domain);
 	async_tx_issue_pending_all();
 	unplug_slaves(mddev);
 
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 2390e0e..dcefdc9 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -214,12 +214,20 @@ struct stripe_head {
 	int			disks;		/* disks in stripe */
 	enum check_states	check_state;
 	enum reconstruct_states reconstruct_state;
-	/* stripe_operations
+	/**
+	 * struct stripe_operations
 	 * @target - STRIPE_OP_COMPUTE_BLK target
+	 * @target2 - 2nd compute target in the raid6 case
+	 * @zero_sum_result - P and Q verification flags
+	 * @request - async service request flags for raid_run_ops
 	 */
 	struct stripe_operations {
 		int 		     target, target2;
 		enum sum_check_flags zero_sum_result;
+		#ifdef CONFIG_MULTICORE_RAID456
+		unsigned long	     request;
+		wait_queue_head_t    wait_for_ops;
+		#endif
 	} ops;
 	struct r5dev {
 		struct bio	req;
@@ -294,6 +302,8 @@ struct r6_state {
 #define	STRIPE_FULL_WRITE	13 /* all blocks are set to be overwritten */
 #define	STRIPE_BIOFILL_RUN	14
 #define	STRIPE_COMPUTE_RUN	15
+#define	STRIPE_OPS_REQ_PENDING	16
+
 /*
  * Operation request flags
  */
-- 
1.6.0.6




^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] md fixes for 2.6.32-rc
  2009-10-07  0:36   ` Dan Williams
@ 2009-10-07  4:34     ` Neil Brown
  2009-10-07 12:05     ` Holger Kiehl
  1 sibling, 0 replies; 13+ messages in thread
From: Neil Brown @ 2009-10-07  4:34 UTC (permalink / raw)
  To: Dan Williams; +Cc: linux-raid



On Tuesday October 6, dan.j.williams@intel.com wrote:
> >From 0496c92cf6ac1f4f7dde6d416707988991d87d41 Mon Sep 17 00:00:00 2001
> From: Dan Williams <dan.j.williams@intel.com>
> Date: Sat, 3 Oct 2009 13:47:05 -0700
> Subject: [PATCH] md/raid456: downlevel multicore operations to raid_run_ops
> 
> The percpu conversion allowed a straightforward handoff of stripe
> processing to the async subsytem that initially showed some modest gains
> (+4%).  However, this model is too simplistic and leads to stripes
> bouncing between raid5d and the async thread pool for every invocation
> of handle_stripe().  As reported by Holger this can fall into a
> pathological situation severely impacting throughput (6x performance
> loss).
> 
> By downleveling the parallelism to raid_run_ops the pathological
> stripe_head bouncing is eliminated.  This version still exhibits an
> average 11% throughput loss for:
> 
> 	mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
> 	echo 1024 > /sys/block/md0/md/stripe_cache_size
> 	dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
> 
> ...but the results are at least stable and can be used as a base for
> further multicore experimentation.

Thanks.

One little change needed:


>  static int grow_one_stripe(raid5_conf_t *conf)
>  {
>  	struct stripe_head *sh;
> @@ -1213,6 +1243,9 @@ static int grow_one_stripe(raid5_conf_t *conf)
>  	memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
>  	sh->raid_conf = conf;
>  	spin_lock_init(&sh->lock);
> +	#ifdef CONFIG_MULTICORE_RAID456
> +	init_waitqueue_head(&sh->ops.wait_for_ops);
> +	#endif
>  
>  	if (grow_buffers(sh, conf->raid_disks)) {
>  		shrink_buffers(sh, conf->raid_disks);

This addition is needed in resize_stripes too.

NeilBrown

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] md fixes for 2.6.32-rc
  2009-10-07  0:36   ` Dan Williams
  2009-10-07  4:34     ` Neil Brown
@ 2009-10-07 12:05     ` Holger Kiehl
  2009-10-07 18:33       ` Asdo
  1 sibling, 1 reply; 13+ messages in thread
From: Holger Kiehl @ 2009-10-07 12:05 UTC (permalink / raw)
  To: Dan Williams; +Cc: Neil Brown, linux-raid

On Tue, 6 Oct 2009, Dan Williams wrote:

>> From 0496c92cf6ac1f4f7dde6d416707988991d87d41 Mon Sep 17 00:00:00 2001
> From: Dan Williams <dan.j.williams@intel.com>
> Date: Sat, 3 Oct 2009 13:47:05 -0700
> Subject: [PATCH] md/raid456: downlevel multicore operations to raid_run_ops
>
> The percpu conversion allowed a straightforward handoff of stripe
> processing to the async subsytem that initially showed some modest gains
> (+4%).  However, this model is too simplistic and leads to stripes
> bouncing between raid5d and the async thread pool for every invocation
> of handle_stripe().  As reported by Holger this can fall into a
> pathological situation severely impacting throughput (6x performance
> loss).
>
> By downleveling the parallelism to raid_run_ops the pathological
> stripe_head bouncing is eliminated.  This version still exhibits an
> average 11% throughput loss for:
>
> 	mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
> 	echo 1024 > /sys/block/md0/md/stripe_cache_size
> 	dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
>
> ...but the results are at least stable and can be used as a base for
> further multicore experimentation.
>
> Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> On Thu, 2009-10-01 at 21:13 -0700, Neil Brown wrote:
>> On Thursday October 1, dan.j.williams@intel.com wrote:
>>> Hi Neil,
>>>
>>> A few fixes:
>>> 1/ The multicore option is not ready for prime time
>>
>> But it is already marked experimental...
>> So do we really need to revert?  or is the current code broken beyond
>> repair?
>
> So we don't need a revert, this fixes up the unpredictability of the
> original implementation.  It surprised me that the overhead of passing
> raid_run_ops to the async thread pool amounted to an 11% performance
> regression.  In any event I think this is a better baseline for future
> multicore experimentation than the current implementation.
>
Just to add some more information, I did try this patch with
2.6.32-rc3-git1 and with the testing I am doing I get appr. 125%
performance regression. The tests I am doing is have several (appr. 60
process) sending via FTP or SFTP about 100000 small files (average size
below 4096 bytes) to localhost in a loop for 30 minutes. Here the
real numbers:

  with multicore support enabled (with your patch)    3276.77 files per second
  with multicore support enabled (without your patch) 1014.47 files per second
  without multicore support                           7549.24 files per second

Holger

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] md fixes for 2.6.32-rc
  2009-10-07 12:05     ` Holger Kiehl
@ 2009-10-07 18:33       ` Asdo
  2009-10-08  8:50         ` Holger Kiehl
  0 siblings, 1 reply; 13+ messages in thread
From: Asdo @ 2009-10-07 18:33 UTC (permalink / raw)
  To: Holger Kiehl, linux-raid

Holger Kiehl wrote:
> On Tue, 6 Oct 2009, Dan Williams wrote:
>
>> ....
>> By downleveling the parallelism to raid_run_ops the pathological
>> stripe_head bouncing is eliminated.  This version still exhibits an
>> average 11% throughput....
>>
>> So we don't need a revert, this fixes up the unpredictability of the
>> original implementation.  It surprised me that the overhead of passing
>> raid_run_ops to the async thread pool amounted to an 11% performance
>> regression.  In any event I think this is a better baseline for future
>> multicore experimentation than the current implementation.
>>
> Just to add some more information, I did try this patch with
> 2.6.32-rc3-git1 and with the testing I am doing I get appr. 125%
> performance regression. 
Hi Holger

 From the above sentence it seems you get worse performance now than 
with the original multicore implementation, while from the numbers below 
it seems you get better performances now.

Which is correct?

(BTW a performance regression higher than 100% is impossible :-) )

> The tests I am doing is have several (appr. 60
> process) sending via FTP or SFTP about 100000 small files (average size
> below 4096 bytes) to localhost in a loop for 30 minutes. Here the
> real numbers:
>
>  with multicore support enabled (with your patch)    3276.77 files per 
> second
>  with multicore support enabled (without your patch) 1014.47 files per 
> second
>  without multicore support                           7549.24 files per 
> second
>
> Holger

Also, could you tell us some details about your machine and the RAID?
Like the model of CPU (Nehalems and AMDs have much faster memory access 
than earlier intels) and if it's a single-cpu or a dual cpu mainboard...
Amount of RAM
also: stripe_cache_size current setting for your RAID
Raid level, number of disks, chunk size, filesystem...

Thank you!


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] md fixes for 2.6.32-rc
  2009-10-07 18:33       ` Asdo
@ 2009-10-08  8:50         ` Holger Kiehl
  2009-10-11 12:16           ` Asdo
  0 siblings, 1 reply; 13+ messages in thread
From: Holger Kiehl @ 2009-10-08  8:50 UTC (permalink / raw)
  To: Asdo; +Cc: linux-raid

On Wed, 7 Oct 2009, Asdo wrote:

> Holger Kiehl wrote:
>> On Tue, 6 Oct 2009, Dan Williams wrote:
>> 
>>> ....
>>> By downleveling the parallelism to raid_run_ops the pathological
>>> stripe_head bouncing is eliminated.  This version still exhibits an
>>> average 11% throughput....
>>> 
>>> So we don't need a revert, this fixes up the unpredictability of the
>>> original implementation.  It surprised me that the overhead of passing
>>> raid_run_ops to the async thread pool amounted to an 11% performance
>>> regression.  In any event I think this is a better baseline for future
>>> multicore experimentation than the current implementation.
>>> 
>> Just to add some more information, I did try this patch with
>> 2.6.32-rc3-git1 and with the testing I am doing I get appr. 125%
>> performance regression. 
> Hi Holger
>
> From the above sentence it seems you get worse performance now than with the 
> original multicore implementation, while from the numbers below it seems you 
> get better performances now.
>
> Which is correct?
>
I only wanted to express that with my testing I get a higher performance
regression then the test Dan did. So with the patch it is much better
then without, as the numbers below show.

> (BTW a performance regression higher than 100% is impossible :-) )
>
>> The tests I am doing is have several (appr. 60
>> process) sending via FTP or SFTP about 100000 small files (average size
>> below 4096 bytes) to localhost in a loop for 30 minutes. Here the
>> real numbers:
>>
>>  with multicore support enabled (with your patch)    3276.77 files per 
>> second
>>  with multicore support enabled (without your patch) 1014.47 files per 
>> second
>>  without multicore support                           7549.24 files per 
>> second
>> 
>> Holger
>
> Also, could you tell us some details about your machine and the RAID?
> Like the model of CPU (Nehalems and AMDs have much faster memory access than 
> earlier intels) and if it's a single-cpu or a dual cpu mainboard...
> Amount of RAM
>
Its a dual cpu mainboard with two Xeon X5460 and 32 GiB Ram.

> also: stripe_cache_size current setting for your RAID
> Raid level, number of disks, chunk size, filesystem...
>
Raid level is Raid6 over 8 disks (actually 16, which are made up of 8
HW Raid1 disks) and chunksize is 2048k. Here the output from /proc/mdstat

    md4 : active raid6 sdi1[2] sdl1[5] sdj1[3] sdn1[7] sdk1[4] sdm1[6] sdh1[1] sdg1[0]
          1754480640 blocks level 6, 2048k chunk, algorithm 2 [8/8] [UUUUUUUU]

Filesystem is ext4, here the details:

    dumpe2fs 1.41.4 (27-Jan-2009)
    Filesystem volume name:   <none>
    Last mounted on:          /home
    Filesystem UUID:          1e3ff0c9-07a0-412e-938c-b9a242ae7d42
    Filesystem magic number:  0xEF53
    Filesystem revision #:    1 (dynamic)
    Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
    Filesystem flags:         signed_directory_hash
    Default mount options:    (none)
    Filesystem state:         clean
    Errors behavior:          Continue
    Filesystem OS type:       Linux
    Inode count:              109658112
    Block count:              438620160
    Reserved block count:     0
    Free blocks:              429498966
    Free inodes:              109493636
    First block:              0
    Block size:               4096
    Fragment size:            4096
    Reserved GDT blocks:      919
    Blocks per group:         32768
    Fragments per group:      32768
    Inodes per group:         8192
    Inode blocks per group:   512
    RAID stride:              512
    RAID stripe width:        3072
    Flex block group size:    16
    Filesystem created:       Thu Sep 17 14:37:08 2009
    Last mount time:          Wed Oct  7 09:04:00 2009
    Last write time:          Wed Oct  7 09:04:00 2009
    Mount count:              12
    Maximum mount count:      -1
    Last checked:             Thu Sep 17 14:37:08 2009
    Check interval:           0 (<none>)
    Reserved blocks uid:      0 (user root)
    Reserved blocks gid:      0 (group root)
    First inode:              11
    Inode size:               256
    Required extra isize:     28
    Desired extra isize:      28
    Journal inode:            8
    Default directory hash:   half_md4
    Directory Hash Seed:      149fb3b1-2e99-46de-bd31-7031f677deb6
    Journal backup:           inode blocks
    Journal size:             768M

Thanks,
Holger

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] md fixes for 2.6.32-rc
  2009-10-08  8:50         ` Holger Kiehl
@ 2009-10-11 12:16           ` Asdo
  2009-10-11 13:17             ` Asdo
  0 siblings, 1 reply; 13+ messages in thread
From: Asdo @ 2009-10-11 12:16 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-raid

Holger Kiehl wrote:
> Its a dual cpu mainboard with two Xeon X5460 and 32 GiB Ram.
>
Nice machine...
Earlier intels though, with kinda slow RAM access (who knows if this has 
something to do...)
A few more observations:

>> also: stripe_cache_size current setting for your RAID
>> Raid level, number of disks, chunk size, filesystem...
> Raid level is Raid6 over 8 disks (actually 16, which are made up of 8
> HW Raid1 disks) and chunksize is 2048k. Here the output from /proc/mdstat
Ok the filesystem appears to be aligned then (I am assuming you are not 
using LVM, otherwise pls tell because there can be other tweaks)

I don't really know this multicore implementation, however here is one 
thing you could try:

The default stripe_cache_size might be too small for a multicore 
implementation, considering that there might be many more in-flight 
operations than with single-core (and even with single core it is 
beneficial to raise it).
What is the current value? (cat /sys/block/md4/md/stripe_cache_size)
You can try:
echo 32768 > /sys/block/md4/md/stripe_cache_size

You can also try to raise /sys/block/md4/md/nr_requests echoing a higher 
value in it, but this operation hangs up the kernel immediately on our 
machine with Ubuntu kernel 2.6.31 . This is actually a bug: Neil / Dan 
are you aware of this bug? Is it fixed in 2.6.32?

Secondly, it's surprising that it is slower than single core. How busy 
are the 8 cores you have, running on the MD threads (try htop) ? If you 
see all 8 cores maxed, it's one kind of implementation problem, if you 
see just 1-2 cores maxed, it's another kind...

Maybe stack traces can be used to see where the bottleneck is... I am 
not a profiling expert actually, but you might try this
cat /proc/12345/stack
replace 12345 with the kernel threads for md: you can probably see the 
PIDs with ps auxH | less or with htop (K option enabled)
catting the stack many times should see where they are stuck for most of 
the time


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 0/3] md fixes for 2.6.32-rc
  2009-10-11 12:16           ` Asdo
@ 2009-10-11 13:17             ` Asdo
  0 siblings, 0 replies; 13+ messages in thread
From: Asdo @ 2009-10-11 13:17 UTC (permalink / raw)
  To: Holger Kiehl; +Cc: linux-raid

Asdo wrote:
> ... How busy are the 8 cores you have, running on the MD threads (try 
> htop) ...

I meant to say: "try htop with K option enabled, i.e. press K while 
running htop, so to see kernel threads (see help by pressing h)"

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-10-11 13:17 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-02  1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
2009-10-02  1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams
2009-10-02  1:18 ` [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" Dan Williams
2009-10-02  1:18 ` [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context Dan Williams
2009-10-02  4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
2009-10-03 15:54   ` Dan Williams
2009-10-07  0:36   ` Dan Williams
2009-10-07  4:34     ` Neil Brown
2009-10-07 12:05     ` Holger Kiehl
2009-10-07 18:33       ` Asdo
2009-10-08  8:50         ` Holger Kiehl
2009-10-11 12:16           ` Asdo
2009-10-11 13:17             ` Asdo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.