* [PATCH 0/3] md fixes for 2.6.32-rc
@ 2009-10-02 1:18 Dan Williams
2009-10-02 1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-02 1:18 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
Hi Neil,
A few fixes:
1/ The multicore option is not ready for prime time
2/ A fix for raid5 init failures
3/ A reminder that the fix you sent to Greg a while back has not found
its way upstream.
Thanks,
Dan
---
Dan Williams (2):
Revert "md/raid456: distribute raid processing over multiple cores"
md/raid5: initialize conf->device_lock earlier
NeilBrown (1):
Allow sysfs_notify_dirent to be called from interrupt context.
drivers/md/Kconfig | 11 ---------
drivers/md/raid5.c | 62 +++++++++++++---------------------------------------
fs/sysfs/file.c | 14 +++++++-----
3 files changed, 23 insertions(+), 64 deletions(-)
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/3] md/raid5: initialize conf->device_lock earlier
2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
@ 2009-10-02 1:18 ` Dan Williams
2009-10-02 1:18 ` [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" Dan Williams
` (2 subsequent siblings)
3 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-02 1:18 UTC (permalink / raw)
To: neilb; +Cc: linux-raid
Deallocating a raid5_conf_t structure requires taking 'device_lock'.
Ensure it is initialized before it is used, i.e. initialize the lock
before attempting any further initializations that might fail.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
drivers/md/raid5.c | 25 ++++++++++++-------------
1 files changed, 12 insertions(+), 13 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 9482980..c21cc50 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -4722,6 +4722,18 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
conf = kzalloc(sizeof(raid5_conf_t), GFP_KERNEL);
if (conf == NULL)
goto abort;
+ spin_lock_init(&conf->device_lock);
+ init_waitqueue_head(&conf->wait_for_stripe);
+ init_waitqueue_head(&conf->wait_for_overlap);
+ INIT_LIST_HEAD(&conf->handle_list);
+ INIT_LIST_HEAD(&conf->hold_list);
+ INIT_LIST_HEAD(&conf->delayed_list);
+ INIT_LIST_HEAD(&conf->bitmap_list);
+ INIT_LIST_HEAD(&conf->inactive_list);
+ atomic_set(&conf->active_stripes, 0);
+ atomic_set(&conf->preread_active_stripes, 0);
+ atomic_set(&conf->active_aligned_reads, 0);
+ conf->bypass_threshold = BYPASS_THRESHOLD;
conf->raid_disks = mddev->raid_disks;
conf->scribble_len = scribble_len(conf->raid_disks);
@@ -4744,19 +4756,6 @@ static raid5_conf_t *setup_conf(mddev_t *mddev)
if (raid5_alloc_percpu(conf) != 0)
goto abort;
- spin_lock_init(&conf->device_lock);
- init_waitqueue_head(&conf->wait_for_stripe);
- init_waitqueue_head(&conf->wait_for_overlap);
- INIT_LIST_HEAD(&conf->handle_list);
- INIT_LIST_HEAD(&conf->hold_list);
- INIT_LIST_HEAD(&conf->delayed_list);
- INIT_LIST_HEAD(&conf->bitmap_list);
- INIT_LIST_HEAD(&conf->inactive_list);
- atomic_set(&conf->active_stripes, 0);
- atomic_set(&conf->preread_active_stripes, 0);
- atomic_set(&conf->active_aligned_reads, 0);
- conf->bypass_threshold = BYPASS_THRESHOLD;
-
pr_debug("raid5: run(%s) called.\n", mdname(mddev));
list_for_each_entry(rdev, &mddev->disks, same_set) {
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores"
2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
2009-10-02 1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams
@ 2009-10-02 1:18 ` Dan Williams
2009-10-02 1:18 ` [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context Dan Williams
2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
3 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-02 1:18 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, Holger Kiehl
The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%). However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe(). As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).
Revert this for now, to be resurrected once we have a more workable
thread pool.
Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
drivers/md/Kconfig | 11 -----------
drivers/md/raid5.c | 37 +++----------------------------------
2 files changed, 3 insertions(+), 45 deletions(-)
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 2158377..f0fac58 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -154,17 +154,6 @@ config MD_RAID456
If unsure, say Y.
-config MULTICORE_RAID456
- bool "RAID-4/RAID-5/RAID-6 Multicore processing (EXPERIMENTAL)"
- depends on MD_RAID456
- depends on SMP
- depends on EXPERIMENTAL
- ---help---
- Enable the raid456 module to dispatch per-stripe raid operations to a
- thread pool.
-
- If unsure, say N.
-
config MD_RAID6_PQ
tristate
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c21cc50..11cdfe6 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -47,7 +47,6 @@
#include <linux/kthread.h>
#include <linux/raid/pq.h>
#include <linux/async_tx.h>
-#include <linux/async.h>
#include <linux/seq_file.h>
#include <linux/cpu.h>
#include "md.h"
@@ -4349,36 +4348,6 @@ static int retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
return handled;
}
-#ifdef CONFIG_MULTICORE_RAID456
-static void __process_stripe(void *param, async_cookie_t cookie)
-{
- struct stripe_head *sh = param;
-
- handle_stripe(sh);
- release_stripe(sh);
-}
-
-static void process_stripe(struct stripe_head *sh, struct list_head *domain)
-{
- async_schedule_domain(__process_stripe, sh, domain);
-}
-
-static void synchronize_stripe_processing(struct list_head *domain)
-{
- async_synchronize_full_domain(domain);
-}
-#else
-static void process_stripe(struct stripe_head *sh, struct list_head *domain)
-{
- handle_stripe(sh);
- release_stripe(sh);
- cond_resched();
-}
-
-static void synchronize_stripe_processing(struct list_head *domain)
-{
-}
-#endif
/*
@@ -4393,7 +4362,6 @@ static void raid5d(mddev_t *mddev)
struct stripe_head *sh;
raid5_conf_t *conf = mddev->private;
int handled;
- LIST_HEAD(raid_domain);
pr_debug("+++ raid5d active\n");
@@ -4430,7 +4398,9 @@ static void raid5d(mddev_t *mddev)
spin_unlock_irq(&conf->device_lock);
handled++;
- process_stripe(sh, &raid_domain);
+ handle_stripe(sh);
+ release_stripe(sh);
+ cond_resched();
spin_lock_irq(&conf->device_lock);
}
@@ -4438,7 +4408,6 @@ static void raid5d(mddev_t *mddev)
spin_unlock_irq(&conf->device_lock);
- synchronize_stripe_processing(&raid_domain);
async_tx_issue_pending_all();
unplug_slaves(mddev);
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context.
2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
2009-10-02 1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams
2009-10-02 1:18 ` [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" Dan Williams
@ 2009-10-02 1:18 ` Dan Williams
2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
3 siblings, 0 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-02 1:18 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, Joel Andres Granados, stable
From: NeilBrown <neilb@suse.de>
sysfs_notify_dirent is a simple atomic operation that can be used to
alert user-space that new data can be read from a sysfs attribute.
Unfortunately is cannot currently be called from non-process context
because of it's use of spin_lock which is sometimes taken with
interrupt enabled.
So change all lockers of sysfs_open_dirent_lock to disable interrupts,
thus making sysfs_notify_dirent safe to be called from non-process
context (as drivers/md does in md_safemode_timeout).
sysfs_get_open_dirent is (documented as being) only called from
process context, so it uses spin_lock_irq. Other places
use spin_lock_irqsave.
The usage for sysfs_notify_dirent in md_safemode_timeout was
introduced in 2.6.28, so this patch is suitable for that and more
recent kernels.
Cc: <stable@kernel.org>
Reported-by: Joel Andres Granados <jgranado@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
fs/sysfs/file.c | 14 ++++++++------
1 files changed, 8 insertions(+), 6 deletions(-)
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 561a9c0..f5ea468 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -268,7 +268,7 @@ static int sysfs_get_open_dirent(struct sysfs_dirent *sd,
struct sysfs_open_dirent *od, *new_od = NULL;
retry:
- spin_lock(&sysfs_open_dirent_lock);
+ spin_lock_irq(&sysfs_open_dirent_lock);
if (!sd->s_attr.open && new_od) {
sd->s_attr.open = new_od;
@@ -281,7 +281,7 @@ static int sysfs_get_open_dirent(struct sysfs_dirent *sd,
list_add_tail(&buffer->list, &od->buffers);
}
- spin_unlock(&sysfs_open_dirent_lock);
+ spin_unlock_irq(&sysfs_open_dirent_lock);
if (od) {
kfree(new_od);
@@ -315,8 +315,9 @@ static void sysfs_put_open_dirent(struct sysfs_dirent *sd,
struct sysfs_buffer *buffer)
{
struct sysfs_open_dirent *od = sd->s_attr.open;
+ unsigned long flags;
- spin_lock(&sysfs_open_dirent_lock);
+ spin_lock_irqsave(&sysfs_open_dirent_lock, flags);
list_del(&buffer->list);
if (atomic_dec_and_test(&od->refcnt))
@@ -324,7 +325,7 @@ static void sysfs_put_open_dirent(struct sysfs_dirent *sd,
else
od = NULL;
- spin_unlock(&sysfs_open_dirent_lock);
+ spin_unlock_irqrestore(&sysfs_open_dirent_lock, flags);
kfree(od);
}
@@ -456,8 +457,9 @@ static unsigned int sysfs_poll(struct file *filp, poll_table *wait)
void sysfs_notify_dirent(struct sysfs_dirent *sd)
{
struct sysfs_open_dirent *od;
+ unsigned long flags;
- spin_lock(&sysfs_open_dirent_lock);
+ spin_lock_irqsave(&sysfs_open_dirent_lock, flags);
od = sd->s_attr.open;
if (od) {
@@ -465,7 +467,7 @@ void sysfs_notify_dirent(struct sysfs_dirent *sd)
wake_up_interruptible(&od->poll);
}
- spin_unlock(&sysfs_open_dirent_lock);
+ spin_unlock_irqrestore(&sysfs_open_dirent_lock, flags);
}
EXPORT_SYMBOL_GPL(sysfs_notify_dirent);
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc
2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
` (2 preceding siblings ...)
2009-10-02 1:18 ` [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context Dan Williams
@ 2009-10-02 4:13 ` Neil Brown
2009-10-03 15:54 ` Dan Williams
2009-10-07 0:36 ` Dan Williams
3 siblings, 2 replies; 13+ messages in thread
From: Neil Brown @ 2009-10-02 4:13 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-raid
On Thursday October 1, dan.j.williams@intel.com wrote:
> Hi Neil,
>
> A few fixes:
> 1/ The multicore option is not ready for prime time
But it is already marked experimental...
So do we really need to revert? or is the current code broken beyond
repair?
> 2/ A fix for raid5 init failures
Yep, thanks!
> 3/ A reminder that the fix you sent to Greg a while back has not found
> its way upstream.
I've nagged Greg about this, thanks.
NeilBrown
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc
2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
@ 2009-10-03 15:54 ` Dan Williams
2009-10-07 0:36 ` Dan Williams
1 sibling, 0 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-03 15:54 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
Neil Brown wrote:
> On Thursday October 1, dan.j.williams@intel.com wrote:
>> Hi Neil,
>>
>> A few fixes:
>> 1/ The multicore option is not ready for prime time
>
> But it is already marked experimental...
> So do we really need to revert? or is the current code broken beyond
> repair?
On second thought perhaps it is salvageable. I'll take a look at moving
the parallelism down from handle_stripe() to raid_run_ops().
--
Dan
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc
2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
2009-10-03 15:54 ` Dan Williams
@ 2009-10-07 0:36 ` Dan Williams
2009-10-07 4:34 ` Neil Brown
2009-10-07 12:05 ` Holger Kiehl
1 sibling, 2 replies; 13+ messages in thread
From: Dan Williams @ 2009-10-07 0:36 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
From 0496c92cf6ac1f4f7dde6d416707988991d87d41 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@intel.com>
Date: Sat, 3 Oct 2009 13:47:05 -0700
Subject: [PATCH] md/raid456: downlevel multicore operations to raid_run_ops
The percpu conversion allowed a straightforward handoff of stripe
processing to the async subsytem that initially showed some modest gains
(+4%). However, this model is too simplistic and leads to stripes
bouncing between raid5d and the async thread pool for every invocation
of handle_stripe(). As reported by Holger this can fall into a
pathological situation severely impacting throughput (6x performance
loss).
By downleveling the parallelism to raid_run_ops the pathological
stripe_head bouncing is eliminated. This version still exhibits an
average 11% throughput loss for:
mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
echo 1024 > /sys/block/md0/md/stripe_cache_size
dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
...but the results are at least stable and can be used as a base for
further multicore experimentation.
Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
On Thu, 2009-10-01 at 21:13 -0700, Neil Brown wrote:
> On Thursday October 1, dan.j.williams@intel.com wrote:
> > Hi Neil,
> >
> > A few fixes:
> > 1/ The multicore option is not ready for prime time
>
> But it is already marked experimental...
> So do we really need to revert? or is the current code broken beyond
> repair?
So we don't need a revert, this fixes up the unpredictability of the
original implementation. It surprised me that the overhead of passing
raid_run_ops to the async thread pool amounted to an 11% performance
regression. In any event I think this is a better baseline for future
multicore experimentation than the current implementation.
--
Dan
drivers/md/raid5.c | 72 ++++++++++++++++++++++++++-------------------------
drivers/md/raid5.h | 12 ++++++++-
2 files changed, 48 insertions(+), 36 deletions(-)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index c21cc50..6b4a09f 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -1139,7 +1139,7 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu
&sh->ops.zero_sum_result, percpu->spare_page, &submit);
}
-static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
+static void __raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
{
int overlap_clear = 0, i, disks = sh->disks;
struct dma_async_tx_descriptor *tx = NULL;
@@ -1204,6 +1204,36 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
put_cpu();
}
+#ifdef CONFIG_MULTICORE_RAID456
+static void async_run_ops(void *param, async_cookie_t cookie)
+{
+ struct stripe_head *sh = param;
+ unsigned long ops_request = sh->ops.request;
+
+ clear_bit_unlock(STRIPE_OPS_REQ_PENDING, &sh->state);
+ wake_up(&sh->ops.wait_for_ops);
+
+ __raid_run_ops(sh, ops_request);
+ release_stripe(sh);
+}
+
+static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request)
+{
+ /* since handle_stripe can be called outside of raid5d context
+ * we need to ensure sh->ops.request is de-staged before another
+ * request arrives
+ */
+ wait_event(sh->ops.wait_for_ops,
+ !test_and_set_bit_lock(STRIPE_OPS_REQ_PENDING, &sh->state));
+ sh->ops.request = ops_request;
+
+ atomic_inc(&sh->count);
+ async_schedule(async_run_ops, sh);
+}
+#else
+#define raid_run_ops __raid_run_ops
+#endif
+
static int grow_one_stripe(raid5_conf_t *conf)
{
struct stripe_head *sh;
@@ -1213,6 +1243,9 @@ static int grow_one_stripe(raid5_conf_t *conf)
memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
sh->raid_conf = conf;
spin_lock_init(&sh->lock);
+ #ifdef CONFIG_MULTICORE_RAID456
+ init_waitqueue_head(&sh->ops.wait_for_ops);
+ #endif
if (grow_buffers(sh, conf->raid_disks)) {
shrink_buffers(sh, conf->raid_disks);
@@ -4349,37 +4382,6 @@ static int retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio)
return handled;
}
-#ifdef CONFIG_MULTICORE_RAID456
-static void __process_stripe(void *param, async_cookie_t cookie)
-{
- struct stripe_head *sh = param;
-
- handle_stripe(sh);
- release_stripe(sh);
-}
-
-static void process_stripe(struct stripe_head *sh, struct list_head *domain)
-{
- async_schedule_domain(__process_stripe, sh, domain);
-}
-
-static void synchronize_stripe_processing(struct list_head *domain)
-{
- async_synchronize_full_domain(domain);
-}
-#else
-static void process_stripe(struct stripe_head *sh, struct list_head *domain)
-{
- handle_stripe(sh);
- release_stripe(sh);
- cond_resched();
-}
-
-static void synchronize_stripe_processing(struct list_head *domain)
-{
-}
-#endif
-
/*
* This is our raid5 kernel thread.
@@ -4393,7 +4395,6 @@ static void raid5d(mddev_t *mddev)
struct stripe_head *sh;
raid5_conf_t *conf = mddev->private;
int handled;
- LIST_HEAD(raid_domain);
pr_debug("+++ raid5d active\n");
@@ -4430,7 +4431,9 @@ static void raid5d(mddev_t *mddev)
spin_unlock_irq(&conf->device_lock);
handled++;
- process_stripe(sh, &raid_domain);
+ handle_stripe(sh);
+ release_stripe(sh);
+ cond_resched();
spin_lock_irq(&conf->device_lock);
}
@@ -4438,7 +4441,6 @@ static void raid5d(mddev_t *mddev)
spin_unlock_irq(&conf->device_lock);
- synchronize_stripe_processing(&raid_domain);
async_tx_issue_pending_all();
unplug_slaves(mddev);
diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h
index 2390e0e..dcefdc9 100644
--- a/drivers/md/raid5.h
+++ b/drivers/md/raid5.h
@@ -214,12 +214,20 @@ struct stripe_head {
int disks; /* disks in stripe */
enum check_states check_state;
enum reconstruct_states reconstruct_state;
- /* stripe_operations
+ /**
+ * struct stripe_operations
* @target - STRIPE_OP_COMPUTE_BLK target
+ * @target2 - 2nd compute target in the raid6 case
+ * @zero_sum_result - P and Q verification flags
+ * @request - async service request flags for raid_run_ops
*/
struct stripe_operations {
int target, target2;
enum sum_check_flags zero_sum_result;
+ #ifdef CONFIG_MULTICORE_RAID456
+ unsigned long request;
+ wait_queue_head_t wait_for_ops;
+ #endif
} ops;
struct r5dev {
struct bio req;
@@ -294,6 +302,8 @@ struct r6_state {
#define STRIPE_FULL_WRITE 13 /* all blocks are set to be overwritten */
#define STRIPE_BIOFILL_RUN 14
#define STRIPE_COMPUTE_RUN 15
+#define STRIPE_OPS_REQ_PENDING 16
+
/*
* Operation request flags
*/
--
1.6.0.6
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc
2009-10-07 0:36 ` Dan Williams
@ 2009-10-07 4:34 ` Neil Brown
2009-10-07 12:05 ` Holger Kiehl
1 sibling, 0 replies; 13+ messages in thread
From: Neil Brown @ 2009-10-07 4:34 UTC (permalink / raw)
To: Dan Williams; +Cc: linux-raid
On Tuesday October 6, dan.j.williams@intel.com wrote:
> >From 0496c92cf6ac1f4f7dde6d416707988991d87d41 Mon Sep 17 00:00:00 2001
> From: Dan Williams <dan.j.williams@intel.com>
> Date: Sat, 3 Oct 2009 13:47:05 -0700
> Subject: [PATCH] md/raid456: downlevel multicore operations to raid_run_ops
>
> The percpu conversion allowed a straightforward handoff of stripe
> processing to the async subsytem that initially showed some modest gains
> (+4%). However, this model is too simplistic and leads to stripes
> bouncing between raid5d and the async thread pool for every invocation
> of handle_stripe(). As reported by Holger this can fall into a
> pathological situation severely impacting throughput (6x performance
> loss).
>
> By downleveling the parallelism to raid_run_ops the pathological
> stripe_head bouncing is eliminated. This version still exhibits an
> average 11% throughput loss for:
>
> mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
> echo 1024 > /sys/block/md0/md/stripe_cache_size
> dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
>
> ...but the results are at least stable and can be used as a base for
> further multicore experimentation.
Thanks.
One little change needed:
> static int grow_one_stripe(raid5_conf_t *conf)
> {
> struct stripe_head *sh;
> @@ -1213,6 +1243,9 @@ static int grow_one_stripe(raid5_conf_t *conf)
> memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev));
> sh->raid_conf = conf;
> spin_lock_init(&sh->lock);
> + #ifdef CONFIG_MULTICORE_RAID456
> + init_waitqueue_head(&sh->ops.wait_for_ops);
> + #endif
>
> if (grow_buffers(sh, conf->raid_disks)) {
> shrink_buffers(sh, conf->raid_disks);
This addition is needed in resize_stripes too.
NeilBrown
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc
2009-10-07 0:36 ` Dan Williams
2009-10-07 4:34 ` Neil Brown
@ 2009-10-07 12:05 ` Holger Kiehl
2009-10-07 18:33 ` Asdo
1 sibling, 1 reply; 13+ messages in thread
From: Holger Kiehl @ 2009-10-07 12:05 UTC (permalink / raw)
To: Dan Williams; +Cc: Neil Brown, linux-raid
On Tue, 6 Oct 2009, Dan Williams wrote:
>> From 0496c92cf6ac1f4f7dde6d416707988991d87d41 Mon Sep 17 00:00:00 2001
> From: Dan Williams <dan.j.williams@intel.com>
> Date: Sat, 3 Oct 2009 13:47:05 -0700
> Subject: [PATCH] md/raid456: downlevel multicore operations to raid_run_ops
>
> The percpu conversion allowed a straightforward handoff of stripe
> processing to the async subsytem that initially showed some modest gains
> (+4%). However, this model is too simplistic and leads to stripes
> bouncing between raid5d and the async thread pool for every invocation
> of handle_stripe(). As reported by Holger this can fall into a
> pathological situation severely impacting throughput (6x performance
> loss).
>
> By downleveling the parallelism to raid_run_ops the pathological
> stripe_head bouncing is eliminated. This version still exhibits an
> average 11% throughput loss for:
>
> mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6
> echo 1024 > /sys/block/md0/md/stripe_cache_size
> dd if=/dev/zero of=/dev/md0 bs=1024k count=2048
>
> ...but the results are at least stable and can be used as a base for
> further multicore experimentation.
>
> Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> On Thu, 2009-10-01 at 21:13 -0700, Neil Brown wrote:
>> On Thursday October 1, dan.j.williams@intel.com wrote:
>>> Hi Neil,
>>>
>>> A few fixes:
>>> 1/ The multicore option is not ready for prime time
>>
>> But it is already marked experimental...
>> So do we really need to revert? or is the current code broken beyond
>> repair?
>
> So we don't need a revert, this fixes up the unpredictability of the
> original implementation. It surprised me that the overhead of passing
> raid_run_ops to the async thread pool amounted to an 11% performance
> regression. In any event I think this is a better baseline for future
> multicore experimentation than the current implementation.
>
Just to add some more information, I did try this patch with
2.6.32-rc3-git1 and with the testing I am doing I get appr. 125%
performance regression. The tests I am doing is have several (appr. 60
process) sending via FTP or SFTP about 100000 small files (average size
below 4096 bytes) to localhost in a loop for 30 minutes. Here the
real numbers:
with multicore support enabled (with your patch) 3276.77 files per second
with multicore support enabled (without your patch) 1014.47 files per second
without multicore support 7549.24 files per second
Holger
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc
2009-10-07 12:05 ` Holger Kiehl
@ 2009-10-07 18:33 ` Asdo
2009-10-08 8:50 ` Holger Kiehl
0 siblings, 1 reply; 13+ messages in thread
From: Asdo @ 2009-10-07 18:33 UTC (permalink / raw)
To: Holger Kiehl, linux-raid
Holger Kiehl wrote:
> On Tue, 6 Oct 2009, Dan Williams wrote:
>
>> ....
>> By downleveling the parallelism to raid_run_ops the pathological
>> stripe_head bouncing is eliminated. This version still exhibits an
>> average 11% throughput....
>>
>> So we don't need a revert, this fixes up the unpredictability of the
>> original implementation. It surprised me that the overhead of passing
>> raid_run_ops to the async thread pool amounted to an 11% performance
>> regression. In any event I think this is a better baseline for future
>> multicore experimentation than the current implementation.
>>
> Just to add some more information, I did try this patch with
> 2.6.32-rc3-git1 and with the testing I am doing I get appr. 125%
> performance regression.
Hi Holger
From the above sentence it seems you get worse performance now than
with the original multicore implementation, while from the numbers below
it seems you get better performances now.
Which is correct?
(BTW a performance regression higher than 100% is impossible :-) )
> The tests I am doing is have several (appr. 60
> process) sending via FTP or SFTP about 100000 small files (average size
> below 4096 bytes) to localhost in a loop for 30 minutes. Here the
> real numbers:
>
> with multicore support enabled (with your patch) 3276.77 files per
> second
> with multicore support enabled (without your patch) 1014.47 files per
> second
> without multicore support 7549.24 files per
> second
>
> Holger
Also, could you tell us some details about your machine and the RAID?
Like the model of CPU (Nehalems and AMDs have much faster memory access
than earlier intels) and if it's a single-cpu or a dual cpu mainboard...
Amount of RAM
also: stripe_cache_size current setting for your RAID
Raid level, number of disks, chunk size, filesystem...
Thank you!
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc
2009-10-07 18:33 ` Asdo
@ 2009-10-08 8:50 ` Holger Kiehl
2009-10-11 12:16 ` Asdo
0 siblings, 1 reply; 13+ messages in thread
From: Holger Kiehl @ 2009-10-08 8:50 UTC (permalink / raw)
To: Asdo; +Cc: linux-raid
On Wed, 7 Oct 2009, Asdo wrote:
> Holger Kiehl wrote:
>> On Tue, 6 Oct 2009, Dan Williams wrote:
>>
>>> ....
>>> By downleveling the parallelism to raid_run_ops the pathological
>>> stripe_head bouncing is eliminated. This version still exhibits an
>>> average 11% throughput....
>>>
>>> So we don't need a revert, this fixes up the unpredictability of the
>>> original implementation. It surprised me that the overhead of passing
>>> raid_run_ops to the async thread pool amounted to an 11% performance
>>> regression. In any event I think this is a better baseline for future
>>> multicore experimentation than the current implementation.
>>>
>> Just to add some more information, I did try this patch with
>> 2.6.32-rc3-git1 and with the testing I am doing I get appr. 125%
>> performance regression.
> Hi Holger
>
> From the above sentence it seems you get worse performance now than with the
> original multicore implementation, while from the numbers below it seems you
> get better performances now.
>
> Which is correct?
>
I only wanted to express that with my testing I get a higher performance
regression then the test Dan did. So with the patch it is much better
then without, as the numbers below show.
> (BTW a performance regression higher than 100% is impossible :-) )
>
>> The tests I am doing is have several (appr. 60
>> process) sending via FTP or SFTP about 100000 small files (average size
>> below 4096 bytes) to localhost in a loop for 30 minutes. Here the
>> real numbers:
>>
>> with multicore support enabled (with your patch) 3276.77 files per
>> second
>> with multicore support enabled (without your patch) 1014.47 files per
>> second
>> without multicore support 7549.24 files per
>> second
>>
>> Holger
>
> Also, could you tell us some details about your machine and the RAID?
> Like the model of CPU (Nehalems and AMDs have much faster memory access than
> earlier intels) and if it's a single-cpu or a dual cpu mainboard...
> Amount of RAM
>
Its a dual cpu mainboard with two Xeon X5460 and 32 GiB Ram.
> also: stripe_cache_size current setting for your RAID
> Raid level, number of disks, chunk size, filesystem...
>
Raid level is Raid6 over 8 disks (actually 16, which are made up of 8
HW Raid1 disks) and chunksize is 2048k. Here the output from /proc/mdstat
md4 : active raid6 sdi1[2] sdl1[5] sdj1[3] sdn1[7] sdk1[4] sdm1[6] sdh1[1] sdg1[0]
1754480640 blocks level 6, 2048k chunk, algorithm 2 [8/8] [UUUUUUUU]
Filesystem is ext4, here the details:
dumpe2fs 1.41.4 (27-Jan-2009)
Filesystem volume name: <none>
Last mounted on: /home
Filesystem UUID: 1e3ff0c9-07a0-412e-938c-b9a242ae7d42
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 109658112
Block count: 438620160
Reserved block count: 0
Free blocks: 429498966
Free inodes: 109493636
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 919
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
RAID stride: 512
RAID stripe width: 3072
Flex block group size: 16
Filesystem created: Thu Sep 17 14:37:08 2009
Last mount time: Wed Oct 7 09:04:00 2009
Last write time: Wed Oct 7 09:04:00 2009
Mount count: 12
Maximum mount count: -1
Last checked: Thu Sep 17 14:37:08 2009
Check interval: 0 (<none>)
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 149fb3b1-2e99-46de-bd31-7031f677deb6
Journal backup: inode blocks
Journal size: 768M
Thanks,
Holger
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc
2009-10-08 8:50 ` Holger Kiehl
@ 2009-10-11 12:16 ` Asdo
2009-10-11 13:17 ` Asdo
0 siblings, 1 reply; 13+ messages in thread
From: Asdo @ 2009-10-11 12:16 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-raid
Holger Kiehl wrote:
> Its a dual cpu mainboard with two Xeon X5460 and 32 GiB Ram.
>
Nice machine...
Earlier intels though, with kinda slow RAM access (who knows if this has
something to do...)
A few more observations:
>> also: stripe_cache_size current setting for your RAID
>> Raid level, number of disks, chunk size, filesystem...
> Raid level is Raid6 over 8 disks (actually 16, which are made up of 8
> HW Raid1 disks) and chunksize is 2048k. Here the output from /proc/mdstat
Ok the filesystem appears to be aligned then (I am assuming you are not
using LVM, otherwise pls tell because there can be other tweaks)
I don't really know this multicore implementation, however here is one
thing you could try:
The default stripe_cache_size might be too small for a multicore
implementation, considering that there might be many more in-flight
operations than with single-core (and even with single core it is
beneficial to raise it).
What is the current value? (cat /sys/block/md4/md/stripe_cache_size)
You can try:
echo 32768 > /sys/block/md4/md/stripe_cache_size
You can also try to raise /sys/block/md4/md/nr_requests echoing a higher
value in it, but this operation hangs up the kernel immediately on our
machine with Ubuntu kernel 2.6.31 . This is actually a bug: Neil / Dan
are you aware of this bug? Is it fixed in 2.6.32?
Secondly, it's surprising that it is slower than single core. How busy
are the 8 cores you have, running on the MD threads (try htop) ? If you
see all 8 cores maxed, it's one kind of implementation problem, if you
see just 1-2 cores maxed, it's another kind...
Maybe stack traces can be used to see where the bottleneck is... I am
not a profiling expert actually, but you might try this
cat /proc/12345/stack
replace 12345 with the kernel threads for md: you can probably see the
PIDs with ps auxH | less or with htop (K option enabled)
catting the stack many times should see where they are stuck for most of
the time
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc
2009-10-11 12:16 ` Asdo
@ 2009-10-11 13:17 ` Asdo
0 siblings, 0 replies; 13+ messages in thread
From: Asdo @ 2009-10-11 13:17 UTC (permalink / raw)
To: Holger Kiehl; +Cc: linux-raid
Asdo wrote:
> ... How busy are the 8 cores you have, running on the MD threads (try
> htop) ...
I meant to say: "try htop with K option enabled, i.e. press K while
running htop, so to see kernel threads (see help by pressing h)"
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-10-11 13:17 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams
2009-10-02 1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams
2009-10-02 1:18 ` [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" Dan Williams
2009-10-02 1:18 ` [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context Dan Williams
2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown
2009-10-03 15:54 ` Dan Williams
2009-10-07 0:36 ` Dan Williams
2009-10-07 4:34 ` Neil Brown
2009-10-07 12:05 ` Holger Kiehl
2009-10-07 18:33 ` Asdo
2009-10-08 8:50 ` Holger Kiehl
2009-10-11 12:16 ` Asdo
2009-10-11 13:17 ` Asdo
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.