* [PATCH 0/3] md fixes for 2.6.32-rc @ 2009-10-02 1:18 Dan Williams 2009-10-02 1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams ` (3 more replies) 0 siblings, 4 replies; 13+ messages in thread From: Dan Williams @ 2009-10-02 1:18 UTC (permalink / raw) To: neilb; +Cc: linux-raid Hi Neil, A few fixes: 1/ The multicore option is not ready for prime time 2/ A fix for raid5 init failures 3/ A reminder that the fix you sent to Greg a while back has not found its way upstream. Thanks, Dan --- Dan Williams (2): Revert "md/raid456: distribute raid processing over multiple cores" md/raid5: initialize conf->device_lock earlier NeilBrown (1): Allow sysfs_notify_dirent to be called from interrupt context. drivers/md/Kconfig | 11 --------- drivers/md/raid5.c | 62 +++++++++++++--------------------------------------- fs/sysfs/file.c | 14 +++++++----- 3 files changed, 23 insertions(+), 64 deletions(-) ^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH 1/3] md/raid5: initialize conf->device_lock earlier 2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams @ 2009-10-02 1:18 ` Dan Williams 2009-10-02 1:18 ` [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" Dan Williams ` (2 subsequent siblings) 3 siblings, 0 replies; 13+ messages in thread From: Dan Williams @ 2009-10-02 1:18 UTC (permalink / raw) To: neilb; +Cc: linux-raid Deallocating a raid5_conf_t structure requires taking 'device_lock'. Ensure it is initialized before it is used, i.e. initialize the lock before attempting any further initializations that might fail. Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/md/raid5.c | 25 ++++++++++++------------- 1 files changed, 12 insertions(+), 13 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index 9482980..c21cc50 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4722,6 +4722,18 @@ static raid5_conf_t *setup_conf(mddev_t *mddev) conf = kzalloc(sizeof(raid5_conf_t), GFP_KERNEL); if (conf == NULL) goto abort; + spin_lock_init(&conf->device_lock); + init_waitqueue_head(&conf->wait_for_stripe); + init_waitqueue_head(&conf->wait_for_overlap); + INIT_LIST_HEAD(&conf->handle_list); + INIT_LIST_HEAD(&conf->hold_list); + INIT_LIST_HEAD(&conf->delayed_list); + INIT_LIST_HEAD(&conf->bitmap_list); + INIT_LIST_HEAD(&conf->inactive_list); + atomic_set(&conf->active_stripes, 0); + atomic_set(&conf->preread_active_stripes, 0); + atomic_set(&conf->active_aligned_reads, 0); + conf->bypass_threshold = BYPASS_THRESHOLD; conf->raid_disks = mddev->raid_disks; conf->scribble_len = scribble_len(conf->raid_disks); @@ -4744,19 +4756,6 @@ static raid5_conf_t *setup_conf(mddev_t *mddev) if (raid5_alloc_percpu(conf) != 0) goto abort; - spin_lock_init(&conf->device_lock); - init_waitqueue_head(&conf->wait_for_stripe); - init_waitqueue_head(&conf->wait_for_overlap); - INIT_LIST_HEAD(&conf->handle_list); - INIT_LIST_HEAD(&conf->hold_list); - INIT_LIST_HEAD(&conf->delayed_list); - INIT_LIST_HEAD(&conf->bitmap_list); - INIT_LIST_HEAD(&conf->inactive_list); - atomic_set(&conf->active_stripes, 0); - atomic_set(&conf->preread_active_stripes, 0); - atomic_set(&conf->active_aligned_reads, 0); - conf->bypass_threshold = BYPASS_THRESHOLD; - pr_debug("raid5: run(%s) called.\n", mdname(mddev)); list_for_each_entry(rdev, &mddev->disks, same_set) { ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" 2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams 2009-10-02 1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams @ 2009-10-02 1:18 ` Dan Williams 2009-10-02 1:18 ` [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context Dan Williams 2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown 3 siblings, 0 replies; 13+ messages in thread From: Dan Williams @ 2009-10-02 1:18 UTC (permalink / raw) To: neilb; +Cc: linux-raid, Holger Kiehl The percpu conversion allowed a straightforward handoff of stripe processing to the async subsytem that initially showed some modest gains (+4%). However, this model is too simplistic and leads to stripes bouncing between raid5d and the async thread pool for every invocation of handle_stripe(). As reported by Holger this can fall into a pathological situation severely impacting throughput (6x performance loss). Revert this for now, to be resurrected once we have a more workable thread pool. Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/md/Kconfig | 11 ----------- drivers/md/raid5.c | 37 +++---------------------------------- 2 files changed, 3 insertions(+), 45 deletions(-) diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 2158377..f0fac58 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -154,17 +154,6 @@ config MD_RAID456 If unsure, say Y. -config MULTICORE_RAID456 - bool "RAID-4/RAID-5/RAID-6 Multicore processing (EXPERIMENTAL)" - depends on MD_RAID456 - depends on SMP - depends on EXPERIMENTAL - ---help--- - Enable the raid456 module to dispatch per-stripe raid operations to a - thread pool. - - If unsure, say N. - config MD_RAID6_PQ tristate diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c21cc50..11cdfe6 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -47,7 +47,6 @@ #include <linux/kthread.h> #include <linux/raid/pq.h> #include <linux/async_tx.h> -#include <linux/async.h> #include <linux/seq_file.h> #include <linux/cpu.h> #include "md.h" @@ -4349,36 +4348,6 @@ static int retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio) return handled; } -#ifdef CONFIG_MULTICORE_RAID456 -static void __process_stripe(void *param, async_cookie_t cookie) -{ - struct stripe_head *sh = param; - - handle_stripe(sh); - release_stripe(sh); -} - -static void process_stripe(struct stripe_head *sh, struct list_head *domain) -{ - async_schedule_domain(__process_stripe, sh, domain); -} - -static void synchronize_stripe_processing(struct list_head *domain) -{ - async_synchronize_full_domain(domain); -} -#else -static void process_stripe(struct stripe_head *sh, struct list_head *domain) -{ - handle_stripe(sh); - release_stripe(sh); - cond_resched(); -} - -static void synchronize_stripe_processing(struct list_head *domain) -{ -} -#endif /* @@ -4393,7 +4362,6 @@ static void raid5d(mddev_t *mddev) struct stripe_head *sh; raid5_conf_t *conf = mddev->private; int handled; - LIST_HEAD(raid_domain); pr_debug("+++ raid5d active\n"); @@ -4430,7 +4398,9 @@ static void raid5d(mddev_t *mddev) spin_unlock_irq(&conf->device_lock); handled++; - process_stripe(sh, &raid_domain); + handle_stripe(sh); + release_stripe(sh); + cond_resched(); spin_lock_irq(&conf->device_lock); } @@ -4438,7 +4408,6 @@ static void raid5d(mddev_t *mddev) spin_unlock_irq(&conf->device_lock); - synchronize_stripe_processing(&raid_domain); async_tx_issue_pending_all(); unplug_slaves(mddev); ^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context. 2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams 2009-10-02 1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams 2009-10-02 1:18 ` [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" Dan Williams @ 2009-10-02 1:18 ` Dan Williams 2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown 3 siblings, 0 replies; 13+ messages in thread From: Dan Williams @ 2009-10-02 1:18 UTC (permalink / raw) To: neilb; +Cc: linux-raid, Joel Andres Granados, stable From: NeilBrown <neilb@suse.de> sysfs_notify_dirent is a simple atomic operation that can be used to alert user-space that new data can be read from a sysfs attribute. Unfortunately is cannot currently be called from non-process context because of it's use of spin_lock which is sometimes taken with interrupt enabled. So change all lockers of sysfs_open_dirent_lock to disable interrupts, thus making sysfs_notify_dirent safe to be called from non-process context (as drivers/md does in md_safemode_timeout). sysfs_get_open_dirent is (documented as being) only called from process context, so it uses spin_lock_irq. Other places use spin_lock_irqsave. The usage for sysfs_notify_dirent in md_safemode_timeout was introduced in 2.6.28, so this patch is suitable for that and more recent kernels. Cc: <stable@kernel.org> Reported-by: Joel Andres Granados <jgranado@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- fs/sysfs/file.c | 14 ++++++++------ 1 files changed, 8 insertions(+), 6 deletions(-) diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c index 561a9c0..f5ea468 100644 --- a/fs/sysfs/file.c +++ b/fs/sysfs/file.c @@ -268,7 +268,7 @@ static int sysfs_get_open_dirent(struct sysfs_dirent *sd, struct sysfs_open_dirent *od, *new_od = NULL; retry: - spin_lock(&sysfs_open_dirent_lock); + spin_lock_irq(&sysfs_open_dirent_lock); if (!sd->s_attr.open && new_od) { sd->s_attr.open = new_od; @@ -281,7 +281,7 @@ static int sysfs_get_open_dirent(struct sysfs_dirent *sd, list_add_tail(&buffer->list, &od->buffers); } - spin_unlock(&sysfs_open_dirent_lock); + spin_unlock_irq(&sysfs_open_dirent_lock); if (od) { kfree(new_od); @@ -315,8 +315,9 @@ static void sysfs_put_open_dirent(struct sysfs_dirent *sd, struct sysfs_buffer *buffer) { struct sysfs_open_dirent *od = sd->s_attr.open; + unsigned long flags; - spin_lock(&sysfs_open_dirent_lock); + spin_lock_irqsave(&sysfs_open_dirent_lock, flags); list_del(&buffer->list); if (atomic_dec_and_test(&od->refcnt)) @@ -324,7 +325,7 @@ static void sysfs_put_open_dirent(struct sysfs_dirent *sd, else od = NULL; - spin_unlock(&sysfs_open_dirent_lock); + spin_unlock_irqrestore(&sysfs_open_dirent_lock, flags); kfree(od); } @@ -456,8 +457,9 @@ static unsigned int sysfs_poll(struct file *filp, poll_table *wait) void sysfs_notify_dirent(struct sysfs_dirent *sd) { struct sysfs_open_dirent *od; + unsigned long flags; - spin_lock(&sysfs_open_dirent_lock); + spin_lock_irqsave(&sysfs_open_dirent_lock, flags); od = sd->s_attr.open; if (od) { @@ -465,7 +467,7 @@ void sysfs_notify_dirent(struct sysfs_dirent *sd) wake_up_interruptible(&od->poll); } - spin_unlock(&sysfs_open_dirent_lock); + spin_unlock_irqrestore(&sysfs_open_dirent_lock, flags); } EXPORT_SYMBOL_GPL(sysfs_notify_dirent); ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc 2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams ` (2 preceding siblings ...) 2009-10-02 1:18 ` [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context Dan Williams @ 2009-10-02 4:13 ` Neil Brown 2009-10-03 15:54 ` Dan Williams 2009-10-07 0:36 ` Dan Williams 3 siblings, 2 replies; 13+ messages in thread From: Neil Brown @ 2009-10-02 4:13 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid On Thursday October 1, dan.j.williams@intel.com wrote: > Hi Neil, > > A few fixes: > 1/ The multicore option is not ready for prime time But it is already marked experimental... So do we really need to revert? or is the current code broken beyond repair? > 2/ A fix for raid5 init failures Yep, thanks! > 3/ A reminder that the fix you sent to Greg a while back has not found > its way upstream. I've nagged Greg about this, thanks. NeilBrown ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc 2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown @ 2009-10-03 15:54 ` Dan Williams 2009-10-07 0:36 ` Dan Williams 1 sibling, 0 replies; 13+ messages in thread From: Dan Williams @ 2009-10-03 15:54 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid Neil Brown wrote: > On Thursday October 1, dan.j.williams@intel.com wrote: >> Hi Neil, >> >> A few fixes: >> 1/ The multicore option is not ready for prime time > > But it is already marked experimental... > So do we really need to revert? or is the current code broken beyond > repair? On second thought perhaps it is salvageable. I'll take a look at moving the parallelism down from handle_stripe() to raid_run_ops(). -- Dan ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc 2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown 2009-10-03 15:54 ` Dan Williams @ 2009-10-07 0:36 ` Dan Williams 2009-10-07 4:34 ` Neil Brown 2009-10-07 12:05 ` Holger Kiehl 1 sibling, 2 replies; 13+ messages in thread From: Dan Williams @ 2009-10-07 0:36 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid From 0496c92cf6ac1f4f7dde6d416707988991d87d41 Mon Sep 17 00:00:00 2001 From: Dan Williams <dan.j.williams@intel.com> Date: Sat, 3 Oct 2009 13:47:05 -0700 Subject: [PATCH] md/raid456: downlevel multicore operations to raid_run_ops The percpu conversion allowed a straightforward handoff of stripe processing to the async subsytem that initially showed some modest gains (+4%). However, this model is too simplistic and leads to stripes bouncing between raid5d and the async thread pool for every invocation of handle_stripe(). As reported by Holger this can fall into a pathological situation severely impacting throughput (6x performance loss). By downleveling the parallelism to raid_run_ops the pathological stripe_head bouncing is eliminated. This version still exhibits an average 11% throughput loss for: mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6 echo 1024 > /sys/block/md0/md/stripe_cache_size dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 ...but the results are at least stable and can be used as a base for further multicore experimentation. Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- On Thu, 2009-10-01 at 21:13 -0700, Neil Brown wrote: > On Thursday October 1, dan.j.williams@intel.com wrote: > > Hi Neil, > > > > A few fixes: > > 1/ The multicore option is not ready for prime time > > But it is already marked experimental... > So do we really need to revert? or is the current code broken beyond > repair? So we don't need a revert, this fixes up the unpredictability of the original implementation. It surprised me that the overhead of passing raid_run_ops to the async thread pool amounted to an 11% performance regression. In any event I think this is a better baseline for future multicore experimentation than the current implementation. -- Dan drivers/md/raid5.c | 72 ++++++++++++++++++++++++++------------------------- drivers/md/raid5.h | 12 ++++++++- 2 files changed, 48 insertions(+), 36 deletions(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index c21cc50..6b4a09f 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -1139,7 +1139,7 @@ static void ops_run_check_pq(struct stripe_head *sh, struct raid5_percpu *percpu &sh->ops.zero_sum_result, percpu->spare_page, &submit); } -static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) +static void __raid_run_ops(struct stripe_head *sh, unsigned long ops_request) { int overlap_clear = 0, i, disks = sh->disks; struct dma_async_tx_descriptor *tx = NULL; @@ -1204,6 +1204,36 @@ static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) put_cpu(); } +#ifdef CONFIG_MULTICORE_RAID456 +static void async_run_ops(void *param, async_cookie_t cookie) +{ + struct stripe_head *sh = param; + unsigned long ops_request = sh->ops.request; + + clear_bit_unlock(STRIPE_OPS_REQ_PENDING, &sh->state); + wake_up(&sh->ops.wait_for_ops); + + __raid_run_ops(sh, ops_request); + release_stripe(sh); +} + +static void raid_run_ops(struct stripe_head *sh, unsigned long ops_request) +{ + /* since handle_stripe can be called outside of raid5d context + * we need to ensure sh->ops.request is de-staged before another + * request arrives + */ + wait_event(sh->ops.wait_for_ops, + !test_and_set_bit_lock(STRIPE_OPS_REQ_PENDING, &sh->state)); + sh->ops.request = ops_request; + + atomic_inc(&sh->count); + async_schedule(async_run_ops, sh); +} +#else +#define raid_run_ops __raid_run_ops +#endif + static int grow_one_stripe(raid5_conf_t *conf) { struct stripe_head *sh; @@ -1213,6 +1243,9 @@ static int grow_one_stripe(raid5_conf_t *conf) memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev)); sh->raid_conf = conf; spin_lock_init(&sh->lock); + #ifdef CONFIG_MULTICORE_RAID456 + init_waitqueue_head(&sh->ops.wait_for_ops); + #endif if (grow_buffers(sh, conf->raid_disks)) { shrink_buffers(sh, conf->raid_disks); @@ -4349,37 +4382,6 @@ static int retry_aligned_read(raid5_conf_t *conf, struct bio *raid_bio) return handled; } -#ifdef CONFIG_MULTICORE_RAID456 -static void __process_stripe(void *param, async_cookie_t cookie) -{ - struct stripe_head *sh = param; - - handle_stripe(sh); - release_stripe(sh); -} - -static void process_stripe(struct stripe_head *sh, struct list_head *domain) -{ - async_schedule_domain(__process_stripe, sh, domain); -} - -static void synchronize_stripe_processing(struct list_head *domain) -{ - async_synchronize_full_domain(domain); -} -#else -static void process_stripe(struct stripe_head *sh, struct list_head *domain) -{ - handle_stripe(sh); - release_stripe(sh); - cond_resched(); -} - -static void synchronize_stripe_processing(struct list_head *domain) -{ -} -#endif - /* * This is our raid5 kernel thread. @@ -4393,7 +4395,6 @@ static void raid5d(mddev_t *mddev) struct stripe_head *sh; raid5_conf_t *conf = mddev->private; int handled; - LIST_HEAD(raid_domain); pr_debug("+++ raid5d active\n"); @@ -4430,7 +4431,9 @@ static void raid5d(mddev_t *mddev) spin_unlock_irq(&conf->device_lock); handled++; - process_stripe(sh, &raid_domain); + handle_stripe(sh); + release_stripe(sh); + cond_resched(); spin_lock_irq(&conf->device_lock); } @@ -4438,7 +4441,6 @@ static void raid5d(mddev_t *mddev) spin_unlock_irq(&conf->device_lock); - synchronize_stripe_processing(&raid_domain); async_tx_issue_pending_all(); unplug_slaves(mddev); diff --git a/drivers/md/raid5.h b/drivers/md/raid5.h index 2390e0e..dcefdc9 100644 --- a/drivers/md/raid5.h +++ b/drivers/md/raid5.h @@ -214,12 +214,20 @@ struct stripe_head { int disks; /* disks in stripe */ enum check_states check_state; enum reconstruct_states reconstruct_state; - /* stripe_operations + /** + * struct stripe_operations * @target - STRIPE_OP_COMPUTE_BLK target + * @target2 - 2nd compute target in the raid6 case + * @zero_sum_result - P and Q verification flags + * @request - async service request flags for raid_run_ops */ struct stripe_operations { int target, target2; enum sum_check_flags zero_sum_result; + #ifdef CONFIG_MULTICORE_RAID456 + unsigned long request; + wait_queue_head_t wait_for_ops; + #endif } ops; struct r5dev { struct bio req; @@ -294,6 +302,8 @@ struct r6_state { #define STRIPE_FULL_WRITE 13 /* all blocks are set to be overwritten */ #define STRIPE_BIOFILL_RUN 14 #define STRIPE_COMPUTE_RUN 15 +#define STRIPE_OPS_REQ_PENDING 16 + /* * Operation request flags */ -- 1.6.0.6 ^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc 2009-10-07 0:36 ` Dan Williams @ 2009-10-07 4:34 ` Neil Brown 2009-10-07 12:05 ` Holger Kiehl 1 sibling, 0 replies; 13+ messages in thread From: Neil Brown @ 2009-10-07 4:34 UTC (permalink / raw) To: Dan Williams; +Cc: linux-raid On Tuesday October 6, dan.j.williams@intel.com wrote: > >From 0496c92cf6ac1f4f7dde6d416707988991d87d41 Mon Sep 17 00:00:00 2001 > From: Dan Williams <dan.j.williams@intel.com> > Date: Sat, 3 Oct 2009 13:47:05 -0700 > Subject: [PATCH] md/raid456: downlevel multicore operations to raid_run_ops > > The percpu conversion allowed a straightforward handoff of stripe > processing to the async subsytem that initially showed some modest gains > (+4%). However, this model is too simplistic and leads to stripes > bouncing between raid5d and the async thread pool for every invocation > of handle_stripe(). As reported by Holger this can fall into a > pathological situation severely impacting throughput (6x performance > loss). > > By downleveling the parallelism to raid_run_ops the pathological > stripe_head bouncing is eliminated. This version still exhibits an > average 11% throughput loss for: > > mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6 > echo 1024 > /sys/block/md0/md/stripe_cache_size > dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 > > ...but the results are at least stable and can be used as a base for > further multicore experimentation. Thanks. One little change needed: > static int grow_one_stripe(raid5_conf_t *conf) > { > struct stripe_head *sh; > @@ -1213,6 +1243,9 @@ static int grow_one_stripe(raid5_conf_t *conf) > memset(sh, 0, sizeof(*sh) + (conf->raid_disks-1)*sizeof(struct r5dev)); > sh->raid_conf = conf; > spin_lock_init(&sh->lock); > + #ifdef CONFIG_MULTICORE_RAID456 > + init_waitqueue_head(&sh->ops.wait_for_ops); > + #endif > > if (grow_buffers(sh, conf->raid_disks)) { > shrink_buffers(sh, conf->raid_disks); This addition is needed in resize_stripes too. NeilBrown ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc 2009-10-07 0:36 ` Dan Williams 2009-10-07 4:34 ` Neil Brown @ 2009-10-07 12:05 ` Holger Kiehl 2009-10-07 18:33 ` Asdo 1 sibling, 1 reply; 13+ messages in thread From: Holger Kiehl @ 2009-10-07 12:05 UTC (permalink / raw) To: Dan Williams; +Cc: Neil Brown, linux-raid On Tue, 6 Oct 2009, Dan Williams wrote: >> From 0496c92cf6ac1f4f7dde6d416707988991d87d41 Mon Sep 17 00:00:00 2001 > From: Dan Williams <dan.j.williams@intel.com> > Date: Sat, 3 Oct 2009 13:47:05 -0700 > Subject: [PATCH] md/raid456: downlevel multicore operations to raid_run_ops > > The percpu conversion allowed a straightforward handoff of stripe > processing to the async subsytem that initially showed some modest gains > (+4%). However, this model is too simplistic and leads to stripes > bouncing between raid5d and the async thread pool for every invocation > of handle_stripe(). As reported by Holger this can fall into a > pathological situation severely impacting throughput (6x performance > loss). > > By downleveling the parallelism to raid_run_ops the pathological > stripe_head bouncing is eliminated. This version still exhibits an > average 11% throughput loss for: > > mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6 > echo 1024 > /sys/block/md0/md/stripe_cache_size > dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 > > ...but the results are at least stable and can be used as a base for > further multicore experimentation. > > Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de> > Signed-off-by: Dan Williams <dan.j.williams@intel.com> > --- > On Thu, 2009-10-01 at 21:13 -0700, Neil Brown wrote: >> On Thursday October 1, dan.j.williams@intel.com wrote: >>> Hi Neil, >>> >>> A few fixes: >>> 1/ The multicore option is not ready for prime time >> >> But it is already marked experimental... >> So do we really need to revert? or is the current code broken beyond >> repair? > > So we don't need a revert, this fixes up the unpredictability of the > original implementation. It surprised me that the overhead of passing > raid_run_ops to the async thread pool amounted to an 11% performance > regression. In any event I think this is a better baseline for future > multicore experimentation than the current implementation. > Just to add some more information, I did try this patch with 2.6.32-rc3-git1 and with the testing I am doing I get appr. 125% performance regression. The tests I am doing is have several (appr. 60 process) sending via FTP or SFTP about 100000 small files (average size below 4096 bytes) to localhost in a loop for 30 minutes. Here the real numbers: with multicore support enabled (with your patch) 3276.77 files per second with multicore support enabled (without your patch) 1014.47 files per second without multicore support 7549.24 files per second Holger ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc 2009-10-07 12:05 ` Holger Kiehl @ 2009-10-07 18:33 ` Asdo 2009-10-08 8:50 ` Holger Kiehl 0 siblings, 1 reply; 13+ messages in thread From: Asdo @ 2009-10-07 18:33 UTC (permalink / raw) To: Holger Kiehl, linux-raid Holger Kiehl wrote: > On Tue, 6 Oct 2009, Dan Williams wrote: > >> .... >> By downleveling the parallelism to raid_run_ops the pathological >> stripe_head bouncing is eliminated. This version still exhibits an >> average 11% throughput.... >> >> So we don't need a revert, this fixes up the unpredictability of the >> original implementation. It surprised me that the overhead of passing >> raid_run_ops to the async thread pool amounted to an 11% performance >> regression. In any event I think this is a better baseline for future >> multicore experimentation than the current implementation. >> > Just to add some more information, I did try this patch with > 2.6.32-rc3-git1 and with the testing I am doing I get appr. 125% > performance regression. Hi Holger From the above sentence it seems you get worse performance now than with the original multicore implementation, while from the numbers below it seems you get better performances now. Which is correct? (BTW a performance regression higher than 100% is impossible :-) ) > The tests I am doing is have several (appr. 60 > process) sending via FTP or SFTP about 100000 small files (average size > below 4096 bytes) to localhost in a loop for 30 minutes. Here the > real numbers: > > with multicore support enabled (with your patch) 3276.77 files per > second > with multicore support enabled (without your patch) 1014.47 files per > second > without multicore support 7549.24 files per > second > > Holger Also, could you tell us some details about your machine and the RAID? Like the model of CPU (Nehalems and AMDs have much faster memory access than earlier intels) and if it's a single-cpu or a dual cpu mainboard... Amount of RAM also: stripe_cache_size current setting for your RAID Raid level, number of disks, chunk size, filesystem... Thank you! ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc 2009-10-07 18:33 ` Asdo @ 2009-10-08 8:50 ` Holger Kiehl 2009-10-11 12:16 ` Asdo 0 siblings, 1 reply; 13+ messages in thread From: Holger Kiehl @ 2009-10-08 8:50 UTC (permalink / raw) To: Asdo; +Cc: linux-raid On Wed, 7 Oct 2009, Asdo wrote: > Holger Kiehl wrote: >> On Tue, 6 Oct 2009, Dan Williams wrote: >> >>> .... >>> By downleveling the parallelism to raid_run_ops the pathological >>> stripe_head bouncing is eliminated. This version still exhibits an >>> average 11% throughput.... >>> >>> So we don't need a revert, this fixes up the unpredictability of the >>> original implementation. It surprised me that the overhead of passing >>> raid_run_ops to the async thread pool amounted to an 11% performance >>> regression. In any event I think this is a better baseline for future >>> multicore experimentation than the current implementation. >>> >> Just to add some more information, I did try this patch with >> 2.6.32-rc3-git1 and with the testing I am doing I get appr. 125% >> performance regression. > Hi Holger > > From the above sentence it seems you get worse performance now than with the > original multicore implementation, while from the numbers below it seems you > get better performances now. > > Which is correct? > I only wanted to express that with my testing I get a higher performance regression then the test Dan did. So with the patch it is much better then without, as the numbers below show. > (BTW a performance regression higher than 100% is impossible :-) ) > >> The tests I am doing is have several (appr. 60 >> process) sending via FTP or SFTP about 100000 small files (average size >> below 4096 bytes) to localhost in a loop for 30 minutes. Here the >> real numbers: >> >> with multicore support enabled (with your patch) 3276.77 files per >> second >> with multicore support enabled (without your patch) 1014.47 files per >> second >> without multicore support 7549.24 files per >> second >> >> Holger > > Also, could you tell us some details about your machine and the RAID? > Like the model of CPU (Nehalems and AMDs have much faster memory access than > earlier intels) and if it's a single-cpu or a dual cpu mainboard... > Amount of RAM > Its a dual cpu mainboard with two Xeon X5460 and 32 GiB Ram. > also: stripe_cache_size current setting for your RAID > Raid level, number of disks, chunk size, filesystem... > Raid level is Raid6 over 8 disks (actually 16, which are made up of 8 HW Raid1 disks) and chunksize is 2048k. Here the output from /proc/mdstat md4 : active raid6 sdi1[2] sdl1[5] sdj1[3] sdn1[7] sdk1[4] sdm1[6] sdh1[1] sdg1[0] 1754480640 blocks level 6, 2048k chunk, algorithm 2 [8/8] [UUUUUUUU] Filesystem is ext4, here the details: dumpe2fs 1.41.4 (27-Jan-2009) Filesystem volume name: <none> Last mounted on: /home Filesystem UUID: 1e3ff0c9-07a0-412e-938c-b9a242ae7d42 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 109658112 Block count: 438620160 Reserved block count: 0 Free blocks: 429498966 Free inodes: 109493636 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 919 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 512 RAID stripe width: 3072 Flex block group size: 16 Filesystem created: Thu Sep 17 14:37:08 2009 Last mount time: Wed Oct 7 09:04:00 2009 Last write time: Wed Oct 7 09:04:00 2009 Mount count: 12 Maximum mount count: -1 Last checked: Thu Sep 17 14:37:08 2009 Check interval: 0 (<none>) Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 149fb3b1-2e99-46de-bd31-7031f677deb6 Journal backup: inode blocks Journal size: 768M Thanks, Holger ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc 2009-10-08 8:50 ` Holger Kiehl @ 2009-10-11 12:16 ` Asdo 2009-10-11 13:17 ` Asdo 0 siblings, 1 reply; 13+ messages in thread From: Asdo @ 2009-10-11 12:16 UTC (permalink / raw) To: Holger Kiehl; +Cc: linux-raid Holger Kiehl wrote: > Its a dual cpu mainboard with two Xeon X5460 and 32 GiB Ram. > Nice machine... Earlier intels though, with kinda slow RAM access (who knows if this has something to do...) A few more observations: >> also: stripe_cache_size current setting for your RAID >> Raid level, number of disks, chunk size, filesystem... > Raid level is Raid6 over 8 disks (actually 16, which are made up of 8 > HW Raid1 disks) and chunksize is 2048k. Here the output from /proc/mdstat Ok the filesystem appears to be aligned then (I am assuming you are not using LVM, otherwise pls tell because there can be other tweaks) I don't really know this multicore implementation, however here is one thing you could try: The default stripe_cache_size might be too small for a multicore implementation, considering that there might be many more in-flight operations than with single-core (and even with single core it is beneficial to raise it). What is the current value? (cat /sys/block/md4/md/stripe_cache_size) You can try: echo 32768 > /sys/block/md4/md/stripe_cache_size You can also try to raise /sys/block/md4/md/nr_requests echoing a higher value in it, but this operation hangs up the kernel immediately on our machine with Ubuntu kernel 2.6.31 . This is actually a bug: Neil / Dan are you aware of this bug? Is it fixed in 2.6.32? Secondly, it's surprising that it is slower than single core. How busy are the 8 cores you have, running on the MD threads (try htop) ? If you see all 8 cores maxed, it's one kind of implementation problem, if you see just 1-2 cores maxed, it's another kind... Maybe stack traces can be used to see where the bottleneck is... I am not a profiling expert actually, but you might try this cat /proc/12345/stack replace 12345 with the kernel threads for md: you can probably see the PIDs with ps auxH | less or with htop (K option enabled) catting the stack many times should see where they are stuck for most of the time ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH 0/3] md fixes for 2.6.32-rc 2009-10-11 12:16 ` Asdo @ 2009-10-11 13:17 ` Asdo 0 siblings, 0 replies; 13+ messages in thread From: Asdo @ 2009-10-11 13:17 UTC (permalink / raw) To: Holger Kiehl; +Cc: linux-raid Asdo wrote: > ... How busy are the 8 cores you have, running on the MD threads (try > htop) ... I meant to say: "try htop with K option enabled, i.e. press K while running htop, so to see kernel threads (see help by pressing h)" ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-10-11 13:17 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2009-10-02 1:18 [PATCH 0/3] md fixes for 2.6.32-rc Dan Williams 2009-10-02 1:18 ` [PATCH 1/3] md/raid5: initialize conf->device_lock earlier Dan Williams 2009-10-02 1:18 ` [PATCH 2/3] Revert "md/raid456: distribute raid processing over multiple cores" Dan Williams 2009-10-02 1:18 ` [PATCH 3/3] Allow sysfs_notify_dirent to be called from interrupt context Dan Williams 2009-10-02 4:13 ` [PATCH 0/3] md fixes for 2.6.32-rc Neil Brown 2009-10-03 15:54 ` Dan Williams 2009-10-07 0:36 ` Dan Williams 2009-10-07 4:34 ` Neil Brown 2009-10-07 12:05 ` Holger Kiehl 2009-10-07 18:33 ` Asdo 2009-10-08 8:50 ` Holger Kiehl 2009-10-11 12:16 ` Asdo 2009-10-11 13:17 ` Asdo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.