* [RFC PATCH 0/2] block: rq_affinity performance fixes @ 2011-07-22 20:59 Dan Williams 2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams 2011-07-22 20:59 ` [RFC PATCH 2/2] block: adaptive rq_affinity Dan Williams 0 siblings, 2 replies; 8+ messages in thread From: Dan Williams @ 2011-07-22 20:59 UTC (permalink / raw) To: jaxboe; +Cc: linux-kernel, linux-scsi Jens, Per the "rq_affinity doesn't seem to work?" thread [1]. We found that the default cpu steering that rq_affinity performs is insufficient to keep up with some completion loads. Dave collected the following relative iops measures: pre-patches rq_affinity=0: 1x pre-patches rq_affinity=1: 1x post-patches rq_affinity=1: 1.08x post-patches rq_affinity=2: 1.35x The "adaptive rq_affinity" patch is more of a discussion point than a real fix. However, it shows that even something straightforward and seemingly aggressive does not really come close to the performance of just turning cpu grouping off altogether. The "strict rq_affinity" patch follows the "nomerges" interface precedent of setting 2 to indicate "I really mean it". So, at a minimum consider taking patch 1, and we can leave the search for a better adaptive algorithm as future work. [1]: http://marc.info/?l=linux-kernel&m=131049742925413&w=4 --- Dan Williams (2): block: strict rq_affinity block: adaptive rq_affinity Documentation/block/queue-sysfs.txt | 10 +++++++--- block/blk-core.c | 6 ++---- block/blk-softirq.c | 23 ++++++++++++++++++----- block/blk-sysfs.c | 13 +++++++++---- include/linux/blkdev.h | 3 ++- 5 files changed, 38 insertions(+), 17 deletions(-) ^ permalink raw reply [flat|nested] 8+ messages in thread
* [RFC PATCH 1/2] block: strict rq_affinity 2011-07-22 20:59 [RFC PATCH 0/2] block: rq_affinity performance fixes Dan Williams @ 2011-07-22 20:59 ` Dan Williams 2011-07-23 1:46 ` Christoph Hellwig 2011-07-23 18:41 ` Jens Axboe 2011-07-22 20:59 ` [RFC PATCH 2/2] block: adaptive rq_affinity Dan Williams 1 sibling, 2 replies; 8+ messages in thread From: Dan Williams @ 2011-07-22 20:59 UTC (permalink / raw) To: jaxboe Cc: Christoph Hellwig, Roland Dreier, Dave Jiang, linux-kernel, linux-scsi Some storage controllers benefit from completions always being steered to the strict requester cpu rather than the looser "per-socket" steering that blk_cpu_to_group() attempts by default. echo 2 > /sys/block/<bdev>/queue/rq_affinity Cc: Christoph Hellwig <hch@infradead.org> Cc: Roland Dreier <roland@purestorage.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- Documentation/block/queue-sysfs.txt | 10 +++++++--- block/blk-core.c | 6 ++---- block/blk-softirq.c | 11 +++++++---- block/blk-sysfs.c | 13 +++++++++---- include/linux/blkdev.h | 3 ++- 5 files changed, 27 insertions(+), 16 deletions(-) diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt index f652740..d8147b3 100644 --- a/Documentation/block/queue-sysfs.txt +++ b/Documentation/block/queue-sysfs.txt @@ -45,9 +45,13 @@ device. rq_affinity (RW) ---------------- -If this option is enabled, the block layer will migrate request completions -to the CPU that originally submitted the request. For some workloads -this provides a significant reduction in CPU cycles due to caching effects. +If this option is '1', the block layer will migrate request completions to the +cpu "group" that originally submitted the request. For some workloads this +provides a significant reduction in CPU cycles due to caching effects. + +For storage configurations that need to maximize distribution of completion +processing setting this option to '2' forces the completion to run on the +requesting cpu (bypassing the "group" aggregation logic). scheduler (RW) -------------- diff --git a/block/blk-core.c b/block/blk-core.c index d2f8f40..9c7ba87 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1279,10 +1279,8 @@ get_rq: init_request_from_bio(req, bio); if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) || - bio_flagged(bio, BIO_CPU_AFFINE)) { - req->cpu = blk_cpu_to_group(get_cpu()); - put_cpu(); - } + bio_flagged(bio, BIO_CPU_AFFINE)) + req->cpu = smp_processor_id(); plug = current->plug; if (plug) { diff --git a/block/blk-softirq.c b/block/blk-softirq.c index ee9c216..475fab8 100644 --- a/block/blk-softirq.c +++ b/block/blk-softirq.c @@ -103,22 +103,25 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = { void __blk_complete_request(struct request *req) { + int ccpu, cpu, group_cpu = NR_CPUS; struct request_queue *q = req->q; unsigned long flags; - int ccpu, cpu, group_cpu; BUG_ON(!q->softirq_done_fn); local_irq_save(flags); cpu = smp_processor_id(); - group_cpu = blk_cpu_to_group(cpu); /* * Select completion CPU */ - if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) + if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) { ccpu = req->cpu; - else + if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) { + ccpu = blk_cpu_to_group(ccpu); + group_cpu = blk_cpu_to_group(cpu); + } + } else ccpu = cpu; if (ccpu == cpu || ccpu == group_cpu) { diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index d935bd8..0ee17b5 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -244,8 +244,9 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page, static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page) { bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags); + bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags); - return queue_var_show(set, page); + return queue_var_show(set << force, page); } static ssize_t @@ -257,10 +258,14 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count) ret = queue_var_store(&val, page, count); spin_lock_irq(q->queue_lock); - if (val) + if (val) { queue_flag_set(QUEUE_FLAG_SAME_COMP, q); - else - queue_flag_clear(QUEUE_FLAG_SAME_COMP, q); + if (val == 2) + queue_flag_set(QUEUE_FLAG_SAME_FORCE, q); + } else { + queue_flag_clear(QUEUE_FLAG_SAME_COMP, q); + queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q); + } spin_unlock_irq(q->queue_lock); #endif return ret; diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 1a23722..b228825 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -393,7 +393,7 @@ struct request_queue #define QUEUE_FLAG_ELVSWITCH 6 /* don't use elevator, just do FIFO */ #define QUEUE_FLAG_BIDI 7 /* queue supports bidi requests */ #define QUEUE_FLAG_NOMERGES 8 /* disable merge attempts */ -#define QUEUE_FLAG_SAME_COMP 9 /* force complete on same CPU */ +#define QUEUE_FLAG_SAME_COMP 9 /* complete on same CPU-group */ #define QUEUE_FLAG_FAIL_IO 10 /* fake timeout */ #define QUEUE_FLAG_STACKABLE 11 /* supports request stacking */ #define QUEUE_FLAG_NONROT 12 /* non-rotational device (SSD) */ @@ -403,6 +403,7 @@ struct request_queue #define QUEUE_FLAG_NOXMERGES 15 /* No extended merges */ #define QUEUE_FLAG_ADD_RANDOM 16 /* Contributes to random pool */ #define QUEUE_FLAG_SECDISCARD 17 /* supports SECDISCARD */ +#define QUEUE_FLAG_SAME_FORCE 18 /* force complete on same CPU */ #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) | \ (1 << QUEUE_FLAG_STACKABLE) | \ ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 1/2] block: strict rq_affinity 2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams @ 2011-07-23 1:46 ` Christoph Hellwig 2011-07-23 18:38 ` Jens Axboe 2011-07-23 18:41 ` Jens Axboe 1 sibling, 1 reply; 8+ messages in thread From: Christoph Hellwig @ 2011-07-23 1:46 UTC (permalink / raw) To: Dan Williams Cc: jaxboe, Christoph Hellwig, Roland Dreier, Dave Jiang, linux-kernel, linux-scsi On Fri, Jul 22, 2011 at 01:59:39PM -0700, Dan Williams wrote: > Some storage controllers benefit from completions always being steered > to the strict requester cpu rather than the looser "per-socket" steering > that blk_cpu_to_group() attempts by default. Isn't this actually dependent on the cpu, and not the storage controller? ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 1/2] block: strict rq_affinity 2011-07-23 1:46 ` Christoph Hellwig @ 2011-07-23 18:38 ` Jens Axboe 0 siblings, 0 replies; 8+ messages in thread From: Jens Axboe @ 2011-07-23 18:38 UTC (permalink / raw) To: Christoph Hellwig Cc: Dan Williams, Roland Dreier, Dave Jiang, linux-kernel, linux-scsi On 2011-07-23 03:46, Christoph Hellwig wrote: > On Fri, Jul 22, 2011 at 01:59:39PM -0700, Dan Williams wrote: >> Some storage controllers benefit from completions always being steered >> to the strict requester cpu rather than the looser "per-socket" steering >> that blk_cpu_to_group() attempts by default. > > Isn't this actually dependent on the cpu, and not the storage > controller? It is, it's completely indendent of the controller used. Perhaps some drivers could have a very heavy end io completion handling causing the problem to become larger, but in general it's an artifact of the CPU and not the controller. -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 1/2] block: strict rq_affinity 2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams 2011-07-23 1:46 ` Christoph Hellwig @ 2011-07-23 18:41 ` Jens Axboe 2011-07-25 1:14 ` Shaohua Li 1 sibling, 1 reply; 8+ messages in thread From: Jens Axboe @ 2011-07-23 18:41 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, Roland Dreier, Dave Jiang, linux-kernel, linux-scsi On 2011-07-22 22:59, Dan Williams wrote: > Some storage controllers benefit from completions always being steered > to the strict requester cpu rather than the looser "per-socket" steering > that blk_cpu_to_group() attempts by default. > > echo 2 > /sys/block/<bdev>/queue/rq_affinity I have applied this one, with a modified patch description. I like the adaptive solution, but it should be rewritten to not declare and expose softirq internals. Essentially have an API from kernel/softirq.c that can return whether a given (or perhaps just local) softirq handler is busy or not. -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 1/2] block: strict rq_affinity 2011-07-23 18:41 ` Jens Axboe @ 2011-07-25 1:14 ` Shaohua Li 2011-07-25 8:21 ` Jens Axboe 0 siblings, 1 reply; 8+ messages in thread From: Shaohua Li @ 2011-07-25 1:14 UTC (permalink / raw) To: Jens Axboe Cc: Dan Williams, Christoph Hellwig, Roland Dreier, Dave Jiang, linux-kernel, linux-scsi 2011/7/24 Jens Axboe <jaxboe@fusionio.com>: > On 2011-07-22 22:59, Dan Williams wrote: >> Some storage controllers benefit from completions always being steered >> to the strict requester cpu rather than the looser "per-socket" steering >> that blk_cpu_to_group() attempts by default. >> >> echo 2 > /sys/block/<bdev>/queue/rq_affinity > > I have applied this one, with a modified patch description. > > I like the adaptive solution, but it should be rewritten to not declare > and expose softirq internals. Essentially have an API from > kernel/softirq.c that can return whether a given (or perhaps just local) > softirq handler is busy or not. Jens, I posted a similar patch about two years ago( http://marc.info/?l=linux-kernel&m=126136252929329&w=2). At that time, you actually did a lot of tests and said the same cpu approach will cause huge lock contention and bounce. Is that get fixed? Thanks, Shaohua ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC PATCH 1/2] block: strict rq_affinity 2011-07-25 1:14 ` Shaohua Li @ 2011-07-25 8:21 ` Jens Axboe 0 siblings, 0 replies; 8+ messages in thread From: Jens Axboe @ 2011-07-25 8:21 UTC (permalink / raw) To: Shaohua Li Cc: Dan Williams, Christoph Hellwig, Roland Dreier, Dave Jiang, linux-kernel, linux-scsi On 2011-07-25 03:14, Shaohua Li wrote: > 2011/7/24 Jens Axboe <jaxboe@fusionio.com>: >> On 2011-07-22 22:59, Dan Williams wrote: >>> Some storage controllers benefit from completions always being steered >>> to the strict requester cpu rather than the looser "per-socket" steering >>> that blk_cpu_to_group() attempts by default. >>> >>> echo 2 > /sys/block/<bdev>/queue/rq_affinity >> >> I have applied this one, with a modified patch description. >> >> I like the adaptive solution, but it should be rewritten to not declare >> and expose softirq internals. Essentially have an API from >> kernel/softirq.c that can return whether a given (or perhaps just local) >> softirq handler is busy or not. > Jens, > I posted a similar patch about two years ago( > http://marc.info/?l=linux-kernel&m=126136252929329&w=2). > At that time, you actually did a lot of tests and said the same cpu > approach will cause huge lock contention and bounce. Is that get fixed? Yep, it's not ideal. But if we are running out of steam on a single processor, there's really not much of an option currently. -- Jens Axboe ^ permalink raw reply [flat|nested] 8+ messages in thread
* [RFC PATCH 2/2] block: adaptive rq_affinity 2011-07-22 20:59 [RFC PATCH 0/2] block: rq_affinity performance fixes Dan Williams 2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams @ 2011-07-22 20:59 ` Dan Williams 1 sibling, 0 replies; 8+ messages in thread From: Dan Williams @ 2011-07-22 20:59 UTC (permalink / raw) To: jaxboe Cc: Roland Dreier, Dave Jiang, linux-scsi, Matthew Wilcox, linux-kernel, Christoph Hellwig For some storage configurations the coarse grained cpu grouping (socket) does not supply enough cpu to keep up with the demands of high iops. Bypass the grouping and complete on the direct requester cpu when the local cpu is under softirq pressure (as measured by ksoftirqd being in the running state). Cc: Matthew Wilcox <matthew@wil.cx> Cc: Christoph Hellwig <hch@infradead.org> Cc: Roland Dreier <roland@purestorage.com> Tested-by: Dave Jiang <dave.jiang@intel.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- block/blk-softirq.c | 12 +++++++++++- 1 files changed, 11 insertions(+), 1 deletions(-) diff --git a/block/blk-softirq.c b/block/blk-softirq.c index 475fab8..f0cda19 100644 --- a/block/blk-softirq.c +++ b/block/blk-softirq.c @@ -101,16 +101,20 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = { .notifier_call = blk_cpu_notify, }; +DECLARE_PER_CPU(struct task_struct *, ksoftirqd); + void __blk_complete_request(struct request *req) { int ccpu, cpu, group_cpu = NR_CPUS; struct request_queue *q = req->q; + struct task_struct *tsk; unsigned long flags; BUG_ON(!q->softirq_done_fn); local_irq_save(flags); cpu = smp_processor_id(); + tsk = per_cpu(ksoftirqd, cpu); /* * Select completion CPU @@ -124,7 +128,13 @@ void __blk_complete_request(struct request *req) } else ccpu = cpu; - if (ccpu == cpu || ccpu == group_cpu) { + /* + * try to skip a remote softirq-trigger if the completion is + * within the same group, but not if local softirqs have already + * spilled to ksoftirqd + */ + if (ccpu == cpu || + (ccpu == group_cpu && tsk->state != TASK_RUNNING)) { struct list_head *list; do_local: list = &__get_cpu_var(blk_cpu_done); ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2011-07-25 8:21 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-07-22 20:59 [RFC PATCH 0/2] block: rq_affinity performance fixes Dan Williams 2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams 2011-07-23 1:46 ` Christoph Hellwig 2011-07-23 18:38 ` Jens Axboe 2011-07-23 18:41 ` Jens Axboe 2011-07-25 1:14 ` Shaohua Li 2011-07-25 8:21 ` Jens Axboe 2011-07-22 20:59 ` [RFC PATCH 2/2] block: adaptive rq_affinity Dan Williams
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.