All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] block: rq_affinity performance fixes
@ 2011-07-22 20:59 Dan Williams
  2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams
  2011-07-22 20:59 ` [RFC PATCH 2/2] block: adaptive rq_affinity Dan Williams
  0 siblings, 2 replies; 8+ messages in thread
From: Dan Williams @ 2011-07-22 20:59 UTC (permalink / raw)
  To: jaxboe; +Cc: linux-kernel, linux-scsi

Jens,

Per the "rq_affinity doesn't seem to work?" thread [1].  We found that the
default cpu steering that rq_affinity performs is insufficient to keep up with
some completion loads.

Dave collected the following relative iops measures:
pre-patches  rq_affinity=0: 1x
pre-patches  rq_affinity=1: 1x
post-patches rq_affinity=1: 1.08x
post-patches rq_affinity=2: 1.35x

The "adaptive rq_affinity" patch is more of a discussion point than a
real fix.  However, it shows that even something straightforward and
seemingly aggressive does not really come close to the performance of
just turning cpu grouping off altogether.  The "strict rq_affinity"
patch follows the "nomerges" interface precedent of setting 2 to
indicate "I really mean it".

So, at a minimum consider taking patch 1, and we can leave the search
for a better adaptive algorithm as future work.

[1]: http://marc.info/?l=linux-kernel&m=131049742925413&w=4

---

Dan Williams (2):
      block: strict rq_affinity
      block: adaptive rq_affinity

 Documentation/block/queue-sysfs.txt |   10 +++++++---
 block/blk-core.c                    |    6 ++----
 block/blk-softirq.c                 |   23 ++++++++++++++++++-----
 block/blk-sysfs.c                   |   13 +++++++++----
 include/linux/blkdev.h              |    3 ++-
 5 files changed, 38 insertions(+), 17 deletions(-)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 1/2] block: strict rq_affinity
  2011-07-22 20:59 [RFC PATCH 0/2] block: rq_affinity performance fixes Dan Williams
@ 2011-07-22 20:59 ` Dan Williams
  2011-07-23  1:46   ` Christoph Hellwig
  2011-07-23 18:41   ` Jens Axboe
  2011-07-22 20:59 ` [RFC PATCH 2/2] block: adaptive rq_affinity Dan Williams
  1 sibling, 2 replies; 8+ messages in thread
From: Dan Williams @ 2011-07-22 20:59 UTC (permalink / raw)
  To: jaxboe
  Cc: Christoph Hellwig, Roland Dreier, Dave Jiang, linux-kernel, linux-scsi

Some storage controllers benefit from completions always being steered
to the strict requester cpu rather than the looser "per-socket" steering
that blk_cpu_to_group() attempts by default.

echo 2 > /sys/block/<bdev>/queue/rq_affinity

Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 Documentation/block/queue-sysfs.txt |   10 +++++++---
 block/blk-core.c                    |    6 ++----
 block/blk-softirq.c                 |   11 +++++++----
 block/blk-sysfs.c                   |   13 +++++++++----
 include/linux/blkdev.h              |    3 ++-
 5 files changed, 27 insertions(+), 16 deletions(-)

diff --git a/Documentation/block/queue-sysfs.txt b/Documentation/block/queue-sysfs.txt
index f652740..d8147b3 100644
--- a/Documentation/block/queue-sysfs.txt
+++ b/Documentation/block/queue-sysfs.txt
@@ -45,9 +45,13 @@ device.
 
 rq_affinity (RW)
 ----------------
-If this option is enabled, the block layer will migrate request completions
-to the CPU that originally submitted the request. For some workloads
-this provides a significant reduction in CPU cycles due to caching effects.
+If this option is '1', the block layer will migrate request completions to the
+cpu "group" that originally submitted the request. For some workloads this
+provides a significant reduction in CPU cycles due to caching effects.
+
+For storage configurations that need to maximize distribution of completion
+processing setting this option to '2' forces the completion to run on the
+requesting cpu (bypassing the "group" aggregation logic).
 
 scheduler (RW)
 --------------
diff --git a/block/blk-core.c b/block/blk-core.c
index d2f8f40..9c7ba87 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1279,10 +1279,8 @@ get_rq:
 	init_request_from_bio(req, bio);
 
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
-	    bio_flagged(bio, BIO_CPU_AFFINE)) {
-		req->cpu = blk_cpu_to_group(get_cpu());
-		put_cpu();
-	}
+	    bio_flagged(bio, BIO_CPU_AFFINE))
+		req->cpu = smp_processor_id();
 
 	plug = current->plug;
 	if (plug) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index ee9c216..475fab8 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -103,22 +103,25 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 
 void __blk_complete_request(struct request *req)
 {
+	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
 	unsigned long flags;
-	int ccpu, cpu, group_cpu;
 
 	BUG_ON(!q->softirq_done_fn);
 
 	local_irq_save(flags);
 	cpu = smp_processor_id();
-	group_cpu = blk_cpu_to_group(cpu);
 
 	/*
 	 * Select completion CPU
 	 */
-	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1)
+	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) && req->cpu != -1) {
 		ccpu = req->cpu;
-	else
+		if (!test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags)) {
+			ccpu = blk_cpu_to_group(ccpu);
+			group_cpu = blk_cpu_to_group(cpu);
+		}
+	} else
 		ccpu = cpu;
 
 	if (ccpu == cpu || ccpu == group_cpu) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index d935bd8..0ee17b5 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -244,8 +244,9 @@ static ssize_t queue_nomerges_store(struct request_queue *q, const char *page,
 static ssize_t queue_rq_affinity_show(struct request_queue *q, char *page)
 {
 	bool set = test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags);
+	bool force = test_bit(QUEUE_FLAG_SAME_FORCE, &q->queue_flags);
 
-	return queue_var_show(set, page);
+	return queue_var_show(set << force, page);
 }
 
 static ssize_t
@@ -257,10 +258,14 @@ queue_rq_affinity_store(struct request_queue *q, const char *page, size_t count)
 
 	ret = queue_var_store(&val, page, count);
 	spin_lock_irq(q->queue_lock);
-	if (val)
+	if (val) {
 		queue_flag_set(QUEUE_FLAG_SAME_COMP, q);
-	else
-		queue_flag_clear(QUEUE_FLAG_SAME_COMP,  q);
+		if (val == 2)
+			queue_flag_set(QUEUE_FLAG_SAME_FORCE, q);
+	} else {
+		queue_flag_clear(QUEUE_FLAG_SAME_COMP, q);
+		queue_flag_clear(QUEUE_FLAG_SAME_FORCE, q);
+	}
 	spin_unlock_irq(q->queue_lock);
 #endif
 	return ret;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1a23722..b228825 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -393,7 +393,7 @@ struct request_queue
 #define QUEUE_FLAG_ELVSWITCH	6	/* don't use elevator, just do FIFO */
 #define QUEUE_FLAG_BIDI		7	/* queue supports bidi requests */
 #define QUEUE_FLAG_NOMERGES     8	/* disable merge attempts */
-#define QUEUE_FLAG_SAME_COMP	9	/* force complete on same CPU */
+#define QUEUE_FLAG_SAME_COMP	9	/* complete on same CPU-group */
 #define QUEUE_FLAG_FAIL_IO     10	/* fake timeout */
 #define QUEUE_FLAG_STACKABLE   11	/* supports request stacking */
 #define QUEUE_FLAG_NONROT      12	/* non-rotational device (SSD) */
@@ -403,6 +403,7 @@ struct request_queue
 #define QUEUE_FLAG_NOXMERGES   15	/* No extended merges */
 #define QUEUE_FLAG_ADD_RANDOM  16	/* Contributes to random pool */
 #define QUEUE_FLAG_SECDISCARD  17	/* supports SECDISCARD */
+#define QUEUE_FLAG_SAME_FORCE  18	/* force complete on same CPU */
 
 #define QUEUE_FLAG_DEFAULT	((1 << QUEUE_FLAG_IO_STAT) |		\
 				 (1 << QUEUE_FLAG_STACKABLE)	|	\


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 2/2] block: adaptive rq_affinity
  2011-07-22 20:59 [RFC PATCH 0/2] block: rq_affinity performance fixes Dan Williams
  2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams
@ 2011-07-22 20:59 ` Dan Williams
  1 sibling, 0 replies; 8+ messages in thread
From: Dan Williams @ 2011-07-22 20:59 UTC (permalink / raw)
  To: jaxboe
  Cc: Roland Dreier, Dave Jiang, linux-scsi, Matthew Wilcox,
	linux-kernel, Christoph Hellwig

For some storage configurations the coarse grained cpu grouping (socket)
does not supply enough cpu to keep up with the demands of high iops.
Bypass the grouping and complete on the direct requester cpu when the
local cpu is under softirq pressure (as measured by ksoftirqd being in
the running state).

Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Roland Dreier <roland@purestorage.com>
Tested-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 block/blk-softirq.c |   12 +++++++++++-
 1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 475fab8..f0cda19 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -101,16 +101,20 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 	.notifier_call	= blk_cpu_notify,
 };
 
+DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
+
 void __blk_complete_request(struct request *req)
 {
 	int ccpu, cpu, group_cpu = NR_CPUS;
 	struct request_queue *q = req->q;
+	struct task_struct *tsk;
 	unsigned long flags;
 
 	BUG_ON(!q->softirq_done_fn);
 
 	local_irq_save(flags);
 	cpu = smp_processor_id();
+	tsk = per_cpu(ksoftirqd, cpu);
 
 	/*
 	 * Select completion CPU
@@ -124,7 +128,13 @@ void __blk_complete_request(struct request *req)
 	} else
 		ccpu = cpu;
 
-	if (ccpu == cpu || ccpu == group_cpu) {
+	/*
+	 * try to skip a remote softirq-trigger if the completion is
+	 * within the same group, but not if local softirqs have already
+	 * spilled to ksoftirqd
+	 */
+	if (ccpu == cpu ||
+	    (ccpu == group_cpu && tsk->state != TASK_RUNNING)) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/2] block: strict rq_affinity
  2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams
@ 2011-07-23  1:46   ` Christoph Hellwig
  2011-07-23 18:38     ` Jens Axboe
  2011-07-23 18:41   ` Jens Axboe
  1 sibling, 1 reply; 8+ messages in thread
From: Christoph Hellwig @ 2011-07-23  1:46 UTC (permalink / raw)
  To: Dan Williams
  Cc: jaxboe, Christoph Hellwig, Roland Dreier, Dave Jiang,
	linux-kernel, linux-scsi

On Fri, Jul 22, 2011 at 01:59:39PM -0700, Dan Williams wrote:
> Some storage controllers benefit from completions always being steered
> to the strict requester cpu rather than the looser "per-socket" steering
> that blk_cpu_to_group() attempts by default.

Isn't this actually dependent on the cpu, and not the storage
controller?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/2] block: strict rq_affinity
  2011-07-23  1:46   ` Christoph Hellwig
@ 2011-07-23 18:38     ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2011-07-23 18:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Dan Williams, Roland Dreier, Dave Jiang, linux-kernel, linux-scsi

On 2011-07-23 03:46, Christoph Hellwig wrote:
> On Fri, Jul 22, 2011 at 01:59:39PM -0700, Dan Williams wrote:
>> Some storage controllers benefit from completions always being steered
>> to the strict requester cpu rather than the looser "per-socket" steering
>> that blk_cpu_to_group() attempts by default.
> 
> Isn't this actually dependent on the cpu, and not the storage
> controller?

It is, it's completely indendent of the controller used. Perhaps some
drivers could have a very heavy end io completion handling causing the
problem to become larger, but in general it's an artifact of the CPU and
not the controller.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/2] block: strict rq_affinity
  2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams
  2011-07-23  1:46   ` Christoph Hellwig
@ 2011-07-23 18:41   ` Jens Axboe
  2011-07-25  1:14     ` Shaohua Li
  1 sibling, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2011-07-23 18:41 UTC (permalink / raw)
  To: Dan Williams
  Cc: Christoph Hellwig, Roland Dreier, Dave Jiang, linux-kernel, linux-scsi

On 2011-07-22 22:59, Dan Williams wrote:
> Some storage controllers benefit from completions always being steered
> to the strict requester cpu rather than the looser "per-socket" steering
> that blk_cpu_to_group() attempts by default.
> 
> echo 2 > /sys/block/<bdev>/queue/rq_affinity

I have applied this one, with a modified patch description.

I like the adaptive solution, but it should be rewritten to not declare
and expose softirq internals. Essentially have an API from
kernel/softirq.c that can return whether a given (or perhaps just local)
softirq handler is busy or not.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/2] block: strict rq_affinity
  2011-07-23 18:41   ` Jens Axboe
@ 2011-07-25  1:14     ` Shaohua Li
  2011-07-25  8:21       ` Jens Axboe
  0 siblings, 1 reply; 8+ messages in thread
From: Shaohua Li @ 2011-07-25  1:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Dan Williams, Christoph Hellwig, Roland Dreier, Dave Jiang,
	linux-kernel, linux-scsi

2011/7/24 Jens Axboe <jaxboe@fusionio.com>:
> On 2011-07-22 22:59, Dan Williams wrote:
>> Some storage controllers benefit from completions always being steered
>> to the strict requester cpu rather than the looser "per-socket" steering
>> that blk_cpu_to_group() attempts by default.
>>
>> echo 2 > /sys/block/<bdev>/queue/rq_affinity
>
> I have applied this one, with a modified patch description.
>
> I like the adaptive solution, but it should be rewritten to not declare
> and expose softirq internals. Essentially have an API from
> kernel/softirq.c that can return whether a given (or perhaps just local)
> softirq handler is busy or not.
Jens,
I posted a similar patch about two years ago(
http://marc.info/?l=linux-kernel&m=126136252929329&w=2).
At that time, you actually did a lot of tests and said the same cpu
approach will cause huge lock contention and bounce. Is that get fixed?

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 1/2] block: strict rq_affinity
  2011-07-25  1:14     ` Shaohua Li
@ 2011-07-25  8:21       ` Jens Axboe
  0 siblings, 0 replies; 8+ messages in thread
From: Jens Axboe @ 2011-07-25  8:21 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Dan Williams, Christoph Hellwig, Roland Dreier, Dave Jiang,
	linux-kernel, linux-scsi

On 2011-07-25 03:14, Shaohua Li wrote:
> 2011/7/24 Jens Axboe <jaxboe@fusionio.com>:
>> On 2011-07-22 22:59, Dan Williams wrote:
>>> Some storage controllers benefit from completions always being steered
>>> to the strict requester cpu rather than the looser "per-socket" steering
>>> that blk_cpu_to_group() attempts by default.
>>>
>>> echo 2 > /sys/block/<bdev>/queue/rq_affinity
>>
>> I have applied this one, with a modified patch description.
>>
>> I like the adaptive solution, but it should be rewritten to not declare
>> and expose softirq internals. Essentially have an API from
>> kernel/softirq.c that can return whether a given (or perhaps just local)
>> softirq handler is busy or not.
> Jens,
> I posted a similar patch about two years ago(
> http://marc.info/?l=linux-kernel&m=126136252929329&w=2).
> At that time, you actually did a lot of tests and said the same cpu
> approach will cause huge lock contention and bounce. Is that get fixed?

Yep, it's not ideal. But if we are running out of steam on a single
processor, there's really not much of an option currently.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2011-07-25  8:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-22 20:59 [RFC PATCH 0/2] block: rq_affinity performance fixes Dan Williams
2011-07-22 20:59 ` [RFC PATCH 1/2] block: strict rq_affinity Dan Williams
2011-07-23  1:46   ` Christoph Hellwig
2011-07-23 18:38     ` Jens Axboe
2011-07-23 18:41   ` Jens Axboe
2011-07-25  1:14     ` Shaohua Li
2011-07-25  8:21       ` Jens Axboe
2011-07-22 20:59 ` [RFC PATCH 2/2] block: adaptive rq_affinity Dan Williams

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.