linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* rq_affinity doesn't seem to work?
@ 2011-07-12 19:03 Jiang, Dave
  2011-07-12 20:30 ` Jens Axboe
  0 siblings, 1 reply; 10+ messages in thread
From: Jiang, Dave @ 2011-07-12 19:03 UTC (permalink / raw)
  To: axboe
  Cc: Williams, Dan J, Foong, Annie, linux-scsi, linux-kernel,
	Nadolski, Edmund, Skirvin, Jeffrey D

Jens,
I'm doing some performance tuning for the Intel isci SAS controller driver, and I noticed some interesting numbers with mpstat. Looking at the numbers it seems that rq_affinity is not moving the request completion to the request submission CPU. Using fio to saturate the system with 512B I/Os, I noticed that all I/Os are bound to the CPUs (CPUs 6 and 7) that service the hard irqs. I have put in a quick hack in the driver so that it records the CPU during request construction and then I try to steer the scsi->done() calls to the request CPUs. With this simple hack, mpstat shows that the soft irq contexts are now distributed. I observed significant performance increase. The iowait% gone from 30s and 40s to low single digit approaching 0. Any ideas what could be happening with the rq_affinity logic? I'm assuming rq_affinity should behave the way my hacked solution is behaving. This is running on an 8 core single CPU SandyBridge based system with hyper-threading turned off. The two MSIX interrupts on the controller are tied to CPU 6 and 7 respectively via /proc/irq/X/smp_affinity. I'm running fio with 8 SAS disks and 8 threads. 

no rq_affinity:
09:23:31 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
09:23:36 AM  all    9.65    0.00   41.75   23.60    0.00   24.98    0.00    0.00    0.03
09:23:36 AM    0   13.40    0.00   59.60   27.00    0.00    0.00    0.00    0.00    0.00
09:23:36 AM    1   14.00    0.00   58.80   27.20    0.00    0.00    0.00    0.00    0.00
09:23:36 AM    2   13.20    0.00   57.40   29.40    0.00    0.00    0.00    0.00    0.00
09:23:36 AM    3   12.40    0.00   57.00   30.60    0.00    0.00    0.00    0.00    0.00
09:23:36 AM    4   12.60    0.00   52.80   34.60    0.00    0.00    0.00    0.00    0.00
09:23:36 AM    5   11.62    0.00   48.30   40.08    0.00    0.00    0.00    0.00    0.00
09:23:36 AM    6    0.00    0.00    0.20    0.00    0.00   99.80    0.00    0.00    0.00
09:23:36 AM    7    0.00    0.00    0.00    0.00    0.00   99.80    0.00    0.00    0.20

with rq_affinity:
09:25:04 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
09:25:09 AM  all    9.50    0.00   42.32   23.19    0.00   24.99    0.00    0.00    0.00
09:25:09 AM    0   13.80    0.00   61.60   24.60    0.00    0.00    0.00    0.00    0.00
09:25:09 AM    1   13.03    0.00   60.32   26.65    0.00    0.00    0.00    0.00    0.00
09:25:09 AM    2   12.83    0.00   58.52   28.66    0.00    0.00    0.00    0.00    0.00
09:25:09 AM    3   12.20    0.00   56.60   31.20    0.00    0.00    0.00    0.00    0.00
09:25:09 AM    4   12.20    0.00   52.40   35.40    0.00    0.00    0.00    0.00    0.00
09:25:09 AM    5   11.78    0.00   49.30   38.92    0.00    0.00    0.00    0.00    0.00
09:25:09 AM    6    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00
09:25:09 AM    7    0.00    0.00    0.00    0.00    0.00  100.00    0.00    0.00    0.00

with soft irq steering:
09:31:57 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
09:32:02 AM  all   12.73    0.00   46.82    1.63    8.03   28.59    0.00    0.00    2.20
09:32:02 AM    0   16.20    0.00   55.00    3.20   10.20   15.40    0.00    0.00    0.00
09:32:02 AM    1   15.60    0.00   57.60    0.00   10.00   16.80    0.00    0.00    0.00
09:32:02 AM    2   16.03    0.00   56.91    0.20   10.62   16.23    0.00    0.00    0.00
09:32:02 AM    3   15.77    0.00   58.48    0.20   10.18   15.17    0.00    0.00    0.20
09:32:02 AM    4   16.17    0.00   56.09    0.00   10.18   17.56    0.00    0.00    0.00
09:32:02 AM    5   16.00    0.00   56.60    0.20   10.60   16.60    0.00    0.00    0.00
09:32:02 AM    6    3.41    0.00   18.64    3.81    0.80   60.52    0.00    0.00   12.83
09:32:02 AM    7    2.79    0.00   14.97    5.79    1.40   70.26    0.00    0.00    4.79

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rq_affinity doesn't seem to work?
  2011-07-12 19:03 rq_affinity doesn't seem to work? Jiang, Dave
@ 2011-07-12 20:30 ` Jens Axboe
  2011-07-12 21:17   ` Jiang, Dave
  2011-07-13 17:10   ` Matthew Wilcox
  0 siblings, 2 replies; 10+ messages in thread
From: Jens Axboe @ 2011-07-12 20:30 UTC (permalink / raw)
  To: Jiang, Dave
  Cc: Williams, Dan J, Foong, Annie, linux-scsi, linux-kernel,
	Nadolski, Edmund, Skirvin, Jeffrey D

On 2011-07-12 21:03, Jiang, Dave wrote:
> Jens,
> I'm doing some performance tuning for the Intel isci SAS controller
> driver, and I noticed some interesting numbers with mpstat. Looking at
> the numbers it seems that rq_affinity is not moving the request
> completion to the request submission CPU. Using fio to saturate the
> system with 512B I/Os, I noticed that all I/Os are bound to the CPUs
> (CPUs 6 and 7) that service the hard irqs. I have put in a quick hack
> in the driver so that it records the CPU during request construction
> and then I try to steer the scsi->done() calls to the request CPUs.
> With this simple hack, mpstat shows that the soft irq contexts are now
> distributed. I observed significant performance increase. The iowait%
> gone from 30s and 40s to low single digit approaching 0. Any ideas
> what could be happening with the rq_affinity logic? I'm assuming
> rq_affinity should behave the way my hacked solution is behaving. This
> is running on an 8 core single CPU SandyBridge based system with
> hyper-threading turned off. The two MSIX interrupts on the controller
> are tied to CPU 6 and 7 respectively via /proc/irq/X/smp_affinity. I'm
> running fio with 8 SAS disks and 8 threads. 

It's probably the grouping, we need to do something about that. Does the
below patch make it behave as you expect?

diff --git a/block/blk.h b/block/blk.h
index d658628..17d53d8 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -157,6 +157,7 @@ static inline int queue_congestion_off_threshold(struct request_queue *q)
 
 static inline int blk_cpu_to_group(int cpu)
 {
+#if 0
 	int group = NR_CPUS;
 #ifdef CONFIG_SCHED_MC
 	const struct cpumask *mask = cpu_coregroup_mask(cpu);
@@ -168,6 +169,7 @@ static inline int blk_cpu_to_group(int cpu)
 #endif
 	if (likely(group < NR_CPUS))
 		return group;
+#endif
 	return cpu;
 }
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* RE: rq_affinity doesn't seem to work?
  2011-07-12 20:30 ` Jens Axboe
@ 2011-07-12 21:17   ` Jiang, Dave
  2011-07-13 17:10   ` Matthew Wilcox
  1 sibling, 0 replies; 10+ messages in thread
From: Jiang, Dave @ 2011-07-12 21:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Williams, Dan J, Foong, Annie, linux-scsi, linux-kernel,
	Nadolski, Edmund, Skirvin, Jeffrey D

> -----Original Message-----
> From: Jens Axboe [mailto:axboe@kernel.dk]
> Sent: Tuesday, July 12, 2011 1:31 PM
> To: Jiang, Dave
> Cc: Williams, Dan J; Foong, Annie; linux-scsi@vger.kernel.org; linux-
> kernel@vger.kernel.org; Nadolski, Edmund; Skirvin, Jeffrey D
> Subject: Re: rq_affinity doesn't seem to work?
> 
> On 2011-07-12 21:03, Jiang, Dave wrote:
> > Jens,
> > I'm doing some performance tuning for the Intel isci SAS controller
> > driver, and I noticed some interesting numbers with mpstat. Looking at
> > the numbers it seems that rq_affinity is not moving the request
> > completion to the request submission CPU. Using fio to saturate the
> > system with 512B I/Os, I noticed that all I/Os are bound to the CPUs
> > (CPUs 6 and 7) that service the hard irqs. I have put in a quick hack
> > in the driver so that it records the CPU during request construction
> > and then I try to steer the scsi->done() calls to the request CPUs.
> > With this simple hack, mpstat shows that the soft irq contexts are now
> > distributed. I observed significant performance increase. The iowait%
> > gone from 30s and 40s to low single digit approaching 0. Any ideas
> > what could be happening with the rq_affinity logic? I'm assuming
> > rq_affinity should behave the way my hacked solution is behaving. This
> > is running on an 8 core single CPU SandyBridge based system with
> > hyper-threading turned off. The two MSIX interrupts on the controller
> > are tied to CPU 6 and 7 respectively via /proc/irq/X/smp_affinity. I'm
> > running fio with 8 SAS disks and 8 threads.
> 
> It's probably the grouping, we need to do something about that. Does the
> below patch make it behave as you expect?

Yep that is it.

02:14:12 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
02:14:17 PM  all   11.98    0.00   46.62    1.18    0.00   37.79    0.00    0.00    2.43
02:14:17 PM    0   15.43    0.00   55.31    0.00    0.00   29.26    0.00    0.00    0.00
02:14:17 PM    1   14.83    0.00   56.71    0.00    0.00   28.46    0.00    0.00    0.00
02:14:17 PM    2   14.80    0.00   56.00    0.00    0.00   29.20    0.00    0.00    0.00
02:14:17 PM    3   14.63    0.00   57.11    0.00    0.00   28.26    0.00    0.00    0.00
02:14:17 PM    4   14.80    0.00   57.60    0.00    0.00   27.60    0.00    0.00    0.00
02:14:17 PM    5   15.03    0.00   56.11    0.00    0.00   28.86    0.00    0.00    0.00
02:14:17 PM    6    3.79    0.00   20.16    5.99    0.00   59.68    0.00    0.00   10.38
02:14:17 PM    7    2.80    0.00   14.20    3.20    0.00   70.80    0.00    0.00    9.00

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rq_affinity doesn't seem to work?
  2011-07-12 20:30 ` Jens Axboe
  2011-07-12 21:17   ` Jiang, Dave
@ 2011-07-13 17:10   ` Matthew Wilcox
  2011-07-13 18:00     ` Jens Axboe
  2011-07-14 17:02     ` Roland Dreier
  1 sibling, 2 replies; 10+ messages in thread
From: Matthew Wilcox @ 2011-07-13 17:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jiang, Dave, Williams, Dan J, Foong, Annie, linux-scsi,
	linux-kernel, Nadolski, Edmund, Skirvin, Jeffrey D

On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote:
> It's probably the grouping, we need to do something about that. Does the
> below patch make it behave as you expect?

"something", absolutely.  But there is benefit from doing some aggregation
(we tried disabling it entirely with the "well-known OLTP benchmark" and
performance went down).

Ideally we'd do something like "if the softirq is taking up more than 10%
of a core, split the grouping".  Do we have enough stats to do that kind
of monitoring?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rq_affinity doesn't seem to work?
  2011-07-13 17:10   ` Matthew Wilcox
@ 2011-07-13 18:00     ` Jens Axboe
  2011-07-14 17:02     ` Roland Dreier
  1 sibling, 0 replies; 10+ messages in thread
From: Jens Axboe @ 2011-07-13 18:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jiang, Dave, Williams, Dan J, Foong, Annie, linux-scsi,
	linux-kernel, Nadolski, Edmund, Skirvin, Jeffrey D

On 2011-07-13 19:10, Matthew Wilcox wrote:
> On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote:
>> It's probably the grouping, we need to do something about that. Does the
>> below patch make it behave as you expect?
> 
> "something", absolutely.  But there is benefit from doing some aggregation
> (we tried disabling it entirely with the "well-known OLTP benchmark" and
> performance went down).

Yep, that's why the current solution is somewhat middle of the road...

> Ideally we'd do something like "if the softirq is taking up more than 10%
> of a core, split the grouping".  Do we have enough stats to do that kind
> of monitoring?

I don't think we have those stats, though it could/should be pulled from
the ksoftirqX threads. We could have some metric, ala

        dest_cpu = get_group_completion_cpu(rq->cpu);
        if (ksoftirqd_of(dest_cpu) >= 90% busy)
                dest_cpu = rq->cpu;

to send things completely local to the submitter of the IO, IFF the
current CPU is close to running at full tilt.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rq_affinity doesn't seem to work?
  2011-07-13 17:10   ` Matthew Wilcox
  2011-07-13 18:00     ` Jens Axboe
@ 2011-07-14 17:02     ` Roland Dreier
  2011-07-15 20:20       ` Dan Williams
  2011-07-15 23:43       ` ersatz splatt
  1 sibling, 2 replies; 10+ messages in thread
From: Roland Dreier @ 2011-07-14 17:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jens Axboe, Jiang, Dave, Williams, Dan J, Foong, Annie,
	linux-scsi, linux-kernel, Nadolski, Edmund, Skirvin, Jeffrey D

On Wed, Jul 13, 2011 at 10:10 AM, Matthew Wilcox <matthew@wil.cx> wrote:
> On Tue, Jul 12, 2011 at 10:30:35PM +0200, Jens Axboe wrote:
>> It's probably the grouping, we need to do something about that. Does the
>> below patch make it behave as you expect?
>
> "something", absolutely.  But there is benefit from doing some aggregation
> (we tried disabling it entirely with the "well-known OLTP benchmark" and
> performance went down).
>
> Ideally we'd do something like "if the softirq is taking up more than 10%
> of a core, split the grouping".  Do we have enough stats to do that kind
> of monitoring?

What platform was your "OLTP benchmark" on?  It seems that as the
number of cores per package goes up, this grouping becomes too coarse,
since almost everyone will have SCHED_MC set in the code:

	static inline int blk_cpu_to_group(int cpu)
	{
		int group = NR_CPUS;
	#ifdef CONFIG_SCHED_MC
		const struct cpumask *mask = cpu_coregroup_mask(cpu);
		group = cpumask_first(mask);
	#elif defined(CONFIG_SCHED_SMT)
		group = cpumask_first(topology_thread_cpumask(cpu));
	#else
		return cpu;
	#endif
		if (likely(group < NR_CPUS))
			return group;
		return cpu;
	}

and so we use cpumask_first(cpu_coregroup_mask(cpu)).  And from

	const struct cpumask *cpu_coregroup_mask(int cpu)
	{
	        struct cpuinfo_x86 *c = &cpu_data(cpu);
	        /*
	         * For perf, we return last level cache shared map.
	         * And for power savings, we return cpu_core_map
	         */
	        if ((sched_mc_power_savings || sched_smt_power_savings) &&
	            !(cpu_has(c, X86_FEATURE_AMD_DCM)))
	                return cpu_core_mask(cpu);
	        else
	                return cpu_llc_shared_mask(cpu);
	}

in the "max performance" case, we use cpu_llc_shared_mask().

The problem as we've seen it is that on a dual-socket Westmere (Xeon
56xx) system, we have two sockets with 6 cores (12 threads) each, all
sharing L3 cache, and so we end up with all block softirqs on only 2
out of 24 threads, which is not enough to handle all the IOPS that
fast storage can provide.

It's not clear to me what the right answer or tradeoffs are here.  It
might make sense to use only one hyperthread per core for block
softirqs.  As I understand the Westmere cache topology, there's not
really an obvious intermediate step -- all the cores in a package
share the L3, and then each core has its own L2.

Limiting softirqs to 10% of a core seems a bit low, since we seem to
be able to use more than 100% of a core handling block softirqs, and
anyway magic numbers like that seem to always be wrong sometimes.
Perhaps we could use the queue length on the destination CPU as a
proxy for how busy ksoftirq is?

 - R.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rq_affinity doesn't seem to work?
  2011-07-14 17:02     ` Roland Dreier
@ 2011-07-15 20:20       ` Dan Williams
  2011-07-15 23:43       ` ersatz splatt
  1 sibling, 0 replies; 10+ messages in thread
From: Dan Williams @ 2011-07-15 20:20 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Matthew Wilcox, Jens Axboe, Jiang, Dave, Foong, Annie,
	linux-scsi, linux-kernel, Nadolski, Edmund, Skirvin, Jeffrey D

On Thu, 2011-07-14 at 10:02 -0700, Roland Dreier wrote:
> On Wed, Jul 13, 2011 at 10:10 AM, Matthew Wilcox <matthew@wil.cx> wrote:
> Limiting softirqs to 10% of a core seems a bit low, since we seem to
> be able to use more than 100% of a core handling block softirqs, and
> anyway magic numbers like that seem to always be wrong sometimes.
> Perhaps we could use the queue length on the destination CPU as a
> proxy for how busy ksoftirq is?

This is likely too aggressive (untested / need to confirm it resolves
the isci issue), but it's at least straightforward to determine, and I
wonder if it prevents the regression Matthew is seeing.  It assumes that
the once we have naturally spilled from the irq return path to ksoftirqd
that this cpu is having trouble keeping up with the load.

??

diff --git a/block/blk-core.c b/block/blk-core.c
index d2f8f40..9c7ba87 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1279,10 +1279,8 @@ get_rq:
 	init_request_from_bio(req, bio);
 
 	if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) ||
-	    bio_flagged(bio, BIO_CPU_AFFINE)) {
-		req->cpu = blk_cpu_to_group(get_cpu());
-		put_cpu();
-	}
+	    bio_flagged(bio, BIO_CPU_AFFINE))
+		req->cpu = smp_processor_id();
 
 	plug = current->plug;
 	if (plug) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index ee9c216..720918f 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -101,17 +101,21 @@ static struct notifier_block __cpuinitdata blk_cpu_notifier = {
 	.notifier_call	= blk_cpu_notify,
 };
 
+DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
+
 void __blk_complete_request(struct request *req)
 {
+	int ccpu, cpu, group_ccpu, group_cpu;
 	struct request_queue *q = req->q;
+	struct task_struct *tsk;
 	unsigned long flags;
-	int ccpu, cpu, group_cpu;
 
 	BUG_ON(!q->softirq_done_fn);
 
 	local_irq_save(flags);
 	cpu = smp_processor_id();
 	group_cpu = blk_cpu_to_group(cpu);
+	tsk = per_cpu(ksoftirqd, cpu);
 
 	/*
 	 * Select completion CPU
@@ -120,8 +124,15 @@ void __blk_complete_request(struct request *req)
 		ccpu = req->cpu;
 	else
 		ccpu = cpu;
+	group_ccpu = blk_cpu_to_group(ccpu);
 
-	if (ccpu == cpu || ccpu == group_cpu) {
+	/*
+	 * try to skip a remote softirq-trigger if the completion is
+	 * within the same group, but not if local softirqs have already
+	 * spilled to ksoftirqd
+	 */
+	if (ccpu == cpu ||
+	    (group_ccpu == group_cpu && tsk->state != TASK_RUNNING)) {
 		struct list_head *list;
 do_local:
 		list = &__get_cpu_var(blk_cpu_done);






^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: rq_affinity doesn't seem to work?
  2011-07-14 17:02     ` Roland Dreier
  2011-07-15 20:20       ` Dan Williams
@ 2011-07-15 23:43       ` ersatz splatt
  2011-07-16  2:12         ` ersatz splatt
  2011-07-16  2:40         ` Christoph Hellwig
  1 sibling, 2 replies; 10+ messages in thread
From: ersatz splatt @ 2011-07-15 23:43 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Matthew Wilcox, Jens Axboe, Jiang, Dave, Williams, Dan J, Foong,
	Annie, linux-scsi, linux-kernel, Nadolski, Edmund, Skirvin,
	Jeffrey D

On Thu, Jul 14, 2011 at 10:02 AM, Roland Dreier <roland@purestorage.com> wrote:

> The problem as we've seen it is that on a dual-socket Westmere (Xeon
> 56xx) system, we have two sockets with 6 cores (12 threads) each, all
> sharing L3 cache, and so we end up with all block softirqs on only 2
> out of 24 threads, which is not enough to handle all the IOPS that
> fast storage can provide.

I have a dual socket system with Tylersburg chipset (approximately
Westmere I gather).
With two Xeon X5660 packages I get this when running with more iops
potential than the system can handle:

02:15:00 PM CPU  %usr  %nice %sys %iowait  %irq  %soft  %steal  %guest %idle
02:15:02 PM  all    2.76    0.00   30.40   28.28    0.00   13.74
0.00    0.00   24.81
02:15:02 PM    0    0.00    0.00    0.00    0.00    0.00  100.00
0.00    0.00    0.00
02:15:02 PM    1    0.00    0.00    0.50    0.00    0.00    0.00
0.00    0.00   99.50
02:15:02 PM    2    3.02    0.00   36.68   52.26    0.00    8.04
0.00    0.00    0.00
02:15:02 PM    3    2.50    0.00   36.00   54.50    0.00    7.00
0.00    0.00    0.00
02:15:02 PM    4    5.47    0.00   64.18   18.91    0.00   11.44
0.00    0.00    0.00
02:15:02 PM    5    3.02    0.00   37.69   53.27    0.00    6.03
0.00    0.00    0.00
02:15:02 PM    6    0.00    0.00    0.50    0.00    0.00   91.54
0.00    0.00    7.96
02:15:02 PM    7    0.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00  100.00
02:15:02 PM    8    3.00    0.00   35.50   55.00    0.00    6.50
0.00    0.00    0.00
02:15:02 PM    9    3.02    0.00   39.70   50.25    0.00    7.04
0.00    0.00    0.00
02:15:02 PM   10    3.50    0.00   36.50   53.00    0.00    7.00
0.00    0.00    0.00
02:15:02 PM   11    6.53    0.00   70.85    9.05    0.00   13.57
0.00    0.00    0.00
02:15:02 PM   12    0.00    0.00    0.57    0.00    0.00    0.00
0.00    0.00   99.43
02:15:02 PM   13    3.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00   97.00
02:15:02 PM   14    2.50    0.00   36.50   54.00    0.00    7.00
0.00    0.00    0.00
02:15:02 PM   15    3.52    0.00   36.18   53.77    0.00    6.53
0.00    0.00    0.00
02:15:02 PM   16    5.00    0.00   64.00   21.00    0.00   10.00
0.00    0.00    0.00
02:15:02 PM   17    3.02    0.00   37.19   52.76    0.00    7.04
0.00    0.00    0.00
02:15:02 PM   18    0.00    0.00    0.00    0.00    0.00    0.00
0.00    0.00  100.00
02:15:02 PM   19    0.00    0.00    1.01    0.00    0.00    0.00
0.00    0.00   98.99
02:15:02 PM   20    3.48    0.00   38.31   52.24    0.00    5.97
0.00    0.00    0.00
02:15:02 PM   21    5.50    0.00   63.00   18.50    0.00   13.00
0.00    0.00    0.00
02:15:02 PM   22    2.50    0.00   35.00   54.50    0.00    8.00
0.00    0.00    0.00
02:15:02 PM   23    5.03    0.00   58.79   23.62    0.00   12.56
0.00    0.00    0.00

By "more IOPS potential than the system can handle", I mean that with
about a quarter of the targets I get the same figure.  The HBA is
known to handle more than twice the IOPS I'm seeing.

I'm using 16 targets with fio driving one target with each core you
see sys activity on.  You can see that two additional cores are
getting weighed down -- 0 and 6.  Is that indicative of the
bottleneck?

These results are without using any of the patches suggested in this
e-mail thread.  I'll have to try and see if they help.

What is the top number of IOPS I should hope for with this system and
the Linux kernel?
Dave Jiang (or anyone else) -- can you share the max IOPS that you are seeing?


> It's not clear to me what the right answer or tradeoffs are here.  It
> might make sense to use only one hyperthread per core for block
> softirqs.  As I understand the Westmere cache topology, there's not
> really an obvious intermediate step -- all the cores in a package
> share the L3, and then each core has its own L2.
>
> Limiting softirqs to 10% of a core seems a bit low, since we seem to
> be able to use more than 100% of a core handling block softirqs, and
> anyway magic numbers like that seem to always be wrong sometimes.
> Perhaps we could use the queue length on the destination CPU as a
> proxy for how busy ksoftirq is?
>
>  - R.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rq_affinity doesn't seem to work?
  2011-07-15 23:43       ` ersatz splatt
@ 2011-07-16  2:12         ` ersatz splatt
  2011-07-16  2:40         ` Christoph Hellwig
  1 sibling, 0 replies; 10+ messages in thread
From: ersatz splatt @ 2011-07-16  2:12 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Matthew Wilcox, Jens Axboe, Jiang, Dave, Williams, Dan J, Foong,
	Annie, linux-scsi, linux-kernel, Nadolski, Edmund, Skirvin,
	Jeffrey D

With the quickest and easiest fix (the first suggestion from Jens
Axboe), I was able to get another 20%+ in IOPS.  Thank you.

(Pardon the previous ugly wrap on the data.  I'm not sure how to stop
that with my e-mail vendor)

Driving more IOPS on the same system looks like this for me:
CPU %usr  %nice  %sys %iowait %irq   %soft  %steal  %guest  %idle
  all    2.85    0.00   31.37   12.05    0.00   14.84    0.00    0.00   38.90
  0    2.44    0.00    0.00    0.00    0.00    4.39    0.00    0.00   93.17
  1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
  2    1.51    0.00   23.12   70.85    0.00    4.52    0.00    0.00    0.00
  3    5.05    0.00   51.01   19.70    0.00   24.24    0.00    0.00    0.00
  4    5.47    0.00   62.19    1.00    0.00   31.34    0.00    0.00    0.00
  5    4.00    0.00   50.00   22.50    0.00   23.50    0.00    0.00    0.00
  6    0.00    0.00    0.00    0.00    0.00    0.47    0.00    0.00   99.53
  7    0.00    0.00    0.22    0.00    0.00    0.00    0.00    0.00   99.78
  8    4.48    0.00   53.23   16.92    0.00   25.37    0.00    0.00    0.00
  9    4.48    0.00   50.25   19.40    0.00   25.87    0.00    0.00    0.00
10    5.53    0.00   63.82    0.50    0.00   30.15    0.00    0.00    0.00
11    3.50    0.00   52.00   20.50    0.00   24.00    0.00    0.00    0.00
12    0.50    0.00    1.00    1.49    0.00    0.00    0.00    0.00   97.01
13    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
14    3.50    0.00   43.50   35.50    0.00   17.50    0.00    0.00    0.00
15    4.02    0.00   51.26   20.60    0.00   24.12    0.00    0.00    0.00
16    6.03    0.00   57.29    8.54    0.00   28.14    0.00    0.00    0.00
17    4.50    0.00   49.00   25.00    0.00   21.50    0.00    0.00    0.00
18    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
19    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
20    4.98    0.00   57.21   11.44    0.00   26.37    0.00    0.00    0.00
21    4.50    0.00   54.00   16.00    0.00   25.50    0.00    0.00    0.00
22    5.50    0.00   58.00    7.00    0.00   29.50    0.00    0.00    0.00
23    4.00    0.00   49.50   22.50    0.00   24.00    0.00    0.00    0.00

I'm happy to have the performance improvement, but I would like to
know how I could do much better.  The storage hardware is all capable
of about twice the IOPS I'm getting now.

I see that "sys" is eating most of the CPU time at this point.  What
do I need to fix?  Is fio too heavy in implementation?  ... or is this
a scsi midlayer bottleneck?

I would be happy to get advice on what I should do to better
illuminate the bottleneck.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: rq_affinity doesn't seem to work?
  2011-07-15 23:43       ` ersatz splatt
  2011-07-16  2:12         ` ersatz splatt
@ 2011-07-16  2:40         ` Christoph Hellwig
  1 sibling, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2011-07-16  2:40 UTC (permalink / raw)
  To: ersatz splatt
  Cc: Roland Dreier, Matthew Wilcox, Jens Axboe, Jiang, Dave, Williams,
	Dan J, Foong, Annie, linux-scsi, linux-kernel, Nadolski, Edmund,
	Skirvin, Jeffrey D

On Fri, Jul 15, 2011 at 04:43:44PM -0700, ersatz splatt wrote:
> I have a dual socket system with Tylersburg chipset (approximately
> Westmere I gather).
> With two Xeon X5660 packages I get this when running with more iops
> potential than the system can handle:

What HBA do you use? Does it already have a lockless ->queuecommand?


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2011-07-16  2:40 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-07-12 19:03 rq_affinity doesn't seem to work? Jiang, Dave
2011-07-12 20:30 ` Jens Axboe
2011-07-12 21:17   ` Jiang, Dave
2011-07-13 17:10   ` Matthew Wilcox
2011-07-13 18:00     ` Jens Axboe
2011-07-14 17:02     ` Roland Dreier
2011-07-15 20:20       ` Dan Williams
2011-07-15 23:43       ` ersatz splatt
2011-07-16  2:12         ` ersatz splatt
2011-07-16  2:40         ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).