All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Alexander Gordeev <agordeev@redhat.com>,
	Arjan van de Ven <arjan@infradead.org>,
	linux-kernel@vger.kernel.org, x86@kernel.org,
	Suresh Siddha <suresh.b.siddha@intel.com>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	Yinghai Lu <yinghai@kernel.org>
Subject: Re: [PATCH 2/3] x86: x2apic/cluster: Make use of lowest priority delivery mode
Date: Mon, 21 May 2012 21:15:46 +0200	[thread overview]
Message-ID: <20120521191546.GA28819@gmail.com> (raw)
In-Reply-To: <CA+55aFxJxfx+OfDiATCCPojVAPew_Hy+2=jx9g4=2_WRvixBFw@mail.gmail.com>


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, May 21, 2012 at 7:59 AM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > For example we don't execute tasks for 100 usecs on one CPU, 
> > then jump to another CPU and execute 100 usecs there, then 
> > to yet another CPU to create an 'absolutely balanced use of 
> > CPU resources'. Why? Because the cache-misses would be 
> > killing us.
> 
> That is likely generally not true within a single socket, 
> though.
> 
> Interrupt handlers will basically never hit in the L1 anyway 
> (*maybe* it happens if the CPU is totally idle, but quite 
> frankly, I doubt it). Even the L2 is likely not large enough 
> to have much cache across irqs, unless it's one of the big 
> Core 2 L2's that are largely shared per socket anyway.

Indeed, you are right as far as the L1 cache is concerned:

For networking we have about 180 L1 misses per IRQ handler 
invocation, if the L1 cache is cold - so most IRQ handlers 
easily fit int the L1 cache. I have measured this using a 
constant rate of networking IRQs and doing:

   perf stat -a -C 0 --repeat 10 -e L1-dcache-load-misses:k sleep 1

It only takes another 180 cache misses for these lines to get 
evicted, and this happens very easily:

On the same testsystem a parallel kernel build will evict about 
25 million L1 cachelines/sec/CPU. That means that an IRQ 
handler's working set in the L1 cache is indeed gone in less 
than 8 microseconds.

When the workload is RAM-bound then the L1 working set of an IRQ 
handler is gone in about 80 usecs. That corresponds to an IRQ 
rate of about 12,500/sec/CPU: if IRQs are coming in faster than 
this then they can still see some of the previous execution's 
footprint in the cache.

Wrt. the L2 cache the numbers come in much more in favor of not 
moving IRQ handlers across L2 cache domains - this means that 
allowing the hardware to distribute them per socket is a pretty 
sensible default.

> So it may well make perfect sense to allow a mask of CPU's for 
> interrupt delivery, but just make sure that the mask all 
> points to CPU's on the same socket. That would give the 
> hardware some leeway in choosing the actual core - it's very 
> possible that hardware could avoid cores that are running with 
> irq's disabled (possibly improving latecy) or even more likely 
> - avoid cores that are in deeper powersaving modes.

Indeed, and that's an important argument.

The one negative effect I mentioned, affine wakeups done by the 
scheduler, could still bite us - this has to be measured and 
affine wakeups have to be made less prominent if IRQ handlers 
start jumping around. We definitely don't want tasks to follow 
round-robin IRQs around.

> Avoiding waking up CPU's that are in C6 would not only help 
> latency, it would help power use. I don't know how well the 
> irq handling actually works on a hw level, but that's exactly 
> the kind of thing I would expect HW to do well (and sw would 
> do badly, because the latencies for things like CPU power 
> states are low enough that trying to do SW irq balancing at 
> that level is entirely and completely idiotic).
> 
> So I do think that we should aim for *allowing* hardware to do 
> these kinds of choices for us. Limiting irq delivery to a 
> particular core is very limiting for very little gain (almost 
> no cache benefits), but limiting it to a particular socket 
> could certainly be a valid thing. You might want to limit it 
> to a particular socket anyway, just because the hardware 
> itself may well be closer to one socket (coming off the PCIe 
> lanes of that particular socket) than anything else.

Ok, I'm convinced.

Limiting to a socket is I suspect an important constraint: we 
shouldn't just feed the hardware whatever mask user-space sends 
us, user-space might not be aware of (or confused about) socket 
boundaries.

Thanks,

	Ingo

  parent reply	other threads:[~2012-05-21 19:15 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-18 10:26 [PATCH 2/3] x86: x2apic/cluster: Make use of lowest priority delivery mode Alexander Gordeev
2012-05-18 14:41 ` Cyrill Gorcunov
2012-05-18 15:42   ` Alexander Gordeev
2012-05-18 15:51     ` Cyrill Gorcunov
2012-05-19 10:47       ` Cyrill Gorcunov
2012-05-21  7:11         ` Alexander Gordeev
2012-05-21  9:46           ` Cyrill Gorcunov
2012-05-19 20:53 ` Yinghai Lu
2012-05-21  8:13   ` Alexander Gordeev
2012-05-21 23:02     ` Yinghai Lu
2012-05-21 23:33       ` Yinghai Lu
2012-05-22  9:36         ` Alexander Gordeev
2012-05-21 23:44     ` Suresh Siddha
2012-05-21 23:58       ` [PATCH 1/2] x86, irq: update irq_cfg domain unless the new affinity is a subset of the current domain Suresh Siddha
2012-05-21 23:58         ` [PATCH 2/2] x2apic, cluster: use all the members of one cluster specified in the smp_affinity mask for the interrupt desintation Suresh Siddha
2012-05-22  7:04           ` Ingo Molnar
2012-05-22  7:34             ` Cyrill Gorcunov
2012-05-22 17:21             ` Suresh Siddha
2012-05-22 17:39               ` Cyrill Gorcunov
2012-05-22 17:42                 ` Suresh Siddha
2012-05-22 17:45                   ` Cyrill Gorcunov
2012-05-22 20:03           ` Yinghai Lu
2012-06-06 15:04           ` [tip:x86/apic] x86/x2apic/cluster: Use all the members of one cluster specified in the smp_affinity mask for the interrupt destination tip-bot for Suresh Siddha
2012-06-06 22:21             ` Yinghai Lu
2012-06-06 23:14               ` Suresh Siddha
2012-06-06 15:03         ` [tip:x86/apic] x86/irq: Update irq_cfg domain unless the new affinity is a subset of the current domain tip-bot for Suresh Siddha
2012-08-07 15:31           ` Robert Richter
2012-08-07 15:41             ` do_IRQ: 1.55 No irq handler for vector (irq -1) Borislav Petkov
2012-08-07 16:24               ` Suresh Siddha
2012-08-07 17:28                 ` Robert Richter
2012-08-07 17:47                   ` Suresh Siddha
2012-08-07 17:45                 ` Eric W. Biederman
2012-08-07 20:57                   ` Borislav Petkov
2012-08-07 22:39                     ` Suresh Siddha
2012-08-08  8:58                       ` Robert Richter
2012-08-08 11:04                         ` Borislav Petkov
2012-08-08 19:16                           ` Suresh Siddha
2012-08-14 17:02                             ` [tip:x86/urgent] x86, apic: fix broken legacy interrupts in the logical apic mode tip-bot for Suresh Siddha
2012-06-06 17:20         ` [PATCH 1/2] x86, irq: update irq_cfg domain unless the new affinity is a subset of the current domain Alexander Gordeev
2012-06-06 23:02           ` Suresh Siddha
2012-06-16  0:25           ` Suresh Siddha
2012-06-18  9:17             ` Alexander Gordeev
2012-06-19  0:51               ` Suresh Siddha
2012-06-19 23:43                 ` [PATCH 1/2] x86, apic: optimize cpu traversal in __assign_irq_vector() using domain membership Suresh Siddha
2012-06-19 23:43                   ` [PATCH 2/2] x86, x2apic: limit the vector reservation to the user specified mask Suresh Siddha
2012-06-20  5:56                     ` Yinghai Lu
2012-06-21  9:04                     ` Alexander Gordeev
2012-06-21 21:51                       ` Suresh Siddha
2012-06-20  5:53                   ` [PATCH 1/2] x86, apic: optimize cpu traversal in __assign_irq_vector() using domain membership Yinghai Lu
2012-06-21  8:31                   ` Alexander Gordeev
2012-06-21 21:53                     ` Suresh Siddha
2012-06-20  0:18               ` [PATCH 1/2] x86, irq: update irq_cfg domain unless the new affinity is a subset of the current domain Suresh Siddha
2012-06-21 11:00                 ` Alexander Gordeev
2012-06-21 21:58                   ` Suresh Siddha
2012-05-22 10:12       ` [PATCH 2/3] x86: x2apic/cluster: Make use of lowest priority delivery mode Alexander Gordeev
2012-05-21  8:22 ` Ingo Molnar
2012-05-21  9:36   ` Alexander Gordeev
2012-05-21 12:40     ` Ingo Molnar
2012-05-21 14:48       ` Alexander Gordeev
2012-05-21 14:59         ` Ingo Molnar
2012-05-21 15:22           ` Alexander Gordeev
2012-05-21 15:34           ` Cyrill Gorcunov
2012-05-21 15:36           ` Linus Torvalds
2012-05-21 18:07             ` Suresh Siddha
2012-05-21 18:18               ` Linus Torvalds
2012-05-21 18:37                 ` Suresh Siddha
2012-05-21 19:30                   ` Ingo Molnar
2012-05-21 19:15             ` Ingo Molnar [this message]
2012-05-21 19:56               ` Suresh Siddha

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120521191546.GA28819@gmail.com \
    --to=mingo@kernel.org \
    --cc=agordeev@redhat.com \
    --cc=arjan@infradead.org \
    --cc=gorcunov@openvz.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=suresh.b.siddha@intel.com \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    --cc=yinghai@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.