linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andrea Arcangeli <andrea@suse.de>
To: William Lee Irwin III <wli@holomorphy.com>,
	"David S. Miller" <davem@redhat.com>,
	akpm@digeo.com, davidsen@tmr.com, haveblue@us.ibm.com,
	habanero@us.ibm.com, mbligh@aracnet.com,
	linux-kernel@vger.kernel.org
Subject: Re: userspace irq balancer
Date: Tue, 27 May 2003 04:14:07 +0200	[thread overview]
Message-ID: <20030527021407.GM3767@dualathlon.random> (raw)
In-Reply-To: <20030527015307.GC8978@holomorphy.com>

On Mon, May 26, 2003 at 06:53:07PM -0700, William Lee Irwin III wrote:
> On Mon, May 26, 2003 at 05:48:41PM -0700, David S. Miller wrote:
> > Andrea, whether ksoftirqd processes the softirq work or not has
> > nothing to do with what I'm talking about.
> > It is all about what does a hardware IRQ mean in terms of work
> > processed.  And it can mean anything from 1 to 1000 packets worth
> > of work.
> > Therefore, any usage of hardware IRQ activity to determine "load" in
> > any sense is totally inaccurate.
> > So I'm asking you, again, how are you going to measure softirq load in
> > making hardware IRQ load balancing decisions?  Watching the scheduling
> > and running of ksoftirqd is not an answer.  Networking hardware
> > interrupts, with a simplistic and mindless algorithm like the one we
> > have currently in the 2.5.x IRQ balancing code, appear to be
> > contributing very little to "load" and that is wrong.
> 
> I should also point out that the cost of reprogramming the interrupt
> controllers isn't taken into account by the kernel irq balancer. In

do you want to take that into account in userspace? if there's a place to
take that into account that place is the kernel. You can even benchmark
it at boot.

> the userspace implementation the reprogramming is done infrequently
> enough to make even significant cost negligible; in-kernel the cost
> is entirely uncontrolled and the rate of reprogramming unlimited.

depends on the kernel algorithm.

I feel like the in kernel algorithm is considered to be the one floating
around that reprograms the apic even when it makes zero changes to the
routing, like if nothing else was possible to do in kernel.

start like this: put the userspace algorithm in kernel, then add a
few bytes of info to keep an average of the idle cpus every second, then
after 30 seconds a cpu is idle start to route the irqs to such idle cpu,
slowly, after 60 seconds more aggressively. etc... For such an algorithm
you don't care less about the reprogramming speed, just like with the
current "userspace" algorithm, but thanks to the kernel info it will be
able to do smarter decisions that would never be possible in userspace
(w/o tlb flushing waste, and w/o kernel->user microkernel protocol
implementation waste).

> Also, Linux' i386 IO-APIC programming model is quite fragile and does
> not properly distinguish between physical and logical destinations or
> SAPIC vs. xAPIC (which differ in the physical destination format) to
> keep it coherent with i386 IO-APIC's DESTMOD. I would very much like to
> see that confusion corrected before any significant amount of online
> i386 IO-APIC RTE reprogramming is considered "stable". For instance, I
> know of one subarch that claims to use logical DESTMOD with clustered
> hierarchical DFR, but is using what appears to be SAPIC physical
> broadcast for the RTE's, and a couple of other confusions where the
> types of APIC ID's are ambiguous depending on subarch and broken by
> dynamic reprogramming. It furthermore assumes flat logical DFR by
> virtue of attempting to form APIC destinations representing arbitrary
> sets of cpus in addition to assuming at least logical with
> cpumask_to_logical_apicid() and is one of the major reasons irqbalance
> is either disabled or unusable in various subarches.
> 
> The story of APIC code tripping over itself is an even unfunnier comedy
> of errors, as the lack of TPR adjustment means that within any APIC
> destination at which IO-APIC RTE's are targeted on Pentium IV systems
> there will always be just a single cpu at which all interrupts are
> concentrated. In order to work around this, all of the buggy code
> choking on the fact arbitrary sets of cpus aren't representable as APIC
> destinations is actually unused except as a buggy translation layer
> from cpu ID's to APIC destinations, and the irqbalancing code works
> around this by forming singleton cpumasks, which have historically been
> frequently confused with APIC destinations of all 4 different formats.
> 
> Basically, the kernel has yet to handle IO-APIC RTE programming
> properly, and until there is a remote semblance of action moving it
> toward the correct formation of IO-APIC RTE's, in-kernel irqbalancing
> is a house of cards built on rapidly shifting sands. There is no point

again, reading this I feel like there's the idea that the only possible
kernel algorithm is the one that bounces stuff and reprograms stuff as
quickly as it can like the hardware one did.

> in anything but a userspace driver where the complexity the kernel has
> failed to handle thus far can be punted, or reliance on hardware
> mechanisms like the TPR that insulate the kernel from its prior and
> current embarrassments in handling this complexity, until something is
> done to correct IO-APIC RTE formation.
> 
> 
> -- wli


Andrea

  parent reply	other threads:[~2003-05-27  2:00 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-05-21 21:43 userspace irq balancer Nakajima, Jun
2003-05-22  0:29 ` Gerrit Huizenga
2003-05-22  1:28   ` Martin J. Bligh
2003-05-22  1:44     ` Gerrit Huizenga
2003-05-22  2:03       ` William Lee Irwin III
2003-05-22  2:04   ` William Lee Irwin III
2003-05-22  2:12     ` Zwane Mwaikambo
2003-05-22  3:57     ` Martin J. Bligh
2003-05-22 17:24       ` Bill Davidsen
2003-05-22 22:44         ` David S. Miller
2003-05-26 22:24           ` Andrea Arcangeli
2003-05-26 23:26             ` Andrew Morton
2003-05-26 23:34               ` Andrea Arcangeli
2003-05-26 23:43                 ` David S. Miller
     [not found]                   ` <20030527000639.GA3767@dualathlon.random>
2003-05-27  0:15                     ` David S. Miller
2003-05-27  0:41                       ` Andrea Arcangeli
2003-05-27  0:48                         ` David S. Miller
2003-05-27  1:09                           ` Andrea Arcangeli
2003-05-27  1:13                             ` David S. Miller
2003-05-27  1:26                               ` Andrea Arcangeli
2003-05-27  6:11                                 ` David S. Miller
2003-05-27 11:53                                   ` Andrea Arcangeli
2003-05-27 22:04                                     ` David S. Miller
2003-05-27 22:27                                       ` Andrea Arcangeli
2003-05-27 23:55                                         ` David S. Miller
2003-06-13  6:22                                         ` David S. Miller
2003-06-13 18:23                                           ` Andrea Arcangeli
2003-05-27  1:16                             ` Dave Jones
2003-05-27  1:17                               ` David S. Miller
2003-05-27  9:07                                 ` Arjan van de Ven
2003-05-27  9:10                                   ` David S. Miller
2003-05-27  1:28                               ` Andrea Arcangeli
2003-05-27  1:53                           ` William Lee Irwin III
2003-05-27  1:59                             ` Andrew Morton
2003-05-27  2:10                               ` William Lee Irwin III
2003-05-27  2:15                                 ` Zwane Mwaikambo
2003-05-27  2:44                                   ` William Lee Irwin III
2003-05-27  2:45                                     ` Zwane Mwaikambo
2003-05-27  4:22                                       ` William Lee Irwin III
2003-05-27  2:15                               ` Andrea Arcangeli
2003-05-27  2:14                             ` Andrea Arcangeli [this message]
2003-05-27  2:26                               ` William Lee Irwin III
2003-05-27  1:17                 ` Andrea Arcangeli
2003-05-27  1:20                   ` David S. Miller
2003-05-27  1:33                     ` Andrea Arcangeli
2003-05-22 14:18     ` James Cleverdon
2003-05-22 14:43       ` William Lee Irwin III
2003-05-22 15:30         ` James Cleverdon
2003-05-22 15:45           ` William Lee Irwin III
  -- strict thread matches above, loose matches on Subject: below --
2003-05-24  1:10 Nakajima, Jun
2003-05-21 16:31 James Bottomley
2003-05-21 20:16 ` Arjan van de Ven
2003-05-20 15:41 Nakajima, Jun
2003-05-21 13:54 ` James Cleverdon
2003-05-21 22:56   ` Zwane Mwaikambo
     [not found] <200305191314.06216.pbadari@us.ibm.com>
2003-05-19 22:07 ` Dave Hansen
2003-05-19 22:11   ` Arjan van de Ven
2003-05-19 22:22     ` Dave Hansen
2003-05-20  3:25       ` David S. Miller
2003-05-20  3:46         ` William Lee Irwin III
2003-05-20  5:03           ` Dave Hansen
2003-05-20  5:53             ` Martin J. Bligh
2003-05-20  6:13               ` David S. Miller
2003-05-20  6:36                 ` Dave Hansen
2003-05-20  6:40                   ` David S. Miller
2003-05-20 14:07                     ` Andrew Theurer
2003-05-20 14:21                       ` Jeff Garzik
2003-05-20 14:35                         ` Andrew Theurer
     [not found]                       ` <20030520.163833.104040023.davem@redhat.com>
2003-05-21 14:58                         ` Martin J. Bligh
2003-05-21 22:55                           ` David S. Miller
2003-05-21 11:00                     ` Kai Bankett
2003-05-20 14:01                 ` Martin J. Bligh
2003-05-20  9:00             ` Arjan van de Ven
2003-05-20  9:14               ` William Lee Irwin III
2003-05-20  9:17               ` Andrew Morton
     [not found]                 ` <20030520.172230.102567463.davem@redhat.com>
2003-05-21 14:27                   ` James Cleverdon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030527021407.GM3767@dualathlon.random \
    --to=andrea@suse.de \
    --cc=akpm@digeo.com \
    --cc=davem@redhat.com \
    --cc=davidsen@tmr.com \
    --cc=habanero@us.ibm.com \
    --cc=haveblue@us.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mbligh@aracnet.com \
    --cc=wli@holomorphy.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).