linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: William Lee Irwin III <wli@holomorphy.com>
To: Gerrit Huizenga <gh@us.ibm.com>
Cc: "Nakajima, Jun" <jun.nakajima@intel.com>,
	jamesclv@us.ibm.com, haveblue@us.ibm.com, pbadari@us.ibm.com,
	linux-kernel@vger.kernel.org, johnstul@us.ibm.com,
	mannthey@us.ibm.com
Subject: Re: userspace irq balancer
Date: Wed, 21 May 2003 19:04:43 -0700	[thread overview]
Message-ID: <20030522020443.GN2444@holomorphy.com> (raw)
In-Reply-To: <E19Idxq-0001LD-00@w-gerrit2>

On Wed, May 21, 2003 at 05:29:45PM -0700, Gerrit Huizenga wrote:
> Yeah, I suppose this userland policy change means we should pull
> the scheduler policy decisions out of the kernel and write user level
> HT, NUMA, SMP and UP schedulers.  Also, the IO schedulers should
> probably be pulled out - I'm sure AS and CFQ and linus_scheduler
> could be user land policies, as well as the elevator.  Memory
> placement and swapping policies, too.
> Oh, wait, some people actually do this - they call it, what,
> Workload Management or some such thing.  But I don't know any
> style of workload management that leaves *no* default, semi-sane
> policy in the kernel.

This is not the case. Interrupt arbitration for sane things generally
balances interrupt load automatically in-hardware. AIUI the TPR was
intended to enable the hardware to do such a thing for xAPIC. Linux
doesn't use the TPR now, which results in decisions made by the
hardware on xAPIC -based SMP systems that are highly detrimental to
performance.

Allowing userspace to exploit more specific knowledge and perform
either static or userspace-controlled dynamic interrupt affinity
is not equivalent to having an insane default policy in-kernel.

The task scheduler, the io scheduler, and memory entitlement policies
are very different issues. They deal entirely with managing software
constructs and resource allocation. Memory placement policies sit at
least two or three levels above anything hardware memory management
can do and it is safe to say it's infeasible to implement NUMA memory
placement policies in hardware.

Interrupt load balancing is very much doable in hardware and prior to
xAPIC it was done so in all cases; for xAPIC the hardware mechanism
became strictly bound to the TPR and had less optimal tiebreak
resolution decisions (something on the order of defaulting to the
lowest APIC ID in the event of a tie, which always occurs if the TPR
isn't frobbed). This naturally creates a problem, which these userspace
and kernel mechanisms are meant to address.

The difficulty with the in-kernel policy is that its decisions are not
optimal for all cases, and it has implementation issues that prevent it
from being fully generally used, i.e. it does not handle the physical
DESTMOD case for pre-xAPIC systems with multiple APIC buses, which
amounts to a very simple incompleteness of what to all outward
appearances is an already large and feature-rich implementation; the
kernel code merely refrains from calling it in that case as a brute-
force workaround. Furthermore the complexity of the decisions to be
made is inappropriate for the kernel. It needs unusual (and slow)
manipulation of hardware to be done in code requiring fast response
times in various cases and that is called at an uncontrollable rate. It
has heuristics which may be inaccurate or wrong for various cases. 

IMHO Linux on Pentium IV should use the TPR in conjunction with _very_
simplistic interrupt load accounting by default and all more
sophisticated logic should be punted straight to userspace as an
administrative API.

To quote chapter and verse:

IA-32 Intel Architecture Software Developer's Manual
Volume 3: System Programming Guide
Section 8.6.2.4 "Lowest Priority Delivery Mode"

"In operating systems that use the lowest-priority delivery mode but do
not update the TPR, the TPR information saved in the chipset will
potentially cause the interrupt to always be delivered to the same
processor from the logical set. This behavior is functionally backward
compatible with the P6 family processor but may result in unexpected
performance implications."

i.e. frob the fscking TPR as recommended by the APIC docs every once in
a while by default, punt anything (and everything) fancier up to
userspace, and get the code that doesn't even understand what the fsck
DESTMOD means the Hell out of the kernel and the Hell away from my
IO-APIC RTE's.


-- wli

  parent reply	other threads:[~2003-05-22  1:51 UTC|newest]

Thread overview: 76+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2003-05-21 21:43 userspace irq balancer Nakajima, Jun
2003-05-22  0:29 ` Gerrit Huizenga
2003-05-22  1:28   ` Martin J. Bligh
2003-05-22  1:44     ` Gerrit Huizenga
2003-05-22  2:03       ` William Lee Irwin III
2003-05-22  2:04   ` William Lee Irwin III [this message]
2003-05-22  2:12     ` Zwane Mwaikambo
2003-05-22  3:57     ` Martin J. Bligh
2003-05-22 17:24       ` Bill Davidsen
2003-05-22 22:44         ` David S. Miller
2003-05-26 22:24           ` Andrea Arcangeli
2003-05-26 23:26             ` Andrew Morton
2003-05-26 23:34               ` Andrea Arcangeli
2003-05-26 23:43                 ` David S. Miller
     [not found]                   ` <20030527000639.GA3767@dualathlon.random>
2003-05-27  0:15                     ` David S. Miller
2003-05-27  0:41                       ` Andrea Arcangeli
2003-05-27  0:48                         ` David S. Miller
2003-05-27  1:09                           ` Andrea Arcangeli
2003-05-27  1:13                             ` David S. Miller
2003-05-27  1:26                               ` Andrea Arcangeli
2003-05-27  6:11                                 ` David S. Miller
2003-05-27 11:53                                   ` Andrea Arcangeli
2003-05-27 22:04                                     ` David S. Miller
2003-05-27 22:27                                       ` Andrea Arcangeli
2003-05-27 23:55                                         ` David S. Miller
2003-06-13  6:22                                         ` David S. Miller
2003-06-13 18:23                                           ` Andrea Arcangeli
2003-05-27  1:16                             ` Dave Jones
2003-05-27  1:17                               ` David S. Miller
2003-05-27  9:07                                 ` Arjan van de Ven
2003-05-27  9:10                                   ` David S. Miller
2003-05-27  1:28                               ` Andrea Arcangeli
2003-05-27  1:53                           ` William Lee Irwin III
2003-05-27  1:59                             ` Andrew Morton
2003-05-27  2:10                               ` William Lee Irwin III
2003-05-27  2:15                                 ` Zwane Mwaikambo
2003-05-27  2:44                                   ` William Lee Irwin III
2003-05-27  2:45                                     ` Zwane Mwaikambo
2003-05-27  4:22                                       ` William Lee Irwin III
2003-05-27  2:15                               ` Andrea Arcangeli
2003-05-27  2:14                             ` Andrea Arcangeli
2003-05-27  2:26                               ` William Lee Irwin III
2003-05-27  1:17                 ` Andrea Arcangeli
2003-05-27  1:20                   ` David S. Miller
2003-05-27  1:33                     ` Andrea Arcangeli
2003-05-22 14:18     ` James Cleverdon
2003-05-22 14:43       ` William Lee Irwin III
2003-05-22 15:30         ` James Cleverdon
2003-05-22 15:45           ` William Lee Irwin III
  -- strict thread matches above, loose matches on Subject: below --
2003-05-24  1:10 Nakajima, Jun
2003-05-21 16:31 James Bottomley
2003-05-21 20:16 ` Arjan van de Ven
2003-05-20 15:41 Nakajima, Jun
2003-05-21 13:54 ` James Cleverdon
2003-05-21 22:56   ` Zwane Mwaikambo
     [not found] <200305191314.06216.pbadari@us.ibm.com>
2003-05-19 22:07 ` Dave Hansen
2003-05-19 22:11   ` Arjan van de Ven
2003-05-19 22:22     ` Dave Hansen
2003-05-20  3:25       ` David S. Miller
2003-05-20  3:46         ` William Lee Irwin III
2003-05-20  5:03           ` Dave Hansen
2003-05-20  5:53             ` Martin J. Bligh
2003-05-20  6:13               ` David S. Miller
2003-05-20  6:36                 ` Dave Hansen
2003-05-20  6:40                   ` David S. Miller
2003-05-20 14:07                     ` Andrew Theurer
2003-05-20 14:21                       ` Jeff Garzik
2003-05-20 14:35                         ` Andrew Theurer
     [not found]                       ` <20030520.163833.104040023.davem@redhat.com>
2003-05-21 14:58                         ` Martin J. Bligh
2003-05-21 22:55                           ` David S. Miller
2003-05-21 11:00                     ` Kai Bankett
2003-05-20 14:01                 ` Martin J. Bligh
2003-05-20  9:00             ` Arjan van de Ven
2003-05-20  9:14               ` William Lee Irwin III
2003-05-20  9:17               ` Andrew Morton
     [not found]                 ` <20030520.172230.102567463.davem@redhat.com>
2003-05-21 14:27                   ` James Cleverdon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20030522020443.GN2444@holomorphy.com \
    --to=wli@holomorphy.com \
    --cc=gh@us.ibm.com \
    --cc=haveblue@us.ibm.com \
    --cc=jamesclv@us.ibm.com \
    --cc=johnstul@us.ibm.com \
    --cc=jun.nakajima@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mannthey@us.ibm.com \
    --cc=pbadari@us.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).