Attached is a patch which provides a rudimentary NUMA scheduler.
This patch basically does two things:

* at exec() it finds the least loaded CPU to assign a task to;
* at load_balance() (find_busiest_queue() actually) it favors 
  cpus on the same node for taking tasks from.

This has been tested on the IA32 based NUMAQ platform and shows
performance gains for kernbench.  Various microbenchmarks also 
show improvements and stickiness of processes to nodes.  Profiles
show that the kernbench performance improvements are directly 
attributable to the use of local memory caused by tasks running
on the same node through their lifetime.

I will be doing much more testing of this, and likely will be
tweaking some of the algorithms based upon test results.  Any
comments, suggestions, flames are welcome.

Patch applies to 2.5.40 and makes use of the new NUMA topology
API.  This scheduler change should work on other NUMA platforms
with just the definition of the architecture specific macros in
topology.h.

-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486