linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [Lse-tech] Re: CPU affinity & IPI latency
@ 2001-07-13 20:17 Shailabh Nagar
  2001-07-15 20:51 ` Davide Libenzi
  0 siblings, 1 reply; 11+ messages in thread
From: Shailabh Nagar @ 2001-07-13 20:17 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Mike Kravetz, lse-tech, Andi Kleen, linux-kernel, Larry McVoy


David,

> Global scheduling decisions should be triggered in response of load
unbalancing
> and not at each schedule() run otherwise we're going to introduce a
common lock
> that will limit the overall scalability.

Thats correct. Though it beggars the question : who will define
"load-imbalance" and at what granularity ? In the Loadbalancing extensions
to MQ (http://lse.sourceforge.net/scheduling/LB/poolMQ.html) load balancing
is done at a frequency specified at the time the loadbalancing module is
loaded. The parameter can be changed dynamically through a /proc interface.
So we are providing a knob for the user/sysadmin to control the
loadbalancing desired.

> My idea about the future of the scheduler is to have a config options
users can
> chose depending upon the machine use.
> By trying to keep a unique scheduler for both UP and MP is like going to
give
> the same answer to different problems and the scaling factor ( of the
scheduler
> itself ) on SMP will never improve.

That is true to an extent. It would be convenient for us as scheduler
rewriters to have neatly differentiated classes like UP, SMP, BIG_SMP, NUMA
etc. But it forces all other scheduler-sensitive code to think of each of
these cases separately and is exactly the reason why #ifdef's are
discouraged for critical kernel code like the scheduler.

Its certainly a challenge to provide SMP/NUMA scalability in the scheduler
(and elsewhere in the kernel) without having to resort to an #ifdef.

Shailabh




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-13 20:17 [Lse-tech] Re: CPU affinity & IPI latency Shailabh Nagar
@ 2001-07-15 20:51 ` Davide Libenzi
  0 siblings, 0 replies; 11+ messages in thread
From: Davide Libenzi @ 2001-07-15 20:51 UTC (permalink / raw)
  To: Shailabh Nagar
  Cc: Larry McVoy, linux-kernel, Andi Kleen, lse-tech, Mike Kravetz


On 13-Jul-2001 Shailabh Nagar wrote:
> That is true to an extent. It would be convenient for us as scheduler
> rewriters to have neatly differentiated classes like UP, SMP, BIG_SMP, NUMA
> etc. But it forces all other scheduler-sensitive code to think of each of
> these cases separately and is exactly the reason why #ifdef's are
> discouraged for critical kernel code like the scheduler.

Personally I hate #ifdef's inside the code more than my cat water, but something
like :

[sched.c]
#ifdef CONFIG_SCHED_XXX
#include "sched_xxx.c"
#else
#ifdef CONFIG_SCHED_YYY
#include "sched_yyy.c"

...

#endif


looks pretty clean to me.




- Davide


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
       [not found] <OF408C990D.63BC0397-ON85256A88.005CF33B@pok.ibm.com>
  2001-07-13 18:27 ` Hubertus Frnake
@ 2001-07-17 14:20 ` Hubertus Frnake
  1 sibling, 0 replies; 11+ messages in thread
From: Hubertus Frnake @ 2001-07-17 14:20 UTC (permalink / raw)
  To: ak, lse-tech, linux-kernel

Enclosed is a patch for the fixing the process bouncing problem a.k.a
"PU affinity & IPI latency".

The patch is again 2.4.5 (all I had last night), so Andi could you please
test  it
on your stuff and see whether it works for you. It works on stock apps.

Basic principle is as follows:

When reschedule_idle(p) determines to IPI a task, it sets a pointer
to <p> in schedule_data(target_cpu). We raise the <p->has_cpu> flag
to indicate that p should not be considered for scheduling.

In schedule(), we check after going to <still_running> whether, this call
is based on an IPI, if so we always take this task.

To be functionally correct, we also consider the reservation in
reschedule_idle(), i.e., we first look for the reservation, then for idle
and then cpu_curr to determine the best task.
If indeed the decision is to "preempt" the reserving task, (actually, its not
running yet), we simply lower the current reservation  has_cpu=0) and
replace it with <p>.

-- Hubertus Franke  (frankeh@us.ibm.com)

diff -uwrbBN linux-2.4.5-van/kernel/sched.c linux-2.4.5-ca/kernel/sched.c
--- linux-2.4.5-van/kernel/sched.c      Fri Apr 20 21:26:16 2001
+++ linux-2.4.5-ca/kernel/sched.c       Tue Jul 17 07:31:10 2001
@@ -97,12 +97,14 @@
 static union {
        struct schedule_data {
                struct task_struct * curr;
+               struct task_struct * resched;
                cycles_t last_schedule;
        } schedule_data;
        char __pad [SMP_CACHE_BYTES];
-} aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0}}};
+} aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0,0}}};

 #define cpu_curr(cpu) aligned_data[(cpu)].schedule_data.curr
+#define cpu_resched(cpu) aligned_data[(cpu)].schedule_data.resched
 #define last_schedule(cpu) aligned_data[(cpu)].schedule_data.last_schedule

 struct kernel_stat kstat;
@@ -208,7 +210,7 @@
 {
 #ifdef CONFIG_SMP
        int this_cpu = smp_processor_id();
-       struct task_struct *tsk, *target_tsk;
+       struct task_struct *tsk, *target_tsk, *rtsk;
        int cpu, best_cpu, i, max_prio;
        cycles_t oldest_idle;

@@ -219,7 +221,9 @@
        best_cpu = p->processor;
        if (can_schedule(p, best_cpu)) {
                tsk = idle_task(best_cpu);
-               if (cpu_curr(best_cpu) == tsk) {
+               if ((cpu_curr(best_cpu) == tsk) &&
+                   (cpu_resched(best_cpu) == NULL))
+               {
                        int need_resched;
 send_now_idle:
                        /*
@@ -244,13 +248,24 @@
         */
        oldest_idle = (cycles_t) -1;
        target_tsk = NULL;
+       best_cpu = 0;
        max_prio = 1;

        for (i = 0; i < smp_num_cpus; i++) {
                cpu = cpu_logical_map(i);
                if (!can_schedule(p, cpu))
                        continue;
+               /* first check whether there is an resched IPI
+                * reservation for that cpu. If so consider priority
+                * of the reservation instead of current.
+                * We do not have to set the need_resched flag again
+                * for the currently running task. It must have been
+                * signalled before
+                */
+               tsk = cpu_resched(cpu);
+               if (tsk == NULL)
                tsk = cpu_curr(cpu);
+
                /*
                 * We use the first available idle CPU. This creates
                 * a priority list between idle CPUs, but this is not
@@ -268,19 +283,30 @@
                                if (prio > max_prio) {
                                        max_prio = prio;
                                        target_tsk = tsk;
+                                       best_cpu = cpu;
                                }
                        }
                }
        }
        tsk = target_tsk;
        if (tsk) {
+               rtsk = cpu_resched(best_cpu);
+               if (rtsk) {
+                       rtsk->has_cpu = 0; /* return rtsk to scheduable */
+                       tsk->has_cpu  = 1; /* can't schedule this one no
more*/ +                       cpu_resched(best_cpu) = tsk;
+                       return;
+               }
                if (oldest_idle != -1ULL) {
                        best_cpu = tsk->processor;
                        goto send_now_idle;
                }
                tsk->need_resched = 1;
-               if (tsk->processor != this_cpu)
-                       smp_send_reschedule(tsk->processor);
+               if (tsk->processor != this_cpu) {
+                       tsk->has_cpu  = 1;
+                       cpu_resched(best_cpu) = tsk;
+                       smp_send_reschedule(best_cpu);
+               }
        }
        return;

@@ -578,6 +604,15 @@
         */

 repeat_schedule:
+       /* we check whether we have a resched_IPI reservation:
+        * if so simply select the reserving task and next and
+        * go to switch to it.
+        */
+       next = cpu_resched(this_cpu);
+       if (next) {
+               next = p;
+               goto found_next;
+       }
        /*
         * Default process to select..
         */
@@ -604,6 +639,7 @@
         * switching to the next task, save this fact in
         * sched_data.
         */
+found_next:
        sched_data->curr = next;
 #ifdef CONFIG_SMP
        next->has_cpu = 1;





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15 20:15     ` Andi Kleen
@ 2001-07-16 15:46       ` Mike Kravetz
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2001-07-16 15:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mike Kravetz, David Lang, Larry McVoy, Davide Libenzi, lse-tech,
	linux-kernel

On Sun, Jul 15, 2001 at 10:15:42PM +0200, Andi Kleen wrote:
> On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote:
> > - Define a new field in the task structure 'saved_cpus_allowed'.
> >   With a little collapsing of existing fields, there is room to put
> >   this on the same cache line as 'cpus_allowed'.
> > - In reschedule_idle() if we determine that the best CPU for a task
> >   is the CPU it is associated with (p->processor), then temporarily
> >   bind the task to that CPU.  The task is temporarily bound to the
> >   CPU by overwriting the 'cpus_allowed' field such that the task can
> >   only be scheduled on the target CPU.  Of course, the original
> >   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> > - In schedule(), the loop which examines all tasks on the runqueue
> >   will restore the value of 'cpus_allowed'.
> 
> This sounds racy, at least with your simple description. Even other CPUs
> could walk their run queue while the IPI is being processed and reset the
> cpus_allowed flag too early. I guess it would need a check in the 
> schedule loop that restores cpus_allowed that the current CPU is the only
> on set in that task's cpus_allowed.

You are correct.  It is trivial to allow only the 'target' CPU to reset
the cpus_allowed field.  This was my first thought.  However, as you
state below we would get into trouble if there was an extremely long
delay in that CPU running schedule.

The timestamp sounds reasonable, but I was trying to keep it as simple
as possible.

>                                     This unfortunately would hang the
> process if the CPU for some reason cannot process schedules timely, so
> it may be needed to add a timestamp also to the task_struct that allows 
> the restoration of cpus_allowed even from non target CPUs after some
> time.

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15 17:00     ` Troy Benjegerdes
@ 2001-07-16  0:58       ` Mike Kravetz
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2001-07-16  0:58 UTC (permalink / raw)
  To: Troy Benjegerdes; +Cc: Andi Kleen, Mike Kravetz, linux-kernel, lse-tech

On Sun, Jul 15, 2001 at 12:00:38PM -0500, Troy Benjegerdes wrote:
> 
> The next very interesting thing I found was that if I run something like 
> 'while true; do true; done' to load the other CPU while gzip is running, 
> gzip runs faster. The time change isn't very large, and appears to be just 
> barely detectable above the noise, but it definetly appears that gzip is 
> bouncing back and forth between cpus if both are idle.
> 
> I'm tempted to try the somewhat brute-force approach of increasing
> PROC_CHANGE_PENALTY, which is currently set to 20 (on ppc) to something
> like 100. Would this be an adequate 'short-term' solution util there is 
> some sort of multi-queue scheduler that everyone Linus likes? What are the 
> drawbacks of increasing PROC_CHANGE_PENALTY?

I'm pretty sure changing the value of PROC_CHANGE_PENALTY will not help
in this case.  PROC_CHANGE_PENALTY becomes increasingly important as
the number of running tasks is increased.  In the case of simply running
one task (like gzip) on your 2 CPU system, I don't think it will make
any difference.

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15 20:02     ` Davide Libenzi
@ 2001-07-15 20:10       ` Andi Kleen
  0 siblings, 0 replies; 11+ messages in thread
From: Andi Kleen @ 2001-07-15 20:10 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Mike Kravetz, linux-kernel, Andi Kleen, lse-tech, Larry McVoy,
	David Lang

On Sun, Jul 15, 2001 at 01:02:21PM -0700, Davide Libenzi wrote:
> The problem of the current scheduler is that it acts like an infinite feedback
> system.
> When we're going to decide if we've to move a task we analyze the status at the
> current time without taking in account the system status at previous time
> values.
> But the feedback we send ( IPI to move the task ) take a finite time to hit the
> target CPU and, meanwhile, the system status changes.
> So we're going to apply a feedback calculated in time T0 to a time Tn, and this
> will result in system auto-oscillation that we perceive as tasks bouncing
> between CPUs.
> This is kind of electronic example but it applies to all feedback systems.
> The solution to this problem, given that we can't have a zero feedback delivery
> time, is :
> 
> 1) lower the feedback amount, that means, try to minimize task movements
> 
> 2) a low pass filter, that means, when we're going to decide the sort ( move )
>         of a task, we've to weight the system status with the one that it had
>         at previous time values

Nice analysis. I think Mike's proposal of the 'saved_cpus_allowed' field
to temporarily bind the task to the target would just act as such an low 
pass filter.

-Andi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15  9:05   ` [Lse-tech] " Andi Kleen
@ 2001-07-15 17:00     ` Troy Benjegerdes
  2001-07-16  0:58       ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Troy Benjegerdes @ 2001-07-15 17:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Mike Kravetz, linux-kernel, lse-tech

On Sun, Jul 15, 2001 at 11:05:43AM +0200, Andi Kleen wrote:
> On Sun, Jul 15, 2001 at 02:42:55AM -0500, Troy Benjegerdes wrote:
> > On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote:
> > > This discussion was started on 'lse-tech@lists.sourceforge.net'.
> > > I'm widening the distribution in the hope of getting more input.
> > > 
> > > It started when Andi Kleen noticed that a single 'CPU Hog' task
> > > was being bounced back and forth between the 2 CPUs on his 2-way
> > > system.  I had seen similar behavior when running the context
> > > switching test of LMbench.  When running lat_ctx with only two
> > > threads on an 8 CPU system, one would ?expect? the two threads
> > > to be confined to two of the 8 CPUs in the system.  However, what
> > > I have observed is that the threads are effectively 'round
> > > robined' among all the CPUs and they all end up bearing
> > > an equivalent amount of the CPU load.  To more easily observe
> > > this, increase the number of 'TRIPS' in the benchmark to a really
> > > large number.
> > 
> > Did this 'CPU Hog' task do any I/O that might have caused an interrupt?
> 
> Yes; it did some IO, but most of its time was doing CPU work
> (it was CPU bound)

I shouldn't have sent that first email out so quickly. About 5 minutes 
after I sent it, I saw gzip switching CPU's with top a lot.

The next very interesting thing I found was that if I run something like 
'while true; do true; done' to load the other CPU while gzip is running, 
gzip runs faster. The time change isn't very large, and appears to be just 
barely detectable above the noise, but it definetly appears that gzip is 
bouncing back and forth between cpus if both are idle.

I'm tempted to try the somewhat brute-force approach of increasing
PROC_CHANGE_PENALTY, which is currently set to 20 (on ppc) to something
like 100. Would this be an adequate 'short-term' solution util there is 
some sort of multi-queue scheduler that everyone Linus likes? What are the 
drawbacks of increasing PROC_CHANGE_PENALTY?

> I'm not sure about the Mac, but on x86 linux boxes the SMP bootup synchronizes
> the time stamp counters of the CPUs. If they are synchronized on your box
> also you could store TSC value in the IPI sender and compute the average
> latency in the receiver. 

The PPC has a 'timebase register' which is effectively the same thing as 
the TSC, except it counts processor bus ticks/4, not cpu cycles.

-- 
Troy Benjegerdes | master of mispeeling | 'da hozer' |  hozer@drgw.net
-----"If this message isn't misspelled, I didn't write it" -- Me -----
"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's 
why I draw cartoons. It's my life." -- Charles Shulz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15  7:42 ` Troy Benjegerdes
@ 2001-07-15  9:05   ` Andi Kleen
  2001-07-15 17:00     ` Troy Benjegerdes
  0 siblings, 1 reply; 11+ messages in thread
From: Andi Kleen @ 2001-07-15  9:05 UTC (permalink / raw)
  To: Troy Benjegerdes; +Cc: Mike Kravetz, linux-kernel, Andi Kleen, lse-tech

On Sun, Jul 15, 2001 at 02:42:55AM -0500, Troy Benjegerdes wrote:
> On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote:
> > This discussion was started on 'lse-tech@lists.sourceforge.net'.
> > I'm widening the distribution in the hope of getting more input.
> > 
> > It started when Andi Kleen noticed that a single 'CPU Hog' task
> > was being bounced back and forth between the 2 CPUs on his 2-way
> > system.  I had seen similar behavior when running the context
> > switching test of LMbench.  When running lat_ctx with only two
> > threads on an 8 CPU system, one would ?expect? the two threads
> > to be confined to two of the 8 CPUs in the system.  However, what
> > I have observed is that the threads are effectively 'round
> > robined' among all the CPUs and they all end up bearing
> > an equivalent amount of the CPU load.  To more easily observe
> > this, increase the number of 'TRIPS' in the benchmark to a really
> > large number.
> 
> Did this 'CPU Hog' task do any I/O that might have caused an interrupt?

Yes; it did some IO, but most of its time was doing CPU work
(it was CPU bound)

> 
> I'm wondering if the interrupt distribution has anything to do with this.. 
> are you using any CPU affinity for interrupts? If not, this might explain 
> why the processes wind up doing a 'round robin'.

It was a normal Intel SMP box with no special settings and the interrupts
should have been evenly distributed (I did not check it at this time, but
normally they are about evenly distributed over the two cpus)


> 
> I'm trying to reproduce this with gzip on a dual Mac G4, and I'm wondering
> if this will be scewed any because all interrupts are directed at cpu0, so
> anything that generates input or output on the network or serial is going
> to tend to wind up on cpu0.
> 
> I'd also like to figure out what th e IPI latency actually is.. Does anyone
> have any suggestions how to measure this? I would hope it's lower on the
> dual 7410 machine I have since I have 4-stage pipelines as opposed to 20
> odd stages that Pentium III class machines do..

I'm not sure about the Mac, but on x86 linux boxes the SMP bootup synchronizes
the time stamp counters of the CPUs. If they are synchronized on your box
also you could store TSC value in the IPI sender and compute the average
latency in the receiver. 


-Andi

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-13 19:39 ` [Lse-tech] " Gerrit Huizenga
@ 2001-07-13 20:05   ` Davide Libenzi
  0 siblings, 0 replies; 11+ messages in thread
From: Davide Libenzi @ 2001-07-13 20:05 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Larry McVoy, linux-kernel, Andi Kleen, lse-tech, Mike Kravetz


On 13-Jul-2001 Gerrit Huizenga wrote:
> In a lot of cases, UP is just a simplified, degenerate case of SMP (which
> is itself often a degenerate case of NUMA).  Wouldn't it make a lot of
> sense to have a single scheduler which 1) was relively simple, 2) was as
> good as the current scheduler (or better) on UP, and 3) scaled well on SMP
> (and
> NUMA)?  I think the current lse scheduler meets all of those goals pretty
> well.

It's the concept of 'good enough' that seems to have different meanings for
different people.
Personally I could even think that the behaviour for the UP case is 'almost'
the same but, as you can see by watching at the lk threads in the last years,
it's pretty hard to try people to agree on the concept of 'good enough'.
Config options will leave the current scheduler for UP ( or even for not heavy
MP ) while will introduce an option to users that will suffer scheduler
problems.


> Config options means the user has to choose, I have too many important
> choices to make already when building a kernel.

Even when I go at the restaurant I've to chose, but I still prefer that instead
of getting the 'menu du jour'.




- Davide


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-13 19:17 Davide Libenzi
@ 2001-07-13 19:39 ` Gerrit Huizenga
  2001-07-13 20:05   ` Davide Libenzi
  0 siblings, 1 reply; 11+ messages in thread
From: Gerrit Huizenga @ 2001-07-13 19:39 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Mike Kravetz, lse-tech, Andi Kleen, linux-kernel, Larry McVoy

> On Fri, 13 Jul 2001 12:17:37 PDT, Davide Libenzi wrote:
> 
> The problem, IMHO, is that we're trying to extend what is a correct
> behaviour on the UP scheduler ( pickup the best task to run ) to SMP
> machines.  Global scheduling decisions should be triggered in response
> of load unbalancing and not at each schedule() run otherwise we're
> going to introduce a common lock that will limit the overall
> scalability.  My idea about the future of the scheduler is to have a
> config options users can chose depending upon the machine use.
> 
> By trying to keep a unique scheduler for both UP and MP is like going
> to give the same answer to different problems and the scaling factor
> (of the scheduler itself) on SMP will never improve.  The code inside
> kernel/sched.c should be reorganized ( it contains even not scheduler
> code ) so that the various CONFIG_SCHED* will not introduce any messy
> inside the code ( possibly by having the code in separate files
> kernel/sched*.c ).
> 
> - Davide

In a lot of cases, UP is just a simplified, degenerate case of SMP (which
is itself often a degenerate case of NUMA).  Wouldn't it make a lot of
sense to have a single scheduler which 1) was relively simple, 2) was as
good as the current scheduler (or better) on UP, and 3) scaled well on SMP (and
NUMA)?  I think the current lse scheduler meets all of those goals pretty
well.

Config options means the user has to choose, I have too many important
choices to make already when building a kernel.

Others have proposed loadable scheduler modules, but the overhead doesn't
seem to justify the separation.  Config options mean more testing, more
stable APIs for low level scheduling (or more times when one or the other
is broken).

gerrit

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
       [not found] <OF408C990D.63BC0397-ON85256A88.005CF33B@pok.ibm.com>
@ 2001-07-13 18:27 ` Hubertus Frnake
  2001-07-17 14:20 ` Hubertus Frnake
  1 sibling, 0 replies; 11+ messages in thread
From: Hubertus Frnake @ 2001-07-13 18:27 UTC (permalink / raw)
  To: linux-kernel, lse-tech

> Davide Libenzi <davidel@xmailserver.org>@lists.sourceforge.net on
> 07/13/2001 12:41:44 PM
>
> Sent by:  lse-tech-admin@lists.sourceforge.net
>
> To:   Larry McVoy <lm@bitmover.com>
> cc:   linux-kernel@vger.kernel.org, Andi Kleen <ak@suse.de>,
>       lse-tech@lists.sourceforge.net, Mike Kravetz <mkravetz@sequent.com>
> Subject:  [Lse-tech] Re: CPU affinity & IPI latency
>
> On 13-Jul-2001 Larry McVoy wrote:
> > Be careful tuning for LMbench (says the author :-)
> >
> > Especially this benchmark.  It's certainly possible to get dramatically
> > better
> > SMP numbers by pinning all the lat_ctx processes to a single CPU, because
> > the benchmark is single threaded.  In other words, if we have 5
> processes,
> > call them A, B, C, D, and E, then the benchmark is passing a token from
> > A to B to C to D to E and around again.
> >
> > If the amount of data/instructions needed by all 5 processes fits in the
> > cache and you pin all the processes to the same CPU you'll get much
> > better performance than simply letting them float.
> >
> > But making the system do that naively is a bad idea.
>
> Agree.
>
> >
> > This is a really hard area to get right but you can take a page from all
> > the failed process migration efforts.  In general, moving stuff is a bad
> > idea, it's much better to leave it where it is.  Everything scales better
> > if there is a process queue per CPU and the default is that you leave the
> > processes on the queue on which they last run.  However, if the load
> average
> > for a queue starts going up and there is another queue with a
> substantially
> > lower load average, then and ONLY then, should you move the process.
>
> I personally think that a standard scheduler/cpu is the way to go for SMP.
> I saw the one IBM guys did and I think that the wrong catch there is trying
> always to grab the best task to run over all CPUs.
> I think that this concept could be relaxed introducing less chains between
> each
> CPU scheduler.
> A cheap load balancer should run, "time to time"(tm), to move tasks when a
> certain level of unbalancing has been reached.
> This will give each scheduler more independence and will make it more
> scalable,
> IMVHO.

Davide, this is exactly what we do with the LoadBalancing extentions to the MQ

scheduler. We subdivide into smaller sets, do the scheduling in there and from

"time to time" (tm), we move tasks over from set to set in order to ensure
loadbalancing.
The frequency and loadbalancing is configurable (module).
http://lse.sourceforge.net/scheduling

>
>
> > This is an area in which I've done a pile of work and I'd be interested
> > in keeping a finger in any efforts to fix up the scheduler.
>
> We've, somehow, understood it :)
>
> - Davide
>
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> http://lists.sourceforge.net/lists/listinfo/lse-tech


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2001-07-17 14:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-13 20:17 [Lse-tech] Re: CPU affinity & IPI latency Shailabh Nagar
2001-07-15 20:51 ` Davide Libenzi
     [not found] <OF408C990D.63BC0397-ON85256A88.005CF33B@pok.ibm.com>
2001-07-13 18:27 ` Hubertus Frnake
2001-07-17 14:20 ` Hubertus Frnake
  -- strict thread matches above, loose matches on Subject: below --
2001-07-13 19:17 Davide Libenzi
2001-07-13 19:39 ` [Lse-tech] " Gerrit Huizenga
2001-07-13 20:05   ` Davide Libenzi
2001-07-13 17:05 Mike Kravetz
2001-07-13 19:51 ` David Lang
2001-07-13 22:43   ` Mike Kravetz
2001-07-15 20:02     ` Davide Libenzi
2001-07-15 20:10       ` [Lse-tech] " Andi Kleen
2001-07-15 20:15     ` Andi Kleen
2001-07-16 15:46       ` [Lse-tech] " Mike Kravetz
2001-07-12 23:40 Mike Kravetz
2001-07-15  7:42 ` Troy Benjegerdes
2001-07-15  9:05   ` [Lse-tech] " Andi Kleen
2001-07-15 17:00     ` Troy Benjegerdes
2001-07-16  0:58       ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).