linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CPU affinity & IPI latency
@ 2001-07-12 23:40 Mike Kravetz
  2001-07-13  0:22 ` Davide Libenzi
  2001-07-15  7:42 ` Troy Benjegerdes
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Kravetz @ 2001-07-12 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, lse-tech

This discussion was started on 'lse-tech@lists.sourceforge.net'.
I'm widening the distribution in the hope of getting more input.

It started when Andi Kleen noticed that a single 'CPU Hog' task
was being bounced back and forth between the 2 CPUs on his 2-way
system.  I had seen similar behavior when running the context
switching test of LMbench.  When running lat_ctx with only two
threads on an 8 CPU system, one would ?expect? the two threads
to be confined to two of the 8 CPUs in the system.  However, what
I have observed is that the threads are effectively 'round
robined' among all the CPUs and they all end up bearing
an equivalent amount of the CPU load.  To more easily observe
this, increase the number of 'TRIPS' in the benchmark to a really
large number.

After a little investigation, I believe this 'situation' is caused
by the latency of the reschedule IPI used by the scheduler.  Recall
that in lat_ctx all threads are in a tight loop consisting of:

pipe_read()
pipe_write()

Both threads 'start' on the same CPU and are sitting in pipe_read
waiting for data.  A token is written to the pipe and one thread
is awakened.  The awakened thread, then immediately writes the token
back to the pipe which ultimately results in a call to reschedule_idle()
that will 'initiate' the scheduling of the other thread.  In
reschedule_idle() we can not take the 'fast path' because WE are
currently executing on the other thread's preferred CPU.  Therefore,
reschedule_idle() chooses the oldest idle CPU and sends the IPI.
However, before the IPI is received (and schedule() run) on the
remote CPU, the currently running thread calls pipe_read which
blocks and calls schedule().  Since the other task has yet to be
scheduled on the other CPU, it is scheduled to run on the current
CPU.  Both tasks continue to execute on the one CPU until such time
that an IPI induced schedule() on the other CPU hits a window where
it finds one of the tasks to schedule.  We continue in this way,
migrating the tasks to the oldest idle CPU and eventually cycling our
way through all the CPUs.

Does this explanation sound reasonable?

If so, it would then follow that booting with 'idle=poll' would
help alleviate this situation.  However, that is not the case.  With
idle=poll the CPU load is not as evenly distributed among the CPUs,
but is still distributed among all of them.

Does the behavior of the 'benchmark' mean anything?  Should one
expect tasks to stay their preferred CPUs if possible?

Thoughts/comments
-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: CPU affinity & IPI latency
  2001-07-12 23:40 CPU affinity & IPI latency Mike Kravetz
@ 2001-07-13  0:22 ` Davide Libenzi
  2001-07-13  0:36   ` Larry McVoy
  2001-07-15  7:42 ` Troy Benjegerdes
  1 sibling, 1 reply; 26+ messages in thread
From: Davide Libenzi @ 2001-07-13  0:22 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: lse-tech, Andi Kleen, linux-kernel


On 12-Jul-2001 Mike Kravetz wrote:
> This discussion was started on 'lse-tech@lists.sourceforge.net'.
> I'm widening the distribution in the hope of getting more input.
> 
> It started when Andi Kleen noticed that a single 'CPU Hog' task
> was being bounced back and forth between the 2 CPUs on his 2-way
> system.  I had seen similar behavior when running the context
> switching test of LMbench.  When running lat_ctx with only two
> threads on an 8 CPU system, one would ?expect? the two threads
> to be confined to two of the 8 CPUs in the system.  However, what
> I have observed is that the threads are effectively 'round
> robined' among all the CPUs and they all end up bearing
> an equivalent amount of the CPU load.  To more easily observe
> this, increase the number of 'TRIPS' in the benchmark to a really
> large number.
> 
> After a little investigation, I believe this 'situation' is caused
> by the latency of the reschedule IPI used by the scheduler.  Recall
> that in lat_ctx all threads are in a tight loop consisting of:
> 
> pipe_read()
> pipe_write()
> 
> Both threads 'start' on the same CPU and are sitting in pipe_read
> waiting for data.  A token is written to the pipe and one thread
> is awakened.  The awakened thread, then immediately writes the token
> back to the pipe which ultimately results in a call to reschedule_idle()
> that will 'initiate' the scheduling of the other thread.  In
> reschedule_idle() we can not take the 'fast path' because WE are
> currently executing on the other thread's preferred CPU.  Therefore,
> reschedule_idle() chooses the oldest idle CPU and sends the IPI.
> However, before the IPI is received (and schedule() run) on the
> remote CPU, the currently running thread calls pipe_read which
> blocks and calls schedule().  Since the other task has yet to be
> scheduled on the other CPU, it is scheduled to run on the current
> CPU.  Both tasks continue to execute on the one CPU until such time
> that an IPI induced schedule() on the other CPU hits a window where
> it finds one of the tasks to schedule.  We continue in this way,
> migrating the tasks to the oldest idle CPU and eventually cycling our
> way through all the CPUs.
> 
> Does this explanation sound reasonable?

I would say yes.


> 
> If so, it would then follow that booting with 'idle=poll' would
> help alleviate this situation.  However, that is not the case.  With
> idle=poll the CPU load is not as evenly distributed among the CPUs,
> but is still distributed among all of them.
> 
> Does the behavior of the 'benchmark' mean anything?  Should one
> expect tasks to stay their preferred CPUs if possible?

Maybe having a per-cpu wake list where the rescheduled task is moved to be woken
up by IPI target.




- Davide


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13  0:22 ` Davide Libenzi
@ 2001-07-13  0:36   ` Larry McVoy
  2001-07-13  2:06     ` Mark Hahn
                       ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Larry McVoy @ 2001-07-13  0:36 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Mike Kravetz, lse-tech, Andi Kleen, linux-kernel

Be careful tuning for LMbench (says the author :-)

Especially this benchmark.  It's certainly possible to get dramatically better
SMP numbers by pinning all the lat_ctx processes to a single CPU, because 
the benchmark is single threaded.  In other words, if we have 5 processes,
call them A, B, C, D, and E, then the benchmark is passing a token from
A to B to C to D to E and around again.  

If the amount of data/instructions needed by all 5 processes fits in the 
cache and you pin all the processes to the same CPU you'll get much 
better performance than simply letting them float.

But making the system do that naively is a bad idea.

This is a really hard area to get right but you can take a page from all
the failed process migration efforts.  In general, moving stuff is a bad
idea, it's much better to leave it where it is.  Everything scales better
if there is a process queue per CPU and the default is that you leave the
processes on the queue on which they last run.  However, if the load average
for a queue starts going up and there is another queue with a substantially
lower load average, then and ONLY then, should you move the process.

I think if you experiment with that you'll see that lat_ctx does well and
so do a lot of other things.

An optimization on that requires hardware support.  If you knew the number
of cache misses associated with each time slice, you could factor that in
and start moving processes that have a "too high" cache miss rate, with the
idea being that we want to keep all processes on the same CPU if we can
but if that is causing an excessive cache miss rate, it's time to move.

Another optimization is to always schedule an exec-ed process (as opposed
to a forked process) on a different CPU than its parent.  In general, when
you exec you have a clear boundary and it's good to spread those out.

All of this is based on my somewhat dated performance efforts that lead to
LMbench.  I don't know of any fundamental changed that invalidate these 
opinions but I could be wrong.

This is an area in which I've done a pile of work and I'd be interested
in keeping a finger in any efforts to fix up the scheduler.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13  0:36   ` Larry McVoy
@ 2001-07-13  2:06     ` Mark Hahn
  2001-07-13 16:41     ` Davide Libenzi
  2001-07-13 17:05     ` Mike Kravetz
  2 siblings, 0 replies; 26+ messages in thread
From: Mark Hahn @ 2001-07-13  2:06 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel

> If the amount of data/instructions needed by all 5 processes fits in the 
> cache and you pin all the processes to the same CPU you'll get much 
> better performance than simply letting them float.

interesting.  I remember that around 2.3.40, the scheduler handled
"frequent-schedulers" (like lat_ctx) differently, based on how long 
a timeslice the proc used, relative to the estimated time to flush cache.

as I recall, it let them stay on their current CPU.  like letting 
someone with 1 item go ahead of you in a grocery store checkout ;)

that code is mostly gone (only cacheflush_time remains);  I think it
morphed into the current migrate-to-longest-idle heuristic.

does anyone remember why the frequent-schedulers code was killed?
just because it conflated cache-affinity with timeslice?

regards, mark hahn.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13  0:36   ` Larry McVoy
  2001-07-13  2:06     ` Mark Hahn
@ 2001-07-13 16:41     ` Davide Libenzi
  2001-07-13 17:31       ` Mike Kravetz
  2001-07-13 17:05     ` Mike Kravetz
  2 siblings, 1 reply; 26+ messages in thread
From: Davide Libenzi @ 2001-07-13 16:41 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel, Andi Kleen, lse-tech, Mike Kravetz


On 13-Jul-2001 Larry McVoy wrote:
> Be careful tuning for LMbench (says the author :-)
> 
> Especially this benchmark.  It's certainly possible to get dramatically
> better
> SMP numbers by pinning all the lat_ctx processes to a single CPU, because 
> the benchmark is single threaded.  In other words, if we have 5 processes,
> call them A, B, C, D, and E, then the benchmark is passing a token from
> A to B to C to D to E and around again.  
> 
> If the amount of data/instructions needed by all 5 processes fits in the 
> cache and you pin all the processes to the same CPU you'll get much 
> better performance than simply letting them float.
> 
> But making the system do that naively is a bad idea.

Agree.


> 
> This is a really hard area to get right but you can take a page from all
> the failed process migration efforts.  In general, moving stuff is a bad
> idea, it's much better to leave it where it is.  Everything scales better
> if there is a process queue per CPU and the default is that you leave the
> processes on the queue on which they last run.  However, if the load average
> for a queue starts going up and there is another queue with a substantially
> lower load average, then and ONLY then, should you move the process.

I personally think that a standard scheduler/cpu is the way to go for SMP.
I saw the one IBM guys did and I think that the wrong catch there is trying
always to grab the best task to run over all CPUs.
I think that this concept could be relaxed introducing less chains between each
CPU scheduler.
A cheap load balancer should run, "time to time"(tm), to move tasks when a
certain level of unbalancing has been reached.
This will give each scheduler more independence and will make it more scalable,
IMVHO.


> This is an area in which I've done a pile of work and I'd be interested
> in keeping a finger in any efforts to fix up the scheduler.

We've, somehow, understood it :)



- Davide


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13  0:36   ` Larry McVoy
  2001-07-13  2:06     ` Mark Hahn
  2001-07-13 16:41     ` Davide Libenzi
@ 2001-07-13 17:05     ` Mike Kravetz
  2001-07-13 19:51       ` David Lang
  2001-07-13 19:54       ` Chris Wedgwood
  2 siblings, 2 replies; 26+ messages in thread
From: Mike Kravetz @ 2001-07-13 17:05 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Davide Libenzi, lse-tech, Andi Kleen, linux-kernel

On Thu, Jul 12, 2001 at 05:36:41PM -0700, Larry McVoy wrote:
> Be careful tuning for LMbench (says the author :-)
> 
> Especially this benchmark.  It's certainly possible to get dramatically better
> SMP numbers by pinning all the lat_ctx processes to a single CPU, because 
> the benchmark is single threaded.  In other words, if we have 5 processes,
> call them A, B, C, D, and E, then the benchmark is passing a token from
> A to B to C to D to E and around again.  
> 
> If the amount of data/instructions needed by all 5 processes fits in the 
> cache and you pin all the processes to the same CPU you'll get much 
> better performance than simply letting them float.
> 
> But making the system do that naively is a bad idea.

I agree, and can't imagine the system ever attempting to take this
into account and leave these 5 tasks on the same CPU.

At the other extreme is my observation that 2 tasks on an 8 CPU
system are 'round robined' among all 8 CPUs.  I think having the
2 tasks stay on 2 of the 8 CPUs would be an improvement with respect
to CPU affinity.  Actually, the scheduler does 'try' to do this.

It is clear that the behavior of lat_ctx bypasses almost all of
the scheduler's attempts at CPU affinity.  The real question is,
"How often in running 'real workloads' are the schduler's attempts
at CPU affinity bypassed?".

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 16:41     ` Davide Libenzi
@ 2001-07-13 17:31       ` Mike Kravetz
  2001-07-13 19:17         ` Davide Libenzi
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Kravetz @ 2001-07-13 17:31 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Larry McVoy, linux-kernel, Andi Kleen, lse-tech, Mike Kravetz

On Fri, Jul 13, 2001 at 09:41:44AM -0700, Davide Libenzi wrote:
> 
> I personally think that a standard scheduler/cpu is the way to go for SMP.
> I saw the one IBM guys did and I think that the wrong catch there is trying
> always to grab the best task to run over all CPUs.

That was me/us.  Most of the reason for making 'global scheduling'
decisions was an attempt to maintain the same behavior as the existing
scheduler.  We are trying to see how well we can make this scheduler
scale, while maintaining this global behavior.  Thought is that if
there was ever any hope of someone adopting this scheduler, they would
be more likely to do so if it attempted to maintain existing behavior.


> I think that this concept could be relaxed introducing less chains between each
> CPU scheduler.
> A cheap load balancer should run, "time to time"(tm), to move tasks when a
> certain level of unbalancing has been reached.
> This will give each scheduler more independence and will make it more scalable,
> IMVHO.

Take a look at the 'CPU pooling & Load balancing' extensions to our
scheduler(lse.sourceforge.net/scheduling).  It allows you to divide
the system into CPU pools keep scheduling decisions limited to
these pools.  Periodicly, load balancing will be performed among
the pools.  This has shown some promise on NUMA architectures (as
one would expect).  You could define pool sizes to be 1 CPU and
get the scheduling behavior you desire on SMP.

But, none of this has to do with CPU affinity issues with the
current scheduler.

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 17:31       ` Mike Kravetz
@ 2001-07-13 19:17         ` Davide Libenzi
  2001-07-13 19:39           ` [Lse-tech] " Gerrit Huizenga
  0 siblings, 1 reply; 26+ messages in thread
From: Davide Libenzi @ 2001-07-13 19:17 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: lse-tech, Andi Kleen, linux-kernel, Larry McVoy


On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 09:41:44AM -0700, Davide Libenzi wrote:
>> 
>> I personally think that a standard scheduler/cpu is the way to go for SMP.
>> I saw the one IBM guys did and I think that the wrong catch there is trying
>> always to grab the best task to run over all CPUs.
> 
> That was me/us.  Most of the reason for making 'global scheduling'
> decisions was an attempt to maintain the same behavior as the existing
> scheduler.  We are trying to see how well we can make this scheduler
> scale, while maintaining this global behavior.  Thought is that if
> there was ever any hope of someone adopting this scheduler, they would
> be more likely to do so if it attempted to maintain existing behavior.

The problem, IMHO, is that we're trying to extend what is a correct behaviour on
the UP scheduler ( pickup the best task to run ) to SMP machines.
Global scheduling decisions should be triggered in response of load unbalancing
and not at each schedule() run otherwise we're going to introduce a common lock
that will limit the overall scalability.
My idea about the future of the scheduler is to have a config options users can
chose depending upon the machine use.
By trying to keep a unique scheduler for both UP and MP is like going to give
the same answer to different problems and the scaling factor ( of the scheduler
itself ) on SMP will never improve.
The code inside kernel/sched.c should be reorganized ( it contains even not
scheduler code ) so that the various CONFIG_SCHED* will not introduce any messy
inside the code ( possibly by having the code in separate files
kernel/sched*.c ).




- Davide


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-13 19:17         ` Davide Libenzi
@ 2001-07-13 19:39           ` Gerrit Huizenga
  2001-07-13 20:05             ` Davide Libenzi
  0 siblings, 1 reply; 26+ messages in thread
From: Gerrit Huizenga @ 2001-07-13 19:39 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Mike Kravetz, lse-tech, Andi Kleen, linux-kernel, Larry McVoy

> On Fri, 13 Jul 2001 12:17:37 PDT, Davide Libenzi wrote:
> 
> The problem, IMHO, is that we're trying to extend what is a correct
> behaviour on the UP scheduler ( pickup the best task to run ) to SMP
> machines.  Global scheduling decisions should be triggered in response
> of load unbalancing and not at each schedule() run otherwise we're
> going to introduce a common lock that will limit the overall
> scalability.  My idea about the future of the scheduler is to have a
> config options users can chose depending upon the machine use.
> 
> By trying to keep a unique scheduler for both UP and MP is like going
> to give the same answer to different problems and the scaling factor
> (of the scheduler itself) on SMP will never improve.  The code inside
> kernel/sched.c should be reorganized ( it contains even not scheduler
> code ) so that the various CONFIG_SCHED* will not introduce any messy
> inside the code ( possibly by having the code in separate files
> kernel/sched*.c ).
> 
> - Davide

In a lot of cases, UP is just a simplified, degenerate case of SMP (which
is itself often a degenerate case of NUMA).  Wouldn't it make a lot of
sense to have a single scheduler which 1) was relively simple, 2) was as
good as the current scheduler (or better) on UP, and 3) scaled well on SMP (and
NUMA)?  I think the current lse scheduler meets all of those goals pretty
well.

Config options means the user has to choose, I have too many important
choices to make already when building a kernel.

Others have proposed loadable scheduler modules, but the overhead doesn't
seem to justify the separation.  Config options mean more testing, more
stable APIs for low level scheduling (or more times when one or the other
is broken).

gerrit

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 17:05     ` Mike Kravetz
@ 2001-07-13 19:51       ` David Lang
  2001-07-13 22:43         ` Mike Kravetz
  2001-07-13 19:54       ` Chris Wedgwood
  1 sibling, 1 reply; 26+ messages in thread
From: David Lang @ 2001-07-13 19:51 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Larry McVoy, Davide Libenzi, lse-tech, Andi Kleen, linux-kernel

A real-world example of this issue.

I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess
was bouncing back and forth between the two CPUs. I actually was able to
gzip faster by starting up setiathome to keep one CPU busy and force the
scheduler to keep the gzip on a single CPU (I ran things several times to
verify it was actually faster)

David Lang


 On Fri, 13 Jul 2001, Mike Kravetz wrote:

> Date: Fri, 13 Jul 2001 10:05:21 -0700
> From: Mike Kravetz <mkravetz@sequent.com>
> To: Larry McVoy <lm@bitmover.com>
> Cc: Davide Libenzi <davidel@xmailserver.org>, lse-tech@lists.sourceforge.net,
>      Andi Kleen <ak@suse.de>, linux-kernel@vger.kernel.org
> Subject: Re: CPU affinity & IPI latency
>
> On Thu, Jul 12, 2001 at 05:36:41PM -0700, Larry McVoy wrote:
> > Be careful tuning for LMbench (says the author :-)
> >
> > Especially this benchmark.  It's certainly possible to get dramatically better
> > SMP numbers by pinning all the lat_ctx processes to a single CPU, because
> > the benchmark is single threaded.  In other words, if we have 5 processes,
> > call them A, B, C, D, and E, then the benchmark is passing a token from
> > A to B to C to D to E and around again.
> >
> > If the amount of data/instructions needed by all 5 processes fits in the
> > cache and you pin all the processes to the same CPU you'll get much
> > better performance than simply letting them float.
> >
> > But making the system do that naively is a bad idea.
>
> I agree, and can't imagine the system ever attempting to take this
> into account and leave these 5 tasks on the same CPU.
>
> At the other extreme is my observation that 2 tasks on an 8 CPU
> system are 'round robined' among all 8 CPUs.  I think having the
> 2 tasks stay on 2 of the 8 CPUs would be an improvement with respect
> to CPU affinity.  Actually, the scheduler does 'try' to do this.
>
> It is clear that the behavior of lat_ctx bypasses almost all of
> the scheduler's attempts at CPU affinity.  The real question is,
> "How often in running 'real workloads' are the schduler's attempts
> at CPU affinity bypassed?".
>
> --
> Mike Kravetz                                 mkravetz@sequent.com
> IBM Linux Technology Center
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 17:05     ` Mike Kravetz
  2001-07-13 19:51       ` David Lang
@ 2001-07-13 19:54       ` Chris Wedgwood
  1 sibling, 0 replies; 26+ messages in thread
From: Chris Wedgwood @ 2001-07-13 19:54 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Larry McVoy, Davide Libenzi, lse-tech, Andi Kleen, linux-kernel

On Fri, Jul 13, 2001 at 10:05:21AM -0700, Mike Kravetz wrote:

    It is clear that the behavior of lat_ctx bypasses almost all of
    the scheduler's attempts at CPU affinity.  The real question is,
    "How often in running 'real workloads' are the schduler's attempts
    at CPU affinity bypassed?".

When encoding mp3s on a dual processor system, naturally I try to
encode two at once.

Most of the time this works as expected, one processed more or less
sticks to each CPU.  However, I have noticed that if for some reason,
a third process has to be scheduled (which is inevitable if you
actually want to do anything), then these two processes seem to bounce
back and forward for a few seconds, _even_ after this CPU spike has
gone.

Now, I'm not sure if I'm imagining this or not, as I said, I have two
CPUs and two CPU-bound tasks, to instrument this as all, I really have
to affect what I am looking at (rather like the uncertainty principal)
so I assumed that perhaps my programs to read and process
/proc/<n>/cpu was simply eating several cycles and each process was
yielding more or less at random causing what I saw seeing.



   --cw

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-13 19:39           ` [Lse-tech] " Gerrit Huizenga
@ 2001-07-13 20:05             ` Davide Libenzi
  0 siblings, 0 replies; 26+ messages in thread
From: Davide Libenzi @ 2001-07-13 20:05 UTC (permalink / raw)
  To: Gerrit Huizenga
  Cc: Larry McVoy, linux-kernel, Andi Kleen, lse-tech, Mike Kravetz


On 13-Jul-2001 Gerrit Huizenga wrote:
> In a lot of cases, UP is just a simplified, degenerate case of SMP (which
> is itself often a degenerate case of NUMA).  Wouldn't it make a lot of
> sense to have a single scheduler which 1) was relively simple, 2) was as
> good as the current scheduler (or better) on UP, and 3) scaled well on SMP
> (and
> NUMA)?  I think the current lse scheduler meets all of those goals pretty
> well.

It's the concept of 'good enough' that seems to have different meanings for
different people.
Personally I could even think that the behaviour for the UP case is 'almost'
the same but, as you can see by watching at the lk threads in the last years,
it's pretty hard to try people to agree on the concept of 'good enough'.
Config options will leave the current scheduler for UP ( or even for not heavy
MP ) while will introduce an option to users that will suffer scheduler
problems.


> Config options means the user has to choose, I have too many important
> choices to make already when building a kernel.

Even when I go at the restaurant I've to chose, but I still prefer that instead
of getting the 'menu du jour'.




- Davide


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 19:51       ` David Lang
@ 2001-07-13 22:43         ` Mike Kravetz
  2001-07-15 20:02           ` Davide Libenzi
  2001-07-15 20:15           ` Andi Kleen
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Kravetz @ 2001-07-13 22:43 UTC (permalink / raw)
  To: David Lang
  Cc: Larry McVoy, Davide Libenzi, lse-tech, Andi Kleen, linux-kernel

On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
> A real-world example of this issue.
> 
> I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess
> was bouncing back and forth between the two CPUs. I actually was able to
> gzip faster by starting up setiathome to keep one CPU busy and force the
> scheduler to keep the gzip on a single CPU (I ran things several times to
> verify it was actually faster)
> 
> David Lang

That does sound like the same behavior I was seeing with lat_ctx.  Like
I said in my previous note, the scheduler does try to take CPU affinity
into account.  reschedule_idle() does a pretty good job of determining
what CPU a task should run on.  In the case of lat_ctx (and I believe
your use of gzip), the 'fast path' in reschedule_idle() is taken because
the CPU associated with the awakened task is idle.  However, before
schedule() is run on the 'target' CPU, schedule() is run on another
CPU and the task is scheduled there.

The root cause of this situation is the delay between the time
reschedule_idle() determines what is the best CPU a task should run
on, and the time schedule() is actually ran on that CPU.

I have toyed with the idea of 'temporarily binding' a task to a CPU
for the duration of the delay between reschedule_idle() and schedule().
It would go something like this,

- Define a new field in the task structure 'saved_cpus_allowed'.
  With a little collapsing of existing fields, there is room to put
  this on the same cache line as 'cpus_allowed'.
- In reschedule_idle() if we determine that the best CPU for a task
  is the CPU it is associated with (p->processor), then temporarily
  bind the task to that CPU.  The task is temporarily bound to the
  CPU by overwriting the 'cpus_allowed' field such that the task can
  only be scheduled on the target CPU.  Of course, the original
  value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
- In schedule(), the loop which examines all tasks on the runqueue
  will restore the value of 'cpus_allowed'.

This would preserve the 'best CPU' decision made by reschedule_idle().
On the down side, 'temporarily bound' tasks could not be scheduled
until schedule() is run on their associated CPUs.  This could potentially
waste CPU cycles, and delay context switches.  In addition, it is
quite possible that while a task is 'temporarily bound' the state of
the system could change in such a way that the best CPU is no longer
best.

There appears to be a classic tradeoff between CPU affinity and
context switch time.

Comments?

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-12 23:40 CPU affinity & IPI latency Mike Kravetz
  2001-07-13  0:22 ` Davide Libenzi
@ 2001-07-15  7:42 ` Troy Benjegerdes
  2001-07-15  9:05   ` [Lse-tech] " Andi Kleen
  1 sibling, 1 reply; 26+ messages in thread
From: Troy Benjegerdes @ 2001-07-15  7:42 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: linux-kernel, Andi Kleen, lse-tech

On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote:
> This discussion was started on 'lse-tech@lists.sourceforge.net'.
> I'm widening the distribution in the hope of getting more input.
> 
> It started when Andi Kleen noticed that a single 'CPU Hog' task
> was being bounced back and forth between the 2 CPUs on his 2-way
> system.  I had seen similar behavior when running the context
> switching test of LMbench.  When running lat_ctx with only two
> threads on an 8 CPU system, one would ?expect? the two threads
> to be confined to two of the 8 CPUs in the system.  However, what
> I have observed is that the threads are effectively 'round
> robined' among all the CPUs and they all end up bearing
> an equivalent amount of the CPU load.  To more easily observe
> this, increase the number of 'TRIPS' in the benchmark to a really
> large number.

Did this 'CPU Hog' task do any I/O that might have caused an interrupt?

I'm wondering if the interrupt distribution has anything to do with this.. 
are you using any CPU affinity for interrupts? If not, this might explain 
why the processes wind up doing a 'round robin'.

I'm trying to reproduce this with gzip on a dual Mac G4, and I'm wondering
if this will be scewed any because all interrupts are directed at cpu0, so
anything that generates input or output on the network or serial is going
to tend to wind up on cpu0.

I'd also like to figure out what the IPI latency actually is.. Does anyone
have any suggestions how to measure this? I would hope it's lower on the
dual 7410 machine I have since I have 4-stage pipelines as opposed to 20
odd stages that Pentium III class machines do..

> After a little investigation, I believe this 'situation' is caused
> by the latency of the reschedule IPI used by the scheduler.  Recall
> that in lat_ctx all threads are in a tight loop consisting of:
> 
> pipe_read()
> pipe_write()
> 
> Both threads 'start' on the same CPU and are sitting in pipe_read
> waiting for data.  A token is written to the pipe and one thread
> is awakened.  The awakened thread, then immediately writes the token
> back to the pipe which ultimately results in a call to reschedule_idle()
> that will 'initiate' the scheduling of the other thread.  In
> reschedule_idle() we can not take the 'fast path' because WE are
> currently executing on the other thread's preferred CPU.  Therefore,
> reschedule_idle() chooses the oldest idle CPU and sends the IPI.
> However, before the IPI is received (and schedule() run) on the
> remote CPU, the currently running thread calls pipe_read which
> blocks and calls schedule().  Since the other task has yet to be
> scheduled on the other CPU, it is scheduled to run on the current
> CPU.  Both tasks continue to execute on the one CPU until such time
> that an IPI induced schedule() on the other CPU hits a window where
> it finds one of the tasks to schedule.  We continue in this way,
> migrating the tasks to the oldest idle CPU and eventually cycling our
> way through all the CPUs.
> 
> Does this explanation sound reasonable?
> 
> If so, it would then follow that booting with 'idle=poll' would
> help alleviate this situation.  However, that is not the case.  With
> idle=poll the CPU load is not as evenly distributed among the CPUs,
> but is still distributed among all of them.
> 
> Does the behavior of the 'benchmark' mean anything?  Should one
> expect tasks to stay their preferred CPUs if possible?
> 
> Thoughts/comments
> -- 
> Mike Kravetz                                 mkravetz@sequent.com
> IBM Linux Technology Center
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Troy Benjegerdes | master of mispeeling | 'da hozer' |  hozer@drgw.net
-----"If this message isn't misspelled, I didn't write it" -- Me -----
"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's 
why I draw cartoons. It's my life." -- Charles Shulz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15  7:42 ` Troy Benjegerdes
@ 2001-07-15  9:05   ` Andi Kleen
  2001-07-15 17:00     ` Troy Benjegerdes
  0 siblings, 1 reply; 26+ messages in thread
From: Andi Kleen @ 2001-07-15  9:05 UTC (permalink / raw)
  To: Troy Benjegerdes; +Cc: Mike Kravetz, linux-kernel, Andi Kleen, lse-tech

On Sun, Jul 15, 2001 at 02:42:55AM -0500, Troy Benjegerdes wrote:
> On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote:
> > This discussion was started on 'lse-tech@lists.sourceforge.net'.
> > I'm widening the distribution in the hope of getting more input.
> > 
> > It started when Andi Kleen noticed that a single 'CPU Hog' task
> > was being bounced back and forth between the 2 CPUs on his 2-way
> > system.  I had seen similar behavior when running the context
> > switching test of LMbench.  When running lat_ctx with only two
> > threads on an 8 CPU system, one would ?expect? the two threads
> > to be confined to two of the 8 CPUs in the system.  However, what
> > I have observed is that the threads are effectively 'round
> > robined' among all the CPUs and they all end up bearing
> > an equivalent amount of the CPU load.  To more easily observe
> > this, increase the number of 'TRIPS' in the benchmark to a really
> > large number.
> 
> Did this 'CPU Hog' task do any I/O that might have caused an interrupt?

Yes; it did some IO, but most of its time was doing CPU work
(it was CPU bound)

> 
> I'm wondering if the interrupt distribution has anything to do with this.. 
> are you using any CPU affinity for interrupts? If not, this might explain 
> why the processes wind up doing a 'round robin'.

It was a normal Intel SMP box with no special settings and the interrupts
should have been evenly distributed (I did not check it at this time, but
normally they are about evenly distributed over the two cpus)


> 
> I'm trying to reproduce this with gzip on a dual Mac G4, and I'm wondering
> if this will be scewed any because all interrupts are directed at cpu0, so
> anything that generates input or output on the network or serial is going
> to tend to wind up on cpu0.
> 
> I'd also like to figure out what th e IPI latency actually is.. Does anyone
> have any suggestions how to measure this? I would hope it's lower on the
> dual 7410 machine I have since I have 4-stage pipelines as opposed to 20
> odd stages that Pentium III class machines do..

I'm not sure about the Mac, but on x86 linux boxes the SMP bootup synchronizes
the time stamp counters of the CPUs. If they are synchronized on your box
also you could store TSC value in the IPI sender and compute the average
latency in the receiver. 


-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15  9:05   ` [Lse-tech] " Andi Kleen
@ 2001-07-15 17:00     ` Troy Benjegerdes
  2001-07-16  0:58       ` Mike Kravetz
  0 siblings, 1 reply; 26+ messages in thread
From: Troy Benjegerdes @ 2001-07-15 17:00 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Mike Kravetz, linux-kernel, lse-tech

On Sun, Jul 15, 2001 at 11:05:43AM +0200, Andi Kleen wrote:
> On Sun, Jul 15, 2001 at 02:42:55AM -0500, Troy Benjegerdes wrote:
> > On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote:
> > > This discussion was started on 'lse-tech@lists.sourceforge.net'.
> > > I'm widening the distribution in the hope of getting more input.
> > > 
> > > It started when Andi Kleen noticed that a single 'CPU Hog' task
> > > was being bounced back and forth between the 2 CPUs on his 2-way
> > > system.  I had seen similar behavior when running the context
> > > switching test of LMbench.  When running lat_ctx with only two
> > > threads on an 8 CPU system, one would ?expect? the two threads
> > > to be confined to two of the 8 CPUs in the system.  However, what
> > > I have observed is that the threads are effectively 'round
> > > robined' among all the CPUs and they all end up bearing
> > > an equivalent amount of the CPU load.  To more easily observe
> > > this, increase the number of 'TRIPS' in the benchmark to a really
> > > large number.
> > 
> > Did this 'CPU Hog' task do any I/O that might have caused an interrupt?
> 
> Yes; it did some IO, but most of its time was doing CPU work
> (it was CPU bound)

I shouldn't have sent that first email out so quickly. About 5 minutes 
after I sent it, I saw gzip switching CPU's with top a lot.

The next very interesting thing I found was that if I run something like 
'while true; do true; done' to load the other CPU while gzip is running, 
gzip runs faster. The time change isn't very large, and appears to be just 
barely detectable above the noise, but it definetly appears that gzip is 
bouncing back and forth between cpus if both are idle.

I'm tempted to try the somewhat brute-force approach of increasing
PROC_CHANGE_PENALTY, which is currently set to 20 (on ppc) to something
like 100. Would this be an adequate 'short-term' solution util there is 
some sort of multi-queue scheduler that everyone Linus likes? What are the 
drawbacks of increasing PROC_CHANGE_PENALTY?

> I'm not sure about the Mac, but on x86 linux boxes the SMP bootup synchronizes
> the time stamp counters of the CPUs. If they are synchronized on your box
> also you could store TSC value in the IPI sender and compute the average
> latency in the receiver. 

The PPC has a 'timebase register' which is effectively the same thing as 
the TSC, except it counts processor bus ticks/4, not cpu cycles.

-- 
Troy Benjegerdes | master of mispeeling | 'da hozer' |  hozer@drgw.net
-----"If this message isn't misspelled, I didn't write it" -- Me -----
"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's 
why I draw cartoons. It's my life." -- Charles Shulz

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 22:43         ` Mike Kravetz
@ 2001-07-15 20:02           ` Davide Libenzi
  2001-07-15 20:10             ` [Lse-tech] " Andi Kleen
  2001-07-15 20:15           ` Andi Kleen
  1 sibling, 1 reply; 26+ messages in thread
From: Davide Libenzi @ 2001-07-15 20:02 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: linux-kernel, Andi Kleen, lse-tech, Larry McVoy, David Lang


On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
>> A real-world example of this issue.
>> 
>> I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess
>> was bouncing back and forth between the two CPUs. I actually was able to
>> gzip faster by starting up setiathome to keep one CPU busy and force the
>> scheduler to keep the gzip on a single CPU (I ran things several times to
>> verify it was actually faster)
>> 
>> David Lang
> 
> That does sound like the same behavior I was seeing with lat_ctx.  Like
> I said in my previous note, the scheduler does try to take CPU affinity
> into account.  reschedule_idle() does a pretty good job of determining
> what CPU a task should run on.  In the case of lat_ctx (and I believe
> your use of gzip), the 'fast path' in reschedule_idle() is taken because
> the CPU associated with the awakened task is idle.  However, before
> schedule() is run on the 'target' CPU, schedule() is run on another
> CPU and the task is scheduled there.
> 
> The root cause of this situation is the delay between the time
> reschedule_idle() determines what is the best CPU a task should run
> on, and the time schedule() is actually ran on that CPU.
> 
> I have toyed with the idea of 'temporarily binding' a task to a CPU
> for the duration of the delay between reschedule_idle() and schedule().
> It would go something like this,
> 
> - Define a new field in the task structure 'saved_cpus_allowed'.
>   With a little collapsing of existing fields, there is room to put
>   this on the same cache line as 'cpus_allowed'.
> - In reschedule_idle() if we determine that the best CPU for a task
>   is the CPU it is associated with (p->processor), then temporarily
>   bind the task to that CPU.  The task is temporarily bound to the
>   CPU by overwriting the 'cpus_allowed' field such that the task can
>   only be scheduled on the target CPU.  Of course, the original
>   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> - In schedule(), the loop which examines all tasks on the runqueue
>   will restore the value of 'cpus_allowed'.
> 
> This would preserve the 'best CPU' decision made by reschedule_idle().
> On the down side, 'temporarily bound' tasks could not be scheduled
> until schedule() is run on their associated CPUs.  This could potentially
> waste CPU cycles, and delay context switches.  In addition, it is
> quite possible that while a task is 'temporarily bound' the state of
> the system could change in such a way that the best CPU is no longer
> best.
> 
> There appears to be a classic tradeoff between CPU affinity and
> context switch time.

The problem of the current scheduler is that it acts like an infinite feedback
system.
When we're going to decide if we've to move a task we analyze the status at the
current time without taking in account the system status at previous time
values.
But the feedback we send ( IPI to move the task ) take a finite time to hit the
target CPU and, meanwhile, the system status changes.
So we're going to apply a feedback calculated in time T0 to a time Tn, and this
will result in system auto-oscillation that we perceive as tasks bouncing
between CPUs.
This is kind of electronic example but it applies to all feedback systems.
The solution to this problem, given that we can't have a zero feedback delivery
time, is :

1) lower the feedback amount, that means, try to minimize task movements

2) a low pass filter, that means, when we're going to decide the sort ( move )
        of a task, we've to weight the system status with the one that it had
        at previous time values





- Davide


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15 20:02           ` Davide Libenzi
@ 2001-07-15 20:10             ` Andi Kleen
  0 siblings, 0 replies; 26+ messages in thread
From: Andi Kleen @ 2001-07-15 20:10 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Mike Kravetz, linux-kernel, Andi Kleen, lse-tech, Larry McVoy,
	David Lang

On Sun, Jul 15, 2001 at 01:02:21PM -0700, Davide Libenzi wrote:
> The problem of the current scheduler is that it acts like an infinite feedback
> system.
> When we're going to decide if we've to move a task we analyze the status at the
> current time without taking in account the system status at previous time
> values.
> But the feedback we send ( IPI to move the task ) take a finite time to hit the
> target CPU and, meanwhile, the system status changes.
> So we're going to apply a feedback calculated in time T0 to a time Tn, and this
> will result in system auto-oscillation that we perceive as tasks bouncing
> between CPUs.
> This is kind of electronic example but it applies to all feedback systems.
> The solution to this problem, given that we can't have a zero feedback delivery
> time, is :
> 
> 1) lower the feedback amount, that means, try to minimize task movements
> 
> 2) a low pass filter, that means, when we're going to decide the sort ( move )
>         of a task, we've to weight the system status with the one that it had
>         at previous time values

Nice analysis. I think Mike's proposal of the 'saved_cpus_allowed' field
to temporarily bind the task to the target would just act as such an low 
pass filter.

-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 22:43         ` Mike Kravetz
  2001-07-15 20:02           ` Davide Libenzi
@ 2001-07-15 20:15           ` Andi Kleen
  2001-07-15 20:31             ` Davide Libenzi
  2001-07-16 15:46             ` [Lse-tech] " Mike Kravetz
  1 sibling, 2 replies; 26+ messages in thread
From: Andi Kleen @ 2001-07-15 20:15 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Lang, Larry McVoy, Davide Libenzi, lse-tech, Andi Kleen,
	linux-kernel

On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote:
> - Define a new field in the task structure 'saved_cpus_allowed'.
>   With a little collapsing of existing fields, there is room to put
>   this on the same cache line as 'cpus_allowed'.
> - In reschedule_idle() if we determine that the best CPU for a task
>   is the CPU it is associated with (p->processor), then temporarily
>   bind the task to that CPU.  The task is temporarily bound to the
>   CPU by overwriting the 'cpus_allowed' field such that the task can
>   only be scheduled on the target CPU.  Of course, the original
>   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> - In schedule(), the loop which examines all tasks on the runqueue
>   will restore the value of 'cpus_allowed'.

This sounds racy, at least with your simple description. Even other CPUs
could walk their run queue while the IPI is being processed and reset the
cpus_allowed flag too early. I guess it would need a check in the 
schedule loop that restores cpus_allowed that the current CPU is the only
on set in that task's cpus_allowed. This unfortunately would hang the
process if the CPU for some reason cannot process schedules timely, so
it may be needed to add a timestamp also to the task_struct that allows 
the restoration of cpus_allowed even from non target CPUs after some
time.


-Andi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-15 20:15           ` Andi Kleen
@ 2001-07-15 20:31             ` Davide Libenzi
  2001-07-16 15:46             ` [Lse-tech] " Mike Kravetz
  1 sibling, 0 replies; 26+ messages in thread
From: Davide Libenzi @ 2001-07-15 20:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, lse-tech, Larry McVoy, David Lang, Mike Kravetz


On 15-Jul-2001 Andi Kleen wrote:
> On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote:
>> - Define a new field in the task structure 'saved_cpus_allowed'.
>>   With a little collapsing of existing fields, there is room to put
>>   this on the same cache line as 'cpus_allowed'.
>> - In reschedule_idle() if we determine that the best CPU for a task
>>   is the CPU it is associated with (p->processor), then temporarily
>>   bind the task to that CPU.  The task is temporarily bound to the
>>   CPU by overwriting the 'cpus_allowed' field such that the task can
>>   only be scheduled on the target CPU.  Of course, the original
>>   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
>> - In schedule(), the loop which examines all tasks on the runqueue
>>   will restore the value of 'cpus_allowed'.
> 
> This sounds racy, at least with your simple description. Even other CPUs
> could walk their run queue while the IPI is being processed and reset the
> cpus_allowed flag too early. I guess it would need a check in the 
> schedule loop that restores cpus_allowed that the current CPU is the only
> on set in that task's cpus_allowed. This unfortunately would hang the
> process if the CPU for some reason cannot process schedules timely, so
> it may be needed to add a timestamp also to the task_struct that allows 
> the restoration of cpus_allowed even from non target CPUs after some
> time.

I previously expressed ( maybe in this thread, I don't remember ) my idea of
not touching ( hacking ) the current scheduler to let it fit to SMP, that IMHO,
has different problems and needs different answers.
Anyway, as I said a couple of messages ago, the solution of this problem could
be to move the moved task in a private ( per CPU ) wakeup list that, when the
idle comes up, it'll be ran.





- Davide


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15 17:00     ` Troy Benjegerdes
@ 2001-07-16  0:58       ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2001-07-16  0:58 UTC (permalink / raw)
  To: Troy Benjegerdes; +Cc: Andi Kleen, Mike Kravetz, linux-kernel, lse-tech

On Sun, Jul 15, 2001 at 12:00:38PM -0500, Troy Benjegerdes wrote:
> 
> The next very interesting thing I found was that if I run something like 
> 'while true; do true; done' to load the other CPU while gzip is running, 
> gzip runs faster. The time change isn't very large, and appears to be just 
> barely detectable above the noise, but it definetly appears that gzip is 
> bouncing back and forth between cpus if both are idle.
> 
> I'm tempted to try the somewhat brute-force approach of increasing
> PROC_CHANGE_PENALTY, which is currently set to 20 (on ppc) to something
> like 100. Would this be an adequate 'short-term' solution util there is 
> some sort of multi-queue scheduler that everyone Linus likes? What are the 
> drawbacks of increasing PROC_CHANGE_PENALTY?

I'm pretty sure changing the value of PROC_CHANGE_PENALTY will not help
in this case.  PROC_CHANGE_PENALTY becomes increasingly important as
the number of running tasks is increased.  In the case of simply running
one task (like gzip) on your 2 CPU system, I don't think it will make
any difference.

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-15 20:15           ` Andi Kleen
  2001-07-15 20:31             ` Davide Libenzi
@ 2001-07-16 15:46             ` Mike Kravetz
  1 sibling, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2001-07-16 15:46 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Mike Kravetz, David Lang, Larry McVoy, Davide Libenzi, lse-tech,
	linux-kernel

On Sun, Jul 15, 2001 at 10:15:42PM +0200, Andi Kleen wrote:
> On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote:
> > - Define a new field in the task structure 'saved_cpus_allowed'.
> >   With a little collapsing of existing fields, there is room to put
> >   this on the same cache line as 'cpus_allowed'.
> > - In reschedule_idle() if we determine that the best CPU for a task
> >   is the CPU it is associated with (p->processor), then temporarily
> >   bind the task to that CPU.  The task is temporarily bound to the
> >   CPU by overwriting the 'cpus_allowed' field such that the task can
> >   only be scheduled on the target CPU.  Of course, the original
> >   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> > - In schedule(), the loop which examines all tasks on the runqueue
> >   will restore the value of 'cpus_allowed'.
> 
> This sounds racy, at least with your simple description. Even other CPUs
> could walk their run queue while the IPI is being processed and reset the
> cpus_allowed flag too early. I guess it would need a check in the 
> schedule loop that restores cpus_allowed that the current CPU is the only
> on set in that task's cpus_allowed.

You are correct.  It is trivial to allow only the 'target' CPU to reset
the cpus_allowed field.  This was my first thought.  However, as you
state below we would get into trouble if there was an extremely long
delay in that CPU running schedule.

The timestamp sounds reasonable, but I was trying to keep it as simple
as possible.

>                                     This unfortunately would hang the
> process if the CPU for some reason cannot process schedules timely, so
> it may be needed to add a timestamp also to the task_struct that allows 
> the restoration of cpus_allowed even from non target CPUs after some
> time.

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
       [not found] <OF408C990D.63BC0397-ON85256A88.005CF33B@pok.ibm.com>
  2001-07-13 18:27 ` Hubertus Frnake
@ 2001-07-17 14:20 ` Hubertus Frnake
  1 sibling, 0 replies; 26+ messages in thread
From: Hubertus Frnake @ 2001-07-17 14:20 UTC (permalink / raw)
  To: ak, lse-tech, linux-kernel

Enclosed is a patch for the fixing the process bouncing problem a.k.a
"PU affinity & IPI latency".

The patch is again 2.4.5 (all I had last night), so Andi could you please
test  it
on your stuff and see whether it works for you. It works on stock apps.

Basic principle is as follows:

When reschedule_idle(p) determines to IPI a task, it sets a pointer
to <p> in schedule_data(target_cpu). We raise the <p->has_cpu> flag
to indicate that p should not be considered for scheduling.

In schedule(), we check after going to <still_running> whether, this call
is based on an IPI, if so we always take this task.

To be functionally correct, we also consider the reservation in
reschedule_idle(), i.e., we first look for the reservation, then for idle
and then cpu_curr to determine the best task.
If indeed the decision is to "preempt" the reserving task, (actually, its not
running yet), we simply lower the current reservation  has_cpu=0) and
replace it with <p>.

-- Hubertus Franke  (frankeh@us.ibm.com)

diff -uwrbBN linux-2.4.5-van/kernel/sched.c linux-2.4.5-ca/kernel/sched.c
--- linux-2.4.5-van/kernel/sched.c      Fri Apr 20 21:26:16 2001
+++ linux-2.4.5-ca/kernel/sched.c       Tue Jul 17 07:31:10 2001
@@ -97,12 +97,14 @@
 static union {
        struct schedule_data {
                struct task_struct * curr;
+               struct task_struct * resched;
                cycles_t last_schedule;
        } schedule_data;
        char __pad [SMP_CACHE_BYTES];
-} aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0}}};
+} aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0,0}}};

 #define cpu_curr(cpu) aligned_data[(cpu)].schedule_data.curr
+#define cpu_resched(cpu) aligned_data[(cpu)].schedule_data.resched
 #define last_schedule(cpu) aligned_data[(cpu)].schedule_data.last_schedule

 struct kernel_stat kstat;
@@ -208,7 +210,7 @@
 {
 #ifdef CONFIG_SMP
        int this_cpu = smp_processor_id();
-       struct task_struct *tsk, *target_tsk;
+       struct task_struct *tsk, *target_tsk, *rtsk;
        int cpu, best_cpu, i, max_prio;
        cycles_t oldest_idle;

@@ -219,7 +221,9 @@
        best_cpu = p->processor;
        if (can_schedule(p, best_cpu)) {
                tsk = idle_task(best_cpu);
-               if (cpu_curr(best_cpu) == tsk) {
+               if ((cpu_curr(best_cpu) == tsk) &&
+                   (cpu_resched(best_cpu) == NULL))
+               {
                        int need_resched;
 send_now_idle:
                        /*
@@ -244,13 +248,24 @@
         */
        oldest_idle = (cycles_t) -1;
        target_tsk = NULL;
+       best_cpu = 0;
        max_prio = 1;

        for (i = 0; i < smp_num_cpus; i++) {
                cpu = cpu_logical_map(i);
                if (!can_schedule(p, cpu))
                        continue;
+               /* first check whether there is an resched IPI
+                * reservation for that cpu. If so consider priority
+                * of the reservation instead of current.
+                * We do not have to set the need_resched flag again
+                * for the currently running task. It must have been
+                * signalled before
+                */
+               tsk = cpu_resched(cpu);
+               if (tsk == NULL)
                tsk = cpu_curr(cpu);
+
                /*
                 * We use the first available idle CPU. This creates
                 * a priority list between idle CPUs, but this is not
@@ -268,19 +283,30 @@
                                if (prio > max_prio) {
                                        max_prio = prio;
                                        target_tsk = tsk;
+                                       best_cpu = cpu;
                                }
                        }
                }
        }
        tsk = target_tsk;
        if (tsk) {
+               rtsk = cpu_resched(best_cpu);
+               if (rtsk) {
+                       rtsk->has_cpu = 0; /* return rtsk to scheduable */
+                       tsk->has_cpu  = 1; /* can't schedule this one no
more*/ +                       cpu_resched(best_cpu) = tsk;
+                       return;
+               }
                if (oldest_idle != -1ULL) {
                        best_cpu = tsk->processor;
                        goto send_now_idle;
                }
                tsk->need_resched = 1;
-               if (tsk->processor != this_cpu)
-                       smp_send_reschedule(tsk->processor);
+               if (tsk->processor != this_cpu) {
+                       tsk->has_cpu  = 1;
+                       cpu_resched(best_cpu) = tsk;
+                       smp_send_reschedule(best_cpu);
+               }
        }
        return;

@@ -578,6 +604,15 @@
         */

 repeat_schedule:
+       /* we check whether we have a resched_IPI reservation:
+        * if so simply select the reserving task and next and
+        * go to switch to it.
+        */
+       next = cpu_resched(this_cpu);
+       if (next) {
+               next = p;
+               goto found_next;
+       }
        /*
         * Default process to select..
         */
@@ -604,6 +639,7 @@
         * switching to the next task, save this fact in
         * sched_data.
         */
+found_next:
        sched_data->curr = next;
 #ifdef CONFIG_SMP
        next->has_cpu = 1;





^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
  2001-07-13 20:17 Shailabh Nagar
@ 2001-07-15 20:51 ` Davide Libenzi
  0 siblings, 0 replies; 26+ messages in thread
From: Davide Libenzi @ 2001-07-15 20:51 UTC (permalink / raw)
  To: Shailabh Nagar
  Cc: Larry McVoy, linux-kernel, Andi Kleen, lse-tech, Mike Kravetz


On 13-Jul-2001 Shailabh Nagar wrote:
> That is true to an extent. It would be convenient for us as scheduler
> rewriters to have neatly differentiated classes like UP, SMP, BIG_SMP, NUMA
> etc. But it forces all other scheduler-sensitive code to think of each of
> these cases separately and is exactly the reason why #ifdef's are
> discouraged for critical kernel code like the scheduler.

Personally I hate #ifdef's inside the code more than my cat water, but something
like :

[sched.c]
#ifdef CONFIG_SCHED_XXX
#include "sched_xxx.c"
#else
#ifdef CONFIG_SCHED_YYY
#include "sched_yyy.c"

...

#endif


looks pretty clean to me.




- Davide


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
@ 2001-07-13 20:17 Shailabh Nagar
  2001-07-15 20:51 ` Davide Libenzi
  0 siblings, 1 reply; 26+ messages in thread
From: Shailabh Nagar @ 2001-07-13 20:17 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Mike Kravetz, lse-tech, Andi Kleen, linux-kernel, Larry McVoy


David,

> Global scheduling decisions should be triggered in response of load
unbalancing
> and not at each schedule() run otherwise we're going to introduce a
common lock
> that will limit the overall scalability.

Thats correct. Though it beggars the question : who will define
"load-imbalance" and at what granularity ? In the Loadbalancing extensions
to MQ (http://lse.sourceforge.net/scheduling/LB/poolMQ.html) load balancing
is done at a frequency specified at the time the loadbalancing module is
loaded. The parameter can be changed dynamically through a /proc interface.
So we are providing a knob for the user/sysadmin to control the
loadbalancing desired.

> My idea about the future of the scheduler is to have a config options
users can
> chose depending upon the machine use.
> By trying to keep a unique scheduler for both UP and MP is like going to
give
> the same answer to different problems and the scaling factor ( of the
scheduler
> itself ) on SMP will never improve.

That is true to an extent. It would be convenient for us as scheduler
rewriters to have neatly differentiated classes like UP, SMP, BIG_SMP, NUMA
etc. But it forces all other scheduler-sensitive code to think of each of
these cases separately and is exactly the reason why #ifdef's are
discouraged for critical kernel code like the scheduler.

Its certainly a challenge to provide SMP/NUMA scalability in the scheduler
(and elsewhere in the kernel) without having to resort to an #ifdef.

Shailabh




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [Lse-tech] Re: CPU affinity & IPI latency
       [not found] <OF408C990D.63BC0397-ON85256A88.005CF33B@pok.ibm.com>
@ 2001-07-13 18:27 ` Hubertus Frnake
  2001-07-17 14:20 ` Hubertus Frnake
  1 sibling, 0 replies; 26+ messages in thread
From: Hubertus Frnake @ 2001-07-13 18:27 UTC (permalink / raw)
  To: linux-kernel, lse-tech

> Davide Libenzi <davidel@xmailserver.org>@lists.sourceforge.net on
> 07/13/2001 12:41:44 PM
>
> Sent by:  lse-tech-admin@lists.sourceforge.net
>
> To:   Larry McVoy <lm@bitmover.com>
> cc:   linux-kernel@vger.kernel.org, Andi Kleen <ak@suse.de>,
>       lse-tech@lists.sourceforge.net, Mike Kravetz <mkravetz@sequent.com>
> Subject:  [Lse-tech] Re: CPU affinity & IPI latency
>
> On 13-Jul-2001 Larry McVoy wrote:
> > Be careful tuning for LMbench (says the author :-)
> >
> > Especially this benchmark.  It's certainly possible to get dramatically
> > better
> > SMP numbers by pinning all the lat_ctx processes to a single CPU, because
> > the benchmark is single threaded.  In other words, if we have 5
> processes,
> > call them A, B, C, D, and E, then the benchmark is passing a token from
> > A to B to C to D to E and around again.
> >
> > If the amount of data/instructions needed by all 5 processes fits in the
> > cache and you pin all the processes to the same CPU you'll get much
> > better performance than simply letting them float.
> >
> > But making the system do that naively is a bad idea.
>
> Agree.
>
> >
> > This is a really hard area to get right but you can take a page from all
> > the failed process migration efforts.  In general, moving stuff is a bad
> > idea, it's much better to leave it where it is.  Everything scales better
> > if there is a process queue per CPU and the default is that you leave the
> > processes on the queue on which they last run.  However, if the load
> average
> > for a queue starts going up and there is another queue with a
> substantially
> > lower load average, then and ONLY then, should you move the process.
>
> I personally think that a standard scheduler/cpu is the way to go for SMP.
> I saw the one IBM guys did and I think that the wrong catch there is trying
> always to grab the best task to run over all CPUs.
> I think that this concept could be relaxed introducing less chains between
> each
> CPU scheduler.
> A cheap load balancer should run, "time to time"(tm), to move tasks when a
> certain level of unbalancing has been reached.
> This will give each scheduler more independence and will make it more
> scalable,
> IMVHO.

Davide, this is exactly what we do with the LoadBalancing extentions to the MQ

scheduler. We subdivide into smaller sets, do the scheduling in there and from

"time to time" (tm), we move tasks over from set to set in order to ensure
loadbalancing.
The frequency and loadbalancing is configurable (module).
http://lse.sourceforge.net/scheduling

>
>
> > This is an area in which I've done a pile of work and I'd be interested
> > in keeping a finger in any efforts to fix up the scheduler.
>
> We've, somehow, understood it :)
>
> - Davide
>
> _______________________________________________
> Lse-tech mailing list
> Lse-tech@lists.sourceforge.net
> http://lists.sourceforge.net/lists/listinfo/lse-tech


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2001-07-17 14:22 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-12 23:40 CPU affinity & IPI latency Mike Kravetz
2001-07-13  0:22 ` Davide Libenzi
2001-07-13  0:36   ` Larry McVoy
2001-07-13  2:06     ` Mark Hahn
2001-07-13 16:41     ` Davide Libenzi
2001-07-13 17:31       ` Mike Kravetz
2001-07-13 19:17         ` Davide Libenzi
2001-07-13 19:39           ` [Lse-tech] " Gerrit Huizenga
2001-07-13 20:05             ` Davide Libenzi
2001-07-13 17:05     ` Mike Kravetz
2001-07-13 19:51       ` David Lang
2001-07-13 22:43         ` Mike Kravetz
2001-07-15 20:02           ` Davide Libenzi
2001-07-15 20:10             ` [Lse-tech] " Andi Kleen
2001-07-15 20:15           ` Andi Kleen
2001-07-15 20:31             ` Davide Libenzi
2001-07-16 15:46             ` [Lse-tech] " Mike Kravetz
2001-07-13 19:54       ` Chris Wedgwood
2001-07-15  7:42 ` Troy Benjegerdes
2001-07-15  9:05   ` [Lse-tech] " Andi Kleen
2001-07-15 17:00     ` Troy Benjegerdes
2001-07-16  0:58       ` Mike Kravetz
2001-07-13 20:17 Shailabh Nagar
2001-07-15 20:51 ` Davide Libenzi
     [not found] <OF408C990D.63BC0397-ON85256A88.005CF33B@pok.ibm.com>
2001-07-13 18:27 ` Hubertus Frnake
2001-07-17 14:20 ` Hubertus Frnake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).