linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: CPU affinity & IPI latency
@ 2001-07-16 21:45 Hubertus Franke
  2001-07-16 22:56 ` Davide Libenzi
  0 siblings, 1 reply; 23+ messages in thread
From: Hubertus Franke @ 2001-07-16 21:45 UTC (permalink / raw)
  To: linux-kernel, lse-tech


Clean, but in this solutions, you can lock out tasks for a
several cycles, if indeed several of them where added
in the time before we can service the IPI on the target cpu.
You can end up with strange priority inversion problems,
because these tasks aren't seen by the general scheduler
or let alone are on the runqueue.
That's why I opted in my design for a single slot.

One could opt for the following implementation to ensure only
a single waiting task, namely put the already enqueued one back
if the goodness is worse.


static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
       if (ad->wlist &&
          (preemption_goodness(tsk,ad->wlist,ad->curr->mm) > 0))
        {
            add_to_runqueue(ad->wlist,tsk);
            if (task_on_runqueue(tsk))
                list_del(&tsk->run_list);
            /* obsolete now tsk->wlist_next = ad->wlist; */
            ad->wlist = tsk;
        }
}

I also don't like the fact that with reschedule_idle() you
just put the task into the runqueue, then you take it out again,
just to put it back into it after the IPI and that as it seems
on every reschedule_idle() path.

In my design one simply wouldn't send the IPI to a target CPU that
has a pending IPI waiting and preemption wouldn't happen based on
the goodness values.

I grant, that my design seems a bit more intrusive on the code,
but you were argueing just yesterday not to get stuck up with
closeness to the current code and semantics.

Comments ?

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com



Davide Libenzi <davidel@xmailserver.org>@ewok.dev.mcafeelabs.com on
07/16/2001 05:25:57 PM

Sent by:  davidel@ewok.dev.mcafeelabs.com


To:   Mike Kravetz <mkravetz@sequent.com>
cc:   linux-kernel@vger.kernel.org, lse-tech@lists.sourceforge.net,
      Hubertus Franke/Watson/IBM@IBMUS
Subject:  Re: CPU affinity & IPI latency




On 16-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
>>
>> Mike, could we utilize the existing mechanism such as has_cpu.
>>
>
> I like it.  Especially the way you eliminated the situation where
> we would have multiple tasks waiting for schedule.  Hope this is
> not a frequent situation!!!  The only thing I don't like is the
> use of has_cpu to prevent the task from being scheduled.  Right
> now, I can't think of any problems with it.  However, in the past
> I have been bit by using fields for purposes other than what they
> were designed for.

How about this ( draft ) :


struct task_struct {
        ...
        struct task_struct * wlist_next;
        ...
};


static union {
        struct schedule_data {
                struct task_struct * curr;
                struct task_struct * wlist;
                cycles_t last_schedule;
        } schedule_data;
        char __pad [SMP_CACHE_BYTES];
} aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0}}};


static inline struct task_struct * wpick(aligned_data * ad) {
        struct task_struct * tsk = ad->wlist;
        if (tsk) {
                ad->wlist = tsk->wlist_next;
                add_to_runqueue(tsk);
        }
        return tsk;
}

static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
        if (task_on_runqueue(tsk))
                list_del(&tsk->run_list);
        tsk->wlist_next = ad->wlist;
        ad->wlist = tsk;
}


asmlinkage void schedule(void)
{
        ...
        if ((next = wpick(sched_data)))
                goto ...;
        ...
}

In reschedule_idle() when before sending the IPI we do a wpush().
We modify aligned_data->wlist and tsk->wlist_next under runqueue_lock so we
don't need another one.
A slight change is needed to reschedule_idle() to handle the new field.
Pros to this solution are 1) we are not going to give other fields a
different
meaning 2) when the idle will call schedule it'll pick the task w/o rescan.




- Davide





^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-16 21:45 CPU affinity & IPI latency Hubertus Franke
@ 2001-07-16 22:56 ` Davide Libenzi
  0 siblings, 0 replies; 23+ messages in thread
From: Davide Libenzi @ 2001-07-16 22:56 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: lse-tech, linux-kernel


On 16-Jul-2001 Hubertus Franke wrote:
> 
> Clean, but in this solutions, you can lock out tasks for a
> several cycles, if indeed several of them where added
> in the time before we can service the IPI on the target cpu.
> You can end up with strange priority inversion problems,
> because these tasks aren't seen by the general scheduler
> or let alone are on the runqueue.
> That's why I opted in my design for a single slot.

The idea of list was kind of extension, but to solve this particular problem we
just need a single task ( with no extra member in task_struct ) pointer.


> One could opt for the following implementation to ensure only
> a single waiting task, namely put the already enqueued one back
> if the goodness is worse.
> 
> 
> static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
>        if (ad->wlist &&
>           (preemption_goodness(tsk,ad->wlist,ad->curr->mm) > 0))
>         {
>             add_to_runqueue(ad->wlist,tsk);
>             if (task_on_runqueue(tsk))
>                 list_del(&tsk->run_list);
>             /* obsolete now tsk->wlist_next = ad->wlist; */
>             ad->wlist = tsk;
>         }
> }

wpush() must be called only if *) the target is idle <and> *) the list ( slot )
is empty.
I won't add the goodness checking in there.
Yes, it could happen that another task with greater goodness needs a CPU to run
and, w/o goodness checking this could not be rescheduled on the idle.
But think about at what would happen if the IPI had a zero time response.
The idle would be running the old ( listed or slotted ) task and the CPU would
not be idle anymore.



> 
> I also don't like the fact that with reschedule_idle() you
> just put the task into the runqueue, then you take it out again,
> just to put it back into it after the IPI and that as it seems
> on every reschedule_idle() path.

You can "suspend" the task even by adding an extra field inside the task
struct and add checking of this field around inside the code.
The list removal does not need you to spread extra checking into the code.
It's a simple method to hide a task without modifying a bunch of code.


> In my design one simply wouldn't send the IPI to a target CPU that
> has a pending IPI waiting and preemption wouldn't happen based on
> the goodness values.

You do exactly that with this solution, you reserve the task for
the target idle.
What do You mean with "preemption wouldn't happen based on the goodness values"?



> I grant, that my design seems a bit more intrusive on the code,
> but you were argueing just yesterday not to get stuck up with
> closeness to the current code and semantics.

Pls, don't make me say this stuff.
My was not to make the code more intrusive but, on the contrary, to make it
light by relaxing, in some aspect, the SMP scheduler behaviour compatibility.
I already expressed my opinion about that in previous emails.





- Davide


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-16 16:14 ` Mike Kravetz
@ 2001-07-16 21:25   ` Davide Libenzi
  0 siblings, 0 replies; 23+ messages in thread
From: Davide Libenzi @ 2001-07-16 21:25 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: linux-kernel, lse-tech, Hubertus Franke


On 16-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
>> 
>> Mike, could we utilize the existing mechanism such as has_cpu.
>> 
> 
> I like it.  Especially the way you eliminated the situation where
> we would have multiple tasks waiting for schedule.  Hope this is
> not a frequent situation!!!  The only thing I don't like is the
> use of has_cpu to prevent the task from being scheduled.  Right
> now, I can't think of any problems with it.  However, in the past
> I have been bit by using fields for purposes other than what they
> were designed for.

How about this ( draft ) :


struct task_struct {
        ...
        struct task_struct * wlist_next;
        ...
};


static union {
        struct schedule_data {
                struct task_struct * curr;
                struct task_struct * wlist;
                cycles_t last_schedule;
        } schedule_data;
        char __pad [SMP_CACHE_BYTES];
} aligned_data [NR_CPUS] __cacheline_aligned = { {{&init_task,0}}};


static inline struct task_struct * wpick(aligned_data * ad) {
        struct task_struct * tsk = ad->wlist;
        if (tsk) {
                ad->wlist = tsk->wlist_next;
                add_to_runqueue(tsk);
        }
        return tsk;
}

static inline void wpush(aligned_data * ad, struct task_struct * tsk) {
        if (task_on_runqueue(tsk))
                list_del(&tsk->run_list);
        tsk->wlist_next = ad->wlist;
        ad->wlist = tsk;
}


asmlinkage void schedule(void)
{
        ...
        if ((next = wpick(sched_data)))
                goto ...; 
        ...
}

In reschedule_idle() when before sending the IPI we do a wpush().
We modify aligned_data->wlist and tsk->wlist_next under runqueue_lock so we
don't need another one.
A slight change is needed to reschedule_idle() to handle the new field.
Pros to this solution are 1) we are not going to give other fields a different
meaning 2) when the idle will call schedule it'll pick the task w/o rescan.




- Davide


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
@ 2001-07-16 18:26 Hubertus Franke
  0 siblings, 0 replies; 23+ messages in thread
From: Hubertus Franke @ 2001-07-16 18:26 UTC (permalink / raw)
  To: lse-tech; +Cc: linux-kernel


Well, actually the sole purpose of <has_cpu> is
to lock a task out of a scheduling decision. Remember
during transient states, there are two tasks that have
the <has_cpu> flag set for a particular cpu.
So I think using <has_cpu> is kosher and preferred in
this situation, IMHO.

As you saw from David Levines reply, he thinks that
a list is more appropriate for more generic decision
support.
But I don't like the fact that you lock down several
tasks at that point. In my solution, you return the
task back to general schedulability.


Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com



Mike Kravetz <mkravetz@sequent.com> on 07/16/2001 12:14:46 PM

To:   Hubertus Franke/Watson/IBM@IBMUS
cc:   Mike Kravetz <mkravetz@sequent.com>, lse-tech@lists.sourceforge.net,
      linux-kernel@vger.kernel.org
Subject:  Re: CPU affinity & IPI latency



On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
>
> Mike, could we utilize the existing mechanism such as has_cpu.
>

I like it.  Especially the way you eliminated the situation where
we would have multiple tasks waiting for schedule.  Hope this is
not a frequent situation!!!  The only thing I don't like is the
use of has_cpu to prevent the task from being scheduled.  Right
now, I can't think of any problems with it.  However, in the past
I have been bit by using fields for purposes other than what they
were designed for.

--
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-16 10:10 Hubertus Franke
@ 2001-07-16 16:16 ` Davide Libenzi
  0 siblings, 0 replies; 23+ messages in thread
From: Davide Libenzi @ 2001-07-16 16:16 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: lse-tech, linux-kernel


On 16-Jul-2001 Hubertus Franke wrote:
> 
> David, you are preaching to choir.
> 
> One can not have it both ways, at least without "#ifdef"s.
> As Mike stated, we made the decision to adhere to current scheduling
> semantics
> purely for the purspose of comparision and increased adaptation chances.
> As shown with the LoadBalancing addition to MQ, there are simple ways to
> relax and completely eliminate the feedback between the queues, if one so
> desires.
> 
> As for the solutions you proposed for the "switching problem", namely the
> wakeup
> list. I don't think you want a list here. A list would basically mean that
> you
> would need to walk it and come up with a single decision again. I think
> what
> I proposed, namely a per-CPU reschedule reservation that can be overwritten
> taking
> PROC_CHANGE_PENALTY or some form of it into account, seems a better
> solution.
> Open to discussions...

No, when you're going to decide ( reschedule_idle ) which idle to spin out, you
can inspect the wake list and, based on the content of the list, one can take a
better decision about which idle to wake.
I think that a list, instead of a single task pointer, is a more open solution
that could drive to a more sophisticated choice of the CPU to stock the task to.




- Davide


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-14  3:25 Hubertus Franke
@ 2001-07-16 16:14 ` Mike Kravetz
  2001-07-16 21:25   ` Davide Libenzi
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Kravetz @ 2001-07-16 16:14 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: Mike Kravetz, lse-tech, linux-kernel

On Fri, Jul 13, 2001 at 11:25:21PM -0400, Hubertus Franke wrote:
> 
> Mike, could we utilize the existing mechanism such as has_cpu.
> 

I like it.  Especially the way you eliminated the situation where
we would have multiple tasks waiting for schedule.  Hope this is
not a frequent situation!!!  The only thing I don't like is the
use of has_cpu to prevent the task from being scheduled.  Right
now, I can't think of any problems with it.  However, in the past
I have been bit by using fields for purposes other than what they
were designed for.

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
@ 2001-07-16 10:10 Hubertus Franke
  2001-07-16 16:16 ` Davide Libenzi
  0 siblings, 1 reply; 23+ messages in thread
From: Hubertus Franke @ 2001-07-16 10:10 UTC (permalink / raw)
  To: linux-kernel, lse-tech


David, you are preaching to choir.

One can not have it both ways, at least without "#ifdef"s.
As Mike stated, we made the decision to adhere to current scheduling
semantics
purely for the purspose of comparision and increased adaptation chances.
As shown with the LoadBalancing addition to MQ, there are simple ways to
relax and completely eliminate the feedback between the queues, if one so
desires.

As for the solutions you proposed for the "switching problem", namely the
wakeup
list. I don't think you want a list here. A list would basically mean that
you
would need to walk it and come up with a single decision again. I think
what
I proposed, namely a per-CPU reschedule reservation that can be overwritten
taking
PROC_CHANGE_PENALTY or some form of it into account, seems a better
solution.
Open to discussions...

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)

email: frankeh@us.ibm.com



Davide Libenzi <davidel@xmailserver.org>@vger.kernel.org on 07/15/2001
04:02:21 PM

Sent by:  linux-kernel-owner@vger.kernel.org


To:   Mike Kravetz <mkravetz@sequent.com>
cc:   linux-kernel@vger.kernel.org, Andi Kleen <ak@suse.de>,
      lse-tech@lists.sourceforge.net, Larry McVoy <lm@bitmover.com>, David
      Lang <david.lang@digitalinsight.com>
Subject:  Re: CPU affinity & IPI latency




On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
>> A real-world example of this issue.
>>
>> I was gzipping a large (~800MB) file on a dual athlon box. the gzip
prcess
>> was bouncing back and forth between the two CPUs. I actually was able to
>> gzip faster by starting up setiathome to keep one CPU busy and force the
>> scheduler to keep the gzip on a single CPU (I ran things several times
to
>> verify it was actually faster)
>>
>> David Lang
>
> That does sound like the same behavior I was seeing with lat_ctx.  Like
> I said in my previous note, the scheduler does try to take CPU affinity
> into account.  reschedule_idle() does a pretty good job of determining
> what CPU a task should run on.  In the case of lat_ctx (and I believe
> your use of gzip), the 'fast path' in reschedule_idle() is taken because
> the CPU associated with the awakened task is idle.  However, before
> schedule() is run on the 'target' CPU, schedule() is run on another
> CPU and the task is scheduled there.
>
> The root cause of this situation is the delay between the time
> reschedule_idle() determines what is the best CPU a task should run
> on, and the time schedule() is actually ran on that CPU.
>
> I have toyed with the idea of 'temporarily binding' a task to a CPU
> for the duration of the delay between reschedule_idle() and schedule().
> It would go something like this,
>
> - Define a new field in the task structure 'saved_cpus_allowed'.
>   With a little collapsing of existing fields, there is room to put
>   this on the same cache line as 'cpus_allowed'.
> - In reschedule_idle() if we determine that the best CPU for a task
>   is the CPU it is associated with (p->processor), then temporarily
>   bind the task to that CPU.  The task is temporarily bound to the
>   CPU by overwriting the 'cpus_allowed' field such that the task can
>   only be scheduled on the target CPU.  Of course, the original
>   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> - In schedule(), the loop which examines all tasks on the runqueue
>   will restore the value of 'cpus_allowed'.
>
> This would preserve the 'best CPU' decision made by reschedule_idle().
> On the down side, 'temporarily bound' tasks could not be scheduled
> until schedule() is run on their associated CPUs.  This could potentially
> waste CPU cycles, and delay context switches.  In addition, it is
> quite possible that while a task is 'temporarily bound' the state of
> the system could change in such a way that the best CPU is no longer
> best.
>
> There appears to be a classic tradeoff between CPU affinity and
> context switch time.

The problem of the current scheduler is that it acts like an infinite
feedback
system.
When we're going to decide if we've to move a task we analyze the status at
the
current time without taking in account the system status at previous time
values.
But the feedback we send ( IPI to move the task ) take a finite time to hit
the
target CPU and, meanwhile, the system status changes.
So we're going to apply a feedback calculated in time T0 to a time Tn, and
this
will result in system auto-oscillation that we perceive as tasks bouncing
between CPUs.
This is kind of electronic example but it applies to all feedback systems.
The solution to this problem, given that we can't have a zero feedback
delivery
time, is :

1) lower the feedback amount, that means, try to minimize task movements

2) a low pass filter, that means, when we're going to decide the sort (
move )
        of a task, we've to weight the system status with the one that it
had
        at previous time values





- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-15 20:15           ` Andi Kleen
@ 2001-07-15 20:31             ` Davide Libenzi
  0 siblings, 0 replies; 23+ messages in thread
From: Davide Libenzi @ 2001-07-15 20:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, lse-tech, Larry McVoy, David Lang, Mike Kravetz


On 15-Jul-2001 Andi Kleen wrote:
> On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote:
>> - Define a new field in the task structure 'saved_cpus_allowed'.
>>   With a little collapsing of existing fields, there is room to put
>>   this on the same cache line as 'cpus_allowed'.
>> - In reschedule_idle() if we determine that the best CPU for a task
>>   is the CPU it is associated with (p->processor), then temporarily
>>   bind the task to that CPU.  The task is temporarily bound to the
>>   CPU by overwriting the 'cpus_allowed' field such that the task can
>>   only be scheduled on the target CPU.  Of course, the original
>>   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
>> - In schedule(), the loop which examines all tasks on the runqueue
>>   will restore the value of 'cpus_allowed'.
> 
> This sounds racy, at least with your simple description. Even other CPUs
> could walk their run queue while the IPI is being processed and reset the
> cpus_allowed flag too early. I guess it would need a check in the 
> schedule loop that restores cpus_allowed that the current CPU is the only
> on set in that task's cpus_allowed. This unfortunately would hang the
> process if the CPU for some reason cannot process schedules timely, so
> it may be needed to add a timestamp also to the task_struct that allows 
> the restoration of cpus_allowed even from non target CPUs after some
> time.

I previously expressed ( maybe in this thread, I don't remember ) my idea of
not touching ( hacking ) the current scheduler to let it fit to SMP, that IMHO,
has different problems and needs different answers.
Anyway, as I said a couple of messages ago, the solution of this problem could
be to move the moved task in a private ( per CPU ) wakeup list that, when the
idle comes up, it'll be ran.





- Davide


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 22:43         ` Mike Kravetz
  2001-07-15 20:02           ` Davide Libenzi
@ 2001-07-15 20:15           ` Andi Kleen
  2001-07-15 20:31             ` Davide Libenzi
  1 sibling, 1 reply; 23+ messages in thread
From: Andi Kleen @ 2001-07-15 20:15 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Lang, Larry McVoy, Davide Libenzi, lse-tech, Andi Kleen,
	linux-kernel

On Fri, Jul 13, 2001 at 03:43:05PM -0700, Mike Kravetz wrote:
> - Define a new field in the task structure 'saved_cpus_allowed'.
>   With a little collapsing of existing fields, there is room to put
>   this on the same cache line as 'cpus_allowed'.
> - In reschedule_idle() if we determine that the best CPU for a task
>   is the CPU it is associated with (p->processor), then temporarily
>   bind the task to that CPU.  The task is temporarily bound to the
>   CPU by overwriting the 'cpus_allowed' field such that the task can
>   only be scheduled on the target CPU.  Of course, the original
>   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> - In schedule(), the loop which examines all tasks on the runqueue
>   will restore the value of 'cpus_allowed'.

This sounds racy, at least with your simple description. Even other CPUs
could walk their run queue while the IPI is being processed and reset the
cpus_allowed flag too early. I guess it would need a check in the 
schedule loop that restores cpus_allowed that the current CPU is the only
on set in that task's cpus_allowed. This unfortunately would hang the
process if the CPU for some reason cannot process schedules timely, so
it may be needed to add a timestamp also to the task_struct that allows 
the restoration of cpus_allowed even from non target CPUs after some
time.


-Andi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 22:43         ` Mike Kravetz
@ 2001-07-15 20:02           ` Davide Libenzi
  2001-07-15 20:15           ` Andi Kleen
  1 sibling, 0 replies; 23+ messages in thread
From: Davide Libenzi @ 2001-07-15 20:02 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: linux-kernel, Andi Kleen, lse-tech, Larry McVoy, David Lang


On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
>> A real-world example of this issue.
>> 
>> I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess
>> was bouncing back and forth between the two CPUs. I actually was able to
>> gzip faster by starting up setiathome to keep one CPU busy and force the
>> scheduler to keep the gzip on a single CPU (I ran things several times to
>> verify it was actually faster)
>> 
>> David Lang
> 
> That does sound like the same behavior I was seeing with lat_ctx.  Like
> I said in my previous note, the scheduler does try to take CPU affinity
> into account.  reschedule_idle() does a pretty good job of determining
> what CPU a task should run on.  In the case of lat_ctx (and I believe
> your use of gzip), the 'fast path' in reschedule_idle() is taken because
> the CPU associated with the awakened task is idle.  However, before
> schedule() is run on the 'target' CPU, schedule() is run on another
> CPU and the task is scheduled there.
> 
> The root cause of this situation is the delay between the time
> reschedule_idle() determines what is the best CPU a task should run
> on, and the time schedule() is actually ran on that CPU.
> 
> I have toyed with the idea of 'temporarily binding' a task to a CPU
> for the duration of the delay between reschedule_idle() and schedule().
> It would go something like this,
> 
> - Define a new field in the task structure 'saved_cpus_allowed'.
>   With a little collapsing of existing fields, there is room to put
>   this on the same cache line as 'cpus_allowed'.
> - In reschedule_idle() if we determine that the best CPU for a task
>   is the CPU it is associated with (p->processor), then temporarily
>   bind the task to that CPU.  The task is temporarily bound to the
>   CPU by overwriting the 'cpus_allowed' field such that the task can
>   only be scheduled on the target CPU.  Of course, the original
>   value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
> - In schedule(), the loop which examines all tasks on the runqueue
>   will restore the value of 'cpus_allowed'.
> 
> This would preserve the 'best CPU' decision made by reschedule_idle().
> On the down side, 'temporarily bound' tasks could not be scheduled
> until schedule() is run on their associated CPUs.  This could potentially
> waste CPU cycles, and delay context switches.  In addition, it is
> quite possible that while a task is 'temporarily bound' the state of
> the system could change in such a way that the best CPU is no longer
> best.
> 
> There appears to be a classic tradeoff between CPU affinity and
> context switch time.

The problem of the current scheduler is that it acts like an infinite feedback
system.
When we're going to decide if we've to move a task we analyze the status at the
current time without taking in account the system status at previous time
values.
But the feedback we send ( IPI to move the task ) take a finite time to hit the
target CPU and, meanwhile, the system status changes.
So we're going to apply a feedback calculated in time T0 to a time Tn, and this
will result in system auto-oscillation that we perceive as tasks bouncing
between CPUs.
This is kind of electronic example but it applies to all feedback systems.
The solution to this problem, given that we can't have a zero feedback delivery
time, is :

1) lower the feedback amount, that means, try to minimize task movements

2) a low pass filter, that means, when we're going to decide the sort ( move )
        of a task, we've to weight the system status with the one that it had
        at previous time values





- Davide


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-12 23:40 Mike Kravetz
  2001-07-13  0:22 ` Davide Libenzi
@ 2001-07-15  7:42 ` Troy Benjegerdes
  1 sibling, 0 replies; 23+ messages in thread
From: Troy Benjegerdes @ 2001-07-15  7:42 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: linux-kernel, Andi Kleen, lse-tech

On Thu, Jul 12, 2001 at 04:40:17PM -0700, Mike Kravetz wrote:
> This discussion was started on 'lse-tech@lists.sourceforge.net'.
> I'm widening the distribution in the hope of getting more input.
> 
> It started when Andi Kleen noticed that a single 'CPU Hog' task
> was being bounced back and forth between the 2 CPUs on his 2-way
> system.  I had seen similar behavior when running the context
> switching test of LMbench.  When running lat_ctx with only two
> threads on an 8 CPU system, one would ?expect? the two threads
> to be confined to two of the 8 CPUs in the system.  However, what
> I have observed is that the threads are effectively 'round
> robined' among all the CPUs and they all end up bearing
> an equivalent amount of the CPU load.  To more easily observe
> this, increase the number of 'TRIPS' in the benchmark to a really
> large number.

Did this 'CPU Hog' task do any I/O that might have caused an interrupt?

I'm wondering if the interrupt distribution has anything to do with this.. 
are you using any CPU affinity for interrupts? If not, this might explain 
why the processes wind up doing a 'round robin'.

I'm trying to reproduce this with gzip on a dual Mac G4, and I'm wondering
if this will be scewed any because all interrupts are directed at cpu0, so
anything that generates input or output on the network or serial is going
to tend to wind up on cpu0.

I'd also like to figure out what the IPI latency actually is.. Does anyone
have any suggestions how to measure this? I would hope it's lower on the
dual 7410 machine I have since I have 4-stage pipelines as opposed to 20
odd stages that Pentium III class machines do..

> After a little investigation, I believe this 'situation' is caused
> by the latency of the reschedule IPI used by the scheduler.  Recall
> that in lat_ctx all threads are in a tight loop consisting of:
> 
> pipe_read()
> pipe_write()
> 
> Both threads 'start' on the same CPU and are sitting in pipe_read
> waiting for data.  A token is written to the pipe and one thread
> is awakened.  The awakened thread, then immediately writes the token
> back to the pipe which ultimately results in a call to reschedule_idle()
> that will 'initiate' the scheduling of the other thread.  In
> reschedule_idle() we can not take the 'fast path' because WE are
> currently executing on the other thread's preferred CPU.  Therefore,
> reschedule_idle() chooses the oldest idle CPU and sends the IPI.
> However, before the IPI is received (and schedule() run) on the
> remote CPU, the currently running thread calls pipe_read which
> blocks and calls schedule().  Since the other task has yet to be
> scheduled on the other CPU, it is scheduled to run on the current
> CPU.  Both tasks continue to execute on the one CPU until such time
> that an IPI induced schedule() on the other CPU hits a window where
> it finds one of the tasks to schedule.  We continue in this way,
> migrating the tasks to the oldest idle CPU and eventually cycling our
> way through all the CPUs.
> 
> Does this explanation sound reasonable?
> 
> If so, it would then follow that booting with 'idle=poll' would
> help alleviate this situation.  However, that is not the case.  With
> idle=poll the CPU load is not as evenly distributed among the CPUs,
> but is still distributed among all of them.
> 
> Does the behavior of the 'benchmark' mean anything?  Should one
> expect tasks to stay their preferred CPUs if possible?
> 
> Thoughts/comments
> -- 
> Mike Kravetz                                 mkravetz@sequent.com
> IBM Linux Technology Center
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Troy Benjegerdes | master of mispeeling | 'da hozer' |  hozer@drgw.net
-----"If this message isn't misspelled, I didn't write it" -- Me -----
"Why do musicians compose symphonies and poets write poems? They do it
because life wouldn't have any meaning for them if they didn't. That's 
why I draw cartoons. It's my life." -- Charles Shulz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
@ 2001-07-14  3:25 Hubertus Franke
  2001-07-16 16:14 ` Mike Kravetz
  0 siblings, 1 reply; 23+ messages in thread
From: Hubertus Franke @ 2001-07-14  3:25 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: lse-tech, linux-kernel



Mike, could we utilize the existing mechanism such as has_cpu.

Here is my approach/suggestion.
We introduce in the sched_data[cpu] a resched_task slot;

     struct task *resched_task;

When in reschedule_idle() a target cpu is to be decided for task <p>
we check the resched_task slot. If the slot is pointing
to some task, then this task should be considered running
and we should not consider the preemption goodness to the
currently running task as we know it already got IPI'ed.
See also the schedule() function describe later on.

reschedule_idle(struct task *p)
{
     :
     :

     struct task *rst = sched_data[target_cpu].resched_task;
      struct task *rmt = sched_data[target_cpu].current;
     if (rst != NULL) {
          if (preemption_goodness(p,rst, ...) > 1)
                    p->processor = target_cpu;
                    p->has_cpu   = 1;
               rst->has_cpu = 0; /* reset as we overwrite */
               sched_data[target_cpu].resched_task = p;
               /* so we make old <rst> available for scheduling
                * and temp-bind <p> to target_cpu */
                * don't have to send IPI as this to handles race
                * condition and we are holding scheduling lock
                */
          } else {
               continue;
               /* we know that the current priority won't be
                * larger than the one of <rst> otherwise this
                * would have been picked up in the schedule
                */
          }
     } else {
          /* standard stuff that we always do */
     }
}

In schedule() we need to check whether a reschedule reservation is held.
First we go through the standard <stillrunning> check to compute the
initial
<c> goodness value. Then under

still_running_back:

     <loop through the runqueue>
     /* note that <rst> will be ignored due to <has_cpu=1> flag */

     /* check wether reservation <rst> exists */
     rst = sched_data[this_cpu].resched_task;
     if (rst != NULL) {
          c = goodness(rst,..);
          if (c > best_prio) {
               best_prio = goodness(rst,..);
               next = rst;
               sched_data[this_cpu].resched_task = NULL;
          } else {
               /* need to return rst back to scheduable state */
               rst->has_cpu = 0;
          }
      }


This approach would eliminate the need to check during runqueue scan
to check for each task's saved_cpus_allowed and would also make sure that
only one task reserves running on a particular cpu. Reservations are
preempted through the existing mechanism, namely goodness comparision,
and such "preempted" tasks are returned to general scheduability.
are put back into the

Hubertus Franke
Enterprise Linux Group (Mgr),  Linux Technology Center (Member Scalability)
, OS-PIC (Chair)
email: frankeh@us.ibm.com
(w) 914-945-2003    (fax) 914-945-4425   TL: 862-2003



Mike Kravetz <mkravetz@sequent.com>@vger.kernel.org on 07/13/2001 06:43:05
PM

Sent by:  linux-kernel-owner@vger.kernel.org


To:   David Lang <david.lang@digitalinsight.com>
cc:   Larry McVoy <lm@bitmover.com>, Davide Libenzi
      <davidel@xmailserver.org>, lse-tech@lists.sourceforge.net, Andi Kleen
      <ak@suse.de>, linux-kernel@vger.kernel.org
Subject:  Re: CPU affinity & IPI latency



On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
> A real-world example of this issue.
>
> I was gzipping a large (~800MB) file on a dual athlon box. the gzip
prcess
> was bouncing back and forth between the two CPUs. I actually was able to
> gzip faster by starting up setiathome to keep one CPU busy and force the
> scheduler to keep the gzip on a single CPU (I ran things several times to
> verify it was actually faster)
>
> David Lang

That does sound like the same behavior I was seeing with lat_ctx.  Like
I said in my previous note, the scheduler does try to take CPU affinity
into account.  reschedule_idle() does a pretty good job of determining
what CPU a task should run on.  In the case of lat_ctx (and I believe
your use of gzip), the 'fast path' in reschedule_idle() is taken because
the CPU associated with the awakened task is idle.  However, before
schedule() is run on the 'target' CPU, schedule() is run on another
CPU and the task is scheduled there.

The root cause of this situation is the delay between the time
reschedule_idle() determines what is the best CPU a task should run
on, and the time schedule() is actually ran on that CPU.

I have toyed with the idea of 'temporarily binding' a task to a CPU
for the duration of the delay between reschedule_idle() and schedule().
It would go something like this,

- Define a new field in the task structure 'saved_cpus_allowed'.
  With a little collapsing of existing fields, there is room to put
  this on the same cache line as 'cpus_allowed'.
- In reschedule_idle() if we determine that the best CPU for a task
  is the CPU it is associated with (p->processor), then temporarily
  bind the task to that CPU.  The task is temporarily bound to the
  CPU by overwriting the 'cpus_allowed' field such that the task can
  only be scheduled on the target CPU.  Of course, the original
  value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
- In schedule(), the loop which examines all tasks on the runqueue
  will restore the value of 'cpus_allowed'.

This would preserve the 'best CPU' decision made by reschedule_idle().
On the down side, 'temporarily bound' tasks could not be scheduled
until schedule() is run on their associated CPUs.  This could potentially
waste CPU cycles, and delay context switches.  In addition, it is
quite possible that while a task is 'temporarily bound' the state of
the system could change in such a way that the best CPU is no longer
best.

There appears to be a classic tradeoff between CPU affinity and
context switch time.

Comments?

--
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 19:51       ` David Lang
@ 2001-07-13 22:43         ` Mike Kravetz
  2001-07-15 20:02           ` Davide Libenzi
  2001-07-15 20:15           ` Andi Kleen
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Kravetz @ 2001-07-13 22:43 UTC (permalink / raw)
  To: David Lang
  Cc: Larry McVoy, Davide Libenzi, lse-tech, Andi Kleen, linux-kernel

On Fri, Jul 13, 2001 at 12:51:53PM -0700, David Lang wrote:
> A real-world example of this issue.
> 
> I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess
> was bouncing back and forth between the two CPUs. I actually was able to
> gzip faster by starting up setiathome to keep one CPU busy and force the
> scheduler to keep the gzip on a single CPU (I ran things several times to
> verify it was actually faster)
> 
> David Lang

That does sound like the same behavior I was seeing with lat_ctx.  Like
I said in my previous note, the scheduler does try to take CPU affinity
into account.  reschedule_idle() does a pretty good job of determining
what CPU a task should run on.  In the case of lat_ctx (and I believe
your use of gzip), the 'fast path' in reschedule_idle() is taken because
the CPU associated with the awakened task is idle.  However, before
schedule() is run on the 'target' CPU, schedule() is run on another
CPU and the task is scheduled there.

The root cause of this situation is the delay between the time
reschedule_idle() determines what is the best CPU a task should run
on, and the time schedule() is actually ran on that CPU.

I have toyed with the idea of 'temporarily binding' a task to a CPU
for the duration of the delay between reschedule_idle() and schedule().
It would go something like this,

- Define a new field in the task structure 'saved_cpus_allowed'.
  With a little collapsing of existing fields, there is room to put
  this on the same cache line as 'cpus_allowed'.
- In reschedule_idle() if we determine that the best CPU for a task
  is the CPU it is associated with (p->processor), then temporarily
  bind the task to that CPU.  The task is temporarily bound to the
  CPU by overwriting the 'cpus_allowed' field such that the task can
  only be scheduled on the target CPU.  Of course, the original
  value of 'cpus_allowed' is saved in 'saved_cpus_allowed'.
- In schedule(), the loop which examines all tasks on the runqueue
  will restore the value of 'cpus_allowed'.

This would preserve the 'best CPU' decision made by reschedule_idle().
On the down side, 'temporarily bound' tasks could not be scheduled
until schedule() is run on their associated CPUs.  This could potentially
waste CPU cycles, and delay context switches.  In addition, it is
quite possible that while a task is 'temporarily bound' the state of
the system could change in such a way that the best CPU is no longer
best.

There appears to be a classic tradeoff between CPU affinity and
context switch time.

Comments?

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 17:05     ` Mike Kravetz
  2001-07-13 19:51       ` David Lang
@ 2001-07-13 19:54       ` Chris Wedgwood
  1 sibling, 0 replies; 23+ messages in thread
From: Chris Wedgwood @ 2001-07-13 19:54 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Larry McVoy, Davide Libenzi, lse-tech, Andi Kleen, linux-kernel

On Fri, Jul 13, 2001 at 10:05:21AM -0700, Mike Kravetz wrote:

    It is clear that the behavior of lat_ctx bypasses almost all of
    the scheduler's attempts at CPU affinity.  The real question is,
    "How often in running 'real workloads' are the schduler's attempts
    at CPU affinity bypassed?".

When encoding mp3s on a dual processor system, naturally I try to
encode two at once.

Most of the time this works as expected, one processed more or less
sticks to each CPU.  However, I have noticed that if for some reason,
a third process has to be scheduled (which is inevitable if you
actually want to do anything), then these two processes seem to bounce
back and forward for a few seconds, _even_ after this CPU spike has
gone.

Now, I'm not sure if I'm imagining this or not, as I said, I have two
CPUs and two CPU-bound tasks, to instrument this as all, I really have
to affect what I am looking at (rather like the uncertainty principal)
so I assumed that perhaps my programs to read and process
/proc/<n>/cpu was simply eating several cycles and each process was
yielding more or less at random causing what I saw seeing.



   --cw

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 17:05     ` Mike Kravetz
@ 2001-07-13 19:51       ` David Lang
  2001-07-13 22:43         ` Mike Kravetz
  2001-07-13 19:54       ` Chris Wedgwood
  1 sibling, 1 reply; 23+ messages in thread
From: David Lang @ 2001-07-13 19:51 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Larry McVoy, Davide Libenzi, lse-tech, Andi Kleen, linux-kernel

A real-world example of this issue.

I was gzipping a large (~800MB) file on a dual athlon box. the gzip prcess
was bouncing back and forth between the two CPUs. I actually was able to
gzip faster by starting up setiathome to keep one CPU busy and force the
scheduler to keep the gzip on a single CPU (I ran things several times to
verify it was actually faster)

David Lang


 On Fri, 13 Jul 2001, Mike Kravetz wrote:

> Date: Fri, 13 Jul 2001 10:05:21 -0700
> From: Mike Kravetz <mkravetz@sequent.com>
> To: Larry McVoy <lm@bitmover.com>
> Cc: Davide Libenzi <davidel@xmailserver.org>, lse-tech@lists.sourceforge.net,
>      Andi Kleen <ak@suse.de>, linux-kernel@vger.kernel.org
> Subject: Re: CPU affinity & IPI latency
>
> On Thu, Jul 12, 2001 at 05:36:41PM -0700, Larry McVoy wrote:
> > Be careful tuning for LMbench (says the author :-)
> >
> > Especially this benchmark.  It's certainly possible to get dramatically better
> > SMP numbers by pinning all the lat_ctx processes to a single CPU, because
> > the benchmark is single threaded.  In other words, if we have 5 processes,
> > call them A, B, C, D, and E, then the benchmark is passing a token from
> > A to B to C to D to E and around again.
> >
> > If the amount of data/instructions needed by all 5 processes fits in the
> > cache and you pin all the processes to the same CPU you'll get much
> > better performance than simply letting them float.
> >
> > But making the system do that naively is a bad idea.
>
> I agree, and can't imagine the system ever attempting to take this
> into account and leave these 5 tasks on the same CPU.
>
> At the other extreme is my observation that 2 tasks on an 8 CPU
> system are 'round robined' among all 8 CPUs.  I think having the
> 2 tasks stay on 2 of the 8 CPUs would be an improvement with respect
> to CPU affinity.  Actually, the scheduler does 'try' to do this.
>
> It is clear that the behavior of lat_ctx bypasses almost all of
> the scheduler's attempts at CPU affinity.  The real question is,
> "How often in running 'real workloads' are the schduler's attempts
> at CPU affinity bypassed?".
>
> --
> Mike Kravetz                                 mkravetz@sequent.com
> IBM Linux Technology Center
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 17:31       ` Mike Kravetz
@ 2001-07-13 19:17         ` Davide Libenzi
  0 siblings, 0 replies; 23+ messages in thread
From: Davide Libenzi @ 2001-07-13 19:17 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: lse-tech, Andi Kleen, linux-kernel, Larry McVoy


On 13-Jul-2001 Mike Kravetz wrote:
> On Fri, Jul 13, 2001 at 09:41:44AM -0700, Davide Libenzi wrote:
>> 
>> I personally think that a standard scheduler/cpu is the way to go for SMP.
>> I saw the one IBM guys did and I think that the wrong catch there is trying
>> always to grab the best task to run over all CPUs.
> 
> That was me/us.  Most of the reason for making 'global scheduling'
> decisions was an attempt to maintain the same behavior as the existing
> scheduler.  We are trying to see how well we can make this scheduler
> scale, while maintaining this global behavior.  Thought is that if
> there was ever any hope of someone adopting this scheduler, they would
> be more likely to do so if it attempted to maintain existing behavior.

The problem, IMHO, is that we're trying to extend what is a correct behaviour on
the UP scheduler ( pickup the best task to run ) to SMP machines.
Global scheduling decisions should be triggered in response of load unbalancing
and not at each schedule() run otherwise we're going to introduce a common lock
that will limit the overall scalability.
My idea about the future of the scheduler is to have a config options users can
chose depending upon the machine use.
By trying to keep a unique scheduler for both UP and MP is like going to give
the same answer to different problems and the scaling factor ( of the scheduler
itself ) on SMP will never improve.
The code inside kernel/sched.c should be reorganized ( it contains even not
scheduler code ) so that the various CONFIG_SCHED* will not introduce any messy
inside the code ( possibly by having the code in separate files
kernel/sched*.c ).




- Davide


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13 16:41     ` Davide Libenzi
@ 2001-07-13 17:31       ` Mike Kravetz
  2001-07-13 19:17         ` Davide Libenzi
  0 siblings, 1 reply; 23+ messages in thread
From: Mike Kravetz @ 2001-07-13 17:31 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Larry McVoy, linux-kernel, Andi Kleen, lse-tech, Mike Kravetz

On Fri, Jul 13, 2001 at 09:41:44AM -0700, Davide Libenzi wrote:
> 
> I personally think that a standard scheduler/cpu is the way to go for SMP.
> I saw the one IBM guys did and I think that the wrong catch there is trying
> always to grab the best task to run over all CPUs.

That was me/us.  Most of the reason for making 'global scheduling'
decisions was an attempt to maintain the same behavior as the existing
scheduler.  We are trying to see how well we can make this scheduler
scale, while maintaining this global behavior.  Thought is that if
there was ever any hope of someone adopting this scheduler, they would
be more likely to do so if it attempted to maintain existing behavior.


> I think that this concept could be relaxed introducing less chains between each
> CPU scheduler.
> A cheap load balancer should run, "time to time"(tm), to move tasks when a
> certain level of unbalancing has been reached.
> This will give each scheduler more independence and will make it more scalable,
> IMVHO.

Take a look at the 'CPU pooling & Load balancing' extensions to our
scheduler(lse.sourceforge.net/scheduling).  It allows you to divide
the system into CPU pools keep scheduling decisions limited to
these pools.  Periodicly, load balancing will be performed among
the pools.  This has shown some promise on NUMA architectures (as
one would expect).  You could define pool sizes to be 1 CPU and
get the scheduling behavior you desire on SMP.

But, none of this has to do with CPU affinity issues with the
current scheduler.

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13  0:36   ` Larry McVoy
  2001-07-13  2:06     ` Mark Hahn
  2001-07-13 16:41     ` Davide Libenzi
@ 2001-07-13 17:05     ` Mike Kravetz
  2001-07-13 19:51       ` David Lang
  2001-07-13 19:54       ` Chris Wedgwood
  2 siblings, 2 replies; 23+ messages in thread
From: Mike Kravetz @ 2001-07-13 17:05 UTC (permalink / raw)
  To: Larry McVoy; +Cc: Davide Libenzi, lse-tech, Andi Kleen, linux-kernel

On Thu, Jul 12, 2001 at 05:36:41PM -0700, Larry McVoy wrote:
> Be careful tuning for LMbench (says the author :-)
> 
> Especially this benchmark.  It's certainly possible to get dramatically better
> SMP numbers by pinning all the lat_ctx processes to a single CPU, because 
> the benchmark is single threaded.  In other words, if we have 5 processes,
> call them A, B, C, D, and E, then the benchmark is passing a token from
> A to B to C to D to E and around again.  
> 
> If the amount of data/instructions needed by all 5 processes fits in the 
> cache and you pin all the processes to the same CPU you'll get much 
> better performance than simply letting them float.
> 
> But making the system do that naively is a bad idea.

I agree, and can't imagine the system ever attempting to take this
into account and leave these 5 tasks on the same CPU.

At the other extreme is my observation that 2 tasks on an 8 CPU
system are 'round robined' among all 8 CPUs.  I think having the
2 tasks stay on 2 of the 8 CPUs would be an improvement with respect
to CPU affinity.  Actually, the scheduler does 'try' to do this.

It is clear that the behavior of lat_ctx bypasses almost all of
the scheduler's attempts at CPU affinity.  The real question is,
"How often in running 'real workloads' are the schduler's attempts
at CPU affinity bypassed?".

-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13  0:36   ` Larry McVoy
  2001-07-13  2:06     ` Mark Hahn
@ 2001-07-13 16:41     ` Davide Libenzi
  2001-07-13 17:31       ` Mike Kravetz
  2001-07-13 17:05     ` Mike Kravetz
  2 siblings, 1 reply; 23+ messages in thread
From: Davide Libenzi @ 2001-07-13 16:41 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel, Andi Kleen, lse-tech, Mike Kravetz


On 13-Jul-2001 Larry McVoy wrote:
> Be careful tuning for LMbench (says the author :-)
> 
> Especially this benchmark.  It's certainly possible to get dramatically
> better
> SMP numbers by pinning all the lat_ctx processes to a single CPU, because 
> the benchmark is single threaded.  In other words, if we have 5 processes,
> call them A, B, C, D, and E, then the benchmark is passing a token from
> A to B to C to D to E and around again.  
> 
> If the amount of data/instructions needed by all 5 processes fits in the 
> cache and you pin all the processes to the same CPU you'll get much 
> better performance than simply letting them float.
> 
> But making the system do that naively is a bad idea.

Agree.


> 
> This is a really hard area to get right but you can take a page from all
> the failed process migration efforts.  In general, moving stuff is a bad
> idea, it's much better to leave it where it is.  Everything scales better
> if there is a process queue per CPU and the default is that you leave the
> processes on the queue on which they last run.  However, if the load average
> for a queue starts going up and there is another queue with a substantially
> lower load average, then and ONLY then, should you move the process.

I personally think that a standard scheduler/cpu is the way to go for SMP.
I saw the one IBM guys did and I think that the wrong catch there is trying
always to grab the best task to run over all CPUs.
I think that this concept could be relaxed introducing less chains between each
CPU scheduler.
A cheap load balancer should run, "time to time"(tm), to move tasks when a
certain level of unbalancing has been reached.
This will give each scheduler more independence and will make it more scalable,
IMVHO.


> This is an area in which I've done a pile of work and I'd be interested
> in keeping a finger in any efforts to fix up the scheduler.

We've, somehow, understood it :)



- Davide


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13  0:36   ` Larry McVoy
@ 2001-07-13  2:06     ` Mark Hahn
  2001-07-13 16:41     ` Davide Libenzi
  2001-07-13 17:05     ` Mike Kravetz
  2 siblings, 0 replies; 23+ messages in thread
From: Mark Hahn @ 2001-07-13  2:06 UTC (permalink / raw)
  To: Larry McVoy; +Cc: linux-kernel

> If the amount of data/instructions needed by all 5 processes fits in the 
> cache and you pin all the processes to the same CPU you'll get much 
> better performance than simply letting them float.

interesting.  I remember that around 2.3.40, the scheduler handled
"frequent-schedulers" (like lat_ctx) differently, based on how long 
a timeslice the proc used, relative to the estimated time to flush cache.

as I recall, it let them stay on their current CPU.  like letting 
someone with 1 item go ahead of you in a grocery store checkout ;)

that code is mostly gone (only cacheflush_time remains);  I think it
morphed into the current migrate-to-longest-idle heuristic.

does anyone remember why the frequent-schedulers code was killed?
just because it conflated cache-affinity with timeslice?

regards, mark hahn.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: CPU affinity & IPI latency
  2001-07-13  0:22 ` Davide Libenzi
@ 2001-07-13  0:36   ` Larry McVoy
  2001-07-13  2:06     ` Mark Hahn
                       ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Larry McVoy @ 2001-07-13  0:36 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Mike Kravetz, lse-tech, Andi Kleen, linux-kernel

Be careful tuning for LMbench (says the author :-)

Especially this benchmark.  It's certainly possible to get dramatically better
SMP numbers by pinning all the lat_ctx processes to a single CPU, because 
the benchmark is single threaded.  In other words, if we have 5 processes,
call them A, B, C, D, and E, then the benchmark is passing a token from
A to B to C to D to E and around again.  

If the amount of data/instructions needed by all 5 processes fits in the 
cache and you pin all the processes to the same CPU you'll get much 
better performance than simply letting them float.

But making the system do that naively is a bad idea.

This is a really hard area to get right but you can take a page from all
the failed process migration efforts.  In general, moving stuff is a bad
idea, it's much better to leave it where it is.  Everything scales better
if there is a process queue per CPU and the default is that you leave the
processes on the queue on which they last run.  However, if the load average
for a queue starts going up and there is another queue with a substantially
lower load average, then and ONLY then, should you move the process.

I think if you experiment with that you'll see that lat_ctx does well and
so do a lot of other things.

An optimization on that requires hardware support.  If you knew the number
of cache misses associated with each time slice, you could factor that in
and start moving processes that have a "too high" cache miss rate, with the
idea being that we want to keep all processes on the same CPU if we can
but if that is causing an excessive cache miss rate, it's time to move.

Another optimization is to always schedule an exec-ed process (as opposed
to a forked process) on a different CPU than its parent.  In general, when
you exec you have a clear boundary and it's good to spread those out.

All of this is based on my somewhat dated performance efforts that lead to
LMbench.  I don't know of any fundamental changed that invalidate these 
opinions but I could be wrong.

This is an area in which I've done a pile of work and I'd be interested
in keeping a finger in any efforts to fix up the scheduler.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* RE: CPU affinity & IPI latency
  2001-07-12 23:40 Mike Kravetz
@ 2001-07-13  0:22 ` Davide Libenzi
  2001-07-13  0:36   ` Larry McVoy
  2001-07-15  7:42 ` Troy Benjegerdes
  1 sibling, 1 reply; 23+ messages in thread
From: Davide Libenzi @ 2001-07-13  0:22 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: lse-tech, Andi Kleen, linux-kernel


On 12-Jul-2001 Mike Kravetz wrote:
> This discussion was started on 'lse-tech@lists.sourceforge.net'.
> I'm widening the distribution in the hope of getting more input.
> 
> It started when Andi Kleen noticed that a single 'CPU Hog' task
> was being bounced back and forth between the 2 CPUs on his 2-way
> system.  I had seen similar behavior when running the context
> switching test of LMbench.  When running lat_ctx with only two
> threads on an 8 CPU system, one would ?expect? the two threads
> to be confined to two of the 8 CPUs in the system.  However, what
> I have observed is that the threads are effectively 'round
> robined' among all the CPUs and they all end up bearing
> an equivalent amount of the CPU load.  To more easily observe
> this, increase the number of 'TRIPS' in the benchmark to a really
> large number.
> 
> After a little investigation, I believe this 'situation' is caused
> by the latency of the reschedule IPI used by the scheduler.  Recall
> that in lat_ctx all threads are in a tight loop consisting of:
> 
> pipe_read()
> pipe_write()
> 
> Both threads 'start' on the same CPU and are sitting in pipe_read
> waiting for data.  A token is written to the pipe and one thread
> is awakened.  The awakened thread, then immediately writes the token
> back to the pipe which ultimately results in a call to reschedule_idle()
> that will 'initiate' the scheduling of the other thread.  In
> reschedule_idle() we can not take the 'fast path' because WE are
> currently executing on the other thread's preferred CPU.  Therefore,
> reschedule_idle() chooses the oldest idle CPU and sends the IPI.
> However, before the IPI is received (and schedule() run) on the
> remote CPU, the currently running thread calls pipe_read which
> blocks and calls schedule().  Since the other task has yet to be
> scheduled on the other CPU, it is scheduled to run on the current
> CPU.  Both tasks continue to execute on the one CPU until such time
> that an IPI induced schedule() on the other CPU hits a window where
> it finds one of the tasks to schedule.  We continue in this way,
> migrating the tasks to the oldest idle CPU and eventually cycling our
> way through all the CPUs.
> 
> Does this explanation sound reasonable?

I would say yes.


> 
> If so, it would then follow that booting with 'idle=poll' would
> help alleviate this situation.  However, that is not the case.  With
> idle=poll the CPU load is not as evenly distributed among the CPUs,
> but is still distributed among all of them.
> 
> Does the behavior of the 'benchmark' mean anything?  Should one
> expect tasks to stay their preferred CPUs if possible?

Maybe having a per-cpu wake list where the rescheduled task is moved to be woken
up by IPI target.




- Davide


^ permalink raw reply	[flat|nested] 23+ messages in thread

* CPU affinity & IPI latency
@ 2001-07-12 23:40 Mike Kravetz
  2001-07-13  0:22 ` Davide Libenzi
  2001-07-15  7:42 ` Troy Benjegerdes
  0 siblings, 2 replies; 23+ messages in thread
From: Mike Kravetz @ 2001-07-12 23:40 UTC (permalink / raw)
  To: linux-kernel; +Cc: Andi Kleen, lse-tech

This discussion was started on 'lse-tech@lists.sourceforge.net'.
I'm widening the distribution in the hope of getting more input.

It started when Andi Kleen noticed that a single 'CPU Hog' task
was being bounced back and forth between the 2 CPUs on his 2-way
system.  I had seen similar behavior when running the context
switching test of LMbench.  When running lat_ctx with only two
threads on an 8 CPU system, one would ?expect? the two threads
to be confined to two of the 8 CPUs in the system.  However, what
I have observed is that the threads are effectively 'round
robined' among all the CPUs and they all end up bearing
an equivalent amount of the CPU load.  To more easily observe
this, increase the number of 'TRIPS' in the benchmark to a really
large number.

After a little investigation, I believe this 'situation' is caused
by the latency of the reschedule IPI used by the scheduler.  Recall
that in lat_ctx all threads are in a tight loop consisting of:

pipe_read()
pipe_write()

Both threads 'start' on the same CPU and are sitting in pipe_read
waiting for data.  A token is written to the pipe and one thread
is awakened.  The awakened thread, then immediately writes the token
back to the pipe which ultimately results in a call to reschedule_idle()
that will 'initiate' the scheduling of the other thread.  In
reschedule_idle() we can not take the 'fast path' because WE are
currently executing on the other thread's preferred CPU.  Therefore,
reschedule_idle() chooses the oldest idle CPU and sends the IPI.
However, before the IPI is received (and schedule() run) on the
remote CPU, the currently running thread calls pipe_read which
blocks and calls schedule().  Since the other task has yet to be
scheduled on the other CPU, it is scheduled to run on the current
CPU.  Both tasks continue to execute on the one CPU until such time
that an IPI induced schedule() on the other CPU hits a window where
it finds one of the tasks to schedule.  We continue in this way,
migrating the tasks to the oldest idle CPU and eventually cycling our
way through all the CPUs.

Does this explanation sound reasonable?

If so, it would then follow that booting with 'idle=poll' would
help alleviate this situation.  However, that is not the case.  With
idle=poll the CPU load is not as evenly distributed among the CPUs,
but is still distributed among all of them.

Does the behavior of the 'benchmark' mean anything?  Should one
expect tasks to stay their preferred CPUs if possible?

Thoughts/comments
-- 
Mike Kravetz                                 mkravetz@sequent.com
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2001-07-16 22:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-07-16 21:45 CPU affinity & IPI latency Hubertus Franke
2001-07-16 22:56 ` Davide Libenzi
  -- strict thread matches above, loose matches on Subject: below --
2001-07-16 18:26 Hubertus Franke
2001-07-16 10:10 Hubertus Franke
2001-07-16 16:16 ` Davide Libenzi
2001-07-14  3:25 Hubertus Franke
2001-07-16 16:14 ` Mike Kravetz
2001-07-16 21:25   ` Davide Libenzi
2001-07-12 23:40 Mike Kravetz
2001-07-13  0:22 ` Davide Libenzi
2001-07-13  0:36   ` Larry McVoy
2001-07-13  2:06     ` Mark Hahn
2001-07-13 16:41     ` Davide Libenzi
2001-07-13 17:31       ` Mike Kravetz
2001-07-13 19:17         ` Davide Libenzi
2001-07-13 17:05     ` Mike Kravetz
2001-07-13 19:51       ` David Lang
2001-07-13 22:43         ` Mike Kravetz
2001-07-15 20:02           ` Davide Libenzi
2001-07-15 20:15           ` Andi Kleen
2001-07-15 20:31             ` Davide Libenzi
2001-07-13 19:54       ` Chris Wedgwood
2001-07-15  7:42 ` Troy Benjegerdes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).