On Thu, May 29, 2014 at 09:56:24PM +0200, Vincent Guittot wrote:
> On 29 May 2014 16:02, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
> >> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >>                * with a large weight task outweighs the tasks on the system).
> >>                */
> >>               if (prefer_sibling && sds->local &&
> >> -                 sds->local_stat.group_has_capacity)
> >> -                     sgs->group_capacity = min(sgs->group_capacity, 1U);
> >> +                 sds->local_stat.group_capacity > 0)
> >> +                     sgs->group_capacity = min(sgs->group_capacity, 1L);
> >>
> >>               if (update_sd_pick_busiest(env, sds, sg, sgs)) {
> >>                       sds->busiest = sg;
> >> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >>                * have to drop below capacity to reach cpu-load equilibrium.
> >>                */
> >>               load_above_capacity =
> >> -                     (busiest->sum_nr_running - busiest->group_capacity);
> >> +                     (busiest->sum_nr_running - busiest->group_weight);
> >>
> >>               load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
> >>               load_above_capacity /= busiest->group_power;
> >
> > I think you just broke PREFER_SIBLING here..
> 
> you mean by replacing the capacity which was reflecting the number of
> core for SMT by the group_weight ?

Right to in the first hunk we lower group_capacity to 1 when prefer_sibling,
then in the second hunk, you replace that group_capacity usage with
group_weight.

With the end result that prefer_sibling is now ineffective.

That said, I fudged the prefer_sibling usage into the capacity logic,
mostly because I could and it was already how the SMT stuff was working.
But there is no reason we should continue to intertwine these two
things.

So I think it would be good to have a patch that implements
prefer_sibling on nr_running separate from the existing capacity bits,
and then convert the remaining capacity bits to utilization (or activity
or whatever you did call it, see Morton's comments etc.).