On Thu, May 29, 2014 at 09:56:24PM +0200, Vincent Guittot wrote: > On 29 May 2014 16:02, Peter Zijlstra wrote: > > On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote: > >> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd > >> * with a large weight task outweighs the tasks on the system). > >> */ > >> if (prefer_sibling && sds->local && > >> - sds->local_stat.group_has_capacity) > >> - sgs->group_capacity = min(sgs->group_capacity, 1U); > >> + sds->local_stat.group_capacity > 0) > >> + sgs->group_capacity = min(sgs->group_capacity, 1L); > >> > >> if (update_sd_pick_busiest(env, sds, sg, sgs)) { > >> sds->busiest = sg; > >> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s > >> * have to drop below capacity to reach cpu-load equilibrium. > >> */ > >> load_above_capacity = > >> - (busiest->sum_nr_running - busiest->group_capacity); > >> + (busiest->sum_nr_running - busiest->group_weight); > >> > >> load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE); > >> load_above_capacity /= busiest->group_power; > > > > I think you just broke PREFER_SIBLING here.. > > you mean by replacing the capacity which was reflecting the number of > core for SMT by the group_weight ? Right to in the first hunk we lower group_capacity to 1 when prefer_sibling, then in the second hunk, you replace that group_capacity usage with group_weight. With the end result that prefer_sibling is now ineffective. That said, I fudged the prefer_sibling usage into the capacity logic, mostly because I could and it was already how the SMT stuff was working. But there is no reason we should continue to intertwine these two things. So I think it would be good to have a patch that implements prefer_sibling on nr_running separate from the existing capacity bits, and then convert the remaining capacity bits to utilization (or activity or whatever you did call it, see Morton's comments etc.).