On Tue, 2017-06-27 at 07:39 +0200, Peter Zijlstra wrote:
> On Mon, Jun 26, 2017 at 03:34:49PM -0400, Rik van Riel wrote:
> > On Mon, 2017-06-26 at 18:12 +0200, Peter Zijlstra wrote:
> > > On Mon, Jun 26, 2017 at 11:20:54AM -0400, Rik van Riel wrote:
> > > 
> > > > Oh, indeed.  I guess in wake_affine() we should test
> > > > whether the CPUs are in the same NUMA node, rather than
> > > > doing cpus_share_cache() ?
> > > 
> > > Well, since select_idle_sibling() is on LLC; the early test on
> > > cpus_share_cache(prev,this) seems to actually make sense.
> > > 
> > > But then cutting out all the other bits seems wrong. Not in the
> > > least
> > > because !NUMA_BALACING should also still keep working.
> > 
> > Even when !NUMA_BALANCING, I suspect it makes little sense
> > to compare the loads just one the cores in question, since
> > select_idle_sibling() will likely move the task somewhere
> > else.
> > 
> > I suspect we want to compare the load on the whole LLC
> > for that reason, even with NUMA_BALANCING disabled.
> 
> But we don't have that data around :/ One thing we could do is try
> and
> keep a copy of the last s*_lb_stats around in the sched_domain_shared
> stuff or something and try and use that.
> 
> That way we can strictly keep things at the LLC level and not confuse
> things with NUMA.
> 
> Similarly, we could use that same data to then avoid re-computing
> things
> for the NUMA domain as well and do away with numa_stats.

That does seem like a useful optimization, though
I guess we would have to invalidate the cached data
every time we actually move a task?

The current code simply walks all the CPUs in the
cpumask_t, and adds up capacity and load. Doing
that appears to be better than poor task placement
(Jirka's numbers speak for themselves), but optimizing
this code path does seem like a worthwhile goal.

I'll look into it.

> > > > Or, alternatively, have an update_numa_stats() variant
> > > > for numa_wake_affine() that works on the LLC level?
> > > 
> > > I think we want to retain the existing behaviour for everything
> > > larger than LLC, and when NUMA_BALANCING, smaller than NUMA.
> > 
> > What do you mean by this, exactly?
> 
> As you noted, when prev and this are in the same LLC, it doesn't
> matter
> and select_idle_sibling() will do its thing. So anything smaller than
> the LLC need not do anything.
> 
> When NUMA_BALANCING we have the numa_stats thing and we can, as you
> propose use that.
> 
> If LLC < NUMA or !NUMA_BALANCING we have a region that needs to do
> _something_.

Agreed. I will fix this. Given that this is a bit
of a corner case, I guess I can fix this with follow-up
patches, to be merged into -tip before the whole series
is sent on to Linus?

> > > Also note that your use of task_h_load() in the new numa thing
> > > suffers
> > > from exactly the problem effective_load() is trying to solve.
> > 
> > Are you saying task_h_load is wrong in task_numa_compare()
> > too, then?  Should both use effective_load()?
> 
> I need more than the few minutes I currently have, but probably. The
> question is of course, how much does it matter and how painful will
> it
> be to do it better.

I suspect it does not matter at all currenly, since the 
load balancing code does not use effective_load, and
having the wake_affine logic calculate things differently
from the load balancer is likely to result in both pieces
of code fighting against each other.

I suspect we should either use task_h_load everywhere,
or effective_load everywhere, but not have a mix and
match situation where one is used in some places, and
the other in others.

-- 
All rights reversed