On Tue, 2017-06-27 at 07:39 +0200, Peter Zijlstra wrote: > On Mon, Jun 26, 2017 at 03:34:49PM -0400, Rik van Riel wrote: > > On Mon, 2017-06-26 at 18:12 +0200, Peter Zijlstra wrote: > > > On Mon, Jun 26, 2017 at 11:20:54AM -0400, Rik van Riel wrote: > > > > > > > Oh, indeed.  I guess in wake_affine() we should test > > > > whether the CPUs are in the same NUMA node, rather than > > > > doing cpus_share_cache() ? > > > > > > Well, since select_idle_sibling() is on LLC; the early test on > > > cpus_share_cache(prev,this) seems to actually make sense. > > > > > > But then cutting out all the other bits seems wrong. Not in the > > > least > > > because !NUMA_BALACING should also still keep working. > > > > Even when !NUMA_BALANCING, I suspect it makes little sense > > to compare the loads just one the cores in question, since > > select_idle_sibling() will likely move the task somewhere > > else. > > > > I suspect we want to compare the load on the whole LLC > > for that reason, even with NUMA_BALANCING disabled. > > But we don't have that data around :/ One thing we could do is try > and > keep a copy of the last s*_lb_stats around in the sched_domain_shared > stuff or something and try and use that. > > That way we can strictly keep things at the LLC level and not confuse > things with NUMA. > > Similarly, we could use that same data to then avoid re-computing > things > for the NUMA domain as well and do away with numa_stats. That does seem like a useful optimization, though I guess we would have to invalidate the cached data every time we actually move a task? The current code simply walks all the CPUs in the cpumask_t, and adds up capacity and load. Doing that appears to be better than poor task placement (Jirka's numbers speak for themselves), but optimizing this code path does seem like a worthwhile goal. I'll look into it. > > > > Or, alternatively, have an update_numa_stats() variant > > > > for numa_wake_affine() that works on the LLC level? > > > > > > I think we want to retain the existing behaviour for everything > > > larger than LLC, and when NUMA_BALANCING, smaller than NUMA. > > > > What do you mean by this, exactly? > > As you noted, when prev and this are in the same LLC, it doesn't > matter > and select_idle_sibling() will do its thing. So anything smaller than > the LLC need not do anything. > > When NUMA_BALANCING we have the numa_stats thing and we can, as you > propose use that. > > If LLC < NUMA or !NUMA_BALANCING we have a region that needs to do > _something_. Agreed. I will fix this. Given that this is a bit of a corner case, I guess I can fix this with follow-up patches, to be merged into -tip before the whole series is sent on to Linus? > > > Also note that your use of task_h_load() in the new numa thing > > > suffers > > > from exactly the problem effective_load() is trying to solve. > > > > Are you saying task_h_load is wrong in task_numa_compare() > > too, then?  Should both use effective_load()? > > I need more than the few minutes I currently have, but probably. The > question is of course, how much does it matter and how painful will > it > be to do it better. I suspect it does not matter at all currenly, since the  load balancing code does not use effective_load, and having the wake_affine logic calculate things differently from the load balancer is likely to result in both pieces of code fighting against each other. I suspect we should either use task_h_load everywhere, or effective_load everywhere, but not have a mix and match situation where one is used in some places, and the other in others. -- All rights reversed