From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752914AbaJBRSY (ORCPT ); Thu, 2 Oct 2014 13:18:24 -0400 Received: from mx1.redhat.com ([209.132.183.28]:40304 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751510AbaJBRSW (ORCPT ); Thu, 2 Oct 2014 13:18:22 -0400 Date: Thu, 2 Oct 2014 13:15:48 -0400 From: Rik van Riel To: Nicolas Pitre Cc: Peter Zijlstra , Ingo Molnar , Daniel Lezcano , "Rafael J. Wysocki" , linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, linaro-kernel@lists.linaro.org Subject: [PATCH RFC] sched,idle: teach select_idle_sibling about idle states Message-ID: <20141002131548.6cd377d5@cuia.bos.redhat.com> In-Reply-To: References: <1409844730-12273-1-git-send-email-nicolas.pitre@linaro.org> <1409844730-12273-3-git-send-email-nicolas.pitre@linaro.org> <542B277D.7050103@redhat.com> Organization: Red Hat, Inc MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 30 Sep 2014 19:15:00 -0400 (EDT) Nicolas Pitre wrote: > On Tue, 30 Sep 2014, Rik van Riel wrote: > > The main thing it does not cover is already running tasks that > > get woken up again, since select_idle_sibling() covers everything > > except for newly forked and newly executed tasks. > > True. Now that you bring this up, I remember that Peter mentioned it as > well. > > > I am looking at adding similar logic to select_idle_sibling() > > OK thanks. This patch is ugly. I have not bothered cleaning it up, because it causes a regression with hackbench. Apparently for hackbench (and potentially other sync wakeups), locality is more important than idleness. We may need to add a third clause before the search, something along the lines of, to ensure target gets selected if neither target or i are idle and the wakeup is synchronous... if (sync_wakeup && cpu_of(target)->nr_running == 1) return target; I still need to run tests with other workloads, too. Another consideration is that search costs with this patch are potentially much increased. I suspect we may want to simply propagate the load on each sched_group up the tree hierarchically, with delta accounting and propagating the info upwards only when the delta is significant, like done in __update_tg_runnable_avg. ---8<--- Subject: sched,idle: teach select_idle_sibling about idle states Change select_idle_sibling to take cpu idle exit latency into account. First preference is to select the cpu with the lowest exit latency from a completely idle sched_group inside the CPU; if that is not available, we pick the CPU with the lowest exit latency in any sched_group. This increases the total search time of select_idle_sibling, we may want to look into propagating load info up the sched_group tree in some way. That information would also be useful to prevent the wake_affine logic from causing a load imbalance between sched_groups. It is not clear when locality (from staying on the old CPU) beats a lower idle exit latency. Having information on whether the CPU drops content from the CPU caches in certain idle states would help with that, but with multiple CPUs bound together in the same physical CPU core, the hardware often does not do what we tell it, anyway... Signed-off-by: Rik van Riel --- kernel/sched/fair.c | 47 +++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 41 insertions(+), 6 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 10a5a28..12540cd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4465,41 +4465,76 @@ static int select_idle_sibling(struct task_struct *p, int target) { struct sched_domain *sd; struct sched_group *sg; + unsigned int min_exit_latency_thread = UINT_MAX; + unsigned int min_exit_latency_core = UINT_MAX; + int shallowest_idle_thread = -1; + int shallowest_idle_core = -1; int i = task_cpu(p); + /* target always has some code running and is not in an idle state */ if (idle_cpu(target)) return target; /* * If the prevous cpu is cache affine and idle, don't be stupid. + * XXX: does i's exit latency exceed sysctl_sched_migration_cost? */ if (i != target && cpus_share_cache(i, target) && idle_cpu(i)) return i; /* * Otherwise, iterate the domains and find an elegible idle cpu. + * First preference is finding a totally idle core with a thread + * in a shallow idle state; second preference is whatever idle + * thread has the shallowest idle state anywhere. */ sd = rcu_dereference(per_cpu(sd_llc, target)); for_each_lower_domain(sd) { sg = sd->groups; do { + unsigned int min_sg_exit_latency = UINT_MAX; + int shallowest_sg_idle_thread = -1; + bool all_idle = true; + if (!cpumask_intersects(sched_group_cpus(sg), tsk_cpus_allowed(p))) goto next; for_each_cpu(i, sched_group_cpus(sg)) { - if (i == target || !idle_cpu(i)) - goto next; + struct rq *rq; + struct cpuidle_state *idle; + + if (i == target || !idle_cpu(i)) { + all_idle = false; + continue; + } + + rq = cpu_rq(i); + idle = idle_get_state(rq); + + if (idle && idle->exit_latency < min_sg_exit_latency) { + min_sg_exit_latency = idle->exit_latency; + shallowest_sg_idle_thread = i; + } + } + + if (all_idle && min_sg_exit_latency < min_exit_latency_core) { + shallowest_idle_core = shallowest_sg_idle_thread; + min_exit_latency_core = min_sg_exit_latency; + } else if (min_sg_exit_latency < min_exit_latency_thread) { + shallowest_idle_thread = shallowest_sg_idle_thread; + min_exit_latency_thread = min_sg_exit_latency; } - target = cpumask_first_and(sched_group_cpus(sg), - tsk_cpus_allowed(p)); - goto done; next: sg = sg->next; } while (sg != sd->groups); } -done: + if (shallowest_idle_core >= 0) + target = shallowest_idle_core; + else if (shallowest_idle_thread >= 0) + target = shallowest_idle_thread; + return target; }