From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752864Ab0BMBdZ (ORCPT ); Fri, 12 Feb 2010 20:33:25 -0500 Received: from mga02.intel.com ([134.134.136.20]:30850 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752249Ab0BMBdY (ORCPT ); Fri, 12 Feb 2010 20:33:24 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.49,464,1262592000"; d="scan'208";a="595624850" Subject: change in sched cpu_power causing regressions with SCHED_MC From: Suresh Siddha Reply-To: Suresh Siddha To: Peter Zijlstra Cc: Ingo Molnar , LKML , "Ma, Ling" , "Zhang, Yanmin" , "ego@in.ibm.com" , "svaidy@linux.vnet.ibm.com" In-Reply-To: <1266023662.2808.118.camel@sbs-t61.sc.intel.com> References: <1266023662.2808.118.camel@sbs-t61.sc.intel.com> Content-Type: text/plain Organization: Intel Corp Date: Fri, 12 Feb 2010 17:31:19 -0800 Message-Id: <1266024679.2808.153.camel@sbs-t61.sc.intel.com> Mime-Version: 1.0 X-Mailer: Evolution 2.26.3 (2.26.3-1.fc11) Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Peterz, We have one more problem that Yanmin and Ling Ma reported. On a dual socket quad-core platforms (for example platforms based on NHM-EP), we are seeing scenarios where one socket is completely busy (with all the 4 cores running with 4 tasks) and another socket is completely idle. This causes performance issues as those 4 tasks share the memory controller, last-level cache bandwidth etc. Also we won't be taking advantage of turbo-mode as much as we like. We will have all these benefits if we move two of those tasks to the other socket. Now both the sockets can potentially go to turbo etc and improve performance. In short, your recent change (shown below) broke this behavior. In the kernel summit you mentioned you made this change with out affecting the behavior of SMT/MC. And my testing immediately after kernel-summit also didn't show the problem (perhaps my test didn't hit this specific change). But apparently we are having performance issues with this patch (Ling Ma's bisect pointed to this patch). I will look more detailed into this after the long weekend (to see if we can catch this scenario in fix_small_imbalance() etc). But wanted to give you a quick heads up. Thanks. commit f93e65c186ab3c05ce2068733ca10e34fd00125e Author: Peter Zijlstra Date: Tue Sep 1 10:34:32 2009 +0200 sched: Restore __cpu_power to a straight sum of power cpu_power is supposed to be a representation of the process capacity of the cpu, not a value to randomly tweak in order to affect placement. Remove the placement hacks. Signed-off-by: Peter Zijlstra Tested-by: Andreas Herrmann Acked-by: Andreas Herrmann Acked-by: Gautham R Shenoy Cc: Balbir Singh LKML-Reference: <20090901083825.810860576@chello.nl> Signed-off-by: Ingo Molnar diff --git a/kernel/sched.c b/kernel/sched.c index da1edc8..584a122 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -8464,15 +8464,13 @@ static void free_sched_groups(const struct cpumask *cpu_map, * there are asymmetries in the topology. If there are asymmetries, group * having more cpu_power will pickup more load compared to the group having * less cpu_power. - * - * cpu_power will be a multiple of SCHED_LOAD_SCALE. This multiple represents - * the maximum number of tasks a group can handle in the presence of other idle - * or lightly loaded groups in the same sched domain. */ static void init_sched_groups_power(int cpu, struct sched_domain *sd) { struct sched_domain *child; struct sched_group *group; + long power; + int weight; WARN_ON(!sd || !sd->groups); @@ -8483,22 +8481,20 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd) sd->groups->__cpu_power = 0; - /* - * For perf policy, if the groups in child domain share resources - * (for example cores sharing some portions of the cache hierarchy - * or SMT), then set this domain groups cpu_power such that each group - * can handle only one task, when there are other idle groups in the - * same sched domain. - */ - if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) && - (child->flags & - (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) { - sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE); + if (!child) { + power = SCHED_LOAD_SCALE; + weight = cpumask_weight(sched_domain_span(sd)); + /* + * SMT siblings share the power of a single core. + */ + if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) + power /= weight; + sg_inc_cpu_power(sd->groups, power); return; } /* - * add cpu_power of each child group to this groups cpu_power + * Add cpu_power of each child group to this groups cpu_power. */ group = child->groups; do {