From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753321AbbCZPXp (ORCPT ); Thu, 26 Mar 2015 11:23:45 -0400 Received: from eu-smtp-delivery-143.mimecast.com ([146.101.78.143]:30771 "EHLO eu-smtp-delivery-143.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752187AbbCZPXm convert rfc822-to-8bit (ORCPT ); Thu, 26 Mar 2015 11:23:42 -0400 Message-ID: <5514247A.1090009@arm.com> Date: Thu, 26 Mar 2015 15:23:38 +0000 From: Dietmar Eggemann User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Morten Rasmussen , Peter Zijlstra CC: Sai Gurrappadi , "mingo@redhat.com" , "vincent.guittot@linaro.org" , "yuyang.du@intel.com" , "preeti@linux.vnet.ibm.com" , "mturquette@linaro.org" , "nico@linaro.org" , "rjw@rjwysocki.net" , Juri Lelli , "linux-kernel@vger.kernel.org" , Peter Boonstoppel Subject: Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group References: <1423074685-6336-1-git-send-email-morten.rasmussen@arm.com> <1423074685-6336-31-git-send-email-morten.rasmussen@arm.com> <55036AA1.7000801@nvidia.com> <20150316141546.GQ4081@e105550-lin.cambridge.arm.com> <20150323164702.GL23123@twins.programming.kicks-ass.net> <551075D9.2040409@arm.com> <20150324104423.GC18994@e105550-lin.cambridge.arm.com> <20150324161037.GY23123@twins.programming.kicks-ass.net> <20150324173955.GI18994@e105550-lin.cambridge.arm.com> In-Reply-To: <20150324173955.GI18994@e105550-lin.cambridge.arm.com> X-OriginalArrivalTime: 26 Mar 2015 15:23:38.0507 (UTC) FILETIME=[D56D95B0:01D067D8] X-MC-Unique: dWMKiK6WReygkMHxiyHiXA-1 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 24/03/15 17:39, Morten Rasmussen wrote: > On Tue, Mar 24, 2015 at 04:10:37PM +0000, Peter Zijlstra wrote: >> On Tue, Mar 24, 2015 at 10:44:24AM +0000, Morten Rasmussen wrote: >>>>> Maybe remind us why this needs to be tied to sched_groups ? Why can't we >>>>> attach the energy information to the domains? >> >>> In the current domain hierarchy you don't have domains with just one cpu >>> in them. If you attach the per-cpu energy data to the MC level domain >>> which spans the whole cluster, you break the current idea of attaching >>> information to the cpumask (currently sched_group, but could be >>> sched_domain as we discuss here) the information is associated with. You >>> would have to either introduce a level of single cpu domains at the >>> lowest level or move away from the idea of attaching data to the cpumask >>> that is associated with it. >>> >>> Using sched_groups we do already have single cpu groups that we can >>> attach per-cpu data to, but we are missing a top level group spanning >>> the entire system for system wide energy data. So from that point of >>> view groups and domains are equally bad. >> >> Oh urgh, good point that. Cursed if you do, cursed if you don't. Bugger. > > Yeah :( I don't really care which one we choose. Adding another top > level domain with one big group spanning all cpus, but with all SD flags > disabled seems less intrusive than adding a level at the bottom. > > Better ideas are very welcome. > I had a stab at integrating such a top level (SYS) domain w/ all known SD flags disabled. This SYS sd exposes itself w/ all counters set to 0 in /proc/schedstat. There're still some kludges in the patch blow: - The need for a new topology SD flag to tell sd_init() that we want to reset the default sd configuration. - Don't break in build_sched_domains() at the first sd spanning cpu_map - Don't decay newidle max times in rebalance_domains() by bailing early on SYS sd. It survived booting on single (MC-SYS) and dual cluster ARM (MC-DIE-SYS) systems. Would something like this be acceptable? diff --git a/include/linux/sched.h b/include/linux/sched.h index f984b4e58865..8fbc9976f5d1 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -904,6 +904,7 @@ enum cpu_idle_type { #define SD_BALANCE_FORK 0x0008 /* Balance on fork, clone */ #define SD_BALANCE_WAKE 0x0010 /* Balance on wakeup */ #define SD_WAKE_AFFINE 0x0020 /* Wake task to waking CPU */ +#define SD_SHARE_ENERGY 0x0040 /* System-wide energy data */ #define SD_SHARE_CPUCAPACITY 0x0080 /* Domain members share cpu power */ #define SD_SHARE_POWERDOMAIN 0x0100 /* Domain members share power domain */ #define SD_SHARE_PKG_RESOURCES 0x0200 /* Domain members share cpu pkg resources */ diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4f52c2e7484e..d058dc1e639f 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -5529,7 +5529,7 @@ static int sd_degenerate(struct sched_domain *sd) } /* Following flags don't use groups */ - if (sd->flags & (SD_WAKE_AFFINE)) + if (sd->flags & (SD_WAKE_AFFINE | SD_SHARE_ENERGY)) return 0; return 1; @@ -6215,8 +6215,9 @@ static int sched_domains_curr_level; * SD_SHARE_POWERDOMAIN - describes shared power domain * SD_SHARE_CAP_STATES - describes shared capacity states * - * Odd one out: + * Odd two out: * SD_ASYM_PACKING - describes SMT quirks + * SD_SHARE_ENERGY - describes EAS quirks */ #define TOPOLOGY_SD_FLAGS \ (SD_SHARE_CPUCAPACITY | \ @@ -6224,7 +6225,8 @@ static int sched_domains_curr_level; SD_NUMA | \ SD_ASYM_PACKING | \ SD_SHARE_POWERDOMAIN | \ - SD_SHARE_CAP_STATES) + SD_SHARE_CAP_STATES | \ + SD_SHARE_ENERGY) static struct sched_domain * sd_init(struct sched_domain_topology_level *tl, int cpu) @@ -6298,6 +6300,14 @@ sd_init(struct sched_domain_topology_level *tl, int cpu) sd->cache_nice_tries = 1; sd->busy_idx = 2; + } else if (sd->flags & SD_SHARE_ENERGY) { + /* Reset the default configuration completely */ + memset(sd, 0, sizeof(*sd)); + + sd->flags = 1*SD_SHARE_ENERGY; +#ifdef CONFIG_SCHED_DEBUG + sd->name = tl->name; +#endif #ifdef CONFIG_NUMA } else if (sd->flags & SD_NUMA) { sd->cache_nice_tries = 2; @@ -6826,8 +6836,6 @@ static int build_sched_domains(const struct cpumask *cpu_map, *per_cpu_ptr(d.sd, i) = sd; if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP)) sd->flags |= SD_OVERLAP; - if (cpumask_equal(cpu_map, sched_domain_span(sd))) - break; } } diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index cfe65aec3237..8d4cc72f4778 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -8073,6 +8073,10 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle) rcu_read_lock(); for_each_domain(cpu, sd) { + + if (sd->flags & SD_SHARE_ENERGY) + continue; + /* * Decay the newidle max times here because this is a regular * visit to all the domains. Decay ~1% per second.