From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752785AbbCXKn5 (ORCPT ); Tue, 24 Mar 2015 06:43:57 -0400 Received: from foss.arm.com ([217.140.101.70]:47873 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752394AbbCXKnv (ORCPT ); Tue, 24 Mar 2015 06:43:51 -0400 Date: Tue, 24 Mar 2015 10:44:24 +0000 From: Morten Rasmussen To: Dietmar Eggemann Cc: Peter Zijlstra , Sai Gurrappadi , "mingo@redhat.com" , "vincent.guittot@linaro.org" , "yuyang.du@intel.com" , "preeti@linux.vnet.ibm.com" , "mturquette@linaro.org" , "nico@linaro.org" , "rjw@rjwysocki.net" , Juri Lelli , "linux-kernel@vger.kernel.org" , Peter Boonstoppel Subject: Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group Message-ID: <20150324104423.GC18994@e105550-lin.cambridge.arm.com> References: <1423074685-6336-1-git-send-email-morten.rasmussen@arm.com> <1423074685-6336-31-git-send-email-morten.rasmussen@arm.com> <55036AA1.7000801@nvidia.com> <20150316141546.GQ4081@e105550-lin.cambridge.arm.com> <20150323164702.GL23123@twins.programming.kicks-ass.net> <551075D9.2040409@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <551075D9.2040409@arm.com> User-Agent: Mutt/1.5.21 (2010-09-15) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Mar 23, 2015 at 08:21:45PM +0000, Dietmar Eggemann wrote: > On 23/03/15 16:47, Peter Zijlstra wrote: > > On Mon, Mar 16, 2015 at 02:15:46PM +0000, Morten Rasmussen wrote: > >> You are absolutely right. The current code is broken for system > >> topologies where all cpus share the same clock source. To be honest, it > >> is actually worse than that and you already pointed out the reason. We > >> don't have a way of representing top level contributions to power > >> consumption in RFCv3, as we don't have sched_group spanning all cpus in > >> single cluster system. For example, we can't represent L2 cache and > >> interconnect power consumption on such systems. > >> > >> In RFCv2 we had a system wide sched_group dangling by itself for that > >> purpose. We chose to remove that in this rewrite as it led to messy > >> code. In my opinion, a more elegant solution is to introduce an > >> additional sched_domain above the current top level which has a single > >> sched_group spanning all cpus in the system. That should fix the > >> SD_SHARE_CAP_STATES problem and allow us to attach power data for the > >> top level. > > > > Maybe remind us why this needs to be tied to sched_groups ? Why can't we > > attach the energy information to the domains? > > Currently on our 2 cluster (big.LITTLE) system (cluster0: big cpus, > cluster1: little cpus) we attach energy information onto all sg's in MC > (cpu/core related energy data) and DIE sd level (cluster related energy > data). > > For an MC level (cpus sharing the same u-arch) attaching the energy > information onto the sd is clearly much easier then attaching it onto > the individual sg's. In the current domain hierarchy you don't have domains with just one cpu in them. If you attach the per-cpu energy data to the MC level domain which spans the whole cluster, you break the current idea of attaching information to the cpumask (currently sched_group, but could be sched_domain as we discuss here) the information is associated with. You would have to either introduce a level of single cpu domains at the lowest level or move away from the idea of attaching data to the cpumask that is associated with it. Using sched_groups we do already have single cpu groups that we can attach per-cpu data to, but we are missing a top level group spanning the entire system for system wide energy data. So from that point of view groups and domains are equally bad. > But on DIE level when we want to figure out the cluster energy data for > a cluster represented by an sg other than the first sg (sg0) than we > would have to access its cluster energy data via the DIE sd of one of > the cpus of this cluster. I haven't seen code actually doing that in CFS. > > IMHO, the current code is always iterating over the sg's of the sd and > accessing either sg (sched_group) or sg->sgc (sched_group_capacity) > data. Our energy data follows the sched_group_capacity example. Right, using domains we can't directly see sibling domains. Using groups we can see some sibling groups directly without accessing a different per-cpu view of the domain hierarchy, but not all of them. We do have to do the per-cpu thing in some cases similar to how it is currently done in select_task_rq_fair(). > > > There is an additional problem with groups you've not yet discovered and > > that is overlapping groups. Certain NUMA topologies result in this. > > There the sum of cpus over the groups is greater than the total cpus in > > the domain. > > Yeah, we haven't tried EAS on such a system nor did we enable > FORCE_SD_OVERLAP sched feature for a long time. There are things to be discussed for NUMA and energy awareness. I'm not sure if the NUMA folks would be interested in energy awareness and if so, how to couple it in a meaningful way with the NUMA scheduling strategy which involves memory location aspects. The current patches should provide the basics to enable partial energy awareness for NUMA systems in the sense that energy aware scheduling decisions will be made up to the level pointed to by the sd_ea pointer (somewhat similar to sd_llc). So it should be possible to for example do energy aware scheduling within a NUMA node and let cross NUMA-node scheduling be done as it is currently done. It is entirely untested, but that is at least how it is intended to work :)