From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932190AbcEKMeA (ORCPT ); Wed, 11 May 2016 08:34:00 -0400 Received: from bombadil.infradead.org ([198.137.202.9]:41737 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751816AbcEKMd5 (ORCPT ); Wed, 11 May 2016 08:33:57 -0400 Date: Wed, 11 May 2016 14:33:45 +0200 From: Peter Zijlstra To: Matt Fleming Cc: mingo@kernel.org, linux-kernel@vger.kernel.org, clm@fb.com, mgalbraith@suse.de, tglx@linutronix.de, fweisbec@gmail.com, srikar@linux.vnet.ibm.com, mikey@neuling.org, anton@samba.org Subject: Re: [RFC][PATCH 4/7] sched: Replace sd_busy/nr_busy_cpus with sched_domain_shared Message-ID: <20160511123345.GD3192@twins.programming.kicks-ass.net> References: <20160509104807.284575300@infradead.org> <20160509105210.642395937@infradead.org> <20160511115555.GT2839@codeblueprint.co.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160511115555.GT2839@codeblueprint.co.uk> User-Agent: Mutt/1.5.21 (2012-12-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 11, 2016 at 12:55:56PM +0100, Matt Fleming wrote: > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -7842,13 +7842,13 @@ static inline void set_cpu_sd_state_busy > > int cpu = smp_processor_id(); > > > > rcu_read_lock(); > > - sd = rcu_dereference(per_cpu(sd_busy, cpu)); > > + sd = rcu_dereference(per_cpu(sd_llc, cpu)); > > > > if (!sd || !sd->nohz_idle) > > goto unlock; > > sd->nohz_idle = 0; > > > > - atomic_inc(&sd->groups->sgc->nr_busy_cpus); > > + atomic_inc(&sd->shared->nr_busy_cpus); > > unlock: > > rcu_read_unlock(); > > } > > This breaks my POWER7 box which presumably doesn't have SD_SHARE_PKG_RESOURCES, > Hmm, PPC folks; what does your topology look like? Currently your sched_domain_topology, as per arch/powerpc/kernel/smp.c seems to suggest your cores do not share cache at all. https://en.wikipedia.org/wiki/POWER7 seems to agree and states "4 MB L3 cache per C1 core" And http://www-03.ibm.com/systems/resources/systems_power_software_i_perfmgmt_underthehood.pdf also explicitly draws pictures with the L3 per core. _however_, that same document describes L3 inter-core fill and lateral cast-out, which sounds like the L3s work together to form a node wide caching system. Do we want to model this co-operative L3 slices thing as a sort of node-wide LLC for the purpose of the scheduler ? While we should definitely fix the assumption that an LLC exists (and I need to look at why it isn't set to the core domain instead as well), the scheduler does try and scale things by 'assuming' LLC := node. It does this for NOHZ, and these here patches under discussion would be doing the same for idle-core state. Would this make sense for power, or should we somehow think of something else?