From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965327Ab1GOIhx (ORCPT ); Fri, 15 Jul 2011 04:37:53 -0400 Received: from casper.infradead.org ([85.118.1.10]:33734 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S964947Ab1GOIhv convert rfc822-to-8bit (ORCPT ); Fri, 15 Jul 2011 04:37:51 -0400 Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 From: Peter Zijlstra To: Anton Blanchard Cc: mahesh@linux.vnet.ibm.com, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, mingo@elte.hu, benh@kernel.crashing.org, torvalds@linux-foundation.org In-Reply-To: <20110715104547.29c3c509@kryten> References: <20110707102107.GA16666@in.ibm.com> <1310036375.3282.509.camel@twins> <20110714103418.7ef25b68@kryten> <20110714143521.5fe4fab6@kryten> <1310649379.2586.273.camel@twins> <20110715104547.29c3c509@kryten> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Date: Fri, 15 Jul 2011 10:37:35 +0200 Message-ID: <1310719055.2586.285.camel@twins> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote: > Hi, > > > Urgh.. so those spans are generated by sched_domain_node_span(), and > > it looks like that simply picks the 15 nearest nodes to the one we've > > got without consideration for overlap with previously generated spans. > > I do wonder if we need this extra level at all on ppc64. From memory > SGI added it for their massive setups, but our largest setup is 32 nodes > and breaking that down into 16 node chunks seems overkill. > > I just realised we were setting NEWIDLE on our node definition and that > was causing large amounts of rebalance work even with > SD_NODES_PER_DOMAIN=16. > > After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look > pretty good. > > Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this > extra level is only used by SGI boxes. We can certainly remove the whole topology layer that causes this problem for 3.0 and try to fix up for 3.1 again. But I was rather hoping to introduce more of those layers in the near future, I was hoping to create a layer per node_distance() value, such that the load-balancing is aware of the interconnects. Now for that I ran into the exact same problem, and at the time didn't come up with a solution, but I think I now see a way out. Something like the below ought to avoid the problem.. makes SGI sad though :-) --- kernel/sched.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 8fb4245..877b9f1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -7203,7 +7203,7 @@ static struct sched_domain_topology_level default_topology[] = { #endif { sd_init_CPU, cpu_cpu_mask, }, #ifdef CONFIG_NUMA - { sd_init_NODE, cpu_node_mask, }, +// { sd_init_NODE, cpu_node_mask, }, { sd_init_ALLNODES, cpu_allnodes_mask, }, #endif { NULL, }, From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:770:15f::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 4A34B1007D1 for ; Fri, 15 Jul 2011 18:37:51 +1000 (EST) Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 From: Peter Zijlstra To: Anton Blanchard In-Reply-To: <20110715104547.29c3c509@kryten> References: <20110707102107.GA16666@in.ibm.com> <1310036375.3282.509.camel@twins> <20110714103418.7ef25b68@kryten> <20110714143521.5fe4fab6@kryten> <1310649379.2586.273.camel@twins> <20110715104547.29c3c509@kryten> Content-Type: text/plain; charset="UTF-8" Date: Fri, 15 Jul 2011 10:37:35 +0200 Message-ID: <1310719055.2586.285.camel@twins> Mime-Version: 1.0 Cc: mahesh@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, mingo@elte.hu, torvalds@linux-foundation.org List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote: > Hi, >=20 > > Urgh.. so those spans are generated by sched_domain_node_span(), and > > it looks like that simply picks the 15 nearest nodes to the one we've > > got without consideration for overlap with previously generated spans. >=20 > I do wonder if we need this extra level at all on ppc64. From memory > SGI added it for their massive setups, but our largest setup is 32 nodes > and breaking that down into 16 node chunks seems overkill. >=20 > I just realised we were setting NEWIDLE on our node definition and that > was causing large amounts of rebalance work even with > SD_NODES_PER_DOMAIN=3D16. >=20 > After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look > pretty good. >=20 > Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this > extra level is only used by SGI boxes. We can certainly remove the whole topology layer that causes this problem for 3.0 and try to fix up for 3.1 again. But I was rather hoping to introduce more of those layers in the near future, I was hoping to create a layer per node_distance() value, such that the load-balancing is aware of the interconnects. Now for that I ran into the exact same problem, and at the time didn't come up with a solution, but I think I now see a way out. Something like the below ought to avoid the problem.. makes SGI sad though :-) --- kernel/sched.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index 8fb4245..877b9f1 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -7203,7 +7203,7 @@ static struct sched_domain_topology_level default_top= ology[] =3D { #endif { sd_init_CPU, cpu_cpu_mask, }, #ifdef CONFIG_NUMA - { sd_init_NODE, cpu_node_mask, }, +// { sd_init_NODE, cpu_node_mask, }, { sd_init_ALLNODES, cpu_allnodes_mask, }, #endif { NULL, },