From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S965327Ab1GOIhx (ORCPT <rfc822;w@1wt.eu>);
	Fri, 15 Jul 2011 04:37:53 -0400
Received: from casper.infradead.org ([85.118.1.10]:33734 "EHLO
	casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S964947Ab1GOIhv convert rfc822-to-8bit (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Fri, 15 Jul 2011 04:37:51 -0400
Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Anton Blanchard <anton@samba.org>
Cc: mahesh@linux.vnet.ibm.com, linux-kernel@vger.kernel.org,
        linuxppc-dev@lists.ozlabs.org, mingo@elte.hu, benh@kernel.crashing.org,
        torvalds@linux-foundation.org
In-Reply-To: <20110715104547.29c3c509@kryten>
References: <20110707102107.GA16666@in.ibm.com>
	 <1310036375.3282.509.camel@twins> <20110714103418.7ef25b68@kryten>
	 <20110714143521.5fe4fab6@kryten> <1310649379.2586.273.camel@twins>
	 <20110715104547.29c3c509@kryten>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8BIT
Date: Fri, 15 Jul 2011 10:37:35 +0200
Message-ID: <1310719055.2586.285.camel@twins>
Mime-Version: 1.0
X-Mailer: Evolution 2.30.3 
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote:
> Hi,
> 
> > Urgh.. so those spans are generated by sched_domain_node_span(), and
> > it looks like that simply picks the 15 nearest nodes to the one we've
> > got without consideration for overlap with previously generated spans.
> 
> I do wonder if we need this extra level at all on ppc64. From memory
> SGI added it for their massive setups, but our largest setup is 32 nodes
> and breaking that down into 16 node chunks seems overkill.
> 
> I just realised we were setting NEWIDLE on our node definition and that
> was causing large amounts of rebalance work even with
> SD_NODES_PER_DOMAIN=16.
> 
> After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look
> pretty good.
> 
> Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this
> extra level is only used by SGI boxes.

We can certainly remove the whole topology layer that causes this
problem for 3.0 and try to fix up for 3.1 again.

But I was rather hoping to introduce more of those layers in the near
future, I was hoping to create a layer per node_distance() value, such
that the load-balancing is aware of the interconnects.

Now for that I ran into the exact same problem, and at the time didn't
come up with a solution, but I think I now see a way out.

Something like the below ought to avoid the problem.. makes SGI sad
though :-)

---
 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8fb4245..877b9f1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7203,7 +7203,7 @@ static struct sched_domain_topology_level default_topology[] = {
 #endif
 	{ sd_init_CPU, cpu_cpu_mask, },
 #ifdef CONFIG_NUMA
-	{ sd_init_NODE, cpu_node_mask, },
+//	{ sd_init_NODE, cpu_node_mask, },
 	{ sd_init_ALLNODES, cpu_allnodes_mask, },
 #endif
 	{ NULL, },


From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <a.p.zijlstra@chello.nl>
Received: from casper.infradead.org (casper.infradead.org
	[IPv6:2001:770:15f::2])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id 4A34B1007D1
	for <linuxppc-dev@lists.ozlabs.org>;
	Fri, 15 Jul 2011 18:37:51 +1000 (EST)
Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Anton Blanchard <anton@samba.org>
In-Reply-To: <20110715104547.29c3c509@kryten>
References: <20110707102107.GA16666@in.ibm.com>
	<1310036375.3282.509.camel@twins> <20110714103418.7ef25b68@kryten>
	<20110714143521.5fe4fab6@kryten> <1310649379.2586.273.camel@twins>
	<20110715104547.29c3c509@kryten>
Content-Type: text/plain; charset="UTF-8"
Date: Fri, 15 Jul 2011 10:37:35 +0200
Message-ID: <1310719055.2586.285.camel@twins>
Mime-Version: 1.0
Cc: mahesh@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, torvalds@linux-foundation.org
List-Id: Linux on PowerPC Developers Mail List <linuxppc-dev.lists.ozlabs.org>
List-Unsubscribe: <https://lists.ozlabs.org/options/linuxppc-dev>,
	<mailto:linuxppc-dev-request@lists.ozlabs.org?subject=unsubscribe>
List-Archive: <http://lists.ozlabs.org/pipermail/linuxppc-dev>
List-Post: <mailto:linuxppc-dev@lists.ozlabs.org>
List-Help: <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=help>
List-Subscribe: <https://lists.ozlabs.org/listinfo/linuxppc-dev>,
	<mailto:linuxppc-dev-request@lists.ozlabs.org?subject=subscribe>

On Fri, 2011-07-15 at 10:45 +1000, Anton Blanchard wrote:
> Hi,
>=20
> > Urgh.. so those spans are generated by sched_domain_node_span(), and
> > it looks like that simply picks the 15 nearest nodes to the one we've
> > got without consideration for overlap with previously generated spans.
>=20
> I do wonder if we need this extra level at all on ppc64. From memory
> SGI added it for their massive setups, but our largest setup is 32 nodes
> and breaking that down into 16 node chunks seems overkill.
>=20
> I just realised we were setting NEWIDLE on our node definition and that
> was causing large amounts of rebalance work even with
> SD_NODES_PER_DOMAIN=3D16.
>=20
> After removing it and bumping SD_NODES_PER_DOMAIN to 32, things look
> pretty good.
>=20
> Perhaps we should allow an arch to override SD_NODES_PER_DOMAIN so this
> extra level is only used by SGI boxes.

We can certainly remove the whole topology layer that causes this
problem for 3.0 and try to fix up for 3.1 again.

But I was rather hoping to introduce more of those layers in the near
future, I was hoping to create a layer per node_distance() value, such
that the load-balancing is aware of the interconnects.

Now for that I ran into the exact same problem, and at the time didn't
come up with a solution, but I think I now see a way out.

Something like the below ought to avoid the problem.. makes SGI sad
though :-)

---
 kernel/sched.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 8fb4245..877b9f1 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7203,7 +7203,7 @@ static struct sched_domain_topology_level default_top=
ology[] =3D {
 #endif
 	{ sd_init_CPU, cpu_cpu_mask, },
 #ifdef CONFIG_NUMA
-	{ sd_init_NODE, cpu_node_mask, },
+//	{ sd_init_NODE, cpu_node_mask, },
 	{ sd_init_ALLNODES, cpu_allnodes_mask, },
 #endif
 	{ NULL, },