From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754833Ab1GNNQj (ORCPT <rfc822;w@1wt.eu>);
	Thu, 14 Jul 2011 09:16:39 -0400
Received: from casper.infradead.org ([85.118.1.10]:33146 "EHLO
	casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1754585Ab1GNNQi convert rfc822-to-8bit (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 14 Jul 2011 09:16:38 -0400
Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Anton Blanchard <anton@samba.org>
Cc: mahesh@linux.vnet.ibm.com, linux-kernel@vger.kernel.org,
        linuxppc-dev@lists.ozlabs.org, mingo@elte.hu, benh@kernel.crashing.org,
        torvalds@linux-foundation.org
In-Reply-To: <20110714143521.5fe4fab6@kryten>
References: <20110707102107.GA16666@in.ibm.com>
	 <1310036375.3282.509.camel@twins> <20110714103418.7ef25b68@kryten>
	 <20110714143521.5fe4fab6@kryten>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: 8BIT
Date: Thu, 14 Jul 2011 15:16:19 +0200
Message-ID: <1310649379.2586.273.camel@twins>
Mime-Version: 1.0
X-Mailer: Evolution 2.30.3 
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote:

> I also printed out the cpu spans as we walk through build_sched_groups:

> 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480

> Duplicates start appearing in this span:
> 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608
> 
> So it looks like the overlap of the 16 entry spans
> (SD_NODES_PER_DOMAIN) is causing our problem.

Urgh.. so those spans are generated by sched_domain_node_span(), and it
looks like that simply picks the 15 nearest nodes to the one we've got
without consideration for overlap with previously generated spans.

Now that used to work because it used to simply allocate a new group
instead of using the existing one.

The thing is, we want to track state unique to a group of cpus, so
duplicating that is iffy.

Otoh, making these masks non-overlapping is probably sub-optimal from a
NUMA pov.

Looking at a slightly simpler set-up (4 socket AMD magny-cours):

$ cat /sys/devices/system/node/node*/distance
10 16 16 22 16 22 16 22
16 10 22 16 22 16 22 16
16 22 10 16 16 22 16 22
22 16 16 10 22 16 22 16
16 22 16 22 10 16 16 22
22 16 22 16 16 10 22 16
16 22 16 22 16 22 10 16
22 16 22 16 22 16 16 10

We can translate that into groups like

{0} {0,1,2,4,6} {0-7}
{1} {1,0,3,5,7} {0-7}
...

and we can easily see there's overlap there as well in the NUMA layout
itself.

This seems to suggest we need to separate the unique state from the
sched_group.

Now all I need is a way to not consume gobs of memory.. /me goes prod

From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <a.p.zijlstra@chello.nl>
Received: from casper.infradead.org (casper.infradead.org
	[IPv6:2001:770:15f::2])
	(using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits))
	(Client did not present a certificate)
	by ozlabs.org (Postfix) with ESMTPS id 88D60B6F65
	for <linuxppc-dev@lists.ozlabs.org>;
	Thu, 14 Jul 2011 23:16:39 +1000 (EST)
Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
To: Anton Blanchard <anton@samba.org>
In-Reply-To: <20110714143521.5fe4fab6@kryten>
References: <20110707102107.GA16666@in.ibm.com>
	<1310036375.3282.509.camel@twins> <20110714103418.7ef25b68@kryten>
	<20110714143521.5fe4fab6@kryten>
Content-Type: text/plain; charset="UTF-8"
Date: Thu, 14 Jul 2011 15:16:19 +0200
Message-ID: <1310649379.2586.273.camel@twins>
Mime-Version: 1.0
Cc: mahesh@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org,
	linux-kernel@vger.kernel.org, mingo@elte.hu, torvalds@linux-foundation.org
List-Id: Linux on PowerPC Developers Mail List <linuxppc-dev.lists.ozlabs.org>
List-Unsubscribe: <https://lists.ozlabs.org/options/linuxppc-dev>,
	<mailto:linuxppc-dev-request@lists.ozlabs.org?subject=unsubscribe>
List-Archive: <http://lists.ozlabs.org/pipermail/linuxppc-dev>
List-Post: <mailto:linuxppc-dev@lists.ozlabs.org>
List-Help: <mailto:linuxppc-dev-request@lists.ozlabs.org?subject=help>
List-Subscribe: <https://lists.ozlabs.org/listinfo/linuxppc-dev>,
	<mailto:linuxppc-dev-request@lists.ozlabs.org?subject=subscribe>

On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote:

> I also printed out the cpu spans as we walk through build_sched_groups:

> 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480

> Duplicates start appearing in this span:
> 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608
>=20
> So it looks like the overlap of the 16 entry spans
> (SD_NODES_PER_DOMAIN) is causing our problem.

Urgh.. so those spans are generated by sched_domain_node_span(), and it
looks like that simply picks the 15 nearest nodes to the one we've got
without consideration for overlap with previously generated spans.

Now that used to work because it used to simply allocate a new group
instead of using the existing one.

The thing is, we want to track state unique to a group of cpus, so
duplicating that is iffy.

Otoh, making these masks non-overlapping is probably sub-optimal from a
NUMA pov.

Looking at a slightly simpler set-up (4 socket AMD magny-cours):

$ cat /sys/devices/system/node/node*/distance
10 16 16 22 16 22 16 22
16 10 22 16 22 16 22 16
16 22 10 16 16 22 16 22
22 16 16 10 22 16 22 16
16 22 16 22 10 16 16 22
22 16 22 16 16 10 22 16
16 22 16 22 16 22 10 16
22 16 22 16 22 16 16 10

We can translate that into groups like

{0} {0,1,2,4,6} {0-7}
{1} {1,0,3,5,7} {0-7}
...

and we can easily see there's overlap there as well in the NUMA layout
itself.

This seems to suggest we need to separate the unique state from the
sched_group.

Now all I need is a way to not consume gobs of memory.. /me goes prod