From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754833Ab1GNNQj (ORCPT ); Thu, 14 Jul 2011 09:16:39 -0400 Received: from casper.infradead.org ([85.118.1.10]:33146 "EHLO casper.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754585Ab1GNNQi convert rfc822-to-8bit (ORCPT ); Thu, 14 Jul 2011 09:16:38 -0400 Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 From: Peter Zijlstra To: Anton Blanchard Cc: mahesh@linux.vnet.ibm.com, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, mingo@elte.hu, benh@kernel.crashing.org, torvalds@linux-foundation.org In-Reply-To: <20110714143521.5fe4fab6@kryten> References: <20110707102107.GA16666@in.ibm.com> <1310036375.3282.509.camel@twins> <20110714103418.7ef25b68@kryten> <20110714143521.5fe4fab6@kryten> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Date: Thu, 14 Jul 2011 15:16:19 +0200 Message-ID: <1310649379.2586.273.camel@twins> Mime-Version: 1.0 X-Mailer: Evolution 2.30.3 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote: > I also printed out the cpu spans as we walk through build_sched_groups: > 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 > Duplicates start appearing in this span: > 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 > > So it looks like the overlap of the 16 entry spans > (SD_NODES_PER_DOMAIN) is causing our problem. Urgh.. so those spans are generated by sched_domain_node_span(), and it looks like that simply picks the 15 nearest nodes to the one we've got without consideration for overlap with previously generated spans. Now that used to work because it used to simply allocate a new group instead of using the existing one. The thing is, we want to track state unique to a group of cpus, so duplicating that is iffy. Otoh, making these masks non-overlapping is probably sub-optimal from a NUMA pov. Looking at a slightly simpler set-up (4 socket AMD magny-cours): $ cat /sys/devices/system/node/node*/distance 10 16 16 22 16 22 16 22 16 10 22 16 22 16 22 16 16 22 10 16 16 22 16 22 22 16 16 10 22 16 22 16 16 22 16 22 10 16 16 22 22 16 22 16 16 10 22 16 16 22 16 22 16 22 10 16 22 16 22 16 22 16 16 10 We can translate that into groups like {0} {0,1,2,4,6} {0-7} {1} {1,0,3,5,7} {0-7} ... and we can easily see there's overlap there as well in the NUMA layout itself. This seems to suggest we need to separate the unique state from the sched_group. Now all I need is a way to not consume gobs of memory.. /me goes prod From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:770:15f::2]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by ozlabs.org (Postfix) with ESMTPS id 88D60B6F65 for ; Thu, 14 Jul 2011 23:16:39 +1000 (EST) Subject: Re: [regression] 3.0-rc boot failure -- bisected to cd4ea6ae3982 From: Peter Zijlstra To: Anton Blanchard In-Reply-To: <20110714143521.5fe4fab6@kryten> References: <20110707102107.GA16666@in.ibm.com> <1310036375.3282.509.camel@twins> <20110714103418.7ef25b68@kryten> <20110714143521.5fe4fab6@kryten> Content-Type: text/plain; charset="UTF-8" Date: Thu, 14 Jul 2011 15:16:19 +0200 Message-ID: <1310649379.2586.273.camel@twins> Mime-Version: 1.0 Cc: mahesh@linux.vnet.ibm.com, linuxppc-dev@lists.ozlabs.org, linux-kernel@vger.kernel.org, mingo@elte.hu, torvalds@linux-foundation.org List-Id: Linux on PowerPC Developers Mail List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , On Thu, 2011-07-14 at 14:35 +1000, Anton Blanchard wrote: > I also printed out the cpu spans as we walk through build_sched_groups: > 0 32 64 96 128 160 192 224 256 288 320 352 384 416 448 480 > Duplicates start appearing in this span: > 128 160 192 224 256 288 320 352 384 416 448 480 512 544 576 608 >=20 > So it looks like the overlap of the 16 entry spans > (SD_NODES_PER_DOMAIN) is causing our problem. Urgh.. so those spans are generated by sched_domain_node_span(), and it looks like that simply picks the 15 nearest nodes to the one we've got without consideration for overlap with previously generated spans. Now that used to work because it used to simply allocate a new group instead of using the existing one. The thing is, we want to track state unique to a group of cpus, so duplicating that is iffy. Otoh, making these masks non-overlapping is probably sub-optimal from a NUMA pov. Looking at a slightly simpler set-up (4 socket AMD magny-cours): $ cat /sys/devices/system/node/node*/distance 10 16 16 22 16 22 16 22 16 10 22 16 22 16 22 16 16 22 10 16 16 22 16 22 22 16 16 10 22 16 22 16 16 22 16 22 10 16 16 22 22 16 22 16 16 10 22 16 16 22 16 22 16 22 10 16 22 16 22 16 22 16 16 10 We can translate that into groups like {0} {0,1,2,4,6} {0-7} {1} {1,0,3,5,7} {0-7} ... and we can easily see there's overlap there as well in the NUMA layout itself. This seems to suggest we need to separate the unique state from the sched_group. Now all I need is a way to not consume gobs of memory.. /me goes prod