From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422735AbWA1AY1 (ORCPT ); Fri, 27 Jan 2006 19:24:27 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1422734AbWA1AY1 (ORCPT ); Fri, 27 Jan 2006 19:24:27 -0500 Received: from smtp.osdl.org ([65.172.181.4]:41876 "EHLO smtp.osdl.org") by vger.kernel.org with ESMTP id S1422730AbWA1AY0 (ORCPT ); Fri, 27 Jan 2006 19:24:26 -0500 Date: Fri, 27 Jan 2006 16:26:11 -0800 From: Andrew Morton To: Ravikiran G Thirumalai Cc: dada1@cosmosbay.com, davem@davemloft.net, linux-kernel@vger.kernel.org, shai@scalex86.org, netdev@vger.kernel.org, pravins@calsoftinc.com Subject: Re: [patch 3/4] net: Percpufy frequently used variables -- proto.sockets_allocated Message-Id: <20060127162611.5d160638.akpm@osdl.org> In-Reply-To: <20060128000100.GD3565@localhost.localdomain> References: <20060126185649.GB3651@localhost.localdomain> <20060126190357.GE3651@localhost.localdomain> <43D9DFA1.9070802@cosmosbay.com> <20060127195227.GA3565@localhost.localdomain> <20060127121602.18bc3f25.akpm@osdl.org> <20060127224433.GB3565@localhost.localdomain> <20060127150106.38b9e041.akpm@osdl.org> <20060127150847.48c312c0.akpm@osdl.org> <20060128000100.GD3565@localhost.localdomain> X-Mailer: Sylpheed version 1.0.0 (GTK+ 1.2.10; i386-vine-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Ravikiran G Thirumalai wrote: > > On Fri, Jan 27, 2006 at 03:08:47PM -0800, Andrew Morton wrote: > > Andrew Morton wrote: > > > > > > Oh, and because vm_acct_memory() is counting a singleton object, it can use > > > DEFINE_PER_CPU rather than alloc_percpu(), so it saves on a bit of kmalloc > > > overhead. > > > > Actually, I don't think that's true. we're allocating a sizeof(long) with > > kmalloc_node() so there shouldn't be memory wastage. > > Oh yeah there is. Each dynamic per-cpu object would have been atleast > (NR_CPUS * sizeof (void *) + num_cpus_possible * cacheline_size ). > Now kmalloc_node will fall back on size-32 for allocation of long, so > replace the cacheline_size above with 32 -- which then means dynamic per-cpu > data are not on a cacheline boundary anymore (most modern cpus have 64byte/128 > byte cache lines) which means per-cpu data could end up false shared.... > OK. But isn't the core of the problem the fact that __alloc_percpu() is using kmalloc_node() rather than a (new, as-yet-unimplemented) kmalloc_cpu()? kmalloc_cpu() wouldn't need the L1 cache alignment. It might be worth creating just a small number of per-cpu slabs (4-byte, 8-byte). A kmalloc_cpu() would just need a per-cpu array of kmem_cache_t*'s and it'd internally use kmalloc_node(cpu_to_node), no? Or we could just give __alloc_percpu() a custom, hand-rolled, not-cacheline-padded sizeof(long) slab per CPU and use that if (size == sizeof(long)). Or something.