From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753664AbZERUVM (ORCPT ); Mon, 18 May 2009 16:21:12 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1752164AbZERUU6 (ORCPT ); Mon, 18 May 2009 16:20:58 -0400 Received: from mx3.mail.elte.hu ([157.181.1.138]:60623 "EHLO mx3.mail.elte.hu" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751794AbZERUU5 (ORCPT ); Mon, 18 May 2009 16:20:57 -0400 Date: Mon, 18 May 2009 22:20:31 +0200 From: Ingo Molnar To: Linus Torvalds , "H. Peter Anvin" , Pekka Enberg , Yinghai Lu Cc: Jeff Garzik , Alexander Viro , Rusty Russell , Linux Kernel Mailing List , Andrew Morton , Peter Zijlstra Subject: Re: [GIT PULL] scheduler fixes Message-ID: <20090518202031.GA26549@elte.hu> References: <20090518142707.GA24142@elte.hu> <20090518164921.GA6903@elte.hu> <20090518170909.GA1623@elte.hu> <20090518190320.GA20260@elte.hu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.18 (2008-05-17) X-ELTE-SpamScore: -1.5 X-ELTE-SpamLevel: X-ELTE-SpamCheck: no X-ELTE-SpamVersion: ELTE 2.0 X-ELTE-SpamCheck-Details: score=-1.5 required=5.9 tests=BAYES_00 autolearn=no SpamAssassin version=3.2.5 -1.5 BAYES_00 BODY: Bayesian spam probability is 0 to 1% [score: 0.0000] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org * Linus Torvalds wrote: > On Mon, 18 May 2009, Ingo Molnar wrote: > > > > Something like the patch below. It also fixes ->span[] which has > > a similar problem. > > Patch looks good to me. ok. I've queued it up for .31, with your Acked-by. (which i assume your reply implies?) > > But ... i think this needs further clean-ups really. Either go > > fully static, or go fully dynamic. > > I do agree that it would probably be good to try to avoid this > static allocation, and allocate these data structures dynamically. > However, if we end up having to use two different allocators > anyway (one for bootup, and one for regular uptimes), then I think > that would be an overall loss (compared to just the simplicity of > statically doing this in a couple of places), rather than an > overall win. > > > Would be nice if bootmem_alloc() was extended with such > > properties - if SLAB is up (and bootmem is down) it would return > > kmalloc(GFP_KERNEL) memory buffers. > > I would rather say the other way around: no "bootmem_alloc()" at > all, but just have a regular alloc() that ends up working like the > "SMP alternatives" code, but instead of being about SMP, it would > be about how early in the boot sequence it is. > > That said, if there are just a couple of places like this that > care, I don't think it's worth it. The static allocation isn't > that horrible. I'd rather have a few ugly static allocations with > comments about _why_ they look the way they do, than try to > over-design things to look "clean". > > Simplicity is a good thing - even if it can then end up meaning > special cases like this. > > That said, if we could move the kmalloc initialization up some > more (and get at least the "boot node" data structures set up, and > avoid any bootmem alloc issues _entirely_, then that would be > good. > > I hate that stupid bootmem allocator. I suspect we seriously > over-use it, and that we _should_ be able to do the SL*B init > earlier. Hm, tempting thought - not sure how to pull it off though. One of the biggest user of bootmem is the mem_map[] hierarchies and the page allocator bitmaps. Not sure we can get rid of bootmem there - those areas are really large, physical memory is often fragmented and we need a good NUMA sense for them as well. We might also have a 22-architectures-to-fix problem as well, before we can get rid of bootmem: $ git grep alloc_bootmem arch/ | wc -l 168 On x86 we recently switched some (but not all) early-pagetable allocations to the 'early brk' method (which is an utterly simple early linear allocator, for limited early dynamic allocations), but even with that we still have ugly bootmem use - for example see the after_bootmem hacks in arch/x86/mm/init_64.c. So we have these increasingly more complete layers of allocators, which bootstrap each other gradually: - static, build-time allocations - early-brk (see extend_brk(), RESERVE_BRK and direct use of _brk_end in assembly code) - e820 based early allocator (reserve_early()) to bootstrap bootmem - bootmem - to bootstrap the page allocator [NUMA aware] - page allocator - to bootstrap SLAB - SLAB that's 5 layers until we get to SLAB. Each layer has to be aware of its own limits, has to interact with pagetable setup and has to end up with a NUMA-aware dynamic allocations as early as possible. And all this complexity definitely _feels_ utterly wrong, as we really know it pretty early on what kind of memory we have, how it's laid out amongst nodes. In the end we really just want to have the page allocator and SL[AOQU]B. Looks daunting. Ingo