From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1422733AbWA1BAY (ORCPT ); Fri, 27 Jan 2006 20:00:24 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1422769AbWA1BAY (ORCPT ); Fri, 27 Jan 2006 20:00:24 -0500 Received: from e36.co.us.ibm.com ([32.97.110.154]:34748 "EHLO e36.co.us.ibm.com") by vger.kernel.org with ESMTP id S1422733AbWA1BAX (ORCPT ); Fri, 27 Jan 2006 20:00:23 -0500 Message-ID: <43DAC222.4060805@us.ibm.com> Date: Fri, 27 Jan 2006 17:00:18 -0800 From: Matthew Dobson User-Agent: Mozilla Thunderbird 1.0.7 (X11/20051011) X-Accept-Language: en-us, en MIME-Version: 1.0 To: Paul Jackson CC: clameter@engr.sgi.com, linux-kernel@vger.kernel.org, sri@us.ibm.com, andrea@suse.de, pavel@suse.cz, linux-mm@kvack.org Subject: Re: [patch 3/9] mempool - Make mempools NUMA aware References: <20060125161321.647368000@localhost.localdomain> <1138233093.27293.1.camel@localhost.localdomain> <43D953C4.5020205@us.ibm.com> <43D95A2E.4020002@us.ibm.com> <43D96633.4080900@us.ibm.com> <43D96A93.9000600@us.ibm.com> <20060127025126.c95f8002.pj@sgi.com> In-Reply-To: <20060127025126.c95f8002.pj@sgi.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Paul Jackson wrote: > Matthew wrote: > >>I'm glad we're on the same page now. :) And yes, adding four "duplicate" >>*_mempool allocators was not my first choice, but I couldn't easily see a >>better way. > > > I hope the following comments aren't too far off target. > > I too am inclined to prefer the __GFP_CRITICAL approach over this. OK. Chalk one more up for that solution... > That or Andrea's suggestion, which except for a free hook, was entirely > outside of the page_alloc.c code paths. This is supposed to be an implementation of Andrea's suggestion. There are no hooks in ANY page_alloc.c code paths. These patches touch mempool code and some slab code, but not any page allocator code. > Or Alan's suggested revival > of the old code to drop non-critical network patches in duress. Dropping non-critical packets is still in our plan, but I don't think that is a FULL solution. As we mentioned before on that topic, you can't tell if a packet is critical until AFTER you receive it, by which point it has already had an skbuff (hopefully) allocated for it. If your network traffic is coming in faster than you can receive, examine, and drop non-critical packets you're hosed. I still think some sort of reserve pool is necessary to give the networking stack a little breathing room when under both memory pressure and network load. > I am tempted to think you've taken an approach that raised some > substantial looking issues: > > * how to tell the system when to use the emergency pool We've dropped the whole "in_emergency" thing. The system uses the emergency pool when the normal pool (ie: the buddy allocator) is out of pages. > * this doesn't really solve the problem (network can still starve) Only if the pool is not large enough. One can argue that sizing the pool appropriately is impossible (theoretical incoming traffic over a GigE card or two for a minute or two is extremely large), but then I guess we shouldn't even try to fix the problem...? > * it wastes memory most of the time True. Any "real" reserve system will suffer from that problem. Ben LaHaise suggested a reserve system that allows the reserve pages to be used for trivially reclaimable allocation while not in active use. An interesting idea. Regardless, the Linux VM sorta already wastes memory by keeping min_free_kbytes around, no? > * it doesn't really improve on GFP_ATOMIC I disagree. It improves on GFP_ATOMIC by giving it a second chance. If you've got a GFP_ATOMIC allocation that is particularly critical, using a mempool to back it means that you can keep going for a while when the rest of the system OOMs/goes into SWAP hell/etc. > and just added another substantial looking issue: > > * it entwines another thread of complexity and performance costs > into the important memory allocation code path. I can't say that it doesn't add any complexity into an important memory allocation path, but I don't think it is a significant amount of complexity. It is just a pointer check in kmem_getpages()... >>With large machines, especially as >>those large machines' workloads are more and more likely to be partitioned >>with something like cpusets, you want to be able to specify where you want >>your reserve pool to come from. > > > Cpusets is about performance, not correctness. Anytime I get cornered > in the cpuset code, I prefer violating the cpuset containment, over > serious system failure. Fair enough. But if we can keep the same baseline performance and add this new feature, I'd like to do that. Doing our best to allocate on a particular node when requested to isn't too much to ask. -Matt