* Re: [Patch 3/3] prepopulate/cache cleared pages @ 2006-02-23 21:10 Chuck Ebbert 2006-02-23 21:18 ` Arjan van de Ven 0 siblings, 1 reply; 22+ messages in thread From: Chuck Ebbert @ 2006-02-23 21:10 UTC (permalink / raw) To: Arjan van de Ven; +Cc: linux-kernel, Andi Kleen, Andrew Morton In-Reply-To: <1140686994.4672.4.camel@laptopd505.fenrus.org> On Thu, 23 Feb 2006 at 10:29:54 +0100, Arjan van de Ven wrote: > Index: linux-work/mm/mempolicy.c > =================================================================== > --- linux-work.orig/mm/mempolicy.c > +++ linux-work/mm/mempolicy.c > @@ -1231,6 +1231,13 @@ alloc_page_vma(gfp_t gfp, struct vm_area > { > struct mempolicy *pol = get_vma_policy(current, vma, addr); > > + if ( (gfp & __GFP_ZERO) && current->cleared_page) { > + struct page *addr; > + addr = current->cleared_page; > + current->cleared_page = NULL; > + return addr; > + } > + > cpuset_update_task_memory_state(); > > if (unlikely(pol->policy == MPOL_INTERLEAVE)) { > @@ -1242,6 +1249,36 @@ alloc_page_vma(gfp_t gfp, struct vm_area > return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol)); > } > > + > +/** > + * prepare_cleared_page - populate the per-task zeroed-page cache > + * > + * This function populates the per-task cache with one zeroed page > + * (if there wasn't one already) > + * The idea is that this (expensive) clearing is done before any > + * locks are taken, speculatively, and that when the page is actually > + * needed under a lock, it is ready for immediate use > + */ > + > +void prepare_cleared_page(void) > +{ > + struct mempolicy *pol = current->mempolicy; > + > + if (current->cleared_page) > + return; > + > + cpuset_update_task_memory_state(); > + > + if (!pol) > + pol = &default_policy; > + if (pol->policy == MPOL_INTERLEAVE) > + current->cleared_page = alloc_page_interleave( > + GFP_HIGHUSER | __GFP_ZERO, 0, interleave_nodes(pol)); ======> else ??? > + current->cleared_page = __alloc_pages(GFP_USER | __GFP_ZERO, > + 0, zonelist_policy(GFP_USER, pol)); > +} > + > + > /** > * alloc_pages_current - Allocate pages. > * -- Chuck "Equations are the Devil's sentences." --Stephen Colbert ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 21:10 [Patch 3/3] prepopulate/cache cleared pages Chuck Ebbert @ 2006-02-23 21:18 ` Arjan van de Ven 0 siblings, 0 replies; 22+ messages in thread From: Arjan van de Ven @ 2006-02-23 21:18 UTC (permalink / raw) To: Chuck Ebbert; +Cc: linux-kernel On Thu, 2006-02-23 at 16:10 -0500, Chuck Ebbert wrote: > > + if (pol->policy == MPOL_INTERLEAVE) > > + current->cleared_page = alloc_page_interleave( > > + GFP_HIGHUSER | __GFP_ZERO, 0, interleave_nodes(pol)); > > ======> else ??? > > > + current->cleared_page = __alloc_pages(GFP_USER | __GFP_ZERO, > > + 0, zonelist_policy(GFP_USER, pol)); > > +} > > + good catch, thanks! ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages
@ 2006-02-23 20:02 Chuck Ebbert
0 siblings, 0 replies; 22+ messages in thread
From: Chuck Ebbert @ 2006-02-23 20:02 UTC (permalink / raw)
To: Andi Kleen; +Cc: Andrew Morton, Arjan van de Ven, Ingo Molnar, linux-kernel
In-Reply-To: <200602231406.43899.ak@suse.de>
On Thu, 23 Feb 2006 at 14:06:43 +0100, Andi Kleen wrote:
> e.g. you might notice that a lot of patches from new contributors
> go smoother into x86-64 than into i386.
That's because, strangely enough, i386 doesn't have an official maintainer.
--
Chuck
"Equations are the Devil's sentences." --Stephen Colbert
^ permalink raw reply [flat|nested] 22+ messages in thread
* [Patch 0/3] threaded mmap tweaks @ 2006-02-23 9:17 Arjan van de Ven 2006-02-23 9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven 0 siblings, 1 reply; 22+ messages in thread From: Arjan van de Ven @ 2006-02-23 9:17 UTC (permalink / raw) To: linux-kernel; +Cc: akpm Hi, I've been looking at a micro-benchmark that basically starts a few threads which then each allocate some memory (via mmap), use it briefly and then free it again (munmap). During this benchmark the mmap_sem gets contended and as a result things go less well than expected. The patches in this series improved the benchmark by 3% on a wallclock time basis. Greetings, Arjan van de Ven ^ permalink raw reply [flat|nested] 22+ messages in thread
* [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 9:17 [Patch 0/3] threaded mmap tweaks Arjan van de Ven @ 2006-02-23 9:29 ` Arjan van de Ven 2006-02-23 9:41 ` Andi Kleen 2006-02-23 18:25 ` Paul Jackson 0 siblings, 2 replies; 22+ messages in thread From: Arjan van de Ven @ 2006-02-23 9:29 UTC (permalink / raw) To: linux-kernel; +Cc: ak, akpm This patch adds an entry for a cleared page to the task struct. The main purpose of this patch is to be able to pre-allocate and clear a page in a pagefault scenario before taking any locks (esp mmap_sem), opportunistically. Allocating+clearing a page is an very expensive operation that currently increases lock hold times quite bit (in a threaded environment that allocates/use/frees memory on a regular basis, this leads to contention). This is probably the most controversial patch of the 3, since there is a potential to take up 1 page per thread in this cache. In practice it's not as bad as it sounds (a large degree of the pagefaults are anonymous and thus immediately use up the page). One could argue "let the VM reap these" but that has a few downsides; it increases locking needs but more, clearing a page is relatively expensive, if the VM reaps the page again in case it wasn't needed, the work was just wasted. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> --- arch/x86_64/mm/fault.c | 2 ++ include/linux/mm.h | 1 + include/linux/sched.h | 1 + kernel/exit.c | 4 ++++ kernel/fork.c | 1 + mm/mempolicy.c | 37 +++++++++++++++++++++++++++++++++++++ 6 files changed, 46 insertions(+) Index: linux-work/arch/x86_64/mm/fault.c =================================================================== --- linux-work.orig/arch/x86_64/mm/fault.c +++ linux-work/arch/x86_64/mm/fault.c @@ -375,6 +375,8 @@ asmlinkage void __kprobes do_page_fault( goto bad_area_nosemaphore; again: + prepare_cleared_page(); + /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunatly, in the case of an Index: linux-work/include/linux/mm.h =================================================================== --- linux-work.orig/include/linux/mm.h +++ linux-work/include/linux/mm.h @@ -1052,6 +1052,7 @@ void drop_pagecache(void); void drop_slab(void); extern void prepopulate_vma(void); +extern void prepopulate_cleared_page(void); extern int randomize_va_space; Index: linux-work/include/linux/sched.h =================================================================== --- linux-work.orig/include/linux/sched.h +++ linux-work/include/linux/sched.h @@ -839,6 +839,7 @@ struct task_struct { struct reclaim_state *reclaim_state; struct vm_area_struct *free_vma_cache; /* keep 1 free vma around as cache */ + struct page *cleared_page; /* optionally keep 1 cleared page around */ struct dentry *proc_dentry; struct backing_dev_info *backing_dev_info; Index: linux-work/kernel/exit.c =================================================================== --- linux-work.orig/kernel/exit.c +++ linux-work/kernel/exit.c @@ -882,6 +882,10 @@ fastcall NORET_TYPE void do_exit(long co kmem_cache_free(vm_area_cachep, tsk->free_vma_cache); tsk->free_vma_cache = NULL; + if (tsk->cleared_page) + __free_page(tsk->cleared_page); + tsk->cleared_page = NULL; + /* PF_DEAD causes final put_task_struct after we schedule. */ preempt_disable(); BUG_ON(tsk->flags & PF_DEAD); Index: linux-work/kernel/fork.c =================================================================== --- linux-work.orig/kernel/fork.c +++ linux-work/kernel/fork.c @@ -180,6 +180,7 @@ static struct task_struct *dup_task_stru atomic_set(&tsk->usage,2); atomic_set(&tsk->fs_excl, 0); tsk->free_vma_cache = NULL; + tsk->cleared_page = NULL; return tsk; } Index: linux-work/mm/mempolicy.c =================================================================== --- linux-work.orig/mm/mempolicy.c +++ linux-work/mm/mempolicy.c @@ -1231,6 +1231,13 @@ alloc_page_vma(gfp_t gfp, struct vm_area { struct mempolicy *pol = get_vma_policy(current, vma, addr); + if ( (gfp & __GFP_ZERO) && current->cleared_page) { + struct page *addr; + addr = current->cleared_page; + current->cleared_page = NULL; + return addr; + } + cpuset_update_task_memory_state(); if (unlikely(pol->policy == MPOL_INTERLEAVE)) { @@ -1242,6 +1249,36 @@ alloc_page_vma(gfp_t gfp, struct vm_area return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol)); } + +/** + * prepare_cleared_page - populate the per-task zeroed-page cache + * + * This function populates the per-task cache with one zeroed page + * (if there wasn't one already) + * The idea is that this (expensive) clearing is done before any + * locks are taken, speculatively, and that when the page is actually + * needed under a lock, it is ready for immediate use + */ + +void prepare_cleared_page(void) +{ + struct mempolicy *pol = current->mempolicy; + + if (current->cleared_page) + return; + + cpuset_update_task_memory_state(); + + if (!pol) + pol = &default_policy; + if (pol->policy == MPOL_INTERLEAVE) + current->cleared_page = alloc_page_interleave( + GFP_HIGHUSER | __GFP_ZERO, 0, interleave_nodes(pol)); + current->cleared_page = __alloc_pages(GFP_USER | __GFP_ZERO, + 0, zonelist_policy(GFP_USER, pol)); +} + + /** * alloc_pages_current - Allocate pages. * ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven @ 2006-02-23 9:41 ` Andi Kleen 2006-02-23 12:41 ` Ingo Molnar 2006-02-23 18:25 ` Paul Jackson 1 sibling, 1 reply; 22+ messages in thread From: Andi Kleen @ 2006-02-23 9:41 UTC (permalink / raw) To: Arjan van de Ven; +Cc: linux-kernel, akpm On Thursday 23 February 2006 10:29, Arjan van de Ven wrote: > This patch adds an entry for a cleared page to the task struct. The main > purpose of this patch is to be able to pre-allocate and clear a page in a > pagefault scenario before taking any locks (esp mmap_sem), > opportunistically. Allocating+clearing a page is an very expensive > operation that currently increases lock hold times quite bit (in a threaded > environment that allocates/use/frees memory on a regular basis, this leads > to contention). > > This is probably the most controversial patch of the 3, since there is > a potential to take up 1 page per thread in this cache. In practice it's > not as bad as it sounds (a large degree of the pagefaults are anonymous > and thus immediately use up the page). One could argue "let the VM reap > these" but that has a few downsides; it increases locking needs but more, > clearing a page is relatively expensive, if the VM reaps the page again > in case it wasn't needed, the work was just wasted. Looks like an incredible bad hack. What workload was that again where it helps? And how much? I think before we can consider adding that ugly code you would a far better rationale. -Andi ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 9:41 ` Andi Kleen @ 2006-02-23 12:41 ` Ingo Molnar 2006-02-23 13:06 ` Andi Kleen 2006-02-28 22:30 ` Pavel Machek 0 siblings, 2 replies; 22+ messages in thread From: Ingo Molnar @ 2006-02-23 12:41 UTC (permalink / raw) To: Andi Kleen; +Cc: Arjan van de Ven, linux-kernel, akpm * Andi Kleen <ak@suse.de> wrote: > On Thursday 23 February 2006 10:29, Arjan van de Ven wrote: > > This patch adds an entry for a cleared page to the task struct. The main > > purpose of this patch is to be able to pre-allocate and clear a page in a > > pagefault scenario before taking any locks (esp mmap_sem), > > opportunistically. Allocating+clearing a page is an very expensive > > operation that currently increases lock hold times quite bit (in a threaded > > environment that allocates/use/frees memory on a regular basis, this leads > > to contention). > > > > This is probably the most controversial patch of the 3, since there is > > a potential to take up 1 page per thread in this cache. In practice it's > > not as bad as it sounds (a large degree of the pagefaults are anonymous > > and thus immediately use up the page). One could argue "let the VM reap > > these" but that has a few downsides; it increases locking needs but more, > > clearing a page is relatively expensive, if the VM reaps the page again > > in case it wasn't needed, the work was just wasted. > > Looks like an incredible bad hack. What workload was that again where > it helps? And how much? I think before we can consider adding that > ugly code you would a far better rationale. yes, the patch is controversial technologically, and Arjan pointed it out above. This is nothing new - and Arjan probably submitted this to lkml so that he can get contructive feedback. What Arjan did is quite nifty, as it moves the page clearing out from under the mmap_sem-held critical section. How that is achieved is really secondary, it's pretty clear that it could be done in some nicer way. Furthermore, we havent seen any significant advance in that area of the kernel [threaded mmap() performance] in quite some time, so contributions are both welcome and encouraged. But Andi, regarding the tone of your reply - it is really getting out of hand! Please lean back and take a deep breath. Maybe you should even sleep over it. And when you come back, please start being _much_ more respectful of other people's work. You are not doing Linux _any_ favor by flaming away so mindlessly ... Arjan is a longtime contributor and he can probably take your flames just fine (as I took your flames just fine the other day), but we should really use a _much_ nicer tone on lkml. You might not realize it, but replies like this can scare away novice contributors forever! You could scare away the next DaveM. Or the next Alan Cox. Or the next Andi Kleen. Heck, much milder replies can scare away even longtime contributors: see Rusty Russell's comments from a couple of days ago ... And no, i dont accept the lame "dont come into the kitchen if you cant stand the flames" excuse: your reply was totally uncalled for, was totally undeserved and was totally unnecessary. It was incredibly mean spirited, and the only effect it can possibly have on the recipient is harm. And it's not like this happened in the heat of a longer discussion or due to some misunderstanding: you flamed away _right at the beginning_, right as the first reply to the patch submission. Shame on you! Is it that hard to reply: "Hm, nice idea, I wish it were cleaner though because <insert explanation here>. How about <insert nifty idea here>." [ Btw., i could possibly have come up with a good solution for this during the time i had to waste on this reply. ] Ingo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 12:41 ` Ingo Molnar @ 2006-02-23 13:06 ` Andi Kleen 2006-02-23 13:15 ` Nick Piggin 2006-02-28 22:30 ` Pavel Machek 1 sibling, 1 reply; 22+ messages in thread From: Andi Kleen @ 2006-02-23 13:06 UTC (permalink / raw) To: Ingo Molnar; +Cc: Arjan van de Ven, linux-kernel, akpm On Thursday 23 February 2006 13:41, Ingo Molnar wrote: > > What Arjan did is quite nifty, as it moves the page clearing out from > under the mmap_sem-held critical section. So that was the point not the rescheduling under lock? Or both? BTW since it touches your area of work you could comment what you think about not using voluntary preempt points for fast sleep locks like I later proposed. > How that is achieved is really > secondary, it's pretty clear that it could be done in some nicer way. Great we agree then. > > And no, i dont accept the lame "dont come into the kitchen if you cant > stand the flames" excuse: your reply was totally uncalled for, was > totally undeserved Well he didn't supply any data so I asked for more. > and was totally unnecessary. It was incredibly mean > spirited, Sorry, but I don't think that's true. Mean spirited would be "we don't care, go away". When I think that I generally don't answer the email. I could have perhaps worded it a bit nicer, agreed, but I think the core of my reply - we need more analysis for that - was quite constructive. At least for one of the subproblems I even proposed a better solution. If there is more analysis of the problem maybe I can help even for more of it. Also it's probably quite clear that added lots of special purpose caches to task_struct for narrow purpose isn't a good way to do optimization. The mail was perhaps a bit harsher than it should have been because I think Arjan should have really known better... > You might not realize it, but replies like this can scare away novice > contributors forever! You could scare away the next DaveM. Or the next > Alan Cox. Or the next Andi Kleen. Heck, much milder replies can scare > away even longtime contributors: see Rusty Russell's comments from a > couple of days ago ... I must say I'm feeling a bit unfairly attacked here because I think I generally try to help new patch submitters (at least if their basic ideas are sound even if some details are wrong) e.g. you might notice that a lot of patches from new contributors go smoother into x86-64 than into i386. -Andi ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 13:06 ` Andi Kleen @ 2006-02-23 13:15 ` Nick Piggin 2006-02-23 13:29 ` Ingo Molnar 0 siblings, 1 reply; 22+ messages in thread From: Nick Piggin @ 2006-02-23 13:15 UTC (permalink / raw) To: Andi Kleen; +Cc: Ingo Molnar, Arjan van de Ven, linux-kernel, akpm Andi Kleen wrote: > On Thursday 23 February 2006 13:41, Ingo Molnar wrote: > > >>What Arjan did is quite nifty, as it moves the page clearing out from >>under the mmap_sem-held critical section. > > > So that was the point not the rescheduling under lock? Or both? > > BTW since it touches your area of work you could comment what > you think about not using voluntary preempt points for fast sleep locks > like I later proposed. > > >>How that is achieved is really >>secondary, it's pretty clear that it could be done in some nicer way. > > > Great we agree then. > I'm worried about the situation where we allocate but don't use the new page: it blows quite a bit of cache. Then, when we do get around to using it, it will be cold(er). Good to see someone trying to improve this area, though. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 13:15 ` Nick Piggin @ 2006-02-23 13:29 ` Ingo Molnar 2006-02-24 6:36 ` Nick Piggin 0 siblings, 1 reply; 22+ messages in thread From: Ingo Molnar @ 2006-02-23 13:29 UTC (permalink / raw) To: Nick Piggin; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > I'm worried about the situation where we allocate but don't use the > new page: it blows quite a bit of cache. Then, when we do get around > to using it, it will be cold(er). couldnt the new pte be flipped in atomically via cmpxchg? That way we could do the page clearing close to where we are doing it now, but without holding the mmap_sem. to solve the pte races we could use a bit in the [otherwise empty] pte to signal "this pte can be flipped in from now on", which bit would automatically be cleared if mprotect() or munmap() is called over that range (without any extra changes to those codepaths). (in the rare case if the cmpxchg() fails, we go into a slowpath that drops the newly allocated page, re-lookups the vma and the pte, etc.) Ingo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 13:29 ` Ingo Molnar @ 2006-02-24 6:36 ` Nick Piggin 2006-02-24 6:49 ` Ingo Molnar 2006-02-24 9:15 ` Arjan van de Ven 0 siblings, 2 replies; 22+ messages in thread From: Nick Piggin @ 2006-02-24 6:36 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>I'm worried about the situation where we allocate but don't use the >>new page: it blows quite a bit of cache. Then, when we do get around >>to using it, it will be cold(er). > > > couldnt the new pte be flipped in atomically via cmpxchg? That way we > could do the page clearing close to where we are doing it now, but > without holding the mmap_sem. > We have nothing to pin the pte page with if we're not holding the mmap_sem. > to solve the pte races we could use a bit in the [otherwise empty] pte > to signal "this pte can be flipped in from now on", which bit would > automatically be cleared if mprotect() or munmap() is called over that > range (without any extra changes to those codepaths). (in the rare case > if the cmpxchg() fails, we go into a slowpath that drops the newly > allocated page, re-lookups the vma and the pte, etc.) > Page still isn't pinned. You might be able to do something wild like disable preemption and interrupts (to stop the TLB IPI) to get a pin on the pte pages. But even in that case, there is nothing in the mmu gather / tlb flush interface that guarantees an architecture cannot free the page table pages immediately (ie without waiting for the flush IPI). This would make sense on architectures that don't walk the page tables in hardware. Arjan, just to get an idea of your workload: obviously it is a mix of read and write on the mmap_sem (read only will not really benefit from reducing lock width because cacheline transfers will still be there). Is it coming from brk() from the allocator? Someone told me a while ago that glibc doesn't have a decent amount of hysteresis in its allocator and tends to enter the kernel quite a lot... that might be something to look into. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-24 6:36 ` Nick Piggin @ 2006-02-24 6:49 ` Ingo Molnar 2006-02-24 7:01 ` Nick Piggin 2006-02-24 9:15 ` Arjan van de Ven 1 sibling, 1 reply; 22+ messages in thread From: Ingo Molnar @ 2006-02-24 6:49 UTC (permalink / raw) To: Nick Piggin; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > couldnt the new pte be flipped in atomically via cmpxchg? That way > > we could do the page clearing close to where we are doing it now, > > but without holding the mmap_sem. > > We have nothing to pin the pte page with if we're not holding the > mmap_sem. why does it have to be pinned? The page is mostly private to this thread until it manages to flip it into the pte. Since there's no pte presence, there's no swapout possible [here i'm assuming anonymous malloc() memory, which is the main focus of Arjan's patch]. Any parallel unmapping of that page will be caught and the installation of the page will be prevented by the 'bit-spin-lock' embedded in the pte. > But even in that case, there is nothing in the mmu gather / tlb flush > interface that guarantees an architecture cannot free the page table > pages immediately (ie without waiting for the flush IPI). This would > make sense on architectures that don't walk the page tables in > hardware. but the page wont be found by any other CPU, so it wont be freed! It is private to this CPU. The page has no pte presence. It will only be present and lookupable as a result of the cmpxchg() flipping the page into the pte. Ingo ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-24 6:49 ` Ingo Molnar @ 2006-02-24 7:01 ` Nick Piggin 2006-02-24 12:33 ` Andi Kleen 0 siblings, 1 reply; 22+ messages in thread From: Nick Piggin @ 2006-02-24 7:01 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>>couldnt the new pte be flipped in atomically via cmpxchg? That way >>>we could do the page clearing close to where we are doing it now, >>>but without holding the mmap_sem. >> >>We have nothing to pin the pte page with if we're not holding the >>mmap_sem. > > > why does it have to be pinned? The page is mostly private to this thread > until it manages to flip it into the pte. Since there's no pte presence, > there's no swapout possible [here i'm assuming anonymous malloc() > memory, which is the main focus of Arjan's patch]. Any parallel > unmapping of that page will be caught and the installation of the page > will be prevented by the 'bit-spin-lock' embedded in the pte. > No, I was talking about page table pages, rather than the newly allocated page. But I didn't realise you wanted the bit lock to go the other way as well (ie. a real bit spinlock). Seems like that would have to add overhead somewhere. > >>But even in that case, there is nothing in the mmu gather / tlb flush >>interface that guarantees an architecture cannot free the page table >>pages immediately (ie without waiting for the flush IPI). This would >>make sense on architectures that don't walk the page tables in >>hardware. > > > but the page wont be found by any other CPU, so it wont be freed! It is > private to this CPU. The page has no pte presence. It will only be > present and lookupable as a result of the cmpxchg() flipping the page > into the pte. > Yeah, as I said above, the newly allocated page is fine, it is the page table pages I'm worried about. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-24 7:01 ` Nick Piggin @ 2006-02-24 12:33 ` Andi Kleen 2006-02-24 12:55 ` Hugh Dickins 0 siblings, 1 reply; 22+ messages in thread From: Andi Kleen @ 2006-02-24 12:33 UTC (permalink / raw) To: Nick Piggin; +Cc: Ingo Molnar, Arjan van de Ven, linux-kernel, akpm On Friday 24 February 2006 08:01, Nick Piggin wrote: > > Yeah, as I said above, the newly allocated page is fine, it is the > page table pages I'm worried about. page tables are easy because we zero them on free (as a side effect of all the pte_clears) I did a experimental hack some time ago to set a new struct page flag when a page is known to be zeroed on freeing and use that for a GFP_ZERO allocation (basically skip the clear_page when that flag was set) The idea was to generalize the old page table reuse caches which Ingo removed at some point. It only works of course if the allocations and freeing of page tables roughly matches up. In theory on could have split the lists of the buddy allocator too into zero/non zero pages to increase the hit rate, but I didn't attempt this. I unfortunately don't remember the outcome, dropped it for some reason. -Andi ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-24 12:33 ` Andi Kleen @ 2006-02-24 12:55 ` Hugh Dickins 0 siblings, 0 replies; 22+ messages in thread From: Hugh Dickins @ 2006-02-24 12:55 UTC (permalink / raw) To: Andi Kleen; +Cc: Nick Piggin, Ingo Molnar, Arjan van de Ven, linux-kernel, akpm On Fri, 24 Feb 2006, Andi Kleen wrote: > On Friday 24 February 2006 08:01, Nick Piggin wrote: > > > > Yeah, as I said above, the newly allocated page is fine, it is the > > page table pages I'm worried about. > > page tables are easy because we zero them on free (as a side effect > of all the pte_clears) But once the page table is freed, it's likely to get used for something else, whether for another page table or something different doesn't matter: this mm can no longer blindly mess with the entries within in. Nick's point is that it's mmap_sem (or mm_users 0) guarding against the page table being freed. Hugh ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-24 6:36 ` Nick Piggin 2006-02-24 6:49 ` Ingo Molnar @ 2006-02-24 9:15 ` Arjan van de Ven 2006-02-24 9:26 ` Nick Piggin 1 sibling, 1 reply; 22+ messages in thread From: Arjan van de Ven @ 2006-02-24 9:15 UTC (permalink / raw) To: Nick Piggin; +Cc: Ingo Molnar, Andi Kleen, Arjan van de Ven, linux-kernel, akpm > Arjan, just to get an idea of your workload: obviously it is a mix of > read and write on the mmap_sem (read only will not really benefit from > reducing lock width because cacheline transfers will still be there). yeah it's threads that each allocate, use and then free memory with mmap() > Is it coming from brk() from the allocator? Someone told me a while ago > that glibc doesn't have a decent amount of hysteresis in its allocator > and tends to enter the kernel quite a lot... that might be something > to look into. we already are working on that angle; I just posted the kernel stuff as a side effect basically ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-24 9:15 ` Arjan van de Ven @ 2006-02-24 9:26 ` Nick Piggin 2006-02-24 12:27 ` Andi Kleen 0 siblings, 1 reply; 22+ messages in thread From: Nick Piggin @ 2006-02-24 9:26 UTC (permalink / raw) To: Arjan van de Ven Cc: Ingo Molnar, Andi Kleen, Arjan van de Ven, linux-kernel, akpm Arjan van de Ven wrote: >>Arjan, just to get an idea of your workload: obviously it is a mix of >>read and write on the mmap_sem (read only will not really benefit from >>reducing lock width because cacheline transfers will still be there). > > > yeah it's threads that each allocate, use and then free memory with > mmap() > > >>Is it coming from brk() from the allocator? Someone told me a while ago >>that glibc doesn't have a decent amount of hysteresis in its allocator >>and tends to enter the kernel quite a lot... that might be something >>to look into. > > > we already are working on that angle; I just posted the kernel stuff as > a side effect basically > OK. [aside] Actually I have a scalability improvement for rwsems, that moves the actual task wakeups out from underneath the rwsem spinlock in the up() paths. This was useful exactly on a mixed read+write workload on mmap_sem. The difference was quite large for the "generic rwsem" algorithm because it uses the spinlock in fastpaths a lot more than the xadd algorithm. I think x86-64 uses the former, which is what I presume you're testing with? Obviously this is a slightly different issue from the one you're trying to address here, but I'll dig out patch when I get some time and you can see if it helps. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-24 9:26 ` Nick Piggin @ 2006-02-24 12:27 ` Andi Kleen 2006-02-24 15:31 ` Andrea Arcangeli 2006-02-25 16:48 ` Nick Piggin 0 siblings, 2 replies; 22+ messages in thread From: Andi Kleen @ 2006-02-24 12:27 UTC (permalink / raw) To: Nick Piggin Cc: Arjan van de Ven, Ingo Molnar, Arjan van de Ven, linux-kernel, akpm, andrea On Friday 24 February 2006 10:26, Nick Piggin wrote: > [aside] > Actually I have a scalability improvement for rwsems, that moves the > actual task wakeups out from underneath the rwsem spinlock in the up() > paths. This was useful exactly on a mixed read+write workload on mmap_sem. > > The difference was quite large for the "generic rwsem" algorithm because > it uses the spinlock in fastpaths a lot more than the xadd algorithm. I > think x86-64 uses the former, which is what I presume you're testing with? I used the generic algorithm because Andrea originally expressed some doubts on the correctness of the xadd algorithms and after trying to understand them myself I wasn't sure myself. Generic was the safer choice. But if someone can show convincing numbers that XADD rwsems are faster for some workload we can switch. I guess they are tested well enough now on i386. Or would your scalability improvement remove that difference (if it really exists)? -Andi ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-24 12:27 ` Andi Kleen @ 2006-02-24 15:31 ` Andrea Arcangeli 2006-02-25 16:48 ` Nick Piggin 1 sibling, 0 replies; 22+ messages in thread From: Andrea Arcangeli @ 2006-02-24 15:31 UTC (permalink / raw) To: Andi Kleen Cc: Nick Piggin, Arjan van de Ven, Ingo Molnar, Arjan van de Ven, linux-kernel, akpm On Fri, Feb 24, 2006 at 01:27:26PM +0100, Andi Kleen wrote: > I used the generic algorithm because Andrea originally expressed some doubts > on the correctness of the xadd algorithms and after trying to understand them > myself I wasn't sure myself. Generic was the safer choice. Amittedly we never had bugreports for the xadd ones, but trust me that's not a good reason to assume that they must be correct. I'd be more confortable if somebody would provide a demonstration of their correctnes. Overall I gave up also because I felt that the small gain was by far not worth the risk. ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-24 12:27 ` Andi Kleen 2006-02-24 15:31 ` Andrea Arcangeli @ 2006-02-25 16:48 ` Nick Piggin 2006-02-25 17:22 ` Nick Piggin 1 sibling, 1 reply; 22+ messages in thread From: Nick Piggin @ 2006-02-25 16:48 UTC (permalink / raw) To: Andi Kleen Cc: Arjan van de Ven, Ingo Molnar, Arjan van de Ven, linux-kernel, akpm, andrea [-- Attachment #1: Type: text/plain, Size: 1595 bytes --] Andi Kleen wrote: > On Friday 24 February 2006 10:26, Nick Piggin wrote: > > >>[aside] >>Actually I have a scalability improvement for rwsems, that moves the >>actual task wakeups out from underneath the rwsem spinlock in the up() >>paths. This was useful exactly on a mixed read+write workload on mmap_sem. >> >>The difference was quite large for the "generic rwsem" algorithm because >>it uses the spinlock in fastpaths a lot more than the xadd algorithm. I >>think x86-64 uses the former, which is what I presume you're testing with? > > > I used the generic algorithm because Andrea originally expressed some doubts > on the correctness of the xadd algorithms and after trying to understand them > myself I wasn't sure myself. Generic was the safer choice. > > But if someone can show convincing numbers that XADD rwsems are faster > for some workload we can switch. I guess they are tested well enough now > on i386. > > Or would your scalability improvement remove that difference (if it really exists)? > Here is a forward port of my scalability improvement, untested. Unfortunately I didn't include absolute performance results, but the changelog indicates that there was some noticable delta between the rwsem implementations (and that's what I vaguely remember). Note: this was with volanomark, which can be quite variable at the best of times IIRC. Note2: this was on a 16-way NUMAQ, which had the interconnect performance of a string between two cans compared to a modern x86-64 of smaller size, so it may be harder to notice an improvement. -- SUSE Labs, Novell Inc. [-- Attachment #2: rwsem-scale.patch --] [-- Type: text/plain, Size: 12243 bytes --] Move wakeups out from under the rwsem's wait_lock spinlock. This reduces that lock's contention by a factor of around 10 on a 16-way NUMAQ running volanomark, however cacheline contention on the rwsem's "activity" drowns out these small improvements when using the i386 "optimised" rwsem: unpatched: 55802519 total 32.3097 23325323 default_idle 364458.1719 22084349 .text.lock.futex 82404.2873 2369107 queue_me 24678.1979 1875296 unqueue_me 9767.1667 1202258 .text.lock.rwsem 46240.6923 941801 finish_task_switch 7357.8203 787101 __wake_up 12298.4531 645252 drop_key_refs 13442.7500 362789 futex_wait 839.7894 333294 futex_wake 1487.9196 146797 rwsem_down_read_failed 436.8958 82788 .text.lock.dev 221.3583 81221 try_to_wake_up 133.5872 +rwsem-scale: 58120260 total 33.6458 25482132 default_idle 398158.3125 22774675 .text.lock.futex 84980.1306 2517797 queue_me 26227.0521 1953424 unqueue_me 10174.0833 1063068 finish_task_switch 8305.2188 834793 __wake_up 13043.6406 674570 drop_key_refs 14053.5417 371811 futex_wait 860.6736 343398 futex_wake 1533.0268 155419 try_to_wake_up 255.6234 114704 .text.lock.rwsem 4411.6923 The rwsem-spinlock implementation, however, is improved significantly more, and now gets volanomark performance similar to the xadd rwsem: unpatched: 30850964 total 18.1787 18986006 default_idle 296656.3438 3989183 .text.lock.rwsem_spinlock 40294.7778 2990161 .text.lock.futex 32501.7500 549707 finish_task_switch 4294.5859 535327 __down_read 3717.5486 452721 queue_me 4715.8438 439725 __up_read 9160.9375 396273 __wake_up 6191.7656 326595 unqueue_me 1701.0156 +rwsem-scale: 25378268 total 14.9537 13325514 default_idle 208211.1562 3675634 .text.lock.futex 39952.5435 2908629 .text.lock.rwsem_spinlock 28239.1165 628115 __down_read 4361.9097 607417 finish_task_switch 4745.4453 588031 queue_me 6125.3229 571169 __up_read 11899.3542 436795 __wake_up 6824.9219 416788 unqueue_me 2170.7708 Index: linux-2.6/lib/rwsem.c =================================================================== --- linux-2.6.orig/lib/rwsem.c 2006-02-26 03:05:01.000000000 +1100 +++ linux-2.6/lib/rwsem.c 2006-02-26 03:37:04.000000000 +1100 @@ -36,14 +36,15 @@ void rwsemtrace(struct rw_semaphore *sem * - the spinlock must be held by the caller * - woken process blocks are discarded from the list after having task zeroed * - writers are only woken if downgrading is false + * + * The spinlock will be dropped by this function. */ -static inline struct rw_semaphore * -__rwsem_do_wake(struct rw_semaphore *sem, int downgrading) +static inline void +__rwsem_do_wake(struct rw_semaphore *sem, int downgrading, unsigned long flags) { + LIST_HEAD(wake_list); struct rwsem_waiter *waiter; - struct task_struct *tsk; - struct list_head *next; - signed long oldcount, woken, loop; + signed long oldcount, woken; rwsemtrace(sem, "Entering __rwsem_do_wake"); @@ -72,12 +73,8 @@ __rwsem_do_wake(struct rw_semaphore *sem * It is an allocated on the waiter's stack and may become invalid at * any time after that point (due to a wakeup from another source). */ - list_del(&waiter->list); - tsk = waiter->task; - smp_mb(); - waiter->task = NULL; - wake_up_process(tsk); - put_task_struct(tsk); + list_move_tail(&waiter->list, &wake_list); + waiter->flags = 0; goto out; /* don't want to wake any writers */ @@ -94,41 +91,37 @@ __rwsem_do_wake(struct rw_semaphore *sem readers_only: woken = 0; do { - woken++; - - if (waiter->list.next == &sem->wait_list) + list_move_tail(&waiter->list, &wake_list); + waiter->flags = 0; + woken += RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS; + if (list_empty(&sem->wait_list)) break; - waiter = list_entry(waiter->list.next, + waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); - } while (waiter->flags & RWSEM_WAITING_FOR_READ); - loop = woken; - woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS; if (!downgrading) /* we'd already done one increment earlier */ woken -= RWSEM_ACTIVE_BIAS; rwsem_atomic_add(woken, sem); - next = sem->wait_list.next; - for (; loop > 0; loop--) { - waiter = list_entry(next, struct rwsem_waiter, list); - next = waiter->list.next; +out: + spin_unlock_irqrestore(&sem->wait_lock, flags); + while (!list_empty(&wake_list)) { + struct task_struct *tsk; + waiter = list_entry(wake_list.next, struct rwsem_waiter, list); + list_del(&waiter->list); tsk = waiter->task; - smp_mb(); waiter->task = NULL; + smp_mb(); wake_up_process(tsk); put_task_struct(tsk); } - sem->wait_list.next = next; - next->prev = &sem->wait_list; - - out: rwsemtrace(sem, "Leaving __rwsem_do_wake"); - return sem; + return; /* undo the change to count, but check for a transition 1->0 */ undo: @@ -145,12 +138,13 @@ rwsem_down_failed_common(struct rw_semap struct rwsem_waiter *waiter, signed long adjustment) { struct task_struct *tsk = current; + unsigned long flags; signed long count; set_task_state(tsk, TASK_UNINTERRUPTIBLE); /* set up my own style of waitqueue */ - spin_lock_irq(&sem->wait_lock); + spin_lock_irqsave(&sem->wait_lock, flags); waiter->task = tsk; get_task_struct(tsk); @@ -161,14 +155,12 @@ rwsem_down_failed_common(struct rw_semap /* if there are no active locks, wake the front queued process(es) up */ if (!(count & RWSEM_ACTIVE_MASK)) - sem = __rwsem_do_wake(sem, 0); - - spin_unlock_irq(&sem->wait_lock); + __rwsem_do_wake(sem, 0, flags); + else + spin_unlock_irqrestore(&sem->wait_lock, flags); /* wait to be given the lock */ - for (;;) { - if (!waiter->task) - break; + while (waiter->task) { schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); } @@ -227,9 +219,9 @@ struct rw_semaphore fastcall *rwsem_wake /* do nothing if list empty */ if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem, 0); - - spin_unlock_irqrestore(&sem->wait_lock, flags); + __rwsem_do_wake(sem, 0, flags); + else + spin_unlock_irqrestore(&sem->wait_lock, flags); rwsemtrace(sem, "Leaving rwsem_wake"); @@ -251,9 +243,9 @@ struct rw_semaphore fastcall *rwsem_down /* do nothing if list empty */ if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem, 1); - - spin_unlock_irqrestore(&sem->wait_lock, flags); + __rwsem_do_wake(sem, 1, flags); + else + spin_unlock_irqrestore(&sem->wait_lock, flags); rwsemtrace(sem, "Leaving rwsem_downgrade_wake"); return sem; Index: linux-2.6/lib/rwsem-spinlock.c =================================================================== --- linux-2.6.orig/lib/rwsem-spinlock.c 2006-02-26 03:05:01.000000000 +1100 +++ linux-2.6/lib/rwsem-spinlock.c 2006-02-26 03:37:42.000000000 +1100 @@ -49,11 +49,11 @@ void fastcall init_rwsem(struct rw_semap * - woken process blocks are discarded from the list after having task zeroed * - writers are only woken if wakewrite is non-zero */ -static inline struct rw_semaphore * -__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite) +static inline void +__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite, unsigned long flags) { + LIST_HEAD(wake_list); struct rwsem_waiter *waiter; - struct task_struct *tsk; int woken; rwsemtrace(sem, "Entering __rwsem_do_wake"); @@ -73,46 +73,46 @@ __rwsem_do_wake(struct rw_semaphore *sem */ if (waiter->flags & RWSEM_WAITING_FOR_WRITE) { sem->activity = -1; - list_del(&waiter->list); - tsk = waiter->task; - /* Don't touch waiter after ->task has been NULLed */ - smp_mb(); - waiter->task = NULL; - wake_up_process(tsk); - put_task_struct(tsk); + list_move_tail(&waiter->list, &wake_list); goto out; } /* grant an infinite number of read locks to the front of the queue */ dont_wake_writers: woken = 0; - while (waiter->flags & RWSEM_WAITING_FOR_READ) { - struct list_head *next = waiter->list.next; + do { + list_move_tail(&waiter->list, &wake_list); + woken++; + if (list_empty(&sem->wait_list)) + break; + waiter = list_entry(sem->wait_list.next, + struct rwsem_waiter, list); + } while (waiter->flags & RWSEM_WAITING_FOR_READ); + + sem->activity += woken; +out: + spin_unlock_irqrestore(&sem->wait_lock, flags); + while (!list_empty(&wake_list)) { + struct task_struct *tsk; + waiter = list_entry(wake_list.next, struct rwsem_waiter, list); list_del(&waiter->list); tsk = waiter->task; - smp_mb(); waiter->task = NULL; + smp_mb(); wake_up_process(tsk); put_task_struct(tsk); - woken++; - if (list_empty(&sem->wait_list)) - break; - waiter = list_entry(next, struct rwsem_waiter, list); } - sem->activity += woken; - - out: rwsemtrace(sem, "Leaving __rwsem_do_wake"); - return sem; } /* * wake a single writer + * called with wait_lock locked and returns with it unlocked. */ -static inline struct rw_semaphore * -__rwsem_wake_one_writer(struct rw_semaphore *sem) +static inline void +__rwsem_wake_one_writer(struct rw_semaphore *sem, unsigned long flags) { struct rwsem_waiter *waiter; struct task_struct *tsk; @@ -121,13 +121,13 @@ __rwsem_wake_one_writer(struct rw_semaph waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list); list_del(&waiter->list); + spin_unlock_irqrestore(&sem->wait_lock, flags); tsk = waiter->task; - smp_mb(); waiter->task = NULL; + smp_mb(); wake_up_process(tsk); put_task_struct(tsk); - return sem; } /* @@ -163,9 +163,7 @@ void fastcall __sched __down_read(struct spin_unlock_irq(&sem->wait_lock); /* wait to be given the lock */ - for (;;) { - if (!waiter.task) - break; + while (waiter.task) { schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); } @@ -234,9 +232,7 @@ void fastcall __sched __down_write(struc spin_unlock_irq(&sem->wait_lock); /* wait to be given the lock */ - for (;;) { - if (!waiter.task) - break; + while (waiter.task) { schedule(); set_task_state(tsk, TASK_UNINTERRUPTIBLE); } @@ -283,9 +279,9 @@ void fastcall __up_read(struct rw_semaph spin_lock_irqsave(&sem->wait_lock, flags); if (--sem->activity == 0 && !list_empty(&sem->wait_list)) - sem = __rwsem_wake_one_writer(sem); - - spin_unlock_irqrestore(&sem->wait_lock, flags); + __rwsem_wake_one_writer(sem, flags); + else + spin_unlock_irqrestore(&sem->wait_lock, flags); rwsemtrace(sem, "Leaving __up_read"); } @@ -303,9 +299,9 @@ void fastcall __up_write(struct rw_semap sem->activity = 0; if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem, 1); - - spin_unlock_irqrestore(&sem->wait_lock, flags); + __rwsem_do_wake(sem, 1, flags); + else + spin_unlock_irqrestore(&sem->wait_lock, flags); rwsemtrace(sem, "Leaving __up_write"); } @@ -324,9 +320,9 @@ void fastcall __downgrade_write(struct r sem->activity = 1; if (!list_empty(&sem->wait_list)) - sem = __rwsem_do_wake(sem, 0); - - spin_unlock_irqrestore(&sem->wait_lock, flags); + __rwsem_do_wake(sem, 0, flags); + else + spin_unlock_irqrestore(&sem->wait_lock, flags); rwsemtrace(sem, "Leaving __downgrade_write"); } ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-25 16:48 ` Nick Piggin @ 2006-02-25 17:22 ` Nick Piggin 0 siblings, 0 replies; 22+ messages in thread From: Nick Piggin @ 2006-02-25 17:22 UTC (permalink / raw) To: Andi Kleen Cc: Arjan van de Ven, Ingo Molnar, Arjan van de Ven, linux-kernel, akpm, andrea Nick Piggin wrote: > Here is a forward port of my scalability improvement, untested. > > Unfortunately I didn't include absolute performance results, but the > changelog indicates that there was some noticable delta between the > rwsem implementations (and that's what I vaguely remember). > > Note: this was with volanomark, which can be quite variable at the > best of times IIRC. > > Note2: this was on a 16-way NUMAQ, which had the interconnect > performance of a string between two cans compared to a modern x86-64 > of smaller size, so it may be harder to notice an improvement. > Oh, I remember one performance caveat of this code on preempt kernels: at very high loads, (ie. lots of tasks being woken in the up path), the wakeup code would end up doing a lot of context switches to and from the tasks it is waking up. This could be easily solved by disabling preempt (and would be still better in terms of interrupt latency and lock hold times), however I never got around to implementing that. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 12:41 ` Ingo Molnar 2006-02-23 13:06 ` Andi Kleen @ 2006-02-28 22:30 ` Pavel Machek 1 sibling, 0 replies; 22+ messages in thread From: Pavel Machek @ 2006-02-28 22:30 UTC (permalink / raw) To: Ingo Molnar; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm On Čt 23-02-06 13:41:53, Ingo Molnar wrote: > > * Andi Kleen <ak@suse.de> wrote: > > > On Thursday 23 February 2006 10:29, Arjan van de Ven wrote: > > > This patch adds an entry for a cleared page to the task struct. The main > > > purpose of this patch is to be able to pre-allocate and clear a page in a > > > pagefault scenario before taking any locks (esp mmap_sem), > > > opportunistically. Allocating+clearing a page is an very expensive > > > operation that currently increases lock hold times quite bit (in a threaded > > > environment that allocates/use/frees memory on a regular basis, this leads > > > to contention). > > > > > > This is probably the most controversial patch of the 3, since there is > > > a potential to take up 1 page per thread in this cache. In practice it's > > > not as bad as it sounds (a large degree of the pagefaults are anonymous > > > and thus immediately use up the page). One could argue "let the VM reap > > > these" but that has a few downsides; it increases locking needs but more, > > > clearing a page is relatively expensive, if the VM reaps the page again > > > in case it wasn't needed, the work was just wasted. > > > > Looks like an incredible bad hack. What workload was that again where > > it helps? And how much? I think before we can consider adding that > > ugly code you would a far better rationale. > > yes, the patch is controversial technologically, and Arjan pointed it > out above. This is nothing new - and Arjan probably submitted this to > lkml so that he can get contructive feedback. Actually, I think I have to back Andi here. This looked like patch for inclusion (signed-off, cc-ed Andrew). And yes, Arjan pointed out that it is controversial, but the way patch was worded I could imagine Andrew merging it... Pavel -- Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted... ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Patch 3/3] prepopulate/cache cleared pages 2006-02-23 9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven 2006-02-23 9:41 ` Andi Kleen @ 2006-02-23 18:25 ` Paul Jackson 1 sibling, 0 replies; 22+ messages in thread From: Paul Jackson @ 2006-02-23 18:25 UTC (permalink / raw) To: Arjan van de Ven; +Cc: linux-kernel, ak, akpm Just a random idea, offered with little real understanding of what's going on ... Instead of a per-task clear page, how about a per-cpu clear page, or short queue of clear pages? This lets the number of clear pages be throttled to whatever is worth it. And it handles such cases as a few threads using the clear pages rapidly, while many other threads don't need any, with a much higher "average usefulness" per clear page (meaning the average time a cleared page sits around wasting memory prior to its being used is much shorter.) Some locking would still be needed, but per-cpu locking is a separate, quicker beast than something like mmap_sem. Mind you, I am not commenting one way or the other on whether any of this is a good idea. Not my expertise ... -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.925.600.0401 ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2006-02-28 22:32 UTC | newest] Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-02-23 21:10 [Patch 3/3] prepopulate/cache cleared pages Chuck Ebbert 2006-02-23 21:18 ` Arjan van de Ven -- strict thread matches above, loose matches on Subject: below -- 2006-02-23 20:02 Chuck Ebbert 2006-02-23 9:17 [Patch 0/3] threaded mmap tweaks Arjan van de Ven 2006-02-23 9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven 2006-02-23 9:41 ` Andi Kleen 2006-02-23 12:41 ` Ingo Molnar 2006-02-23 13:06 ` Andi Kleen 2006-02-23 13:15 ` Nick Piggin 2006-02-23 13:29 ` Ingo Molnar 2006-02-24 6:36 ` Nick Piggin 2006-02-24 6:49 ` Ingo Molnar 2006-02-24 7:01 ` Nick Piggin 2006-02-24 12:33 ` Andi Kleen 2006-02-24 12:55 ` Hugh Dickins 2006-02-24 9:15 ` Arjan van de Ven 2006-02-24 9:26 ` Nick Piggin 2006-02-24 12:27 ` Andi Kleen 2006-02-24 15:31 ` Andrea Arcangeli 2006-02-25 16:48 ` Nick Piggin 2006-02-25 17:22 ` Nick Piggin 2006-02-28 22:30 ` Pavel Machek 2006-02-23 18:25 ` Paul Jackson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).