linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Patch 0/3] threaded mmap tweaks
@ 2006-02-23  9:17 Arjan van de Ven
  2006-02-23  9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven
  2006-02-23  9:30 ` [Patch 2/3] fast VMA recycling Arjan van de Ven
  0 siblings, 2 replies; 32+ messages in thread
From: Arjan van de Ven @ 2006-02-23  9:17 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm

Hi,

I've been looking at a micro-benchmark that basically starts a few
threads which then each allocate some memory (via mmap), use it briefly
and then free it again (munmap). During this benchmark the mmap_sem gets
contended and as a result things go less well than expected. The patches
in this series improved the benchmark by 3% on a wallclock time basis.

Greetings,
   Arjan van de Ven

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Patch 3/3] prepopulate/cache cleared pages
  2006-02-23  9:17 [Patch 0/3] threaded mmap tweaks Arjan van de Ven
@ 2006-02-23  9:29 ` Arjan van de Ven
  2006-02-23  9:41   ` Andi Kleen
  2006-02-23 18:25   ` Paul Jackson
  2006-02-23  9:30 ` [Patch 2/3] fast VMA recycling Arjan van de Ven
  1 sibling, 2 replies; 32+ messages in thread
From: Arjan van de Ven @ 2006-02-23  9:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: ak, akpm

This patch adds an entry for a cleared page to the task struct. The main
purpose of this patch is to be able to pre-allocate and clear a page in a
pagefault scenario before taking any locks (esp mmap_sem),
opportunistically. Allocating+clearing a page is an very expensive 
operation that currently increases lock hold times quite bit (in a threaded 
environment that allocates/use/frees memory on a regular basis, this leads
to contention).

This is probably the most controversial patch of the 3, since there is
a potential to take up 1 page per thread in this cache. In practice it's
not as bad as it sounds (a large degree of the pagefaults are anonymous 
and thus immediately use up the page). One could argue "let the VM reap
these" but that has a few downsides; it increases locking needs but more,
clearing a page is relatively expensive, if the VM reaps the page again
in case it wasn't needed, the work was just wasted.


Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

---
 arch/x86_64/mm/fault.c |    2 ++
 include/linux/mm.h     |    1 +
 include/linux/sched.h  |    1 +
 kernel/exit.c          |    4 ++++
 kernel/fork.c          |    1 +
 mm/mempolicy.c         |   37 +++++++++++++++++++++++++++++++++++++
 6 files changed, 46 insertions(+)

Index: linux-work/arch/x86_64/mm/fault.c
===================================================================
--- linux-work.orig/arch/x86_64/mm/fault.c
+++ linux-work/arch/x86_64/mm/fault.c
@@ -375,6 +375,8 @@ asmlinkage void __kprobes do_page_fault(
 		goto bad_area_nosemaphore;
 
  again:
+	prepare_cleared_page();
+
 	/* When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in the
 	 * kernel and should generate an OOPS.  Unfortunatly, in the case of an
Index: linux-work/include/linux/mm.h
===================================================================
--- linux-work.orig/include/linux/mm.h
+++ linux-work/include/linux/mm.h
@@ -1052,6 +1052,7 @@ void drop_pagecache(void);
 void drop_slab(void);
 
 extern void prepopulate_vma(void);
+extern void prepopulate_cleared_page(void);
 
 extern int randomize_va_space;
 
Index: linux-work/include/linux/sched.h
===================================================================
--- linux-work.orig/include/linux/sched.h
+++ linux-work/include/linux/sched.h
@@ -839,6 +839,7 @@ struct task_struct {
 	struct reclaim_state *reclaim_state;
 
 	struct vm_area_struct *free_vma_cache;  /* keep 1 free vma around as cache */
+	struct page *cleared_page;		/* optionally keep 1 cleared page around */
 
 	struct dentry *proc_dentry;
 	struct backing_dev_info *backing_dev_info;
Index: linux-work/kernel/exit.c
===================================================================
--- linux-work.orig/kernel/exit.c
+++ linux-work/kernel/exit.c
@@ -882,6 +882,10 @@ fastcall NORET_TYPE void do_exit(long co
 		kmem_cache_free(vm_area_cachep, tsk->free_vma_cache);
 	tsk->free_vma_cache = NULL;
 
+	if (tsk->cleared_page)
+		__free_page(tsk->cleared_page);
+	tsk->cleared_page = NULL;
+
 	/* PF_DEAD causes final put_task_struct after we schedule. */
 	preempt_disable();
 	BUG_ON(tsk->flags & PF_DEAD);
Index: linux-work/kernel/fork.c
===================================================================
--- linux-work.orig/kernel/fork.c
+++ linux-work/kernel/fork.c
@@ -180,6 +180,7 @@ static struct task_struct *dup_task_stru
 	atomic_set(&tsk->usage,2);
 	atomic_set(&tsk->fs_excl, 0);
 	tsk->free_vma_cache = NULL;
+	tsk->cleared_page = NULL;
 
 	return tsk;
 }
Index: linux-work/mm/mempolicy.c
===================================================================
--- linux-work.orig/mm/mempolicy.c
+++ linux-work/mm/mempolicy.c
@@ -1231,6 +1231,13 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
 
+	if ( (gfp & __GFP_ZERO) && current->cleared_page) {
+		struct page *addr;
+		addr = current->cleared_page;
+		current->cleared_page = NULL;
+		return addr;
+	}
+
 	cpuset_update_task_memory_state();
 
 	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
@@ -1242,6 +1249,36 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
 }
 
+
+/**
+ *	prepare_cleared_page - populate the per-task zeroed-page cache
+ *
+ *	This function populates the per-task cache with one zeroed page
+ *	(if there wasn't one already)
+ *	The idea is that this (expensive) clearing is done before any
+ *	locks are taken, speculatively, and that when the page is actually
+ *	needed under a lock, it is ready for immediate use
+ */
+
+void prepare_cleared_page(void)
+{
+	struct mempolicy *pol = current->mempolicy;
+
+	if (current->cleared_page)
+		return;
+
+	cpuset_update_task_memory_state();
+
+	if (!pol)
+		pol = &default_policy;
+	if (pol->policy == MPOL_INTERLEAVE)
+		current->cleared_page = alloc_page_interleave(
+			GFP_HIGHUSER | __GFP_ZERO, 0, interleave_nodes(pol));
+	current->cleared_page = __alloc_pages(GFP_USER | __GFP_ZERO,
+			0, zonelist_policy(GFP_USER, pol));
+}
+
+
 /**
  * 	alloc_pages_current - Allocate pages.
  *


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [Patch 2/3] fast VMA recycling
  2006-02-23  9:17 [Patch 0/3] threaded mmap tweaks Arjan van de Ven
  2006-02-23  9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven
@ 2006-02-23  9:30 ` Arjan van de Ven
  2006-02-23  9:42   ` Andi Kleen
  2006-02-23 16:37   ` Benjamin LaHaise
  1 sibling, 2 replies; 32+ messages in thread
From: Arjan van de Ven @ 2006-02-23  9:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, ak

This patch adds a per task-struct cache of a free vma. 

In normal operation, it is a really common action during userspace mmap 
or malloc to first allocate a vma, and then find out that it can be merged,
and thus free it again. In fact this is the case roughly 95% of the time.

In addition, this patch allows code to "prepopulate" the cache, and
this is done as example for the x86_64 mmap codepath. The advantage of this
prepopulation is that the memory allocation (which is a sleeping operation
due to the GFP_KERNEL flag, potentially causing either a direct sleep or a 
voluntary preempt sleep) will happen before the mmap_sem is taken, and thus 
reduces lock hold time (and thus the contention potential)

The cache is only allowed to be accessed for "current", and not from IRQ
context. This allows for lockless access, making it a cheap cache.

One could argue that this should be a generic slab feature (the preloading) but
that only gives some of the gains and not all, and vma's are small creatures,
with a high "recycling" rate in typical cases.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>

---
 arch/x86_64/kernel/sys_x86_64.c |    2 +
 include/linux/mm.h              |    2 +
 include/linux/sched.h           |    2 +
 kernel/exit.c                   |    4 +++
 kernel/fork.c                   |    2 +
 mm/mmap.c                       |   50 ++++++++++++++++++++++++++++++++--------
 6 files changed, 52 insertions(+), 10 deletions(-)

Index: linux-work/arch/x86_64/kernel/sys_x86_64.c
===================================================================
--- linux-work.orig/arch/x86_64/kernel/sys_x86_64.c
+++ linux-work/arch/x86_64/kernel/sys_x86_64.c
@@ -55,6 +55,8 @@ asmlinkage long sys_mmap(unsigned long a
 		if (!file)
 			goto out;
 	}
+
+	prepopulate_vma();
 	down_write(&current->mm->mmap_sem);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, off >> PAGE_SHIFT);
 	up_write(&current->mm->mmap_sem);
Index: linux-work/include/linux/mm.h
===================================================================
--- linux-work.orig/include/linux/mm.h
+++ linux-work/include/linux/mm.h
@@ -1051,6 +1051,8 @@ int shrink_slab(unsigned long scanned, g
 void drop_pagecache(void);
 void drop_slab(void);
 
+extern void prepopulate_vma(void);
+
 extern int randomize_va_space;
 
 #endif /* __KERNEL__ */
Index: linux-work/include/linux/sched.h
===================================================================
--- linux-work.orig/include/linux/sched.h
+++ linux-work/include/linux/sched.h
@@ -838,6 +838,8 @@ struct task_struct {
 /* VM state */
 	struct reclaim_state *reclaim_state;
 
+	struct vm_area_struct *free_vma_cache;  /* keep 1 free vma around as cache */
+
 	struct dentry *proc_dentry;
 	struct backing_dev_info *backing_dev_info;
 
Index: linux-work/kernel/exit.c
===================================================================
--- linux-work.orig/kernel/exit.c
+++ linux-work/kernel/exit.c
@@ -878,6 +878,10 @@ fastcall NORET_TYPE void do_exit(long co
 	 */
 	mutex_debug_check_no_locks_held(tsk);
 
+	if (tsk->free_vma_cache)
+		kmem_cache_free(vm_area_cachep, tsk->free_vma_cache);
+	tsk->free_vma_cache = NULL;
+
 	/* PF_DEAD causes final put_task_struct after we schedule. */
 	preempt_disable();
 	BUG_ON(tsk->flags & PF_DEAD);
Index: linux-work/kernel/fork.c
===================================================================
--- linux-work.orig/kernel/fork.c
+++ linux-work/kernel/fork.c
@@ -179,6 +179,8 @@ static struct task_struct *dup_task_stru
 	/* One for us, one for whoever does the "release_task()" (usually parent) */
 	atomic_set(&tsk->usage,2);
 	atomic_set(&tsk->fs_excl, 0);
+	tsk->free_vma_cache = NULL;
+
 	return tsk;
 }
 
Index: linux-work/mm/mmap.c
===================================================================
--- linux-work.orig/mm/mmap.c
+++ linux-work/mm/mmap.c
@@ -65,6 +65,36 @@ int sysctl_overcommit_ratio = 50;	/* def
 int sysctl_max_map_count __read_mostly = DEFAULT_MAX_MAP_COUNT;
 atomic_t vm_committed_space = ATOMIC_INIT(0);
 
+
+static void free_vma(struct vm_area_struct *vma)
+{
+	struct vm_area_struct *oldvma;
+
+	oldvma = current->free_vma_cache;
+	current->free_vma_cache = vma;
+	if (oldvma)
+		kmem_cache_free(vm_area_cachep, oldvma);
+}
+
+static struct vm_area_struct *alloc_vma(void)
+{
+	if (current->free_vma_cache)  {
+		struct vm_area_struct *vma;
+		vma = current->free_vma_cache;
+		current->free_vma_cache = NULL;
+		return vma;
+	}
+	return kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+}
+
+void prepopulate_vma(void)
+{
+	if (!current->free_vma_cache)
+		current->free_vma_cache =
+			kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+}
+
+
 /*
  * Check that a process has enough memory to allocate a new virtual
  * mapping. 0 means there is enough memory for the allocation to
@@ -206,7 +236,7 @@ static struct vm_area_struct *remove_vma
 	if (vma->vm_file)
 		fput(vma->vm_file);
 	mpol_free(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 	return next;
 }
 
@@ -593,7 +623,7 @@ again:			remove_next = 1 + (end > next->
 			fput(file);
 		mm->map_count--;
 		mpol_free(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
+		free_vma(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -1048,7 +1078,7 @@ munmap_back:
 	 * specific mapper. the address has already been validated, but
 	 * not unmapped, but the maps are removed from the list.
 	 */
-	vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+	vma = alloc_vma();
 	if (!vma) {
 		error = -ENOMEM;
 		goto unacct_error;
@@ -1113,7 +1143,7 @@ munmap_back:
 			fput(file);
 		}
 		mpol_free(vma_policy(vma));
-		kmem_cache_free(vm_area_cachep, vma);
+		free_vma(vma);
 	}
 out:	
 	mm->total_vm += len >> PAGE_SHIFT;
@@ -1140,7 +1170,7 @@ unmap_and_free_vma:
 	unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
 	charged = 0;
 free_vma:
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
@@ -1711,7 +1741,7 @@ int split_vma(struct mm_struct * mm, str
 	if (mm->map_count >= sysctl_max_map_count)
 		return -ENOMEM;
 
-	new = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+	new = alloc_vma();
 	if (!new)
 		return -ENOMEM;
 
@@ -1727,7 +1757,7 @@ int split_vma(struct mm_struct * mm, str
 
 	pol = mpol_copy(vma_policy(vma));
 	if (IS_ERR(pol)) {
-		kmem_cache_free(vm_area_cachep, new);
+		free_vma(new);
 		return PTR_ERR(pol);
 	}
 	vma_set_policy(new, pol);
@@ -1904,7 +1934,7 @@ unsigned long do_brk(unsigned long addr,
 	/*
 	 * create a vma struct for an anonymous mapping
 	 */
-	vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+	vma = alloc_vma();
 	if (!vma) {
 		vm_unacct_memory(len >> PAGE_SHIFT);
 		return -ENOMEM;
@@ -2024,12 +2054,12 @@ struct vm_area_struct *copy_vma(struct v
 		    vma_start < new_vma->vm_end)
 			*vmap = new_vma;
 	} else {
-		new_vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
+		new_vma = alloc_vma();
 		if (new_vma) {
 			*new_vma = *vma;
 			pol = mpol_copy(vma_policy(vma));
 			if (IS_ERR(pol)) {
-				kmem_cache_free(vm_area_cachep, new_vma);
+				free_vma(new_vma);
 				return NULL;
 			}
 			vma_set_policy(new_vma, pol);


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-23  9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven
@ 2006-02-23  9:41   ` Andi Kleen
  2006-02-23 12:41     ` Ingo Molnar
  2006-02-23 18:25   ` Paul Jackson
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2006-02-23  9:41 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, akpm

On Thursday 23 February 2006 10:29, Arjan van de Ven wrote:
> This patch adds an entry for a cleared page to the task struct. The main
> purpose of this patch is to be able to pre-allocate and clear a page in a
> pagefault scenario before taking any locks (esp mmap_sem),
> opportunistically. Allocating+clearing a page is an very expensive 
> operation that currently increases lock hold times quite bit (in a threaded 
> environment that allocates/use/frees memory on a regular basis, this leads
> to contention).
> 
> This is probably the most controversial patch of the 3, since there is
> a potential to take up 1 page per thread in this cache. In practice it's
> not as bad as it sounds (a large degree of the pagefaults are anonymous 
> and thus immediately use up the page). One could argue "let the VM reap
> these" but that has a few downsides; it increases locking needs but more,
> clearing a page is relatively expensive, if the VM reaps the page again
> in case it wasn't needed, the work was just wasted.

Looks like an incredible bad hack. What workload was that again where it helps?
And how much? I think before we can consider adding that ugly code you would a far better
rationale.

-Andi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-23  9:30 ` [Patch 2/3] fast VMA recycling Arjan van de Ven
@ 2006-02-23  9:42   ` Andi Kleen
  2006-02-23  9:48     ` Arjan van de Ven
  2006-02-23 16:37   ` Benjamin LaHaise
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2006-02-23  9:42 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, akpm

On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> This patch adds a per task-struct cache of a free vma. 
> 
> In normal operation, it is a really common action during userspace mmap 
> or malloc to first allocate a vma, and then find out that it can be merged,
> and thus free it again. In fact this is the case roughly 95% of the time.
> 
> In addition, this patch allows code to "prepopulate" the cache, and
> this is done as example for the x86_64 mmap codepath. The advantage of this
> prepopulation is that the memory allocation (which is a sleeping operation
> due to the GFP_KERNEL flag, potentially causing either a direct sleep or a 
> voluntary preempt sleep) will happen before the mmap_sem is taken, and thus 
> reduces lock hold time (and thus the contention potential)

The slab fast path doesn't sleep. And if you're short of memory 
then you probably have other issues.
 
> The cache is only allowed to be accessed for "current", and not from IRQ
> context. This allows for lockless access, making it a cheap cache.
> 
> One could argue that this should be a generic slab feature (the preloading) but
> that only gives some of the gains and not all, and vma's are small creatures,
> with a high "recycling" rate in typical cases.

Numbers numbers numbers. What workload? How much did it help? 

Also looks like a very narrow purpose hard to maintain fragile hack to me.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-23  9:42   ` Andi Kleen
@ 2006-02-23  9:48     ` Arjan van de Ven
  2006-02-23 10:05       ` Andi Kleen
  2006-02-24 18:52       ` Christoph Hellwig
  0 siblings, 2 replies; 32+ messages in thread
From: Arjan van de Ven @ 2006-02-23  9:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > This patch adds a per task-struct cache of a free vma. 
> > 
> > In normal operation, it is a really common action during userspace mmap 
> > or malloc to first allocate a vma, and then find out that it can be merged,
> > and thus free it again. In fact this is the case roughly 95% of the time.
> > 
> > In addition, this patch allows code to "prepopulate" the cache, and
> > this is done as example for the x86_64 mmap codepath. The advantage of this
> > prepopulation is that the memory allocation (which is a sleeping operation
> > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a 
> > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus 
> > reduces lock hold time (and thus the contention potential)
> 
> The slab fast path doesn't sleep. 

it does via might_sleep()

> Numbers numbers numbers. What workload? How much did it help? 

see post 0/3

3% on a threaded allocation benchmark (which resembles a webserver with
cgis apparently)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-23  9:48     ` Arjan van de Ven
@ 2006-02-23 10:05       ` Andi Kleen
  2006-02-23 10:15         ` Arjan van de Ven
  2006-02-24 18:52       ` Christoph Hellwig
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2006-02-23 10:05 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, akpm

On Thursday 23 February 2006 10:48, Arjan van de Ven wrote:
> On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > This patch adds a per task-struct cache of a free vma. 
> > > 
> > > In normal operation, it is a really common action during userspace mmap 
> > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > and thus free it again. In fact this is the case roughly 95% of the time.
> > > 
> > > In addition, this patch allows code to "prepopulate" the cache, and
> > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > prepopulation is that the memory allocation (which is a sleeping operation
> > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a 
> > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus 
> > > reduces lock hold time (and thus the contention potential)
> > 
> > The slab fast path doesn't sleep. 
> 
> it does via might_sleep()

Hmm? That shouldn't sleep.

If it takes any time in a real workload then it should move into DEBUG_KERNEL
too. But I doubt it. Something with your analysis is wrong.

-Andi

P.S.: Your email address bounces.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-23 10:05       ` Andi Kleen
@ 2006-02-23 10:15         ` Arjan van de Ven
  2006-02-23 11:00           ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: Arjan van de Ven @ 2006-02-23 10:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel

On Thu, 2006-02-23 at 11:05 +0100, Andi Kleen wrote:
> On Thursday 23 February 2006 10:48, Arjan van de Ven wrote:
> > On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > > This patch adds a per task-struct cache of a free vma. 
> > > > 
> > > > In normal operation, it is a really common action during userspace mmap 
> > > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > > and thus free it again. In fact this is the case roughly 95% of the time.
> > > > 
> > > > In addition, this patch allows code to "prepopulate" the cache, and
> > > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > > prepopulation is that the memory allocation (which is a sleeping operation
> > > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a 
> > > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus 
> > > > reduces lock hold time (and thus the contention potential)
> > > 
> > > The slab fast path doesn't sleep. 
> > 
> > it does via might_sleep()
> 
> Hmm? That shouldn't sleep.

see voluntary preempt.


> If it takes any time in a real workload then it should move into DEBUG_KERNEL
> too. But I doubt it. Something with your analysis is wrong.

well I'm seeing contention; and this is one of the things that can be
moved out of the lock easily, and especially given the high recycle rate
of these things... looks worth it to me.



> P.S.: Your email address bounces.

sorry about that; made a typo when trying to figure out how to set up
non-corrupting patches lkml


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-23 10:15         ` Arjan van de Ven
@ 2006-02-23 11:00           ` Andi Kleen
  2006-02-23 11:22             ` Arjan van de Ven
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2006-02-23 11:00 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: akpm, linux-kernel

On Thursday 23 February 2006 11:15, Arjan van de Ven wrote:
> On Thu, 2006-02-23 at 11:05 +0100, Andi Kleen wrote:
> > On Thursday 23 February 2006 10:48, Arjan van de Ven wrote:
> > > On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > > > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > > > This patch adds a per task-struct cache of a free vma. 
> > > > > 
> > > > > In normal operation, it is a really common action during userspace mmap 
> > > > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > > > and thus free it again. In fact this is the case roughly 95% of the time.
> > > > > 
> > > > > In addition, this patch allows code to "prepopulate" the cache, and
> > > > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > > > prepopulation is that the memory allocation (which is a sleeping operation
> > > > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a 
> > > > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus 
> > > > > reduces lock hold time (and thus the contention potential)
> > > > 
> > > > The slab fast path doesn't sleep. 
> > > 
> > > it does via might_sleep()
> > 
> > Hmm? That shouldn't sleep.
> 
> see voluntary preempt.

Only when its time slice is used up but then it would sleep a bit later 
in user space. But it should be really a unlikely case and nothing
to optimize for.

> 
> > If it takes any time in a real workload then it should move into DEBUG_KERNEL
> > too. But I doubt it. Something with your analysis is wrong.
> 
> well I'm seeing contention; and this is one of the things that can be
> moved out of the lock easily, and especially given the high recycle rate
> of these things... looks worth it to me.

I think you need a better analysis of what is actually happening
instead of trying all kind of weird quick fragile hacks.

-Andi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-23 11:00           ` Andi Kleen
@ 2006-02-23 11:22             ` Arjan van de Ven
  2006-02-23 11:57               ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: Arjan van de Ven @ 2006-02-23 11:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, linux-kernel

Andi Kleen wrote:
>> see voluntary preempt.
> 
> Only when its time slice is used up 

or if some other thread gets a higher dynamic prio
> but then it would sleep a bit later 
> in user space. 

... but that is without the semaphore held! (and that is the entire 
point of this patch, move the sleep moments to outside the lock holding 
area as much as possible, to reduce lock hold times)



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-23 11:22             ` Arjan van de Ven
@ 2006-02-23 11:57               ` Andi Kleen
  0 siblings, 0 replies; 32+ messages in thread
From: Andi Kleen @ 2006-02-23 11:57 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: akpm, linux-kernel

On Thursday 23 February 2006 12:22, Arjan van de Ven wrote:
> Andi Kleen wrote:
> >> see voluntary preempt.
> > 
> > Only when its time slice is used up 
> 
> or if some other thread gets a higher dynamic prio
> > but then it would sleep a bit later 
> > in user space. 
> 
> ... but that is without the semaphore held! (and that is the entire 
> point of this patch, move the sleep moments to outside the lock holding 
> area as much as possible, to reduce lock hold times)

And you verified this happens often in your workload?

Anyways, how about adding a down_no_preempt() or similar instead
that won't voluntarily preempt?

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-23  9:41   ` Andi Kleen
@ 2006-02-23 12:41     ` Ingo Molnar
  2006-02-23 13:06       ` Andi Kleen
  2006-02-28 22:30       ` Pavel Machek
  0 siblings, 2 replies; 32+ messages in thread
From: Ingo Molnar @ 2006-02-23 12:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Arjan van de Ven, linux-kernel, akpm


* Andi Kleen <ak@suse.de> wrote:

> On Thursday 23 February 2006 10:29, Arjan van de Ven wrote:
> > This patch adds an entry for a cleared page to the task struct. The main
> > purpose of this patch is to be able to pre-allocate and clear a page in a
> > pagefault scenario before taking any locks (esp mmap_sem),
> > opportunistically. Allocating+clearing a page is an very expensive 
> > operation that currently increases lock hold times quite bit (in a threaded 
> > environment that allocates/use/frees memory on a regular basis, this leads
> > to contention).
> > 
> > This is probably the most controversial patch of the 3, since there is
> > a potential to take up 1 page per thread in this cache. In practice it's
> > not as bad as it sounds (a large degree of the pagefaults are anonymous 
> > and thus immediately use up the page). One could argue "let the VM reap
> > these" but that has a few downsides; it increases locking needs but more,
> > clearing a page is relatively expensive, if the VM reaps the page again
> > in case it wasn't needed, the work was just wasted.
> 
> Looks like an incredible bad hack. What workload was that again where 
> it helps? And how much? I think before we can consider adding that 
> ugly code you would a far better rationale.

yes, the patch is controversial technologically, and Arjan pointed it 
out above. This is nothing new - and Arjan probably submitted this to 
lkml so that he can get contructive feedback.

What Arjan did is quite nifty, as it moves the page clearing out from 
under the mmap_sem-held critical section. How that is achieved is really 
secondary, it's pretty clear that it could be done in some nicer way.  
Furthermore, we havent seen any significant advance in that area of the 
kernel [threaded mmap() performance] in quite some time, so 
contributions are both welcome and encouraged.

But Andi, regarding the tone of your reply - it is really getting out of 
hand! Please lean back and take a deep breath. Maybe you should even 
sleep over it. And when you come back, please start being _much_ more 
respectful of other people's work. You are not doing Linux _any_ favor 
by flaming away so mindlessly ... Arjan is a longtime contributor and he 
can probably take your flames just fine (as I took your flames just fine 
the other day), but we should really use a _much_ nicer tone on lkml.

You might not realize it, but replies like this can scare away novice 
contributors forever! You could scare away the next DaveM. Or the next 
Alan Cox. Or the next Andi Kleen. Heck, much milder replies can scare 
away even longtime contributors: see Rusty Russell's comments from a 
couple of days ago ...

And no, i dont accept the lame "dont come into the kitchen if you cant 
stand the flames" excuse: your reply was totally uncalled for, was 
totally undeserved and was totally unnecessary. It was incredibly mean 
spirited, and the only effect it can possibly have on the recipient is 
harm. And it's not like this happened in the heat of a longer discussion 
or due to some misunderstanding: you flamed away _right at the 
beginning_, right as the first reply to the patch submission. Shame on 
you! Is it that hard to reply:

  "Hm, nice idea, I wish it were cleaner though because
   <insert explanation here>. How about <insert nifty idea here>."

[ Btw., i could possibly have come up with a good solution for this
  during the time i had to waste on this reply. ]

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-23 12:41     ` Ingo Molnar
@ 2006-02-23 13:06       ` Andi Kleen
  2006-02-23 13:15         ` Nick Piggin
  2006-02-28 22:30       ` Pavel Machek
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2006-02-23 13:06 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Arjan van de Ven, linux-kernel, akpm

On Thursday 23 February 2006 13:41, Ingo Molnar wrote:

> 
> What Arjan did is quite nifty, as it moves the page clearing out from 
> under the mmap_sem-held critical section.

So that was the point not the rescheduling under lock? Or both?

BTW since it touches your area of work you could comment what
you think about not using voluntary preempt points for fast sleep locks
like I later proposed.

> How that is achieved is really  
> secondary, it's pretty clear that it could be done in some nicer way.

Great we agree then.
> 
> And no, i dont accept the lame "dont come into the kitchen if you cant 
> stand the flames" excuse: your reply was totally uncalled for, was 
> totally undeserved 

Well he didn't supply any data so I asked for more.

> and was totally unnecessary. It was incredibly mean  
> spirited, 

Sorry, but I don't think that's true. Mean spirited would be
"we don't care, go away".  When I think that I generally don't 
answer the email.

I could have perhaps worded it a bit nicer, agreed, but I think
the core of my reply - we need more analysis for that - was
quite constructive. At least for one of the subproblems I even
proposed a better solution. If there is more analysis of the
problem maybe I can help even for more of it.

Also it's probably quite clear that added lots of special purpose caches
to task_struct for narrow purpose isn't a good way to do optimization.

The mail was perhaps a bit harsher than it should have been
because I think Arjan should have really known better...

> You might not realize it, but replies like this can scare away novice 
> contributors forever! You could scare away the next DaveM. Or the next 
> Alan Cox. Or the next Andi Kleen. Heck, much milder replies can scare 
> away even longtime contributors: see Rusty Russell's comments from a 
> couple of days ago ...

I must say I'm feeling a bit unfairly attacked here because I think I generally
try to help new patch submitters (at least if their basic ideas are sound even
if some details are wrong)

e.g. you might notice that a lot of patches from new contributors
go smoother into x86-64 than into i386.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-23 13:06       ` Andi Kleen
@ 2006-02-23 13:15         ` Nick Piggin
  2006-02-23 13:29           ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2006-02-23 13:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Ingo Molnar, Arjan van de Ven, linux-kernel, akpm

Andi Kleen wrote:
> On Thursday 23 February 2006 13:41, Ingo Molnar wrote:
> 
> 
>>What Arjan did is quite nifty, as it moves the page clearing out from 
>>under the mmap_sem-held critical section.
> 
> 
> So that was the point not the rescheduling under lock? Or both?
> 
> BTW since it touches your area of work you could comment what
> you think about not using voluntary preempt points for fast sleep locks
> like I later proposed.
> 
> 
>>How that is achieved is really  
>>secondary, it's pretty clear that it could be done in some nicer way.
> 
> 
> Great we agree then.
> 

I'm worried about the situation where we allocate but don't use the
new page: it blows quite a bit of cache. Then, when we do get around
to using it, it will be cold(er).

Good to see someone trying to improve this area, though.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-23 13:15         ` Nick Piggin
@ 2006-02-23 13:29           ` Ingo Molnar
  2006-02-24  6:36             ` Nick Piggin
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2006-02-23 13:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> I'm worried about the situation where we allocate but don't use the 
> new page: it blows quite a bit of cache. Then, when we do get around 
> to using it, it will be cold(er).

couldnt the new pte be flipped in atomically via cmpxchg? That way we 
could do the page clearing close to where we are doing it now, but 
without holding the mmap_sem.

to solve the pte races we could use a bit in the [otherwise empty] pte 
to signal "this pte can be flipped in from now on", which bit would 
automatically be cleared if mprotect() or munmap() is called over that 
range (without any extra changes to those codepaths). (in the rare case 
if the cmpxchg() fails, we go into a slowpath that drops the newly 
allocated page, re-lookups the vma and the pte, etc.)

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-23  9:30 ` [Patch 2/3] fast VMA recycling Arjan van de Ven
  2006-02-23  9:42   ` Andi Kleen
@ 2006-02-23 16:37   ` Benjamin LaHaise
  1 sibling, 0 replies; 32+ messages in thread
From: Benjamin LaHaise @ 2006-02-23 16:37 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, akpm, ak

On Thu, Feb 23, 2006 at 10:30:29AM +0100, Arjan van de Ven wrote:
> One could argue that this should be a generic slab feature (the preloading) but
> that only gives some of the gains and not all, and vma's are small creatures,
> with a high "recycling" rate in typical cases.

I think this is wrong, and far too narrow in useful scope to be merged.  
It's unnecessary bloat of task_struct and if this is really a problem, then 
there is a [performance] bug in the slab allocator.  Please at least provide 
the slab statistics for what is going on, as the workload might well be 
thrashing the creation and destruction of slabs, which would be a much better 
thing to fix.

		-ben

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-23  9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven
  2006-02-23  9:41   ` Andi Kleen
@ 2006-02-23 18:25   ` Paul Jackson
  1 sibling, 0 replies; 32+ messages in thread
From: Paul Jackson @ 2006-02-23 18:25 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: linux-kernel, ak, akpm

Just a random idea, offered with little real understanding
of what's going on ...

  Instead of a per-task clear page, how about a per-cpu clear page,
  or short queue of clear pages?

  This lets the number of clear pages be throttled to whatever
  is worth it.  And it handles such cases as a few threads using
  the clear pages rapidly, while many other threads don't need any,
  with a much higher "average usefulness" per clear page (meaning
  the average time a cleared page sits around wasting memory prior
  to its being used is much shorter.)

  Some locking would still be needed, but per-cpu locking is
  a separate, quicker beast than something like mmap_sem.

Mind you, I am not commenting one way or the other on whether any
of this is a good idea.  Not my expertise ...

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-23 13:29           ` Ingo Molnar
@ 2006-02-24  6:36             ` Nick Piggin
  2006-02-24  6:49               ` Ingo Molnar
  2006-02-24  9:15               ` Arjan van de Ven
  0 siblings, 2 replies; 32+ messages in thread
From: Nick Piggin @ 2006-02-24  6:36 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>I'm worried about the situation where we allocate but don't use the 
>>new page: it blows quite a bit of cache. Then, when we do get around 
>>to using it, it will be cold(er).
> 
> 
> couldnt the new pte be flipped in atomically via cmpxchg? That way we 
> could do the page clearing close to where we are doing it now, but 
> without holding the mmap_sem.
> 

We have nothing to pin the pte page with if we're not holding the
mmap_sem.

> to solve the pte races we could use a bit in the [otherwise empty] pte 
> to signal "this pte can be flipped in from now on", which bit would 
> automatically be cleared if mprotect() or munmap() is called over that 
> range (without any extra changes to those codepaths). (in the rare case 
> if the cmpxchg() fails, we go into a slowpath that drops the newly 
> allocated page, re-lookups the vma and the pte, etc.)
> 

Page still isn't pinned. You might be able to do something wild like
disable preemption and interrupts (to stop the TLB IPI) to get a pin
on the pte pages.

But even in that case, there is nothing in the mmu gather / tlb flush
interface that guarantees an architecture cannot free the page table
pages immediately (ie without waiting for the flush IPI). This would
make sense on architectures that don't walk the page tables in hardware.

Arjan, just to get an idea of your workload: obviously it is a mix of
read and write on the mmap_sem (read only will not really benefit from
reducing lock width because cacheline transfers will still be there).
Is it coming from brk() from the allocator? Someone told me a while ago
that glibc doesn't have a decent amount of hysteresis in its allocator
and tends to enter the kernel quite a lot... that might be something
to look into.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-24  6:36             ` Nick Piggin
@ 2006-02-24  6:49               ` Ingo Molnar
  2006-02-24  7:01                 ` Nick Piggin
  2006-02-24  9:15               ` Arjan van de Ven
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2006-02-24  6:49 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm


* Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> > couldnt the new pte be flipped in atomically via cmpxchg? That way 
> > we could do the page clearing close to where we are doing it now,
> > but without holding the mmap_sem.
> 
> We have nothing to pin the pte page with if we're not holding the 
> mmap_sem.

why does it have to be pinned? The page is mostly private to this thread 
until it manages to flip it into the pte. Since there's no pte presence, 
there's no swapout possible [here i'm assuming anonymous malloc() 
memory, which is the main focus of Arjan's patch]. Any parallel 
unmapping of that page will be caught and the installation of the page 
will be prevented by the 'bit-spin-lock' embedded in the pte.

> But even in that case, there is nothing in the mmu gather / tlb flush 
> interface that guarantees an architecture cannot free the page table 
> pages immediately (ie without waiting for the flush IPI). This would 
> make sense on architectures that don't walk the page tables in 
> hardware.

but the page wont be found by any other CPU, so it wont be freed! It is 
private to this CPU. The page has no pte presence. It will only be 
present and lookupable as a result of the cmpxchg() flipping the page 
into the pte.

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-24  6:49               ` Ingo Molnar
@ 2006-02-24  7:01                 ` Nick Piggin
  2006-02-24 12:33                   ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2006-02-24  7:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm

Ingo Molnar wrote:
> * Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>>couldnt the new pte be flipped in atomically via cmpxchg? That way 
>>>we could do the page clearing close to where we are doing it now,
>>>but without holding the mmap_sem.
>>
>>We have nothing to pin the pte page with if we're not holding the 
>>mmap_sem.
> 
> 
> why does it have to be pinned? The page is mostly private to this thread 
> until it manages to flip it into the pte. Since there's no pte presence, 
> there's no swapout possible [here i'm assuming anonymous malloc() 
> memory, which is the main focus of Arjan's patch]. Any parallel 
> unmapping of that page will be caught and the installation of the page 
> will be prevented by the 'bit-spin-lock' embedded in the pte.
> 

No, I was talking about page table pages, rather than the newly
allocated page.

But I didn't realise you wanted the bit lock to go the other way
as well (ie. a real bit spinlock). Seems like that would have to
add overhead somewhere.

> 
>>But even in that case, there is nothing in the mmu gather / tlb flush 
>>interface that guarantees an architecture cannot free the page table 
>>pages immediately (ie without waiting for the flush IPI). This would 
>>make sense on architectures that don't walk the page tables in 
>>hardware.
> 
> 
> but the page wont be found by any other CPU, so it wont be freed! It is 
> private to this CPU. The page has no pte presence. It will only be 
> present and lookupable as a result of the cmpxchg() flipping the page 
> into the pte.
> 

Yeah, as I said above, the newly allocated page is fine, it is the
page table pages I'm worried about.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-24  6:36             ` Nick Piggin
  2006-02-24  6:49               ` Ingo Molnar
@ 2006-02-24  9:15               ` Arjan van de Ven
  2006-02-24  9:26                 ` Nick Piggin
  1 sibling, 1 reply; 32+ messages in thread
From: Arjan van de Ven @ 2006-02-24  9:15 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, Andi Kleen, Arjan van de Ven, linux-kernel, akpm


> Arjan, just to get an idea of your workload: obviously it is a mix of
> read and write on the mmap_sem (read only will not really benefit from
> reducing lock width because cacheline transfers will still be there).

yeah it's threads that each allocate, use and then free memory with
mmap()

> Is it coming from brk() from the allocator? Someone told me a while ago
> that glibc doesn't have a decent amount of hysteresis in its allocator
> and tends to enter the kernel quite a lot... that might be something
> to look into.

we already are working on that angle; I just posted the kernel stuff as
a side effect basically 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-24  9:15               ` Arjan van de Ven
@ 2006-02-24  9:26                 ` Nick Piggin
  2006-02-24 12:27                   ` Andi Kleen
  0 siblings, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2006-02-24  9:26 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Andi Kleen, Arjan van de Ven, linux-kernel, akpm

Arjan van de Ven wrote:
>>Arjan, just to get an idea of your workload: obviously it is a mix of
>>read and write on the mmap_sem (read only will not really benefit from
>>reducing lock width because cacheline transfers will still be there).
> 
> 
> yeah it's threads that each allocate, use and then free memory with
> mmap()
> 
> 
>>Is it coming from brk() from the allocator? Someone told me a while ago
>>that glibc doesn't have a decent amount of hysteresis in its allocator
>>and tends to enter the kernel quite a lot... that might be something
>>to look into.
> 
> 
> we already are working on that angle; I just posted the kernel stuff as
> a side effect basically 
> 

OK.

[aside]
Actually I have a scalability improvement for rwsems, that moves the
actual task wakeups out from underneath the rwsem spinlock in the up()
paths. This was useful exactly on a mixed read+write workload on mmap_sem.

The difference was quite large for the "generic rwsem" algorithm because
it uses the spinlock in fastpaths a lot more than the xadd algorithm. I
think x86-64 uses the former, which is what I presume you're testing with?

Obviously this is a slightly different issue from the one you're trying to
address here, but I'll dig out patch when I get some time and you can see
if it helps.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-24  9:26                 ` Nick Piggin
@ 2006-02-24 12:27                   ` Andi Kleen
  2006-02-24 15:31                     ` Andrea Arcangeli
  2006-02-25 16:48                     ` Nick Piggin
  0 siblings, 2 replies; 32+ messages in thread
From: Andi Kleen @ 2006-02-24 12:27 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Arjan van de Ven, Ingo Molnar, Arjan van de Ven, linux-kernel,
	akpm, andrea

On Friday 24 February 2006 10:26, Nick Piggin wrote:

> [aside]
> Actually I have a scalability improvement for rwsems, that moves the
> actual task wakeups out from underneath the rwsem spinlock in the up()
> paths. This was useful exactly on a mixed read+write workload on mmap_sem.
> 
> The difference was quite large for the "generic rwsem" algorithm because
> it uses the spinlock in fastpaths a lot more than the xadd algorithm. I
> think x86-64 uses the former, which is what I presume you're testing with?

I used the generic algorithm because Andrea originally expressed some doubts 
on the correctness of the xadd algorithms and after trying to understand them 
myself I wasn't sure myself. Generic was the safer choice.

But if someone can show convincing numbers that XADD rwsems are faster
for some workload we can switch. I guess they are tested well enough now
on i386.

Or would your scalability improvement remove that difference (if it really exists)?


-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-24  7:01                 ` Nick Piggin
@ 2006-02-24 12:33                   ` Andi Kleen
  2006-02-24 12:55                     ` Hugh Dickins
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2006-02-24 12:33 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Ingo Molnar, Arjan van de Ven, linux-kernel, akpm

On Friday 24 February 2006 08:01, Nick Piggin wrote:

> 
> Yeah, as I said above, the newly allocated page is fine, it is the
> page table pages I'm worried about.

page tables are easy because we zero them on free (as a side effect
of all the pte_clears)

I did a experimental hack some time ago to set a new struct page
flag when a page is known to be zeroed on freeing and use that for 
a GFP_ZERO allocation (basically skip the clear_page when that
flag was set)

The idea was to generalize the old page table reuse caches which
Ingo removed at some point.

It only works of course if the allocations and freeing
of page tables roughly matches up. In theory on could have
split the lists of the buddy allocator too into zero/non
zero pages to increase the hit rate, but I didn't attempt this.

I unfortunately don't remember the outcome, dropped it for some reason. 

-Andi


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-24 12:33                   ` Andi Kleen
@ 2006-02-24 12:55                     ` Hugh Dickins
  0 siblings, 0 replies; 32+ messages in thread
From: Hugh Dickins @ 2006-02-24 12:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Ingo Molnar, Arjan van de Ven, linux-kernel, akpm

On Fri, 24 Feb 2006, Andi Kleen wrote:
> On Friday 24 February 2006 08:01, Nick Piggin wrote:
> > 
> > Yeah, as I said above, the newly allocated page is fine, it is the
> > page table pages I'm worried about.
> 
> page tables are easy because we zero them on free (as a side effect
> of all the pte_clears)

But once the page table is freed, it's likely to get used for something
else, whether for another page table or something different doesn't 
matter: this mm can no longer blindly mess with the entries within in.

Nick's point is that it's mmap_sem (or mm_users 0) guarding against
the page table being freed.

Hugh

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-24 12:27                   ` Andi Kleen
@ 2006-02-24 15:31                     ` Andrea Arcangeli
  2006-02-25 16:48                     ` Nick Piggin
  1 sibling, 0 replies; 32+ messages in thread
From: Andrea Arcangeli @ 2006-02-24 15:31 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Nick Piggin, Arjan van de Ven, Ingo Molnar, Arjan van de Ven,
	linux-kernel, akpm

On Fri, Feb 24, 2006 at 01:27:26PM +0100, Andi Kleen wrote:
> I used the generic algorithm because Andrea originally expressed some doubts 
> on the correctness of the xadd algorithms and after trying to understand them 
> myself I wasn't sure myself. Generic was the safer choice.

Amittedly we never had bugreports for the xadd ones, but trust me that's
not a good reason to assume that they must be correct. I'd be more
confortable if somebody would provide a demonstration of their
correctnes. Overall I gave up also because I felt that the small gain
was by far not worth the risk.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-23  9:48     ` Arjan van de Ven
  2006-02-23 10:05       ` Andi Kleen
@ 2006-02-24 18:52       ` Christoph Hellwig
  2006-02-24 19:05         ` Andi Kleen
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2006-02-24 18:52 UTC (permalink / raw)
  To: Arjan van de Ven; +Cc: Andi Kleen, linux-kernel, akpm

On Thu, Feb 23, 2006 at 10:48:50AM +0100, Arjan van de Ven wrote:
> On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > This patch adds a per task-struct cache of a free vma. 
> > > 
> > > In normal operation, it is a really common action during userspace mmap 
> > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > and thus free it again. In fact this is the case roughly 95% of the time.
> > > 
> > > In addition, this patch allows code to "prepopulate" the cache, and
> > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > prepopulation is that the memory allocation (which is a sleeping operation
> > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a 
> > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus 
> > > reduces lock hold time (and thus the contention potential)
> > 
> > The slab fast path doesn't sleep. 
> 
> it does via might_sleep()

so turn of the voluntary preempt bullshit. 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-24 18:52       ` Christoph Hellwig
@ 2006-02-24 19:05         ` Andi Kleen
  2006-02-24 19:09           ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2006-02-24 19:05 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Arjan van de Ven, linux-kernel, akpm, mingo

On Friday 24 February 2006 19:52, Christoph Hellwig wrote:
> On Thu, Feb 23, 2006 at 10:48:50AM +0100, Arjan van de Ven wrote:
> > On Thu, 2006-02-23 at 10:42 +0100, Andi Kleen wrote:
> > > On Thursday 23 February 2006 10:30, Arjan van de Ven wrote:
> > > > This patch adds a per task-struct cache of a free vma. 
> > > > 
> > > > In normal operation, it is a really common action during userspace mmap 
> > > > or malloc to first allocate a vma, and then find out that it can be merged,
> > > > and thus free it again. In fact this is the case roughly 95% of the time.
> > > > 
> > > > In addition, this patch allows code to "prepopulate" the cache, and
> > > > this is done as example for the x86_64 mmap codepath. The advantage of this
> > > > prepopulation is that the memory allocation (which is a sleeping operation
> > > > due to the GFP_KERNEL flag, potentially causing either a direct sleep or a 
> > > > voluntary preempt sleep) will happen before the mmap_sem is taken, and thus 
> > > > reduces lock hold time (and thus the contention potential)
> > > 
> > > The slab fast path doesn't sleep. 
> > 
> > it does via might_sleep()
> 
> so turn of the voluntary preempt bullshit. 

I think voluntary preempt is generally a good idea, but we should make it optional
for down() since it can apparently cause bad side effects (like holding 
semaphores/mutexes for too long) 

There would be two possible ways: 

have default mutex_lock()/down() do a might_sleep()
with preemption and have a separate variant that doesn't preempt
or have the default non preempt and a separate variant
that does preempt. 

I think I would prefer the later because for preemption 
it probably makes often more sense to do the preemption
on the up() not the down.

On the other hand one could argue that it's safer to only
change it for mmap_sem, which would suggest a special variant
without preemption and keep the current default.

Actually the non preempt variants probably should still have a might_sleep() 
for debugging, but in a variant that doesn't do preemption (might_sleep_no_preempt()?) 

Ingo, what do you think?

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 2/3] fast VMA recycling
  2006-02-24 19:05         ` Andi Kleen
@ 2006-02-24 19:09           ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2006-02-24 19:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Christoph Hellwig, Arjan van de Ven, linux-kernel, akpm, mingo

On Fri, Feb 24, 2006 at 08:05:16PM +0100, Andi Kleen wrote:
> I think voluntary preempt is generally a good idea, but we should make it optional
> for down() since it can apparently cause bad side effects (like holding 
> semaphores/mutexes for too long) 
> 
> There would be two possible ways: 
> 
> have default mutex_lock()/down() do a might_sleep()
> with preemption and have a separate variant that doesn't preempt
> or have the default non preempt and a separate variant
> that does preempt.

better remove the stupid cond_resched() from might_sleep which from it's
naming already is a debug thing and add it to those places where it makes
sense.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-24 12:27                   ` Andi Kleen
  2006-02-24 15:31                     ` Andrea Arcangeli
@ 2006-02-25 16:48                     ` Nick Piggin
  2006-02-25 17:22                       ` Nick Piggin
  1 sibling, 1 reply; 32+ messages in thread
From: Nick Piggin @ 2006-02-25 16:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Arjan van de Ven, Ingo Molnar, Arjan van de Ven, linux-kernel,
	akpm, andrea

[-- Attachment #1: Type: text/plain, Size: 1595 bytes --]

Andi Kleen wrote:
> On Friday 24 February 2006 10:26, Nick Piggin wrote:
> 
> 
>>[aside]
>>Actually I have a scalability improvement for rwsems, that moves the
>>actual task wakeups out from underneath the rwsem spinlock in the up()
>>paths. This was useful exactly on a mixed read+write workload on mmap_sem.
>>
>>The difference was quite large for the "generic rwsem" algorithm because
>>it uses the spinlock in fastpaths a lot more than the xadd algorithm. I
>>think x86-64 uses the former, which is what I presume you're testing with?
> 
> 
> I used the generic algorithm because Andrea originally expressed some doubts 
> on the correctness of the xadd algorithms and after trying to understand them 
> myself I wasn't sure myself. Generic was the safer choice.
> 
> But if someone can show convincing numbers that XADD rwsems are faster
> for some workload we can switch. I guess they are tested well enough now
> on i386.
> 
> Or would your scalability improvement remove that difference (if it really exists)?
> 

Here is a forward port of my scalability improvement, untested.

Unfortunately I didn't include absolute performance results, but the
changelog indicates that there was some noticable delta between the
rwsem implementations (and that's what I vaguely remember).

Note: this was with volanomark, which can be quite variable at the
best of times IIRC.

Note2: this was on a 16-way NUMAQ, which had the interconnect
performance of a string between two cans compared to a modern x86-64
of smaller size, so it may be harder to notice an improvement.

-- 
SUSE Labs, Novell Inc.

[-- Attachment #2: rwsem-scale.patch --]
[-- Type: text/plain, Size: 12243 bytes --]

Move wakeups out from under the rwsem's wait_lock spinlock.

This reduces that lock's contention by a factor of around
10 on a 16-way NUMAQ running volanomark, however cacheline
contention on the rwsem's "activity" drowns out these
small improvements when using the i386 "optimised" rwsem:

unpatched:
55802519 total                                     32.3097
23325323 default_idle                             364458.1719
22084349 .text.lock.futex                         82404.2873
2369107 queue_me                                 24678.1979
1875296 unqueue_me                               9767.1667
1202258 .text.lock.rwsem                         46240.6923
941801 finish_task_switch                       7357.8203
787101 __wake_up                                12298.4531
645252 drop_key_refs                            13442.7500
362789 futex_wait                               839.7894
333294 futex_wake                               1487.9196
146797 rwsem_down_read_failed                   436.8958
 82788 .text.lock.dev                           221.3583
 81221 try_to_wake_up                           133.5872

+rwsem-scale:
58120260 total                                     33.6458
25482132 default_idle                             398158.3125
22774675 .text.lock.futex                         84980.1306
2517797 queue_me                                 26227.0521
1953424 unqueue_me                               10174.0833
1063068 finish_task_switch                       8305.2188
834793 __wake_up                                13043.6406
674570 drop_key_refs                            14053.5417
371811 futex_wait                               860.6736
343398 futex_wake                               1533.0268
155419 try_to_wake_up                           255.6234
114704 .text.lock.rwsem                         4411.6923

The rwsem-spinlock implementation, however, is improved
significantly more, and now gets volanomark performance
similar to the xadd rwsem:

unpatched:
30850964 total                                     18.1787
18986006 default_idle                             296656.3438
3989183 .text.lock.rwsem_spinlock                40294.7778
2990161 .text.lock.futex                         32501.7500
549707 finish_task_switch                       4294.5859
535327 __down_read                              3717.5486
452721 queue_me                                 4715.8438
439725 __up_read                                9160.9375
396273 __wake_up                                6191.7656
326595 unqueue_me                               1701.0156

+rwsem-scale:
25378268 total                                     14.9537
13325514 default_idle                             208211.1562
3675634 .text.lock.futex                         39952.5435
2908629 .text.lock.rwsem_spinlock                28239.1165
628115 __down_read                              4361.9097
607417 finish_task_switch                       4745.4453
588031 queue_me                                 6125.3229
571169 __up_read                                11899.3542
436795 __wake_up                                6824.9219
416788 unqueue_me                               2170.7708


Index: linux-2.6/lib/rwsem.c
===================================================================
--- linux-2.6.orig/lib/rwsem.c	2006-02-26 03:05:01.000000000 +1100
+++ linux-2.6/lib/rwsem.c	2006-02-26 03:37:04.000000000 +1100
@@ -36,14 +36,15 @@ void rwsemtrace(struct rw_semaphore *sem
  * - the spinlock must be held by the caller
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only woken if downgrading is false
+ *
+ * The spinlock will be dropped by this function.
  */
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int downgrading)
+static inline void
+__rwsem_do_wake(struct rw_semaphore *sem, int downgrading, unsigned long flags)
 {
+	LIST_HEAD(wake_list);
 	struct rwsem_waiter *waiter;
-	struct task_struct *tsk;
-	struct list_head *next;
-	signed long oldcount, woken, loop;
+	signed long oldcount, woken;
 
 	rwsemtrace(sem, "Entering __rwsem_do_wake");
 
@@ -72,12 +73,8 @@ __rwsem_do_wake(struct rw_semaphore *sem
 	 * It is an allocated on the waiter's stack and may become invalid at
 	 * any time after that point (due to a wakeup from another source).
 	 */
-	list_del(&waiter->list);
-	tsk = waiter->task;
-	smp_mb();
-	waiter->task = NULL;
-	wake_up_process(tsk);
-	put_task_struct(tsk);
+	list_move_tail(&waiter->list, &wake_list);
+	waiter->flags = 0;
 	goto out;
 
 	/* don't want to wake any writers */
@@ -94,41 +91,37 @@ __rwsem_do_wake(struct rw_semaphore *sem
  readers_only:
 	woken = 0;
 	do {
-		woken++;
-
-		if (waiter->list.next == &sem->wait_list)
+		list_move_tail(&waiter->list, &wake_list);
+		waiter->flags = 0;
+		woken += RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
+		if (list_empty(&sem->wait_list))
 			break;
 
-		waiter = list_entry(waiter->list.next,
+		waiter = list_entry(sem->wait_list.next,
 					struct rwsem_waiter, list);
-
 	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
 
-	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS - RWSEM_WAITING_BIAS;
 	if (!downgrading)
 		/* we'd already done one increment earlier */
 		woken -= RWSEM_ACTIVE_BIAS;
 
 	rwsem_atomic_add(woken, sem);
 
-	next = sem->wait_list.next;
-	for (; loop > 0; loop--) {
-		waiter = list_entry(next, struct rwsem_waiter, list);
-		next = waiter->list.next;
+out:
+	spin_unlock_irqrestore(&sem->wait_lock, flags);
+	while (!list_empty(&wake_list)) {
+		struct task_struct *tsk;
+		waiter = list_entry(wake_list.next, struct rwsem_waiter, list);
+		list_del(&waiter->list);
 		tsk = waiter->task;
-		smp_mb();
 		waiter->task = NULL;
+		smp_mb();
 		wake_up_process(tsk);
 		put_task_struct(tsk);
 	}
 
-	sem->wait_list.next = next;
-	next->prev = &sem->wait_list;
-
- out:
 	rwsemtrace(sem, "Leaving __rwsem_do_wake");
-	return sem;
+	return;
 
 	/* undo the change to count, but check for a transition 1->0 */
  undo:
@@ -145,12 +138,13 @@ rwsem_down_failed_common(struct rw_semap
 			struct rwsem_waiter *waiter, signed long adjustment)
 {
 	struct task_struct *tsk = current;
+	unsigned long flags;
 	signed long count;
 
 	set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 
 	/* set up my own style of waitqueue */
-	spin_lock_irq(&sem->wait_lock);
+	spin_lock_irqsave(&sem->wait_lock, flags);
 	waiter->task = tsk;
 	get_task_struct(tsk);
 
@@ -161,14 +155,12 @@ rwsem_down_failed_common(struct rw_semap
 
 	/* if there are no active locks, wake the front queued process(es) up */
 	if (!(count & RWSEM_ACTIVE_MASK))
-		sem = __rwsem_do_wake(sem, 0);
-
-	spin_unlock_irq(&sem->wait_lock);
+		__rwsem_do_wake(sem, 0, flags);
+	else
+		spin_unlock_irqrestore(&sem->wait_lock, flags);
 
 	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter->task)
-			break;
+	while (waiter->task) {
 		schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
@@ -227,9 +219,9 @@ struct rw_semaphore fastcall *rwsem_wake
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 0);
-
-	spin_unlock_irqrestore(&sem->wait_lock, flags);
+		__rwsem_do_wake(sem, 0, flags);
+	else
+		spin_unlock_irqrestore(&sem->wait_lock, flags);
 
 	rwsemtrace(sem, "Leaving rwsem_wake");
 
@@ -251,9 +243,9 @@ struct rw_semaphore fastcall *rwsem_down
 
 	/* do nothing if list empty */
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 1);
-
-	spin_unlock_irqrestore(&sem->wait_lock, flags);
+		__rwsem_do_wake(sem, 1, flags);
+	else
+		spin_unlock_irqrestore(&sem->wait_lock, flags);
 
 	rwsemtrace(sem, "Leaving rwsem_downgrade_wake");
 	return sem;
Index: linux-2.6/lib/rwsem-spinlock.c
===================================================================
--- linux-2.6.orig/lib/rwsem-spinlock.c	2006-02-26 03:05:01.000000000 +1100
+++ linux-2.6/lib/rwsem-spinlock.c	2006-02-26 03:37:42.000000000 +1100
@@ -49,11 +49,11 @@ void fastcall init_rwsem(struct rw_semap
  * - woken process blocks are discarded from the list after having task zeroed
  * - writers are only woken if wakewrite is non-zero
  */
-static inline struct rw_semaphore *
-__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite)
+static inline void
+__rwsem_do_wake(struct rw_semaphore *sem, int wakewrite, unsigned long flags)
 {
+	LIST_HEAD(wake_list);
 	struct rwsem_waiter *waiter;
-	struct task_struct *tsk;
 	int woken;
 
 	rwsemtrace(sem, "Entering __rwsem_do_wake");
@@ -73,46 +73,46 @@ __rwsem_do_wake(struct rw_semaphore *sem
 	 */
 	if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
 		sem->activity = -1;
-		list_del(&waiter->list);
-		tsk = waiter->task;
-		/* Don't touch waiter after ->task has been NULLed */
-		smp_mb();
-		waiter->task = NULL;
-		wake_up_process(tsk);
-		put_task_struct(tsk);
+		list_move_tail(&waiter->list, &wake_list);
 		goto out;
 	}
 
 	/* grant an infinite number of read locks to the front of the queue */
  dont_wake_writers:
 	woken = 0;
-	while (waiter->flags & RWSEM_WAITING_FOR_READ) {
-		struct list_head *next = waiter->list.next;
+	do {
+		list_move_tail(&waiter->list, &wake_list);
+		woken++;
+		if (list_empty(&sem->wait_list))
+			break;
+		waiter = list_entry(sem->wait_list.next,
+				struct rwsem_waiter, list);
+	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
+
+	sem->activity += woken;
 
+out:
+	spin_unlock_irqrestore(&sem->wait_lock, flags);
+	while (!list_empty(&wake_list)) {
+		struct task_struct *tsk;
+		waiter = list_entry(wake_list.next, struct rwsem_waiter, list);
 		list_del(&waiter->list);
 		tsk = waiter->task;
-		smp_mb();
 		waiter->task = NULL;
+		smp_mb();
 		wake_up_process(tsk);
 		put_task_struct(tsk);
-		woken++;
-		if (list_empty(&sem->wait_list))
-			break;
-		waiter = list_entry(next, struct rwsem_waiter, list);
 	}
 
-	sem->activity += woken;
-
- out:
 	rwsemtrace(sem, "Leaving __rwsem_do_wake");
-	return sem;
 }
 
 /*
  * wake a single writer
+ * called with wait_lock locked and returns with it unlocked.
  */
-static inline struct rw_semaphore *
-__rwsem_wake_one_writer(struct rw_semaphore *sem)
+static inline void
+__rwsem_wake_one_writer(struct rw_semaphore *sem, unsigned long flags)
 {
 	struct rwsem_waiter *waiter;
 	struct task_struct *tsk;
@@ -121,13 +121,13 @@ __rwsem_wake_one_writer(struct rw_semaph
 
 	waiter = list_entry(sem->wait_list.next, struct rwsem_waiter, list);
 	list_del(&waiter->list);
+	spin_unlock_irqrestore(&sem->wait_lock, flags);
 
 	tsk = waiter->task;
-	smp_mb();
 	waiter->task = NULL;
+	smp_mb();
 	wake_up_process(tsk);
 	put_task_struct(tsk);
-	return sem;
 }
 
 /*
@@ -163,9 +163,7 @@ void fastcall __sched __down_read(struct
 	spin_unlock_irq(&sem->wait_lock);
 
 	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter.task)
-			break;
+	while (waiter.task) {
 		schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
@@ -234,9 +232,7 @@ void fastcall __sched __down_write(struc
 	spin_unlock_irq(&sem->wait_lock);
 
 	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter.task)
-			break;
+	while (waiter.task) {
 		schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
@@ -283,9 +279,9 @@ void fastcall __up_read(struct rw_semaph
 	spin_lock_irqsave(&sem->wait_lock, flags);
 
 	if (--sem->activity == 0 && !list_empty(&sem->wait_list))
-		sem = __rwsem_wake_one_writer(sem);
-
-	spin_unlock_irqrestore(&sem->wait_lock, flags);
+		__rwsem_wake_one_writer(sem, flags);
+	else
+		spin_unlock_irqrestore(&sem->wait_lock, flags);
 
 	rwsemtrace(sem, "Leaving __up_read");
 }
@@ -303,9 +299,9 @@ void fastcall __up_write(struct rw_semap
 
 	sem->activity = 0;
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 1);
-
-	spin_unlock_irqrestore(&sem->wait_lock, flags);
+		__rwsem_do_wake(sem, 1, flags);
+	else
+		spin_unlock_irqrestore(&sem->wait_lock, flags);
 
 	rwsemtrace(sem, "Leaving __up_write");
 }
@@ -324,9 +320,9 @@ void fastcall __downgrade_write(struct r
 
 	sem->activity = 1;
 	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem, 0);
-
-	spin_unlock_irqrestore(&sem->wait_lock, flags);
+		__rwsem_do_wake(sem, 0, flags);
+	else
+		spin_unlock_irqrestore(&sem->wait_lock, flags);
 
 	rwsemtrace(sem, "Leaving __downgrade_write");
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-25 16:48                     ` Nick Piggin
@ 2006-02-25 17:22                       ` Nick Piggin
  0 siblings, 0 replies; 32+ messages in thread
From: Nick Piggin @ 2006-02-25 17:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Arjan van de Ven, Ingo Molnar, Arjan van de Ven, linux-kernel,
	akpm, andrea

Nick Piggin wrote:

> Here is a forward port of my scalability improvement, untested.
> 
> Unfortunately I didn't include absolute performance results, but the
> changelog indicates that there was some noticable delta between the
> rwsem implementations (and that's what I vaguely remember).
> 
> Note: this was with volanomark, which can be quite variable at the
> best of times IIRC.
> 
> Note2: this was on a 16-way NUMAQ, which had the interconnect
> performance of a string between two cans compared to a modern x86-64
> of smaller size, so it may be harder to notice an improvement.
> 

Oh, I remember one performance caveat of this code on preempt kernels:
at very high loads, (ie. lots of tasks being woken in the up path),
the wakeup code would end up doing a lot of context switches to and
from the tasks it is waking up.

This could be easily solved by disabling preempt (and would be still
better in terms of interrupt latency and lock hold times), however I
never got around to implementing that.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Patch 3/3] prepopulate/cache cleared pages
  2006-02-23 12:41     ` Ingo Molnar
  2006-02-23 13:06       ` Andi Kleen
@ 2006-02-28 22:30       ` Pavel Machek
  1 sibling, 0 replies; 32+ messages in thread
From: Pavel Machek @ 2006-02-28 22:30 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andi Kleen, Arjan van de Ven, linux-kernel, akpm

On Čt 23-02-06 13:41:53, Ingo Molnar wrote:
> 
> * Andi Kleen <ak@suse.de> wrote:
> 
> > On Thursday 23 February 2006 10:29, Arjan van de Ven wrote:
> > > This patch adds an entry for a cleared page to the task struct. The main
> > > purpose of this patch is to be able to pre-allocate and clear a page in a
> > > pagefault scenario before taking any locks (esp mmap_sem),
> > > opportunistically. Allocating+clearing a page is an very expensive 
> > > operation that currently increases lock hold times quite bit (in a threaded 
> > > environment that allocates/use/frees memory on a regular basis, this leads
> > > to contention).
> > > 
> > > This is probably the most controversial patch of the 3, since there is
> > > a potential to take up 1 page per thread in this cache. In practice it's
> > > not as bad as it sounds (a large degree of the pagefaults are anonymous 
> > > and thus immediately use up the page). One could argue "let the VM reap
> > > these" but that has a few downsides; it increases locking needs but more,
> > > clearing a page is relatively expensive, if the VM reaps the page again
> > > in case it wasn't needed, the work was just wasted.
> > 
> > Looks like an incredible bad hack. What workload was that again where 
> > it helps? And how much? I think before we can consider adding that 
> > ugly code you would a far better rationale.
> 
> yes, the patch is controversial technologically, and Arjan pointed it 
> out above. This is nothing new - and Arjan probably submitted this to 
> lkml so that he can get contructive feedback.

Actually, I think I have to back Andi here. This looked like patch for
inclusion (signed-off, cc-ed Andrew). And yes, Arjan pointed out that
it is controversial, but the way patch was worded I could imagine
Andrew merging it...
								Pavel
-- 
Web maintainer for suspend.sf.net (www.sf.net/projects/suspend) wanted...

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2006-02-28 22:32 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-02-23  9:17 [Patch 0/3] threaded mmap tweaks Arjan van de Ven
2006-02-23  9:29 ` [Patch 3/3] prepopulate/cache cleared pages Arjan van de Ven
2006-02-23  9:41   ` Andi Kleen
2006-02-23 12:41     ` Ingo Molnar
2006-02-23 13:06       ` Andi Kleen
2006-02-23 13:15         ` Nick Piggin
2006-02-23 13:29           ` Ingo Molnar
2006-02-24  6:36             ` Nick Piggin
2006-02-24  6:49               ` Ingo Molnar
2006-02-24  7:01                 ` Nick Piggin
2006-02-24 12:33                   ` Andi Kleen
2006-02-24 12:55                     ` Hugh Dickins
2006-02-24  9:15               ` Arjan van de Ven
2006-02-24  9:26                 ` Nick Piggin
2006-02-24 12:27                   ` Andi Kleen
2006-02-24 15:31                     ` Andrea Arcangeli
2006-02-25 16:48                     ` Nick Piggin
2006-02-25 17:22                       ` Nick Piggin
2006-02-28 22:30       ` Pavel Machek
2006-02-23 18:25   ` Paul Jackson
2006-02-23  9:30 ` [Patch 2/3] fast VMA recycling Arjan van de Ven
2006-02-23  9:42   ` Andi Kleen
2006-02-23  9:48     ` Arjan van de Ven
2006-02-23 10:05       ` Andi Kleen
2006-02-23 10:15         ` Arjan van de Ven
2006-02-23 11:00           ` Andi Kleen
2006-02-23 11:22             ` Arjan van de Ven
2006-02-23 11:57               ` Andi Kleen
2006-02-24 18:52       ` Christoph Hellwig
2006-02-24 19:05         ` Andi Kleen
2006-02-24 19:09           ` Christoph Hellwig
2006-02-23 16:37   ` Benjamin LaHaise

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).