On Tue, 5 Jan 2010 14:30:46 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> ==
> Above codes are a bit heavy(and buggy). I have some fixes.
> 
Here is my latest one. But I myself think this should be improved more
and don't think this patch is bug-free. I have some concern
s around fork() etc..and still considering how to avoid atomic ops.

I post this just as a reference rather than cut-out from old e-mails.

Here is a comarison before after patch. A test program does following in
multi-thread. (attached)

  while (1) {
	touch memory (with write)
   	pthread_barrier_wait().
   	fork() if I'm the first thread.
     	-> exit() immediately
        wait()
   	pthread_barrier_wait()
  }

And see overheads by perf. This is result on 8core/2socket hosts.
This is highly affected by the number of sockets.

# Samples: 1164449628090
#
# Overhead          Command             Shared Object  Symbol
# ........  ...............  ........................  ......
#
    43.23%  multi-fault-all  [kernel]                  [k] smp_invalidate_interrupt
    16.27%  multi-fault-all  [kernel]                  [k] flush_tlb_others_ipi
    11.55%  multi-fault-all  [kernel]                  [k] _raw_spin_lock_irqsave    <========(*)
     6.23%  multi-fault-all  [kernel]                  [k] intel_pmu_enable_all
     2.17%  multi-fault-all  [kernel]                  [k] _raw_spin_unlock_irqrestore
     1.59%  multi-fault-all  [kernel]                  [k] page_fault
     1.45%  multi-fault-all  [kernel]                  [k] do_wp_page
     1.35%  multi-fault-all  ./multi-fault-all-fork    [.] worker
     1.26%  multi-fault-all  [kernel]                  [k] handle_mm_fault
     1.19%  multi-fault-all  [kernel]                  [k] _raw_spin_lock
     1.03%  multi-fault-all  [kernel]                  [k] invalidate_interrupt7
     1.00%  multi-fault-all  [kernel]                  [k] invalidate_interrupt6
     0.99%  multi-fault-all  [kernel]                  [k] invalidate_interrupt3


# Samples: 181505050964
#
# Overhead          Command             Shared Object  Symbol
# ........  ...............  ........................  ......
#
    45.08%  multi-fault-all  [kernel]                  [k] smp_invalidate_interrupt
    19.45%  multi-fault-all  [kernel]                  [k] intel_pmu_enable_all
    14.17%  multi-fault-all  [kernel]                  [k] flush_tlb_others_ipi
     1.89%  multi-fault-all  [kernel]                  [k] do_wp_page
     1.58%  multi-fault-all  [kernel]                  [k] page_fault
     1.46%  multi-fault-all  ./multi-fault-all-fork    [.] worker
     1.26%  multi-fault-all  [kernel]                  [k] do_page_fault
     1.14%  multi-fault-all  [kernel]                  [k] _raw_spin_lock
     1.10%  multi-fault-all  [kernel]                  [k] flush_tlb_page
     1.09%  multi-fault-all  [kernel]                  [k] vma_put           <========(**)
     0.99%  multi-fault-all  [kernel]                  [k] invalidate_interrupt0
     0.98%  multi-fault-all  [kernel]                  [k] find_vma_speculative <=====(**)
     0.81%  multi-fault-all  [kernel]                  [k] invalidate_interrupt3
     0.81%  multi-fault-all  [kernel]                  [k] native_apic_mem_write
     0.79%  multi-fault-all  [kernel]                  [k] invalidate_interrupt7
     0.76%  multi-fault-all  [kernel]                  [k] invalidate_interrupt5

(*) is removed overhead of mmap_sem and (**) is added overhead of atomic ops.
And yes, 1% seems still not ideally small. And it's unknown how this false-sharing
of atomic ops can affect to very big SMP.....maybe very big.

BTW, I'm not sure why intel_pmu_enable_all() is very big...because of fork() ?

My patch is below, just as a reference of my work in the last year, not
very clean yet. I need more time for improvements (and does not adhere to
this implementation.)

When you test this, please turn off SPINLOCK_DEBUG to enable split-page-table-lock.

Regards,
-Kame
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Asynchronous page fault.

This patch is for avoidng mmap_sem in usual page fault. At running highly
multi-threaded programs, mm->mmap_sem can use much CPU because of false
sharing when it causes page fault in parallel. (Run after fork() is a typical
case, I think.)
This patch uses a speculative vma lookup to reduce that cost.

Considering vma lookup, rb-tree lookup, the only operation we do is checking
node->rb_left,rb_right. And there are no complicated operation.
At page fault, there are no demands for accessing sorted-vma-list or access
prev or next in many case. Except for stack-expansion, we always need a vma
which contains page-fault address. Then, we can access vma's RB-tree in
speculative way.
Even if RB-tree rotation occurs while we walk tree for look-up, we just
miss vma without oops. In other words, we can _try_ to find vma in lockless
manner. If failed, retry is ok.... we take lock and access vma.

For lockess walking, this uses RCU and adds find_vma_speculative(). And
per-vma wait-queue and reference count. This refcnt+wait_queue guarantees that
there are no thread which access the vma when we call subsystem's unmap
functions.

Changelog:
 - removed rcu_xxx macros.
 - fixed reference count handling bug at vma_put().
 - removed atomic_add_unless() etc. But use Big number for handling race.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 arch/x86/mm/fault.c      |   14 ++++++
 include/linux/mm.h       |   14 ++++++
 include/linux/mm_types.h |    5 ++
 mm/mmap.c                |   98 +++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 127 insertions(+), 4 deletions(-)

Index: linux-2.6.33-rc2/include/linux/mm_types.h
===================================================================
--- linux-2.6.33-rc2.orig/include/linux/mm_types.h
+++ linux-2.6.33-rc2/include/linux/mm_types.h
@@ -11,6 +11,7 @@
 #include <linux/rwsem.h>
 #include <linux/completion.h>
 #include <linux/cpumask.h>
+#include <linux/rcupdate.h>
 #include <linux/page-debug-flags.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
@@ -180,6 +181,10 @@ struct vm_area_struct {
 	void * vm_private_data;		/* was vm_pte (shared mem) */
 	unsigned long vm_truncate_count;/* truncate_count or restart_addr */
 
+	atomic_t refcnt;
+	wait_queue_head_t wait_queue;
+	struct rcu_head	rcuhead;
+
 #ifndef CONFIG_MMU
 	struct vm_region *vm_region;	/* NOMMU mapping region */
 #endif
Index: linux-2.6.33-rc2/mm/mmap.c
===================================================================
--- linux-2.6.33-rc2.orig/mm/mmap.c
+++ linux-2.6.33-rc2/mm/mmap.c
@@ -187,6 +187,26 @@ error:
 	return -ENOMEM;
 }
 
+static void __free_vma_rcu_cb(struct rcu_head *head)
+{
+	struct vm_area_struct *vma;
+	vma = container_of(head, struct vm_area_struct, rcuhead);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+/* Call this if vma was linked to rb-tree */
+static void free_vma_rcu(struct vm_area_struct *vma)
+{
+	call_rcu(&vma->rcuhead, __free_vma_rcu_cb);
+}
+#define VMA_FREE_MAGIC	(10000000)
+/* called when vma is unlinked and wait for all racy access.*/
+static void invalidate_vma_before_free(struct vm_area_struct *vma)
+{
+	atomic_sub(VMA_FREE_MAGIC, &vma->refcnt);
+	wait_event(vma->wait_queue,
+		(atomic_read(&vma->refcnt) == -VMA_FREE_MAGIC));
+}
+
 /*
  * Requires inode->i_mapping->i_mmap_lock
  */
@@ -238,7 +258,7 @@ static struct vm_area_struct *remove_vma
 			removed_exe_file_vma(vma->vm_mm);
 	}
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma_rcu(vma);
 	return next;
 }
 
@@ -404,8 +424,12 @@ __vma_link_list(struct mm_struct *mm, st
 void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct rb_node **rb_link, struct rb_node *rb_parent)
 {
+	atomic_set(&vma->refcnt, 0);
+	init_waitqueue_head(&vma->wait_queue);
 	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
 	rb_insert_color(&vma->vm_rb, &mm->mm_rb);
+	/* For speculative vma lookup */
+	smp_wmb();
 }
 
 static void __vma_link_file(struct vm_area_struct *vma)
@@ -490,6 +514,7 @@ __vma_unlink(struct mm_struct *mm, struc
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
 	if (mm->mmap_cache == vma)
 		mm->mmap_cache = prev;
+	smp_wmb();
 }
 
 /*
@@ -614,6 +639,7 @@ again:			remove_next = 1 + (end > next->
 		 * us to remove next before dropping the locks.
 		 */
 		__vma_unlink(mm, next, vma);
+		invalidate_vma_before_free(next);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
 		if (next->anon_vma)
@@ -640,7 +666,7 @@ again:			remove_next = 1 + (end > next->
 		}
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
+		free_vma_rcu(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -1544,6 +1570,71 @@ out:
 }
 
 /*
+ * Returns vma which contains given address. This scans rb-tree in speculative
+ * way and increment a reference count if found. Even if vma exists in rb-tree,
+ * this function may return NULL in racy case. So, this function cannot be used
+ * for checking whether given address is valid or not.
+ */
+
+struct vm_area_struct *
+find_vma_speculative(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma = NULL;
+	struct vm_area_struct *vma_tmp;
+	struct rb_node *rb_node;
+
+	if (unlikely(!mm))
+		return NULL;;
+
+	rcu_read_lock();
+	/*
+ 	 * Barreir against modification of rb-tree
+ 	 * rb-tree update is not an atomic ops and no barreir is used while
+ 	 * modification. Then, modification to rb-tree can be reordered. This
+ 	 * memory barrier is against vma_(un)link_rb() for avoiding to read
+ 	 * too old data to catch all changes we get rcu_read_lock.
+ 	 *
+ 	 * We may see broken RB-tree and can't find existing vma. But it's ok.
+ 	 * We allowed to return NULL even if valid one exists. The caller will
+ 	 * use find_vma() with read-semaphore.
+ 	 */
+
+  	smp_read_barrier_depends();
+	rb_node = mm->mm_rb.rb_node;
+	vma = NULL;
+	while (rb_node) {
+		vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
+
+		if (vma_tmp->vm_end > addr) {
+			vma = vma_tmp;
+			if (vma_tmp->vm_start <= addr)
+				break;
+			rb_node = rb_node->rb_left;
+		} else
+			rb_node = rb_node->rb_right;
+	}
+	if (vma) {
+		if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
+			if (atomic_inc_return(&vma->refcnt) < 0) {
+				vma_put(vma);
+				vma = NULL;
+			}
+		} else
+			vma = NULL;
+	}
+	rcu_read_unlock();
+	return vma;
+}
+
+void vma_put(struct vm_area_struct *vma)
+{
+	if ((atomic_dec_return(&vma->refcnt) == -VMA_FREE_MAGIC) &&
+	    waitqueue_active(&vma->wait_queue))
+		wake_up(&vma->wait_queue);
+	return;
+}
+
+/*
  * Verify that the stack growth is acceptable and
  * update accounting. This is shared with both the
  * grow-up and grow-down cases.
@@ -1808,6 +1899,7 @@ detach_vmas_to_be_unmapped(struct mm_str
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	do {
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
+		invalidate_vma_before_free(vma);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
Index: linux-2.6.33-rc2/include/linux/mm.h
===================================================================
--- linux-2.6.33-rc2.orig/include/linux/mm.h
+++ linux-2.6.33-rc2/include/linux/mm.h
@@ -1235,6 +1235,20 @@ static inline struct vm_area_struct * fi
 	return vma;
 }
 
+/*
+ * Look up vma for given address in speculative way. This allows lockless lookup
+ * of vmas but in racy case, vma may no be found. You have to call find_vma()
+ * under read lock of mm->mmap_sem even if this function returns NULL.
+ * And, returned vma's reference count is incremented to show vma is accessed
+ * asynchronously, the caller has to call vma_put().
+ *
+ * Unlike find_vma(), this returns a vma which contains specified address.
+ * This doesn't return the nearest vma.
+ */
+extern struct vm_area_struct *find_vma_speculative(struct mm_struct *mm,
+	unsigned long addr);
+void vma_put(struct vm_area_struct *vma);
+
 static inline unsigned long vma_pages(struct vm_area_struct *vma)
 {
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
Index: linux-2.6.33-rc2/arch/x86/mm/fault.c
===================================================================
--- linux-2.6.33-rc2.orig/arch/x86/mm/fault.c
+++ linux-2.6.33-rc2/arch/x86/mm/fault.c
@@ -952,6 +952,7 @@ do_page_fault(struct pt_regs *regs, unsi
 	struct mm_struct *mm;
 	int write;
 	int fault;
+	int speculative = 0;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1040,6 +1041,14 @@ do_page_fault(struct pt_regs *regs, unsi
 		return;
 	}
 
+	if ((error_code & PF_USER)) {
+		vma = find_vma_speculative(mm, address);
+		if (vma) {
+			speculative = 1;
+			goto good_area;
+		}
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
@@ -1136,5 +1145,8 @@ good_area:
 
 	check_v8086_mode(regs, address, tsk);
 
-	up_read(&mm->mmap_sem);
+	if (speculative)
+		vma_put(vma);
+	else
+		up_read(&mm->mmap_sem);
 }