linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] asynchronous page fault.
@ 2009-12-25  1:51 KAMEZAWA Hiroyuki
  2009-12-27  9:47 ` Minchan Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-25  1:51 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, minchan.kim, cl

[-- Attachment #1: Type: text/plain, Size: 13333 bytes --]

Speculative page fault v3.

This version is much simpler than old versions and doesn't use mm_accessor
but use RCU. This is based on linux-2.6.33-rc2.

This patch is just my toy but shows...
 - Once RB-tree is RCU-aware and no-lock in readside, we can avoid mmap_sem
   in page fault. 
So, what we need is not mm_accessor, but RCU-aware RB-tree, I think.

But yes, I may miss something critical ;)

After patch, statistics perf show is following. Test progam is attached.
  
# Samples: 1331231315119
#
# Overhead          Command             Shared Object  Symbol
# ........  ...............  ........................  ......
#
    28.41%  multi-fault-all  [kernel]                  [k] clear_page_c
            |
            --- clear_page_c
                __alloc_pages_nodemask
                handle_mm_fault
                do_page_fault
                page_fault
                0x400950
               |
                --100.00%-- (nil)

    21.69%  multi-fault-all  [kernel]                  [k] _raw_spin_lock
            |
            --- _raw_spin_lock
               |
               |--81.85%-- free_pcppages_bulk
               |          free_hot_cold_page
               |          __pagevec_free
               |          release_pages
               |          free_pages_and_swap_cache


I'll be almost offline in the next week. 

Minchan, in this version, I didn't add CONFIG and some others which was
recommended just for my laziness. Sorry.

=
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Asynchronous page fault.

This patch is for avoidng mmap_sem in usual page fault. At running highly
multi-threaded programs, mm->mmap_sem can use much CPU because of false
sharing when it causes page fault in parallel. (Run after fork() is a typical
case, I think.)
This patch uses a speculative vma lookup to reduce that cost.

Considering vma lookup, rb-tree lookup, the only operation we do is checking
node->rb_left,rb_right. And there are no complicated operation.
At page fault, there are no demands for accessing sorted-vma-list or access
prev or next in many case. Except for stack-expansion, we always need a vma
which contains page-fault address. Then, we can access vma's RB-tree in
speculative way.
Even if RB-tree rotation occurs while we walk tree for look-up, we just
miss vma without oops. In other words, we can _try_ to find vma in lockless
manner. If failed, retry is ok.... we take lock and access vma.

For lockess walking, this uses RCU and adds find_vma_speculative(). And
per-vma wait-queue and reference count. This refcnt+wait_queue guarantees that
there are no thread which access the vma when we call subsystem's unmap
functions.

Test result on my tiny test program on 8core/2socket machine is here.
This measures how many page fault can occur in 60sec in parallel.

[root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8

 Performance counter stats for './multi-fault-all-split 8' (5 runs):

       17481387  page-faults                ( +-   0.409% )
      509914595  cache-misses               ( +-   0.239% )

   60.002277793  seconds time elapsed   ( +-   0.000% )


[root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8


 Performance counter stats for './multi-fault-all-split 8' (5 runs):

       35949073  page-faults                ( +-   0.364% )
      473091100  cache-misses               ( +-   0.304% )

   60.005444117  seconds time elapsed   ( +-   0.004% )



Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 arch/x86/mm/fault.c      |   14 +++++++-
 include/linux/mm.h       |   14 ++++++++
 include/linux/mm_types.h |    5 +++
 lib/rbtree.c             |   34 ++++++++++----------
 mm/mmap.c                |   77 +++++++++++++++++++++++++++++++++++++++++++++--
 5 files changed, 123 insertions(+), 21 deletions(-)

Index: linux-2.6.33-rc2/lib/rbtree.c
===================================================================
--- linux-2.6.33-rc2.orig/lib/rbtree.c
+++ linux-2.6.33-rc2/lib/rbtree.c
@@ -30,19 +30,19 @@ static void __rb_rotate_left(struct rb_n
 
 	if ((node->rb_right = right->rb_left))
 		rb_set_parent(right->rb_left, node);
-	right->rb_left = node;
+	rcu_assign_pointer(right->rb_left, node);
 
 	rb_set_parent(right, parent);
 
 	if (parent)
 	{
 		if (node == parent->rb_left)
-			parent->rb_left = right;
+			rcu_assign_pointer(parent->rb_left, right);
 		else
-			parent->rb_right = right;
+			rcu_assign_pointer(parent->rb_right, right);
 	}
 	else
-		root->rb_node = right;
+		rcu_assign_pointer(root->rb_node, right);
 	rb_set_parent(node, right);
 }
 
@@ -53,19 +53,19 @@ static void __rb_rotate_right(struct rb_
 
 	if ((node->rb_left = left->rb_right))
 		rb_set_parent(left->rb_right, node);
-	left->rb_right = node;
+	rcu_assign_pointer(left->rb_right, node);
 
 	rb_set_parent(left, parent);
 
 	if (parent)
 	{
 		if (node == parent->rb_right)
-			parent->rb_right = left;
+			rcu_assign_pointer(parent->rb_right, left);
 		else
-			parent->rb_left = left;
+			rcu_assign_pointer(parent->rb_left, left);
 	}
 	else
-		root->rb_node = left;
+		rcu_assign_pointer(root->rb_node, left);
 	rb_set_parent(node, left);
 }
 
@@ -234,11 +234,11 @@ void rb_erase(struct rb_node *node, stru
 
 		if (rb_parent(old)) {
 			if (rb_parent(old)->rb_left == old)
-				rb_parent(old)->rb_left = node;
+				rcu_assign_pointer(rb_parent(old)->rb_left, node);
 			else
-				rb_parent(old)->rb_right = node;
+				rcu_assign_pointer(rb_parent(old)->rb_right, node);
 		} else
-			root->rb_node = node;
+			rcu_assign_pointer(root->rb_node, node);
 
 		child = node->rb_right;
 		parent = rb_parent(node);
@@ -249,14 +249,14 @@ void rb_erase(struct rb_node *node, stru
 		} else {
 			if (child)
 				rb_set_parent(child, parent);
-			parent->rb_left = child;
+			rcu_assign_pointer(parent->rb_left, child);
 
-			node->rb_right = old->rb_right;
+			rcu_assign_pointer(node->rb_right, old->rb_right);
 			rb_set_parent(old->rb_right, node);
 		}
 
 		node->rb_parent_color = old->rb_parent_color;
-		node->rb_left = old->rb_left;
+		rcu_assign_pointer(node->rb_left, old->rb_left);
 		rb_set_parent(old->rb_left, node);
 
 		goto color;
@@ -270,12 +270,12 @@ void rb_erase(struct rb_node *node, stru
 	if (parent)
 	{
 		if (parent->rb_left == node)
-			parent->rb_left = child;
+			rcu_assign_pointer(parent->rb_left, child);
 		else
-			parent->rb_right = child;
+			rcu_assign_pointer(parent->rb_right, child);
 	}
 	else
-		root->rb_node = child;
+		rcu_assign_pointer(root->rb_node, child);
 
  color:
 	if (color == RB_BLACK)
Index: linux-2.6.33-rc2/include/linux/mm_types.h
===================================================================
--- linux-2.6.33-rc2.orig/include/linux/mm_types.h
+++ linux-2.6.33-rc2/include/linux/mm_types.h
@@ -11,6 +11,7 @@
 #include <linux/rwsem.h>
 #include <linux/completion.h>
 #include <linux/cpumask.h>
+#include <linux/rcupdate.h>
 #include <linux/page-debug-flags.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
@@ -180,6 +181,10 @@ struct vm_area_struct {
 	void * vm_private_data;		/* was vm_pte (shared mem) */
 	unsigned long vm_truncate_count;/* truncate_count or restart_addr */
 
+	atomic_t refcnt;
+	wait_queue_head_t wait_queue;
+	struct rcu_head	rcuhead;
+
 #ifndef CONFIG_MMU
 	struct vm_region *vm_region;	/* NOMMU mapping region */
 #endif
Index: linux-2.6.33-rc2/mm/mmap.c
===================================================================
--- linux-2.6.33-rc2.orig/mm/mmap.c
+++ linux-2.6.33-rc2/mm/mmap.c
@@ -187,6 +187,24 @@ error:
 	return -ENOMEM;
 }
 
+static void __free_vma_rcu_cb(struct rcu_head *head)
+{
+	struct vm_area_struct *vma;
+	vma = container_of(head, struct vm_area_struct, rcuhead);
+	kmem_cache_free(vm_area_cachep, vma);
+}
+/* Call this if vma was linked to rb-tree */
+static void free_vma_rcu(struct vm_area_struct *vma)
+{
+	call_rcu(&vma->rcuhead, __free_vma_rcu_cb);
+}
+/* called when vma is unlinked and wait for all racy access.*/
+static void invalidate_vma_before_free(struct vm_area_struct *vma)
+{
+	atomic_dec(&vma->refcnt);
+	wait_event(vma->wait_queue, !atomic_read(&vma->refcnt));
+}
+
 /*
  * Requires inode->i_mapping->i_mmap_lock
  */
@@ -238,7 +256,7 @@ static struct vm_area_struct *remove_vma
 			removed_exe_file_vma(vma->vm_mm);
 	}
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma_rcu(vma);
 	return next;
 }
 
@@ -404,6 +422,8 @@ __vma_link_list(struct mm_struct *mm, st
 void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
 		struct rb_node **rb_link, struct rb_node *rb_parent)
 {
+	atomic_set(&vma->refcnt, 1);
+	init_waitqueue_head(&vma->wait_queue);
 	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
 	rb_insert_color(&vma->vm_rb, &mm->mm_rb);
 }
@@ -614,6 +634,7 @@ again:			remove_next = 1 + (end > next->
 		 * us to remove next before dropping the locks.
 		 */
 		__vma_unlink(mm, next, vma);
+		invalidate_vma_before_free(next);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
 		if (next->anon_vma)
@@ -640,7 +661,7 @@ again:			remove_next = 1 + (end > next->
 		}
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
+		free_vma_rcu(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
@@ -1544,6 +1565,55 @@ out:
 }
 
 /*
+ * Returns vma which contains given address. This scans rb-tree in speculative
+ * way and increment a reference count if found. Even if vma exists in rb-tree,
+ * this function may return NULL in racy case. So, this function cannot be used
+ * for checking whether given address is valid or not.
+ */
+struct vm_area_struct *
+find_vma_speculative(struct mm_struct *mm, unsigned long addr)
+{
+	struct vm_area_struct *vma = NULL;
+	struct vm_area_struct *vma_tmp;
+	struct rb_node *rb_node;
+
+	if (unlikely(!mm))
+		return NULL;;
+
+	rcu_read_lock();
+	rb_node = rcu_dereference(mm->mm_rb.rb_node);
+	vma = NULL;
+	while (rb_node) {
+		vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
+
+		if (vma_tmp->vm_end > addr) {
+			vma = vma_tmp;
+			if (vma_tmp->vm_start <= addr)
+				break;
+			rb_node = rcu_dereference(rb_node->rb_left);
+		} else
+			rb_node = rcu_dereference(rb_node->rb_right);
+	}
+	if (vma) {
+		if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
+			if (!atomic_inc_not_zero(&vma->refcnt))
+				vma = NULL;
+		} else
+			vma = NULL;
+	}
+	rcu_read_unlock();
+	return vma;
+}
+
+void vma_put(struct vm_area_struct *vma)
+{
+	if ((atomic_dec_return(&vma->refcnt) == 1) &&
+	    waitqueue_active(&vma->wait_queue))
+		wake_up(&vma->wait_queue);
+	return;
+}
+
+/*
  * Verify that the stack growth is acceptable and
  * update accounting. This is shared with both the
  * grow-up and grow-down cases.
@@ -1808,6 +1878,7 @@ detach_vmas_to_be_unmapped(struct mm_str
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	do {
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
+		invalidate_vma_before_free(vma);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
@@ -1819,7 +1890,7 @@ detach_vmas_to_be_unmapped(struct mm_str
 	else
 		addr = vma ?  vma->vm_start : mm->mmap_base;
 	mm->unmap_area(mm, addr);
-	mm->mmap_cache = NULL;		/* Kill the cache. */
+	mm->mmap_cache = NULL;	/* Kill the cache. */
 }
 
 /*
Index: linux-2.6.33-rc2/include/linux/mm.h
===================================================================
--- linux-2.6.33-rc2.orig/include/linux/mm.h
+++ linux-2.6.33-rc2/include/linux/mm.h
@@ -1235,6 +1235,20 @@ static inline struct vm_area_struct * fi
 	return vma;
 }
 
+/*
+ * Look up vma for given address in speculative way. This allows lockless lookup
+ * of vmas but in racy case, vma may no be found. You have to call find_vma()
+ * under read lock of mm->mmap_sem even if this function returns NULL.
+ * And, returned vma's reference count is incremented to show vma is accessed
+ * asynchronously, the caller has to call vma_put().
+ *
+ * Unlike find_vma(), this returns a vma which contains specified address.
+ * This doesn't return the nearest vma.
+ */
+extern struct vm_area_struct *find_vma_speculative(struct mm_struct *mm,
+	unsigned long addr);
+void vma_put(struct vm_area_struct *vma);
+
 static inline unsigned long vma_pages(struct vm_area_struct *vma)
 {
 	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
Index: linux-2.6.33-rc2/arch/x86/mm/fault.c
===================================================================
--- linux-2.6.33-rc2.orig/arch/x86/mm/fault.c
+++ linux-2.6.33-rc2/arch/x86/mm/fault.c
@@ -952,6 +952,7 @@ do_page_fault(struct pt_regs *regs, unsi
 	struct mm_struct *mm;
 	int write;
 	int fault;
+	int speculative = 0;
 
 	tsk = current;
 	mm = tsk->mm;
@@ -1040,6 +1041,14 @@ do_page_fault(struct pt_regs *regs, unsi
 		return;
 	}
 
+	if ((error_code & PF_USER)) {
+		vma = find_vma_speculative(mm, address);
+		if (vma) {
+			speculative = 1;
+			goto good_area;
+		}
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
@@ -1136,5 +1145,8 @@ good_area:
 
 	check_v8086_mode(regs, address, tsk);
 
-	up_read(&mm->mmap_sem);
+	if (speculative)
+		vma_put(vma);
+	else
+		up_read(&mm->mmap_sem);
 }

[-- Attachment #2: multi-fault-all-split.c --]
[-- Type: text/x-csrc, Size: 1849 bytes --]

/*
 * multi-fault.c :: causes 60secs of parallel page fault in multi-thread.
 * % gcc -O2 -o multi-fault multi-fault.c -lpthread
 * % multi-fault # of cpus.
 */

#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>

#define NR_THREADS	8
pthread_t threads[NR_THREADS];
/*
 * For avoiding contention in page table lock, FAULT area is
 * sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
 */
#define MMAP_LENGTH	(8 * 1024 * 1024)
#define FAULT_LENGTH	(2 * 1024 * 1024)
void *mmap_area[NR_THREADS];
#define PAGE_SIZE	4096

pthread_barrier_t barrier;
int name[NR_THREADS];

void segv_handler(int sig)
{
	sleep(100);
}
void *worker(void *data)
{
	cpu_set_t set;
	int cpu;

	cpu = *(int *)data;

	CPU_ZERO(&set);
	CPU_SET(cpu, &set);
	sched_setaffinity(0, sizeof(set), &set);

	while (1) {
		char *c;
		char *start = mmap_area[cpu];
		char *end = mmap_area[cpu] + FAULT_LENGTH;
		pthread_barrier_wait(&barrier);
		//printf("fault into %p-%p\n",start, end);

		for (c = start; c < end; c += PAGE_SIZE)
			*c = 0;
		pthread_barrier_wait(&barrier);

		madvise(start, FAULT_LENGTH, MADV_DONTNEED);
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	int i, ret;
	unsigned int num;

	if (argc < 2)
		return 0;

	num = atoi(argv[1]);	
	pthread_barrier_init(&barrier, NULL, num);

	for (i = 0; i < num; i++) {
		mmap_area[i] = mmap(NULL, MMAP_LENGTH * num,
				PROT_WRITE|PROT_READ,
				MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
		/* memory hole */
		mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
	}

	for (i = 0; i < num; ++i) {
		name[i] = i;
		ret = pthread_create(&threads[i], NULL, worker, &name[i]);
		if (ret < 0) {
			perror("pthread create");
			return 0;
		}
	}
	sleep(60);
	return 0;
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-25  1:51 [RFC PATCH] asynchronous page fault KAMEZAWA Hiroyuki
@ 2009-12-27  9:47 ` Minchan Kim
  2009-12-27 23:59   ` KAMEZAWA Hiroyuki
  2009-12-27 11:19 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Minchan Kim @ 2009-12-27  9:47 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, cl

Hi, Kame. 

KAMEZAWA Hiroyuki wrote:
> Speculative page fault v3.
> 
> This version is much simpler than old versions and doesn't use mm_accessor
> but use RCU. This is based on linux-2.6.33-rc2.
> 
> This patch is just my toy but shows...
>  - Once RB-tree is RCU-aware and no-lock in readside, we can avoid mmap_sem
>    in page fault. 
> So, what we need is not mm_accessor, but RCU-aware RB-tree, I think.
> 
> But yes, I may miss something critical ;)
> 
> After patch, statistics perf show is following. Test progam is attached.
>   
> # Samples: 1331231315119
> #
> # Overhead          Command             Shared Object  Symbol
> # ........  ...............  ........................  ......
> #
>     28.41%  multi-fault-all  [kernel]                  [k] clear_page_c
>             |
>             --- clear_page_c
>                 __alloc_pages_nodemask
>                 handle_mm_fault
>                 do_page_fault
>                 page_fault
>                 0x400950
>                |
>                 --100.00%-- (nil)
> 
>     21.69%  multi-fault-all  [kernel]                  [k] _raw_spin_lock
>             |
>             --- _raw_spin_lock
>                |
>                |--81.85%-- free_pcppages_bulk
>                |          free_hot_cold_page
>                |          __pagevec_free
>                |          release_pages
>                |          free_pages_and_swap_cache
> 
> 
> I'll be almost offline in the next week. 
> 
> Minchan, in this version, I didn't add CONFIG and some others which was
> recommended just for my laziness. Sorry.

No problem :)
Thanks for X-mas present. 

> 
> =
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Asynchronous page fault.
> 
> This patch is for avoidng mmap_sem in usual page fault. At running highly
> multi-threaded programs, mm->mmap_sem can use much CPU because of false
> sharing when it causes page fault in parallel. (Run after fork() is a typical
> case, I think.)
> This patch uses a speculative vma lookup to reduce that cost.
> 
> Considering vma lookup, rb-tree lookup, the only operation we do is checking
> node->rb_left,rb_right. And there are no complicated operation.
> At page fault, there are no demands for accessing sorted-vma-list or access
> prev or next in many case. Except for stack-expansion, we always need a vma
> which contains page-fault address. Then, we can access vma's RB-tree in
> speculative way.
> Even if RB-tree rotation occurs while we walk tree for look-up, we just
> miss vma without oops. In other words, we can _try_ to find vma in lockless
> manner. If failed, retry is ok.... we take lock and access vma.
> 
> For lockess walking, this uses RCU and adds find_vma_speculative(). And
> per-vma wait-queue and reference count. This refcnt+wait_queue guarantees that
> there are no thread which access the vma when we call subsystem's unmap
> functions.
> 
> Test result on my tiny test program on 8core/2socket machine is here.
> This measures how many page fault can occur in 60sec in parallel.
> 
> [root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8
> 
>  Performance counter stats for './multi-fault-all-split 8' (5 runs):
> 
>        17481387  page-faults                ( +-   0.409% )
>       509914595  cache-misses               ( +-   0.239% )
> 
>    60.002277793  seconds time elapsed   ( +-   0.000% )
> 
> 
> [root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8
> 
> 
>  Performance counter stats for './multi-fault-all-split 8' (5 runs):
> 
>        35949073  page-faults                ( +-   0.364% )
>       473091100  cache-misses               ( +-   0.304% )
> 
>    60.005444117  seconds time elapsed   ( +-   0.004% )
> 
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

<snip>


> +/* called when vma is unlinked and wait for all racy access.*/
> +static void invalidate_vma_before_free(struct vm_area_struct *vma)
> +{
> +	atomic_dec(&vma->refcnt);
> +	wait_event(vma->wait_queue, !atomic_read(&vma->refcnt));
> +}

I think we have to make sure atomicity of both (atomic_dec and wait_event).

> +
>  /*
>   * Requires inode->i_mapping->i_mmap_lock
>   */
> @@ -238,7 +256,7 @@ static struct vm_area_struct *remove_vma
>  			removed_exe_file_vma(vma->vm_mm);
>  	}
>  	mpol_put(vma_policy(vma));
> -	kmem_cache_free(vm_area_cachep, vma);
> +	free_vma_rcu(vma);
>  	return next;
>  }
>  
> @@ -404,6 +422,8 @@ __vma_link_list(struct mm_struct *mm, st
>  void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
>  		struct rb_node **rb_link, struct rb_node *rb_parent)
>  {
> +	atomic_set(&vma->refcnt, 1);
> +	init_waitqueue_head(&vma->wait_queue);
>  	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
>  	rb_insert_color(&vma->vm_rb, &mm->mm_rb);
>  }
> @@ -614,6 +634,7 @@ again:			remove_next = 1 + (end > next->
>  		 * us to remove next before dropping the locks.
>  		 */
>  		__vma_unlink(mm, next, vma);
> +		invalidate_vma_before_free(next);
>  		if (file)
>  			__remove_shared_vm_struct(next, file, mapping);
>  		if (next->anon_vma)
> @@ -640,7 +661,7 @@ again:			remove_next = 1 + (end > next->
>  		}
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
> -		kmem_cache_free(vm_area_cachep, next);
> +		free_vma_rcu(next);
>  		/*
>  		 * In mprotect's case 6 (see comments on vma_merge),
>  		 * we must remove another next too. It would clutter
> @@ -1544,6 +1565,55 @@ out:
>  }
>  
>  /*
> + * Returns vma which contains given address. This scans rb-tree in speculative
> + * way and increment a reference count if found. Even if vma exists in rb-tree,
> + * this function may return NULL in racy case. So, this function cannot be used
> + * for checking whether given address is valid or not.
> + */
> +struct vm_area_struct *
> +find_vma_speculative(struct mm_struct *mm, unsigned long addr)
> +{
> +	struct vm_area_struct *vma = NULL;
> +	struct vm_area_struct *vma_tmp;
> +	struct rb_node *rb_node;
> +
> +	if (unlikely(!mm))
> +		return NULL;;
> +
> +	rcu_read_lock();
> +	rb_node = rcu_dereference(mm->mm_rb.rb_node);
> +	vma = NULL;
> +	while (rb_node) {
> +		vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
> +
> +		if (vma_tmp->vm_end > addr) {
> +			vma = vma_tmp;
> +			if (vma_tmp->vm_start <= addr)
> +				break;
> +			rb_node = rcu_dereference(rb_node->rb_left);
> +		} else
> +			rb_node = rcu_dereference(rb_node->rb_right);
> +	}
> +	if (vma) {
> +		if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
> +			if (!atomic_inc_not_zero(&vma->refcnt))
> +				vma = NULL;
> +		} else
> +			vma = NULL;
> +	}
> +	rcu_read_unlock();
> +	return vma;
> +}
> +
> +void vma_put(struct vm_area_struct *vma)
> +{
> +	if ((atomic_dec_return(&vma->refcnt) == 1) &&
> +	    waitqueue_active(&vma->wait_queue))
> +		wake_up(&vma->wait_queue);
> +	return;
> +}
> +

Let's consider following case. 

CPU 0					CPU 1

find_vma_speculative(refcnt = 2)
					do_unmap 
					invaliate_vma_before_free(refcount = 1)
					wait_event
vma_put
refcnt = 0
skip wakeup 

Hmm.. 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-25  1:51 [RFC PATCH] asynchronous page fault KAMEZAWA Hiroyuki
  2009-12-27  9:47 ` Minchan Kim
@ 2009-12-27 11:19 ` Peter Zijlstra
  2009-12-28  0:00   ` KAMEZAWA Hiroyuki
  2009-12-28  0:57   ` Balbir Singh
  2009-12-27 12:03 ` Peter Zijlstra
  2010-01-02 21:45 ` Benjamin Herrenschmidt
  3 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-27 11:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Fri, 2009-12-25 at 10:51 +0900, KAMEZAWA Hiroyuki wrote:
> Index: linux-2.6.33-rc2/lib/rbtree.c
> ===================================================================
> --- linux-2.6.33-rc2.orig/lib/rbtree.c
> +++ linux-2.6.33-rc2/lib/rbtree.c
> @@ -30,19 +30,19 @@ static void __rb_rotate_left(struct rb_n
>  
>         if ((node->rb_right = right->rb_left))
>                 rb_set_parent(right->rb_left, node);
> -       right->rb_left = node;
> +       rcu_assign_pointer(right->rb_left, node);
>  
>         rb_set_parent(right, parent);
>  
>         if (parent)
>         {
>                 if (node == parent->rb_left)
> -                       parent->rb_left = right;
> +                       rcu_assign_pointer(parent->rb_left, right);
>                 else
> -                       parent->rb_right = right;
> +                       rcu_assign_pointer(parent->rb_right, right);
>         }
>         else
> -               root->rb_node = right;
> +               rcu_assign_pointer(root->rb_node, right);
>         rb_set_parent(node, right);
>  }
>  
> @@ -53,19 +53,19 @@ static void __rb_rotate_right(struct rb_
>  
>         if ((node->rb_left = left->rb_right))
>                 rb_set_parent(left->rb_right, node);
> -       left->rb_right = node;
> +       rcu_assign_pointer(left->rb_right, node);
>  
>         rb_set_parent(left, parent);
>  
>         if (parent)
>         {
>                 if (node == parent->rb_right)
> -                       parent->rb_right = left;
> +                       rcu_assign_pointer(parent->rb_right, left);
>                 else
> -                       parent->rb_left = left;
> +                       rcu_assign_pointer(parent->rb_left, left);
>         }
>         else
> -               root->rb_node = left;
> +               rcu_assign_pointer(root->rb_node, left);
>         rb_set_parent(node, left);
>  }


Consider the tree rotation:


           Q                        P
         /   \                    /   \
       P       C                A       Q
     /   \                            /   \
   A       B                        B       C


Since this comprises of 3 assignments (assuming right rotation):

  Q.left = B
  P.right = Q
  parent = P

it is non-atomic. This in turn means that any lock-less decent into the
tree will be able to miss a whole subtree or worse (imagine us being at
Q, needing to go to A, then the rotation happens, and all we can choose
from is B or C).

Your changelog states as much.

"Even if RB-tree rotation occurs while we walk tree for look-up, we just
miss vma without oops."

However, since this is the case, do we still need the
rcu_assign_pointer() conversion your patch does? All I can see it do is
slow down all RB-tree users, without any gain.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-25  1:51 [RFC PATCH] asynchronous page fault KAMEZAWA Hiroyuki
  2009-12-27  9:47 ` Minchan Kim
  2009-12-27 11:19 ` Peter Zijlstra
@ 2009-12-27 12:03 ` Peter Zijlstra
  2009-12-28  0:36   ` KAMEZAWA Hiroyuki
  2010-01-02 21:45 ` Benjamin Herrenschmidt
  3 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-27 12:03 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Fri, 2009-12-25 at 10:51 +0900, KAMEZAWA Hiroyuki wrote:
>  /*
> + * Returns vma which contains given address. This scans rb-tree in speculative
> + * way and increment a reference count if found. Even if vma exists in rb-tree,
> + * this function may return NULL in racy case. So, this function cannot be used
> + * for checking whether given address is valid or not.
> + */
> +struct vm_area_struct *
> +find_vma_speculative(struct mm_struct *mm, unsigned long addr)
> +{
> +       struct vm_area_struct *vma = NULL;
> +       struct vm_area_struct *vma_tmp;
> +       struct rb_node *rb_node;
> +
> +       if (unlikely(!mm))
> +               return NULL;;
> +
> +       rcu_read_lock();
> +       rb_node = rcu_dereference(mm->mm_rb.rb_node);
> +       vma = NULL;
> +       while (rb_node) {
> +               vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
> +
> +               if (vma_tmp->vm_end > addr) {
> +                       vma = vma_tmp;
> +                       if (vma_tmp->vm_start <= addr)
> +                               break;
> +                       rb_node = rcu_dereference(rb_node->rb_left);
> +               } else
> +                       rb_node = rcu_dereference(rb_node->rb_right);
> +       }
> +       if (vma) {
> +               if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
> +                       if (!atomic_inc_not_zero(&vma->refcnt))

And here you destroy pretty much all advantage of having done the
lockless lookup ;-)

The idea is to let the RCU lock span whatever length you need the vma
for, the easy way is to simply use PREEMPT_RCU=y for now, the hard way
is to also incorporate the drop-mmap_sem on blocking patches from a
while ago.

> +                               vma = NULL;
> +               } else
> +                       vma = NULL;
> +       }
> +       rcu_read_unlock();
> +       return vma;
> +} 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-27  9:47 ` Minchan Kim
@ 2009-12-27 23:59   ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-27 23:59 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-kernel, linux-mm, cl

On Sun, 27 Dec 2009 18:47:25 +0900
Minchan Kim <minchan.kim@gmail.com> wrote:
> > 
> > =
> > From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > 
> > Asynchronous page fault.
> > 
> > This patch is for avoidng mmap_sem in usual page fault. At running highly
> > multi-threaded programs, mm->mmap_sem can use much CPU because of false
> > sharing when it causes page fault in parallel. (Run after fork() is a typical
> > case, I think.)
> > This patch uses a speculative vma lookup to reduce that cost.
> > 
> > Considering vma lookup, rb-tree lookup, the only operation we do is checking
> > node->rb_left,rb_right. And there are no complicated operation.
> > At page fault, there are no demands for accessing sorted-vma-list or access
> > prev or next in many case. Except for stack-expansion, we always need a vma
> > which contains page-fault address. Then, we can access vma's RB-tree in
> > speculative way.
> > Even if RB-tree rotation occurs while we walk tree for look-up, we just
> > miss vma without oops. In other words, we can _try_ to find vma in lockless
> > manner. If failed, retry is ok.... we take lock and access vma.
> > 
> > For lockess walking, this uses RCU and adds find_vma_speculative(). And
> > per-vma wait-queue and reference count. This refcnt+wait_queue guarantees that
> > there are no thread which access the vma when we call subsystem's unmap
> > functions.
> > 
> > Test result on my tiny test program on 8core/2socket machine is here.
> > This measures how many page fault can occur in 60sec in parallel.
> > 
> > [root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8
> > 
> >  Performance counter stats for './multi-fault-all-split 8' (5 runs):
> > 
> >        17481387  page-faults                ( +-   0.409% )
> >       509914595  cache-misses               ( +-   0.239% )
> > 
> >    60.002277793  seconds time elapsed   ( +-   0.000% )
> > 
> > 
> > [root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8
> > 
> > 
> >  Performance counter stats for './multi-fault-all-split 8' (5 runs):
> > 
> >        35949073  page-faults                ( +-   0.364% )
> >       473091100  cache-misses               ( +-   0.304% )
> > 
> >    60.005444117  seconds time elapsed   ( +-   0.004% )
> > 
> > 
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> <snip>
> 
> 
> > +/* called when vma is unlinked and wait for all racy access.*/
> > +static void invalidate_vma_before_free(struct vm_area_struct *vma)
> > +{
> > +	atomic_dec(&vma->refcnt);
> > +	wait_event(vma->wait_queue, !atomic_read(&vma->refcnt));
> > +}
> 
> I think we have to make sure atomicity of both (atomic_dec and wait_event).
> 
I still consider how to do this.

	atomic_sub(&vma->refcnt, 65536)
	wait_event(..., atomic_read(&vma->refcnt) != 65536)

etc.



> > +
> >  /*
> >   * Requires inode->i_mapping->i_mmap_lock
> >   */
> > @@ -238,7 +256,7 @@ static struct vm_area_struct *remove_vma
> >  			removed_exe_file_vma(vma->vm_mm);
> >  	}
> >  	mpol_put(vma_policy(vma));
> > -	kmem_cache_free(vm_area_cachep, vma);
> > +	free_vma_rcu(vma);
> >  	return next;
> >  }
> >  
> > @@ -404,6 +422,8 @@ __vma_link_list(struct mm_struct *mm, st
> >  void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
> >  		struct rb_node **rb_link, struct rb_node *rb_parent)
> >  {
> > +	atomic_set(&vma->refcnt, 1);
> > +	init_waitqueue_head(&vma->wait_queue);
> >  	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
> >  	rb_insert_color(&vma->vm_rb, &mm->mm_rb);
> >  }
> > @@ -614,6 +634,7 @@ again:			remove_next = 1 + (end > next->
> >  		 * us to remove next before dropping the locks.
> >  		 */
> >  		__vma_unlink(mm, next, vma);
> > +		invalidate_vma_before_free(next);
> >  		if (file)
> >  			__remove_shared_vm_struct(next, file, mapping);
> >  		if (next->anon_vma)
> > @@ -640,7 +661,7 @@ again:			remove_next = 1 + (end > next->
> >  		}
> >  		mm->map_count--;
> >  		mpol_put(vma_policy(next));
> > -		kmem_cache_free(vm_area_cachep, next);
> > +		free_vma_rcu(next);
> >  		/*
> >  		 * In mprotect's case 6 (see comments on vma_merge),
> >  		 * we must remove another next too. It would clutter
> > @@ -1544,6 +1565,55 @@ out:
> >  }
> >  
> >  /*
> > + * Returns vma which contains given address. This scans rb-tree in speculative
> > + * way and increment a reference count if found. Even if vma exists in rb-tree,
> > + * this function may return NULL in racy case. So, this function cannot be used
> > + * for checking whether given address is valid or not.
> > + */
> > +struct vm_area_struct *
> > +find_vma_speculative(struct mm_struct *mm, unsigned long addr)
> > +{
> > +	struct vm_area_struct *vma = NULL;
> > +	struct vm_area_struct *vma_tmp;
> > +	struct rb_node *rb_node;
> > +
> > +	if (unlikely(!mm))
> > +		return NULL;;
> > +
> > +	rcu_read_lock();
> > +	rb_node = rcu_dereference(mm->mm_rb.rb_node);
> > +	vma = NULL;
> > +	while (rb_node) {
> > +		vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
> > +
> > +		if (vma_tmp->vm_end > addr) {
> > +			vma = vma_tmp;
> > +			if (vma_tmp->vm_start <= addr)
> > +				break;
> > +			rb_node = rcu_dereference(rb_node->rb_left);
> > +		} else
> > +			rb_node = rcu_dereference(rb_node->rb_right);
> > +	}
> > +	if (vma) {
> > +		if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
> > +			if (!atomic_inc_not_zero(&vma->refcnt))
> > +				vma = NULL;
> > +		} else
> > +			vma = NULL;
> > +	}
> > +	rcu_read_unlock();
> > +	return vma;
> > +}
> > +
> > +void vma_put(struct vm_area_struct *vma)
> > +{
> > +	if ((atomic_dec_return(&vma->refcnt) == 1) &&
> > +	    waitqueue_active(&vma->wait_queue))
> > +		wake_up(&vma->wait_queue);
> > +	return;
> > +}
> > +
> 
> Let's consider following case. 
> 
> CPU 0					CPU 1
> 
> find_vma_speculative(refcnt = 2)
> 					do_unmap 
> 					invaliate_vma_before_free(refcount = 1)
> 					wait_event
> vma_put
> refcnt = 0
> skip wakeup 
> 
> Hmm.. 

Nice catch. I'll change this logic. Maybe some easy trick can fix this.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-27 11:19 ` Peter Zijlstra
@ 2009-12-28  0:00   ` KAMEZAWA Hiroyuki
  2009-12-28  0:57   ` Balbir Singh
  1 sibling, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-28  0:00 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Sun, 27 Dec 2009 12:19:56 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, 2009-12-25 at 10:51 +0900, KAMEZAWA Hiroyuki wrote:
> > Index: linux-2.6.33-rc2/lib/rbtree.c
> > ===================================================================
> > --- linux-2.6.33-rc2.orig/lib/rbtree.c
> > +++ linux-2.6.33-rc2/lib/rbtree.c
> > @@ -30,19 +30,19 @@ static void __rb_rotate_left(struct rb_n
> >  
> >         if ((node->rb_right = right->rb_left))
> >                 rb_set_parent(right->rb_left, node);
> > -       right->rb_left = node;
> > +       rcu_assign_pointer(right->rb_left, node);
> >  
> >         rb_set_parent(right, parent);
> >  
> >         if (parent)
> >         {
> >                 if (node == parent->rb_left)
> > -                       parent->rb_left = right;
> > +                       rcu_assign_pointer(parent->rb_left, right);
> >                 else
> > -                       parent->rb_right = right;
> > +                       rcu_assign_pointer(parent->rb_right, right);
> >         }
> >         else
> > -               root->rb_node = right;
> > +               rcu_assign_pointer(root->rb_node, right);
> >         rb_set_parent(node, right);
> >  }
> >  
> > @@ -53,19 +53,19 @@ static void __rb_rotate_right(struct rb_
> >  
> >         if ((node->rb_left = left->rb_right))
> >                 rb_set_parent(left->rb_right, node);
> > -       left->rb_right = node;
> > +       rcu_assign_pointer(left->rb_right, node);
> >  
> >         rb_set_parent(left, parent);
> >  
> >         if (parent)
> >         {
> >                 if (node == parent->rb_right)
> > -                       parent->rb_right = left;
> > +                       rcu_assign_pointer(parent->rb_right, left);
> >                 else
> > -                       parent->rb_left = left;
> > +                       rcu_assign_pointer(parent->rb_left, left);
> >         }
> >         else
> > -               root->rb_node = left;
> > +               rcu_assign_pointer(root->rb_node, left);
> >         rb_set_parent(node, left);
> >  }
> 
> 
> Consider the tree rotation:
> 
> 
>            Q                        P
>          /   \                    /   \
>        P       C                A       Q
>      /   \                            /   \
>    A       B                        B       C
> 
> 
> Since this comprises of 3 assignments (assuming right rotation):
> 
>   Q.left = B
>   P.right = Q
>   parent = P
> 
> it is non-atomic. This in turn means that any lock-less decent into the
> tree will be able to miss a whole subtree or worse (imagine us being at
> Q, needing to go to A, then the rotation happens, and all we can choose
> from is B or C).
> 
> Your changelog states as much.
> 
> "Even if RB-tree rotation occurs while we walk tree for look-up, we just
> miss vma without oops."
> 
> However, since this is the case, do we still need the
> rcu_assign_pointer() conversion your patch does? All I can see it do is
> slow down all RB-tree users, without any gain.
> 

Ok, I'll remove all.

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-27 12:03 ` Peter Zijlstra
@ 2009-12-28  0:36   ` KAMEZAWA Hiroyuki
  2009-12-28  1:19     ` KAMEZAWA Hiroyuki
                       ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-28  0:36 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Sun, 27 Dec 2009 13:03:11 +0100
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, 2009-12-25 at 10:51 +0900, KAMEZAWA Hiroyuki wrote:
> >  /*
> > + * Returns vma which contains given address. This scans rb-tree in speculative
> > + * way and increment a reference count if found. Even if vma exists in rb-tree,
> > + * this function may return NULL in racy case. So, this function cannot be used
> > + * for checking whether given address is valid or not.
> > + */
> > +struct vm_area_struct *
> > +find_vma_speculative(struct mm_struct *mm, unsigned long addr)
> > +{
> > +       struct vm_area_struct *vma = NULL;
> > +       struct vm_area_struct *vma_tmp;
> > +       struct rb_node *rb_node;
> > +
> > +       if (unlikely(!mm))
> > +               return NULL;;
> > +
> > +       rcu_read_lock();
> > +       rb_node = rcu_dereference(mm->mm_rb.rb_node);
> > +       vma = NULL;
> > +       while (rb_node) {
> > +               vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
> > +
> > +               if (vma_tmp->vm_end > addr) {
> > +                       vma = vma_tmp;
> > +                       if (vma_tmp->vm_start <= addr)
> > +                               break;
> > +                       rb_node = rcu_dereference(rb_node->rb_left);
> > +               } else
> > +                       rb_node = rcu_dereference(rb_node->rb_right);
> > +       }
> > +       if (vma) {
> > +               if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
> > +                       if (!atomic_inc_not_zero(&vma->refcnt))
> 
> And here you destroy pretty much all advantage of having done the
> lockless lookup ;-)
> 
Hmm ? for single-thread apps ? This patch's purpose is not for lockless
lookup, it's just a part of work. My purpose is avoiding false-sharing.

2.6.33-rc2's score of the same test program is here.

    75.42%  multi-fault-all  [kernel]                  [k] _raw_spin_lock_irqsav
            |
            --- _raw_spin_lock_irqsave
               |
               |--49.13%-- __down_read_trylock
               |          down_read_trylock
               |          do_page_fault
               |          page_fault
               |          0x400950
               |          |
               |           --100.00%-- (nil)
               |
               |--46.92%-- __up_read
               |          up_read
               |          |
               |          |--99.99%-- do_page_fault
               |          |          page_fault
               |          |          0x400950
               |          |          (nil)
               |           --0.01%-- [...]

Most of time is used for up/down read.

Here is a comparison between
 - page fault by 8 threads on one vma
 - page fault by 8 threads on 8 vma on x86-64.

== one vma ==
# Samples: 1338964273489
#
# Overhead          Command             Shared Object  Symbol
# ........  ...............  ........................  ......
#
    26.90%  multi-fault-all  [kernel]                  [k] clear_page_c
            |
            --- clear_page_c
                __alloc_pages_nodemask
                handle_mm_fault
                do_page_fault
                page_fault
                0x400940
               |
                --100.00%-- (nil)

    20.65%  multi-fault-all  [kernel]                  [k] _raw_spin_lock
            |
            --- _raw_spin_lock
               |
               |--85.07%-- free_pcppages_bulk
               |          free_hot_cold_page

    ....<snip>
    3.94%  multi-fault-all  [kernel]                  [k] find_vma_speculative
            |
            --- find_vma_speculative
               |
               |--99.40%-- do_page_fault
               |          page_fault
               |          0x400940
               |          |
               |           --100.00%-- (nil)
               |
                --0.60%-- page_fault
                          0x400940
                          |
                           --100.00%-- (nil)
==

== 8 vma ==
    27.98%  multi-fault-all  [kernel]                  [k] clear_page_c
            |
            --- clear_page_c
                __alloc_pages_nodemask
                handle_mm_fault
                do_page_fault
                page_fault
                0x400950
               |
                --100.00%-- (nil)

    21.91%  multi-fault-all  [kernel]                  [k] _raw_spin_lock
            |
            --- _raw_spin_lock
               |
               |--77.01%-- free_pcppages_bulk
               |          free_hot_cold_page
               |          __pagevec_free
               |          release_pages
...<snip>

     0.21%  multi-fault-all  [kernel]                  [k] find_vma_speculative
            |
            --- find_vma_speculative
               |
               |--87.50%-- do_page_fault
               |          page_fault
               |          0x400950
               |          |
               |           --100.00%-- (nil)
               |
                --12.50%-- page_fault
                          0x400950
                          |
                           --100.00%-- (nil)
==
Yes, this atomic_inc_unless adds some overhead. But this isn't as bad as
false sharing in mmap_sem. Anyway, as Minchan pointed out, this code contains
bug. I consider this part again.


> The idea is to let the RCU lock span whatever length you need the vma
> for, the easy way is to simply use PREEMPT_RCU=y for now, 

I tried to remove his kind of reference count trick but I can't do that
without synchronize_rcu() somewhere in unmap code. I don't like that and
use this refcnt.

> the hard way
> is to also incorporate the drop-mmap_sem on blocking patches from a
> while ago.
> 
"drop-mmap_sem if block" is no help for this false-sharing problem.

Thanks,
-Kame





^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-27 11:19 ` Peter Zijlstra
  2009-12-28  0:00   ` KAMEZAWA Hiroyuki
@ 2009-12-28  0:57   ` Balbir Singh
  2009-12-28  1:05     ` KAMEZAWA Hiroyuki
  2009-12-28  8:32     ` Peter Zijlstra
  1 sibling, 2 replies; 32+ messages in thread
From: Balbir Singh @ 2009-12-28  0:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl

* Peter Zijlstra <peterz@infradead.org> [2009-12-27 12:19:56]:

> Your changelog states as much.
> 
> "Even if RB-tree rotation occurs while we walk tree for look-up, we just
> miss vma without oops."
> 
> However, since this is the case, do we still need the
> rcu_assign_pointer() conversion your patch does? All I can see it do is
> slow down all RB-tree users, without any gain.

Don't we need the rcu_assign_pointer() on the read side primarily to
make sure the pointer is still valid and assignments (writes) are not
re-ordered? Are you suggesting that the pointer assignment paths be
completely atomic?

-- 
	Balbir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  0:57   ` Balbir Singh
@ 2009-12-28  1:05     ` KAMEZAWA Hiroyuki
  2009-12-28  2:58       ` Balbir Singh
  2009-12-28  8:32     ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-28  1:05 UTC (permalink / raw)
  To: balbir; +Cc: Peter Zijlstra, linux-kernel, linux-mm, minchan.kim, cl

On Mon, 28 Dec 2009 06:27:46 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * Peter Zijlstra <peterz@infradead.org> [2009-12-27 12:19:56]:
> 
> > Your changelog states as much.
> > 
> > "Even if RB-tree rotation occurs while we walk tree for look-up, we just
> > miss vma without oops."
> > 
> > However, since this is the case, do we still need the
> > rcu_assign_pointer() conversion your patch does? All I can see it do is
> > slow down all RB-tree users, without any gain.
> 
> Don't we need the rcu_assign_pointer() on the read side primarily to
> make sure the pointer is still valid and assignments (writes) are not
> re-ordered? Are you suggesting that the pointer assignment paths be
> completely atomic?
> 
>From following reasons.
  - What we have to avoid is not to touch unkonwn memory via broken pointer.
    This is speculative look up and can miss vmas. So, even if tree is broken,
    there is no problem. Broken pointer which points to places other than rb-tree
    is problem.
  - rb-tree's rb_left and rb_right don't points to memory other than
    rb-tree. (or NULL)  And vmas are not freed/reused while rcu_read_lock().
    Then, we don't dive into unknown memory.
  - Then, we can skip rcu_assign_pointer().

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  0:36   ` KAMEZAWA Hiroyuki
@ 2009-12-28  1:19     ` KAMEZAWA Hiroyuki
  2009-12-28  8:30     ` Peter Zijlstra
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-28  1:19 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Peter Zijlstra, linux-kernel, linux-mm, minchan.kim, cl

On Mon, 28 Dec 2009 09:36:06 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> == one vma ==
> # Samples: 1338964273489
> #
> # Overhead          Command             Shared Object  Symbol
> # ........  ...............  ........................  ......
> #
>     26.90%  multi-fault-all  [kernel]                  [k] clear_page_c
>             |
>             --- clear_page_c
>                 __alloc_pages_nodemask
>                 handle_mm_fault
>                 do_page_fault
>                 page_fault
>                 0x400940
>                |
>                 --100.00%-- (nil)
> 
>     20.65%  multi-fault-all  [kernel]                  [k] _raw_spin_lock
>             |
>             --- _raw_spin_lock
>                |
>                |--85.07%-- free_pcppages_bulk
>                |          free_hot_cold_page
> 
>     ....<snip>
>     3.94%  multi-fault-all  [kernel]                  [k] find_vma_speculative
>             |
>             --- find_vma_speculative
>                |
>                |--99.40%-- do_page_fault
>                |          page_fault
>                |          0x400940
>                |          |
>                |           --100.00%-- (nil)
>                |
>                 --0.60%-- page_fault
>                           0x400940
>                           |
>                            --100.00%-- (nil)
> ==
> 
A small modification for avoiding atomic_add_unless() makes following score.
(used atomic_inc()->atomic_read() instead of that.)
==
     2.55%  multi-fault-all  [kernel]                  [k] vma_put
            |
            --- vma_put
               |
               |--98.87%-- do_page_fault
               |          page_fault
               |          0x400940
               |          (nil)
               |
                --1.13%-- page_fault
                          0x400940
                          (nil)
     0.65%  multi-fault-all  [kernel]                  [k] find_vma_speculative
            |
            --- find_vma_speculative
               |
               |--95.55%-- do_page_fault
               |          page_fault
               |          0x400940
               |          |
               |           --100.00%-- (nil)
               |
                --4.45%-- page_fault
                          0x400940
                          (nil)
==
Hmm...maybe worth to consider some.
I hope something good pops up to me after new year vacation.

Regards,
-Kame






^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  1:05     ` KAMEZAWA Hiroyuki
@ 2009-12-28  2:58       ` Balbir Singh
  2009-12-28  3:13         ` KAMEZAWA Hiroyuki
  2009-12-28  8:34         ` Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Balbir Singh @ 2009-12-28  2:58 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: Peter Zijlstra, linux-kernel, linux-mm, minchan.kim, cl

* KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-12-28 10:05:14]:

> On Mon, 28 Dec 2009 06:27:46 +0530
> Balbir Singh <balbir@linux.vnet.ibm.com> wrote:
> 
> > * Peter Zijlstra <peterz@infradead.org> [2009-12-27 12:19:56]:
> > 
> > > Your changelog states as much.
> > > 
> > > "Even if RB-tree rotation occurs while we walk tree for look-up, we just
> > > miss vma without oops."
> > > 
> > > However, since this is the case, do we still need the
> > > rcu_assign_pointer() conversion your patch does? All I can see it do is
> > > slow down all RB-tree users, without any gain.
> > 
> > Don't we need the rcu_assign_pointer() on the read side primarily to
> > make sure the pointer is still valid and assignments (writes) are not
> > re-ordered? Are you suggesting that the pointer assignment paths be
> > completely atomic?
> > 
> >From following reasons.
>   - What we have to avoid is not to touch unkonwn memory via broken pointer.
>     This is speculative look up and can miss vmas. So, even if tree is broken,
>     there is no problem. Broken pointer which points to places other than rb-tree
>     is problem.

Exactly!

>   - rb-tree's rb_left and rb_right don't points to memory other than
>     rb-tree. (or NULL)  And vmas are not freed/reused while rcu_read_lock().
>     Then, we don't dive into unknown memory.
>   - Then, we can skip rcu_assign_pointer().
>

We can, but the data being on read-side is going to be out-of-date
more than without the use of rcu_assign_pointer(). Do we need variants
like to rcu_rb_next() to avoid overheads for everyone?

-- 
	Balbir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  2:58       ` Balbir Singh
@ 2009-12-28  3:13         ` KAMEZAWA Hiroyuki
  2009-12-28  8:34         ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-28  3:13 UTC (permalink / raw)
  To: balbir; +Cc: Peter Zijlstra, linux-kernel, linux-mm, minchan.kim, cl

On Mon, 28 Dec 2009 08:28:39 +0530
Balbir Singh <balbir@linux.vnet.ibm.com> wrote:

> * KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> [2009-12-28 10:05:14]:

> >   - rb-tree's rb_left and rb_right don't points to memory other than
> >     rb-tree. (or NULL)  And vmas are not freed/reused while rcu_read_lock().
> >     Then, we don't dive into unknown memory.
> >   - Then, we can skip rcu_assign_pointer().
> >
> 
> We can, but the data being on read-side is going to be out-of-date
> more than without the use of rcu_assign_pointer(). Do we need variants
> like to rcu_rb_next() to avoid overheads for everyone?
> 
I myself can't know how often out-of-date data can be seen (because I use x86).

But, I feel that we don't see broken tree so often. Because...
  - Single-threaded apps never see broken tree.
  - Even if rb-tree modification frequently happens, tree rotation is not
    very often and sub-trees tend to be stable as a chunk.

Hmm, adding barrier like this ?

static inline void
__vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
                struct vm_area_struct *prev)
{
        prev->vm_next = vma->vm_next;
        rb_erase(&vma->vm_rb, &mm->mm_rb);
        if (mm->mmap_cache == vma)
                mm->mmap_cache = prev;
	smp_wb(); <==============================================(new)
}



Regards,
-Kame


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  0:36   ` KAMEZAWA Hiroyuki
  2009-12-28  1:19     ` KAMEZAWA Hiroyuki
@ 2009-12-28  8:30     ` Peter Zijlstra
  2009-12-28  9:58       ` KAMEZAWA Hiroyuki
  2009-12-28  8:55     ` Peter Zijlstra
  2009-12-28 11:43     ` Peter Zijlstra
  3 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-28  8:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Mon, 2009-12-28 at 09:36 +0900, KAMEZAWA Hiroyuki wrote:
> 
> > The idea is to let the RCU lock span whatever length you need the vma
> > for, the easy way is to simply use PREEMPT_RCU=y for now, 
> 
> I tried to remove his kind of reference count trick but I can't do that
> without synchronize_rcu() somewhere in unmap code. I don't like that and
> use this refcnt. 

Why, because otherwise we can access page tables for an already unmapped
vma? Yeah that is the interesting bit ;-)




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  0:57   ` Balbir Singh
  2009-12-28  1:05     ` KAMEZAWA Hiroyuki
@ 2009-12-28  8:32     ` Peter Zijlstra
  2009-12-29  9:54       ` Balbir Singh
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-28  8:32 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl

On Mon, 2009-12-28 at 06:27 +0530, Balbir Singh wrote:
> * Peter Zijlstra <peterz@infradead.org> [2009-12-27 12:19:56]:
> 
> > Your changelog states as much.
> > 
> > "Even if RB-tree rotation occurs while we walk tree for look-up, we just
> > miss vma without oops."
> > 
> > However, since this is the case, do we still need the
> > rcu_assign_pointer() conversion your patch does? All I can see it do is
> > slow down all RB-tree users, without any gain.
> 
> Don't we need the rcu_assign_pointer() on the read side primarily to
> make sure the pointer is still valid and assignments (writes) are not
> re-ordered? Are you suggesting that the pointer assignment paths be
> completely atomic?

rcu_assign_pointer() is the write side, but if you need a barrier, you
can make do with a single smp_wmb() after doing the rb-tree op. There is
no need to add multiple in the tree-ops themselves.

You cannot make the assignment paths atomic (without locks) that's the
whole problem.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  2:58       ` Balbir Singh
  2009-12-28  3:13         ` KAMEZAWA Hiroyuki
@ 2009-12-28  8:34         ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-28  8:34 UTC (permalink / raw)
  To: balbir; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl

On Mon, 2009-12-28 at 08:28 +0530, Balbir Singh wrote:

> We can, but the data being on read-side is going to be out-of-date
> more than without the use of rcu_assign_pointer(). Do we need variants
> like to rcu_rb_next() to avoid overheads for everyone?

More or less doesn't matter! As long as you cannot get it atomic there's
holes and you need to deal with it.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  0:36   ` KAMEZAWA Hiroyuki
  2009-12-28  1:19     ` KAMEZAWA Hiroyuki
  2009-12-28  8:30     ` Peter Zijlstra
@ 2009-12-28  8:55     ` Peter Zijlstra
  2009-12-28 10:08       ` KAMEZAWA Hiroyuki
  2009-12-28 11:43     ` Peter Zijlstra
  3 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-28  8:55 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Mon, 2009-12-28 at 09:36 +0900, KAMEZAWA Hiroyuki wrote:
> Hmm ? for single-thread apps ? This patch's purpose is not for lockless
> lookup, it's just a part of work. My purpose is avoiding false-sharing.

False sharing in the sense of the mmap_sem cacheline containing other
variables? How could that ever be a problem for a single threaded
application?

For multi-threaded apps the contention on that cacheline is the largest
issue, and moving it to a vma cacheline doesn't seem like a big
improvement.

You want something much finer grained than vmas, there's lots of apps
working on a single (or very few) vma(s). Leaving you with pretty much
the same cacheline contention. Only now its a different cacheline.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  8:30     ` Peter Zijlstra
@ 2009-12-28  9:58       ` KAMEZAWA Hiroyuki
  2009-12-28 10:30         ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-28  9:58 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl

Peter Zijlstra さんは書きました:
> On Mon, 2009-12-28 at 09:36 +0900, KAMEZAWA Hiroyuki wrote:
>>
>> > The idea is to let the RCU lock span whatever length you need the vma
>> > for, the easy way is to simply use PREEMPT_RCU=y for now,
>>
>> I tried to remove his kind of reference count trick but I can't do that
>> without synchronize_rcu() somewhere in unmap code. I don't like that and
>> use this refcnt.
>
> Why, because otherwise we can access page tables for an already unmapped
> vma? Yeah that is the interesting bit ;-)
>
Without that
  vma->a_ops->fault()
and
  vma->a_ops->unmap()
can be called at the same time. and vma->vm_file can be dropped while
vma->a_ops->fault() is called. etc...
Ah, I may miss something. I'll consider in the next year.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  8:55     ` Peter Zijlstra
@ 2009-12-28 10:08       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-28 10:08 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl

Peter Zijlstra wrote:
> On Mon, 2009-12-28 at 09:36 +0900, KAMEZAWA Hiroyuki wrote:
>> Hmm ? for single-thread apps ? This patch's purpose is not for lockless
>> lookup, it's just a part of work. My purpose is avoiding false-sharing.
>
> False sharing in the sense of the mmap_sem cacheline containing other
> variables?  How could that ever be a problem for a single threaded
> application?
>
No problem at all. I just couldn't catch what you mean.


> For multi-threaded apps the contention on that cacheline is the largest
> issue, and moving it to a vma cacheline doesn't seem like a big
> improvement.
>
I feel mmap_sem's cacheline ping-pong is more terrible than
simple atomic_inc().
__down_read() does
  write (spinlock)
  write (->sem_activity)
  write (unlock)


> You want something much finer grained than vmas, there's lots of apps
> working on a single (or very few) vma(s). Leaving you with pretty much
> the same cacheline contention. Only now its a different cacheline.
>
Ya, maybe. I hope I can find some magical one.
Using per-cpu counter here as Christoph did may be an idea...

Thanks,
-Kame




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  9:58       ` KAMEZAWA Hiroyuki
@ 2009-12-28 10:30         ` Peter Zijlstra
  2009-12-28 10:40           ` Peter Zijlstra
  2009-12-28 10:57           ` [RFC PATCH] asynchronous " KAMEZAWA Hiroyuki
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-28 10:30 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Mon, 2009-12-28 at 18:58 +0900, KAMEZAWA Hiroyuki wrote:
> Peter Zijlstra さんは書きました:
> > On Mon, 2009-12-28 at 09:36 +0900, KAMEZAWA Hiroyuki wrote:
> >>
> >> > The idea is to let the RCU lock span whatever length you need the vma
> >> > for, the easy way is to simply use PREEMPT_RCU=y for now,
> >>
> >> I tried to remove his kind of reference count trick but I can't do that
> >> without synchronize_rcu() somewhere in unmap code. I don't like that and
> >> use this refcnt.
> >
> > Why, because otherwise we can access page tables for an already unmapped
> > vma? Yeah that is the interesting bit ;-)
> >
> Without that
>   vma->a_ops->fault()
> and
>   vma->a_ops->unmap()
> can be called at the same time. and vma->vm_file can be dropped while
> vma->a_ops->fault() is called. etc...

Right, so acquiring the PTE lock will either instantiate page tables for
a non-existing vma, leaving you with an interesting mess to clean up, or
you can also RCU free the page tables (in the same RCU domain as the
vma) which will mostly[*] avoid that issue.

[ To make live really really interesting you could even re-use the
  page-tables and abort the RCU free when the region gets re-mapped
  before the RCU callbacks happen, this will avoid a free/alloc cycle
  for fast remapping workloads. ]

Once you hold the PTE lock, you can validate the vma you looked up,
since ->unmap() syncs against it. If at that time you find the
speculative vma is dead, you fail and re-try the fault.

[*] there still is the case of faulting on an address that didn't
previously have page-tables hence the unmap page table scan will have
skipped it -- my hacks simply leaked page tables here, but the idea was
to acquire the mmap_sem for reading and cleanup properly.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28 10:30         ` Peter Zijlstra
@ 2009-12-28 10:40           ` Peter Zijlstra
  2010-01-02 16:14             ` Peter Zijlstra
  2009-12-28 10:57           ` [RFC PATCH] asynchronous " KAMEZAWA Hiroyuki
  1 sibling, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-28 10:40 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Mon, 2009-12-28 at 11:30 +0100, Peter Zijlstra wrote:
> On Mon, 2009-12-28 at 18:58 +0900, KAMEZAWA Hiroyuki wrote:
> > Peter Zijlstra さんは書きました:
> > > On Mon, 2009-12-28 at 09:36 +0900, KAMEZAWA Hiroyuki wrote:
> > >>
> > >> > The idea is to let the RCU lock span whatever length you need the vma
> > >> > for, the easy way is to simply use PREEMPT_RCU=y for now,
> > >>
> > >> I tried to remove his kind of reference count trick but I can't do that
> > >> without synchronize_rcu() somewhere in unmap code. I don't like that and
> > >> use this refcnt.
> > >
> > > Why, because otherwise we can access page tables for an already unmapped
> > > vma? Yeah that is the interesting bit ;-)
> > >
> > Without that
> >   vma->a_ops->fault()
> > and
> >   vma->a_ops->unmap()
> > can be called at the same time. and vma->vm_file can be dropped while
> > vma->a_ops->fault() is called. etc...
> 
> Right, so acquiring the PTE lock will either instantiate page tables for
> a non-existing vma, leaving you with an interesting mess to clean up, or
> you can also RCU free the page tables (in the same RCU domain as the
> vma) which will mostly[*] avoid that issue.
> 
> [ To make live really really interesting you could even re-use the
>   page-tables and abort the RCU free when the region gets re-mapped
>   before the RCU callbacks happen, this will avoid a free/alloc cycle
>   for fast remapping workloads. ]
> 
> Once you hold the PTE lock, you can validate the vma you looked up,
> since ->unmap() syncs against it. If at that time you find the
> speculative vma is dead, you fail and re-try the fault.
> 
> [*] there still is the case of faulting on an address that didn't
> previously have page-tables hence the unmap page table scan will have
> skipped it -- my hacks simply leaked page tables here, but the idea was
> to acquire the mmap_sem for reading and cleanup properly.

Alternatively, we could mark vma's dead in some way before we do the
unmap, then whenever we hit the page-table alloc path, we check against
the speculative vma and bail if it died.

That might just work.. will need to ponder it a bit more.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28 10:30         ` Peter Zijlstra
  2009-12-28 10:40           ` Peter Zijlstra
@ 2009-12-28 10:57           ` KAMEZAWA Hiroyuki
  2009-12-28 11:06             ` Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-12-28 10:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl

Peter Zijlstra wrote:
> On Mon, 2009-12-28 at 18:58 +0900, KAMEZAWA Hiroyuki wrote:
>> Peter Zijlstra wrote:
>> > On Mon, 2009-12-28 at 09:36 +0900, KAMEZAWA Hiroyuki wrote:
>> >>
>> >> > The idea is to let the RCU lock span whatever length you need the
>> vma
>> >> > for, the easy way is to simply use PREEMPT_RCU=y for now,
>> >>
>> >> I tried to remove his kind of reference count trick but I can't do
>> that
>> >> without synchronize_rcu() somewhere in unmap code. I don't like that
>> and
>> >> use this refcnt.
>> >
>> > Why, because otherwise we can access page tables for an already
>> unmapped
>> > vma? Yeah that is the interesting bit ;-)
>> >
>> Without that
>>   vma->a_ops->fault()
>> and
>>   vma->a_ops->unmap()
>> can be called at the same time. and vma->vm_file can be dropped while
>> vma->a_ops->fault() is called. etc...
>
> Right, so acquiring the PTE lock will either instantiate page tables for
> a non-existing vma, leaving you with an interesting mess to clean up, or
> you can also RCU free the page tables (in the same RCU domain as the
> vma) which will mostly[*] avoid that issue.
>
> [ To make live really really interesting you could even re-use the
>   page-tables and abort the RCU free when the region gets re-mapped
>   before the RCU callbacks happen, this will avoid a free/alloc cycle
>   for fast remapping workloads. ]
>
> Once you hold the PTE lock, you can validate the vma you looked up,
> since ->unmap() syncs against it. If at that time you find the
> speculative vma is dead, you fail and re-try the fault.
>
My previous one did similar but still used vma->refcnt. I'll consider again.

> [*] there still is the case of faulting on an address that didn't
> previously have page-tables hence the unmap page table scan will have
> skipped it -- my hacks simply leaked page tables here, but the idea was
> to acquire the mmap_sem for reading and cleanup properly.
>
Hmm, thank you for hints.

But this current version implementation has some reasons.
  - because pmd has some trobles because of quicklists..I don't wanted to
    touch free routine of them.
  - pmd can be removed asynchronously while page fault is going on.
  - I'd like to avoid modification to free_pte_range etc...

I feel pmd/page-table-lock is a hard to handle object than expected.

I'll consider some about per-thread approach or split vma approach
or scalable range lock or some synchronization without heavy atomic op.

Anyway, I think I show something can be done without mmap_sem modification.
See you next year.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28 10:57           ` [RFC PATCH] asynchronous " KAMEZAWA Hiroyuki
@ 2009-12-28 11:06             ` Peter Zijlstra
  0 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-28 11:06 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Mon, 2009-12-28 at 19:57 +0900, KAMEZAWA Hiroyuki wrote:
>   - because pmd has some trobles because of quicklists..I don't wanted to
>     touch free routine of them. 

I really doubt the value of that quicklist horror. IIRC x86 stopped
supporting that a while ago as well.

I would suspect the page-table retention scheme possible with RCU freed
page tables could be far more efficient than quicklists, but then that's
all speculation since I don't know what kind of workloads we're talking
about and this glaring lack of implementation to compare.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  0:36   ` KAMEZAWA Hiroyuki
                       ` (2 preceding siblings ...)
  2009-12-28  8:55     ` Peter Zijlstra
@ 2009-12-28 11:43     ` Peter Zijlstra
  3 siblings, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2009-12-28 11:43 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Mon, 2009-12-28 at 09:36 +0900, KAMEZAWA Hiroyuki wrote:
> > the hard way
> > is to also incorporate the drop-mmap_sem on blocking patches from a
> > while ago.
> > 
> "drop-mmap_sem if block" is no help for this false-sharing problem.

No but it does help with the problem of RCU-read-lock not being able to
sleep.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28  8:32     ` Peter Zijlstra
@ 2009-12-29  9:54       ` Balbir Singh
  0 siblings, 0 replies; 32+ messages in thread
From: Balbir Singh @ 2009-12-29  9:54 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl

* Peter Zijlstra <peterz@infradead.org> [2009-12-28 09:32:53]:

> On Mon, 2009-12-28 at 06:27 +0530, Balbir Singh wrote:
> > * Peter Zijlstra <peterz@infradead.org> [2009-12-27 12:19:56]:
> > 
> > > Your changelog states as much.
> > > 
> > > "Even if RB-tree rotation occurs while we walk tree for look-up, we just
> > > miss vma without oops."
> > > 
> > > However, since this is the case, do we still need the
> > > rcu_assign_pointer() conversion your patch does? All I can see it do is
> > > slow down all RB-tree users, without any gain.
> > 
> > Don't we need the rcu_assign_pointer() on the read side primarily to
> > make sure the pointer is still valid and assignments (writes) are not
> > re-ordered? Are you suggesting that the pointer assignment paths be
> > completely atomic?
> 
> rcu_assign_pointer() is the write side, but if you need a barrier, you
> can make do with a single smp_wmb() after doing the rb-tree op. There is
> no need to add multiple in the tree-ops themselves.
>

Yes, that makes sense.
 
> You cannot make the assignment paths atomic (without locks) that's the
> whole problem.
>

True, but pre-emption can be nasty in some cases. But I am no expert
in the atomicity of operations like assignments across architectures.
I assume all word, long assignments are.

-- 
	Balbir

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-28 10:40           ` Peter Zijlstra
@ 2010-01-02 16:14             ` Peter Zijlstra
  2010-01-04  3:02               ` Paul E. McKenney
  2010-01-04 13:48               ` [RFC PATCH -v2] speculative " Peter Zijlstra
  0 siblings, 2 replies; 32+ messages in thread
From: Peter Zijlstra @ 2010-01-02 16:14 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, minchan.kim, cl, Paul E. McKenney,
	hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds

On Mon, 2009-12-28 at 11:40 +0100, Peter Zijlstra wrote:

> > Right, so acquiring the PTE lock will either instantiate page tables for
> > a non-existing vma, leaving you with an interesting mess to clean up, or
> > you can also RCU free the page tables (in the same RCU domain as the
> > vma) which will mostly[*] avoid that issue.
> > 
> > [ To make live really really interesting you could even re-use the
> >   page-tables and abort the RCU free when the region gets re-mapped
> >   before the RCU callbacks happen, this will avoid a free/alloc cycle
> >   for fast remapping workloads. ]
> > 
> > Once you hold the PTE lock, you can validate the vma you looked up,
> > since ->unmap() syncs against it. If at that time you find the
> > speculative vma is dead, you fail and re-try the fault.
> > 
> > [*] there still is the case of faulting on an address that didn't
> > previously have page-tables hence the unmap page table scan will have
> > skipped it -- my hacks simply leaked page tables here, but the idea was
> > to acquire the mmap_sem for reading and cleanup properly.
> 
> Alternatively, we could mark vma's dead in some way before we do the
> unmap, then whenever we hit the page-table alloc path, we check against
> the speculative vma and bail if it died.
> 
> That might just work.. will need to ponder it a bit more.

Right, so I don't think we need RCU page tables on x86. All we need is
some extension to the fast_gup() stuff.

Nor do we need to modify the page-table alloc paths. All we need to do
is have 2 versions of the page table walks like those in
handle_mm_fault().

What we do need is to have call_srcu() for the VMAs since all this fault
stuff can block in many ways. And we need to tag 'dead' VMAs as such
(before doing the unmap).

[ Paul, pretty please? :-) ]

We also need to introduce FAULT_FLAG_SPECULATIVE to tell the rest of the
fault code about us not holding mmap_sem. And add a return fault return
state like VM_FAULT_RETRY to retry the fault holding the mmap_sem.

Then we need alternative page table walkers, currently things like
handle_mm_fault() use the p*_alloc*() like routines, but for
FAULT_FLAG_SPECULATIVE we need to use the p*_offset*() variants like in
follow_pte(). If that fails to find the pte, we return VM_FAULT_RETRY.

[ One sad consequence is that this still requires the mmap_sem for
  every page table alloc, but since a pte can hold lots of pages it
  should hopefully work out nicely ]

The above is the tricky bit since that can race with unmap, which is
where the fast_gup() stuff comes into play. fast_gup() has the exact
same problem, and already solved it for us. So we need a speculative
page table walker that does whatever fast_gup() does, which on x86 is
disable IRQs (powerpc has RCU freed page tables).

Now all sites where we actually lock the ptl we need to actually redo
that page table walk, failing to find the pte will again return
VM_FAULT_RETRY, once we lock the ptl we need to check the VMA's dead
state, if dead we also bail with VM_FAULT_RETRY, otherwise we're good
and can continue.


Something like the below, which 'sometimes' boots on my dual core and
when it boots seems to survive building a kernel... definitely needs
more work.

---
 arch/x86/mm/fault.c      |    8 ++
 arch/x86/mm/gup.c        |   10 ++
 include/linux/mm.h       |   14 +++
 include/linux/mm_types.h |    4 +
 init/Kconfig             |   34 +++---
 kernel/sched.c           |    9 ++-
 mm/memory.c              |  293 +++++++++++++++++++++++++++++++--------------
 mm/mmap.c                |   31 +++++-
 mm/util.c                |   12 ++-
 9 files changed, 302 insertions(+), 113 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f627779..c748529 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1040,6 +1040,14 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 		return;
 	}
 
+	if (error_code & PF_USER) {
+		fault = handle_speculative_fault(mm, address,
+				error_code & PF_WRITE ? FAULT_FLAG_WRITE : 0);
+
+		if (!(fault & (VM_FAULT_ERROR | VM_FAULT_RETRY)))
+			return;
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 71da1bc..6eeaef7 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -373,3 +373,13 @@ slow_irqon:
 		return ret;
 	}
 }
+
+void pin_page_tables(void)
+{
+	local_irq_disable();
+}
+
+void unpin_page_tables(void)
+{
+	local_irq_enable();
+}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2265f28..7bc94f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -136,6 +136,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_SPECULATIVE	0x08
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -711,6 +712,7 @@ static inline int page_mapped(struct page *page)
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
+#define VM_FAULT_RETRY  0x0400
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
@@ -763,6 +765,14 @@ unsigned long unmap_vmas(struct mmu_gather **tlb,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
 
+static inline int vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
+{
+	int ret = RB_EMPTY_NODE(&vma->vm_rb);
+	unsigned seq = vma->vm_sequence.sequence;
+	smp_rmb();
+	return ret || (seq & 1) || seq != sequence;
+}
+
 /**
  * mm_walk - callbacks for walk_page_range
  * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
@@ -819,6 +829,8 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
+extern int handle_speculative_fault(struct mm_struct *mm,
+			unsigned long address, unsigned int flags);
 #else
 static inline int handle_mm_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
@@ -838,6 +850,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
+void pin_page_tables(void);
+void unpin_page_tables(void);
 struct page *get_dump_page(unsigned long addr);
 
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 84a524a..0727300 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,8 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/seqlock.h>
+#include <linux/rcupdate.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -186,6 +188,8 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	seqcount_t vm_sequence;
+	struct rcu_head vm_rcu_head;
 };
 
 struct core_thread {
diff --git a/init/Kconfig b/init/Kconfig
index 06dab27..5edae47 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -314,19 +314,19 @@ menu "RCU Subsystem"
 
 choice
 	prompt "RCU Implementation"
-	default TREE_RCU
+	default TREE_PREEMPT_RCU
 
-config TREE_RCU
-	bool "Tree-based hierarchical RCU"
-	help
-	  This option selects the RCU implementation that is
-	  designed for very large SMP system with hundreds or
-	  thousands of CPUs.  It also scales down nicely to
-	  smaller systems.
+#config TREE_RCU
+#	bool "Tree-based hierarchical RCU"
+#	help
+#	  This option selects the RCU implementation that is
+#	  designed for very large SMP system with hundreds or
+#	  thousands of CPUs.  It also scales down nicely to
+#	  smaller systems.
 
 config TREE_PREEMPT_RCU
 	bool "Preemptable tree-based hierarchical RCU"
-	depends on PREEMPT
+#	depends on PREEMPT
 	help
 	  This option selects the RCU implementation that is
 	  designed for very large SMP systems with hundreds or
@@ -334,14 +334,14 @@ config TREE_PREEMPT_RCU
 	  is also required.  It also scales down nicely to
 	  smaller systems.
 
-config TINY_RCU
-	bool "UP-only small-memory-footprint RCU"
-	depends on !SMP
-	help
-	  This option selects the RCU implementation that is
-	  designed for UP systems from which real-time response
-	  is not required.  This option greatly reduces the
-	  memory footprint of RCU.
+#config TINY_RCU
+#	bool "UP-only small-memory-footprint RCU"
+#	depends on !SMP
+#	help
+#	  This option selects the RCU implementation that is
+#	  designed for UP systems from which real-time response
+#	  is not required.  This option greatly reduces the
+#	  memory footprint of RCU.
 
 endchoice
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 22c14eb..21cdc52 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9689,7 +9689,14 @@ void __init sched_init(void)
 #ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
 static inline int preempt_count_equals(int preempt_offset)
 {
-	int nested = (preempt_count() & ~PREEMPT_ACTIVE) + rcu_preempt_depth();
+	int nested = (preempt_count() & ~PREEMPT_ACTIVE)
+		/*
+		 * remove this for we need preemptible RCU
+		 * exactly because it needs to sleep..
+		 *
+		 + rcu_preempt_depth()
+		 */
+		;
 
 	return (nested == PREEMPT_INATOMIC_BASE + preempt_offset);
 }
diff --git a/mm/memory.c b/mm/memory.c
index 09e4b1b..ace6645 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1919,31 +1919,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
 /*
- * handle_pte_fault chooses page fault handler according to an entry
- * which was read non-atomically.  Before making any commitment, on
- * those architectures or configurations (e.g. i386 with PAE) which
- * might give a mix of unmatched parts, do_swap_page and do_file_page
- * must check under lock before unmapping the pte and proceeding
- * (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page and do_no_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
-{
-	int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
-		spin_unlock(ptl);
-	}
-#endif
-	pte_unmap(page_table);
-	return same;
-}
-
-/*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
  * servicing faults for write access.  In the normal case, do always want
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
@@ -1982,6 +1957,52 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
 		copy_user_highpage(dst, src, va, vma);
 }
 
+static int pte_map_lock(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		unsigned int seq, pte_t **ptep, spinlock_t **ptlp)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+
+	if (!(flags & FAULT_FLAG_SPECULATIVE)) {
+		*ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
+		return 1;
+	}
+
+	pin_page_tables();
+
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out;
+
+	if (pmd_huge(*pmd))
+		goto out;
+
+	*ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
+	if (!*ptep)
+		goto out;
+
+	if (vma && vma_is_dead(vma, seq))
+		goto unlock;
+
+	unpin_page_tables();
+	return 1;
+
+unlock:
+	pte_unmap_unlock(*ptep, *ptlp);
+out:
+	unpin_page_tables();
+	return 0;
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2002,7 +2023,8 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
  */
 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		spinlock_t *ptl, pte_t orig_pte)
+		spinlock_t *ptl, unsigned int flags, pte_t orig_pte,
+		unsigned int seq)
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
@@ -2034,8 +2056,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			page_cache_get(old_page);
 			pte_unmap_unlock(page_table, ptl);
 			lock_page(old_page);
-			page_table = pte_offset_map_lock(mm, pmd, address,
-							 &ptl);
+
+			if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
+						&page_table, &ptl)) {
+				unlock_page(old_page);
+				ret = VM_FAULT_RETRY;
+				goto err;
+			}
+
 			if (!pte_same(*page_table, orig_pte)) {
 				unlock_page(old_page);
 				page_cache_release(old_page);
@@ -2077,14 +2105,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (unlikely(tmp &
 					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
 				ret = tmp;
-				goto unwritable_page;
+				goto err;
 			}
 			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
 				lock_page(old_page);
 				if (!old_page->mapping) {
 					ret = 0; /* retry the fault */
 					unlock_page(old_page);
-					goto unwritable_page;
+					goto err;
 				}
 			} else
 				VM_BUG_ON(!PageLocked(old_page));
@@ -2095,8 +2123,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * they did, we just return, as we can count on the
 			 * MMU to tell us if they didn't also make it writable.
 			 */
-			page_table = pte_offset_map_lock(mm, pmd, address,
-							 &ptl);
+			if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
+						&page_table, &ptl)) {
+				unlock_page(old_page);
+				ret = VM_FAULT_RETRY;
+				goto err;
+			}
+
 			if (!pte_same(*page_table, orig_pte)) {
 				unlock_page(old_page);
 				page_cache_release(old_page);
@@ -2128,17 +2161,23 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
+	if (unlikely(anon_vma_prepare(vma))) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
 
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
-		if (!new_page)
-			goto oom;
+		if (!new_page) {
+			ret = VM_FAULT_OOM;
+			goto err;
+		}
 	} else {
 		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
-		if (!new_page)
-			goto oom;
+		if (!new_page) {
+			ret = VM_FAULT_OOM;
+			goto err;
+		}
 		cow_user_page(new_page, old_page, address, vma);
 	}
 	__SetPageUptodate(new_page);
@@ -2153,13 +2192,20 @@ gotten:
 		unlock_page(old_page);
 	}
 
-	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
-		goto oom_free_new;
+	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)) {
+		ret = VM_FAULT_OOM;
+		goto err_free_new;
+	}
 
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
+		mem_cgroup_uncharge_page(new_page);
+		ret = VM_FAULT_RETRY;
+		goto err_free_new;
+	}
+
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
@@ -2258,9 +2304,9 @@ unlock:
 			file_update_time(vma->vm_file);
 	}
 	return ret;
-oom_free_new:
+err_free_new:
 	page_cache_release(new_page);
-oom:
+err:
 	if (old_page) {
 		if (page_mkwrite) {
 			unlock_page(old_page);
@@ -2268,10 +2314,6 @@ oom:
 		}
 		page_cache_release(old_page);
 	}
-	return VM_FAULT_OOM;
-
-unwritable_page:
-	page_cache_release(old_page);
 	return ret;
 }
 
@@ -2508,22 +2550,23 @@ int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		pte_t orig_pte, unsigned int seq)
 {
 	spinlock_t *ptl;
 	struct page *page;
 	swp_entry_t entry;
-	pte_t pte;
+	pte_t *page_table, pte;
 	struct mem_cgroup *ptr = NULL;
 	int ret = 0;
 
-	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
-		goto out;
-
 	entry = pte_to_swp_entry(orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
+			if (flags & FAULT_FLAG_SPECULATIVE) {
+				ret = VM_FAULT_RETRY;
+				goto out;
+			}
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
@@ -2544,7 +2587,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+			if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
+						&page_table, &ptl)) {
+				ret = VM_FAULT_RETRY;
+				goto out;
+			}
 			if (likely(pte_same(*page_table, orig_pte)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2581,7 +2628,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
+		ret = VM_FAULT_RETRY;
+		goto out_nolock;
+	}
+
 	if (unlikely(!pte_same(*page_table, orig_pte)))
 		goto out_nomap;
 
@@ -2622,7 +2673,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unlock_page(page);
 
 	if (flags & FAULT_FLAG_WRITE) {
-		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+		ret |= do_wp_page(mm, vma, address, page_table, pmd,
+				ptl, flags, pte, seq);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
 		goto out;
@@ -2635,8 +2687,9 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge_swapin(ptr);
 	pte_unmap_unlock(page_table, ptl);
+out_nolock:
+	mem_cgroup_cancel_charge_swapin(ptr);
 out_page:
 	unlock_page(page);
 out_release:
@@ -2650,18 +2703,19 @@ out_release:
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags)
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		unsigned int seq)
 {
 	struct page *page;
 	spinlock_t *ptl;
-	pte_t entry;
+	pte_t entry, *page_table;
 
 	if (!(flags & FAULT_FLAG_WRITE)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
 						vma->vm_page_prot));
-		ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
+		if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
+					&page_table, &ptl))
+			return VM_FAULT_RETRY;
 		if (!pte_none(*page_table))
 			goto unlock;
 		goto setpte;
@@ -2684,7 +2738,12 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
+		mem_cgroup_uncharge_page(page);
+		page_cache_release(page);
+		return VM_FAULT_RETRY;
+	}
+
 	if (!pte_none(*page_table))
 		goto release;
 
@@ -2722,8 +2781,8 @@ oom:
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd, pgoff_t pgoff,
+		unsigned int flags, pte_t orig_pte, unsigned int seq)
 {
 	pte_t *page_table;
 	spinlock_t *ptl;
@@ -2823,7 +2882,10 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	}
 
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
+		ret = VM_FAULT_RETRY;
+		goto out_uncharge;
+	}
 
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
@@ -2856,7 +2918,10 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, address, entry);
+		pte_unmap_unlock(page_table, ptl);
 	} else {
+		pte_unmap_unlock(page_table, ptl);
+out_uncharge:
 		if (charged)
 			mem_cgroup_uncharge_page(page);
 		if (anon)
@@ -2865,8 +2930,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			anon = 1; /* no anon but release faulted_page */
 	}
 
-	pte_unmap_unlock(page_table, ptl);
-
 out:
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
@@ -2900,14 +2963,13 @@ unwritable_page:
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd,
+		unsigned int flags, pte_t orig_pte, unsigned int seq)
 {
 	pgoff_t pgoff = (((address & PAGE_MASK)
 			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	pte_unmap(page_table);
-	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte, seq);
 }
 
 /*
@@ -2920,16 +2982,13 @@ static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd,
+		unsigned int flags, pte_t orig_pte, unsigned int seq)
 {
 	pgoff_t pgoff;
 
 	flags |= FAULT_FLAG_NONLINEAR;
 
-	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
-		return 0;
-
 	if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
 		/*
 		 * Page table corrupted: show pte and kill process.
@@ -2939,7 +2998,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	pgoff = pte_to_pgoff(orig_pte);
-	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte, seq);
 }
 
 /*
@@ -2957,37 +3016,38 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+		pte_t entry, pmd_t *pmd, unsigned int flags,
+		unsigned int seq)
 {
-	pte_t entry;
 	spinlock_t *ptl;
+	pte_t *pte;
 
-	entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
 				if (likely(vma->vm_ops->fault))
 					return do_linear_fault(mm, vma, address,
-						pte, pmd, flags, entry);
+						pmd, flags, entry, seq);
 			}
 			return do_anonymous_page(mm, vma, address,
-						 pte, pmd, flags);
+						 pmd, flags, seq);
 		}
 		if (pte_file(entry))
 			return do_nonlinear_fault(mm, vma, address,
-					pte, pmd, flags, entry);
+					pmd, flags, entry, seq);
 		return do_swap_page(mm, vma, address,
-					pte, pmd, flags, entry);
+					pmd, flags, entry, seq);
 	}
 
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &pte, &ptl))
+		return VM_FAULT_RETRY;
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
 	if (flags & FAULT_FLAG_WRITE) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
 			return do_wp_page(mm, vma, address,
-					pte, pmd, ptl, entry);
+					pte, pmd, ptl, flags, entry, seq);
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
@@ -3017,7 +3077,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
+	pte_t *pte, entry;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -3037,9 +3097,60 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte)
 		return VM_FAULT_OOM;
 
-	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+	entry = *pte;
+
+	pte_unmap(pte);
+
+	return handle_pte_fault(mm, vma, address, entry, pmd, flags, 0);
+}
+
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
+		unsigned int flags)
+{
+	pmd_t *pmd = NULL;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	struct vm_area_struct *vma;
+	unsigned int seq;
+	int ret = VM_FAULT_RETRY;
+	int dead;
+
+	__set_current_state(TASK_RUNNING);
+	flags |= FAULT_FLAG_SPECULATIVE;
+
+	count_vm_event(PGFAULT);
+
+	rcu_read_lock();
+	if (!pte_map_lock(mm, NULL, address, pmd, flags, 0, &pte, &ptl))
+		goto out_unlock;
+
+	vma = find_vma(mm, address);
+	if (!(vma && vma->vm_end > address && vma->vm_start <= address))
+		goto out_unmap;
+
+	dead = RB_EMPTY_NODE(&vma->vm_rb);
+	seq = vma->vm_sequence.sequence;
+	smp_rmb();
+
+	if (dead || seq & 1)
+		goto out_unmap;
+
+	entry = *pte;
+
+	pte_unmap_unlock(pte, ptl);
+
+	ret = handle_pte_fault(mm, vma, address, entry, pmd, flags, seq);
+
+out_unlock:
+	rcu_read_unlock();
+	return ret;
+
+out_unmap:
+	pte_unmap_unlock(pte, ptl);
+	goto out_unlock;
 }
 
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
diff --git a/mm/mmap.c b/mm/mmap.c
index d9c77b2..024e406 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -222,6 +222,19 @@ void unlink_file_vma(struct vm_area_struct *vma)
 	}
 }
 
+static void free_vma_rcu(struct rcu_head *head)
+{
+	struct vm_area_struct *vma =
+		container_of(head, struct vm_area_struct, vm_rcu_head);
+
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+	call_rcu(&vma->vm_rcu_head, free_vma_rcu);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -238,7 +251,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 			removed_exe_file_vma(vma->vm_mm);
 	}
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 	return next;
 }
 
@@ -488,6 +501,8 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	prev->vm_next = vma->vm_next;
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
+	smp_wmb();
+	RB_CLEAR_NODE(&vma->vm_rb);
 	if (mm->mmap_cache == vma)
 		mm->mmap_cache = prev;
 }
@@ -512,6 +527,10 @@ void vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin(&next->vm_sequence);
+
 	if (next && !insert) {
 		if (end >= next->vm_end) {
 			/*
@@ -640,18 +659,24 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
+		free_vma(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
 		 * up the code too much to do both in one go.
 		 */
 		if (remove_next == 2) {
+			write_seqcount_end(&next->vm_sequence);
 			next = vma->vm_next;
+			write_seqcount_begin(&next->vm_sequence);
 			goto again;
 		}
 	}
 
+	if (next)
+		write_seqcount_end(&next->vm_sequence);
+	write_seqcount_end(&vma->vm_sequence);
+
 	validate_mm(mm);
 }
 
@@ -1808,6 +1833,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	do {
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
+		smp_wmb();
+		RB_CLEAR_NODE(&vma->vm_rb);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
diff --git a/mm/util.c b/mm/util.c
index b377ce4..1f5cfb7 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -257,8 +257,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
  * callers need to carefully consider what to use. On many architectures,
  * get_user_pages_fast simply falls back to get_user_pages.
  */
-int __attribute__((weak)) get_user_pages_fast(unsigned long start,
-				int nr_pages, int write, struct page **pages)
+int __weak get_user_pages_fast(unsigned long start,
+			       int nr_pages, int write, struct page **pages)
 {
 	struct mm_struct *mm = current->mm;
 	int ret;
@@ -272,6 +272,14 @@ int __attribute__((weak)) get_user_pages_fast(unsigned long start,
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
+void __weak pin_page_tables(void)
+{
+}
+
+void __weak unpin_page_tables(void)
+{
+}
+
 SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
 		unsigned long, prot, unsigned long, flags,
 		unsigned long, fd, unsigned long, pgoff)



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2009-12-25  1:51 [RFC PATCH] asynchronous page fault KAMEZAWA Hiroyuki
                   ` (2 preceding siblings ...)
  2009-12-27 12:03 ` Peter Zijlstra
@ 2010-01-02 21:45 ` Benjamin Herrenschmidt
  3 siblings, 0 replies; 32+ messages in thread
From: Benjamin Herrenschmidt @ 2010-01-02 21:45 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: linux-kernel, linux-mm, minchan.kim, cl

On Fri, 2009-12-25 at 10:51 +0900, KAMEZAWA Hiroyuki wrote:
> Speculative page fault v3.
> 
> This version is much simpler than old versions and doesn't use mm_accessor
> but use RCU. This is based on linux-2.6.33-rc2.
> 
> This patch is just my toy but shows...
>  - Once RB-tree is RCU-aware and no-lock in readside, we can avoid mmap_sem
>    in page fault. 
> So, what we need is not mm_accessor, but RCU-aware RB-tree, I think.
> 
> But yes, I may miss something critical ;)
> 
> After patch, statistics perf show is following. Test progam is attached.

One concern I have with this, not that it can't be addressed but we'll
have to be extra careful, is that the mmap_sem in the page fault path
tend to protect more than just the VMA tree.

One example on powerpc is the slice map used to keep track of page
sizes. I would also need some time to convince myself that I don't have
some bits of the MMU hash code that doesn't assume that holding the
mmap_sem for writing prevents a PTE from being changed from !present to
present.

I wouldn't be surprised if there were more around fancy users of
->fault(), things like spufs, the DRM, etc...

Cheers,
Ben.
 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2010-01-02 16:14             ` Peter Zijlstra
@ 2010-01-04  3:02               ` Paul E. McKenney
  2010-01-04  7:53                 ` Peter Zijlstra
  2010-01-04 13:48               ` [RFC PATCH -v2] speculative " Peter Zijlstra
  1 sibling, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2010-01-04  3:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl,
	hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds

On Sat, Jan 02, 2010 at 05:14:04PM +0100, Peter Zijlstra wrote:
> On Mon, 2009-12-28 at 11:40 +0100, Peter Zijlstra wrote:
> 
> > > Right, so acquiring the PTE lock will either instantiate page tables for
> > > a non-existing vma, leaving you with an interesting mess to clean up, or
> > > you can also RCU free the page tables (in the same RCU domain as the
> > > vma) which will mostly[*] avoid that issue.
> > > 
> > > [ To make live really really interesting you could even re-use the
> > >   page-tables and abort the RCU free when the region gets re-mapped
> > >   before the RCU callbacks happen, this will avoid a free/alloc cycle
> > >   for fast remapping workloads. ]
> > > 
> > > Once you hold the PTE lock, you can validate the vma you looked up,
> > > since ->unmap() syncs against it. If at that time you find the
> > > speculative vma is dead, you fail and re-try the fault.
> > > 
> > > [*] there still is the case of faulting on an address that didn't
> > > previously have page-tables hence the unmap page table scan will have
> > > skipped it -- my hacks simply leaked page tables here, but the idea was
> > > to acquire the mmap_sem for reading and cleanup properly.
> > 
> > Alternatively, we could mark vma's dead in some way before we do the
> > unmap, then whenever we hit the page-table alloc path, we check against
> > the speculative vma and bail if it died.
> > 
> > That might just work.. will need to ponder it a bit more.
> 
> Right, so I don't think we need RCU page tables on x86. All we need is
> some extension to the fast_gup() stuff.
> 
> Nor do we need to modify the page-table alloc paths. All we need to do
> is have 2 versions of the page table walks like those in
> handle_mm_fault().
> 
> What we do need is to have call_srcu() for the VMAs since all this fault
> stuff can block in many ways. And we need to tag 'dead' VMAs as such
> (before doing the unmap).
> 
> [ Paul, pretty please? :-) ]

It would not be all that hard for me to make a call_srcu(), but...

1.	How are you avoiding OOM by SRCU callback?  (I am sure you
	have this worked out, but I do have to ask!)

2.	How many srcu_struct data structures are you envisioning?
	One globally?  One per process?  One per struct vma?
	(Not necessary to know this for call_srcu(), but will be needed
	as I work out how to make SRCU scale with large numbers of CPUs.)

							Thanx, Paul

> We also need to introduce FAULT_FLAG_SPECULATIVE to tell the rest of the
> fault code about us not holding mmap_sem. And add a return fault return
> state like VM_FAULT_RETRY to retry the fault holding the mmap_sem.
> 
> Then we need alternative page table walkers, currently things like
> handle_mm_fault() use the p*_alloc*() like routines, but for
> FAULT_FLAG_SPECULATIVE we need to use the p*_offset*() variants like in
> follow_pte(). If that fails to find the pte, we return VM_FAULT_RETRY.
> 
> [ One sad consequence is that this still requires the mmap_sem for
>   every page table alloc, but since a pte can hold lots of pages it
>   should hopefully work out nicely ]
> 
> The above is the tricky bit since that can race with unmap, which is
> where the fast_gup() stuff comes into play. fast_gup() has the exact
> same problem, and already solved it for us. So we need a speculative
> page table walker that does whatever fast_gup() does, which on x86 is
> disable IRQs (powerpc has RCU freed page tables).
> 
> Now all sites where we actually lock the ptl we need to actually redo
> that page table walk, failing to find the pte will again return
> VM_FAULT_RETRY, once we lock the ptl we need to check the VMA's dead
> state, if dead we also bail with VM_FAULT_RETRY, otherwise we're good
> and can continue.
> 
> 
> Something like the below, which 'sometimes' boots on my dual core and
> when it boots seems to survive building a kernel... definitely needs
> more work.
> 
> ---
>  arch/x86/mm/fault.c      |    8 ++
>  arch/x86/mm/gup.c        |   10 ++
>  include/linux/mm.h       |   14 +++
>  include/linux/mm_types.h |    4 +
>  init/Kconfig             |   34 +++---
>  kernel/sched.c           |    9 ++-
>  mm/memory.c              |  293 +++++++++++++++++++++++++++++++--------------
>  mm/mmap.c                |   31 +++++-
>  mm/util.c                |   12 ++-
>  9 files changed, 302 insertions(+), 113 deletions(-)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index f627779..c748529 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1040,6 +1040,14 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
>  		return;
>  	}
> 
> +	if (error_code & PF_USER) {
> +		fault = handle_speculative_fault(mm, address,
> +				error_code & PF_WRITE ? FAULT_FLAG_WRITE : 0);
> +
> +		if (!(fault & (VM_FAULT_ERROR | VM_FAULT_RETRY)))
> +			return;
> +	}
> +
>  	/*
>  	 * When running in the kernel we expect faults to occur only to
>  	 * addresses in user space.  All other faults represent errors in
> diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
> index 71da1bc..6eeaef7 100644
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -373,3 +373,13 @@ slow_irqon:
>  		return ret;
>  	}
>  }
> +
> +void pin_page_tables(void)
> +{
> +	local_irq_disable();
> +}
> +
> +void unpin_page_tables(void)
> +{
> +	local_irq_enable();
> +}
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 2265f28..7bc94f9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -136,6 +136,7 @@ extern pgprot_t protection_map[16];
>  #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
>  #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
>  #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
> +#define FAULT_FLAG_SPECULATIVE	0x08
> 
>  /*
>   * This interface is used by x86 PAT code to identify a pfn mapping that is
> @@ -711,6 +712,7 @@ static inline int page_mapped(struct page *page)
> 
>  #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
>  #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
> +#define VM_FAULT_RETRY  0x0400
> 
>  #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
> 
> @@ -763,6 +765,14 @@ unsigned long unmap_vmas(struct mmu_gather **tlb,
>  		unsigned long end_addr, unsigned long *nr_accounted,
>  		struct zap_details *);
> 
> +static inline int vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
> +{
> +	int ret = RB_EMPTY_NODE(&vma->vm_rb);
> +	unsigned seq = vma->vm_sequence.sequence;
> +	smp_rmb();
> +	return ret || (seq & 1) || seq != sequence;
> +}
> +
>  /**
>   * mm_walk - callbacks for walk_page_range
>   * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
> @@ -819,6 +829,8 @@ int invalidate_inode_page(struct page *page);
>  #ifdef CONFIG_MMU
>  extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			unsigned long address, unsigned int flags);
> +extern int handle_speculative_fault(struct mm_struct *mm,
> +			unsigned long address, unsigned int flags);
>  #else
>  static inline int handle_mm_fault(struct mm_struct *mm,
>  			struct vm_area_struct *vma, unsigned long address,
> @@ -838,6 +850,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>  			struct page **pages, struct vm_area_struct **vmas);
>  int get_user_pages_fast(unsigned long start, int nr_pages, int write,
>  			struct page **pages);
> +void pin_page_tables(void);
> +void unpin_page_tables(void);
>  struct page *get_dump_page(unsigned long addr);
> 
>  extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 84a524a..0727300 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -12,6 +12,8 @@
>  #include <linux/completion.h>
>  #include <linux/cpumask.h>
>  #include <linux/page-debug-flags.h>
> +#include <linux/seqlock.h>
> +#include <linux/rcupdate.h>
>  #include <asm/page.h>
>  #include <asm/mmu.h>
> 
> @@ -186,6 +188,8 @@ struct vm_area_struct {
>  #ifdef CONFIG_NUMA
>  	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
>  #endif
> +	seqcount_t vm_sequence;
> +	struct rcu_head vm_rcu_head;
>  };
> 
>  struct core_thread {
> diff --git a/init/Kconfig b/init/Kconfig
> index 06dab27..5edae47 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -314,19 +314,19 @@ menu "RCU Subsystem"
> 
>  choice
>  	prompt "RCU Implementation"
> -	default TREE_RCU
> +	default TREE_PREEMPT_RCU
> 
> -config TREE_RCU
> -	bool "Tree-based hierarchical RCU"
> -	help
> -	  This option selects the RCU implementation that is
> -	  designed for very large SMP system with hundreds or
> -	  thousands of CPUs.  It also scales down nicely to
> -	  smaller systems.
> +#config TREE_RCU
> +#	bool "Tree-based hierarchical RCU"
> +#	help
> +#	  This option selects the RCU implementation that is
> +#	  designed for very large SMP system with hundreds or
> +#	  thousands of CPUs.  It also scales down nicely to
> +#	  smaller systems.
> 
>  config TREE_PREEMPT_RCU
>  	bool "Preemptable tree-based hierarchical RCU"
> -	depends on PREEMPT
> +#	depends on PREEMPT
>  	help
>  	  This option selects the RCU implementation that is
>  	  designed for very large SMP systems with hundreds or
> @@ -334,14 +334,14 @@ config TREE_PREEMPT_RCU
>  	  is also required.  It also scales down nicely to
>  	  smaller systems.
> 
> -config TINY_RCU
> -	bool "UP-only small-memory-footprint RCU"
> -	depends on !SMP
> -	help
> -	  This option selects the RCU implementation that is
> -	  designed for UP systems from which real-time response
> -	  is not required.  This option greatly reduces the
> -	  memory footprint of RCU.
> +#config TINY_RCU
> +#	bool "UP-only small-memory-footprint RCU"
> +#	depends on !SMP
> +#	help
> +#	  This option selects the RCU implementation that is
> +#	  designed for UP systems from which real-time response
> +#	  is not required.  This option greatly reduces the
> +#	  memory footprint of RCU.
> 
>  endchoice
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 22c14eb..21cdc52 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9689,7 +9689,14 @@ void __init sched_init(void)
>  #ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
>  static inline int preempt_count_equals(int preempt_offset)
>  {
> -	int nested = (preempt_count() & ~PREEMPT_ACTIVE) + rcu_preempt_depth();
> +	int nested = (preempt_count() & ~PREEMPT_ACTIVE)
> +		/*
> +		 * remove this for we need preemptible RCU
> +		 * exactly because it needs to sleep..
> +		 *
> +		 + rcu_preempt_depth()
> +		 */
> +		;
> 
>  	return (nested == PREEMPT_INATOMIC_BASE + preempt_offset);
>  }
> diff --git a/mm/memory.c b/mm/memory.c
> index 09e4b1b..ace6645 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1919,31 +1919,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>  EXPORT_SYMBOL_GPL(apply_to_page_range);
> 
>  /*
> - * handle_pte_fault chooses page fault handler according to an entry
> - * which was read non-atomically.  Before making any commitment, on
> - * those architectures or configurations (e.g. i386 with PAE) which
> - * might give a mix of unmatched parts, do_swap_page and do_file_page
> - * must check under lock before unmapping the pte and proceeding
> - * (but do_wp_page is only called after already making such a check;
> - * and do_anonymous_page and do_no_page can safely check later on).
> - */
> -static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
> -				pte_t *page_table, pte_t orig_pte)
> -{
> -	int same = 1;
> -#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
> -	if (sizeof(pte_t) > sizeof(unsigned long)) {
> -		spinlock_t *ptl = pte_lockptr(mm, pmd);
> -		spin_lock(ptl);
> -		same = pte_same(*page_table, orig_pte);
> -		spin_unlock(ptl);
> -	}
> -#endif
> -	pte_unmap(page_table);
> -	return same;
> -}
> -
> -/*
>   * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
>   * servicing faults for write access.  In the normal case, do always want
>   * pte_mkwrite.  But get_user_pages can cause write faults for mappings
> @@ -1982,6 +1957,52 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
>  		copy_user_highpage(dst, src, va, vma);
>  }
> 
> +static int pte_map_lock(struct mm_struct *mm, struct vm_area_struct *vma,
> +		unsigned long address, pmd_t *pmd, unsigned int flags,
> +		unsigned int seq, pte_t **ptep, spinlock_t **ptlp)
> +{
> +	pgd_t *pgd;
> +	pud_t *pud;
> +
> +	if (!(flags & FAULT_FLAG_SPECULATIVE)) {
> +		*ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
> +		return 1;
> +	}
> +
> +	pin_page_tables();
> +
> +	pgd = pgd_offset(mm, address);
> +	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> +		goto out;
> +
> +	pud = pud_offset(pgd, address);
> +	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
> +		goto out;
> +
> +	pmd = pmd_offset(pud, address);
> +	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
> +		goto out;
> +
> +	if (pmd_huge(*pmd))
> +		goto out;
> +
> +	*ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
> +	if (!*ptep)
> +		goto out;
> +
> +	if (vma && vma_is_dead(vma, seq))
> +		goto unlock;
> +
> +	unpin_page_tables();
> +	return 1;
> +
> +unlock:
> +	pte_unmap_unlock(*ptep, *ptlp);
> +out:
> +	unpin_page_tables();
> +	return 0;
> +}
> +
>  /*
>   * This routine handles present pages, when users try to write
>   * to a shared page. It is done by copying the page to a new address
> @@ -2002,7 +2023,8 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
>   */
>  static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  		unsigned long address, pte_t *page_table, pmd_t *pmd,
> -		spinlock_t *ptl, pte_t orig_pte)
> +		spinlock_t *ptl, unsigned int flags, pte_t orig_pte,
> +		unsigned int seq)
>  {
>  	struct page *old_page, *new_page;
>  	pte_t entry;
> @@ -2034,8 +2056,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			page_cache_get(old_page);
>  			pte_unmap_unlock(page_table, ptl);
>  			lock_page(old_page);
> -			page_table = pte_offset_map_lock(mm, pmd, address,
> -							 &ptl);
> +
> +			if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
> +						&page_table, &ptl)) {
> +				unlock_page(old_page);
> +				ret = VM_FAULT_RETRY;
> +				goto err;
> +			}
> +
>  			if (!pte_same(*page_table, orig_pte)) {
>  				unlock_page(old_page);
>  				page_cache_release(old_page);
> @@ -2077,14 +2105,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			if (unlikely(tmp &
>  					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
>  				ret = tmp;
> -				goto unwritable_page;
> +				goto err;
>  			}
>  			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
>  				lock_page(old_page);
>  				if (!old_page->mapping) {
>  					ret = 0; /* retry the fault */
>  					unlock_page(old_page);
> -					goto unwritable_page;
> +					goto err;
>  				}
>  			} else
>  				VM_BUG_ON(!PageLocked(old_page));
> @@ -2095,8 +2123,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 * they did, we just return, as we can count on the
>  			 * MMU to tell us if they didn't also make it writable.
>  			 */
> -			page_table = pte_offset_map_lock(mm, pmd, address,
> -							 &ptl);
> +			if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
> +						&page_table, &ptl)) {
> +				unlock_page(old_page);
> +				ret = VM_FAULT_RETRY;
> +				goto err;
> +			}
> +
>  			if (!pte_same(*page_table, orig_pte)) {
>  				unlock_page(old_page);
>  				page_cache_release(old_page);
> @@ -2128,17 +2161,23 @@ reuse:
>  gotten:
>  	pte_unmap_unlock(page_table, ptl);
> 
> -	if (unlikely(anon_vma_prepare(vma)))
> -		goto oom;
> +	if (unlikely(anon_vma_prepare(vma))) {
> +		ret = VM_FAULT_OOM;
> +		goto err;
> +	}
> 
>  	if (is_zero_pfn(pte_pfn(orig_pte))) {
>  		new_page = alloc_zeroed_user_highpage_movable(vma, address);
> -		if (!new_page)
> -			goto oom;
> +		if (!new_page) {
> +			ret = VM_FAULT_OOM;
> +			goto err;
> +		}
>  	} else {
>  		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
> -		if (!new_page)
> -			goto oom;
> +		if (!new_page) {
> +			ret = VM_FAULT_OOM;
> +			goto err;
> +		}
>  		cow_user_page(new_page, old_page, address, vma);
>  	}
>  	__SetPageUptodate(new_page);
> @@ -2153,13 +2192,20 @@ gotten:
>  		unlock_page(old_page);
>  	}
> 
> -	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
> -		goto oom_free_new;
> +	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)) {
> +		ret = VM_FAULT_OOM;
> +		goto err_free_new;
> +	}
> 
>  	/*
>  	 * Re-check the pte - we dropped the lock
>  	 */
> -	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
> +		mem_cgroup_uncharge_page(new_page);
> +		ret = VM_FAULT_RETRY;
> +		goto err_free_new;
> +	}
> +
>  	if (likely(pte_same(*page_table, orig_pte))) {
>  		if (old_page) {
>  			if (!PageAnon(old_page)) {
> @@ -2258,9 +2304,9 @@ unlock:
>  			file_update_time(vma->vm_file);
>  	}
>  	return ret;
> -oom_free_new:
> +err_free_new:
>  	page_cache_release(new_page);
> -oom:
> +err:
>  	if (old_page) {
>  		if (page_mkwrite) {
>  			unlock_page(old_page);
> @@ -2268,10 +2314,6 @@ oom:
>  		}
>  		page_cache_release(old_page);
>  	}
> -	return VM_FAULT_OOM;
> -
> -unwritable_page:
> -	page_cache_release(old_page);
>  	return ret;
>  }
> 
> @@ -2508,22 +2550,23 @@ int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
>  static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
> -		unsigned long address, pte_t *page_table, pmd_t *pmd,
> -		unsigned int flags, pte_t orig_pte)
> +		unsigned long address, pmd_t *pmd, unsigned int flags,
> +		pte_t orig_pte, unsigned int seq)
>  {
>  	spinlock_t *ptl;
>  	struct page *page;
>  	swp_entry_t entry;
> -	pte_t pte;
> +	pte_t *page_table, pte;
>  	struct mem_cgroup *ptr = NULL;
>  	int ret = 0;
> 
> -	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> -		goto out;
> -
>  	entry = pte_to_swp_entry(orig_pte);
>  	if (unlikely(non_swap_entry(entry))) {
>  		if (is_migration_entry(entry)) {
> +			if (flags & FAULT_FLAG_SPECULATIVE) {
> +				ret = VM_FAULT_RETRY;
> +				goto out;
> +			}
>  			migration_entry_wait(mm, pmd, address);
>  		} else if (is_hwpoison_entry(entry)) {
>  			ret = VM_FAULT_HWPOISON;
> @@ -2544,7 +2587,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  			 * Back out if somebody else faulted in this pte
>  			 * while we released the pte lock.
>  			 */
> -			page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +			if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
> +						&page_table, &ptl)) {
> +				ret = VM_FAULT_RETRY;
> +				goto out;
> +			}
>  			if (likely(pte_same(*page_table, orig_pte)))
>  				ret = VM_FAULT_OOM;
>  			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
> @@ -2581,7 +2628,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	/*
>  	 * Back out if somebody else already faulted in this pte.
>  	 */
> -	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
> +		ret = VM_FAULT_RETRY;
> +		goto out_nolock;
> +	}
> +
>  	if (unlikely(!pte_same(*page_table, orig_pte)))
>  		goto out_nomap;
> 
> @@ -2622,7 +2673,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	unlock_page(page);
> 
>  	if (flags & FAULT_FLAG_WRITE) {
> -		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
> +		ret |= do_wp_page(mm, vma, address, page_table, pmd,
> +				ptl, flags, pte, seq);
>  		if (ret & VM_FAULT_ERROR)
>  			ret &= VM_FAULT_ERROR;
>  		goto out;
> @@ -2635,8 +2687,9 @@ unlock:
>  out:
>  	return ret;
>  out_nomap:
> -	mem_cgroup_cancel_charge_swapin(ptr);
>  	pte_unmap_unlock(page_table, ptl);
> +out_nolock:
> +	mem_cgroup_cancel_charge_swapin(ptr);
>  out_page:
>  	unlock_page(page);
>  out_release:
> @@ -2650,18 +2703,19 @@ out_release:
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
>  static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
> -		unsigned long address, pte_t *page_table, pmd_t *pmd,
> -		unsigned int flags)
> +		unsigned long address, pmd_t *pmd, unsigned int flags,
> +		unsigned int seq)
>  {
>  	struct page *page;
>  	spinlock_t *ptl;
> -	pte_t entry;
> +	pte_t entry, *page_table;
> 
>  	if (!(flags & FAULT_FLAG_WRITE)) {
>  		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
>  						vma->vm_page_prot));
> -		ptl = pte_lockptr(mm, pmd);
> -		spin_lock(ptl);
> +		if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
> +					&page_table, &ptl))
> +			return VM_FAULT_RETRY;
>  		if (!pte_none(*page_table))
>  			goto unlock;
>  		goto setpte;
> @@ -2684,7 +2738,12 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (vma->vm_flags & VM_WRITE)
>  		entry = pte_mkwrite(pte_mkdirty(entry));
> 
> -	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
> +		mem_cgroup_uncharge_page(page);
> +		page_cache_release(page);
> +		return VM_FAULT_RETRY;
> +	}
> +
>  	if (!pte_none(*page_table))
>  		goto release;
> 
> @@ -2722,8 +2781,8 @@ oom:
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
>  static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> -		unsigned long address, pmd_t *pmd,
> -		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
> +		unsigned long address, pmd_t *pmd, pgoff_t pgoff,
> +		unsigned int flags, pte_t orig_pte, unsigned int seq)
>  {
>  	pte_t *page_table;
>  	spinlock_t *ptl;
> @@ -2823,7 +2882,10 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> 
>  	}
> 
> -	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
> +	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
> +		ret = VM_FAULT_RETRY;
> +		goto out_uncharge;
> +	}
> 
>  	/*
>  	 * This silly early PAGE_DIRTY setting removes a race
> @@ -2856,7 +2918,10 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> 
>  		/* no need to invalidate: a not-present page won't be cached */
>  		update_mmu_cache(vma, address, entry);
> +		pte_unmap_unlock(page_table, ptl);
>  	} else {
> +		pte_unmap_unlock(page_table, ptl);
> +out_uncharge:
>  		if (charged)
>  			mem_cgroup_uncharge_page(page);
>  		if (anon)
> @@ -2865,8 +2930,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  			anon = 1; /* no anon but release faulted_page */
>  	}
> 
> -	pte_unmap_unlock(page_table, ptl);
> -
>  out:
>  	if (dirty_page) {
>  		struct address_space *mapping = page->mapping;
> @@ -2900,14 +2963,13 @@ unwritable_page:
>  }
> 
>  static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> -		unsigned long address, pte_t *page_table, pmd_t *pmd,
> -		unsigned int flags, pte_t orig_pte)
> +		unsigned long address, pmd_t *pmd,
> +		unsigned int flags, pte_t orig_pte, unsigned int seq)
>  {
>  	pgoff_t pgoff = (((address & PAGE_MASK)
>  			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> 
> -	pte_unmap(page_table);
> -	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
> +	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte, seq);
>  }
> 
>  /*
> @@ -2920,16 +2982,13 @@ static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   * We return with mmap_sem still held, but pte unmapped and unlocked.
>   */
>  static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> -		unsigned long address, pte_t *page_table, pmd_t *pmd,
> -		unsigned int flags, pte_t orig_pte)
> +		unsigned long address, pmd_t *pmd,
> +		unsigned int flags, pte_t orig_pte, unsigned int seq)
>  {
>  	pgoff_t pgoff;
> 
>  	flags |= FAULT_FLAG_NONLINEAR;
> 
> -	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
> -		return 0;
> -
>  	if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
>  		/*
>  		 * Page table corrupted: show pte and kill process.
> @@ -2939,7 +2998,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	}
> 
>  	pgoff = pte_to_pgoff(orig_pte);
> -	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
> +	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte, seq);
>  }
> 
>  /*
> @@ -2957,37 +3016,38 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   */
>  static inline int handle_pte_fault(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long address,
> -		pte_t *pte, pmd_t *pmd, unsigned int flags)
> +		pte_t entry, pmd_t *pmd, unsigned int flags,
> +		unsigned int seq)
>  {
> -	pte_t entry;
>  	spinlock_t *ptl;
> +	pte_t *pte;
> 
> -	entry = *pte;
>  	if (!pte_present(entry)) {
>  		if (pte_none(entry)) {
>  			if (vma->vm_ops) {
>  				if (likely(vma->vm_ops->fault))
>  					return do_linear_fault(mm, vma, address,
> -						pte, pmd, flags, entry);
> +						pmd, flags, entry, seq);
>  			}
>  			return do_anonymous_page(mm, vma, address,
> -						 pte, pmd, flags);
> +						 pmd, flags, seq);
>  		}
>  		if (pte_file(entry))
>  			return do_nonlinear_fault(mm, vma, address,
> -					pte, pmd, flags, entry);
> +					pmd, flags, entry, seq);
>  		return do_swap_page(mm, vma, address,
> -					pte, pmd, flags, entry);
> +					pmd, flags, entry, seq);
>  	}
> 
> -	ptl = pte_lockptr(mm, pmd);
> -	spin_lock(ptl);
> +	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &pte, &ptl))
> +		return VM_FAULT_RETRY;
>  	if (unlikely(!pte_same(*pte, entry)))
>  		goto unlock;
>  	if (flags & FAULT_FLAG_WRITE) {
> -		if (!pte_write(entry))
> +		if (!pte_write(entry)) {
>  			return do_wp_page(mm, vma, address,
> -					pte, pmd, ptl, entry);
> +					pte, pmd, ptl, flags, entry, seq);
> +		}
>  		entry = pte_mkdirty(entry);
>  	}
>  	entry = pte_mkyoung(entry);
> @@ -3017,7 +3077,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	pgd_t *pgd;
>  	pud_t *pud;
>  	pmd_t *pmd;
> -	pte_t *pte;
> +	pte_t *pte, entry;
> 
>  	__set_current_state(TASK_RUNNING);
> 
> @@ -3037,9 +3097,60 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>  	if (!pte)
>  		return VM_FAULT_OOM;
> 
> -	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> +	entry = *pte;
> +
> +	pte_unmap(pte);
> +
> +	return handle_pte_fault(mm, vma, address, entry, pmd, flags, 0);
> +}
> +
> +int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
> +		unsigned int flags)
> +{
> +	pmd_t *pmd = NULL;
> +	pte_t *pte, entry;
> +	spinlock_t *ptl;
> +	struct vm_area_struct *vma;
> +	unsigned int seq;
> +	int ret = VM_FAULT_RETRY;
> +	int dead;
> +
> +	__set_current_state(TASK_RUNNING);
> +	flags |= FAULT_FLAG_SPECULATIVE;
> +
> +	count_vm_event(PGFAULT);
> +
> +	rcu_read_lock();
> +	if (!pte_map_lock(mm, NULL, address, pmd, flags, 0, &pte, &ptl))
> +		goto out_unlock;
> +
> +	vma = find_vma(mm, address);
> +	if (!(vma && vma->vm_end > address && vma->vm_start <= address))
> +		goto out_unmap;
> +
> +	dead = RB_EMPTY_NODE(&vma->vm_rb);
> +	seq = vma->vm_sequence.sequence;
> +	smp_rmb();
> +
> +	if (dead || seq & 1)
> +		goto out_unmap;
> +
> +	entry = *pte;
> +
> +	pte_unmap_unlock(pte, ptl);
> +
> +	ret = handle_pte_fault(mm, vma, address, entry, pmd, flags, seq);
> +
> +out_unlock:
> +	rcu_read_unlock();
> +	return ret;
> +
> +out_unmap:
> +	pte_unmap_unlock(pte, ptl);
> +	goto out_unlock;
>  }
> 
> +
>  #ifndef __PAGETABLE_PUD_FOLDED
>  /*
>   * Allocate page upper directory.
> diff --git a/mm/mmap.c b/mm/mmap.c
> index d9c77b2..024e406 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -222,6 +222,19 @@ void unlink_file_vma(struct vm_area_struct *vma)
>  	}
>  }
> 
> +static void free_vma_rcu(struct rcu_head *head)
> +{
> +	struct vm_area_struct *vma =
> +		container_of(head, struct vm_area_struct, vm_rcu_head);
> +
> +	kmem_cache_free(vm_area_cachep, vma);
> +}
> +
> +static void free_vma(struct vm_area_struct *vma)
> +{
> +	call_rcu(&vma->vm_rcu_head, free_vma_rcu);
> +}
> +
>  /*
>   * Close a vm structure and free it, returning the next.
>   */
> @@ -238,7 +251,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
>  			removed_exe_file_vma(vma->vm_mm);
>  	}
>  	mpol_put(vma_policy(vma));
> -	kmem_cache_free(vm_area_cachep, vma);
> +	free_vma(vma);
>  	return next;
>  }
> 
> @@ -488,6 +501,8 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
>  {
>  	prev->vm_next = vma->vm_next;
>  	rb_erase(&vma->vm_rb, &mm->mm_rb);
> +	smp_wmb();
> +	RB_CLEAR_NODE(&vma->vm_rb);
>  	if (mm->mmap_cache == vma)
>  		mm->mmap_cache = prev;
>  }
> @@ -512,6 +527,10 @@ void vma_adjust(struct vm_area_struct *vma, unsigned long start,
>  	long adjust_next = 0;
>  	int remove_next = 0;
> 
> +	write_seqcount_begin(&vma->vm_sequence);
> +	if (next)
> +		write_seqcount_begin(&next->vm_sequence);
> +
>  	if (next && !insert) {
>  		if (end >= next->vm_end) {
>  			/*
> @@ -640,18 +659,24 @@ again:			remove_next = 1 + (end > next->vm_end);
>  		}
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
> -		kmem_cache_free(vm_area_cachep, next);
> +		free_vma(next);
>  		/*
>  		 * In mprotect's case 6 (see comments on vma_merge),
>  		 * we must remove another next too. It would clutter
>  		 * up the code too much to do both in one go.
>  		 */
>  		if (remove_next == 2) {
> +			write_seqcount_end(&next->vm_sequence);
>  			next = vma->vm_next;
> +			write_seqcount_begin(&next->vm_sequence);
>  			goto again;
>  		}
>  	}
> 
> +	if (next)
> +		write_seqcount_end(&next->vm_sequence);
> +	write_seqcount_end(&vma->vm_sequence);
> +
>  	validate_mm(mm);
>  }
> 
> @@ -1808,6 +1833,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
>  	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
>  	do {
>  		rb_erase(&vma->vm_rb, &mm->mm_rb);
> +		smp_wmb();
> +		RB_CLEAR_NODE(&vma->vm_rb);
>  		mm->map_count--;
>  		tail_vma = vma;
>  		vma = vma->vm_next;
> diff --git a/mm/util.c b/mm/util.c
> index b377ce4..1f5cfb7 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -257,8 +257,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
>   * callers need to carefully consider what to use. On many architectures,
>   * get_user_pages_fast simply falls back to get_user_pages.
>   */
> -int __attribute__((weak)) get_user_pages_fast(unsigned long start,
> -				int nr_pages, int write, struct page **pages)
> +int __weak get_user_pages_fast(unsigned long start,
> +			       int nr_pages, int write, struct page **pages)
>  {
>  	struct mm_struct *mm = current->mm;
>  	int ret;
> @@ -272,6 +272,14 @@ int __attribute__((weak)) get_user_pages_fast(unsigned long start,
>  }
>  EXPORT_SYMBOL_GPL(get_user_pages_fast);
> 
> +void __weak pin_page_tables(void)
> +{
> +}
> +
> +void __weak unpin_page_tables(void)
> +{
> +}
> +
>  SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len,
>  		unsigned long, prot, unsigned long, flags,
>  		unsigned long, fd, unsigned long, pgoff)
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2010-01-04  3:02               ` Paul E. McKenney
@ 2010-01-04  7:53                 ` Peter Zijlstra
  2010-01-04 15:55                   ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2010-01-04  7:53 UTC (permalink / raw)
  To: paulmck
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl,
	hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds

On Sun, 2010-01-03 at 19:02 -0800, Paul E. McKenney wrote:
> It would not be all that hard for me to make a call_srcu(), but...
> 
> 1.      How are you avoiding OOM by SRCU callback?  (I am sure you
>         have this worked out, but I do have to ask!)

Well, I was thinking srcu to have this force quiescent state in
call_srcu() much like you did for the preemptible rcu.

Alternatively we could actively throttle the call_srcu() call when we've
got too much pending work.

> 2.      How many srcu_struct data structures are you envisioning?
>         One globally?  One per process?  One per struct vma?
>         (Not necessary to know this for call_srcu(), but will be needed
>         as I work out how to make SRCU scale with large numbers of CPUs.)

For this patch in particular, one global one, covering all vmas.

One reason to keep the vma RCU domain separate from other RCU objects is
that these VMA thingies can have rather long quiescent periods due to
this sleep stuff. So mixing that in with other RCU users which have much
better defined periods will just degrade everything bringing that OOM
scenario much closer.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH -v2] speculative page fault
  2010-01-02 16:14             ` Peter Zijlstra
  2010-01-04  3:02               ` Paul E. McKenney
@ 2010-01-04 13:48               ` Peter Zijlstra
  1 sibling, 0 replies; 32+ messages in thread
From: Peter Zijlstra @ 2010-01-04 13:48 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: linux-kernel, linux-mm, minchan.kim, cl, Paul E. McKenney,
	hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds

On Sat, 2010-01-02 at 17:14 +0100, Peter Zijlstra wrote:

> Something like the below, which 'sometimes' boots on my dual core and
> when it boots seems to survive building a kernel... definitely needs
> more work.

OK, the below seems to work much better, it appears we're flushing tlbs
while holding the ptl, which meant that my pte_map_lock() which on x86
ended up taking the ptl with IRQs disabled, can deadlock against another
cpu holding that same ptl while trying to flush the tlbs, since it won't
service the IPIs.

I crudely hacked around that by making pte_map_lock() a spinning loop
that drops the IRQ-disabled thing.. not pretty. At least it boots.

The added counters seem to indicate pretty much all faults end up being
speculative faults, which should be good.

The multi-fault.c test case from Kame doesn't show much (if any)
improvement on my dual-core (shared cache etc..) but does show a
significant increase in cache misses, which I guess are attributed to
the extra page-table walking (will examine more closely after I try and
get that tlb flushing crap sorted)..

What do people think of the general approach?

---
 arch/x86/mm/fault.c      |   20 ++-
 arch/x86/mm/gup.c        |   10 ++
 drivers/char/sysrq.c     |    3 +
 include/linux/mm.h       |   14 ++
 include/linux/mm_types.h |    4 +
 include/linux/sched.h    |    1 +
 init/Kconfig             |   34 +++---
 kernel/sched.c           |    9 ++-
 mm/memory.c              |  301 ++++++++++++++++++++++++++++++++--------------
 mm/mmap.c                |   31 +++++-
 mm/util.c                |   12 ++-
 11 files changed, 320 insertions(+), 119 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index f627779..0b7b32c 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -786,8 +786,6 @@ out_of_memory(struct pt_regs *regs, unsigned long error_code,
 	 * We ran out of memory, call the OOM killer, and return the userspace
 	 * (which will retry the fault, or kill us if we got oom-killed):
 	 */
-	up_read(&current->mm->mmap_sem);
-
 	pagefault_out_of_memory();
 }
 
@@ -799,8 +797,6 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
 	struct mm_struct *mm = tsk->mm;
 	int code = BUS_ADRERR;
 
-	up_read(&mm->mmap_sem);
-
 	/* Kernel mode? Handle exceptions or die: */
 	if (!(error_code & PF_USER))
 		no_context(regs, error_code, address);
@@ -1030,6 +1026,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 		pgtable_bad(regs, error_code, address);
 
 	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);
+	tsk->total_flt++;
 
 	/*
 	 * If we're in an interrupt, have no user context or are running
@@ -1040,6 +1037,16 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
 		return;
 	}
 
+	if (error_code & PF_USER) {
+		fault = handle_speculative_fault(mm, address,
+				error_code & PF_WRITE ? FAULT_FLAG_WRITE : 0);
+
+		if (!(fault & VM_FAULT_RETRY)) {
+			tsk->spec_flt++;
+			goto done;
+		}
+	}
+
 	/*
 	 * When running in the kernel we expect faults to occur only to
 	 * addresses in user space.  All other faults represent errors in
@@ -1119,7 +1126,10 @@ good_area:
 	 */
 	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
 
+	up_read(&mm->mmap_sem);
+done:
 	if (unlikely(fault & VM_FAULT_ERROR)) {
+		tsk->err_flt++;
 		mm_fault_error(regs, error_code, address, fault);
 		return;
 	}
@@ -1135,6 +1145,4 @@ good_area:
 	}
 
 	check_v8086_mode(regs, address, tsk);
-
-	up_read(&mm->mmap_sem);
 }
diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index 71da1bc..6eeaef7 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -373,3 +373,13 @@ slow_irqon:
 		return ret;
 	}
 }
+
+void pin_page_tables(void)
+{
+	local_irq_disable();
+}
+
+void unpin_page_tables(void)
+{
+	local_irq_enable();
+}
diff --git a/drivers/char/sysrq.c b/drivers/char/sysrq.c
index 1ae2de7..e805b83 100644
--- a/drivers/char/sysrq.c
+++ b/drivers/char/sysrq.c
@@ -253,6 +253,9 @@ static void sysrq_handle_showregs(int key, struct tty_struct *tty)
 	if (regs)
 		show_regs(regs);
 	perf_event_print_debug();
+	printk(KERN_INFO "faults: %lu, err: %lu, min: %lu, maj: %lu, spec: %lu\n",
+			current->total_flt, current->err_flt, current->min_flt,
+			current->maj_flt, current->spec_flt);
 }
 static struct sysrq_key_op sysrq_showregs_op = {
 	.handler	= sysrq_handle_showregs,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2265f28..7bc94f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -136,6 +136,7 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_WRITE	0x01	/* Fault was a write access */
 #define FAULT_FLAG_NONLINEAR	0x02	/* Fault was via a nonlinear mapping */
 #define FAULT_FLAG_MKWRITE	0x04	/* Fault was mkwrite of existing pte */
+#define FAULT_FLAG_SPECULATIVE	0x08
 
 /*
  * This interface is used by x86 PAT code to identify a pfn mapping that is
@@ -711,6 +712,7 @@ static inline int page_mapped(struct page *page)
 
 #define VM_FAULT_NOPAGE	0x0100	/* ->fault installed the pte, not return page */
 #define VM_FAULT_LOCKED	0x0200	/* ->fault locked the returned page */
+#define VM_FAULT_RETRY  0x0400
 
 #define VM_FAULT_ERROR	(VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_HWPOISON)
 
@@ -763,6 +765,14 @@ unsigned long unmap_vmas(struct mmu_gather **tlb,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
 
+static inline int vma_is_dead(struct vm_area_struct *vma, unsigned int sequence)
+{
+	int ret = RB_EMPTY_NODE(&vma->vm_rb);
+	unsigned seq = vma->vm_sequence.sequence;
+	smp_rmb();
+	return ret || (seq & 1) || seq != sequence;
+}
+
 /**
  * mm_walk - callbacks for walk_page_range
  * @pgd_entry: if set, called for each non-empty PGD (top-level) entry
@@ -819,6 +829,8 @@ int invalidate_inode_page(struct page *page);
 #ifdef CONFIG_MMU
 extern int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			unsigned long address, unsigned int flags);
+extern int handle_speculative_fault(struct mm_struct *mm,
+			unsigned long address, unsigned int flags);
 #else
 static inline int handle_mm_fault(struct mm_struct *mm,
 			struct vm_area_struct *vma, unsigned long address,
@@ -838,6 +850,8 @@ int get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
 			struct page **pages, struct vm_area_struct **vmas);
 int get_user_pages_fast(unsigned long start, int nr_pages, int write,
 			struct page **pages);
+void pin_page_tables(void);
+void unpin_page_tables(void);
 struct page *get_dump_page(unsigned long addr);
 
 extern int try_to_release_page(struct page * page, gfp_t gfp_mask);
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 84a524a..0727300 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -12,6 +12,8 @@
 #include <linux/completion.h>
 #include <linux/cpumask.h>
 #include <linux/page-debug-flags.h>
+#include <linux/seqlock.h>
+#include <linux/rcupdate.h>
 #include <asm/page.h>
 #include <asm/mmu.h>
 
@@ -186,6 +188,8 @@ struct vm_area_struct {
 #ifdef CONFIG_NUMA
 	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
 #endif
+	seqcount_t vm_sequence;
+	struct rcu_head vm_rcu_head;
 };
 
 struct core_thread {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f2f842d..8763962 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1347,6 +1347,7 @@ struct task_struct {
 	struct timespec real_start_time;	/* boot based time */
 /* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
 	unsigned long min_flt, maj_flt;
+	unsigned long total_flt, err_flt, spec_flt;
 
 	struct task_cputime cputime_expires;
 	struct list_head cpu_timers[3];
diff --git a/init/Kconfig b/init/Kconfig
index 06dab27..5edae47 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -314,19 +314,19 @@ menu "RCU Subsystem"
 
 choice
 	prompt "RCU Implementation"
-	default TREE_RCU
+	default TREE_PREEMPT_RCU
 
-config TREE_RCU
-	bool "Tree-based hierarchical RCU"
-	help
-	  This option selects the RCU implementation that is
-	  designed for very large SMP system with hundreds or
-	  thousands of CPUs.  It also scales down nicely to
-	  smaller systems.
+#config TREE_RCU
+#	bool "Tree-based hierarchical RCU"
+#	help
+#	  This option selects the RCU implementation that is
+#	  designed for very large SMP system with hundreds or
+#	  thousands of CPUs.  It also scales down nicely to
+#	  smaller systems.
 
 config TREE_PREEMPT_RCU
 	bool "Preemptable tree-based hierarchical RCU"
-	depends on PREEMPT
+#	depends on PREEMPT
 	help
 	  This option selects the RCU implementation that is
 	  designed for very large SMP systems with hundreds or
@@ -334,14 +334,14 @@ config TREE_PREEMPT_RCU
 	  is also required.  It also scales down nicely to
 	  smaller systems.
 
-config TINY_RCU
-	bool "UP-only small-memory-footprint RCU"
-	depends on !SMP
-	help
-	  This option selects the RCU implementation that is
-	  designed for UP systems from which real-time response
-	  is not required.  This option greatly reduces the
-	  memory footprint of RCU.
+#config TINY_RCU
+#	bool "UP-only small-memory-footprint RCU"
+#	depends on !SMP
+#	help
+#	  This option selects the RCU implementation that is
+#	  designed for UP systems from which real-time response
+#	  is not required.  This option greatly reduces the
+#	  memory footprint of RCU.
 
 endchoice
 
diff --git a/kernel/sched.c b/kernel/sched.c
index 22c14eb..21cdc52 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9689,7 +9689,14 @@ void __init sched_init(void)
 #ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
 static inline int preempt_count_equals(int preempt_offset)
 {
-	int nested = (preempt_count() & ~PREEMPT_ACTIVE) + rcu_preempt_depth();
+	int nested = (preempt_count() & ~PREEMPT_ACTIVE)
+		/*
+		 * remove this for we need preemptible RCU
+		 * exactly because it needs to sleep..
+		 *
+		 + rcu_preempt_depth()
+		 */
+		;
 
 	return (nested == PREEMPT_INATOMIC_BASE + preempt_offset);
 }
diff --git a/mm/memory.c b/mm/memory.c
index 09e4b1b..2b8f60c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1919,31 +1919,6 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
 EXPORT_SYMBOL_GPL(apply_to_page_range);
 
 /*
- * handle_pte_fault chooses page fault handler according to an entry
- * which was read non-atomically.  Before making any commitment, on
- * those architectures or configurations (e.g. i386 with PAE) which
- * might give a mix of unmatched parts, do_swap_page and do_file_page
- * must check under lock before unmapping the pte and proceeding
- * (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page and do_no_page can safely check later on).
- */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
-{
-	int same = 1;
-#if defined(CONFIG_SMP) || defined(CONFIG_PREEMPT)
-	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
-		spin_unlock(ptl);
-	}
-#endif
-	pte_unmap(page_table);
-	return same;
-}
-
-/*
  * Do pte_mkwrite, but only if the vma says VM_WRITE.  We do this when
  * servicing faults for write access.  In the normal case, do always want
  * pte_mkwrite.  But get_user_pages can cause write faults for mappings
@@ -1982,6 +1957,60 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
 		copy_user_highpage(dst, src, va, vma);
 }
 
+static int pte_map_lock(struct mm_struct *mm, struct vm_area_struct *vma,
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		unsigned int seq, pte_t **ptep, spinlock_t **ptlp)
+{
+	pgd_t *pgd;
+	pud_t *pud;
+
+	if (!(flags & FAULT_FLAG_SPECULATIVE)) {
+		*ptep = pte_offset_map_lock(mm, pmd, address, ptlp);
+		return 1;
+	}
+
+again:
+	pin_page_tables();
+
+	pgd = pgd_offset(mm, address);
+	if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
+		goto out;
+
+	pud = pud_offset(pgd, address);
+	if (pud_none(*pud) || unlikely(pud_bad(*pud)))
+		goto out;
+
+	pmd = pmd_offset(pud, address);
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		goto out;
+
+	if (pmd_huge(*pmd))
+		goto out;
+
+	*ptlp = pte_lockptr(mm, pmd);
+	*ptep = pte_offset_map(pmd, address);
+	if (!spin_trylock(*ptlp)) {
+		pte_unmap(*ptep);
+		unpin_page_tables();
+		goto again;
+	}
+
+	if (!*ptep)
+		goto out;
+
+	if (vma && vma_is_dead(vma, seq))
+		goto unlock;
+
+	unpin_page_tables();
+	return 1;
+
+unlock:
+	pte_unmap_unlock(*ptep, *ptlp);
+out:
+	unpin_page_tables();
+	return 0;
+}
+
 /*
  * This routine handles present pages, when users try to write
  * to a shared page. It is done by copying the page to a new address
@@ -2002,7 +2031,8 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo
  */
 static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		spinlock_t *ptl, pte_t orig_pte)
+		spinlock_t *ptl, unsigned int flags, pte_t orig_pte,
+		unsigned int seq)
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
@@ -2034,8 +2064,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			page_cache_get(old_page);
 			pte_unmap_unlock(page_table, ptl);
 			lock_page(old_page);
-			page_table = pte_offset_map_lock(mm, pmd, address,
-							 &ptl);
+
+			if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
+						&page_table, &ptl)) {
+				unlock_page(old_page);
+				ret = VM_FAULT_RETRY;
+				goto err;
+			}
+
 			if (!pte_same(*page_table, orig_pte)) {
 				unlock_page(old_page);
 				page_cache_release(old_page);
@@ -2077,14 +2113,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			if (unlikely(tmp &
 					(VM_FAULT_ERROR | VM_FAULT_NOPAGE))) {
 				ret = tmp;
-				goto unwritable_page;
+				goto err;
 			}
 			if (unlikely(!(tmp & VM_FAULT_LOCKED))) {
 				lock_page(old_page);
 				if (!old_page->mapping) {
 					ret = 0; /* retry the fault */
 					unlock_page(old_page);
-					goto unwritable_page;
+					goto err;
 				}
 			} else
 				VM_BUG_ON(!PageLocked(old_page));
@@ -2095,8 +2131,13 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * they did, we just return, as we can count on the
 			 * MMU to tell us if they didn't also make it writable.
 			 */
-			page_table = pte_offset_map_lock(mm, pmd, address,
-							 &ptl);
+			if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
+						&page_table, &ptl)) {
+				unlock_page(old_page);
+				ret = VM_FAULT_RETRY;
+				goto err;
+			}
+
 			if (!pte_same(*page_table, orig_pte)) {
 				unlock_page(old_page);
 				page_cache_release(old_page);
@@ -2128,17 +2169,23 @@ reuse:
 gotten:
 	pte_unmap_unlock(page_table, ptl);
 
-	if (unlikely(anon_vma_prepare(vma)))
-		goto oom;
+	if (unlikely(anon_vma_prepare(vma))) {
+		ret = VM_FAULT_OOM;
+		goto err;
+	}
 
 	if (is_zero_pfn(pte_pfn(orig_pte))) {
 		new_page = alloc_zeroed_user_highpage_movable(vma, address);
-		if (!new_page)
-			goto oom;
+		if (!new_page) {
+			ret = VM_FAULT_OOM;
+			goto err;
+		}
 	} else {
 		new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
-		if (!new_page)
-			goto oom;
+		if (!new_page) {
+			ret = VM_FAULT_OOM;
+			goto err;
+		}
 		cow_user_page(new_page, old_page, address, vma);
 	}
 	__SetPageUptodate(new_page);
@@ -2153,13 +2200,20 @@ gotten:
 		unlock_page(old_page);
 	}
 
-	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL))
-		goto oom_free_new;
+	if (mem_cgroup_newpage_charge(new_page, mm, GFP_KERNEL)) {
+		ret = VM_FAULT_OOM;
+		goto err_free_new;
+	}
 
 	/*
 	 * Re-check the pte - we dropped the lock
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
+		mem_cgroup_uncharge_page(new_page);
+		ret = VM_FAULT_RETRY;
+		goto err_free_new;
+	}
+
 	if (likely(pte_same(*page_table, orig_pte))) {
 		if (old_page) {
 			if (!PageAnon(old_page)) {
@@ -2258,9 +2312,9 @@ unlock:
 			file_update_time(vma->vm_file);
 	}
 	return ret;
-oom_free_new:
+err_free_new:
 	page_cache_release(new_page);
-oom:
+err:
 	if (old_page) {
 		if (page_mkwrite) {
 			unlock_page(old_page);
@@ -2268,10 +2322,6 @@ oom:
 		}
 		page_cache_release(old_page);
 	}
-	return VM_FAULT_OOM;
-
-unwritable_page:
-	page_cache_release(old_page);
 	return ret;
 }
 
@@ -2508,22 +2558,23 @@ int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		pte_t orig_pte, unsigned int seq)
 {
 	spinlock_t *ptl;
 	struct page *page;
 	swp_entry_t entry;
-	pte_t pte;
+	pte_t *page_table, pte;
 	struct mem_cgroup *ptr = NULL;
 	int ret = 0;
 
-	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
-		goto out;
-
 	entry = pte_to_swp_entry(orig_pte);
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
+			if (flags & FAULT_FLAG_SPECULATIVE) {
+				ret = VM_FAULT_RETRY;
+				goto out;
+			}
 			migration_entry_wait(mm, pmd, address);
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
@@ -2544,7 +2595,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			 * Back out if somebody else faulted in this pte
 			 * while we released the pte lock.
 			 */
-			page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+			if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
+						&page_table, &ptl)) {
+				ret = VM_FAULT_RETRY;
+				goto out;
+			}
 			if (likely(pte_same(*page_table, orig_pte)))
 				ret = VM_FAULT_OOM;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
@@ -2581,7 +2636,11 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
+		ret = VM_FAULT_RETRY;
+		goto out_nolock;
+	}
+
 	if (unlikely(!pte_same(*page_table, orig_pte)))
 		goto out_nomap;
 
@@ -2622,7 +2681,8 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	unlock_page(page);
 
 	if (flags & FAULT_FLAG_WRITE) {
-		ret |= do_wp_page(mm, vma, address, page_table, pmd, ptl, pte);
+		ret |= do_wp_page(mm, vma, address, page_table, pmd,
+				ptl, flags, pte, seq);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
 		goto out;
@@ -2635,8 +2695,9 @@ unlock:
 out:
 	return ret;
 out_nomap:
-	mem_cgroup_cancel_charge_swapin(ptr);
 	pte_unmap_unlock(page_table, ptl);
+out_nolock:
+	mem_cgroup_cancel_charge_swapin(ptr);
 out_page:
 	unlock_page(page);
 out_release:
@@ -2650,18 +2711,19 @@ out_release:
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags)
+		unsigned long address, pmd_t *pmd, unsigned int flags,
+		unsigned int seq)
 {
 	struct page *page;
 	spinlock_t *ptl;
-	pte_t entry;
+	pte_t entry, *page_table;
 
 	if (!(flags & FAULT_FLAG_WRITE)) {
 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
 						vma->vm_page_prot));
-		ptl = pte_lockptr(mm, pmd);
-		spin_lock(ptl);
+		if (!pte_map_lock(mm, vma, address, pmd, flags, seq,
+					&page_table, &ptl))
+			return VM_FAULT_RETRY;
 		if (!pte_none(*page_table))
 			goto unlock;
 		goto setpte;
@@ -2684,7 +2746,12 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (vma->vm_flags & VM_WRITE)
 		entry = pte_mkwrite(pte_mkdirty(entry));
 
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
+		mem_cgroup_uncharge_page(page);
+		page_cache_release(page);
+		return VM_FAULT_RETRY;
+	}
+
 	if (!pte_none(*page_table))
 		goto release;
 
@@ -2722,8 +2789,8 @@ oom:
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pmd_t *pmd,
-		pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd, pgoff_t pgoff,
+		unsigned int flags, pte_t orig_pte, unsigned int seq)
 {
 	pte_t *page_table;
 	spinlock_t *ptl;
@@ -2823,7 +2890,10 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	}
 
-	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &page_table, &ptl)) {
+		ret = VM_FAULT_RETRY;
+		goto out_uncharge;
+	}
 
 	/*
 	 * This silly early PAGE_DIRTY setting removes a race
@@ -2856,7 +2926,10 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, address, entry);
+		pte_unmap_unlock(page_table, ptl);
 	} else {
+		pte_unmap_unlock(page_table, ptl);
+out_uncharge:
 		if (charged)
 			mem_cgroup_uncharge_page(page);
 		if (anon)
@@ -2865,8 +2938,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 			anon = 1; /* no anon but release faulted_page */
 	}
 
-	pte_unmap_unlock(page_table, ptl);
-
 out:
 	if (dirty_page) {
 		struct address_space *mapping = page->mapping;
@@ -2900,14 +2971,13 @@ unwritable_page:
 }
 
 static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd,
+		unsigned int flags, pte_t orig_pte, unsigned int seq)
 {
 	pgoff_t pgoff = (((address & PAGE_MASK)
 			- vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
 
-	pte_unmap(page_table);
-	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte, seq);
 }
 
 /*
@@ -2920,16 +2990,13 @@ static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  * We return with mmap_sem still held, but pte unmapped and unlocked.
  */
 static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
-		unsigned long address, pte_t *page_table, pmd_t *pmd,
-		unsigned int flags, pte_t orig_pte)
+		unsigned long address, pmd_t *pmd,
+		unsigned int flags, pte_t orig_pte, unsigned int seq)
 {
 	pgoff_t pgoff;
 
 	flags |= FAULT_FLAG_NONLINEAR;
 
-	if (!pte_unmap_same(mm, pmd, page_table, orig_pte))
-		return 0;
-
 	if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) {
 		/*
 		 * Page table corrupted: show pte and kill process.
@@ -2939,7 +3006,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	pgoff = pte_to_pgoff(orig_pte);
-	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+	return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte, seq);
 }
 
 /*
@@ -2957,37 +3024,38 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
  */
 static inline int handle_pte_fault(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long address,
-		pte_t *pte, pmd_t *pmd, unsigned int flags)
+		pte_t entry, pmd_t *pmd, unsigned int flags,
+		unsigned int seq)
 {
-	pte_t entry;
 	spinlock_t *ptl;
+	pte_t *pte;
 
-	entry = *pte;
 	if (!pte_present(entry)) {
 		if (pte_none(entry)) {
 			if (vma->vm_ops) {
 				if (likely(vma->vm_ops->fault))
 					return do_linear_fault(mm, vma, address,
-						pte, pmd, flags, entry);
+						pmd, flags, entry, seq);
 			}
 			return do_anonymous_page(mm, vma, address,
-						 pte, pmd, flags);
+						 pmd, flags, seq);
 		}
 		if (pte_file(entry))
 			return do_nonlinear_fault(mm, vma, address,
-					pte, pmd, flags, entry);
+					pmd, flags, entry, seq);
 		return do_swap_page(mm, vma, address,
-					pte, pmd, flags, entry);
+					pmd, flags, entry, seq);
 	}
 
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
+	if (!pte_map_lock(mm, vma, address, pmd, flags, seq, &pte, &ptl))
+		return VM_FAULT_RETRY;
 	if (unlikely(!pte_same(*pte, entry)))
 		goto unlock;
 	if (flags & FAULT_FLAG_WRITE) {
-		if (!pte_write(entry))
+		if (!pte_write(entry)) {
 			return do_wp_page(mm, vma, address,
-					pte, pmd, ptl, entry);
+					pte, pmd, ptl, flags, entry, seq);
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
@@ -3017,7 +3085,7 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
-	pte_t *pte;
+	pte_t *pte, entry;
 
 	__set_current_state(TASK_RUNNING);
 
@@ -3037,9 +3105,60 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!pte)
 		return VM_FAULT_OOM;
 
-	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+	entry = *pte;
+
+	pte_unmap(pte);
+
+	return handle_pte_fault(mm, vma, address, entry, pmd, flags, 0);
 }
 
+int handle_speculative_fault(struct mm_struct *mm, unsigned long address,
+		unsigned int flags)
+{
+	pmd_t *pmd = NULL;
+	pte_t *pte, entry;
+	spinlock_t *ptl;
+	struct vm_area_struct *vma;
+	unsigned int seq;
+	int ret = VM_FAULT_RETRY;
+	int dead;
+
+	__set_current_state(TASK_RUNNING);
+	flags |= FAULT_FLAG_SPECULATIVE;
+
+	count_vm_event(PGFAULT);
+
+	rcu_read_lock();
+	if (!pte_map_lock(mm, NULL, address, pmd, flags, 0, &pte, &ptl))
+		goto out_unlock;
+
+	vma = find_vma(mm, address);
+	if (!(vma && vma->vm_end > address && vma->vm_start <= address))
+		goto out_unmap;
+
+	dead = RB_EMPTY_NODE(&vma->vm_rb);
+	seq = vma->vm_sequence.sequence;
+	smp_rmb();
+
+	if (dead || seq & 1)
+		goto out_unmap;
+
+	entry = *pte;
+
+	pte_unmap_unlock(pte, ptl);
+
+	ret = handle_pte_fault(mm, vma, address, entry, pmd, flags, seq);
+
+out_unlock:
+	rcu_read_unlock();
+	return ret;
+
+out_unmap:
+	pte_unmap_unlock(pte, ptl);
+	goto out_unlock;
+}
+
+
 #ifndef __PAGETABLE_PUD_FOLDED
 /*
  * Allocate page upper directory.
diff --git a/mm/mmap.c b/mm/mmap.c
index ee22989..da7d34c 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -222,6 +222,19 @@ void unlink_file_vma(struct vm_area_struct *vma)
 	}
 }
 
+static void free_vma_rcu(struct rcu_head *head)
+{
+	struct vm_area_struct *vma =
+		container_of(head, struct vm_area_struct, vm_rcu_head);
+
+	kmem_cache_free(vm_area_cachep, vma);
+}
+
+static void free_vma(struct vm_area_struct *vma)
+{
+	call_rcu(&vma->vm_rcu_head, free_vma_rcu);
+}
+
 /*
  * Close a vm structure and free it, returning the next.
  */
@@ -238,7 +251,7 @@ static struct vm_area_struct *remove_vma(struct vm_area_struct *vma)
 			removed_exe_file_vma(vma->vm_mm);
 	}
 	mpol_put(vma_policy(vma));
-	kmem_cache_free(vm_area_cachep, vma);
+	free_vma(vma);
 	return next;
 }
 
@@ -488,6 +501,8 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	prev->vm_next = vma->vm_next;
 	rb_erase(&vma->vm_rb, &mm->mm_rb);
+	smp_wmb();
+	RB_CLEAR_NODE(&vma->vm_rb);
 	if (mm->mmap_cache == vma)
 		mm->mmap_cache = prev;
 }
@@ -512,6 +527,10 @@ void vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	long adjust_next = 0;
 	int remove_next = 0;
 
+	write_seqcount_begin(&vma->vm_sequence);
+	if (next)
+		write_seqcount_begin(&next->vm_sequence);
+
 	if (next && !insert) {
 		if (end >= next->vm_end) {
 			/*
@@ -640,18 +659,24 @@ again:			remove_next = 1 + (end > next->vm_end);
 		}
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		kmem_cache_free(vm_area_cachep, next);
+		free_vma(next);
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
 		 * we must remove another next too. It would clutter
 		 * up the code too much to do both in one go.
 		 */
 		if (remove_next == 2) {
+			write_seqcount_end(&next->vm_sequence);
 			next = vma->vm_next;
+			write_seqcount_begin(&next->vm_sequence);
 			goto again;
 		}
 	}
 
+	if (next)
+		write_seqcount_end(&next->vm_sequence);
+	write_seqcount_end(&vma->vm_sequence);
+
 	validate_mm(mm);
 }
 
@@ -1848,6 +1873,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	do {
 		rb_erase(&vma->vm_rb, &mm->mm_rb);
+		smp_wmb();
+		RB_CLEAR_NODE(&vma->vm_rb);
 		mm->map_count--;
 		tail_vma = vma;
 		vma = vma->vm_next;
diff --git a/mm/util.c b/mm/util.c
index 7c35ad9..ba5350c 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -253,8 +253,8 @@ void arch_pick_mmap_layout(struct mm_struct *mm)
  * callers need to carefully consider what to use. On many architectures,
  * get_user_pages_fast simply falls back to get_user_pages.
  */
-int __attribute__((weak)) get_user_pages_fast(unsigned long start,
-				int nr_pages, int write, struct page **pages)
+int __weak get_user_pages_fast(unsigned long start,
+			       int nr_pages, int write, struct page **pages)
 {
 	struct mm_struct *mm = current->mm;
 	int ret;
@@ -268,6 +268,14 @@ int __attribute__((weak)) get_user_pages_fast(unsigned long start,
 }
 EXPORT_SYMBOL_GPL(get_user_pages_fast);
 
+void __weak pin_page_tables(void)
+{
+}
+
+void __weak unpin_page_tables(void)
+{
+}
+
 /* Tracepoints definitions. */
 EXPORT_TRACEPOINT_SYMBOL(kmalloc);
 EXPORT_TRACEPOINT_SYMBOL(kmem_cache_alloc);



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2010-01-04  7:53                 ` Peter Zijlstra
@ 2010-01-04 15:55                   ` Paul E. McKenney
  2010-01-04 16:02                     ` Peter Zijlstra
  0 siblings, 1 reply; 32+ messages in thread
From: Paul E. McKenney @ 2010-01-04 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl,
	hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds

On Mon, Jan 04, 2010 at 08:53:23AM +0100, Peter Zijlstra wrote:
> On Sun, 2010-01-03 at 19:02 -0800, Paul E. McKenney wrote:
> > It would not be all that hard for me to make a call_srcu(), but...
> > 
> > 1.      How are you avoiding OOM by SRCU callback?  (I am sure you
> >         have this worked out, but I do have to ask!)
> 
> Well, I was thinking srcu to have this force quiescent state in
> call_srcu() much like you did for the preemptible rcu.

Ah, so the idea would be that you register a function with the srcu_struct
that is invoked when some readers are stuck for too long in their SRCU
read-side critical sections?  Presumably you also supply a time value for
"too long" as well.  Hmmm...  What do you do, cancel the corresponding
I/O or something?

This would not be hard once I get SRCU folded into the treercu
infrastructure.  However, at the moment, SRCU has no way of tracking
which tasks are holding things up.  So not 2.6.34 material, but definitely
doable longer term.

> Alternatively we could actively throttle the call_srcu() call when we've
> got too much pending work.

This could be done with the existing SRCU implementation.  This could be
a call to a function you registered.  In theory, it would be possible
to make call_srcu() refuse to enqueue a callback when there were too
many, but that really badly violates the spirit of the call_rcu() family
of functions.

> > 2.      How many srcu_struct data structures are you envisioning?
> >         One globally?  One per process?  One per struct vma?
> >         (Not necessary to know this for call_srcu(), but will be needed
> >         as I work out how to make SRCU scale with large numbers of CPUs.)
> 
> For this patch in particular, one global one, covering all vmas.

Whew!!!  ;-)

Then it would still be feasible for the CPU-hotplug code to scan all
SRCU structures, if need be.  (The reason that SRCU gets away without
worrying about CPU hotplug now is that it doesn't keep track of which
tasks are in read-side critical sections.)

> One reason to keep the vma RCU domain separate from other RCU objects is
> that these VMA thingies can have rather long quiescent periods due to
> this sleep stuff. So mixing that in with other RCU users which have much
> better defined periods will just degrade everything bringing that OOM
> scenario much closer.

Fair enough!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2010-01-04 15:55                   ` Paul E. McKenney
@ 2010-01-04 16:02                     ` Peter Zijlstra
  2010-01-04 16:56                       ` Paul E. McKenney
  0 siblings, 1 reply; 32+ messages in thread
From: Peter Zijlstra @ 2010-01-04 16:02 UTC (permalink / raw)
  To: paulmck
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl,
	hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds

On Mon, 2010-01-04 at 07:55 -0800, Paul E. McKenney wrote:
> > Well, I was thinking srcu to have this force quiescent state in
> > call_srcu() much like you did for the preemptible rcu.
> 
> Ah, so the idea would be that you register a function with the srcu_struct
> that is invoked when some readers are stuck for too long in their SRCU
> read-side critical sections?  Presumably you also supply a time value for
> "too long" as well.  Hmmm...  What do you do, cancel the corresponding
> I/O or something? 

Hmm, I was more thinking along the lines of:

say IDX is the current counter idx.

if (pending > thresh) {
  flush(!IDX)
  force_flip_counter();
}

Since we explicitly hold a reference on IDX, we can actually wait for !
IDX to reach 0 and flush those callbacks.

We then force-flip the counter, so that even if all callbacks (or the
majority) were not for !IDX but part of IDX, we'd be able to flush them
on the next call_srcu() because that will then hold a ref on the new
counter index.


Or am I missing something obvious?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH] asynchronous page fault.
  2010-01-04 16:02                     ` Peter Zijlstra
@ 2010-01-04 16:56                       ` Paul E. McKenney
  0 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2010-01-04 16:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: KAMEZAWA Hiroyuki, linux-kernel, linux-mm, minchan.kim, cl,
	hugh.dickins, Nick Piggin, Ingo Molnar, Linus Torvalds

On Mon, Jan 04, 2010 at 05:02:54PM +0100, Peter Zijlstra wrote:
> On Mon, 2010-01-04 at 07:55 -0800, Paul E. McKenney wrote:
> > > Well, I was thinking srcu to have this force quiescent state in
> > > call_srcu() much like you did for the preemptible rcu.
> > 
> > Ah, so the idea would be that you register a function with the srcu_struct
> > that is invoked when some readers are stuck for too long in their SRCU
> > read-side critical sections?  Presumably you also supply a time value for
> > "too long" as well.  Hmmm...  What do you do, cancel the corresponding
> > I/O or something? 
> 
> Hmm, I was more thinking along the lines of:
> 
> say IDX is the current counter idx.
> 
> if (pending > thresh) {
>   flush(!IDX)

This flushes pending I/Os?

>   force_flip_counter();

If this is internal to SRCU, what it would do is check for CPUs being
offline or in dyntick-idle state.  Or was your thought that this is
where I invoke callbacks into your code to do whatever can be done to
wake up the sleeping readers?

> }
> 
> Since we explicitly hold a reference on IDX, we can actually wait for !
> IDX to reach 0 and flush those callbacks.

One other thing -- if I merge SRCU into the tree-based infrastructure,
I should be able to eliminate the need for srcu_read_lock() to return
the index (and thus for srcu_read_unlock() to take it as an argument).
So the index would be strictly internal, as it currently is with the
other flavors of RCU.

> We then force-flip the counter, so that even if all callbacks (or the
> majority) were not for !IDX but part of IDX, we'd be able to flush them
> on the next call_srcu() because that will then hold a ref on the new
> counter index.

We can certainly defer callbacks to a later grace period.  What we cannot
do is advance the counter until all readers for the current grace period
have exited their SRCU read-side critical sections.

> Or am I missing something obvious?

Or maybe I am.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2010-01-04 16:57 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-12-25  1:51 [RFC PATCH] asynchronous page fault KAMEZAWA Hiroyuki
2009-12-27  9:47 ` Minchan Kim
2009-12-27 23:59   ` KAMEZAWA Hiroyuki
2009-12-27 11:19 ` Peter Zijlstra
2009-12-28  0:00   ` KAMEZAWA Hiroyuki
2009-12-28  0:57   ` Balbir Singh
2009-12-28  1:05     ` KAMEZAWA Hiroyuki
2009-12-28  2:58       ` Balbir Singh
2009-12-28  3:13         ` KAMEZAWA Hiroyuki
2009-12-28  8:34         ` Peter Zijlstra
2009-12-28  8:32     ` Peter Zijlstra
2009-12-29  9:54       ` Balbir Singh
2009-12-27 12:03 ` Peter Zijlstra
2009-12-28  0:36   ` KAMEZAWA Hiroyuki
2009-12-28  1:19     ` KAMEZAWA Hiroyuki
2009-12-28  8:30     ` Peter Zijlstra
2009-12-28  9:58       ` KAMEZAWA Hiroyuki
2009-12-28 10:30         ` Peter Zijlstra
2009-12-28 10:40           ` Peter Zijlstra
2010-01-02 16:14             ` Peter Zijlstra
2010-01-04  3:02               ` Paul E. McKenney
2010-01-04  7:53                 ` Peter Zijlstra
2010-01-04 15:55                   ` Paul E. McKenney
2010-01-04 16:02                     ` Peter Zijlstra
2010-01-04 16:56                       ` Paul E. McKenney
2010-01-04 13:48               ` [RFC PATCH -v2] speculative " Peter Zijlstra
2009-12-28 10:57           ` [RFC PATCH] asynchronous " KAMEZAWA Hiroyuki
2009-12-28 11:06             ` Peter Zijlstra
2009-12-28  8:55     ` Peter Zijlstra
2009-12-28 10:08       ` KAMEZAWA Hiroyuki
2009-12-28 11:43     ` Peter Zijlstra
2010-01-02 21:45 ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).