linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/14] Concurrent Page Cache
@ 2007-01-28 13:13 Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 01/14] radix-tree: use indirect bit Peter Zijlstra
                   ` (14 more replies)
  0 siblings, 15 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

With Nick leading the way to getting rid of the read side of the tree_lock,
this work continues by breaking the write side of said lock.

Aside from breaking MTD this version of the concurrent page cache seems
rock solid on my dual core x86_64 box.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 01/14] radix-tree: use indirect bit
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 02/14] radix-tree: gang_lookup_slot Peter Zijlstra
                   ` (13 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra, Nick Piggin

[-- Attachment #1: radix-tree-use-indirect-bit.patch --]
[-- Type: text/plain, Size: 9982 bytes --]

From: Nick Piggin <npiggin@suse.de>

Rather than sign direct radix-tree pointers with a special bit, sign
the indirect one that hangs off the root. This means that, given a
lookup_slot operation, the invalid result will be differentiated from
the valid (previously, valid results could have the bit either set or
clear).

This does not affect slot lookups which occur under lock -- they
can never return an invalid result. Is needed in future for lockless
pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/radix-tree.h |   40 ++++++++++++++------------
 lib/radix-tree.c           |   69 ++++++++++++++++++++++++++++-----------------
 2 files changed, 65 insertions(+), 44 deletions(-)

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h	2007-01-22 20:09:50.000000000 +0100
+++ linux-2.6/include/linux/radix-tree.h	2007-01-22 20:09:53.000000000 +0100
@@ -26,28 +26,31 @@
 #include <linux/rcupdate.h>
 
 /*
- * A direct pointer (root->rnode pointing directly to a data item,
- * rather than another radix_tree_node) is signalled by the low bit
- * set in the root->rnode pointer.
- *
- * In this case root->height is also NULL, but the direct pointer tests are
- * needed for RCU lookups when root->height is unreliable.
+ * An indirect pointer (root->rnode pointing to a radix_tree_node, rather
+ * than a data item) is signalled by the low bit set in the root->rnode
+ * pointer.
+ *
+ * In this case root->height is > 0, but the indirect pointer tests are
+ * needed for RCU lookups (because root->height is unreliable). The only
+ * time callers need worry about this is when doing a lookup_slot under
+ * RCU.
  */
-#define RADIX_TREE_DIRECT_PTR	1
+#define RADIX_TREE_INDIRECT_PTR	1
+#define RADIX_TREE_RETRY ((void *)-1UL)
 
-static inline void *radix_tree_ptr_to_direct(void *ptr)
+static inline void *radix_tree_ptr_to_indirect(void *ptr)
 {
-	return (void *)((unsigned long)ptr | RADIX_TREE_DIRECT_PTR);
+	return (void *)((unsigned long)ptr | RADIX_TREE_INDIRECT_PTR);
 }
 
-static inline void *radix_tree_direct_to_ptr(void *ptr)
+static inline void *radix_tree_indirect_to_ptr(void *ptr)
 {
-	return (void *)((unsigned long)ptr & ~RADIX_TREE_DIRECT_PTR);
+	return (void *)((unsigned long)ptr & ~RADIX_TREE_INDIRECT_PTR);
 }
 
-static inline int radix_tree_is_direct_ptr(void *ptr)
+static inline int radix_tree_is_indirect_ptr(void *ptr)
 {
-	return (int)((unsigned long)ptr & RADIX_TREE_DIRECT_PTR);
+	return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
 }
 
 /*** radix-tree API starts here ***/
@@ -130,7 +133,10 @@ do {									\
  */
 static inline void *radix_tree_deref_slot(void **pslot)
 {
-	return radix_tree_direct_to_ptr(*pslot);
+	void *ret = *pslot;
+	if (unlikely(radix_tree_is_indirect_ptr(ret)))
+		ret = RADIX_TREE_RETRY;
+	return ret;
 }
 /**
  * radix_tree_replace_slot	- replace item in a slot
@@ -142,10 +148,8 @@ static inline void *radix_tree_deref_slo
  */
 static inline void radix_tree_replace_slot(void **pslot, void *item)
 {
-	BUG_ON(radix_tree_is_direct_ptr(item));
-	rcu_assign_pointer(*pslot,
-		(void *)((unsigned long)item |
-			((unsigned long)*pslot & RADIX_TREE_DIRECT_PTR)));
+	BUG_ON(radix_tree_is_indirect_ptr(item));
+	rcu_assign_pointer(*pslot, item);
 }
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c	2007-01-22 20:09:50.000000000 +0100
+++ linux-2.6/lib/radix-tree.c	2007-01-22 20:09:53.000000000 +0100
@@ -104,7 +104,7 @@ radix_tree_node_alloc(struct radix_tree_
 			rtp->nr--;
 		}
 	}
-	BUG_ON(radix_tree_is_direct_ptr(ret));
+	BUG_ON(radix_tree_is_indirect_ptr(ret));
 	return ret;
 }
 
@@ -239,7 +239,7 @@ static int radix_tree_extend(struct radi
 			return -ENOMEM;
 
 		/* Increase the height.  */
-		node->slots[0] = radix_tree_direct_to_ptr(root->rnode);
+		node->slots[0] = radix_tree_indirect_to_ptr(root->rnode);
 
 		/* Propagate the aggregated tag info into the new root */
 		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
@@ -250,6 +250,7 @@ static int radix_tree_extend(struct radi
 		newheight = root->height+1;
 		node->height = newheight;
 		node->count = 1;
+		node = radix_tree_ptr_to_indirect(node);
 		rcu_assign_pointer(root->rnode, node);
 		root->height = newheight;
 	} while (height > root->height);
@@ -273,7 +274,7 @@ int radix_tree_insert(struct radix_tree_
 	int offset;
 	int error;
 
-	BUG_ON(radix_tree_is_direct_ptr(item));
+	BUG_ON(radix_tree_is_indirect_ptr(item));
 
 	/* Make sure the tree is high enough.  */
 	if (index > radix_tree_maxindex(root->height)) {
@@ -282,7 +283,8 @@ int radix_tree_insert(struct radix_tree_
 			return error;
 	}
 
-	slot = root->rnode;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
+
 	height = root->height;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
@@ -297,7 +299,8 @@ int radix_tree_insert(struct radix_tree_
 				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
 			} else
-				rcu_assign_pointer(root->rnode, slot);
+				rcu_assign_pointer(root->rnode,
+					radix_tree_ptr_to_indirect(slot));
 		}
 
 		/* Go a level down */
@@ -317,7 +320,7 @@ int radix_tree_insert(struct radix_tree_
 		BUG_ON(tag_get(node, 0, offset));
 		BUG_ON(tag_get(node, 1, offset));
 	} else {
-		rcu_assign_pointer(root->rnode, radix_tree_ptr_to_direct(item));
+		rcu_assign_pointer(root->rnode, item);
 		BUG_ON(root_tag_get(root, 0));
 		BUG_ON(root_tag_get(root, 1));
 	}
@@ -349,11 +352,12 @@ void **radix_tree_lookup_slot(struct rad
 	if (node == NULL)
 		return NULL;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (index > 0)
 			return NULL;
 		return (void **)&root->rnode;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -397,11 +401,12 @@ void *radix_tree_lookup(struct radix_tre
 	if (node == NULL)
 		return NULL;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (index > 0)
 			return NULL;
-		return radix_tree_direct_to_ptr(node);
+		return node;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -446,7 +451,7 @@ void *radix_tree_tag_set(struct radix_tr
 	height = root->height;
 	BUG_ON(index > radix_tree_maxindex(height));
 
-	slot = root->rnode;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 
 	while (height > 0) {
@@ -496,7 +501,7 @@ void *radix_tree_tag_clear(struct radix_
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	pathp->node = NULL;
-	slot = root->rnode;
+	slot = radix_tree_indirect_to_ptr(root->rnode);
 
 	while (height > 0) {
 		int offset;
@@ -561,8 +566,9 @@ int radix_tree_tag_get(struct radix_tree
 	if (node == NULL)
 		return 0;
 
-	if (radix_tree_is_direct_ptr(node))
+	if (!radix_tree_is_indirect_ptr(node))
 		return (index == 0);
+	node = radix_tree_indirect_to_ptr(node);
 
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
@@ -679,13 +685,13 @@ radix_tree_gang_lookup(struct radix_tree
 	if (!node)
 		return 0;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		node = radix_tree_direct_to_ptr(node);
-		results[0] = rcu_dereference(node);
+		results[0] = node;
 		return 1;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -807,13 +813,13 @@ radix_tree_gang_lookup_tag(struct radix_
 	if (!node)
 		return 0;
 
-	if (radix_tree_is_direct_ptr(node)) {
+	if (!radix_tree_is_indirect_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		node = radix_tree_direct_to_ptr(node);
-		results[0] = rcu_dereference(node);
+		results[0] = node;
 		return 1;
 	}
+	node = radix_tree_indirect_to_ptr(node);
 
 	max_index = radix_tree_maxindex(node->height);
 
@@ -843,12 +849,22 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_tag
 static inline void radix_tree_shrink(struct radix_tree_root *root)
 {
 	/* try to shrink tree height */
-	while (root->height > 0 &&
-			root->rnode->count == 1 &&
-			root->rnode->slots[0]) {
+	while (root->height > 0) {
 		struct radix_tree_node *to_free = root->rnode;
 		void *newptr;
 
+		BUG_ON(!radix_tree_is_indirect_ptr(to_free));
+		to_free = radix_tree_indirect_to_ptr(to_free);
+
+		/*
+		 * The candidate node has more than one child, or its child
+		 * is not at the leftmost slot, we cannot shrink.
+		 */
+		if (to_free->count != 1)
+			break;
+		if (!to_free->slots[0])
+			break;
+
 		/*
 		 * We don't need rcu_assign_pointer(), since we are simply
 		 * moving the node from one part of the tree to another. If
@@ -857,8 +873,8 @@ static inline void radix_tree_shrink(str
 		 * one (root->rnode).
 		 */
 		newptr = to_free->slots[0];
-		if (root->height == 1)
-			newptr = radix_tree_ptr_to_direct(newptr);
+		if (root->height > 1)
+			newptr = radix_tree_ptr_to_indirect(newptr);
 		root->rnode = newptr;
 		root->height--;
 		/* must only free zeroed nodes into the slab */
@@ -893,12 +909,12 @@ void *radix_tree_delete(struct radix_tre
 		goto out;
 
 	slot = root->rnode;
-	if (height == 0 && root->rnode) {
-		slot = radix_tree_direct_to_ptr(slot);
+	if (height == 0 /* XXX: bugfix? */) {
 		root_tag_clear_all(root);
 		root->rnode = NULL;
 		goto out;
 	}
+	slot = radix_tree_indirect_to_ptr(slot);
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 	pathp->node = NULL;
@@ -940,7 +956,8 @@ void *radix_tree_delete(struct radix_tre
 			radix_tree_node_free(to_free);
 
 		if (pathp->node->count) {
-			if (pathp->node == root->rnode)
+			if (pathp->node ==
+					radix_tree_indirect_to_ptr(root->rnode))
 				radix_tree_shrink(root);
 			goto out;
 		}

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 02/14] radix-tree: gang_lookup_slot
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 01/14] radix-tree: use indirect bit Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 03/14] radix-tree: gang_lookup_tag_slot Peter Zijlstra
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra, Nick Piggin

[-- Attachment #1: radix-tree-gang_lookup_slot.patch --]
[-- Type: text/plain, Size: 6189 bytes --]

From: Nick Piggin <npiggin@suse.de>

Introduce a gang_lookup_slot function which is used by lockless pagecache.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/radix-tree.h |    7 +++
 lib/radix-tree.c           |   86 +++++++++++++++++++++++++++++++++++++++------
 2 files changed, 82 insertions(+), 11 deletions(-)

Index: linux-2.6-git2/include/linux/radix-tree.h
===================================================================
--- linux-2.6-git2.orig/include/linux/radix-tree.h	2006-12-20 22:26:03.000000000 +0100
+++ linux-2.6-git2/include/linux/radix-tree.h	2006-12-20 22:26:20.000000000 +0100
@@ -99,12 +99,14 @@ do {									\
  *
  * The notable exceptions to this rule are the following functions:
  * radix_tree_lookup
+ * radix_tree_lookup_slot
  * radix_tree_tag_get
  * radix_tree_gang_lookup
+ * radix_tree_gang_lookup_slot
  * radix_tree_gang_lookup_tag
  * radix_tree_tagged
  *
- * The first 4 functions are able to be called locklessly, using RCU. The
+ * The first 6 functions are able to be called locklessly, using RCU. The
  * caller must ensure calls to these functions are made within rcu_read_lock()
  * regions. Other readers (lock-free or otherwise) and modifications may be
  * running concurrently.
@@ -159,6 +161,9 @@ void *radix_tree_delete(struct radix_tre
 unsigned int
 radix_tree_gang_lookup(struct radix_tree_root *root, void **results,
 			unsigned long first_index, unsigned int max_items);
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items);
 int radix_tree_preload(gfp_t gfp_mask);
 void radix_tree_init(void);
 void *radix_tree_tag_set(struct radix_tree_root *root,
Index: linux-2.6-git2/lib/radix-tree.c
===================================================================
--- linux-2.6-git2.orig/lib/radix-tree.c	2006-12-20 22:26:03.000000000 +0100
+++ linux-2.6-git2/lib/radix-tree.c	2006-12-20 22:26:07.000000000 +0100
@@ -337,18 +337,17 @@ EXPORT_SYMBOL(radix_tree_insert);
  *	Returns:  the slot corresponding to the position @index in the
  *	radix tree @root. This is useful for update-if-exists operations.
  *
- *	This function cannot be called under rcu_read_lock, it must be
- *	excluded from writers, as must the returned slot for subsequent
- *	use by radix_tree_deref_slot() and radix_tree_replace slot.
- *	Caller must hold tree write locked across slot lookup and
- *	replace.
+ *	This function can be called under rcu_read_lock iff the slot is not
+ *	modified by radix_tree_replace_slot, otherwise it must be called
+ *	exclusive from other writers. Any dereference of the slot must be done
+ *	using radix_tree_deref_slot.
  */
 void **radix_tree_lookup_slot(struct radix_tree_root *root, unsigned long index)
 {
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
 
-	node = root->rnode;
+	node = rcu_dereference(root->rnode);
 	if (node == NULL)
 		return NULL;
 
@@ -368,7 +367,7 @@ void **radix_tree_lookup_slot(struct rad
 	do {
 		slot = (struct radix_tree_node **)
 			(node->slots + ((index>>shift) & RADIX_TREE_MAP_MASK));
-		node = *slot;
+		node = rcu_dereference(*slot);
 		if (node == NULL)
 			return NULL;
 
@@ -605,7 +604,7 @@ EXPORT_SYMBOL(radix_tree_tag_get);
 #endif
 
 static unsigned int
-__lookup(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup(struct radix_tree_node *slot, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index)
 {
 	unsigned int nr_found = 0;
@@ -643,7 +642,7 @@ __lookup(struct radix_tree_node *slot, v
 		index++;
 		node = slot->slots[i];
 		if (node) {
-			results[nr_found++] = rcu_dereference(node);
+			results[nr_found++] = &(slot->slots[i]);
 			if (nr_found == max_items)
 				goto out;
 		}
@@ -697,6 +696,73 @@ radix_tree_gang_lookup(struct radix_tree
 
 	ret = 0;
 	while (ret < max_items) {
+		unsigned int nr_found, i, j;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		nr_found = __lookup(node, (void ***)results + ret, cur_index,
+					max_items - ret, &next_index);
+		for (i = j = 0; i < nr_found; i++) {
+			struct radix_tree_node *slot;
+			slot = rcu_dereference(*(((void ***)results)[ret + i]));
+			if (!slot)
+				continue;
+			results[ret + j] = slot;
+			j++;
+		}
+		ret += j;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup);
+
+/**
+ *	radix_tree_gang_lookup_slot - perform multiple slot lookup on radix tree
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *
+ *	Performs an index-ascending scan of the tree for present items.  Places
+ *	their slots at *@results and returns the number of items which were
+ *	placed at *@results.
+ *
+ *	The implementation is naive.
+ *
+ *	Like radix_tree_gang_lookup as far as RCU and locking goes. Slots must
+ *	be dereferenced with radix_tree_deref_slot, and if using only RCU
+ *	protection, radix_tree_deref_slot may fail requiring a retry.
+ */
+unsigned int
+radix_tree_gang_lookup_slot(struct radix_tree_root *root, void ***results,
+			unsigned long first_index, unsigned int max_items)
+{
+	unsigned long max_index;
+	struct radix_tree_node *node;
+	unsigned long cur_index = first_index;
+	unsigned int ret;
+
+	node = rcu_dereference(root->rnode);
+	if (!node)
+		return 0;
+
+	if (!radix_tree_is_indirect_ptr(node)) {
+		if (first_index > 0)
+			return 0;
+		results[0] = (void **)&root->rnode;
+		return 1;
+	}
+	node = radix_tree_indirect_to_ptr(node);
+
+	max_index = radix_tree_maxindex(node->height);
+
+	ret = 0;
+	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
 
@@ -712,7 +778,7 @@ radix_tree_gang_lookup(struct radix_tree
 
 	return ret;
 }
-EXPORT_SYMBOL(radix_tree_gang_lookup);
+EXPORT_SYMBOL(radix_tree_gang_lookup_slot);
 
 /*
  * FIXME: the two tag_get()s here should use find_next_bit() instead of

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 03/14] radix-tree: gang_lookup_tag_slot
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 01/14] radix-tree: use indirect bit Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 02/14] radix-tree: gang_lookup_slot Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 04/14] mm: speculative get page Peter Zijlstra
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: radix-tree-gang_lookup_tag_slot.patch --]
[-- Type: text/plain, Size: 4820 bytes --]

Simple implementation of radix_tree_gang_lookup_tag_slot()

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/radix-tree.h |    5 ++
 lib/radix-tree.c           |   81 ++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 82 insertions(+), 4 deletions(-)

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h	2007-01-22 20:10:00.000000000 +0100
+++ linux-2.6/include/linux/radix-tree.h	2007-01-22 20:10:02.000000000 +0100
@@ -104,6 +104,7 @@ do {									\
  * radix_tree_gang_lookup
  * radix_tree_gang_lookup_slot
  * radix_tree_gang_lookup_tag
+ * radix_tree_gang_lookup_tag_slot
  * radix_tree_tagged
  *
  * The first 6 functions are able to be called locklessly, using RCU. The
@@ -176,6 +177,10 @@ unsigned int
 radix_tree_gang_lookup_tag(struct radix_tree_root *root, void **results,
 		unsigned long first_index, unsigned int max_items,
 		unsigned int tag);
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+		unsigned long first_index, unsigned int max_items,
+		unsigned int tag);
 int radix_tree_tagged(struct radix_tree_root *root, unsigned int tag);
 
 static inline void radix_tree_preload_end(void)
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c	2007-01-22 20:10:00.000000000 +0100
+++ linux-2.6/lib/radix-tree.c	2007-01-22 20:10:02.000000000 +0100
@@ -785,7 +785,7 @@ EXPORT_SYMBOL(radix_tree_gang_lookup_slo
  * open-coding the search.
  */
 static unsigned int
-__lookup_tag(struct radix_tree_node *slot, void **results, unsigned long index,
+__lookup_tag(struct radix_tree_node *slot, void ***results, unsigned long index,
 	unsigned int max_items, unsigned long *next_index, unsigned int tag)
 {
 	unsigned int nr_found = 0;
@@ -831,8 +831,7 @@ __lookup_tag(struct radix_tree_node *slo
 				 * rely on its value remaining the same).
 				 */
 				if (node) {
-					node = rcu_dereference(node);
-					results[nr_found++] = node;
+					results[nr_found++] = &(slot->slots[j]);
 					if (nr_found == max_items)
 						goto out;
 				}
@@ -891,6 +890,80 @@ radix_tree_gang_lookup_tag(struct radix_
 
 	ret = 0;
 	while (ret < max_items) {
+		unsigned int nr_found, i, j;
+		unsigned long next_index;	/* Index of next search */
+
+		if (cur_index > max_index)
+			break;
+		nr_found = __lookup_tag(node, (void ***)results + ret, cur_index,
+					max_items - ret, &next_index, tag);
+		for (i = j = 0; i < nr_found; i++) {
+			struct radix_tree_node *slot;
+			slot = rcu_dereference(*(((void ***)results)[ret + i]));
+			if (!slot)
+				continue;
+			results[ret + j] = slot;
+			j++;
+		}
+		ret += j;
+		if (next_index == 0)
+			break;
+		cur_index = next_index;
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
+
+/**
+ *	radix_tree_gang_lookup_tag_slot - perform multiple slot lookup on a
+ *	                                  radix tree based on a tag
+ *	@root:		radix tree root
+ *	@results:	where the results of the lookup are placed
+ *	@first_index:	start the lookup from this key
+ *	@max_items:	place up to this many items at *results
+ *	@tag:		the tag index (< RADIX_TREE_MAX_TAGS)
+ *
+ *	Performs an index-ascending scan of the tree for present items which
+ *	have the tag indexed by @tag set.  Places their slots at *@results and
+ *	returns the number of items which were placed at *@results.
+ *
+ *	The implementation is naive.
+ *
+ *	Like radix_tree_gang_lookup_tag as far as RCU and locking goes. Slots
+ *	must be dereferenced with radix_tree_deref_slot, and if using only RCU
+ *	protection, radix_tree_deref_slot may fail requiring a retry.
+ */
+unsigned int
+radix_tree_gang_lookup_tag_slot(struct radix_tree_root *root, void ***results,
+		unsigned long first_index, unsigned int max_items,
+		unsigned int tag)
+{
+	struct radix_tree_node *node;
+	unsigned long max_index;
+	unsigned long cur_index = first_index;
+	unsigned int ret;
+
+	/* check the root's tag bit */
+	if (!root_tag_get(root, tag))
+		return 0;
+
+	node = rcu_dereference(root->rnode);
+	if (!node)
+		return 0;
+
+	if (!radix_tree_is_indirect_ptr(node)) {
+		if (first_index > 0)
+			return 0;
+		results[0] = node;
+		return 1;
+	}
+	node = radix_tree_indirect_to_ptr(node);
+
+	max_index = radix_tree_maxindex(node->height);
+
+	ret = 0;
+	while (ret < max_items) {
 		unsigned int nr_found;
 		unsigned long next_index;	/* Index of next search */
 
@@ -906,7 +979,7 @@ radix_tree_gang_lookup_tag(struct radix_
 
 	return ret;
 }
-EXPORT_SYMBOL(radix_tree_gang_lookup_tag);
+EXPORT_SYMBOL(radix_tree_gang_lookup_tag_slot);
 
 /**
  *	radix_tree_shrink    -    shrink height of a radix tree to minimal

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 04/14] mm: speculative get page
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 03/14] radix-tree: gang_lookup_tag_slot Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 05/14] mm: lockless pagecache lookups Peter Zijlstra
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra, Nick Piggin

[-- Attachment #1: mm-speculative-get-page.patch --]
[-- Type: text/plain, Size: 11345 bytes --]

From: Nick Piggin <npiggin@suse.de>

If we can be sure that elevating the page_count on a pagecache
page will pin it, we can speculatively run this operation, and
subsequently check to see if we hit the right page rather than
relying on holding a lock or otherwise pinning a reference to the
page.

This can be done if get_page/put_page behaves consistently
throughout the whole tree (ie. if we "get" the page after it has
been used for something else, we must be able to free it with a
put_page).

Actually, there is a period where the count behaves differently:
when the page is free or if it is a constituent page of a compound
page. We need an atomic_inc_not_zero operation to ensure we don't
try to grab the page in either case.

This patch introduces the core locking protocol to the pagecache
(ie. adds page_cache_get_speculative, and tweaks some update-side
code to make it work).


[Hugh notices that PG_nonewrefs might be dispensed with entirely
 if current SetPageNoNewRefs instead atomically save the page count
 and temporarily set it to zero. 

 This is a nice idea, and simplifies find_get_page very much, but
 cannot be applied to all current SetPageNoNewRefs sites. Need to
 verify that add_to_page_cache and add_to_swap_cache can cope
 without it or make do some other way.

 In the meantime, this version is a slightly more mechanical
 replacement.]

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/page-flags.h |    8 +++
 include/linux/pagemap.h    |  105 +++++++++++++++++++++++++++++++++++++++++++++
 mm/filemap.c               |    4 +
 mm/migrate.c               |    9 +++
 mm/swap_state.c            |    4 +
 mm/vmscan.c                |   12 +++--
 6 files changed, 136 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2007-01-13 21:04:10.000000000 +0100
+++ linux-2.6/include/linux/page-flags.h	2007-01-22 20:10:56.000000000 +0100
@@ -91,7 +91,8 @@
 #define PG_nosave_free		18	/* Used for system suspend/resume */
 #define PG_buddy		19	/* Page is free, on buddy lists */
 
-
+#define PG_nonewrefs		20	/* Block concurrent pagecache lookups
+					 * while testing refcount */
 #if (BITS_PER_LONG > 32)
 /*
  * 64-bit-only flags build down from bit 31
@@ -251,6 +252,11 @@ static inline void SetPageUptodate(struc
 #define SetPageUncached(page)	set_bit(PG_uncached, &(page)->flags)
 #define ClearPageUncached(page)	clear_bit(PG_uncached, &(page)->flags)
 
+#define PageNoNewRefs(page)	test_bit(PG_nonewrefs, &(page)->flags)
+#define SetPageNoNewRefs(page)	set_bit(PG_nonewrefs, &(page)->flags)
+#define ClearPageNoNewRefs(page) clear_bit(PG_nonewrefs, &(page)->flags)
+#define __ClearPageNoNewRefs(page) __clear_bit(PG_nonewrefs, &(page)->flags)
+
 struct page;	/* forward declaration */
 
 extern void cancel_dirty_page(struct page *page, unsigned int account_size);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2007-01-13 21:04:10.000000000 +0100
+++ linux-2.6/include/linux/pagemap.h	2007-01-22 20:10:30.000000000 +0100
@@ -11,6 +11,8 @@
 #include <linux/compiler.h>
 #include <asm/uaccess.h>
 #include <linux/gfp.h>
+#include <linux/page-flags.h>
+#include <linux/hardirq.h> /* for in_interrupt() */
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -51,6 +53,109 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+/*
+ * speculatively take a reference to a page.
+ * If the page is free (_count == 0), then _count is untouched, and 0
+ * is returned. Otherwise, _count is incremented by 1 and 1 is returned.
+ *
+ * This function must be run in the same rcu_read_lock() section as has
+ * been used to lookup the page in the pagecache radix-tree: this allows
+ * allocators to use a synchronize_rcu() to stabilize _count.
+ *
+ * Unless an RCU grace period has passed, the count of all pages coming out
+ * of the allocator must be considered unstable. page_count may return higher
+ * than expected, and put_page must be able to do the right thing when the
+ * page has been finished with (because put_page is what is used to drop an
+ * invalid speculative reference).
+ *
+ * After incrementing the refcount, this function spins until PageNoNewRefs
+ * is clear, then a read memory barrier is issued.
+ *
+ * This forms the core of the lockless pagecache locking protocol, where
+ * the lookup-side (eg. find_get_page) has the following pattern:
+ * 1. find page in radix tree
+ * 2. conditionally increment refcount
+ * 3. wait for PageNoNewRefs
+ * 4. check the page is still in pagecache
+ *
+ * Remove-side (that cares about _count, eg. reclaim) has the following:
+ * A. SetPageNoNewRefs
+ * B. check refcount is correct
+ * C. remove page
+ * D. ClearPageNoNewRefs
+ *
+ * There are 2 critical interleavings that matter:
+ * - 2 runs before B: in this case, B sees elevated refcount and bails out
+ * - B runs before 2: in this case, 3 ensures 4 will not run until *after* C
+ *   (after D, even). In which case, 4 will notice C and lookup side can retry
+ *
+ * It is possible that between 1 and 2, the page is removed then the exact same
+ * page is inserted into the same position in pagecache. That's OK: the
+ * old find_get_page using tree_lock could equally have run before or after
+ * the write-side, depending on timing.
+ *
+ * Pagecache insertion isn't a big problem: either 1 will find the page or
+ * it will not. Likewise, the old find_get_page could run either before the
+ * insertion or afterwards, depending on timing.
+ */
+static inline int page_cache_get_speculative(struct page *page)
+{
+	VM_BUG_ON(in_interrupt());
+
+#ifndef CONFIG_SMP
+# ifdef CONFIG_PREEMPT
+	VM_BUG_ON(!in_atomic());
+# endif
+	/*
+	 * Preempt must be disabled here - we rely on rcu_read_lock doing
+	 * this for us.
+	 *
+	 * Pagecache won't be truncated from interrupt context, so if we have
+	 * found a page in the radix tree here, we have pinned its refcount by
+	 * disabling preempt, and hence no need for the "speculative get" that
+	 * SMP requires.
+	 */
+	VM_BUG_ON(page_count(page) == 0);
+	atomic_inc(&page->_count);
+
+#else
+	if (unlikely(!get_page_unless_zero(page)))
+		return 0; /* page has been freed */
+
+	/*
+	 * Note that get_page_unless_zero provides a memory barrier.
+	 * This is needed to ensure PageNoNewRefs is evaluated after the
+	 * page refcount has been raised. See below comment.
+	 */
+
+	while (unlikely(PageNoNewRefs(page)))
+		cpu_relax();
+
+	/*
+	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
+	 * is performed before a future load used to ensure the page is
+	 * the correct on (usually: page->mapping and page->index).
+	 *
+	 * Those places that set PageNoNewRefs have the following pattern:
+	 * 	SetPageNoNewRefs(page)
+	 * 	wmb();
+	 * 	if (page_count(page) == X)
+	 * 		remove page from pagecache
+	 * 	wmb();
+	 * 	ClearPageNoNewRefs(page)
+	 *
+	 * If the load was out of order, page->mapping might be loaded before
+	 * the page is removed from pagecache but PageNoNewRefs evaluated
+	 * after the ClearPageNoNewRefs().
+	 */
+	smp_rmb();
+
+#endif
+	VM_BUG_ON(PageCompound(page) && (struct page *)page_private(page) != page);
+
+	return 1;
+}
+
 #ifdef CONFIG_NUMA
 extern struct page *__page_cache_alloc(gfp_t gfp);
 #else
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-01-13 21:04:13.000000000 +0100
+++ linux-2.6/mm/vmscan.c	2007-01-22 20:10:30.000000000 +0100
@@ -390,6 +390,8 @@ int remove_mapping(struct address_space 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
+	SetPageNoNewRefs(page);
+	smp_wmb();
 	write_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
@@ -427,17 +429,21 @@ int remove_mapping(struct address_space 
 		__delete_from_swap_cache(page);
 		write_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
-		__put_page(page);	/* The pagecache ref */
-		return 1;
+		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
 	write_unlock_irq(&mapping->tree_lock);
-	__put_page(page);
+
+free_it:
+	smp_wmb();
+	__ClearPageNoNewRefs(page);
+	__put_page(page); /* The pagecache ref */
 	return 1;
 
 cannot_free:
 	write_unlock_irq(&mapping->tree_lock);
+	ClearPageNoNewRefs(page);
 	return 0;
 }
 
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-01-13 21:04:13.000000000 +0100
+++ linux-2.6/mm/filemap.c	2007-01-22 20:10:30.000000000 +0100
@@ -440,6 +440,8 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		SetPageNoNewRefs(page);
+		smp_wmb();
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
@@ -451,6 +453,8 @@ int add_to_page_cache(struct page *page,
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&mapping->tree_lock);
+		smp_wmb();
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c	2007-01-13 21:04:13.000000000 +0100
+++ linux-2.6/mm/swap_state.c	2007-01-22 20:10:30.000000000 +0100
@@ -78,6 +78,8 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		SetPageNoNewRefs(page);
+		smp_wmb();
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
@@ -90,6 +92,8 @@ static int __add_to_swap_cache(struct pa
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
+		smp_wmb();
+		ClearPageNoNewRefs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2007-01-13 21:04:13.000000000 +0100
+++ linux-2.6/mm/migrate.c	2007-01-22 20:10:30.000000000 +0100
@@ -303,6 +303,8 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
+	SetPageNoNewRefs(page);
+	smp_wmb();
 	write_lock_irq(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
@@ -311,6 +313,7 @@ static int migrate_page_move_mapping(str
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
 		write_unlock_irq(&mapping->tree_lock);
+		ClearPageNoNewRefs(page);
 		return -EAGAIN;
 	}
 
@@ -326,6 +329,10 @@ static int migrate_page_move_mapping(str
 #endif
 
 	radix_tree_replace_slot(pslot, newpage);
+	page->mapping = NULL;
+  	write_unlock_irq(&mapping->tree_lock);
+	smp_wmb();
+	ClearPageNoNewRefs(page);
 
 	/*
 	 * Drop cache reference from old page.
@@ -333,8 +340,6 @@ static int migrate_page_move_mapping(str
 	 */
 	__put_page(page);
 
-	write_unlock_irq(&mapping->tree_lock);
-
 	return 0;
 }
 

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 05/14] mm: lockless pagecache lookups
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 04/14] mm: speculative get page Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 06/14] mm: fix speculative page get preemption bug Peter Zijlstra
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra, Nick Piggin

[-- Attachment #1: mm-lockless-pagecache-lookups.patch --]
[-- Type: text/plain, Size: 7463 bytes --]

From: Nick Piggin <npiggin@suse.de>

Combine page_cache_get_speculative with lockless radix tree lookups to
introduce lockless page cache lookups (ie. no mapping->tree_lock on
the read-side).

The only atomicity changes this introduces is that the gang pagecache
lookup functions now behave as if they are implemented with multiple
find_get_page calls, rather than operating on a snapshot of the pages.
In practice, this atomicity guarantee is not used anyway, and it is
difficult to see how it could be. Gang pagecache lookups are designed
to replace individual lookups, so these semantics are natural.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/filemap.c        |  133 +++++++++++++++++++++++++++++++++++++---------------
 mm/page-writeback.c |    5 -
 mm/readahead.c      |    6 --
 3 files changed, 101 insertions(+), 43 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-01-22 20:10:30.000000000 +0100
+++ linux-2.6/mm/filemap.c	2007-01-22 20:11:03.000000000 +0100
@@ -596,15 +596,31 @@ void fastcall __lock_page_nosync(struct 
  * Is there a pagecache struct page at the given (mapping, offset) tuple?
  * If yes, increment its refcount and return it; if no, return NULL.
  */
-struct page * find_get_page(struct address_space *mapping, unsigned long offset)
+struct page *find_get_page(struct address_space *mapping, unsigned long offset)
 {
+	void **pagep;
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page)
-		page_cache_get(page);
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_lock();
+repeat:
+	page = NULL;
+	pagep = radix_tree_lookup_slot(&mapping->page_tree, offset);
+	if (pagep) {
+		page = radix_tree_deref_slot(pagep);
+		if (unlikely(!page || page == RADIX_TREE_RETRY))
+			goto repeat;
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *pagep)) {
+			page_cache_release(page);
+			goto repeat;
+		}
+	}
+	rcu_read_unlock();
+
 	return page;
 }
 EXPORT_SYMBOL(find_get_page);
@@ -644,26 +660,19 @@ struct page *find_lock_page(struct addre
 {
 	struct page *page;
 
-	read_lock_irq(&mapping->tree_lock);
 repeat:
-	page = radix_tree_lookup(&mapping->page_tree, offset);
+	page = find_get_page(mapping, offset);
 	if (page) {
-		page_cache_get(page);
-		if (TestSetPageLocked(page)) {
-			read_unlock_irq(&mapping->tree_lock);
-			__lock_page(page);
-			read_lock_irq(&mapping->tree_lock);
-
-			/* Has the page been truncated while we slept? */
-			if (unlikely(page->mapping != mapping ||
-				     page->index != offset)) {
-				unlock_page(page);
-				page_cache_release(page);
-				goto repeat;
-			}
+		lock_page(page);
+		/* Has the page been truncated? */
+		if (unlikely(page->mapping != mapping
+				|| page->index != offset)) {
+			unlock_page(page);
+			page_cache_release(page);
+			goto repeat;
 		}
 	}
-	read_unlock_irq(&mapping->tree_lock);
+
 	return page;
 }
 EXPORT_SYMBOL(find_lock_page);
@@ -733,13 +742,39 @@ unsigned find_get_pages(struct address_s
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup(&mapping->page_tree,
-				(void **)pages, start, nr_pages);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, start, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
+	}
+	rcu_read_unlock();
 	return ret;
 }
 
@@ -760,19 +795,44 @@ unsigned find_get_pages_contig(struct ad
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
 
-	read_lock_irq(&mapping->tree_lock);
-	ret = radix_tree_gang_lookup(&mapping->page_tree,
-				(void **)pages, index, nr_pages);
-	for (i = 0; i < ret; i++) {
-		if (pages[i]->mapping == NULL || pages[i]->index != index)
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_slot(&mapping->page_tree,
+				(void ***)pages, index, nr_pages);
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void **)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelock
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
+
+		if (page->mapping == NULL || page->index != index)
 			break;
 
-		page_cache_get(pages[i]);
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
 		index++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
-	return i;
+	rcu_read_unlock();
+	return ret;
 }
 
 /**
@@ -793,6 +853,7 @@ unsigned find_get_pages_tag(struct addre
 	unsigned int ret;
 
 	read_lock_irq(&mapping->tree_lock);
+	/* TODO: implement lookup_tag_slot and make this lockless */
 	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
 				(void **)pages, *index, nr_pages, tag);
 	for (i = 0; i < ret; i++)
Index: linux-2.6/mm/readahead.c
===================================================================
--- linux-2.6.orig/mm/readahead.c	2007-01-13 21:04:13.000000000 +0100
+++ linux-2.6/mm/readahead.c	2007-01-22 20:11:03.000000000 +0100
@@ -281,27 +281,25 @@ __do_page_cache_readahead(struct address
 	/*
 	 * Preallocate as many pages as we will need.
 	 */
-	read_lock_irq(&mapping->tree_lock);
 	for (page_idx = 0; page_idx < nr_to_read; page_idx++) {
 		pgoff_t page_offset = offset + page_idx;
 		
 		if (page_offset > end_index)
 			break;
 
+		rcu_read_lock();
 		page = radix_tree_lookup(&mapping->page_tree, page_offset);
+		rcu_read_unlock();
 		if (page)
 			continue;
 
-		read_unlock_irq(&mapping->tree_lock);
 		page = page_cache_alloc_cold(mapping);
-		read_lock_irq(&mapping->tree_lock);
 		if (!page)
 			break;
 		page->index = page_offset;
 		list_add(&page->lru, &page_pool);
 		ret++;
 	}
-	read_unlock_irq(&mapping->tree_lock);
 
 	/*
 	 * Now start the IO.  We ignore I/O errors - if the page is not
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-01-22 17:29:38.000000000 +0100
+++ linux-2.6/mm/page-writeback.c	2007-01-22 20:11:03.000000000 +0100
@@ -959,12 +959,11 @@ EXPORT_SYMBOL(test_set_page_writeback);
  */
 int mapping_tagged(struct address_space *mapping, int tag)
 {
-	unsigned long flags;
 	int ret;
 
-	read_lock_irqsave(&mapping->tree_lock, flags);
+	rcu_read_lock();
 	ret = radix_tree_tagged(&mapping->page_tree, tag);
-	read_unlock_irqrestore(&mapping->tree_lock, flags);
+	rcu_read_unlock();
 	return ret;
 }
 EXPORT_SYMBOL(mapping_tagged);

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 06/14] mm: fix speculative page get preemption bug
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 05/14] mm: lockless pagecache lookups Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 07/14] mm: speculative find_get_pages_tag Peter Zijlstra
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: mm-lockless-preempt-fixup.patch --]
[-- Type: text/plain, Size: 5511 bytes --]

Livelock scenario pointed out by Nick.

SetPageNoNewRefs(page);
	  *** preempted here ***
		      page_cache_get_speculative() {
			while (PageNoNewRefs(page)) /* livelock */
		      }

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/pagemap.h |   25 +++++++++++++++++++++++--
 mm/filemap.c            |    6 ++----
 mm/migrate.c            |    8 +++-----
 mm/swap_state.c         |    6 ++----
 mm/vmscan.c             |    8 +++-----
 5 files changed, 33 insertions(+), 20 deletions(-)

Index: linux-2.6-rt/include/linux/pagemap.h
===================================================================
--- linux-2.6-rt.orig/include/linux/pagemap.h	2006-11-29 14:20:48.000000000 +0100
+++ linux-2.6-rt/include/linux/pagemap.h	2006-11-29 14:20:55.000000000 +0100
@@ -53,6 +53,28 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+static inline void set_page_no_new_refs(struct page *page)
+{
+	VM_BUG_ON(PageNoNewRefs(page));
+	preempt_disable();
+	SetPageNoNewRefs(page);
+	smp_wmb();
+}
+
+static inline void end_page_no_new_refs(struct page *page)
+{
+	VM_BUG_ON(!PageNoNewRefs(page));
+	smp_wmb();
+	ClearPageNoNewRefs(page);
+	preempt_enable();
+}
+
+static inline void wait_on_new_refs(struct page *page)
+{
+	while (unlikely(PageNoNewRefs(page)))
+		cpu_relax();
+}
+
 /*
  * speculatively take a reference to a page.
  * If the page is free (_count == 0), then _count is untouched, and 0
@@ -128,8 +150,7 @@ static inline int page_cache_get_specula
 	 * page refcount has been raised. See below comment.
 	 */
 
-	while (unlikely(PageNoNewRefs(page)))
-		cpu_relax();
+	wait_on_new_refs(page);
 
 	/*
 	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
Index: linux-2.6-rt/mm/filemap.c
===================================================================
--- linux-2.6-rt.orig/mm/filemap.c	2006-11-29 14:20:52.000000000 +0100
+++ linux-2.6-rt/mm/filemap.c	2006-11-29 14:20:55.000000000 +0100
@@ -440,8 +440,7 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
-		SetPageNoNewRefs(page);
-		smp_wmb();
+		set_page_no_new_refs(page);
 		write_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
@@ -453,8 +452,7 @@ int add_to_page_cache(struct page *page,
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&mapping->tree_lock);
-		smp_wmb();
-		ClearPageNoNewRefs(page);
+		end_page_no_new_refs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6-rt/mm/migrate.c
===================================================================
--- linux-2.6-rt.orig/mm/migrate.c	2006-11-29 14:20:48.000000000 +0100
+++ linux-2.6-rt/mm/migrate.c	2006-11-29 14:20:55.000000000 +0100
@@ -303,8 +303,7 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
-	SetPageNoNewRefs(page);
-	smp_wmb();
+	set_page_no_new_refs(page);
 	write_lock_irq(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
@@ -313,7 +312,7 @@ static int migrate_page_move_mapping(str
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
 		write_unlock_irq(&mapping->tree_lock);
-		ClearPageNoNewRefs(page);
+		end_page_no_new_refs(page);
 		return -EAGAIN;
 	}
 
@@ -331,8 +330,7 @@ static int migrate_page_move_mapping(str
 	radix_tree_replace_slot(pslot, newpage);
 	page->mapping = NULL;
   	write_unlock_irq(&mapping->tree_lock);
-	smp_wmb();
-	ClearPageNoNewRefs(page);
+	end_page_no_new_refs(page);
 
 	/*
 	 * Drop cache reference from old page.
Index: linux-2.6-rt/mm/swap_state.c
===================================================================
--- linux-2.6-rt.orig/mm/swap_state.c	2006-11-29 14:20:48.000000000 +0100
+++ linux-2.6-rt/mm/swap_state.c	2006-11-29 14:20:55.000000000 +0100
@@ -78,8 +78,7 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
-		SetPageNoNewRefs(page);
-		smp_wmb();
+		set_page_no_new_refs(page);
 		write_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
@@ -92,8 +91,7 @@ static int __add_to_swap_cache(struct pa
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		write_unlock_irq(&swapper_space.tree_lock);
-		smp_wmb();
-		ClearPageNoNewRefs(page);
+		end_page_no_new_refs(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6-rt/mm/vmscan.c
===================================================================
--- linux-2.6-rt.orig/mm/vmscan.c	2006-11-29 14:20:48.000000000 +0100
+++ linux-2.6-rt/mm/vmscan.c	2006-11-29 14:20:55.000000000 +0100
@@ -390,8 +390,7 @@ int remove_mapping(struct address_space 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	SetPageNoNewRefs(page);
-	smp_wmb();
+	set_page_no_new_refs(page);
 	write_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
@@ -436,14 +435,13 @@ int remove_mapping(struct address_space 
 	write_unlock_irq(&mapping->tree_lock);
 
 free_it:
-	smp_wmb();
-	__ClearPageNoNewRefs(page);
+	end_page_no_new_refs(page);
 	__put_page(page); /* The pagecache ref */
 	return 1;
 
 cannot_free:
 	write_unlock_irq(&mapping->tree_lock);
-	ClearPageNoNewRefs(page);
+	end_page_no_new_refs(page);
 	return 0;
 }
 

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 07/14] mm: speculative find_get_pages_tag
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (5 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 06/14] mm: fix speculative page get preemption bug Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 08/14] mm: remove find_tylock_page Peter Zijlstra
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: mm-lockless-find_get_pages_tag.patch --]
[-- Type: text/plain, Size: 1670 bytes --]

implement the speculative find_get_pages_tag.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 mm/filemap.c |   42 +++++++++++++++++++++++++++++++++---------
 1 file changed, 33 insertions(+), 9 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-01-22 20:11:07.000000000 +0100
+++ linux-2.6/mm/filemap.c	2007-01-22 20:11:09.000000000 +0100
@@ -849,16 +849,40 @@ unsigned find_get_pages_tag(struct addre
 {
 	unsigned int i;
 	unsigned int ret;
+	unsigned int nr_found;
 
-	read_lock_irq(&mapping->tree_lock);
-	/* TODO: implement lookup_tag_slot and make this lockless */
-	ret = radix_tree_gang_lookup_tag(&mapping->page_tree,
-				(void **)pages, *index, nr_pages, tag);
-	for (i = 0; i < ret; i++)
-		page_cache_get(pages[i]);
-	if (ret)
-		*index = pages[ret - 1]->index + 1;
-	read_unlock_irq(&mapping->tree_lock);
+	rcu_read_lock();
+restart:
+	nr_found = radix_tree_gang_lookup_tag_slot(&mapping->page_tree,
+			(void ***)pages, *index, nr_pages, tag);
+
+	ret = 0;
+	for (i = 0; i < nr_found; i++) {
+		struct page *page;
+repeat:
+		page = radix_tree_deref_slot((void**)pages[i]);
+		if (unlikely(!page))
+			continue;
+		/*
+		 * this can only trigger if nr_found == 1, making livelocks
+		 * a non issue.
+		 */
+		if (unlikely(page == RADIX_TREE_RETRY))
+			goto restart;
+
+		if (!page_cache_get_speculative(page))
+			goto repeat;
+
+		/* Has the page moved? */
+		if (unlikely(page != *((void **)pages[i]))) {
+			page_cache_release(page);
+			goto repeat;
+		}
+
+		pages[ret] = page;
+		ret++;
+	}
+	rcu_read_unlock();
 	return ret;
 }
 

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 08/14] mm: remove find_tylock_page
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (6 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 07/14] mm: speculative find_get_pages_tag Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 09/14] mm: change tree_lock into a spinlock Peter Zijlstra
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: kill-find_trylock_page.patch --]
[-- Type: text/plain, Size: 2029 bytes --]

its the last read_lock user of tree_lock, and since its unused remove
it rather than convert it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/pagemap.h |    2 --
 mm/filemap.c            |   20 --------------------
 2 files changed, 22 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-01-22 20:11:09.000000000 +0100
+++ linux-2.6/mm/filemap.c	2007-01-22 20:11:13.000000000 +0100
@@ -624,26 +624,6 @@ repeat:
 EXPORT_SYMBOL(find_get_page);
 
 /**
- * find_trylock_page - find and lock a page
- * @mapping: the address_space to search
- * @offset: the page index
- *
- * Same as find_get_page(), but trylock it instead of incrementing the count.
- */
-struct page *find_trylock_page(struct address_space *mapping, unsigned long offset)
-{
-	struct page *page;
-
-	read_lock_irq(&mapping->tree_lock);
-	page = radix_tree_lookup(&mapping->page_tree, offset);
-	if (page && TestSetPageLocked(page))
-		page = NULL;
-	read_unlock_irq(&mapping->tree_lock);
-	return page;
-}
-EXPORT_SYMBOL(find_trylock_page);
-
-/**
  * find_lock_page - locate, pin and lock a pagecache page
  * @mapping: the address_space to search
  * @offset: the page index
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2007-01-22 20:11:07.000000000 +0100
+++ linux-2.6/include/linux/pagemap.h	2007-01-22 20:11:13.000000000 +0100
@@ -202,8 +202,6 @@ extern struct page * find_get_page(struc
 				unsigned long index);
 extern struct page * find_lock_page(struct address_space *mapping,
 				unsigned long index);
-extern __deprecated_for_modules struct page * find_trylock_page(
-			struct address_space *mapping, unsigned long index);
 extern struct page * find_or_create_page(struct address_space *mapping,
 				unsigned long index, gfp_t gfp_mask);
 unsigned find_get_pages(struct address_space *mapping, pgoff_t start,

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 09/14] mm: change tree_lock into a spinlock
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (7 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 08/14] mm: remove find_tylock_page Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 10/14] radix-tree: concurrent write side support Peter Zijlstra
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: tree_lock-to-spinlock.patch --]
[-- Type: text/plain, Size: 13013 bytes --]

with all the read_lock uses of the tree_lock gone, change it into
a spinlock.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                     |    4 ++--
 fs/inode.c                      |    2 +-
 include/asm-arm/cacheflush.h    |    4 ++--
 include/asm-parisc/cacheflush.h |    4 ++--
 include/linux/fs.h              |    2 +-
 mm/filemap.c                    |    8 ++++----
 mm/migrate.c                    |    6 +++---
 mm/page-writeback.c             |   12 ++++++------
 mm/swap_state.c                 |   10 +++++-----
 mm/swapfile.c                   |    4 ++--
 mm/truncate.c                   |    6 +++---
 mm/vmscan.c                     |    8 ++++----
 12 files changed, 35 insertions(+), 35 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-01-28 13:59:07.000000000 +0100
+++ linux-2.6/fs/buffer.c	2007-01-28 13:59:24.000000000 +0100
@@ -728,7 +728,7 @@ int __set_page_dirty_buffers(struct page
 	if (TestSetPageDirty(page))
 		return 0;
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
@@ -737,7 +737,7 @@ int __set_page_dirty_buffers(struct page
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	return 1;
 }
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h	2007-01-28 13:59:07.000000000 +0100
+++ linux-2.6/include/asm-arm/cacheflush.h	2007-01-28 13:59:24.000000000 +0100
@@ -368,9 +368,9 @@ static inline void flush_anon_page(struc
 }
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_user_range(vma,page,addr,len) \
 	flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h	2007-01-28 13:59:07.000000000 +0100
+++ linux-2.6/include/asm-parisc/cacheflush.h	2007-01-28 13:59:24.000000000 +0100
@@ -59,9 +59,9 @@ flush_user_icache_range(unsigned long st
 extern void flush_dcache_page(struct page *page);
 
 #define flush_dcache_mmap_lock(mapping) \
-	write_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->tree_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	write_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->tree_lock)
 
 #define flush_icache_page(vma,page)	do { flush_kernel_dcache_page(page); flush_kernel_icache_page(page_address(page)); } while (0)
 
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2007-01-28 13:59:07.000000000 +0100
+++ linux-2.6/include/linux/fs.h	2007-01-28 13:59:24.000000000 +0100
@@ -433,7 +433,7 @@ struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
-	rwlock_t		tree_lock;	/* and rwlock protecting it */
+	spinlock_t		tree_lock;	/* and rwlock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-01-28 13:59:23.000000000 +0100
+++ linux-2.6/mm/filemap.c	2007-01-28 13:59:24.000000000 +0100
@@ -128,9 +128,9 @@ void remove_from_page_cache(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 }
 
 static int sync_page(void *word)
@@ -441,7 +441,7 @@ int add_to_page_cache(struct page *page,
 
 	if (error == 0) {
 		set_page_no_new_refs(page);
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
 			page_cache_get(page);
@@ -451,7 +451,7 @@ int add_to_page_cache(struct page *page,
 			mapping->nrpages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		end_page_no_new_refs(page);
 		radix_tree_preload_end();
 	}
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2007-01-28 13:59:21.000000000 +0100
+++ linux-2.6/mm/migrate.c	2007-01-28 13:59:24.000000000 +0100
@@ -304,14 +304,14 @@ static int migrate_page_move_mapping(str
 	}
 
 	set_page_no_new_refs(page);
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		end_page_no_new_refs(page);
 		return -EAGAIN;
 	}
@@ -329,7 +329,7 @@ static int migrate_page_move_mapping(str
 
 	radix_tree_replace_slot(pslot, newpage);
 	page->mapping = NULL;
-  	write_unlock_irq(&mapping->tree_lock);
+  	spin_unlock_irq(&mapping->tree_lock);
 	end_page_no_new_refs(page);
 
 	/*
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-01-28 13:59:20.000000000 +0100
+++ linux-2.6/mm/page-writeback.c	2007-01-28 13:59:24.000000000 +0100
@@ -765,7 +765,7 @@ int __set_page_dirty_nobuffers(struct pa
 		if (!mapping)
 			return 1;
 
-		write_lock_irq(&mapping->tree_lock);
+		spin_lock_irq(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
@@ -776,7 +776,7 @@ int __set_page_dirty_nobuffers(struct pa
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -913,13 +913,13 @@ int test_clear_page_writeback(struct pag
 	if (mapping) {
 		unsigned long flags;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestClearPageWriteback(page);
 		if (ret)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
@@ -934,7 +934,7 @@ int test_set_page_writeback(struct page 
 	if (mapping) {
 		unsigned long flags;
 
-		write_lock_irqsave(&mapping->tree_lock, flags);
+		spin_lock_irqsave(&mapping->tree_lock, flags);
 		ret = TestSetPageWriteback(page);
 		if (!ret)
 			radix_tree_tag_set(&mapping->page_tree,
@@ -944,7 +944,7 @@ int test_set_page_writeback(struct page 
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		write_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c	2007-01-28 13:59:21.000000000 +0100
+++ linux-2.6/mm/swap_state.c	2007-01-28 13:59:24.000000000 +0100
@@ -38,7 +38,7 @@ static struct backing_dev_info swap_back
 
 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-	.tree_lock	= __RW_LOCK_UNLOCKED(swapper_space.tree_lock),
+	.tree_lock	= __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
@@ -79,7 +79,7 @@ static int __add_to_swap_cache(struct pa
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
 		set_page_no_new_refs(page);
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
@@ -90,7 +90,7 @@ static int __add_to_swap_cache(struct pa
 			total_swapcache_pages++;
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 		end_page_no_new_refs(page);
 		radix_tree_preload_end();
 	}
@@ -202,9 +202,9 @@ void delete_from_swap_cache(struct page 
 
 	entry.val = page_private(page);
 
-	write_lock_irq(&swapper_space.tree_lock);
+	spin_lock_irq(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	write_unlock_irq(&swapper_space.tree_lock);
+	spin_unlock_irq(&swapper_space.tree_lock);
 
 	swap_free(entry);
 	page_cache_release(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2007-01-28 13:59:08.000000000 +0100
+++ linux-2.6/mm/swapfile.c	2007-01-28 13:59:24.000000000 +0100
@@ -367,13 +367,13 @@ int remove_exclusive_swap_page(struct pa
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
-		write_lock_irq(&swapper_space.tree_lock);
+		spin_lock_irq(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		write_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock_irq(&swapper_space.tree_lock);
 	}
 	spin_unlock(&swap_lock);
 
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c	2007-01-28 13:59:08.000000000 +0100
+++ linux-2.6/mm/truncate.c	2007-01-28 13:59:24.000000000 +0100
@@ -333,18 +333,18 @@ invalidate_complete_page2(struct address
 	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
 failed:
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	return 0;
 }
 
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-01-28 13:59:21.000000000 +0100
+++ linux-2.6/mm/vmscan.c	2007-01-28 13:59:24.000000000 +0100
@@ -391,7 +391,7 @@ int remove_mapping(struct address_space 
 	BUG_ON(mapping != page_mapping(page));
 
 	set_page_no_new_refs(page);
-	write_lock_irq(&mapping->tree_lock);
+	spin_lock_irq(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -426,13 +426,13 @@ int remove_mapping(struct address_space 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
-		write_unlock_irq(&mapping->tree_lock);
+		spin_unlock_irq(&mapping->tree_lock);
 		swap_free(swap);
 		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 
 free_it:
 	end_page_no_new_refs(page);
@@ -440,7 +440,7 @@ free_it:
 	return 1;
 
 cannot_free:
-	write_unlock_irq(&mapping->tree_lock);
+	spin_unlock_irq(&mapping->tree_lock);
 	end_page_no_new_refs(page);
 	return 0;
 }
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2007-01-28 13:59:07.000000000 +0100
+++ linux-2.6/fs/inode.c	2007-01-28 13:59:24.000000000 +0100
@@ -193,7 +193,7 @@ void inode_init_once(struct inode *inode
 	mutex_init(&inode->i_mutex);
 	init_rwsem(&inode->i_alloc_sem);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
-	rwlock_init(&inode->i_data.tree_lock);
+	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 10/14] radix-tree: concurrent write side support
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (8 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 09/14] mm: change tree_lock into a spinlock Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 11/14] atomic_ulong_t Peter Zijlstra
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: radix-tree-concurrent.patch --]
[-- Type: text/plain, Size: 20049 bytes --]

Provide support for concurrent write side operations without changing the API
for all current uses.

Concurrency is realized by means of two locking models; the simple one is 
ladder locking, the more complex one is path locking.

Ladder locking is like walking down a ladder, you place your foot on a spoke
below the one your other foot finds support etc.. There is no walking with both
feet in the air. Likewise with walking a tree, you lock a node below the 
current node before releasing it.

This allows other modifying operations to start as soon as you release the
lock on the root node and even complete before you if they walk another path
downward.

The modifying operations: insert, lookup_slot and set_tag, use this simple
method.

The more complex path locking method is needed for operations that need to
walk upwards again after they walked down, those are: tag_clear and delete.

These lock their whole path downwards and release whole sections at points
where it can be determined the walk upwards will stop, thus also allowing
concurrency.

Finding the conditions for the terminated walk upwards while doing the downward
walk is the 'interesting' part of this approach.

The remaining - unmodified - operations will have exclusive locking (since
they're unmodified, they never move the lock downwards from the root node).

The API for this looks like:

  DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree)

  radix_tree_lock(&ctx)
  ... do _1_ modifying operation ...
  radix_tree_unlock(&ctx)

Note that before the radix operation the root node is held and will provide
exclusive locking, after the operation the held lock might only be enough to
protect a single item.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/radix-tree.h |   87 ++++++++++++-
 init/Kconfig               |    4 
 lib/radix-tree.c           |  282 +++++++++++++++++++++++++++++++++++----------
 3 files changed, 310 insertions(+), 63 deletions(-)

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h
+++ linux-2.6/include/linux/radix-tree.h
@@ -62,23 +62,73 @@ struct radix_tree_root {
 	unsigned int		height;
 	gfp_t			gfp_mask;
 	struct radix_tree_node	*rnode;
+	spinlock_t		lock;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	struct radix_tree_context *context;
+	struct task_struct	*owner;
+#endif
 };
 
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+#define __RADIX_TREE_CONCURRENT_INIT					\
+	.context = NULL,						\
+	.owner = NULL,
+#else
+#define __RADIX_TREE_CONCURRENT_INIT
+#endif
+
 #define RADIX_TREE_INIT(mask)	{					\
 	.height = 0,							\
 	.gfp_mask = (mask),						\
 	.rnode = NULL,							\
+	.lock = __SPIN_LOCK_UNLOCKED(radix_tree_root.lock),		\
+	__RADIX_TREE_CONCURRENT_INIT					\
 }
 
 #define RADIX_TREE(name, mask) \
 	struct radix_tree_root name = RADIX_TREE_INIT(mask)
 
-#define INIT_RADIX_TREE(root, mask)					\
-do {									\
-	(root)->height = 0;						\
-	(root)->gfp_mask = (mask);					\
-	(root)->rnode = NULL;						\
-} while (0)
+static inline void INIT_RADIX_TREE(struct radix_tree_root *root, gfp_t gfp_mask)
+{
+	root->height = 0;
+	root->gfp_mask = gfp_mask;
+	root->rnode = NULL;
+	spin_lock_init(&root->lock);
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	root->context = NULL;
+	root->owner = NULL;
+#endif
+}
+
+struct radix_tree_context {
+	struct radix_tree_root	*root;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	spinlock_t		*locked;
+#endif
+};
+
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+#define __RADIX_TREE_CONTEXT_INIT					\
+		.locked = NULL,
+#else
+#define __RADIX_TREE_CONTEXT_INIT
+#endif
+
+#define DECLARE_RADIX_TREE_CONTEXT(context, tree) 			\
+	struct radix_tree_context context = { 				\
+		.root = (tree), 					\
+		__RADIX_TREE_CONTEXT_INIT				\
+       	}
+
+static inline void
+init_radix_tree_context(struct radix_tree_context *wctx,
+		struct radix_tree_root *root)
+{
+	wctx->root = root;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	wctx->locked = NULL;
+#endif
+}
 
 /**
  * Radix-tree synchronization
@@ -155,6 +205,31 @@ static inline void radix_tree_replace_sl
 	rcu_assign_pointer(*pslot, item);
 }
 
+static inline void radix_tree_lock(struct radix_tree_context *context)
+{
+	struct radix_tree_root *root = context->root;
+	rcu_read_lock();
+	spin_lock(&root->lock);
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	BUG_ON(context->locked);
+	context->locked = &root->lock;
+	root->owner = current;
+	root->context = context;
+#endif
+}
+
+static inline void radix_tree_unlock(struct radix_tree_context *context)
+{
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	BUG_ON(!context->locked);
+	spin_unlock(context->locked);
+	context->locked = NULL;
+#else
+	spin_unlock(&context->root->lock);
+#endif
+	rcu_read_unlock();
+}
+
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c
+++ linux-2.6/lib/radix-tree.c
@@ -32,6 +32,7 @@
 #include <linux/string.h>
 #include <linux/bitops.h>
 #include <linux/rcupdate.h>
+#include <linux/spinlock.h>
 
 
 #ifdef __KERNEL__
@@ -52,11 +53,17 @@ struct radix_tree_node {
 	struct rcu_head	rcu_head;
 	void		*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	spinlock_t	lock;
+#endif
 };
 
 struct radix_tree_path {
 	struct radix_tree_node *node;
 	int offset;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	spinlock_t *locked;
+#endif
 };
 
 #define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
@@ -64,6 +71,10 @@ struct radix_tree_path {
 
 static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH] __read_mostly;
 
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+static struct lock_class_key radix_node_class[RADIX_TREE_MAX_PATH];
+#endif
+
 /*
  * Radix tree node cache.
  */
@@ -88,7 +99,7 @@ static inline gfp_t root_gfp_mask(struct
  * that the caller has pinned this thread of control to the current CPU.
  */
 static struct radix_tree_node *
-radix_tree_node_alloc(struct radix_tree_root *root)
+radix_tree_node_alloc(struct radix_tree_root *root, int height)
 {
 	struct radix_tree_node *ret;
 	gfp_t gfp_mask = root_gfp_mask(root);
@@ -105,6 +116,11 @@ radix_tree_node_alloc(struct radix_tree_
 		}
 	}
 	BUG_ON(radix_tree_is_indirect_ptr(ret));
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	spin_lock_init(&ret->lock);
+	lockdep_set_class(&ret->lock, &radix_node_class[height]);
+#endif
+	ret->height = height;
 	return ret;
 }
 
@@ -205,6 +221,22 @@ static inline int any_tag_set(struct rad
 	return 0;
 }
 
+static inline int any_tag_set_but(struct radix_tree_node *node,
+		unsigned int tag, int offset)
+{
+	int idx;
+	int offset_idx = offset / BITS_PER_LONG;
+	unsigned long offset_mask = ~(1UL << (offset % BITS_PER_LONG));
+	for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) {
+		unsigned long mask = ~0UL;
+		if (idx == offset_idx)
+			mask = offset_mask;
+		if (node->tags[tag][idx] & mask)
+			return 1;
+	}
+	return 0;
+}
+
 /*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
@@ -234,8 +266,8 @@ static int radix_tree_extend(struct radi
 	}
 
 	do {
-		unsigned int newheight;
-		if (!(node = radix_tree_node_alloc(root)))
+		unsigned int newheight = root->height + 1;
+		if (!(node = radix_tree_node_alloc(root, newheight)))
 			return -ENOMEM;
 
 		/* Increase the height.  */
@@ -247,8 +279,6 @@ static int radix_tree_extend(struct radi
 				tag_set(node, tag, 0);
 		}
 
-		newheight = root->height+1;
-		node->height = newheight;
 		node->count = 1;
 		node = radix_tree_ptr_to_indirect(node);
 		rcu_assign_pointer(root->rnode, node);
@@ -258,6 +288,77 @@ out:
 	return 0;
 }
 
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+static inline struct radix_tree_context *
+radix_tree_get_context(struct radix_tree_root *root)
+{
+	struct radix_tree_context *context = NULL;
+
+	if (root->owner == current && root->context) {
+		context = root->context;
+		root->context = NULL;
+	}
+	return context;
+}
+
+#define RADIX_TREE_CONTEXT(context, root) \
+	struct radix_tree_context *context = radix_tree_get_context(root)
+
+static inline spinlock_t *radix_node_lock(struct radix_tree_root *root,
+		struct radix_tree_node *node)
+{
+	spinlock_t *locked = &node->lock;
+	spin_lock(locked);
+	return locked;
+}
+
+static inline void radix_ladder_lock(struct radix_tree_context *context,
+		struct radix_tree_node *node)
+{
+	if (context) {
+		struct radix_tree_root *root = context->root;
+		spinlock_t *locked = radix_node_lock(root, node);
+		if (locked) {
+			spin_unlock(context->locked);
+			context->locked = locked;
+		}
+	}
+}
+
+static inline void radix_path_init(struct radix_tree_context *context,
+		struct radix_tree_path *pathp)
+{
+	pathp->locked = context ? context->locked : NULL;
+}
+
+static inline void radix_path_lock(struct radix_tree_context *context,
+		struct radix_tree_path *pathp, struct radix_tree_node *node)
+{
+	if (context) {
+		struct radix_tree_root *root = context->root;
+		spinlock_t *locked = radix_node_lock(root, node);
+		if (locked)
+			context->locked = locked;
+		pathp->locked = locked;
+	} else
+		pathp->locked = NULL;
+}
+
+static inline void radix_path_unlock(struct radix_tree_context *context,
+		struct radix_tree_path *punlock)
+{
+	if (context && punlock->locked &&
+			context->locked != punlock->locked)
+		spin_unlock(punlock->locked);
+}
+#else
+#define RADIX_TREE_CONTEXT(context, root) do { } while (0)
+#define radix_ladder_lock(context, node) do { } while (0)
+#define radix_path_init(context, pathp) do { } while (0)
+#define radix_path_lock(context, pathp, node) do { } while (0)
+#define radix_path_unlock(context, punlock) do { } while (0)
+#endif
+
 /**
  *	radix_tree_insert    -    insert into a radix tree
  *	@root:		radix tree root
@@ -273,6 +374,8 @@ int radix_tree_insert(struct radix_tree_
 	unsigned int height, shift;
 	int offset;
 	int error;
+	int tag;
+	RADIX_TREE_CONTEXT(context, root);
 
 	BUG_ON(radix_tree_is_indirect_ptr(item));
 
@@ -292,9 +395,8 @@ int radix_tree_insert(struct radix_tree_
 	while (height > 0) {
 		if (slot == NULL) {
 			/* Have to add a child node.  */
-			if (!(slot = radix_tree_node_alloc(root)))
+			if (!(slot = radix_tree_node_alloc(root, height)))
 				return -ENOMEM;
-			slot->height = height;
 			if (node) {
 				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
@@ -306,6 +408,9 @@ int radix_tree_insert(struct radix_tree_
 		/* Go a level down */
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		node = slot;
+
+		radix_ladder_lock(context, node);
+
 		slot = node->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
@@ -317,12 +422,12 @@ int radix_tree_insert(struct radix_tree_
 	if (node) {
 		node->count++;
 		rcu_assign_pointer(node->slots[offset], item);
-		BUG_ON(tag_get(node, 0, offset));
-		BUG_ON(tag_get(node, 1, offset));
+		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+			BUG_ON(tag_get(node, tag, offset));
 	} else {
 		rcu_assign_pointer(root->rnode, item);
-		BUG_ON(root_tag_get(root, 0));
-		BUG_ON(root_tag_get(root, 1));
+		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+			BUG_ON(root_tag_get(root, tag));
 	}
 
 	return 0;
@@ -346,6 +451,7 @@ void **radix_tree_lookup_slot(struct rad
 {
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
+	RADIX_TREE_CONTEXT(context, root);
 
 	node = rcu_dereference(root->rnode);
 	if (node == NULL)
@@ -371,6 +477,8 @@ void **radix_tree_lookup_slot(struct rad
 		if (node == NULL)
 			return NULL;
 
+		radix_ladder_lock(context, node);
+
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
 	} while (height > 0);
@@ -446,6 +554,7 @@ void *radix_tree_tag_set(struct radix_tr
 {
 	unsigned int height, shift;
 	struct radix_tree_node *slot;
+	RADIX_TREE_CONTEXT(context, root);
 
 	height = root->height;
 	BUG_ON(index > radix_tree_maxindex(height));
@@ -453,9 +562,15 @@ void *radix_tree_tag_set(struct radix_tr
 	slot = radix_tree_indirect_to_ptr(root->rnode);
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 
+	/* set the root's tag bit */
+	if (slot && !root_tag_get(root, tag))
+		root_tag_set(root, tag);
+
 	while (height > 0) {
 		int offset;
 
+		radix_ladder_lock(context, slot);
+
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		if (!tag_get(slot, tag, offset))
 			tag_set(slot, tag, offset);
@@ -465,14 +580,24 @@ void *radix_tree_tag_set(struct radix_tr
 		height--;
 	}
 
-	/* set the root's tag bit */
-	if (slot && !root_tag_get(root, tag))
-		root_tag_set(root, tag);
-
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_tag_set);
 
+/*
+ * the change can never propagate upwards from here.
+ */
+static inline int radix_tree_unlock_tag(struct radix_tree_root *root,
+		struct radix_tree_path *pathp, int tag)
+{
+	int this, other;
+
+	this = tag_get(pathp->node, tag, pathp->offset);
+	other = any_tag_set_but(pathp->node, tag, pathp->offset);
+
+	return !this || other;
+}
+
 /**
  *	radix_tree_tag_clear - clear a tag on a radix tree node
  *	@root:		radix tree root
@@ -491,15 +616,19 @@ void *radix_tree_tag_clear(struct radix_
 			unsigned long index, unsigned int tag)
 {
 	struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
+	struct radix_tree_path *punlock = path, *piter;
 	struct radix_tree_node *slot = NULL;
 	unsigned int height, shift;
+	RADIX_TREE_CONTEXT(context, root);
+
+	pathp->node = NULL;
+	radix_path_init(context, pathp);
 
 	height = root->height;
 	if (index > radix_tree_maxindex(height))
 		goto out;
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
-	pathp->node = NULL;
 	slot = radix_tree_indirect_to_ptr(root->rnode);
 
 	while (height > 0) {
@@ -509,10 +638,17 @@ void *radix_tree_tag_clear(struct radix_
 			goto out;
 
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
-		pathp[1].offset = offset;
-		pathp[1].node = slot;
-		slot = slot->slots[offset];
 		pathp++;
+		pathp->offset = offset;
+		pathp->node = slot;
+		radix_path_lock(context, pathp, slot);
+
+		if (radix_tree_unlock_tag(root, pathp, tag)) {
+			for (; punlock < pathp; punlock++)
+				radix_path_unlock(context, punlock);
+		}
+
+		slot = slot->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
 	}
@@ -520,20 +656,22 @@ void *radix_tree_tag_clear(struct radix_
 	if (slot == NULL)
 		goto out;
 
-	while (pathp->node) {
-		if (!tag_get(pathp->node, tag, pathp->offset))
-			goto out;
-		tag_clear(pathp->node, tag, pathp->offset);
-		if (any_tag_set(pathp->node, tag))
-			goto out;
-		pathp--;
+	for (piter = pathp; piter >= punlock; piter--) {
+		if (piter->node) {
+			if (!tag_get(piter->node, tag, piter->offset))
+				break;
+			tag_clear(piter->node, tag, piter->offset);
+			if (any_tag_set(piter->node, tag))
+				break;
+		} else {
+			if (root_tag_get(root, tag))
+				root_tag_clear(root, tag);
+		}
 	}
 
-	/* clear the root's tag bit */
-	if (root_tag_get(root, tag))
-		root_tag_clear(root, tag);
-
 out:
+	for (; punlock < pathp; punlock++)
+		radix_path_unlock(context, punlock);
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_tag_clear);
@@ -955,7 +1093,7 @@ radix_tree_gang_lookup_tag_slot(struct r
 	if (!radix_tree_is_indirect_ptr(node)) {
 		if (first_index > 0)
 			return 0;
-		results[0] = node;
+		results[0] = (void **)&root->rnode;
 		return 1;
 	}
 	node = radix_tree_indirect_to_ptr(node);
@@ -991,6 +1129,7 @@ static inline void radix_tree_shrink(str
 	while (root->height > 0) {
 		struct radix_tree_node *to_free = root->rnode;
 		void *newptr;
+		int tag;
 
 		BUG_ON(!radix_tree_is_indirect_ptr(to_free));
 		to_free = radix_tree_indirect_to_ptr(to_free);
@@ -1017,14 +1156,29 @@ static inline void radix_tree_shrink(str
 		root->rnode = newptr;
 		root->height--;
 		/* must only free zeroed nodes into the slab */
-		tag_clear(to_free, 0, 0);
-		tag_clear(to_free, 1, 0);
+		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+			tag_clear(to_free, tag, 0);
 		to_free->slots[0] = NULL;
 		to_free->count = 0;
-		radix_tree_node_free(to_free);
 	}
 }
 
+static inline int radix_tree_unlock_all(struct radix_tree_root *root,
+		struct radix_tree_path *pathp)
+{
+	int tag;
+	int unlock = 1;
+
+	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+		if (!radix_tree_unlock_tag(root, pathp, tag)) {
+			unlock = 0;
+			break;
+		}
+	}
+
+	return unlock;
+}
+
 /**
  *	radix_tree_delete    -    delete an item from a radix tree
  *	@root:		radix tree root
@@ -1037,11 +1191,15 @@ static inline void radix_tree_shrink(str
 void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 {
 	struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
+	struct radix_tree_path *punlock = path, *piter;
 	struct radix_tree_node *slot = NULL;
-	struct radix_tree_node *to_free;
 	unsigned int height, shift;
 	int tag;
 	int offset;
+	RADIX_TREE_CONTEXT(context, root);
+
+	pathp->node = NULL;
+	radix_path_init(context, pathp);
 
 	height = root->height;
 	if (index > radix_tree_maxindex(height))
@@ -1056,7 +1214,6 @@ void *radix_tree_delete(struct radix_tre
 	slot = radix_tree_indirect_to_ptr(slot);
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
-	pathp->node = NULL;
 
 	do {
 		if (slot == NULL)
@@ -1066,6 +1223,13 @@ void *radix_tree_delete(struct radix_tre
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		pathp->offset = offset;
 		pathp->node = slot;
+		radix_path_lock(context, pathp, slot);
+
+		if (slot->count > 2 && radix_tree_unlock_all(root, pathp)) {
+			for (; punlock < pathp; punlock++)
+				radix_path_unlock(context, punlock);
+		}
+
 		slot = slot->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
@@ -1078,41 +1242,45 @@ void *radix_tree_delete(struct radix_tre
 	 * Clear all tags associated with the just-deleted item
 	 */
 	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
-		if (tag_get(pathp->node, tag, pathp->offset))
-			radix_tree_tag_clear(root, index, tag);
+		for (piter = pathp; piter >= punlock; piter--) {
+			if (piter->node) {
+				if (!tag_get(piter->node, tag, piter->offset))
+					break;
+				tag_clear(piter->node, tag, piter->offset);
+				if (any_tag_set(piter->node, tag))
+					break;
+			} else {
+				if (root_tag_get(root, tag))
+					root_tag_clear(root, tag);
+			}
+		}
 	}
 
-	to_free = NULL;
-	/* Now free the nodes we do not need anymore */
-	while (pathp->node) {
-		pathp->node->slots[pathp->offset] = NULL;
-		pathp->node->count--;
-		/*
-		 * Queue the node for deferred freeing after the
-		 * last reference to it disappears (set NULL, above).
-		 */
-		if (to_free)
-			radix_tree_node_free(to_free);
+	/* Now unhook the nodes we do not need anymore */
+	for (piter = pathp; piter >= punlock && piter->node; piter--) {
+		piter->node->slots[piter->offset] = NULL;
+		piter->node->count--;
 
-		if (pathp->node->count) {
-			if (pathp->node ==
+		if (piter->node->count) {
+			if (piter->node ==
 					radix_tree_indirect_to_ptr(root->rnode))
 				radix_tree_shrink(root);
 			goto out;
 		}
+	}
 
-		/* Node with zero slots in use so free it */
-		to_free = pathp->node;
-		pathp--;
+	BUG_ON(piter->node);
 
-	}
 	root_tag_clear_all(root);
 	root->height = 0;
 	root->rnode = NULL;
-	if (to_free)
-		radix_tree_node_free(to_free);
 
 out:
+	for (; punlock <= pathp; punlock++) {
+		radix_path_unlock(context, punlock);
+		if (punlock->node && punlock->node->count == 0)
+			radix_tree_node_free(punlock->node);
+	}
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_delete);
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig
+++ linux-2.6/init/Kconfig
@@ -316,6 +316,10 @@ config TASK_IO_ACCOUNTING
 config SYSCTL
 	bool
 
+config RADIX_TREE_CONCURRENT
+	bool "Enable concurrent radix tree operations (EXPERIMENTAL)"
+	default y if SMP
+
 menuconfig EMBEDDED
 	bool "Configure standard kernel features (for small systems)"
 	help

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 11/14] atomic_ulong_t
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (9 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 10/14] radix-tree: concurrent write side support Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-29 17:11   ` Christoph Lameter
  2007-01-28 13:13 ` [PATCH 12/14] mm/fs: abstract address_space::nrpages Peter Zijlstra
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: atomic_ulong_t.patch --]
[-- Type: text/plain, Size: 1611 bytes --]

provide an unsigned long atomic type.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/asm-generic/atomic.h |   45 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

Index: linux-2.6-git2/include/asm-generic/atomic.h
===================================================================
--- linux-2.6-git2.orig/include/asm-generic/atomic.h	2006-12-15 14:13:20.000000000 +0100
+++ linux-2.6-git2/include/asm-generic/atomic.h	2006-12-20 22:28:23.000000000 +0100
@@ -115,4 +115,49 @@ static inline void atomic_long_sub(long 
 
 #endif  /*  BITS_PER_LONG == 64  */
 
+typedef atomic_long_t atomic_ulong_t;
+
+#define ATOMIC_ULONG_INIT(i)	ATOMIC_LONG_INIT(i)
+static inline unsigned long atomic_ulong_read(atomic_ulong_t *l)
+{
+	atomic_long_t *v = (atomic_long_t *)l;
+
+	return (unsigned long)atomic_long_read(v);
+}
+
+static inline void atomic_ulong_set(atomic_ulong_t *l, unsigned long i)
+{
+	atomic_long_t *v = (atomic_long_t *)l;
+
+	atomic_long_set(v, i);
+}
+
+static inline void atomic_ulong_inc(atomic_ulong_t *l)
+{
+	atomic_long_t *v = (atomic_long_t *)l;
+
+	atomic_long_inc(v);
+}
+
+static inline void atomic_ulong_dec(atomic_ulong_t *l)
+{
+	atomic_long_t *v = (atomic_long_t *)l;
+
+	atomic_long_dec(v);
+}
+
+static inline void atomic_ulong_add(unsigned long i, atomic_ulong_t *l)
+{
+	atomic_long_t *v = (atomic_long_t *)l;
+
+	atomic_long_add(i, v);
+}
+
+static inline void atomic_ulong_sub(unsigned long i, atomic_ulong_t *l)
+{
+	atomic_long_t *v = (atomic_long_t *)l;
+
+	atomic_long_sub(i, v);
+}
+
 #endif  /*  _ASM_GENERIC_ATOMIC_H  */

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 12/14] mm/fs: abstract address_space::nrpages
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (10 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 11/14] atomic_ulong_t Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 13/14] mm: lock_page_ref Peter Zijlstra
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: mapping_nrpages.patch --]
[-- Type: text/plain, Size: 19357 bytes --]

Currently the tree_lock protects mapping->nrpages, this will not be
possible much longer. Hence abstract the access to this variable so that
it can be easily replaced by an atomic_ulong_t.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sh64/lib/dbg.c          |    2 +-
 fs/block_dev.c               |    2 +-
 fs/buffer.c                  |    2 +-
 fs/gfs2/glock.c              |    2 +-
 fs/gfs2/glops.c              |    6 +++---
 fs/gfs2/meta_io.c            |    2 +-
 fs/hugetlbfs/inode.c         |    2 +-
 fs/inode.c                   |   10 +++++-----
 fs/jffs/inode-v23.c          |    2 +-
 fs/jffs2/dir.c               |    4 ++--
 fs/jffs2/fs.c                |    2 +-
 fs/libfs.c                   |    2 +-
 fs/nfs/inode.c               |    6 +++---
 fs/xfs/linux-2.6/xfs_vnode.h |    2 +-
 include/linux/fs.h           |   22 +++++++++++++++++++++-
 include/linux/swap.h         |    2 +-
 ipc/shm.c                    |    4 ++--
 mm/filemap.c                 |   12 ++++++------
 mm/shmem.c                   |    8 ++++----
 mm/swap_state.c              |    4 ++--
 mm/truncate.c                |    2 +-
 21 files changed, 60 insertions(+), 40 deletions(-)

Index: linux-2.6/arch/sh64/lib/dbg.c
===================================================================
--- linux-2.6.orig/arch/sh64/lib/dbg.c	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/arch/sh64/lib/dbg.c	2007-01-28 14:02:36.000000000 +0100
@@ -424,6 +424,6 @@ void print_page(struct page *page)
 	printk("  page[%p] -> index 0x%lx,  count 0x%x,  flags 0x%lx\n",
 	       page, page->index, page_count(page), page->flags);
 	printk("       address_space = %p, pages =%ld\n", page->mapping,
-	       page->mapping->nrpages);
+	       mapping_nrpages(page->mapping));
 
 }
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2007-01-27 20:36:02.000000000 +0100
+++ linux-2.6/fs/block_dev.c	2007-01-28 14:02:36.000000000 +0100
@@ -554,7 +554,7 @@ long nr_blockdev_pages(void)
 	list_for_each(p, &all_bdevs) {
 		struct block_device *bdev;
 		bdev = list_entry(p, struct block_device, bd_list);
-		ret += bdev->bd_inode->i_mapping->nrpages;
+		ret += mapping_nrpages(bdev->bd_inode->i_mapping);
 	}
 	spin_unlock(&bdev_lock);
 	return ret;
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/fs/buffer.c	2007-01-28 14:02:36.000000000 +0100
@@ -336,7 +336,7 @@ void invalidate_bdev(struct block_device
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0)
+	if (mapping_nrpages(mapping) == 0)
 		return;
 
 	invalidate_bh_lrus();
Index: linux-2.6/fs/gfs2/glock.c
===================================================================
--- linux-2.6.orig/fs/gfs2/glock.c	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/fs/gfs2/glock.c	2007-01-28 14:02:36.000000000 +0100
@@ -2110,7 +2110,7 @@ static int dump_glock(struct gfs2_glock 
 		    (list_empty(&gl->gl_reclaim)) ? "no" : "yes");
 	if (gl->gl_aspace)
 		printk(KERN_INFO "  aspace = 0x%p nrpages = %lu\n", gl->gl_aspace,
-		       gl->gl_aspace->i_mapping->nrpages);
+		       mapping_nrpages(gl->gl_aspace->i_mapping));
 	else
 		printk(KERN_INFO "  aspace = no\n");
 	printk(KERN_INFO "  ail = %d\n", atomic_read(&gl->gl_ail_count));
Index: linux-2.6/fs/gfs2/glops.c
===================================================================
--- linux-2.6.orig/fs/gfs2/glops.c	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/fs/gfs2/glops.c	2007-01-28 14:02:36.000000000 +0100
@@ -243,7 +243,7 @@ static void inode_go_inval(struct gfs2_g
 
 	if (ip && S_ISREG(ip->i_inode.i_mode)) {
 		truncate_inode_pages(ip->i_inode.i_mapping, 0);
-		gfs2_assert_withdraw(GFS2_SB(&ip->i_inode), !ip->i_inode.i_mapping->nrpages);
+		gfs2_assert_withdraw(GFS2_SB(&ip->i_inode), !mapping_nrpages(ip->i_inode.i_mapping));
 		clear_bit(GIF_PAGED, &ip->i_flags);
 	}
 }
@@ -260,7 +260,7 @@ static int inode_go_demote_ok(struct gfs
 	struct gfs2_sbd *sdp = gl->gl_sbd;
 	int demote = 0;
 
-	if (!gl->gl_object && !gl->gl_aspace->i_mapping->nrpages)
+	if (!gl->gl_object && !mapping_nrpages(gl->gl_aspace->i_mapping))
 		demote = 1;
 	else if (!sdp->sd_args.ar_localcaching &&
 		 time_after_eq(jiffies, gl->gl_stamp +
@@ -360,7 +360,7 @@ static void inode_greedy(struct gfs2_glo
 
 static int rgrp_go_demote_ok(struct gfs2_glock *gl)
 {
-	return !gl->gl_aspace->i_mapping->nrpages;
+	return !mapping_nrpages(gl->gl_aspace->i_mapping);
 }
 
 /**
Index: linux-2.6/fs/gfs2/meta_io.c
===================================================================
--- linux-2.6.orig/fs/gfs2/meta_io.c	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/fs/gfs2/meta_io.c	2007-01-28 14:02:36.000000000 +0100
@@ -104,7 +104,7 @@ void gfs2_meta_inval(struct gfs2_glock *
 	truncate_inode_pages(mapping, 0);
 	atomic_dec(&aspace->i_writecount);
 
-	gfs2_assert_withdraw(sdp, !mapping->nrpages);
+	gfs2_assert_withdraw(sdp, !mapping_nrpages(mapping));
 }
 
 /**
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/fs/hugetlbfs/inode.c	2007-01-28 14:02:36.000000000 +0100
@@ -214,7 +214,7 @@ static void truncate_hugepages(struct in
 		}
 		huge_pagevec_release(&pvec);
 	}
-	BUG_ON(!lstart && mapping->nrpages);
+	BUG_ON(!lstart && mapping_nrpages(mapping));
 	hugetlb_unreserve_pages(inode, start, freed);
 }
 
Index: linux-2.6/fs/jffs/inode-v23.c
===================================================================
--- linux-2.6.orig/fs/jffs/inode-v23.c	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/fs/jffs/inode-v23.c	2007-01-28 14:02:36.000000000 +0100
@@ -1352,7 +1352,7 @@ jffs_create(struct inode *dir, struct de
 	inode->i_op = &jffs_file_inode_operations;
 	inode->i_fop = &jffs_file_operations;
 	inode->i_mapping->a_ops = &jffs_address_operations;
-	inode->i_mapping->nrpages = 0;
+	mapping_nrpages_init(inode->i_mapping);
 
 	d_instantiate(dentry, inode);
  jffs_create_end:
Index: linux-2.6/fs/jffs2/dir.c
===================================================================
--- linux-2.6.orig/fs/jffs2/dir.c	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/fs/jffs2/dir.c	2007-01-28 14:02:36.000000000 +0100
@@ -206,7 +206,7 @@ static int jffs2_create(struct inode *di
 	inode->i_op = &jffs2_file_inode_operations;
 	inode->i_fop = &jffs2_file_operations;
 	inode->i_mapping->a_ops = &jffs2_file_address_operations;
-	inode->i_mapping->nrpages = 0;
+	mapping_nrpages_init(inode->i_mapping);
 
 	f = JFFS2_INODE_INFO(inode);
 	dir_f = JFFS2_INODE_INFO(dir_i);
@@ -230,7 +230,7 @@ static int jffs2_create(struct inode *di
 	d_instantiate(dentry, inode);
 
 	D1(printk(KERN_DEBUG "jffs2_create: Created ino #%lu with mode %o, nlink %d(%d). nrpages %ld\n",
-		  inode->i_ino, inode->i_mode, inode->i_nlink, f->inocache->nlink, inode->i_mapping->nrpages));
+		  inode->i_ino, inode->i_mode, inode->i_nlink, f->inocache->nlink, mapping_nrpages(inode->i_mapping)));
 	return 0;
 
  fail:
Index: linux-2.6/fs/jffs2/fs.c
===================================================================
--- linux-2.6.orig/fs/jffs2/fs.c	2007-01-27 20:36:02.000000000 +0100
+++ linux-2.6/fs/jffs2/fs.c	2007-01-28 14:02:36.000000000 +0100
@@ -293,7 +293,7 @@ void jffs2_read_inode (struct inode *ino
 		inode->i_op = &jffs2_file_inode_operations;
 		inode->i_fop = &jffs2_file_operations;
 		inode->i_mapping->a_ops = &jffs2_file_address_operations;
-		inode->i_mapping->nrpages = 0;
+		mapping_nrpages_init(inode->i_mapping);
 		break;
 
 	case S_IFBLK:
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/fs/libfs.c	2007-01-28 14:02:36.000000000 +0100
@@ -16,7 +16,7 @@ int simple_getattr(struct vfsmount *mnt,
 {
 	struct inode *inode = dentry->d_inode;
 	generic_fillattr(inode, stat);
-	stat->blocks = inode->i_mapping->nrpages << (PAGE_CACHE_SHIFT - 9);
+	stat->blocks = mapping_nrpages(inode->i_mapping) << (PAGE_CACHE_SHIFT - 9);
 	return 0;
 }
 
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c	2007-01-27 20:36:03.000000000 +0100
+++ linux-2.6/fs/nfs/inode.c	2007-01-28 14:03:13.000000000 +0100
@@ -93,7 +93,7 @@ int nfs_sync_mapping(struct address_spac
 {
 	int ret;
 
-	if (mapping->nrpages == 0)
+	if (mapping_nrpages(mapping) == 0)
 		return 0;
 	unmap_mapping_range(mapping, 0, 0, 0);
 	ret = filemap_write_and_wait(mapping);
@@ -133,7 +133,7 @@ void nfs_zap_caches(struct inode *inode)
 
 void nfs_zap_mapping(struct inode *inode, struct address_space *mapping)
 {
-	if (mapping->nrpages != 0) {
+	if (mapping_nrpages(mapping) != 0) {
 		spin_lock(&inode->i_lock);
 		NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA;
 		spin_unlock(&inode->i_lock);
@@ -669,7 +669,7 @@ static int nfs_invalidate_mapping_nolock
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
 	
-	if (mapping->nrpages != 0) {
+	if (mapping_nrpages(mapping) != 0) {
 		int ret = invalidate_inode_pages2(mapping);
 		if (ret < 0)
 			return ret;
Index: linux-2.6/fs/xfs/linux-2.6/xfs_vnode.h
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_vnode.h	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/fs/xfs/linux-2.6/xfs_vnode.h	2007-01-28 14:02:36.000000000 +0100
@@ -548,7 +548,7 @@ static inline void vn_atime_to_time_t(bh
  * Some useful predicates.
  */
 #define VN_MAPPED(vp)	mapping_mapped(vn_to_inode(vp)->i_mapping)
-#define VN_CACHED(vp)	(vn_to_inode(vp)->i_mapping->nrpages)
+#define VN_CACHED(vp)	mapping_nrpages(vn_to_inode(vp)->i_mapping)
 #define VN_DIRTY(vp)	mapping_tagged(vn_to_inode(vp)->i_mapping, \
 					PAGECACHE_TAG_DIRTY)
 #define VN_TRUNC(vp)	((vp)->v_flag & VTRUNCATED)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/include/linux/fs.h	2007-01-28 14:02:36.000000000 +0100
@@ -439,7 +439,7 @@ struct address_space {
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
-	unsigned long		nrpages;	/* number of total pages */
+	unsigned long		__nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
@@ -454,6 +454,26 @@ struct address_space {
 	 * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
 	 */
 
+static inline void mapping_nrpages_init(struct address_space *mapping)
+{
+	mapping->__nrpages = 0;
+}
+
+static inline unsigned long mapping_nrpages(struct address_space *mapping)
+{
+	return mapping->__nrpages;
+}
+
+static inline void mapping_nrpages_inc(struct address_space *mapping)
+{
+	mapping->__nrpages++;
+}
+
+static inline void mapping_nrpages_dec(struct address_space *mapping)
+{
+	mapping->__nrpages--;
+}
+
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	struct inode *		bd_inode;	/* will die */
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c	2007-01-27 20:36:03.000000000 +0100
+++ linux-2.6/ipc/shm.c	2007-01-28 14:02:36.000000000 +0100
@@ -499,11 +499,11 @@ static void shm_get_stat(struct ipc_name
 
 		if (is_file_hugepages(shp->shm_file)) {
 			struct address_space *mapping = inode->i_mapping;
-			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping_nrpages(mapping);
 		} else {
 			struct shmem_inode_info *info = SHMEM_I(inode);
 			spin_lock(&info->lock);
-			*rss += inode->i_mapping->nrpages;
+			*rss += mapping_nrpages(inode->i_mapping);
 			*swp += info->swapped;
 			spin_unlock(&info->lock);
 		}
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/mm/filemap.c	2007-01-28 14:02:36.000000000 +0100
@@ -118,7 +118,7 @@ void __remove_from_page_cache(struct pag
 
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
-	mapping->nrpages--;
+	mapping_nrpages_dec(mapping);
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 }
 
@@ -190,7 +190,7 @@ int __filemap_fdatawrite_range(struct ad
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
-		.nr_to_write = mapping->nrpages * 2,
+		.nr_to_write = mapping_nrpages(mapping) * 2,
 		.range_start = start,
 		.range_end = end,
 	};
@@ -372,7 +372,7 @@ int filemap_write_and_wait(struct addres
 {
 	int err = 0;
 
-	if (mapping->nrpages) {
+	if (mapping_nrpages(mapping)) {
 		err = filemap_fdatawrite(mapping);
 		/*
 		 * Even if the above returned error, the pages may be
@@ -406,7 +406,7 @@ int filemap_write_and_wait_range(struct 
 {
 	int err = 0;
 
-	if (mapping->nrpages) {
+	if (mapping_nrpages(mapping)) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
 		/* See comment of filemap_write_and_wait() */
@@ -448,7 +448,7 @@ int add_to_page_cache(struct page *page,
 			SetPageLocked(page);
 			page->mapping = mapping;
 			page->index = offset;
-			mapping->nrpages++;
+			mapping_nrpages_inc(mapping);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		spin_unlock_irq(&mapping->tree_lock);
@@ -2477,7 +2477,7 @@ generic_file_direct_IO(int rw, struct ki
 	if (retval == 0) {
 		retval = mapping->a_ops->direct_IO(rw, iocb, iov,
 						offset, nr_segs);
-		if (rw == WRITE && mapping->nrpages) {
+		if (rw == WRITE && mapping_nrpages(mapping)) {
 			pgoff_t end = (offset + write_len - 1)
 						>> PAGE_CACHE_SHIFT;
 			int err = invalidate_inode_pages2_range(mapping,
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2007-01-22 22:43:27.000000000 +0100
+++ linux-2.6/mm/shmem.c	2007-01-28 14:02:36.000000000 +0100
@@ -211,8 +211,8 @@ static void shmem_free_blocks(struct ino
  * We have to calculate the free blocks since the mm can drop
  * undirtied hole pages behind our back.
  *
- * But normally   info->alloced == inode->i_mapping->nrpages + info->swapped
- * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)
+ * But normally   info->alloced == mapping_nrpages(inode->i_mapping) + info->swapped
+ * So mm freed is info->alloced - (mapping_nrpages(inode->i_mapping) + info->swapped)
  *
  * It has to be called with the spinlock held.
  */
@@ -221,7 +221,7 @@ static void shmem_recalc_inode(struct in
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	long freed;
 
-	freed = info->alloced - info->swapped - inode->i_mapping->nrpages;
+	freed = info->alloced - info->swapped - mapping_nrpages(inode->i_mapping);
 	if (freed > 0) {
 		info->alloced -= freed;
 		shmem_unacct_blocks(info->flags, freed);
@@ -607,7 +607,7 @@ static void shmem_truncate_range(struct 
 done1:
 	shmem_dir_unmap(dir);
 done2:
-	if (inode->i_mapping->nrpages && (info->flags & SHMEM_PAGEIN)) {
+	if (mapping_nrpages(inode->i_mapping) && (info->flags & SHMEM_PAGEIN)) {
 		/*
 		 * Call truncate_inode_pages again: racing shmem_unuse_inode
 		 * may have swizzled a page in from swap since vmtruncate or
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/mm/swap_state.c	2007-01-28 14:02:36.000000000 +0100
@@ -87,7 +87,7 @@ static int __add_to_swap_cache(struct pa
 			SetPageLocked(page);
 			SetPageSwapCache(page);
 			set_page_private(page, entry.val);
-			total_swapcache_pages++;
+			mapping_nrpages_inc(&swapper_space);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		spin_unlock_irq(&swapper_space.tree_lock);
@@ -133,7 +133,7 @@ void __delete_from_swap_cache(struct pag
 	radix_tree_delete(&swapper_space.page_tree, page_private(page));
 	set_page_private(page, 0);
 	ClearPageSwapCache(page);
-	total_swapcache_pages--;
+	mapping_nrpages_dec(&swapper_space);
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
 }
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/mm/truncate.c	2007-01-28 14:02:36.000000000 +0100
@@ -163,7 +163,7 @@ void truncate_inode_pages_range(struct a
 	pgoff_t next;
 	int i;
 
-	if (mapping->nrpages == 0)
+	if (mapping_nrpages(mapping) == 0)
 		return;
 
 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h	2007-01-22 22:43:26.000000000 +0100
+++ linux-2.6/include/linux/swap.h	2007-01-28 14:02:36.000000000 +0100
@@ -222,7 +222,7 @@ extern int end_swap_bio_read(struct bio 
 
 /* linux/mm/swap_state.c */
 extern struct address_space swapper_space;
-#define total_swapcache_pages  swapper_space.nrpages
+#define total_swapcache_pages  mapping_nrpages(&swapper_space)
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *, gfp_t);
 extern void __delete_from_swap_cache(struct page *);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/fs/inode.c	2007-01-28 14:02:36.000000000 +0100
@@ -246,7 +246,7 @@ void clear_inode(struct inode *inode)
 	might_sleep();
 	invalidate_inode_buffers(inode);
        
-	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(mapping_nrpages(&inode->i_data));
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	wait_on_inode(inode);
@@ -279,7 +279,7 @@ static void dispose_list(struct list_hea
 		inode = list_entry(head->next, struct inode, i_list);
 		list_del(&inode->i_list);
 
-		if (inode->i_data.nrpages)
+		if (mapping_nrpages(&inode->i_data))
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 
@@ -371,7 +371,7 @@ static int can_unuse(struct inode *inode
 		return 0;
 	if (atomic_read(&inode->i_count))
 		return 0;
-	if (inode->i_data.nrpages)
+	if (mapping_nrpages(&inode->i_data))
 		return 0;
 	return 1;
 }
@@ -410,7 +410,7 @@ static void prune_icache(int nr_to_scan)
 			list_move(&inode->i_list, &inode_unused);
 			continue;
 		}
-		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+		if (inode_has_buffers(inode) || mapping_nrpages(&inode->i_data)) {
 			__iget(inode);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
@@ -1057,7 +1057,7 @@ static void generic_forget_inode(struct 
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
-	if (inode->i_data.nrpages)
+	if (mapping_nrpages(&inode->i_data))
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
 	wake_up_inode(inode);

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 13/14] mm: lock_page_ref
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (11 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 12/14] mm/fs: abstract address_space::nrpages Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-28 13:13 ` [PATCH 14/14] mm: concurrent pagecache write side Peter Zijlstra
  2007-01-29 17:20 ` [PATCH 00/14] Concurrent Page Cache Christoph Lameter
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: lock_page_ref.patch --]
[-- Type: text/plain, Size: 13130 bytes --]

Change the PG_nonewref operations into locking primitives and place them
so that they provide page level serialization with regard to the page_tree
operations. (basically replace the tree_lock with a per page lock).

The normal page lock has sufficiently different (and overlapping) scope and
protection rules that this second lock is needed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c             |    6 ++++--
 include/linux/pagemap.h |   44 ++++++++++++++++++++++++++++++++------------
 mm/filemap.c            |   14 ++++++++------
 mm/migrate.c            |   12 ++++++------
 mm/page-writeback.c     |   18 ++++++++++++------
 mm/swap_state.c         |   14 ++++++++------
 mm/swapfile.c           |    6 ++++--
 mm/truncate.c           |    9 ++++++---
 mm/vmscan.c             |   14 +++++++-------
 9 files changed, 87 insertions(+), 50 deletions(-)

Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2007-01-28 13:59:23.000000000 +0100
+++ linux-2.6/include/linux/pagemap.h	2007-01-28 14:03:21.000000000 +0100
@@ -13,6 +13,7 @@
 #include <linux/gfp.h>
 #include <linux/page-flags.h>
 #include <linux/hardirq.h> /* for in_interrupt() */
+#include <linux/bit_spinlock.h>
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -53,28 +54,47 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
-static inline void set_page_no_new_refs(struct page *page)
+static inline void lock_page_ref(struct page *page)
 {
-	VM_BUG_ON(PageNoNewRefs(page));
-	preempt_disable();
-	SetPageNoNewRefs(page);
+	bit_spin_lock(PG_nonewrefs, &page->flags);
 	smp_wmb();
 }
 
-static inline void end_page_no_new_refs(struct page *page)
+static inline void unlock_page_ref(struct page *page)
 {
-	VM_BUG_ON(!PageNoNewRefs(page));
-	smp_wmb();
-	ClearPageNoNewRefs(page);
-	preempt_enable();
+	bit_spin_unlock(PG_nonewrefs, &page->flags);
 }
 
-static inline void wait_on_new_refs(struct page *page)
+static inline void wait_on_unlock_page_ref(struct page *page)
 {
-	while (unlikely(PageNoNewRefs(page)))
+	while (unlikely(test_bit(PG_nonewrefs, &page->flags)))
 		cpu_relax();
 }
 
+#define lock_page_ref_irq(page)					\
+	do {							\
+		local_irq_disable();				\
+		lock_page_ref(page);				\
+	} while (0)
+
+#define unlock_page_ref_irq(page)				\
+	do {							\
+		unlock_page_ref(page);				\
+		local_irq_enable();				\
+	} while (0)
+
+#define lock_page_ref_irqsave(page, flags)			\
+	do {							\
+		local_irq_save(flags);				\
+		lock_page_ref(page);				\
+	} while (0)
+
+#define unlock_page_ref_irqrestore(page, flags)			\
+	do {							\
+		unlock_page_ref(page);				\
+		local_irq_restore(flags);			\
+	} while (0)
+
 /*
  * speculatively take a reference to a page.
  * If the page is free (_count == 0), then _count is untouched, and 0
@@ -150,7 +170,7 @@ static inline int page_cache_get_specula
 	 * page refcount has been raised. See below comment.
 	 */
 
-	wait_on_new_refs(page);
+	wait_on_unlock_page_ref(page);
 
 	/*
 	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-01-28 14:02:36.000000000 +0100
+++ linux-2.6/mm/filemap.c	2007-01-28 14:03:21.000000000 +0100
@@ -128,9 +128,11 @@ void remove_from_page_cache(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 }
 
 static int sync_page(void *word)
@@ -440,8 +442,8 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
-		set_page_no_new_refs(page);
-		spin_lock_irq(&mapping->tree_lock);
+		lock_page_ref_irq(page);
+		spin_lock(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
 			page_cache_get(page);
@@ -451,8 +453,8 @@ int add_to_page_cache(struct page *page,
 			mapping_nrpages_inc(mapping);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		spin_unlock_irq(&mapping->tree_lock);
-		end_page_no_new_refs(page);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irq(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/mm/migrate.c	2007-01-28 14:03:21.000000000 +0100
@@ -303,16 +303,16 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
-	set_page_no_new_refs(page);
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
-		spin_unlock_irq(&mapping->tree_lock);
-		end_page_no_new_refs(page);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irq(page);
 		return -EAGAIN;
 	}
 
@@ -329,8 +329,8 @@ static int migrate_page_move_mapping(str
 
 	radix_tree_replace_slot(pslot, newpage);
 	page->mapping = NULL;
-  	spin_unlock_irq(&mapping->tree_lock);
-	end_page_no_new_refs(page);
+  	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 
 	/*
 	 * Drop cache reference from old page.
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c	2007-01-28 14:02:36.000000000 +0100
+++ linux-2.6/mm/swap_state.c	2007-01-28 14:03:21.000000000 +0100
@@ -78,8 +78,8 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
-		set_page_no_new_refs(page);
-		spin_lock_irq(&swapper_space.tree_lock);
+		lock_page_ref_irq(page);
+		spin_lock(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
@@ -90,8 +90,8 @@ static int __add_to_swap_cache(struct pa
 			mapping_nrpages_inc(&swapper_space);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		spin_unlock_irq(&swapper_space.tree_lock);
-		end_page_no_new_refs(page);
+		spin_unlock(&swapper_space.tree_lock);
+		unlock_page_ref_irq(page);
 		radix_tree_preload_end();
 	}
 	return error;
@@ -202,9 +202,11 @@ void delete_from_swap_cache(struct page 
 
 	entry.val = page_private(page);
 
-	spin_lock_irq(&swapper_space.tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	spin_unlock_irq(&swapper_space.tree_lock);
+	spin_unlock(&swapper_space.tree_lock);
+	unlock_page_ref_irq(page);
 
 	swap_free(entry);
 	page_cache_release(page);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/mm/vmscan.c	2007-01-28 14:03:21.000000000 +0100
@@ -390,8 +390,8 @@ int remove_mapping(struct address_space 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	set_page_no_new_refs(page);
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -426,22 +426,22 @@ int remove_mapping(struct address_space 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
-		spin_unlock_irq(&mapping->tree_lock);
+		spin_unlock(&mapping->tree_lock);
 		swap_free(swap);
 		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
 
 free_it:
-	end_page_no_new_refs(page);
+	unlock_page_ref_irq(page);
 	__put_page(page); /* The pagecache ref */
 	return 1;
 
 cannot_free:
-	spin_unlock_irq(&mapping->tree_lock);
-	end_page_no_new_refs(page);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 	return 0;
 }
 
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-01-28 14:02:36.000000000 +0100
+++ linux-2.6/fs/buffer.c	2007-01-28 14:03:21.000000000 +0100
@@ -728,7 +728,8 @@ int __set_page_dirty_buffers(struct page
 	if (TestSetPageDirty(page))
 		return 0;
 
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
@@ -737,7 +738,8 @@ int __set_page_dirty_buffers(struct page
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	return 1;
 }
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/mm/page-writeback.c	2007-01-28 14:03:21.000000000 +0100
@@ -765,7 +765,8 @@ int __set_page_dirty_nobuffers(struct pa
 		if (!mapping)
 			return 1;
 
-		spin_lock_irq(&mapping->tree_lock);
+		lock_page_ref_irq(page);
+		spin_lock(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
@@ -776,7 +777,8 @@ int __set_page_dirty_nobuffers(struct pa
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
-		spin_unlock_irq(&mapping->tree_lock);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irq(page);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -913,13 +915,15 @@ int test_clear_page_writeback(struct pag
 	if (mapping) {
 		unsigned long flags;
 
-		spin_lock_irqsave(&mapping->tree_lock, flags);
+		lock_page_ref_irqsave(page, flags);
+		spin_lock(&mapping->tree_lock);
 		ret = TestClearPageWriteback(page);
 		if (ret)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irqrestore(page, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
 	}
@@ -934,7 +938,8 @@ int test_set_page_writeback(struct page 
 	if (mapping) {
 		unsigned long flags;
 
-		spin_lock_irqsave(&mapping->tree_lock, flags);
+		lock_page_ref_irqsave(page, flags);
+		spin_lock(&mapping->tree_lock);
 		ret = TestSetPageWriteback(page);
 		if (!ret)
 			radix_tree_tag_set(&mapping->page_tree,
@@ -944,7 +949,8 @@ int test_set_page_writeback(struct page 
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irqrestore(page, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
 	}
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2007-01-28 13:59:24.000000000 +0100
+++ linux-2.6/mm/swapfile.c	2007-01-28 14:03:21.000000000 +0100
@@ -367,13 +367,15 @@ int remove_exclusive_swap_page(struct pa
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
-		spin_lock_irq(&swapper_space.tree_lock);
+		lock_page_ref_irq(page);
+		spin_lock(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		spin_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock(&swapper_space.tree_lock);
+		unlock_page_ref_irq(page);
 	}
 	spin_unlock(&swap_lock);
 
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c	2007-01-28 14:02:36.000000000 +0100
+++ linux-2.6/mm/truncate.c	2007-01-28 14:03:21.000000000 +0100
@@ -333,18 +333,21 @@ invalidate_complete_page2(struct address
 	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
 failed:
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 	return 0;
 }
 

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 14/14] mm: concurrent pagecache write side
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (12 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 13/14] mm: lock_page_ref Peter Zijlstra
@ 2007-01-28 13:13 ` Peter Zijlstra
  2007-01-29 17:20 ` [PATCH 00/14] Concurrent Page Cache Christoph Lameter
  14 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-28 13:13 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Andrew Morton, Nick Piggin, Christoph Lameter, Ingo Molnar,
	Rik van Riel, Peter Zijlstra

[-- Attachment #1: mm-concurrent-pagecache.patch --]
[-- Type: text/plain, Size: 13087 bytes --]

Remove the tree_lock, change address_space::nrpages to atomic_ulong_t
because its not protected any longer and use the concurrent radix tree API to
protect the modifying radix tree operations.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c         |    6 ++++--
 fs/inode.c          |    1 -
 include/linux/fs.h  |   11 +++++------
 mm/filemap.c        |   13 +++++++------
 mm/migrate.c        |    9 +++++----
 mm/page-writeback.c |   28 +++++++++++++++++++---------
 mm/swap_state.c     |   13 ++++++++-----
 mm/swapfile.c       |    2 --
 mm/truncate.c       |    3 ---
 mm/vmscan.c         |    4 ----
 10 files changed, 48 insertions(+), 42 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -729,16 +729,18 @@ int __set_page_dirty_buffers(struct page
 		return 0;
 
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
+		DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
+		radix_tree_lock(&ctx);
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
+		radix_tree_unlock(&ctx);
 	}
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	return 1;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c
+++ linux-2.6/fs/inode.c
@@ -193,7 +193,6 @@ void inode_init_once(struct inode *inode
 	mutex_init(&inode->i_mutex);
 	init_rwsem(&inode->i_alloc_sem);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
-	spin_lock_init(&inode->i_data.tree_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -433,13 +433,12 @@ struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
-	spinlock_t		tree_lock;	/* and rwlock protecting it */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
-	unsigned long		__nrpages;	/* number of total pages */
+	atomic_ulong_t		__nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
@@ -456,22 +455,22 @@ struct address_space {
 
 static inline void mapping_nrpages_init(struct address_space *mapping)
 {
-	mapping->__nrpages = 0;
+	mapping->__nrpages = (atomic_ulong_t)ATOMIC_ULONG_INIT(0);
 }
 
 static inline unsigned long mapping_nrpages(struct address_space *mapping)
 {
-	return mapping->__nrpages;
+	return atomic_ulong_read(&mapping->__nrpages);
 }
 
 static inline void mapping_nrpages_inc(struct address_space *mapping)
 {
-	mapping->__nrpages++;
+	atomic_ulong_inc(&mapping->__nrpages);
 }
 
 static inline void mapping_nrpages_dec(struct address_space *mapping)
 {
-	mapping->__nrpages--;
+	atomic_ulong_dec(&mapping->__nrpages);
 }
 
 struct block_device {
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -115,8 +115,11 @@ generic_file_direct_IO(int rw, struct ki
 void __remove_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
 
+	radix_tree_lock(&ctx);
 	radix_tree_delete(&mapping->page_tree, page->index);
+	radix_tree_unlock(&ctx);
 	page->mapping = NULL;
 	mapping_nrpages_dec(mapping);
 	__dec_zone_page_state(page, NR_FILE_PAGES);
@@ -124,14 +127,10 @@ void __remove_from_page_cache(struct pag
 
 void remove_from_page_cache(struct page *page)
 {
-	struct address_space *mapping = page->mapping;
-
 	BUG_ON(!PageLocked(page));
 
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 }
 
@@ -442,9 +441,12 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
 		lock_page_ref_irq(page);
-		spin_lock(&mapping->tree_lock);
+		radix_tree_lock(&ctx);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
+		radix_tree_unlock(&ctx);
 		if (!error) {
 			page_cache_get(page);
 			SetPageLocked(page);
@@ -453,7 +455,6 @@ int add_to_page_cache(struct page *page,
 			mapping_nrpages_inc(mapping);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		spin_unlock(&mapping->tree_lock);
 		unlock_page_ref_irq(page);
 		radix_tree_preload_end();
 	}
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c
+++ linux-2.6/mm/migrate.c
@@ -295,6 +295,7 @@ static int migrate_page_move_mapping(str
 		struct page *newpage, struct page *page)
 {
 	void **pslot;
+	struct radix_tree_context ctx;
 
 	if (!mapping) {
 		/* Anonymous page */
@@ -303,15 +304,15 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
+	init_radix_tree_context(&ctx, &mapping->page_tree);
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
-
+	radix_tree_lock(&ctx);
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
-		spin_unlock(&mapping->tree_lock);
+		radix_tree_unlock(&ctx);
 		unlock_page_ref_irq(page);
 		return -EAGAIN;
 	}
@@ -329,7 +330,7 @@ static int migrate_page_move_mapping(str
 
 	radix_tree_replace_slot(pslot, newpage);
 	page->mapping = NULL;
-  	spin_unlock(&mapping->tree_lock);
+	radix_tree_unlock(&ctx);
 	unlock_page_ref_irq(page);
 
 	/*
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c
+++ linux-2.6/mm/page-writeback.c
@@ -766,18 +766,20 @@ int __set_page_dirty_nobuffers(struct pa
 			return 1;
 
 		lock_page_ref_irq(page);
-		spin_lock(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
+			DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
 			BUG_ON(mapping2 != mapping);
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
+			radix_tree_lock(&ctx);
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
+			radix_tree_unlock(&ctx);
 		}
-		spin_unlock(&mapping->tree_lock);
 		unlock_page_ref_irq(page);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -916,13 +918,16 @@ int test_clear_page_writeback(struct pag
 		unsigned long flags;
 
 		lock_page_ref_irqsave(page, flags);
-		spin_lock(&mapping->tree_lock);
 		ret = TestClearPageWriteback(page);
-		if (ret)
+		if (ret) {
+			DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
+			radix_tree_lock(&ctx);
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		spin_unlock(&mapping->tree_lock);
+			radix_tree_unlock(&ctx);
+		}
 		unlock_page_ref_irqrestore(page, flags);
 	} else {
 		ret = TestClearPageWriteback(page);
@@ -937,19 +942,24 @@ int test_set_page_writeback(struct page 
 
 	if (mapping) {
 		unsigned long flags;
+		DECLARE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
 
 		lock_page_ref_irqsave(page, flags);
-		spin_lock(&mapping->tree_lock);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
+		if (!ret) {
+			radix_tree_lock(&ctx);
 			radix_tree_tag_set(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		if (!PageDirty(page))
+			radix_tree_unlock(&ctx);
+		}
+		if (!PageDirty(page)) {
+			radix_tree_lock(&ctx);
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		spin_unlock(&mapping->tree_lock);
+			radix_tree_unlock(&ctx);
+		}
 		unlock_page_ref_irqrestore(page, flags);
 	} else {
 		ret = TestSetPageWriteback(page);
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c
+++ linux-2.6/mm/swap_state.c
@@ -38,7 +38,6 @@ static struct backing_dev_info swap_back
 
 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-	.tree_lock	= __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
@@ -78,10 +77,13 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		DECLARE_RADIX_TREE_CONTEXT(ctx, &swapper_space.page_tree);
+
 		lock_page_ref_irq(page);
-		spin_lock(&swapper_space.tree_lock);
+		radix_tree_lock(&ctx);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
+		radix_tree_unlock(&ctx);
 		if (!error) {
 			page_cache_get(page);
 			SetPageLocked(page);
@@ -90,7 +92,6 @@ static int __add_to_swap_cache(struct pa
 			mapping_nrpages_inc(&swapper_space);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		spin_unlock(&swapper_space.tree_lock);
 		unlock_page_ref_irq(page);
 		radix_tree_preload_end();
 	}
@@ -125,12 +126,16 @@ static int add_to_swap_cache(struct page
  */
 void __delete_from_swap_cache(struct page *page)
 {
+	DECLARE_RADIX_TREE_CONTEXT(ctx, &swapper_space.page_tree);
+
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!PageSwapCache(page));
 	BUG_ON(PageWriteback(page));
 	BUG_ON(PagePrivate(page));
 
+	radix_tree_lock(&ctx);
 	radix_tree_delete(&swapper_space.page_tree, page_private(page));
+	radix_tree_unlock(&ctx);
 	set_page_private(page, 0);
 	ClearPageSwapCache(page);
 	mapping_nrpages_dec(&swapper_space);
@@ -203,9 +208,7 @@ void delete_from_swap_cache(struct page 
 	entry.val = page_private(page);
 
 	lock_page_ref_irq(page);
-	spin_lock(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	spin_unlock(&swapper_space.tree_lock);
 	unlock_page_ref_irq(page);
 
 	swap_free(entry);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c
+++ linux-2.6/mm/swapfile.c
@@ -368,13 +368,11 @@ int remove_exclusive_swap_page(struct pa
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
 		lock_page_ref_irq(page);
-		spin_lock(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		spin_unlock(&swapper_space.tree_lock);
 		unlock_page_ref_irq(page);
 	}
 	spin_unlock(&swap_lock);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c
+++ linux-2.6/mm/truncate.c
@@ -327,19 +327,16 @@ invalidate_complete_page2(struct address
 		return 0;
 
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
 failed:
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 	return 0;
 }
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -391,7 +391,6 @@ int remove_mapping(struct address_space 
 	BUG_ON(mapping != page_mapping(page));
 
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -426,13 +425,11 @@ int remove_mapping(struct address_space 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
-		spin_unlock(&mapping->tree_lock);
 		swap_free(swap);
 		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
-	spin_unlock(&mapping->tree_lock);
 
 free_it:
 	unlock_page_ref_irq(page);
@@ -440,7 +437,6 @@ free_it:
 	return 1;
 
 cannot_free:
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 	return 0;
 }

--


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 11/14] atomic_ulong_t
  2007-01-28 13:13 ` [PATCH 11/14] atomic_ulong_t Peter Zijlstra
@ 2007-01-29 17:11   ` Christoph Lameter
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Lameter @ 2007-01-29 17:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Andrew Morton, Nick Piggin, Ingo Molnar,
	Rik van Riel

On Sun, 28 Jan 2007, Peter Zijlstra wrote:

> provide an unsigned long atomic type.

Is this really necessary? We have no atomic_uint_t type either.

Could you use atomic_long_t instead?


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 00/14] Concurrent Page Cache
  2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
                   ` (13 preceding siblings ...)
  2007-01-28 13:13 ` [PATCH 14/14] mm: concurrent pagecache write side Peter Zijlstra
@ 2007-01-29 17:20 ` Christoph Lameter
  2007-01-29 18:05   ` Peter Zijlstra
  14 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2007-01-29 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Andrew Morton, Nick Piggin, Ingo Molnar,
	Rik van Riel

On Sun, 28 Jan 2007, Peter Zijlstra wrote:

> With Nick leading the way to getting rid of the read side of the tree_lock,
> this work continues by breaking the write side of said lock.

Could we get the read side in separately from the write side? I think I 
get the read side but the write side still looks scary to me and 
introduces new ways of locking. Ladder locking?
 
> Aside from breaking MTD this version of the concurrent page cache seems
> rock solid on my dual core x86_64 box.

What exactly is the MTD doing and how does it break?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 00/14] Concurrent Page Cache
  2007-01-29 17:20 ` [PATCH 00/14] Concurrent Page Cache Christoph Lameter
@ 2007-01-29 18:05   ` Peter Zijlstra
  2007-01-29 18:15     ` Christoph Lameter
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-29 18:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, linux-mm, Andrew Morton, Nick Piggin, Ingo Molnar,
	Rik van Riel

On Mon, 2007-01-29 at 09:20 -0800, Christoph Lameter wrote:
> On Sun, 28 Jan 2007, Peter Zijlstra wrote:
> 
> > With Nick leading the way to getting rid of the read side of the tree_lock,
> > this work continues by breaking the write side of said lock.
> 
> Could we get the read side in separately from the write side?

Sure, apply patches 1 through 9. 10 and up are the write side.

>  I think I 
> get the read side but the write side still looks scary to me and 
> introduces new ways of locking. Ladder locking?

Its all quite simple really; imagine locking the whole tree path
beginning at the root node. This obviously doesn't provide concurrency
since holing the root node locked will avoid all other operations from
starting.

However, as soon as you've locked a node on the next level and can
determine that you will never need to traverse the tree path upwards
from this point, you can drop the lock(s) on the previous level.

In the trivial case where you will never traverse up again, this reduces
to ladder locking.

Perhaps it is best illustrated with a picture (forgive the ASCII art)

                          A0
                  B0              B1
              C0      C1      C2      C3
            D0  D1  D2  D3  D4  D5  D6  D7

Say we need D5, which yield the path: A0, B1, C2, D5.

Ladder locking would end up:

lock A0
lock B1
unlock A0 -> a new operation can start
lock C2
unlock B1
lock D5
unlock C2
** we do stuff to D5
unlock D5

For path locking, this would end up being something like this:
(say we can determine we'll never cross C2 back up)

lock A0
lock B1
lock C2
unlock A0 -> a new operation can start
unlock B1
lock D5
** we do stuff to D5 and walk back up to C2
unlock C2
unlock D5

> > Aside from breaking MTD this version of the concurrent page cache seems
> > rock solid on my dual core x86_64 box.
> 
> What exactly is the MTD doing 

drivers/mtd/devices/block2mtd.c - fudges with the pagecache, haven't
spend any time in converting that. Shouldn't be hard.

> and how does it break?

Doesn't compile anymore.

(could use atomic_long_t for mapping::nr_pages I guess)


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 00/14] Concurrent Page Cache
  2007-01-29 18:05   ` Peter Zijlstra
@ 2007-01-29 18:15     ` Christoph Lameter
  2007-01-29 18:56       ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Christoph Lameter @ 2007-01-29 18:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-mm, Andrew Morton, Nick Piggin, Ingo Molnar,
	Rik van Riel

On Mon, 29 Jan 2007, Peter Zijlstra wrote:

> Ladder locking would end up:
> 
> lock A0
> lock B1
> unlock A0 -> a new operation can start
> lock C2
> unlock B1
> lock D5
> unlock C2
> ** we do stuff to D5
> unlock D5
> 

Instead of taking one lock we would need to take 4? Wont doing so cause 
significant locking overhead? We probably would want to run some 
benchmarks. Maybe disable the scheme for systems with a small number of 
processors?



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 00/14] Concurrent Page Cache
  2007-01-29 18:15     ` Christoph Lameter
@ 2007-01-29 18:56       ` Peter Zijlstra
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Zijlstra @ 2007-01-29 18:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux-kernel, linux-mm, Andrew Morton, Nick Piggin, Ingo Molnar,
	Rik van Riel

On Mon, 2007-01-29 at 10:15 -0800, Christoph Lameter wrote:
> On Mon, 29 Jan 2007, Peter Zijlstra wrote:
> 
> > Ladder locking would end up:
> > 
> > lock A0
> > lock B1
> > unlock A0 -> a new operation can start
> > lock C2
> > unlock B1
> > lock D5
> > unlock C2
> > ** we do stuff to D5
> > unlock D5
> > 
> 
> Instead of taking one lock we would need to take 4?

Yep.

> Wont doing so cause significant locking overhead?
> We probably would want to run some benchmarks. 

Right, I was hoping the extra locking overhead would be more than
compensated by the reduction in lock contention time. But testing is
indeed in order.

> Maybe disable the scheme for systems with a small number of 
> processors?

CONFIG_RADIX_TREE_CONCURRENT does exactly this.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-01-29 18:56 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-28 13:13 [PATCH 00/14] Concurrent Page Cache Peter Zijlstra
2007-01-28 13:13 ` [PATCH 01/14] radix-tree: use indirect bit Peter Zijlstra
2007-01-28 13:13 ` [PATCH 02/14] radix-tree: gang_lookup_slot Peter Zijlstra
2007-01-28 13:13 ` [PATCH 03/14] radix-tree: gang_lookup_tag_slot Peter Zijlstra
2007-01-28 13:13 ` [PATCH 04/14] mm: speculative get page Peter Zijlstra
2007-01-28 13:13 ` [PATCH 05/14] mm: lockless pagecache lookups Peter Zijlstra
2007-01-28 13:13 ` [PATCH 06/14] mm: fix speculative page get preemption bug Peter Zijlstra
2007-01-28 13:13 ` [PATCH 07/14] mm: speculative find_get_pages_tag Peter Zijlstra
2007-01-28 13:13 ` [PATCH 08/14] mm: remove find_tylock_page Peter Zijlstra
2007-01-28 13:13 ` [PATCH 09/14] mm: change tree_lock into a spinlock Peter Zijlstra
2007-01-28 13:13 ` [PATCH 10/14] radix-tree: concurrent write side support Peter Zijlstra
2007-01-28 13:13 ` [PATCH 11/14] atomic_ulong_t Peter Zijlstra
2007-01-29 17:11   ` Christoph Lameter
2007-01-28 13:13 ` [PATCH 12/14] mm/fs: abstract address_space::nrpages Peter Zijlstra
2007-01-28 13:13 ` [PATCH 13/14] mm: lock_page_ref Peter Zijlstra
2007-01-28 13:13 ` [PATCH 14/14] mm: concurrent pagecache write side Peter Zijlstra
2007-01-29 17:20 ` [PATCH 00/14] Concurrent Page Cache Christoph Lameter
2007-01-29 18:05   ` Peter Zijlstra
2007-01-29 18:15     ` Christoph Lameter
2007-01-29 18:56       ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).