All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] concurrent pagecache
@ 2007-04-18 20:12 Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 1/6] radix-tree: concurrent write side support Peter Zijlstra
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Peter Zijlstra @ 2007-04-18 20:12 UTC (permalink / raw)
  To: linux-mm; +Cc: npiggin, akpm, clameter, a.p.zijlstra

The latest version of the concurrent pagecache work; it goes on top of Nick's
latest lockless pagecache patches.

The biggest change from the previous version is the addition of optimistic
locking. This avoids taking the upper level locks where possible, and
significantly reduces cache-line bouncing and lock contention. Scalability
is now determined by density and number of the elements in the tree.

Radix tree benchmarks on 2 cpus; 2 threads performing modifying operations on 
the same tree. The first series uses separate sequential ranges of the tree.
The second interleaves the same range.

The sequental has minimal shared cache-lines, the interleaved maximal
(in one test one can even see the threads drift apart due to cache-line
bouncing so that they don't share the exact same leaf anymore)

-----

start: 0 end: 800000 intv: 1
start: 800000 end: 1000000 intv: 1

CONFIG_RADIX_TREE_CONCURRENT=n

[ffff81007d7f60c0] insert 0 done in 15286 ms
[ffff810076f16040] insert 0 done in 16250 ms
[ffff81007d7f60c0] tag 0 done in 14401 ms
[ffff810076f16040] tag 0 done in 15420 ms
[ffff81007d7f60c0] untag 0 done in 17868 ms
[ffff810076f16040] untag 0 done in 18251 ms
[ffff81007d7f60c0] tag 1 done in 15331 ms
[ffff810076f16040] tag 1 done in 15287 ms
[ffff81007d7f60c0] untag 1 done in 18342 ms
[ffff810076f16040] untag 1 done in 18092 ms
[ffff81007d7f60c0] remove 0 done in 15412 ms
[ffff810076f16040] remove 0 done in 13802 ms

CONFIG_RADIX_TREE_CONCURRENT=y

[ffff8100587e00c0] insert 0 done in 14750 ms
[ffff8100587e1080] insert 0 done in 14834 ms
[ffff8100587e00c0] tag 0 done in 14489 ms
[ffff8100587e1080] tag 0 done in 14567 ms
[ffff8100587e00c0] untag 0 done in 15016 ms
[ffff8100587e1080] untag 0 done in 15066 ms
[ffff8100587e00c0] tag 1 done in 14593 ms
[ffff8100587e1080] tag 1 done in 14674 ms
[ffff8100587e00c0] untag 1 done in 14984 ms
[ffff8100587e1080] untag 1 done in 15043 ms
[ffff8100587e00c0] remove 0 done in 16307 ms
[ffff8100587e1080] remove 0 done in 16193 ms

CONFIG_RADIX_TREE_OPTIMISTIC=y

[ffff81006b36e040] insert 0 done in 3443 ms
[ffff81006b3620c0] insert 0 done in 3449 ms
[ffff81006b3620c0] tag 0 done in 5338 ms
[ffff81006b36e040] tag 0 done in 5375 ms
[ffff81006b3620c0] untag 0 done in 4107 ms
[ffff81006b36e040] untag 0 done in 4151 ms
[ffff81006b3620c0] tag 1 done in 5349 ms
[ffff81006b36e040] tag 1 done in 5376 ms
[ffff81006b3620c0] untag 1 done in 4110 ms
[ffff81006b36e040] untag 1 done in 4137 ms
[ffff81006b3620c0] remove 0 done in 6482 ms
[ffff81006b36e040] remove 0 done in 6494 ms

levels skipped  hits
        0            8785
        1          544094
        2        34312192
        3        65798168
        4              52
        5               0
        6               0
        7               0
        8               0
        9               0
       10               0
       11               0
failed                232

-----

start: 0 end: 1000000 intv: 2
start: 1 end: 1000000 intv: 2

CONFIG_RADIX_TREE_CONCURRENT=n

[ffff8100679b9080] insert 0 done in 16007 ms
[ffff8100764e40c0] insert 0 done in 16004 ms
[ffff8100679b9080] tag 0 done in 14964 ms
[ffff8100764e40c0] tag 0 done in 15007 ms
[ffff8100679b9080] untag 0 done in 17414 ms
[ffff8100764e40c0] untag 0 done in 17564 ms
[ffff8100679b9080] tag 1 done in 14909 ms
[ffff8100764e40c0] tag 1 done in 15076 ms
[ffff8100679b9080] untag 1 done in 17455 ms
[ffff8100764e40c0] untag 1 done in 17628 ms
[ffff8100679b9080] remove 0 done in 14067 ms
[ffff8100764e40c0] remove 0 done in 14358 ms

CONFIG_RADIX_TREE_CONCURRENT=y

[ffff81006becc0c0] insert 0 done in 19483 ms
[ffff81006bec8080] insert 0 done in 19486 ms
[ffff81006bec8080] tag 0 done in 15604 ms
[ffff81006becc0c0] tag 0 done in 15632 ms
[ffff81006bec8080] untag 0 done in 16952 ms
[ffff81006becc0c0] untag 0 done in 16968 ms
[ffff81006bec8080] tag 1 done in 15444 ms
[ffff81006becc0c0] tag 1 done in 15471 ms
[ffff81006bec8080] untag 1 done in 16996 ms
[ffff81006becc0c0] untag 1 done in 17010 ms
[ffff81006bec8080] remove 0 done in 16145 ms
[ffff81006becc0c0] remove 0 done in 16867 ms

CONFIG_RADIX_TREE_OPTIMISTIC=y

[ffff8100606260c0] insert 0 done in 12036 ms
[ffff810067c20040] insert 0 done in 12033 ms
[ffff810067c20040] tag 0 done in 9438 ms
[ffff8100606260c0] tag 0 done in 9438 ms
[ffff8100606260c0] untag 0 done in 4067 ms
[ffff810067c20040] untag 0 done in 4208 ms
[ffff8100606260c0] tag 1 done in 5424 ms
[ffff810067c20040] tag 1 done in 5368 ms
[ffff8100606260c0] untag 1 done in 4072 ms
[ffff810067c20040] untag 1 done in 4195 ms
[ffff8100606260c0] remove 0 done in 6111 ms
[ffff810067c20040] remove 0 done in 6948 ms

levels skipped  hits
        0            9053
        1          556631
        2        34790358
        3        65307226
        4              74
        5               0
        6               0
        7               0
        8               0
        9               0
       10               0
       11               0
failed                250


-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/6] radix-tree: concurrent write side support
  2007-04-18 20:12 [PATCH 0/6] concurrent pagecache Peter Zijlstra
@ 2007-04-18 20:12 ` Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 2/6] mm/fs: abstract address_space::nrpages Peter Zijlstra
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2007-04-18 20:12 UTC (permalink / raw)
  To: linux-mm; +Cc: npiggin, akpm, clameter, a.p.zijlstra

[-- Attachment #1: radix-tree-concurrent.patch --]
[-- Type: text/plain, Size: 20088 bytes --]

Provide support for concurrent write side operations without changing the API
for all other users.

Concurrency is realized by means of two locking models; the simple one is 
ladder locking, the more complex one is path locking.

Ladder locking (aka Lock-Coupling) is the simplest form of traversing a node
locked tree. One will only ever hold two locks at the same time. Take a child
node's lock while holding the current. Once acquired, release the upper node's
lock. This overlap is needed, because if we'd release a node's lock before
acquiring a child node's lock, there is nothing stopping the tree from
changing right under out free.

This allows other modifying operations to start as soon as you release the
lock on the root node and even complete before you if they walk another path
downward.

The modifying operations: insert, lookup_slot and set_tag, use this simple
method.

The more complex path locking method is needed for operations that need to
walk upwards again after they walked down, those are: tag_clear and delete.

These lock their whole path downwards and release whole sections at points
where it can be determined the walk upwards will stop, thus also allowing
concurrency.

Finding the conditions for the terminated walk upwards while doing the downward
walk is the 'interesting' part of this approach.

The API for this looks like:

  DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree)

  radix_tree_lock(&ctx)
  ... do _1_ modifying operation ...
  radix_tree_unlock(&ctx)

Note that, while it allows for concurrency the overhead of cache-line bouncing
the locks around on SMP machine will severly reduce the advantage.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/radix-tree.h |   87 ++++++++++++-
 init/Kconfig               |    4 
 lib/radix-tree.c           |  282 +++++++++++++++++++++++++++++++++++----------
 3 files changed, 310 insertions(+), 63 deletions(-)

Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h	2007-04-13 11:23:47.000000000 +0200
+++ linux-2.6/include/linux/radix-tree.h	2007-04-13 14:13:05.000000000 +0200
@@ -62,23 +62,65 @@ struct radix_tree_root {
 	unsigned int		height;
 	gfp_t			gfp_mask;
 	struct radix_tree_node	*rnode;
+	spinlock_t		lock;
 };
 
 #define RADIX_TREE_INIT(mask)	{					\
 	.height = 0,							\
 	.gfp_mask = (mask),						\
 	.rnode = NULL,							\
+	.lock = __SPIN_LOCK_UNLOCKED(radix_tree_root.lock),		\
 }
 
 #define RADIX_TREE(name, mask) \
 	struct radix_tree_root name = RADIX_TREE_INIT(mask)
 
-#define INIT_RADIX_TREE(root, mask)					\
-do {									\
-	(root)->height = 0;						\
-	(root)->gfp_mask = (mask);					\
-	(root)->rnode = NULL;						\
-} while (0)
+static inline void INIT_RADIX_TREE(struct radix_tree_root *root, gfp_t gfp_mask)
+{
+	root->height = 0;
+	root->gfp_mask = gfp_mask;
+	root->rnode = NULL;
+	spin_lock_init(&root->lock);
+}
+
+struct radix_tree_context {
+	struct radix_tree_root	*tree;
+	struct radix_tree_root	*root;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	spinlock_t		*locked;
+#endif
+};
+
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+#define RADIX_CONTEXT_ROOT(context)					\
+	((struct radix_tree_root *)(((unsigned long)context) + 1))
+
+#define __RADIX_TREE_CONTEXT_INIT(context, _tree)			\
+		.tree = RADIX_CONTEXT_ROOT(&context),			\
+		.locked = NULL,
+#else
+#define __RADIX_TREE_CONTEXT_INIT(context, _tree)			\
+		.tree = (_tree),
+#endif
+
+#define DEFINE_RADIX_TREE_CONTEXT(context, _tree) 			\
+	struct radix_tree_context context = { 				\
+		.root = (_tree), 					\
+		__RADIX_TREE_CONTEXT_INIT(context, _tree)		\
+       	}
+
+static inline void
+init_radix_tree_context(struct radix_tree_context *ctx,
+		struct radix_tree_root *root)
+{
+	ctx->root = root;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	ctx->tree = RADIX_CONTEXT_ROOT(ctx);
+	ctx->locked = NULL;
+#else
+	ctx->tree = root;
+#endif
+}
 
 /**
  * Radix-tree synchronization
@@ -155,6 +197,29 @@ static inline void radix_tree_replace_sl
 	rcu_assign_pointer(*pslot, item);
 }
 
+static inline void radix_tree_lock(struct radix_tree_context *context)
+{
+	struct radix_tree_root *root = context->root;
+	rcu_read_lock();
+	spin_lock(&root->lock);
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	BUG_ON(context->locked);
+	context->locked = &root->lock;
+#endif
+}
+
+static inline void radix_tree_unlock(struct radix_tree_context *context)
+{
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	BUG_ON(!context->locked);
+	spin_unlock(context->locked);
+	context->locked = NULL;
+#else
+	spin_unlock(&context->root->lock);
+#endif
+	rcu_read_unlock();
+}
+
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);
 void **radix_tree_lookup_slot(struct radix_tree_root *, unsigned long);
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c	2007-04-13 11:23:47.000000000 +0200
+++ linux-2.6/lib/radix-tree.c	2007-04-13 12:26:29.000000000 +0200
@@ -32,6 +32,7 @@
 #include <linux/string.h>
 #include <linux/bitops.h>
 #include <linux/rcupdate.h>
+#include <linux/spinlock.h>
 
 
 #ifdef __KERNEL__
@@ -52,11 +53,17 @@ struct radix_tree_node {
 	struct rcu_head	rcu_head;
 	void		*slots[RADIX_TREE_MAP_SIZE];
 	unsigned long	tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	spinlock_t	lock;
+#endif
 };
 
 struct radix_tree_path {
 	struct radix_tree_node *node;
 	int offset;
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	spinlock_t *locked;
+#endif
 };
 
 #define RADIX_TREE_INDEX_BITS  (8 /* CHAR_BIT */ * sizeof(unsigned long))
@@ -64,6 +71,10 @@ struct radix_tree_path {
 
 static unsigned long height_to_maxindex[RADIX_TREE_MAX_PATH] __read_mostly;
 
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+static struct lock_class_key radix_node_class[RADIX_TREE_MAX_PATH];
+#endif
+
 /*
  * Radix tree node cache.
  */
@@ -88,7 +99,7 @@ static inline gfp_t root_gfp_mask(struct
  * that the caller has pinned this thread of control to the current CPU.
  */
 static struct radix_tree_node *
-radix_tree_node_alloc(struct radix_tree_root *root)
+radix_tree_node_alloc(struct radix_tree_root *root, int height)
 {
 	struct radix_tree_node *ret;
 	gfp_t gfp_mask = root_gfp_mask(root);
@@ -105,6 +116,11 @@ radix_tree_node_alloc(struct radix_tree_
 		}
 	}
 	BUG_ON(radix_tree_is_indirect_ptr(ret));
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+	spin_lock_init(&ret->lock);
+	lockdep_set_class(&ret->lock, &radix_node_class[height]);
+#endif
+	ret->height = height;
 	return ret;
 }
 
@@ -205,6 +221,22 @@ static inline int any_tag_set(struct rad
 	return 0;
 }
 
+static inline int any_tag_set_but(struct radix_tree_node *node,
+		unsigned int tag, int offset)
+{
+	int idx;
+	int offset_idx = offset / BITS_PER_LONG;
+	unsigned long offset_mask = ~(1UL << (offset % BITS_PER_LONG));
+	for (idx = 0; idx < RADIX_TREE_TAG_LONGS; idx++) {
+		unsigned long mask = ~0UL;
+		if (idx == offset_idx)
+			mask = offset_mask;
+		if (node->tags[tag][idx] & mask)
+			return 1;
+	}
+	return 0;
+}
+
 /*
  *	Return the maximum key which can be store into a
  *	radix tree with height HEIGHT.
@@ -234,8 +266,8 @@ static int radix_tree_extend(struct radi
 	}
 
 	do {
-		unsigned int newheight;
-		if (!(node = radix_tree_node_alloc(root)))
+		unsigned int newheight = root->height + 1;
+		if (!(node = radix_tree_node_alloc(root, newheight)))
 			return -ENOMEM;
 
 		/* Increase the height.  */
@@ -247,8 +279,6 @@ static int radix_tree_extend(struct radi
 				tag_set(node, tag, 0);
 		}
 
-		newheight = root->height+1;
-		node->height = newheight;
 		node->count = 1;
 		node = radix_tree_ptr_to_indirect(node);
 		rcu_assign_pointer(root->rnode, node);
@@ -258,6 +288,80 @@ out:
 	return 0;
 }
 
+#ifdef CONFIG_RADIX_TREE_CONCURRENT
+static inline struct radix_tree_context *
+radix_tree_get_context(struct radix_tree_root **rootp)
+{
+	struct radix_tree_context *context = NULL;
+	unsigned long addr = (unsigned long)*rootp;
+
+	if (addr & 1) {
+		context = (struct radix_tree_context *)(addr - 1);
+		*rootp = context->root;
+	}
+
+	return context;
+}
+
+#define RADIX_TREE_CONTEXT(context, root) \
+	struct radix_tree_context *context =	\
+		radix_tree_get_context(&root)
+
+static inline spinlock_t *radix_node_lock(struct radix_tree_root *root,
+		struct radix_tree_node *node)
+{
+	spinlock_t *locked = &node->lock;
+	spin_lock(locked);
+	return locked;
+}
+
+static inline void radix_ladder_lock(struct radix_tree_context *context,
+		struct radix_tree_node *node)
+{
+	if (context) {
+		struct radix_tree_root *root = context->root;
+		spinlock_t *locked = radix_node_lock(root, node);
+		if (locked) {
+			spin_unlock(context->locked);
+			context->locked = locked;
+		}
+	}
+}
+
+static inline void radix_path_init(struct radix_tree_context *context,
+		struct radix_tree_path *pathp)
+{
+	pathp->locked = context ? context->locked : NULL;
+}
+
+static inline void radix_path_lock(struct radix_tree_context *context,
+		struct radix_tree_path *pathp, struct radix_tree_node *node)
+{
+	if (context) {
+		struct radix_tree_root *root = context->root;
+		spinlock_t *locked = radix_node_lock(root, node);
+		if (locked)
+			context->locked = locked;
+		pathp->locked = locked;
+	} else
+		pathp->locked = NULL;
+}
+
+static inline void radix_path_unlock(struct radix_tree_context *context,
+		struct radix_tree_path *punlock)
+{
+	if (context && punlock->locked &&
+			context->locked != punlock->locked)
+		spin_unlock(punlock->locked);
+}
+#else
+#define RADIX_TREE_CONTEXT(context, root) do { } while (0)
+#define radix_ladder_lock(context, node) do { } while (0)
+#define radix_path_init(context, pathp) do { } while (0)
+#define radix_path_lock(context, pathp, node) do { } while (0)
+#define radix_path_unlock(context, punlock) do { } while (0)
+#endif
+
 /**
  *	radix_tree_insert    -    insert into a radix tree
  *	@root:		radix tree root
@@ -273,6 +377,8 @@ int radix_tree_insert(struct radix_tree_
 	unsigned int height, shift;
 	int offset;
 	int error;
+	int tag;
+	RADIX_TREE_CONTEXT(context, root);
 
 	BUG_ON(radix_tree_is_indirect_ptr(item));
 
@@ -292,9 +398,8 @@ int radix_tree_insert(struct radix_tree_
 	while (height > 0) {
 		if (slot == NULL) {
 			/* Have to add a child node.  */
-			if (!(slot = radix_tree_node_alloc(root)))
+			if (!(slot = radix_tree_node_alloc(root, height)))
 				return -ENOMEM;
-			slot->height = height;
 			if (node) {
 				rcu_assign_pointer(node->slots[offset], slot);
 				node->count++;
@@ -306,6 +411,9 @@ int radix_tree_insert(struct radix_tree_
 		/* Go a level down */
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		node = slot;
+
+		radix_ladder_lock(context, node);
+
 		slot = node->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
@@ -317,12 +425,12 @@ int radix_tree_insert(struct radix_tree_
 	if (node) {
 		node->count++;
 		rcu_assign_pointer(node->slots[offset], item);
-		BUG_ON(tag_get(node, 0, offset));
-		BUG_ON(tag_get(node, 1, offset));
+		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+			BUG_ON(tag_get(node, tag, offset));
 	} else {
 		rcu_assign_pointer(root->rnode, item);
-		BUG_ON(root_tag_get(root, 0));
-		BUG_ON(root_tag_get(root, 1));
+		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+			BUG_ON(root_tag_get(root, tag));
 	}
 
 	return 0;
@@ -346,6 +454,7 @@ void **radix_tree_lookup_slot(struct rad
 {
 	unsigned int height, shift;
 	struct radix_tree_node *node, **slot;
+	RADIX_TREE_CONTEXT(context, root);
 
 	node = rcu_dereference(root->rnode);
 	if (node == NULL)
@@ -371,6 +480,8 @@ void **radix_tree_lookup_slot(struct rad
 		if (node == NULL)
 			return NULL;
 
+		radix_ladder_lock(context, node);
+
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
 	} while (height > 0);
@@ -446,6 +557,7 @@ void *radix_tree_tag_set(struct radix_tr
 {
 	unsigned int height, shift;
 	struct radix_tree_node *slot;
+	RADIX_TREE_CONTEXT(context, root);
 
 	height = root->height;
 	BUG_ON(index > radix_tree_maxindex(height));
@@ -453,9 +565,15 @@ void *radix_tree_tag_set(struct radix_tr
 	slot = radix_tree_indirect_to_ptr(root->rnode);
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
 
+	/* set the root's tag bit */
+	if (slot && !root_tag_get(root, tag))
+		root_tag_set(root, tag);
+
 	while (height > 0) {
 		int offset;
 
+		radix_ladder_lock(context, slot);
+
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		if (!tag_get(slot, tag, offset))
 			tag_set(slot, tag, offset);
@@ -465,14 +583,24 @@ void *radix_tree_tag_set(struct radix_tr
 		height--;
 	}
 
-	/* set the root's tag bit */
-	if (slot && !root_tag_get(root, tag))
-		root_tag_set(root, tag);
-
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_tag_set);
 
+/*
+ * the change can never propagate upwards from here.
+ */
+static inline int radix_tree_unlock_tag(struct radix_tree_root *root,
+		struct radix_tree_path *pathp, int tag)
+{
+	int this, other;
+
+	this = tag_get(pathp->node, tag, pathp->offset);
+	other = any_tag_set_but(pathp->node, tag, pathp->offset);
+
+	return !this || other;
+}
+
 /**
  *	radix_tree_tag_clear - clear a tag on a radix tree node
  *	@root:		radix tree root
@@ -491,15 +619,19 @@ void *radix_tree_tag_clear(struct radix_
 			unsigned long index, unsigned int tag)
 {
 	struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
+	struct radix_tree_path *punlock = path, *piter;
 	struct radix_tree_node *slot = NULL;
 	unsigned int height, shift;
+	RADIX_TREE_CONTEXT(context, root);
+
+	pathp->node = NULL;
+	radix_path_init(context, pathp);
 
 	height = root->height;
 	if (index > radix_tree_maxindex(height))
 		goto out;
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
-	pathp->node = NULL;
 	slot = radix_tree_indirect_to_ptr(root->rnode);
 
 	while (height > 0) {
@@ -509,10 +641,17 @@ void *radix_tree_tag_clear(struct radix_
 			goto out;
 
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
-		pathp[1].offset = offset;
-		pathp[1].node = slot;
-		slot = slot->slots[offset];
 		pathp++;
+		pathp->offset = offset;
+		pathp->node = slot;
+		radix_path_lock(context, pathp, slot);
+
+		if (radix_tree_unlock_tag(root, pathp, tag)) {
+			for (; punlock < pathp; punlock++)
+				radix_path_unlock(context, punlock);
+		}
+
+		slot = slot->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
 	}
@@ -520,20 +659,22 @@ void *radix_tree_tag_clear(struct radix_
 	if (slot == NULL)
 		goto out;
 
-	while (pathp->node) {
-		if (!tag_get(pathp->node, tag, pathp->offset))
-			goto out;
-		tag_clear(pathp->node, tag, pathp->offset);
-		if (any_tag_set(pathp->node, tag))
-			goto out;
-		pathp--;
+	for (piter = pathp; piter >= punlock; piter--) {
+		if (piter->node) {
+			if (!tag_get(piter->node, tag, piter->offset))
+				break;
+			tag_clear(piter->node, tag, piter->offset);
+			if (any_tag_set(piter->node, tag))
+				break;
+		} else {
+			if (root_tag_get(root, tag))
+				root_tag_clear(root, tag);
+		}
 	}
 
-	/* clear the root's tag bit */
-	if (root_tag_get(root, tag))
-		root_tag_clear(root, tag);
-
 out:
+	for (; punlock < pathp; punlock++)
+		radix_path_unlock(context, punlock);
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_tag_clear);
@@ -985,6 +1126,7 @@ static inline void radix_tree_shrink(str
 	while (root->height > 0) {
 		struct radix_tree_node *to_free = root->rnode;
 		void *newptr;
+		int tag;
 
 		BUG_ON(!radix_tree_is_indirect_ptr(to_free));
 		to_free = radix_tree_indirect_to_ptr(to_free);
@@ -1011,14 +1153,29 @@ static inline void radix_tree_shrink(str
 		root->rnode = newptr;
 		root->height--;
 		/* must only free zeroed nodes into the slab */
-		tag_clear(to_free, 0, 0);
-		tag_clear(to_free, 1, 0);
+		for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++)
+			tag_clear(to_free, tag, 0);
 		to_free->slots[0] = NULL;
 		to_free->count = 0;
-		radix_tree_node_free(to_free);
 	}
 }
 
+static inline int radix_tree_unlock_all(struct radix_tree_root *root,
+		struct radix_tree_path *pathp)
+{
+	int tag;
+	int unlock = 1;
+
+	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
+		if (!radix_tree_unlock_tag(root, pathp, tag)) {
+			unlock = 0;
+			break;
+		}
+	}
+
+	return unlock;
+}
+
 /**
  *	radix_tree_delete    -    delete an item from a radix tree
  *	@root:		radix tree root
@@ -1031,11 +1188,15 @@ static inline void radix_tree_shrink(str
 void *radix_tree_delete(struct radix_tree_root *root, unsigned long index)
 {
 	struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
+	struct radix_tree_path *punlock = path, *piter;
 	struct radix_tree_node *slot = NULL;
-	struct radix_tree_node *to_free;
 	unsigned int height, shift;
 	int tag;
 	int offset;
+	RADIX_TREE_CONTEXT(context, root);
+
+	pathp->node = NULL;
+	radix_path_init(context, pathp);
 
 	height = root->height;
 	if (index > radix_tree_maxindex(height))
@@ -1050,7 +1211,6 @@ void *radix_tree_delete(struct radix_tre
 	slot = radix_tree_indirect_to_ptr(slot);
 
 	shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
-	pathp->node = NULL;
 
 	do {
 		if (slot == NULL)
@@ -1060,6 +1220,13 @@ void *radix_tree_delete(struct radix_tre
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		pathp->offset = offset;
 		pathp->node = slot;
+		radix_path_lock(context, pathp, slot);
+
+		if (slot->count > 2 && radix_tree_unlock_all(root, pathp)) {
+			for (; punlock < pathp; punlock++)
+				radix_path_unlock(context, punlock);
+		}
+
 		slot = slot->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
@@ -1072,41 +1239,45 @@ void *radix_tree_delete(struct radix_tre
 	 * Clear all tags associated with the just-deleted item
 	 */
 	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
-		if (tag_get(pathp->node, tag, pathp->offset))
-			radix_tree_tag_clear(root, index, tag);
+		for (piter = pathp; piter >= punlock; piter--) {
+			if (piter->node) {
+				if (!tag_get(piter->node, tag, piter->offset))
+					break;
+				tag_clear(piter->node, tag, piter->offset);
+				if (any_tag_set(piter->node, tag))
+					break;
+			} else {
+				if (root_tag_get(root, tag))
+					root_tag_clear(root, tag);
+			}
+		}
 	}
 
-	to_free = NULL;
-	/* Now free the nodes we do not need anymore */
-	while (pathp->node) {
-		pathp->node->slots[pathp->offset] = NULL;
-		pathp->node->count--;
-		/*
-		 * Queue the node for deferred freeing after the
-		 * last reference to it disappears (set NULL, above).
-		 */
-		if (to_free)
-			radix_tree_node_free(to_free);
+	/* Now unhook the nodes we do not need anymore */
+	for (piter = pathp; piter >= punlock && piter->node; piter--) {
+		piter->node->slots[piter->offset] = NULL;
+		piter->node->count--;
 
-		if (pathp->node->count) {
-			if (pathp->node ==
+		if (piter->node->count) {
+			if (piter->node ==
 					radix_tree_indirect_to_ptr(root->rnode))
 				radix_tree_shrink(root);
 			goto out;
 		}
+	}
 
-		/* Node with zero slots in use so free it */
-		to_free = pathp->node;
-		pathp--;
+	BUG_ON(piter->node);
 
-	}
 	root_tag_clear_all(root);
 	root->height = 0;
 	root->rnode = NULL;
-	if (to_free)
-		radix_tree_node_free(to_free);
 
 out:
+	for (; punlock <= pathp; punlock++) {
+		radix_path_unlock(context, punlock);
+		if (punlock->node && punlock->node->count == 0)
+			radix_tree_node_free(punlock->node);
+	}
 	return slot;
 }
 EXPORT_SYMBOL(radix_tree_delete);
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig	2007-04-13 11:23:32.000000000 +0200
+++ linux-2.6/init/Kconfig	2007-04-13 11:23:47.000000000 +0200
@@ -342,6 +342,10 @@ config CC_OPTIMIZE_FOR_SIZE
 config SYSCTL
 	bool
 
+config RADIX_TREE_CONCURRENT
+	bool "Enable concurrent radix tree operations (EXPERIMENTAL)"
+	default y if SMP
+
 menuconfig EMBEDDED
 	bool "Configure standard kernel features (for small systems)"
 	help

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 2/6] mm/fs: abstract address_space::nrpages
  2007-04-18 20:12 [PATCH 0/6] concurrent pagecache Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 1/6] radix-tree: concurrent write side support Peter Zijlstra
@ 2007-04-18 20:12 ` Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 3/6] mm: lock_page_ref Peter Zijlstra
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2007-04-18 20:12 UTC (permalink / raw)
  To: linux-mm; +Cc: npiggin, akpm, clameter, a.p.zijlstra

[-- Attachment #1: mapping_nrpages.patch --]
[-- Type: text/plain, Size: 19065 bytes --]

Currently the tree_lock protects mapping->nrpages, this will not be
possible much longer. Hence abstract the access to this variable so that
it can be easily replaced by an atomic_long_t.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/sh64/lib/dbg.c          |    2 +-
 fs/block_dev.c               |    2 +-
 fs/buffer.c                  |    2 +-
 fs/gfs2/glock.c              |    2 +-
 fs/gfs2/glops.c              |    6 +++---
 fs/gfs2/meta_io.c            |    2 +-
 fs/hugetlbfs/inode.c         |    2 +-
 fs/inode.c                   |   10 +++++-----
 fs/jffs/inode-v23.c          |    2 +-
 fs/jffs2/dir.c               |    4 ++--
 fs/jffs2/fs.c                |    2 +-
 fs/libfs.c                   |    2 +-
 fs/nfs/inode.c               |    6 +++---
 fs/xfs/linux-2.6/xfs_vnode.h |    2 +-
 include/linux/fs.h           |   22 +++++++++++++++++++++-
 include/linux/swap.h         |    2 +-
 ipc/shm.c                    |    4 ++--
 mm/filemap.c                 |   12 ++++++------
 mm/shmem.c                   |    8 ++++----
 mm/swap_state.c              |    4 ++--
 mm/truncate.c                |    2 +-
 21 files changed, 60 insertions(+), 40 deletions(-)

Index: linux-2.6/arch/sh64/lib/dbg.c
===================================================================
--- linux-2.6.orig/arch/sh64/lib/dbg.c	2007-02-06 17:45:59.000000000 +0100
+++ linux-2.6/arch/sh64/lib/dbg.c	2007-04-12 18:39:35.000000000 +0200
@@ -424,6 +424,6 @@ void print_page(struct page *page)
 	printk("  page[%p] -> index 0x%lx,  count 0x%x,  flags 0x%lx\n",
 	       page, page->index, page_count(page), page->flags);
 	printk("       address_space = %p, pages =%ld\n", page->mapping,
-	       page->mapping->nrpages);
+	       mapping_nrpages(page->mapping));
 
 }
Index: linux-2.6/fs/block_dev.c
===================================================================
--- linux-2.6.orig/fs/block_dev.c	2007-04-12 18:28:22.000000000 +0200
+++ linux-2.6/fs/block_dev.c	2007-04-12 18:39:35.000000000 +0200
@@ -595,7 +595,7 @@ long nr_blockdev_pages(void)
 	list_for_each(p, &all_bdevs) {
 		struct block_device *bdev;
 		bdev = list_entry(p, struct block_device, bd_list);
-		ret += bdev->bd_inode->i_mapping->nrpages;
+		ret += mapping_nrpages(bdev->bd_inode->i_mapping);
 	}
 	spin_unlock(&bdev_lock);
 	return ret;
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-04-12 18:33:48.000000000 +0200
+++ linux-2.6/fs/buffer.c	2007-04-12 18:39:35.000000000 +0200
@@ -337,7 +337,7 @@ void invalidate_bdev(struct block_device
 {
 	struct address_space *mapping = bdev->bd_inode->i_mapping;
 
-	if (mapping->nrpages == 0)
+	if (mapping_nrpages(mapping) == 0)
 		return;
 
 	invalidate_bh_lrus();
Index: linux-2.6/fs/gfs2/glock.c
===================================================================
--- linux-2.6.orig/fs/gfs2/glock.c	2007-04-12 18:28:22.000000000 +0200
+++ linux-2.6/fs/gfs2/glock.c	2007-04-12 18:39:35.000000000 +0200
@@ -1942,7 +1942,7 @@ static int dump_glock(struct gfs2_glock 
 		    (list_empty(&gl->gl_reclaim)) ? "no" : "yes");
 	if (gl->gl_aspace)
 		printk(KERN_INFO "  aspace = 0x%p nrpages = %lu\n", gl->gl_aspace,
-		       gl->gl_aspace->i_mapping->nrpages);
+		       mapping_nrpages(gl->gl_aspace->i_mapping));
 	else
 		printk(KERN_INFO "  aspace = no\n");
 	printk(KERN_INFO "  ail = %d\n", atomic_read(&gl->gl_ail_count));
Index: linux-2.6/fs/gfs2/glops.c
===================================================================
--- linux-2.6.orig/fs/gfs2/glops.c	2007-04-12 18:28:22.000000000 +0200
+++ linux-2.6/fs/gfs2/glops.c	2007-04-12 18:39:35.000000000 +0200
@@ -261,7 +261,7 @@ static int inode_go_demote_ok(struct gfs
 	struct gfs2_sbd *sdp = gl->gl_sbd;
 	int demote = 0;
 
-	if (!gl->gl_object && !gl->gl_aspace->i_mapping->nrpages)
+	if (!gl->gl_object && !mapping_nrpages(gl->gl_aspace->i_mapping))
 		demote = 1;
 	else if (!sdp->sd_args.ar_localcaching &&
 		 time_after_eq(jiffies, gl->gl_stamp +
@@ -328,7 +328,7 @@ static void inode_go_unlock(struct gfs2_
 
 static int rgrp_go_demote_ok(struct gfs2_glock *gl)
 {
-	return !gl->gl_aspace->i_mapping->nrpages;
+	return !mapping_nrpages(gl->gl_aspace->i_mapping);
 }
 
 /**
Index: linux-2.6/fs/gfs2/meta_io.c
===================================================================
--- linux-2.6.orig/fs/gfs2/meta_io.c	2007-04-03 13:58:17.000000000 +0200
+++ linux-2.6/fs/gfs2/meta_io.c	2007-04-12 18:39:35.000000000 +0200
@@ -104,7 +104,7 @@ void gfs2_meta_inval(struct gfs2_glock *
 	truncate_inode_pages(mapping, 0);
 	atomic_dec(&aspace->i_writecount);
 
-	gfs2_assert_withdraw(sdp, !mapping->nrpages);
+	gfs2_assert_withdraw(sdp, !mapping_nrpages(mapping));
 }
 
 /**
Index: linux-2.6/fs/hugetlbfs/inode.c
===================================================================
--- linux-2.6.orig/fs/hugetlbfs/inode.c	2007-04-12 18:28:22.000000000 +0200
+++ linux-2.6/fs/hugetlbfs/inode.c	2007-04-12 18:39:35.000000000 +0200
@@ -214,7 +214,7 @@ static void truncate_hugepages(struct in
 		}
 		huge_pagevec_release(&pvec);
 	}
-	BUG_ON(!lstart && mapping->nrpages);
+	BUG_ON(!lstart && mapping_nrpages(mapping));
 	hugetlb_unreserve_pages(inode, start, freed);
 }
 
Index: linux-2.6/fs/jffs2/dir.c
===================================================================
--- linux-2.6.orig/fs/jffs2/dir.c	2007-04-03 13:58:17.000000000 +0200
+++ linux-2.6/fs/jffs2/dir.c	2007-04-12 18:39:35.000000000 +0200
@@ -205,7 +205,7 @@ static int jffs2_create(struct inode *di
 	inode->i_op = &jffs2_file_inode_operations;
 	inode->i_fop = &jffs2_file_operations;
 	inode->i_mapping->a_ops = &jffs2_file_address_operations;
-	inode->i_mapping->nrpages = 0;
+	mapping_nrpages_init(inode->i_mapping);
 
 	f = JFFS2_INODE_INFO(inode);
 	dir_f = JFFS2_INODE_INFO(dir_i);
@@ -229,7 +229,7 @@ static int jffs2_create(struct inode *di
 	d_instantiate(dentry, inode);
 
 	D1(printk(KERN_DEBUG "jffs2_create: Created ino #%lu with mode %o, nlink %d(%d). nrpages %ld\n",
-		  inode->i_ino, inode->i_mode, inode->i_nlink, f->inocache->nlink, inode->i_mapping->nrpages));
+		  inode->i_ino, inode->i_mode, inode->i_nlink, f->inocache->nlink, mapping_nrpages(inode->i_mapping)));
 	return 0;
 
  fail:
Index: linux-2.6/fs/jffs2/fs.c
===================================================================
--- linux-2.6.orig/fs/jffs2/fs.c	2007-04-12 18:28:22.000000000 +0200
+++ linux-2.6/fs/jffs2/fs.c	2007-04-12 18:39:35.000000000 +0200
@@ -293,7 +293,7 @@ void jffs2_read_inode (struct inode *ino
 		inode->i_op = &jffs2_file_inode_operations;
 		inode->i_fop = &jffs2_file_operations;
 		inode->i_mapping->a_ops = &jffs2_file_address_operations;
-		inode->i_mapping->nrpages = 0;
+		mapping_nrpages_init(inode->i_mapping);
 		break;
 
 	case S_IFBLK:
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c	2007-04-12 18:28:22.000000000 +0200
+++ linux-2.6/fs/libfs.c	2007-04-12 18:39:35.000000000 +0200
@@ -16,7 +16,7 @@ int simple_getattr(struct vfsmount *mnt,
 {
 	struct inode *inode = dentry->d_inode;
 	generic_fillattr(inode, stat);
-	stat->blocks = inode->i_mapping->nrpages << (PAGE_CACHE_SHIFT - 9);
+	stat->blocks = mapping_nrpages(inode->i_mapping) << (PAGE_CACHE_SHIFT - 9);
 	return 0;
 }
 
Index: linux-2.6/fs/nfs/inode.c
===================================================================
--- linux-2.6.orig/fs/nfs/inode.c	2007-04-12 18:28:22.000000000 +0200
+++ linux-2.6/fs/nfs/inode.c	2007-04-12 18:39:35.000000000 +0200
@@ -98,7 +98,7 @@ int nfs_sync_mapping(struct address_spac
 {
 	int ret;
 
-	if (mapping->nrpages == 0)
+	if (mapping_nrpages(mapping) == 0)
 		return 0;
 	unmap_mapping_range(mapping, 0, 0, 0);
 	ret = filemap_write_and_wait(mapping);
@@ -138,7 +138,7 @@ void nfs_zap_caches(struct inode *inode)
 
 void nfs_zap_mapping(struct inode *inode, struct address_space *mapping)
 {
-	if (mapping->nrpages != 0) {
+	if (mapping_nrpages(mapping) != 0) {
 		spin_lock(&inode->i_lock);
 		NFS_I(inode)->cache_validity |= NFS_INO_INVALID_DATA;
 		spin_unlock(&inode->i_lock);
@@ -677,7 +677,7 @@ static int nfs_invalidate_mapping_nolock
 {
 	struct nfs_inode *nfsi = NFS_I(inode);
 	
-	if (mapping->nrpages != 0) {
+	if (mapping_nrpages(mapping) != 0) {
 		int ret = invalidate_inode_pages2(mapping);
 		if (ret < 0)
 			return ret;
Index: linux-2.6/fs/xfs/linux-2.6/xfs_vnode.h
===================================================================
--- linux-2.6.orig/fs/xfs/linux-2.6/xfs_vnode.h	2007-04-03 13:58:18.000000000 +0200
+++ linux-2.6/fs/xfs/linux-2.6/xfs_vnode.h	2007-04-12 18:39:35.000000000 +0200
@@ -548,7 +548,7 @@ static inline void vn_atime_to_time_t(bh
  * Some useful predicates.
  */
 #define VN_MAPPED(vp)	mapping_mapped(vn_to_inode(vp)->i_mapping)
-#define VN_CACHED(vp)	(vn_to_inode(vp)->i_mapping->nrpages)
+#define VN_CACHED(vp)	mapping_nrpages(vn_to_inode(vp)->i_mapping)
 #define VN_DIRTY(vp)	mapping_tagged(vn_to_inode(vp)->i_mapping, \
 					PAGECACHE_TAG_DIRTY)
 #define VN_TRUNC(vp)	((vp)->v_flag & VTRUNCATED)
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2007-04-12 18:33:48.000000000 +0200
+++ linux-2.6/include/linux/fs.h	2007-04-12 18:39:35.000000000 +0200
@@ -440,7 +440,7 @@ struct address_space {
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
-	unsigned long		nrpages;	/* number of total pages */
+	unsigned long		__nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
@@ -455,6 +455,26 @@ struct address_space {
 	 * of struct page's "mapping" pointer be used for PAGE_MAPPING_ANON.
 	 */
 
+static inline void mapping_nrpages_init(struct address_space *mapping)
+{
+	mapping->__nrpages = 0;
+}
+
+static inline unsigned long mapping_nrpages(struct address_space *mapping)
+{
+	return mapping->__nrpages;
+}
+
+static inline void mapping_nrpages_inc(struct address_space *mapping)
+{
+	mapping->__nrpages++;
+}
+
+static inline void mapping_nrpages_dec(struct address_space *mapping)
+{
+	mapping->__nrpages--;
+}
+
 struct block_device {
 	dev_t			bd_dev;  /* not a kdev_t - it's a search key */
 	struct inode *		bd_inode;	/* will die */
Index: linux-2.6/ipc/shm.c
===================================================================
--- linux-2.6.orig/ipc/shm.c	2007-04-12 18:28:29.000000000 +0200
+++ linux-2.6/ipc/shm.c	2007-04-12 18:39:35.000000000 +0200
@@ -559,11 +559,11 @@ static void shm_get_stat(struct ipc_name
 
 		if (is_file_hugepages(shp->shm_file)) {
 			struct address_space *mapping = inode->i_mapping;
-			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping->nrpages;
+			*rss += (HPAGE_SIZE/PAGE_SIZE)*mapping_nrpages(mapping);
 		} else {
 			struct shmem_inode_info *info = SHMEM_I(inode);
 			spin_lock(&info->lock);
-			*rss += inode->i_mapping->nrpages;
+			*rss += mapping_nrpages(inode->i_mapping);
 			*swp += info->swapped;
 			spin_unlock(&info->lock);
 		}
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-04-12 18:33:48.000000000 +0200
+++ linux-2.6/mm/filemap.c	2007-04-12 18:40:35.000000000 +0200
@@ -118,7 +118,7 @@ void __remove_from_page_cache(struct pag
 
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
-	mapping->nrpages--;
+	mapping_nrpages_dec(mapping);
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 }
 
@@ -190,7 +190,7 @@ int __filemap_fdatawrite_range(struct ad
 	int ret;
 	struct writeback_control wbc = {
 		.sync_mode = sync_mode,
-		.nr_to_write = mapping->nrpages * 2,
+		.nr_to_write = mapping_nrpages(mapping) * 2,
 		.range_start = start,
 		.range_end = end,
 	};
@@ -372,7 +372,7 @@ int filemap_write_and_wait(struct addres
 {
 	int err = 0;
 
-	if (mapping->nrpages) {
+	if (mapping_nrpages(mapping)) {
 		err = filemap_fdatawrite(mapping);
 		/*
 		 * Even if the above returned error, the pages may be
@@ -406,7 +406,7 @@ int filemap_write_and_wait_range(struct 
 {
 	int err = 0;
 
-	if (mapping->nrpages) {
+	if (mapping_nrpages(mapping)) {
 		err = __filemap_fdatawrite_range(mapping, lstart, lend,
 						 WB_SYNC_ALL);
 		/* See comment of filemap_write_and_wait() */
@@ -448,7 +448,7 @@ int add_to_page_cache(struct page *page,
 			SetPageLocked(page);
 			page->mapping = mapping;
 			page->index = offset;
-			mapping->nrpages++;
+			mapping_nrpages_inc(mapping);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		spin_unlock_irq(&mapping->tree_lock);
@@ -2496,7 +2496,7 @@ generic_file_direct_IO(int rw, struct ki
 	 * about to write.  We do this *before* the write so that we can return
 	 * -EIO without clobbering -EIOCBQUEUED from ->direct_IO().
 	 */
-	if (rw == WRITE && mapping->nrpages) {
+	if (rw == WRITE && mapping_nrpages(mapping)) {
 		retval = invalidate_inode_pages2_range(mapping,
 					offset >> PAGE_CACHE_SHIFT, end);
 		if (retval)
@@ -2514,7 +2514,7 @@ generic_file_direct_IO(int rw, struct ki
 	 * thing to do, so we don't support it 100%.  If this invalidation
 	 * fails and we have -EIOCBQUEUED we ignore the failure.
 	 */
-	if (rw == WRITE && mapping->nrpages) {
+	if (rw == WRITE && mapping_nrpages(mapping)) {
 		int err = invalidate_inode_pages2_range(mapping,
 					      offset >> PAGE_CACHE_SHIFT, end);
 		if (err && retval >= 0)
Index: linux-2.6/mm/shmem.c
===================================================================
--- linux-2.6.orig/mm/shmem.c	2007-04-12 18:28:30.000000000 +0200
+++ linux-2.6/mm/shmem.c	2007-04-12 18:39:35.000000000 +0200
@@ -211,8 +211,8 @@ static void shmem_free_blocks(struct ino
  * We have to calculate the free blocks since the mm can drop
  * undirtied hole pages behind our back.
  *
- * But normally   info->alloced == inode->i_mapping->nrpages + info->swapped
- * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)
+ * But normally   info->alloced == mapping_nrpages(inode->i_mapping) + info->swapped
+ * So mm freed is info->alloced - (mapping_nrpages(inode->i_mapping) + info->swapped)
  *
  * It has to be called with the spinlock held.
  */
@@ -221,7 +221,7 @@ static void shmem_recalc_inode(struct in
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	long freed;
 
-	freed = info->alloced - info->swapped - inode->i_mapping->nrpages;
+	freed = info->alloced - info->swapped - mapping_nrpages(inode->i_mapping);
 	if (freed > 0) {
 		info->alloced -= freed;
 		shmem_unacct_blocks(info->flags, freed);
@@ -667,7 +667,7 @@ static void shmem_truncate_range(struct 
 done1:
 	shmem_dir_unmap(dir);
 done2:
-	if (inode->i_mapping->nrpages && (info->flags & SHMEM_PAGEIN)) {
+	if (mapping_nrpages(inode->i_mapping) && (info->flags & SHMEM_PAGEIN)) {
 		/*
 		 * Call truncate_inode_pages again: racing shmem_unuse_inode
 		 * may have swizzled a page in from swap since vmtruncate or
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c	2007-04-12 18:33:48.000000000 +0200
+++ linux-2.6/mm/swap_state.c	2007-04-12 18:40:51.000000000 +0200
@@ -87,7 +87,7 @@ static int __add_to_swap_cache(struct pa
 			page_cache_get(page);
 			SetPageSwapCache(page);
 			set_page_private(page, entry.val);
-			total_swapcache_pages++;
+			mapping_nrpages_inc(&swapper_space);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
 		spin_unlock_irq(&swapper_space.tree_lock);
@@ -133,7 +133,7 @@ void __delete_from_swap_cache(struct pag
 	radix_tree_delete(&swapper_space.page_tree, page_private(page));
 	set_page_private(page, 0);
 	ClearPageSwapCache(page);
-	total_swapcache_pages--;
+	mapping_nrpages_dec(&swapper_space);
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	INC_CACHE_INFO(del_total);
 }
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c	2007-04-12 18:33:48.000000000 +0200
+++ linux-2.6/mm/truncate.c	2007-04-12 18:39:35.000000000 +0200
@@ -163,7 +163,7 @@ void truncate_inode_pages_range(struct a
 	pgoff_t next;
 	int i;
 
-	if (mapping->nrpages == 0)
+	if (mapping_nrpages(mapping) == 0)
 		return;
 
 	BUG_ON((lend & (PAGE_CACHE_SIZE - 1)) != (PAGE_CACHE_SIZE - 1));
Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h	2007-04-12 18:28:29.000000000 +0200
+++ linux-2.6/include/linux/swap.h	2007-04-12 18:39:35.000000000 +0200
@@ -224,7 +224,7 @@ extern int end_swap_bio_read(struct bio 
 
 /* linux/mm/swap_state.c */
 extern struct address_space swapper_space;
-#define total_swapcache_pages  swapper_space.nrpages
+#define total_swapcache_pages  mapping_nrpages(&swapper_space)
 extern void show_swap_cache_info(void);
 extern int add_to_swap(struct page *, gfp_t);
 extern void __delete_from_swap_cache(struct page *);
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2007-04-12 18:33:48.000000000 +0200
+++ linux-2.6/fs/inode.c	2007-04-12 18:39:35.000000000 +0200
@@ -246,7 +246,7 @@ void clear_inode(struct inode *inode)
 	might_sleep();
 	invalidate_inode_buffers(inode);
        
-	BUG_ON(inode->i_data.nrpages);
+	BUG_ON(mapping_nrpages(&inode->i_data));
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
 	wait_on_inode(inode);
@@ -279,7 +279,7 @@ static void dispose_list(struct list_hea
 		inode = list_entry(head->next, struct inode, i_list);
 		list_del(&inode->i_list);
 
-		if (inode->i_data.nrpages)
+		if (mapping_nrpages(&inode->i_data))
 			truncate_inode_pages(&inode->i_data, 0);
 		clear_inode(inode);
 
@@ -371,7 +371,7 @@ static int can_unuse(struct inode *inode
 		return 0;
 	if (atomic_read(&inode->i_count))
 		return 0;
-	if (inode->i_data.nrpages)
+	if (mapping_nrpages(&inode->i_data))
 		return 0;
 	return 1;
 }
@@ -410,7 +410,7 @@ static void prune_icache(int nr_to_scan)
 			list_move(&inode->i_list, &inode_unused);
 			continue;
 		}
-		if (inode_has_buffers(inode) || inode->i_data.nrpages) {
+		if (inode_has_buffers(inode) || mapping_nrpages(&inode->i_data)) {
 			__iget(inode);
 			spin_unlock(&inode_lock);
 			if (remove_inode_buffers(inode))
@@ -1058,7 +1058,7 @@ static void generic_forget_inode(struct 
 	inode->i_state |= I_FREEING;
 	inodes_stat.nr_inodes--;
 	spin_unlock(&inode_lock);
-	if (inode->i_data.nrpages)
+	if (mapping_nrpages(&inode->i_data))
 		truncate_inode_pages(&inode->i_data, 0);
 	clear_inode(inode);
 	wake_up_inode(inode);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 3/6] mm: lock_page_ref
  2007-04-18 20:12 [PATCH 0/6] concurrent pagecache Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 1/6] radix-tree: concurrent write side support Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 2/6] mm/fs: abstract address_space::nrpages Peter Zijlstra
@ 2007-04-18 20:12 ` Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 4/6] mm: concurrent pagecache write side Peter Zijlstra
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2007-04-18 20:12 UTC (permalink / raw)
  To: linux-mm; +Cc: npiggin, akpm, clameter, a.p.zijlstra

[-- Attachment #1: lock_page_ref.patch --]
[-- Type: text/plain, Size: 13756 bytes --]

Change the PG_nonewref operations into locking primitives and place them
so that they provide page level serialization with regard to the page_tree
operations. (basically replace the tree_lock with a per page lock).

The normal page lock has sufficiently different (and overlapping) scope and
protection rules that this second lock is needed.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c             |    6 ++++--
 include/linux/pagemap.h |   44 ++++++++++++++++++++++++++++++++------------
 mm/filemap.c            |   14 ++++++++------
 mm/migrate.c            |   12 ++++++------
 mm/page-writeback.c     |   18 ++++++++++++------
 mm/swap_state.c         |   14 ++++++++------
 mm/swapfile.c           |    6 ++++--
 mm/truncate.c           |    9 ++++++---
 mm/vmscan.c             |   14 +++++++-------
 9 files changed, 87 insertions(+), 50 deletions(-)

Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h	2007-04-13 12:26:07.000000000 +0200
+++ linux-2.6/include/linux/pagemap.h	2007-04-13 12:26:43.000000000 +0200
@@ -13,6 +13,7 @@
 #include <linux/gfp.h>
 #include <linux/page-flags.h>
 #include <linux/hardirq.h> /* for in_interrupt() */
+#include <linux/bit_spinlock.h>
 
 /*
  * Bits in mapping->flags.  The lower __GFP_BITS_SHIFT bits are the page
@@ -53,6 +54,47 @@ static inline void mapping_set_gfp_mask(
 #define page_cache_release(page)	put_page(page)
 void release_pages(struct page **pages, int nr, int cold);
 
+static inline void lock_page_ref(struct page *page)
+{
+	bit_spin_lock(PG_nonewrefs, &page->flags);
+	smp_wmb();
+}
+
+static inline void unlock_page_ref(struct page *page)
+{
+	bit_spin_unlock(PG_nonewrefs, &page->flags);
+}
+
+static inline void wait_on_page_ref(struct page *page)
+{
+	while (unlikely(test_bit(PG_nonewrefs, &page->flags)))
+		cpu_relax();
+}
+
+#define lock_page_ref_irq(page)					\
+	do {							\
+		local_irq_disable();				\
+		lock_page_ref(page);				\
+	} while (0)
+
+#define unlock_page_ref_irq(page)				\
+	do {							\
+		unlock_page_ref(page);				\
+		local_irq_enable();				\
+	} while (0)
+
+#define lock_page_ref_irqsave(page, flags)			\
+	do {							\
+		local_irq_save(flags);				\
+		lock_page_ref(page);				\
+	} while (0)
+
+#define unlock_page_ref_irqrestore(page, flags)			\
+	do {							\
+		unlock_page_ref(page);				\
+		local_irq_restore(flags);			\
+	} while (0)
+
 /*
  * speculatively take a reference to a page.
  * If the page is free (_count == 0), then _count is untouched, and 0
@@ -128,8 +170,7 @@ static inline int page_cache_get_specula
 	 * page refcount has been raised. See below comment.
 	 */
 
-	while (unlikely(PageNoNewRefs(page)))
-		cpu_relax();
+	wait_on_page_ref(page);
 
 	/*
 	 * smp_rmb is to ensure the load of page->flags (for PageNoNewRefs())
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-04-13 12:26:41.000000000 +0200
+++ linux-2.6/mm/filemap.c	2007-04-13 12:26:43.000000000 +0200
@@ -128,9 +128,11 @@ void remove_from_page_cache(struct page 
 
 	BUG_ON(!PageLocked(page));
 
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 }
 
 static int sync_page(void *word)
@@ -440,8 +442,8 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
-		set_page_nonewrefs(page);
-		spin_lock_irq(&mapping->tree_lock);
+		lock_page_ref_irq(page);
+		spin_lock(&mapping->tree_lock);
 		error = radix_tree_insert(&mapping->page_tree, offset, page);
 		if (!error) {
 			page_cache_get(page);
@@ -451,8 +453,8 @@ int add_to_page_cache(struct page *page,
 			mapping_nrpages_inc(mapping);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		spin_unlock_irq(&mapping->tree_lock);
-		clear_page_nonewrefs(page);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irq(page);
 		radix_tree_preload_end();
 	}
 	return error;
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2007-04-13 12:26:07.000000000 +0200
+++ linux-2.6/mm/migrate.c	2007-04-13 12:26:43.000000000 +0200
@@ -303,16 +303,16 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
-	set_page_nonewrefs(page);
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 
 	pslot = radix_tree_lookup_slot(&mapping->page_tree,
  					page_index(page));
 
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
-		spin_unlock_irq(&mapping->tree_lock);
-		clear_page_nonewrefs(page);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irq(page);
 		return -EAGAIN;
 	}
 
@@ -329,8 +329,8 @@ static int migrate_page_move_mapping(str
 
 	radix_tree_replace_slot(pslot, newpage);
 	page->mapping = NULL;
-  	spin_unlock_irq(&mapping->tree_lock);
-	clear_page_nonewrefs(page);
+  	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 
 	/*
 	 * Drop cache reference from old page.
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c	2007-04-13 12:26:41.000000000 +0200
+++ linux-2.6/mm/swap_state.c	2007-04-13 12:26:43.000000000 +0200
@@ -79,8 +79,8 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
-		set_page_nonewrefs(page);
-		spin_lock_irq(&swapper_space.tree_lock);
+		lock_page_ref_irq(page);
+		spin_lock(&swapper_space.tree_lock);
 		error = radix_tree_insert(&swapper_space.page_tree,
 						entry.val, page);
 		if (!error) {
@@ -90,8 +90,8 @@ static int __add_to_swap_cache(struct pa
 			mapping_nrpages_inc(&swapper_space);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		spin_unlock_irq(&swapper_space.tree_lock);
-		clear_page_nonewrefs(page);
+		spin_unlock(&swapper_space.tree_lock);
+		unlock_page_ref_irq(page);
 		radix_tree_preload_end();
 	}
 	return error;
@@ -202,9 +202,11 @@ void delete_from_swap_cache(struct page 
 
 	entry.val = page_private(page);
 
-	spin_lock_irq(&swapper_space.tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	spin_unlock_irq(&swapper_space.tree_lock);
+	spin_unlock(&swapper_space.tree_lock);
+	unlock_page_ref_irq(page);
 
 	swap_free(entry);
 	page_cache_release(page);
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-04-13 12:26:07.000000000 +0200
+++ linux-2.6/mm/vmscan.c	2007-04-13 12:26:43.000000000 +0200
@@ -390,8 +390,8 @@ int remove_mapping(struct address_space 
 	BUG_ON(!PageLocked(page));
 	BUG_ON(mapping != page_mapping(page));
 
-	set_page_nonewrefs(page);
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -426,22 +426,22 @@ int remove_mapping(struct address_space 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
-		spin_unlock_irq(&mapping->tree_lock);
+		spin_unlock(&mapping->tree_lock);
 		swap_free(swap);
 		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
 
 free_it:
-	__clear_page_nonewrefs(page);
+	unlock_page_ref_irq(page);
 	__put_page(page); /* The pagecache ref */
 	return 1;
 
 cannot_free:
-	spin_unlock_irq(&mapping->tree_lock);
-	clear_page_nonewrefs(page);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 	return 0;
 }
 
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-04-13 12:26:41.000000000 +0200
+++ linux-2.6/fs/buffer.c	2007-04-13 12:26:43.000000000 +0200
@@ -729,7 +729,8 @@ int __set_page_dirty_buffers(struct page
 	if (TestSetPageDirty(page))
 		return 0;
 
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
@@ -738,7 +739,8 @@ int __set_page_dirty_buffers(struct page
 		radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 	}
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	return 1;
 }
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-04-13 12:26:07.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2007-04-13 12:26:43.000000000 +0200
@@ -776,7 +776,8 @@ int __set_page_dirty_nobuffers(struct pa
 		if (!mapping)
 			return 1;
 
-		spin_lock_irq(&mapping->tree_lock);
+		lock_page_ref_irq(page);
+		spin_lock(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
 			BUG_ON(mapping2 != mapping);
@@ -787,7 +788,8 @@ int __set_page_dirty_nobuffers(struct pa
 			radix_tree_tag_set(&mapping->page_tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
 		}
-		spin_unlock_irq(&mapping->tree_lock);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irq(page);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
 			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -924,13 +926,15 @@ int test_clear_page_writeback(struct pag
 		unsigned long flags;
 		int ret;
 
-		spin_lock_irqsave(&mapping->tree_lock, flags);
+		lock_page_ref_irqsave(page, flags);
+		spin_lock(&mapping->tree_lock);
 		ret = TestClearPageWriteback(page);
 		if (ret)
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irqrestore(page, flags);
 		return ret;
 	}
 	return TestClearPageWriteback(page);
@@ -944,7 +948,8 @@ int test_set_page_writeback(struct page 
 		unsigned long flags;
 		int ret;
 
-		spin_lock_irqsave(&mapping->tree_lock, flags);
+		lock_page_ref_irqsave(page, flags);
+		spin_lock(&mapping->tree_lock);
 		ret = TestSetPageWriteback(page);
 		if (!ret)
 			radix_tree_tag_set(&mapping->page_tree,
@@ -954,7 +959,8 @@ int test_set_page_writeback(struct page 
 			radix_tree_tag_clear(&mapping->page_tree,
 						page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		spin_unlock(&mapping->tree_lock);
+		unlock_page_ref_irqrestore(page, flags);
 		return ret;
 	}
 	return TestSetPageWriteback(page);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2007-04-13 12:26:07.000000000 +0200
+++ linux-2.6/mm/swapfile.c	2007-04-13 12:26:43.000000000 +0200
@@ -367,13 +367,15 @@ int remove_exclusive_swap_page(struct pa
 	retval = 0;
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
-		spin_lock_irq(&swapper_space.tree_lock);
+		lock_page_ref_irq(page);
+		spin_lock(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		spin_unlock_irq(&swapper_space.tree_lock);
+		spin_unlock(&swapper_space.tree_lock);
+		unlock_page_ref_irq(page);
 	}
 	spin_unlock(&swap_lock);
 
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c	2007-04-13 12:26:41.000000000 +0200
+++ linux-2.6/mm/truncate.c	2007-04-13 12:26:43.000000000 +0200
@@ -328,18 +328,21 @@ invalidate_complete_page2(struct address
 	if (PagePrivate(page) && !try_to_release_page(page, GFP_KERNEL))
 		return 0;
 
-	spin_lock_irq(&mapping->tree_lock);
+	lock_page_ref_irq(page);
+	spin_lock(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
 failed:
-	spin_unlock_irq(&mapping->tree_lock);
+	spin_unlock(&mapping->tree_lock);
+	unlock_page_ref_irq(page);
 	return 0;
 }
 
Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h	2007-04-13 12:26:07.000000000 +0200
+++ linux-2.6/include/linux/page-flags.h	2007-04-13 12:26:49.000000000 +0200
@@ -273,25 +273,4 @@ static inline void set_page_writeback(st
 	test_set_page_writeback(page);
 }
 
-static inline void set_page_nonewrefs(struct page *page)
-{
-	preempt_disable();
-	SetPageNoNewRefs(page);
-	smp_wmb();
-}
-
-static inline void __clear_page_nonewrefs(struct page *page)
-{
-	smp_wmb();
-	__ClearPageNoNewRefs(page);
-	preempt_enable();
-}
-
-static inline void clear_page_nonewrefs(struct page *page)
-{
-	smp_wmb();
-	ClearPageNoNewRefs(page);
-	preempt_enable();
-}
-
 #endif	/* PAGE_FLAGS_H */

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 4/6] mm: concurrent pagecache write side
  2007-04-18 20:12 [PATCH 0/6] concurrent pagecache Peter Zijlstra
                   ` (2 preceding siblings ...)
  2007-04-18 20:12 ` [PATCH 3/6] mm: lock_page_ref Peter Zijlstra
@ 2007-04-18 20:12 ` Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 5/6] radix-tree: optimistic locking Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 6/6] debug: optimistic lock histogram Peter Zijlstra
  5 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2007-04-18 20:12 UTC (permalink / raw)
  To: linux-mm; +Cc: npiggin, akpm, clameter, a.p.zijlstra

[-- Attachment #1: mm-concurrent-pagecache.patch --]
[-- Type: text/plain, Size: 16421 bytes --]

Remove the tree_lock, change address_space::nrpages to atomic_long_t because
its not protected any longer and use the concurrent radix tree API to protect
the modifying radix tree operations.

The tree_lock is actually renamed to priv_lock and its only remaining user will
be the __flush_dcache_page logic on arm an parisc. Another potential user would
be the per address_space node mask allocation Christoph is working on.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/buffer.c                     |    6 ++++--
 fs/inode.c                      |    2 +-
 include/asm-arm/cacheflush.h    |    4 ++--
 include/asm-parisc/cacheflush.h |    4 ++--
 include/linux/fs.h              |   12 ++++++------
 mm/filemap.c                    |   13 +++++++------
 mm/migrate.c                    |    9 +++++----
 mm/page-writeback.c             |   28 +++++++++++++++++++---------
 mm/swap_state.c                 |   13 ++++++++-----
 mm/swapfile.c                   |    2 --
 mm/truncate.c                   |    3 ---
 mm/vmscan.c                     |    4 ----
 12 files changed, 54 insertions(+), 46 deletions(-)

Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c	2007-04-13 12:26:43.000000000 +0200
+++ linux-2.6/fs/buffer.c	2007-04-13 12:27:11.000000000 +0200
@@ -730,16 +730,18 @@ int __set_page_dirty_buffers(struct page
 		return 0;
 
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
+		DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
 		if (mapping_cap_account_dirty(mapping)) {
 			__inc_zone_page_state(page, NR_FILE_DIRTY);
 			task_io_account_write(PAGE_CACHE_SIZE);
 		}
-		radix_tree_tag_set(&mapping->page_tree,
+		radix_tree_lock(&ctx);
+		radix_tree_tag_set(ctx.tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
+		radix_tree_unlock(&ctx);
 	}
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
 	return 1;
Index: linux-2.6/fs/inode.c
===================================================================
--- linux-2.6.orig/fs/inode.c	2007-04-13 12:26:41.000000000 +0200
+++ linux-2.6/fs/inode.c	2007-04-13 12:27:05.000000000 +0200
@@ -193,7 +193,7 @@ void inode_init_once(struct inode *inode
 	mutex_init(&inode->i_mutex);
 	init_rwsem(&inode->i_alloc_sem);
 	INIT_RADIX_TREE(&inode->i_data.page_tree, GFP_ATOMIC);
-	spin_lock_init(&inode->i_data.tree_lock);
+	spin_lock_init(&inode->i_data.priv_lock);
 	spin_lock_init(&inode->i_data.i_mmap_lock);
 	INIT_LIST_HEAD(&inode->i_data.private_list);
 	spin_lock_init(&inode->i_data.private_lock);
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2007-04-13 12:26:41.000000000 +0200
+++ linux-2.6/include/linux/fs.h	2007-04-13 12:27:05.000000000 +0200
@@ -434,13 +434,13 @@ struct backing_dev_info;
 struct address_space {
 	struct inode		*host;		/* owner: inode, block_device */
 	struct radix_tree_root	page_tree;	/* radix tree of all pages */
-	spinlock_t		tree_lock;	/* and lock protecting it */
+	spinlock_t		priv_lock;	/* spinlock protecting various stuffs */
 	unsigned int		i_mmap_writable;/* count VM_SHARED mappings */
 	struct prio_tree_root	i_mmap;		/* tree of private and shared mappings */
 	struct list_head	i_mmap_nonlinear;/*list VM_NONLINEAR mappings */
 	spinlock_t		i_mmap_lock;	/* protect tree, count, list */
 	unsigned int		truncate_count;	/* Cover race condition with truncate */
-	unsigned long		__nrpages;	/* number of total pages */
+	atomic_long_t		__nrpages;	/* number of total pages */
 	pgoff_t			writeback_index;/* writeback starts here */
 	const struct address_space_operations *a_ops;	/* methods */
 	unsigned long		flags;		/* error bits/gfp mask */
@@ -457,22 +457,22 @@ struct address_space {
 
 static inline void mapping_nrpages_init(struct address_space *mapping)
 {
-	mapping->__nrpages = 0;
+	mapping->__nrpages = (atomic_long_t)ATOMIC_LONG_INIT(0);
 }
 
 static inline unsigned long mapping_nrpages(struct address_space *mapping)
 {
-	return mapping->__nrpages;
+	return (unsigned long)atomic_long_read(&mapping->__nrpages);
 }
 
 static inline void mapping_nrpages_inc(struct address_space *mapping)
 {
-	mapping->__nrpages++;
+	atomic_long_inc(&mapping->__nrpages);
 }
 
 static inline void mapping_nrpages_dec(struct address_space *mapping)
 {
-	mapping->__nrpages--;
+	atomic_long_dec(&mapping->__nrpages);
 }
 
 struct block_device {
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c	2007-04-13 12:26:43.000000000 +0200
+++ linux-2.6/mm/filemap.c	2007-04-13 12:27:11.000000000 +0200
@@ -115,8 +115,11 @@ generic_file_direct_IO(int rw, struct ki
 void __remove_from_page_cache(struct page *page)
 {
 	struct address_space *mapping = page->mapping;
+	DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
 
-	radix_tree_delete(&mapping->page_tree, page->index);
+	radix_tree_lock(&ctx);
+	radix_tree_delete(ctx.tree, page->index);
+	radix_tree_unlock(&ctx);
 	page->mapping = NULL;
 	mapping_nrpages_dec(mapping);
 	__dec_zone_page_state(page, NR_FILE_PAGES);
@@ -124,14 +127,10 @@ void __remove_from_page_cache(struct pag
 
 void remove_from_page_cache(struct page *page)
 {
-	struct address_space *mapping = page->mapping;
-
 	BUG_ON(!PageLocked(page));
 
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
 	__remove_from_page_cache(page);
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 }
 
@@ -442,9 +441,12 @@ int add_to_page_cache(struct page *page,
 	int error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
 
 	if (error == 0) {
+		DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
 		lock_page_ref_irq(page);
-		spin_lock(&mapping->tree_lock);
-		error = radix_tree_insert(&mapping->page_tree, offset, page);
+		radix_tree_lock(&ctx);
+		error = radix_tree_insert(ctx.tree, offset, page);
+		radix_tree_unlock(&ctx);
 		if (!error) {
 			page_cache_get(page);
 			SetPageLocked(page);
@@ -453,7 +455,6 @@ int add_to_page_cache(struct page *page,
 			mapping_nrpages_inc(mapping);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		spin_unlock(&mapping->tree_lock);
 		unlock_page_ref_irq(page);
 		radix_tree_preload_end();
 	}
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c	2007-04-13 12:26:43.000000000 +0200
+++ linux-2.6/mm/migrate.c	2007-04-13 12:27:11.000000000 +0200
@@ -295,6 +295,7 @@ static int migrate_page_move_mapping(str
 		struct page *newpage, struct page *page)
 {
 	void **pslot;
+	struct radix_tree_context ctx;
 
 	if (!mapping) {
 		/* Anonymous page */
@@ -303,15 +304,14 @@ static int migrate_page_move_mapping(str
 		return 0;
 	}
 
+	init_radix_tree_context(&ctx, &mapping->page_tree);
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
-
-	pslot = radix_tree_lookup_slot(&mapping->page_tree,
- 					page_index(page));
+	radix_tree_lock(&ctx);
+	pslot = radix_tree_lookup_slot(ctx.tree, page_index(page));
 
 	if (page_count(page) != 2 + !!PagePrivate(page) ||
 			(struct page *)radix_tree_deref_slot(pslot) != page) {
-		spin_unlock(&mapping->tree_lock);
+		radix_tree_unlock(&ctx);
 		unlock_page_ref_irq(page);
 		return -EAGAIN;
 	}
@@ -329,7 +329,7 @@ static int migrate_page_move_mapping(str
 
 	radix_tree_replace_slot(pslot, newpage);
 	page->mapping = NULL;
-  	spin_unlock(&mapping->tree_lock);
+	radix_tree_unlock(&ctx);
 	unlock_page_ref_irq(page);
 
 	/*
Index: linux-2.6/mm/page-writeback.c
===================================================================
--- linux-2.6.orig/mm/page-writeback.c	2007-04-13 12:26:43.000000000 +0200
+++ linux-2.6/mm/page-writeback.c	2007-04-13 12:27:11.000000000 +0200
@@ -777,18 +777,20 @@ int __set_page_dirty_nobuffers(struct pa
 			return 1;
 
 		lock_page_ref_irq(page);
-		spin_lock(&mapping->tree_lock);
 		mapping2 = page_mapping(page);
 		if (mapping2) { /* Race with truncate? */
+			DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
 			BUG_ON(mapping2 != mapping);
 			if (mapping_cap_account_dirty(mapping)) {
 				__inc_zone_page_state(page, NR_FILE_DIRTY);
 				task_io_account_write(PAGE_CACHE_SIZE);
 			}
-			radix_tree_tag_set(&mapping->page_tree,
+			radix_tree_lock(&ctx);
+			radix_tree_tag_set(ctx.tree,
 				page_index(page), PAGECACHE_TAG_DIRTY);
+			radix_tree_unlock(&ctx);
 		}
-		spin_unlock(&mapping->tree_lock);
 		unlock_page_ref_irq(page);
 		if (mapping->host) {
 			/* !PageAnon && !swapper_space */
@@ -927,13 +929,15 @@ int test_clear_page_writeback(struct pag
 		int ret;
 
 		lock_page_ref_irqsave(page, flags);
-		spin_lock(&mapping->tree_lock);
 		ret = TestClearPageWriteback(page);
-		if (ret)
-			radix_tree_tag_clear(&mapping->page_tree,
-						page_index(page),
+		if (ret) {
+			DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
+
+			radix_tree_lock(&ctx);
+			radix_tree_tag_clear(ctx.tree, page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		spin_unlock(&mapping->tree_lock);
+			radix_tree_unlock(&ctx);
+		}
 		unlock_page_ref_irqrestore(page, flags);
 		return ret;
 	}
@@ -947,19 +951,22 @@ int test_set_page_writeback(struct page 
 	if (mapping) {
 		unsigned long flags;
 		int ret;
+		DEFINE_RADIX_TREE_CONTEXT(ctx, &mapping->page_tree);
 
 		lock_page_ref_irqsave(page, flags);
-		spin_lock(&mapping->tree_lock);
 		ret = TestSetPageWriteback(page);
-		if (!ret)
-			radix_tree_tag_set(&mapping->page_tree,
-						page_index(page),
+		if (!ret) {
+			radix_tree_lock(&ctx);
+			radix_tree_tag_set(ctx.tree, page_index(page),
 						PAGECACHE_TAG_WRITEBACK);
-		if (!PageDirty(page))
-			radix_tree_tag_clear(&mapping->page_tree,
-						page_index(page),
+			radix_tree_unlock(&ctx);
+		}
+		if (!PageDirty(page)) {
+			radix_tree_lock(&ctx);
+			radix_tree_tag_clear(ctx.tree, page_index(page),
 						PAGECACHE_TAG_DIRTY);
-		spin_unlock(&mapping->tree_lock);
+			radix_tree_unlock(&ctx);
+		}
 		unlock_page_ref_irqrestore(page, flags);
 		return ret;
 	}
Index: linux-2.6/mm/swap_state.c
===================================================================
--- linux-2.6.orig/mm/swap_state.c	2007-04-13 12:26:43.000000000 +0200
+++ linux-2.6/mm/swap_state.c	2007-04-13 12:27:11.000000000 +0200
@@ -38,7 +38,6 @@ static struct backing_dev_info swap_back
 
 struct address_space swapper_space = {
 	.page_tree	= RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN),
-	.tree_lock	= __SPIN_LOCK_UNLOCKED(swapper_space.tree_lock),
 	.a_ops		= &swap_aops,
 	.i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
 	.backing_dev_info = &swap_backing_dev_info,
@@ -79,10 +78,12 @@ static int __add_to_swap_cache(struct pa
 	BUG_ON(PagePrivate(page));
 	error = radix_tree_preload(gfp_mask);
 	if (!error) {
+		DEFINE_RADIX_TREE_CONTEXT(ctx, &swapper_space.page_tree);
+
 		lock_page_ref_irq(page);
-		spin_lock(&swapper_space.tree_lock);
-		error = radix_tree_insert(&swapper_space.page_tree,
-						entry.val, page);
+		radix_tree_lock(&ctx);
+		error = radix_tree_insert(ctx.tree, entry.val, page);
+		radix_tree_unlock(&ctx);
 		if (!error) {
 			page_cache_get(page);
 			SetPageSwapCache(page);
@@ -90,7 +91,6 @@ static int __add_to_swap_cache(struct pa
 			mapping_nrpages_inc(&swapper_space);
 			__inc_zone_page_state(page, NR_FILE_PAGES);
 		}
-		spin_unlock(&swapper_space.tree_lock);
 		unlock_page_ref_irq(page);
 		radix_tree_preload_end();
 	}
@@ -125,12 +125,16 @@ static int add_to_swap_cache(struct page
  */
 void __delete_from_swap_cache(struct page *page)
 {
+	DEFINE_RADIX_TREE_CONTEXT(ctx, &swapper_space.page_tree);
+
 	BUG_ON(!PageLocked(page));
 	BUG_ON(!PageSwapCache(page));
 	BUG_ON(PageWriteback(page));
 	BUG_ON(PagePrivate(page));
 
-	radix_tree_delete(&swapper_space.page_tree, page_private(page));
+	radix_tree_lock(&ctx);
+	radix_tree_delete(ctx.tree, page_private(page));
+	radix_tree_unlock(&ctx);
 	set_page_private(page, 0);
 	ClearPageSwapCache(page);
 	mapping_nrpages_dec(&swapper_space);
@@ -203,9 +207,7 @@ void delete_from_swap_cache(struct page 
 	entry.val = page_private(page);
 
 	lock_page_ref_irq(page);
-	spin_lock(&swapper_space.tree_lock);
 	__delete_from_swap_cache(page);
-	spin_unlock(&swapper_space.tree_lock);
 	unlock_page_ref_irq(page);
 
 	swap_free(entry);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c	2007-04-13 12:26:43.000000000 +0200
+++ linux-2.6/mm/swapfile.c	2007-04-13 12:27:05.000000000 +0200
@@ -368,13 +368,11 @@ int remove_exclusive_swap_page(struct pa
 	if (p->swap_map[swp_offset(entry)] == 1) {
 		/* Recheck the page count with the swapcache lock held.. */
 		lock_page_ref_irq(page);
-		spin_lock(&swapper_space.tree_lock);
 		if ((page_count(page) == 2) && !PageWriteback(page)) {
 			__delete_from_swap_cache(page);
 			SetPageDirty(page);
 			retval = 1;
 		}
-		spin_unlock(&swapper_space.tree_lock);
 		unlock_page_ref_irq(page);
 	}
 	spin_unlock(&swap_lock);
Index: linux-2.6/mm/truncate.c
===================================================================
--- linux-2.6.orig/mm/truncate.c	2007-04-13 12:26:43.000000000 +0200
+++ linux-2.6/mm/truncate.c	2007-04-13 12:27:05.000000000 +0200
@@ -329,19 +329,16 @@ invalidate_complete_page2(struct address
 		return 0;
 
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
 	if (PageDirty(page))
 		goto failed;
 
 	BUG_ON(PagePrivate(page));
 	__remove_from_page_cache(page);
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 	ClearPageUptodate(page);
 	page_cache_release(page);	/* pagecache ref */
 	return 1;
 failed:
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 	return 0;
 }
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2007-04-13 12:26:43.000000000 +0200
+++ linux-2.6/mm/vmscan.c	2007-04-13 12:27:05.000000000 +0200
@@ -391,7 +391,6 @@ int remove_mapping(struct address_space 
 	BUG_ON(mapping != page_mapping(page));
 
 	lock_page_ref_irq(page);
-	spin_lock(&mapping->tree_lock);
 	/*
 	 * The non racy check for a busy page.
 	 *
@@ -426,13 +425,11 @@ int remove_mapping(struct address_space 
 	if (PageSwapCache(page)) {
 		swp_entry_t swap = { .val = page_private(page) };
 		__delete_from_swap_cache(page);
-		spin_unlock(&mapping->tree_lock);
 		swap_free(swap);
 		goto free_it;
 	}
 
 	__remove_from_page_cache(page);
-	spin_unlock(&mapping->tree_lock);
 
 free_it:
 	unlock_page_ref_irq(page);
@@ -440,7 +437,6 @@ free_it:
 	return 1;
 
 cannot_free:
-	spin_unlock(&mapping->tree_lock);
 	unlock_page_ref_irq(page);
 	return 0;
 }
Index: linux-2.6/include/asm-arm/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-arm/cacheflush.h	2007-04-13 12:26:07.000000000 +0200
+++ linux-2.6/include/asm-arm/cacheflush.h	2007-04-13 12:27:05.000000000 +0200
@@ -405,9 +405,9 @@ static inline void flush_anon_page(struc
 }
 
 #define flush_dcache_mmap_lock(mapping) \
-	spin_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->priv_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	spin_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->priv_lock)
 
 #define flush_icache_user_range(vma,page,addr,len) \
 	flush_dcache_page(page)
Index: linux-2.6/include/asm-parisc/cacheflush.h
===================================================================
--- linux-2.6.orig/include/asm-parisc/cacheflush.h	2007-04-13 12:26:07.000000000 +0200
+++ linux-2.6/include/asm-parisc/cacheflush.h	2007-04-13 12:27:05.000000000 +0200
@@ -45,9 +45,9 @@ void flush_cache_mm(struct mm_struct *mm
 extern void flush_dcache_page(struct page *page);
 
 #define flush_dcache_mmap_lock(mapping) \
-	spin_lock_irq(&(mapping)->tree_lock)
+	spin_lock_irq(&(mapping)->priv_lock)
 #define flush_dcache_mmap_unlock(mapping) \
-	spin_unlock_irq(&(mapping)->tree_lock)
+	spin_unlock_irq(&(mapping)->priv_lock)
 
 #define flush_icache_page(vma,page)	do { 		\
 	flush_kernel_dcache_page(page);			\

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 5/6] radix-tree: optimistic locking
  2007-04-18 20:12 [PATCH 0/6] concurrent pagecache Peter Zijlstra
                   ` (3 preceding siblings ...)
  2007-04-18 20:12 ` [PATCH 4/6] mm: concurrent pagecache write side Peter Zijlstra
@ 2007-04-18 20:12 ` Peter Zijlstra
  2007-04-18 20:12 ` [PATCH 6/6] debug: optimistic lock histogram Peter Zijlstra
  5 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2007-04-18 20:12 UTC (permalink / raw)
  To: linux-mm; +Cc: npiggin, akpm, clameter, a.p.zijlstra

[-- Attachment #1: radix-tree-optimistic.patch --]
[-- Type: text/plain, Size: 11963 bytes --]

Implement optimistic locking for the concurrent radix tree.

Optimistic locking is aimed at avoiding taking higher level node locks to
reduce cache-line bouncing (and lock contention). We descent the tree using an
RCU lookup, looking for the lowest modification termination point.

If found, we try to acquire the lock of that node. After we have obtained this
lock, we will need to validate if the initial conditions still hold true.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/radix-tree.h |   27 +++++-
 init/Kconfig               |    6 +
 lib/radix-tree.c           |  175 ++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 187 insertions(+), 21 deletions(-)

Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c	2007-04-18 09:56:20.000000000 +0200
+++ linux-2.6/lib/radix-tree.c	2007-04-18 09:56:42.000000000 +0200
@@ -362,6 +362,117 @@ static inline void radix_path_unlock(str
 #define radix_path_unlock(context, punlock) do { } while (0)
 #endif
 
+#ifdef CONFIG_RADIX_TREE_OPTIMISTIC
+typedef int (*radix_valid_fn)(struct radix_tree_node *, int, int);
+
+static struct radix_tree_node *
+radix_optimistic_lookup(struct radix_tree_context *context, unsigned long index,
+		int tag, radix_valid_fn valid)
+{
+	unsigned int height, shift;
+	struct radix_tree_node *node, *ret = NULL, **slot;
+	struct radix_tree_root *root = context->root;
+
+	node = rcu_dereference(root->rnode);
+	if (node == NULL)
+		return NULL;
+
+	if (!radix_tree_is_indirect_ptr(node))
+			return NULL;
+
+	node = radix_tree_indirect_to_ptr(node);
+
+	height = node->height;
+	if (index > radix_tree_maxindex(height))
+		return NULL;
+
+	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
+	do {
+		int offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		if ((*valid)(node, offset, tag))
+			ret = node;
+		slot = (struct radix_tree_node **)(node->slots + offset);
+		node = rcu_dereference(*slot);
+		if (!node)
+			break;
+
+		shift -= RADIX_TREE_MAP_SHIFT;
+		height--;
+	} while (height > 0);
+
+	return ret;
+}
+
+static struct radix_tree_node *
+__radix_optimistic_lock(struct radix_tree_context *context, unsigned long index,
+	       	int tag, radix_valid_fn valid)
+{
+	struct radix_tree_node *node;
+	spinlock_t *locked;
+	unsigned int shift, offset;
+
+	node = radix_optimistic_lookup(context, index, tag, valid);
+	if (!node)
+		goto out;
+
+	locked = radix_node_lock(context->root, node);
+	if (!locked)
+		goto out;
+
+#if 0
+	if (node != radix_optimistic_lookup(context, index, tag, valid))
+		goto out_unlock;
+#else
+	/* check if the node got freed */
+	if (!node->count)
+		goto out_unlock;
+
+	/* check if the node is still a valid termination point */
+	shift = (node->height - 1) * RADIX_TREE_MAP_SHIFT;
+	offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+	if (!(*valid)(node, offset, tag))
+		goto out_unlock;
+#endif
+
+	context->locked = locked;
+	return node;
+
+out_unlock:
+	spin_unlock(locked);
+out:
+	return NULL;
+}
+
+static struct radix_tree_node *
+radix_optimistic_lock(struct radix_tree_context *context, unsigned long index,
+		int tag, radix_valid_fn valid)
+{
+	struct radix_tree_node *node = NULL;
+
+	if (context) {
+		node = __radix_optimistic_lock(context, index, tag, valid);
+		if (!node) {
+			BUG_ON(context->locked);
+			spin_lock(&context->root->lock);
+			context->locked = &context->root->lock;
+		}
+	}
+	return node;
+}
+
+static int radix_valid_always(struct radix_tree_node *node, int offset, int tag)
+{
+	return 1;
+}
+
+static int radix_valid_tag(struct radix_tree_node *node, int offset, int tag)
+{
+	return tag_get(node, tag, offset);
+}
+#else
+#define radix_optimistic_lock(context, index, tag, valid) NULL
+#endif
+
 /**
  *	radix_tree_insert    -    insert into a radix tree
  *	@root:		radix tree root
@@ -382,6 +493,13 @@ int radix_tree_insert(struct radix_tree_
 
 	BUG_ON(radix_tree_is_indirect_ptr(item));
 
+	node = radix_optimistic_lock(context, index, 0, radix_valid_always);
+	if (node) {
+		height = node->height;
+		shift = (height-1) * RADIX_TREE_MAP_SHIFT;
+		goto optimistic;
+	}
+
 	/* Make sure the tree is high enough.  */
 	if (index > radix_tree_maxindex(root->height)) {
 		error = radix_tree_extend(root, index);
@@ -390,7 +508,6 @@ int radix_tree_insert(struct radix_tree_
 	}
 
 	slot = radix_tree_indirect_to_ptr(root->rnode);
-
 	height = root->height;
 	shift = (height-1) * RADIX_TREE_MAP_SHIFT;
 
@@ -409,11 +526,11 @@ int radix_tree_insert(struct radix_tree_
 		}
 
 		/* Go a level down */
-		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		node = slot;
-
 		radix_ladder_lock(context, node);
 
+optimistic:
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		slot = node->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
@@ -456,6 +573,10 @@ void **radix_tree_lookup_slot(struct rad
 	struct radix_tree_node *node, **slot;
 	RADIX_TREE_CONTEXT(context, root);
 
+	node = radix_optimistic_lock(context, index, 0, radix_valid_always);
+	if (node)
+		goto optimistic;
+
 	node = rcu_dereference(root->rnode);
 	if (node == NULL)
 		return NULL;
@@ -467,6 +588,7 @@ void **radix_tree_lookup_slot(struct rad
 	}
 	node = radix_tree_indirect_to_ptr(node);
 
+optimistic:
 	height = node->height;
 	if (index > radix_tree_maxindex(height))
 		return NULL;
@@ -559,6 +681,13 @@ void *radix_tree_tag_set(struct radix_tr
 	struct radix_tree_node *slot;
 	RADIX_TREE_CONTEXT(context, root);
 
+	slot = radix_optimistic_lock(context, index, tag, radix_valid_tag);
+	if (slot) {
+		height = slot->height;
+		shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+		goto optimistic;
+	}
+
 	height = root->height;
 	BUG_ON(index > radix_tree_maxindex(height));
 
@@ -574,6 +703,7 @@ void *radix_tree_tag_set(struct radix_tr
 
 		radix_ladder_lock(context, slot);
 
+optimistic:
 		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
 		if (!tag_get(slot, tag, offset))
 			tag_set(slot, tag, offset);
@@ -590,13 +720,13 @@ EXPORT_SYMBOL(radix_tree_tag_set);
 /*
  * the change can never propagate upwards from here.
  */
-static inline int radix_tree_unlock_tag(struct radix_tree_root *root,
-		struct radix_tree_path *pathp, int tag)
+static
+int radix_valid_tag_clear(struct radix_tree_node *node, int offset, int tag)
 {
 	int this, other;
 
-	this = tag_get(pathp->node, tag, pathp->offset);
-	other = any_tag_set_but(pathp->node, tag, pathp->offset);
+	this = tag_get(node, tag, offset);
+	other = any_tag_set_but(node, tag, offset);
 
 	return !this || other;
 }
@@ -621,9 +751,22 @@ void *radix_tree_tag_clear(struct radix_
 	struct radix_tree_path path[RADIX_TREE_MAX_PATH], *pathp = path;
 	struct radix_tree_path *punlock = path, *piter;
 	struct radix_tree_node *slot = NULL;
-	unsigned int height, shift;
+	unsigned int height, shift, offset;
+
 	RADIX_TREE_CONTEXT(context, root);
 
+	slot = radix_optimistic_lock(context, index, tag,
+			radix_valid_tag_clear);
+	if (slot) {
+		height = slot->height;
+		shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		pathp->offset = offset;
+		pathp->node = slot;
+		radix_path_init(context, pathp);
+		goto optimistic;
+	}
+
 	pathp->node = NULL;
 	radix_path_init(context, pathp);
 
@@ -635,8 +778,6 @@ void *radix_tree_tag_clear(struct radix_
 	slot = radix_tree_indirect_to_ptr(root->rnode);
 
 	while (height > 0) {
-		int offset;
-
 		if (slot == NULL)
 			goto out;
 
@@ -646,11 +787,12 @@ void *radix_tree_tag_clear(struct radix_
 		pathp->node = slot;
 		radix_path_lock(context, pathp, slot);
 
-		if (radix_tree_unlock_tag(root, pathp, tag)) {
+		if (radix_valid_tag_clear(slot, offset, tag)) {
 			for (; punlock < pathp; punlock++)
 				radix_path_unlock(context, punlock);
 		}
 
+optimistic:
 		slot = slot->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
@@ -1160,14 +1302,20 @@ static inline void radix_tree_shrink(str
 	}
 }
 
-static inline int radix_tree_unlock_all(struct radix_tree_root *root,
-		struct radix_tree_path *pathp)
+static
+int radix_valid_delete(struct radix_tree_node *node, int offset, int tag)
 {
-	int tag;
-	int unlock = 1;
+	/*
+	 * we need to check for > 2, because nodes with a single child
+	 * can still be deleted, see radix_tree_shrink().
+	 */
+	int unlock = (node->count > 2);
+
+	if (!unlock)
+		return unlock;
 
 	for (tag = 0; tag < RADIX_TREE_MAX_TAGS; tag++) {
-		if (!radix_tree_unlock_tag(root, pathp, tag)) {
+		if (!radix_valid_tag_clear(node, offset, tag)) {
 			unlock = 0;
 			break;
 		}
@@ -1195,6 +1343,17 @@ void *radix_tree_delete(struct radix_tre
 	int offset;
 	RADIX_TREE_CONTEXT(context, root);
 
+	slot = radix_optimistic_lock(context, index, 0, radix_valid_delete);
+	if (slot) {
+		height = slot->height;
+		shift = (height - 1) * RADIX_TREE_MAP_SHIFT;
+		offset = (index >> shift) & RADIX_TREE_MAP_MASK;
+		pathp->offset = offset;
+		pathp->node = slot;
+		radix_path_init(context, pathp);
+		goto optimistic;
+	}
+
 	pathp->node = NULL;
 	radix_path_init(context, pathp);
 
@@ -1222,11 +1381,12 @@ void *radix_tree_delete(struct radix_tre
 		pathp->node = slot;
 		radix_path_lock(context, pathp, slot);
 
-		if (slot->count > 2 && radix_tree_unlock_all(root, pathp)) {
+		if (radix_valid_delete(slot, offset, 0)) {
 			for (; punlock < pathp; punlock++)
 				radix_path_unlock(context, punlock);
 		}
 
+optimistic:
 		slot = slot->slots[offset];
 		shift -= RADIX_TREE_MAP_SHIFT;
 		height--;
Index: linux-2.6/init/Kconfig
===================================================================
--- linux-2.6.orig/init/Kconfig	2007-04-18 09:56:20.000000000 +0200
+++ linux-2.6/init/Kconfig	2007-04-18 09:56:27.000000000 +0200
@@ -344,8 +344,14 @@ config SYSCTL
 
 config RADIX_TREE_CONCURRENT
 	bool "Enable concurrent radix tree operations (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
 	default y if SMP
 
+config RADIX_TREE_OPTIMISTIC
+	bool "Enabled optimistic locking (EXPERIMENTAL)"
+	depends on RADIX_TREE_CONCURRENT
+	default y
+
 menuconfig EMBEDDED
 	bool "Configure standard kernel features (for small systems)"
 	help
Index: linux-2.6/include/linux/radix-tree.h
===================================================================
--- linux-2.6.orig/include/linux/radix-tree.h	2007-04-18 09:56:20.000000000 +0200
+++ linux-2.6/include/linux/radix-tree.h	2007-04-18 09:56:27.000000000 +0200
@@ -197,28 +197,47 @@ static inline void radix_tree_replace_sl
 	rcu_assign_pointer(*pslot, item);
 }
 
+#if defined(CONFIG_RADIX_TREE_OPTIMISTIC)
+static inline void radix_tree_lock(struct radix_tree_context *context)
+{
+	rcu_read_lock();
+	BUG_ON(context->locked);
+}
+#elif defined(CONFIG_RADIX_TREE_CONCURRENT)
 static inline void radix_tree_lock(struct radix_tree_context *context)
 {
 	struct radix_tree_root *root = context->root;
+
 	rcu_read_lock();
 	spin_lock(&root->lock);
-#ifdef CONFIG_RADIX_TREE_CONCURRENT
 	BUG_ON(context->locked);
 	context->locked = &root->lock;
-#endif
 }
+#else
+static inline void radix_tree_lock(struct radix_tree_context *context)
+{
+	struct radix_tree_root *root = context->root;
+
+	rcu_read_lock();
+	spin_lock(&root->lock);
+}
+#endif
 
+#if defined(CONFIG_RADIX_TREE_CONCURRENT)
 static inline void radix_tree_unlock(struct radix_tree_context *context)
 {
-#ifdef CONFIG_RADIX_TREE_CONCURRENT
 	BUG_ON(!context->locked);
 	spin_unlock(context->locked);
 	context->locked = NULL;
+	rcu_read_unlock();
+}
 #else
+static inline void radix_tree_unlock(struct radix_tree_context *context)
+{
 	spin_unlock(&context->root->lock);
-#endif
 	rcu_read_unlock();
 }
+#endif
 
 int radix_tree_insert(struct radix_tree_root *, unsigned long, void *);
 void *radix_tree_lookup(struct radix_tree_root *, unsigned long);

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 6/6] debug: optimistic lock histogram
  2007-04-18 20:12 [PATCH 0/6] concurrent pagecache Peter Zijlstra
                   ` (4 preceding siblings ...)
  2007-04-18 20:12 ` [PATCH 5/6] radix-tree: optimistic locking Peter Zijlstra
@ 2007-04-18 20:12 ` Peter Zijlstra
  5 siblings, 0 replies; 7+ messages in thread
From: Peter Zijlstra @ 2007-04-18 20:12 UTC (permalink / raw)
  To: linux-mm; +Cc: npiggin, akpm, clameter, a.p.zijlstra

[-- Attachment #1: radix-tree-optimistic-hist.patch --]
[-- Type: text/plain, Size: 4659 bytes --]

A simple histogram allowing insight into the efficiency of the optimistic
locking for the subjected workload.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 fs/proc/proc_misc.c |   22 +++++++++++
 lib/radix-tree.c    |  101 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 123 insertions(+)

Index: linux-2.6/fs/proc/proc_misc.c
===================================================================
--- linux-2.6.orig/fs/proc/proc_misc.c	2007-04-18 09:56:42.000000000 +0200
+++ linux-2.6/fs/proc/proc_misc.c	2007-04-18 09:56:50.000000000 +0200
@@ -268,6 +268,25 @@ static const struct file_operations proc
 	.release	= seq_release,
 };
 
+#ifdef CONFIG_RADIX_TREE_OPTIMISTIC
+extern struct seq_operations optimistic_op;
+static int optimistic_open(struct inode *inode, struct file *file)
+{
+	(void)inode;
+	return seq_open(file, &optimistic_op);
+}
+
+extern ssize_t optimistic_write(struct file *, const char __user *, size_t, loff_t *);
+
+static struct file_operations optimistic_file_operations = {
+	.open	= optimistic_open,
+	.read	= seq_read,
+	.llseek	= seq_lseek,
+	.release = seq_release,
+	.write	= optimistic_write,
+};
+#endif
+
 static int devinfo_show(struct seq_file *f, void *v)
 {
 	int i = *(loff_t *) v;
@@ -701,6 +720,9 @@ void __init proc_misc_init(void)
 			entry->proc_fops = &proc_kmsg_operations;
 	}
 #endif
+#ifdef CONFIG_RADIX_TREE_OPTIMISTIC
+	create_seq_entry("radix_optimistic", 0, &optimistic_file_operations);
+#endif
 	create_seq_entry("devices", 0, &proc_devinfo_operations);
 	create_seq_entry("cpuinfo", 0, &proc_cpuinfo_operations);
 #ifdef CONFIG_BLOCK
Index: linux-2.6/lib/radix-tree.c
===================================================================
--- linux-2.6.orig/lib/radix-tree.c	2007-04-18 09:56:42.000000000 +0200
+++ linux-2.6/lib/radix-tree.c	2007-04-18 10:11:26.000000000 +0200
@@ -75,6 +75,105 @@ static unsigned long height_to_maxindex[
 static struct lock_class_key radix_node_class[RADIX_TREE_MAX_PATH];
 #endif
 
+#ifdef CONFIG_RADIX_TREE_OPTIMISTIC
+static DEFINE_PER_CPU(unsigned long[RADIX_TREE_MAX_PATH+1], optimistic_histogram);
+
+static void optimistic_hit(unsigned long height)
+{
+	if (height > RADIX_TREE_MAX_PATH)
+		height = RADIX_TREE_MAX_PATH;
+
+	__get_cpu_var(optimistic_histogram)[height]++;
+}
+
+#ifdef CONFIG_PROC_FS
+
+#include <linux/seq_file.h>
+#include <linux/uaccess.h>
+
+static void *frag_start(struct seq_file *m, loff_t *pos)
+{
+	if (*pos < 0 || *pos > RADIX_TREE_MAX_PATH)
+		return NULL;
+
+	m->private = (void *)(unsigned long)*pos;
+	return pos;
+}
+
+static void *frag_next(struct seq_file *m, void *arg, loff_t *pos)
+{
+	if (*pos < RADIX_TREE_MAX_PATH) {
+		(*pos)++;
+		(*((unsigned long *)&m->private))++;
+		return pos;
+	}
+	return NULL;
+}
+
+static void frag_stop(struct seq_file *m, void *arg)
+{
+}
+
+unsigned long get_optimistic_stat(unsigned long index)
+{
+	unsigned long total = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		total += per_cpu(optimistic_histogram, cpu)[index];
+	}
+	return total;
+}
+
+static int frag_show(struct seq_file *m, void *arg)
+{
+	unsigned long index = (unsigned long)m->private;
+	unsigned long hits = get_optimistic_stat(index);
+
+	if (index == 0)
+		seq_printf(m, "levels skipped\thits\n");
+
+	if (index < RADIX_TREE_MAX_PATH)
+		seq_printf(m, "%9lu\t%9lu\n", index, hits);
+	else
+		seq_printf(m, "failed\t%9lu\n", hits);
+
+	return 0;
+}
+
+struct seq_operations optimistic_op = {
+	.start = frag_start,
+	.next = frag_next,
+	.stop = frag_stop,
+	.show = frag_show,
+};
+
+static void optimistic_reset(void)
+{
+	int cpu;
+	int height;
+	for_each_possible_cpu(cpu) {
+		for (height = 0; height <= RADIX_TREE_MAX_PATH; height++)
+			per_cpu(optimistic_histogram, cpu)[height] = 0;
+	}
+}
+
+ssize_t optimistic_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	if (count) {
+		char c;
+		if (get_user(c, buf))
+			return -EFAULT;
+		if (c == '0')
+			optimistic_reset();
+	}
+	return count;
+}
+
+#endif // CONFIG_PROC_FS
+#endif // CONFIG_RADIX_TREE_OPTIMISTIC
+
 /*
  * Radix tree node cache.
  */
@@ -455,7 +554,9 @@ radix_optimistic_lock(struct radix_tree_
 			BUG_ON(context->locked);
 			spin_lock(&context->root->lock);
 			context->locked = &context->root->lock;
-		}
+			optimistic_hit(RADIX_TREE_MAX_PATH);
+		} else
+			optimistic_hit(context->root->height - node->height);
 	}
 	return node;
 }

-- 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2007-04-18 20:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-04-18 20:12 [PATCH 0/6] concurrent pagecache Peter Zijlstra
2007-04-18 20:12 ` [PATCH 1/6] radix-tree: concurrent write side support Peter Zijlstra
2007-04-18 20:12 ` [PATCH 2/6] mm/fs: abstract address_space::nrpages Peter Zijlstra
2007-04-18 20:12 ` [PATCH 3/6] mm: lock_page_ref Peter Zijlstra
2007-04-18 20:12 ` [PATCH 4/6] mm: concurrent pagecache write side Peter Zijlstra
2007-04-18 20:12 ` [PATCH 5/6] radix-tree: optimistic locking Peter Zijlstra
2007-04-18 20:12 ` [PATCH 6/6] debug: optimistic lock histogram Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.